monitor

Framework for executing local monitoring checks through systems like nagios, returning results, and also uploading to a document store database such as elasticsearch, redis, or couch

Checks can be run with the accre-monitor cli endpoint. Each check is a simple python function which takes a single argument of an options payload passed through the command line as JSON, and returns a tuple of a status string and some JSON-serializable data (dict, etc…). A function is used as a check if decorated with monitor_command.

The status and data from a check is placed into a JSON object along with the time and hostname of the node. This object is sent to one or more configured outputs. For example, stdout will return the data to standard output. Additional outputs, such as elasticsearch databases, may be added as needed.

The status string from the check should be one of OK, WARNING, CRITICAL, or UNKNOWN. The exit code of the cli command will be determined based on this string as well as the configured monitoring system, such as nagios, which maps these strings to exit codes.

Functions to serve as monitoring checks should be added to a module in the monitor_checks package so that they will automatically appear as a CLI subcommand. A simple example check for load averages is as follows:

@monitor_command
def loadavg(opts):
    """
    Check the load average (1, 5, 15 minutes) of the server.

    A JSON option of 'warning' and/or 'critical' may be given with
    a list of 1,5,15 min averages above which the check will return
    a warning or critical status
    """
    data = os.getloadavg()
    status = 'OK'
    if 'warning' in opts:
        for idx, val in enumerate(opts['warning']):
            if data[idx] > val:
                status = 'WARNING'
    if 'critical' in opts:
        for idx, val in enumerate(opts['critical']):
            if data[idx] > val:
                status = 'CRITICAL'
    return status, data

This may be called as follows, if configured with the stdout output, and will return the following:

$ accre-monitor loadavg
{
  "node": "vmps13",
  "date": "2017-12-11T21:24:28.496027",
  "name": "loadavg",
  "status": "OK",
  "data": [
    1.134765625,
    1.36279296875,
    1.41650390625
  ]
}

or this can be called with options for warning and critical limits:

$ accre-monitor --options '{"warning": [1, 1, 1], "critical": [2, 2, 2]}' loadavg
{
  "status": "WARNING",
  "date": "2017-12-11T21:24:18.842369",
  "node": "dhcp-129-59-132-42.n1.vanderbilt.edu",
  "data": [
    1.07275390625,
    1.35888671875,
    1.416015625
  ],
  "name": "loadavg"
}

For reasonably simple command options, the JSON format can be replaced with additional options of the form --option value. These will be reported as strings, but there is a helper function util.interpret_string_value which will interpret the values such that if the value can be converted to a number it will be, and if value contains commas it will be interpreted as a list. So the above command may be rewritten as accre-monitor loadavg --warning 1,1,1 --critical 2,2,2. This method of specifying options will not work for more complex options such as nested JSON objects or lists, or values that contain whitespace or special characters. If the value is missing for one of these additional options, its value will be set as None, allowing for the use of flags.

Setting an output of nagios-stdout will return the information to STDOUT in a manner suitable for a nagios check, with a pipe separating a human readable short description from a single line JSON payload and then additional data split on the lines below. Setting an output of nagios-base64-stdout will return the JSON payload in a base64-encoded form.

Setting an output of nagios-nsca will send the information to the nagios server as an NSCA check. This option will only work with certain auditor checks configured on the nagios server.

Status check results can be posted to slack in the #monitor channel by adding slack-all as an output. To only post for warnings or above, use slack-warning, and use slack-critical to post only critical or unknown checks.

In order to improve the nagios (or other) output for human readability, a check may optionally return additional items in its output tuple. A third item if found will be a short (one-line) text description of the status, and a fourth will be a longer (multi-line) description.

The docstring of each monitoring check will also be interpreted and placed in the CLI help. All text in the docstring preceding a double-newline will be shown for the command help, i.e. accre-monitor -h and the full docstring will be shown for the subcommand help, i.e. accre-monitor subcommand -h.

Status checks that begin with auditor are reserved for the auditor to run and will not be added in the CLI unless the configuration file has an auditor field set to true in the monitoring section.

In order to prevent stale checks that are stuck executing from piling up on a node, an optional locking mechanism is available using flock(2) and a configurable directory for lock files. If used, flocks will ensure that there is never more than one check running of a given type running on a given node by a given user. To use this feature, add a lockdir field to the configuration file specifying the directory to keep lock files. If the value is set to [DEFAULT] it will be set to /tmp/<<USERNAME>>/accre-monitor. The lock directory will be created if it does not already exist when configured.

accre.monitor.EXITCODE_MAPS = {'nagios': {'CRITICAL': 2, 'OK': 0, 'UNKNOWN': 3, 'WARNING': 1}}

Mappings between status strings and exit codes expected by monitoring systems (i.e. nagios)

accre.monitor.complete_check(name, check_retval, outputs=None, runtime=None)[source]

Build the json object for the check record, send it to all configured data stores or outputs, and return an exit code determined by the check status and configured monitoring system.

This function can be called by stand-alone checks written as individual scripts outside the accre package to propagate information as the checks in this module. Calling this function will exit the interpreter.

Parameters:
  • name (str) – Name of the check being run

  • status (str) – Status of the check - OK, WARNING, CRITICAL, or UNKNOWN

  • check_retval (tuple(str)) – Return value of the check, of which the first item is the status code, the second is any object serializable as JSON with more detailed information about the monitoring check result, the third (optional) is a short one-line text description of the result, and the fourth (optional) is a long multi-line description.

  • outputs (list(str)) – Outputs to override those set in the configuration

  • runtime (float) – The wall clock execution time of the check, in seconds

accre.monitor.maybe_acquire_lock(check_name, flock_id=None)[source]

Attempt to acquire a monitoring flock for the given check_name. Returns a tuple of information as would a normal status check for easy passage to monitoring check compeltion. In the case of success, the status is ‘OK’, and in case of failure it is ‘CRITICAL’. Information about any exception returned will be passed as data.

Parameters:
  • check_name (str) – Name of the monitoring check for which a lock should be aquired

  • flock_id (int) – An ID number for the check so that a different flock is set for each check ID. This is useful for when you want multiple checks (up to some limit) to be executable in parallel on a given host.

Returns:

Tuple of lock acquisition status with similar format to a monitoring check

Return type:

tuple(str)

accre.monitor.monitor_command(f)[source]

Register a function to be available in the CLI as a montioring check.

Any monitoring command should accept a single argument of a argparse.Namespace type which is the parsed arguments given to the check. Use of this argument is optional and it may be ignored.

The command should return a tuple of two elements with a status string of “OK”, “WARNING”, “CRITICAL”, or “UNKNOWN” and then a json serializable object (i.e. simple dict, etc…) with check-specific information.

The docstring of the check will become the CLI help message for the check.