Source code for accre.monitor

'''
Framework for executing local monitoring checks through systems like
nagios, returning results, and also uploading to a document store
database such as elasticsearch, redis, or couch

Checks can be run with the accre-monitor cli endpoint. Each check is
a simple python function which takes a single argument of an options
payload passed through the command line as JSON, and returns a tuple
of a status string and some JSON-serializable data (dict, etc...).
A function is used as a check if decorated with ``monitor_command``.

The status and data from a check is placed into a JSON object along
with the time and hostname of the node. This object is sent to one
or more configured outputs. For example, ``stdout`` will return the
data to standard output. Additional outputs, such as elasticsearch
databases, may be added as needed.

The status string from the check should be one of OK, WARNING, CRITICAL,
or UNKNOWN. The exit code of the cli command will be determined
based on this string as well as the configured monitoring system, such
as nagios, which maps these strings to exit codes.

Functions to serve as monitoring checks should be added to a module in the
``monitor_checks`` package so that they will automatically appear as a CLI
subcommand.
A simple example check for load averages is as follows:

.. code-block:: python

    @monitor_command
    def loadavg(opts):
        """
        Check the load average (1, 5, 15 minutes) of the server.

        A JSON option of 'warning' and/or 'critical' may be given with
        a list of 1,5,15 min averages above which the check will return
        a warning or critical status
        """
        data = os.getloadavg()
        status = 'OK'
        if 'warning' in opts:
            for idx, val in enumerate(opts['warning']):
                if data[idx] > val:
                    status = 'WARNING'
        if 'critical' in opts:
            for idx, val in enumerate(opts['critical']):
                if data[idx] > val:
                    status = 'CRITICAL'
        return status, data

This may be called as follows, if configured with the stdout output, and
will return the following:

.. code-block:: none

    $ accre-monitor loadavg
    {
      "node": "vmps13",
      "date": "2017-12-11T21:24:28.496027",
      "name": "loadavg",
      "status": "OK",
      "data": [
        1.134765625,
        1.36279296875,
        1.41650390625
      ]
    }

or this can be called with options for warning and critical limits:

.. code-block:: none

    $ accre-monitor --options '{"warning": [1, 1, 1], "critical": [2, 2, 2]}' loadavg
    {
      "status": "WARNING",
      "date": "2017-12-11T21:24:18.842369",
      "node": "dhcp-129-59-132-42.n1.vanderbilt.edu",
      "data": [
        1.07275390625,
        1.35888671875,
        1.416015625
      ],
      "name": "loadavg"
    }

For reasonably simple command options, the JSON format can be replaced with 
additional options of the form ``--option value``. These will be reported as
strings, but there is a helper function ``util.interpret_string_value`` which
will interpret the values such that if the value can be converted
to a number it will be, and if ``value`` contains commas it will be
interpreted as a list. So the above command may be rewritten as
``accre-monitor loadavg --warning 1,1,1 --critical 2,2,2``.
This method of specifying options will not work for more complex options such as
nested JSON objects or lists, or values that contain whitespace or 
special characters. If the value is missing for one of these additional
options, its value will be set as None, allowing for the use of flags.

Setting an output of ``nagios-stdout`` will return the information to STDOUT in a manner suitable
for a nagios check, with a pipe separating a human readable short description from a single
line JSON payload and then additional data split on the lines below. Setting an output
of ``nagios-base64-stdout`` will return the JSON payload in a base64-encoded form.

Setting an output of ``nagios-nsca`` will send the information to the nagios
server as an NSCA check. This option will only work with certain auditor checks
configured on the nagios server.

Status check results can be posted to slack in the #monitor channel by adding
``slack-all`` as an output. To only post for warnings or above, use
``slack-warning``, and use ``slack-critical`` to post only critical or
unknown checks.

In order to improve the nagios (or other) output for human readability, a check may optionally return
additional items in its output tuple. A third item if found will be a short (one-line) text
description of the status, and a fourth will be a longer (multi-line) description.

The docstring of each monitoring check will also be interpreted and placed 
in the CLI help. All text in the docstring preceding a double-newline will
be shown for the command help, i.e. ``accre-monitor -h`` and the full docstring
will be shown for the subcommand help, i.e. ``accre-monitor subcommand -h``.

Status checks that begin with ``auditor`` are reserved for the auditor to
run and will not be added in the CLI unless the configuration file has
an auditor field set to true in the monitoring section.

In order to prevent stale checks that are stuck executing from piling
up on a node, an optional locking mechanism is available using ``flock(2)``
and a configurable directory for lock files. If used, flocks will ensure that
there is never more than one check running of a given type running on
a given node by a given user. To use this feature, add a ``lockdir`` field
to the configuration file specifying the directory to keep lock files.
If the value is set to ``[DEFAULT]`` it will be set to
``/tmp/<<USERNAME>>/accre-monitor``.
The lock directory will be created if it does not already exist when
configured.
'''
import base64
import datetime
import fcntl
import getpass
import json
import os
import os.path
import socket
import sys
import collections.abc

import accre.util
from accre.nagios import send_nsca_notification
import accre.slack as slack
from accre.config import get_config
from accre.util import RedStr, YellowStr, GreenStr, PurpleStr

CONFIG = get_config()


#: Mappings between status strings and exit codes expected
#: by monitoring systems (i.e. nagios)
EXITCODE_MAPS = {
    'nagios': {'OK': 0, 'WARNING': 1, 'CRITICAL': 2, 'UNKNOWN': 3}
}


[docs]def monitor_command(f):
    """
    Register a function to be available in the CLI as a montioring check.

    Any monitoring command should accept a single argument of a
    ``argparse.Namespace`` type which is the parsed arguments given to the
    check. Use of this argument is optional and it may be ignored.

    The command should return a tuple of two elements with a status string of
    "OK", "WARNING", "CRITICAL", or "UNKNOWN" and then a json serializable
    object (i.e. simple dict, etc...) with check-specific information.

    The docstring of the check will become the CLI help message for the
    check.
    """
    MONITORING_COMMANDS.append(f)
    return f


# Internal list of montioring commands populated by the
# monitor_command decorator
MONITORING_COMMANDS = []


[docs]def maybe_acquire_lock(check_name, flock_id=None):
    """
    Attempt to acquire a monitoring flock for the given check_name.
    Returns a tuple of information as would a normal status check
    for easy passage to monitoring check compeltion. In the case
    of success, the status is 'OK', and in case of failure it is
    'CRITICAL'. Information about any exception returned will be passed
    as data.

    :param str check_name: Name of the monitoring check for which a
        lock should be aquired
    :param int flock_id: An ID number for the check so that a different
        flock is set for each check ID. This is useful for when you
        want multiple checks (up to some limit) to be executable in
        parallel on a given host.
    :returns: Tuple of lock acquisition status with similar format to
        a monitoring check
    :rtype: tuple(str) 
    """
    lockname = check_name
    if flock_id is not None:
       lockname += '.{0}'.format(flock_id) 
    global _persist_lock
    lockdir = CONFIG['monitor']['lockdir']
    if lockdir == '[DEFAULT]':
        lockdir = os.path.join('/tmp', getpass.getuser(), 'accre-monitor')
    lockfile = os.path.join(lockdir, lockname)
    try:
        if not os.path.exists(lockdir):
            os.makedirs(lockdir)
        _persist_lock = open(lockfile, 'w+')
        result = fcntl.flock(_persist_lock, fcntl.LOCK_EX | fcntl.LOCK_NB)
        return 'OK', 'OK'
    except Exception as e:
        data = {
            'lock_failure': (
                'Unable to acquire lock file {0} to run this check'
                .format(lockfile)
            ),
            'suggestion': 'Check for stalled monitor check processes',
            'exception': str(e)
        }
        return 'CRITICAL', data


# placeholder reference to hold lock file open for duration of process
_persist_lock = None


[docs]def complete_check(name, check_retval, outputs=None, runtime=None):
    """
    Build the json object for the check record, send it to
    all configured data stores or outputs, and return an
    exit code determined by the check status and configured
    monitoring system.

    This function can be called by stand-alone checks written
    as individual scripts outside the accre package to
    propagate information as the checks in this module.
    Calling this function will exit the interpreter.

    :param str name: Name of the check being run
    :param str status: Status of the check - OK, WARNING, CRITICAL, or UNKNOWN
    :param tuple(str) check_retval: Return value of the check, of which
        the first item is the status code, the second is any object
        serializable as JSON with more detailed information about the
        monitoring check result, the third (optional) is a short
        one-line text description of the result, and the fourth (optional)
        is a long multi-line description.
    :param list(str) outputs: Outputs to override those set in the
        configuration
    :param float runtime: The wall clock execution time of the check,
        in seconds
    """
    status, data, short_text, long_text = _process_check_retval(check_retval)
    checktime = accre.util.utcnow().isoformat()
    host = socket.gethostname()
    result = {
        'node': host,
        'date': checktime,
        'status': status,
        'data': data,
        'name': name,
    }
    if runtime is not None:
        result['runtime'] = runtime

    if outputs is None:
        outputs = CONFIG['monitor']['outputs'].split(',')

    if 'stdout' in outputs:
        print(json.dumps(result, indent=2))
    if 'nagios-stdout' in outputs:
        _nagios_output(result, short_text=short_text, long_text=long_text)
    if 'nagios-base64-stdout' in outputs:
        _nagios_output(result, short_text=short_text, long_text=long_text, b64=True)
    if 'nagios-nsca' in outputs:
        _nagios_nsca_output(result)
    if 'human-stdout' in outputs:
        _human_output(result, short_text=short_text, long_text=long_text)
    if 'slack-all' in outputs:
        _slack_output(result, short_text=short_text, level='OK')
    if 'slack-warning' in outputs:
        _slack_output(result, short_text=short_text, level='WARNING')
    if 'slack-critical' in outputs:
        _slack_output(result, short_text=short_text, level='CRITICAL')
    if 'slack-auditor' in outputs:
        _slack_output(
            result,
            pretext=short_text,
            channel='auditor',
            level='OK',
            username='auditor',
            icon='jarjar'
        )

    exitmap = EXITCODE_MAPS[CONFIG['monitor']['system']]
    if status not in exitmap:
        status = 'UNKNOWN'
    sys.exit(exitmap[status])


def _process_check_retval(retval):
    """
    Split the check return value into status, data,
    short, and long text components
    """
    # this is wrong, but we'll accept a string as
    # a status code
    if isinstance(retval, (str, bytes)):
        return retval, None, None, None

    result = [None, None, None, None]
    for idx in range(len(retval)):
        if idx > 3:
            break
        result[idx] = retval[idx]
    return tuple(result)


def _nagios_output(result, short_text=None, long_text=None, b64=False):
    """
    Print output in nagios format to STDOUT with full JSON result
    as $SERVICEPERFDATA$, the check status as $SERVICEOUTPUT$,
    and short and long descriptions added to the $SERVICEOUTPUT$
    and $LONGSERVICEOUTPUT$ respectively. If long_text is None,
    use
    the information in the data field as $LONGSERVICEOUTPUT$ separated
    by line if a mapping or iterable type, but no additional lines if
    data is None.
    """
    if short_text is None:
        short_msg = '{0} - {1}'.format(result['name'], result['status'])
    else:
        short_msg = (
            '{0} - {1} - {2}'
            .format(result['name'], result['status'], short_text)
        )
    data = json.dumps(result)
    if b64:
        data = base64.b64encode(bytes(data, encoding='utf-8')).decode('utf-8')
    print('{0} | {1}'.format(short_msg, data))

    if long_text is None:
        if result['data'] is None:
            return
        elif isinstance(result['data'], (str, bytes)):
            print(result['data'])
        elif isinstance(result['data'], collections.abc.Mapping):
            for key in result['data']:
                print('{0}: {1}'.format(key, result['data'][key]))
        elif isinstance(result['data'], collections.abc.Iterable):
            for item in result['data']:
                print(item)
        else:
            print(result['data'])
    else:
        print(long_text)


def _nagios_nsca_output(result):
    """
    Send NSCA formatted output nagios server, 
    """
    if result['status'] in EXITCODE_MAPS['nagios']:
        status = EXITCODE_MAPS['nagios'][result['status']]
    else:
        status = 3
    host = result['node'].split('.')[0]
    send_nsca_notification(host, result['name'], status, result['data'])


def _human_output(result, short_text=None, long_text=None):
    """
    Print information in a human readable format, using only the
    status code with ANSI colors, the short and long text, and
    if long text is not available split out the data as with
    :func:`accre.monitor._nagios_output`.
    """
    if result['status'] == 'OK':
        print('[ {0} ]'.format(GreenStr('OK')), end=' ')
    elif result['status'] == 'WARNING':
        print('[ {0} ]'.format(YellowStr('WARNING')), end=' ')
    elif result['status'] == 'CRITICAL':
        print('[ {0} ]'.format(RedStr('CRITICAL')), end=' ')
    else:
        print('[ {0} ]'.format(PurpleStr('UNKNOWN')), end=' ')

    if short_text is not None:
        print(short_text)
    else:
        print('')

    if long_text is None:
        if result['data'] is None:
            return
        elif isinstance(result['data'], (str, bytes)):
            print(result['data'])
        elif isinstance(result['data'], collections.abc.Mapping):
            for key in result['data']:
                print('{0}: {1}'.format(key, result['data'][key]))
        elif isinstance(result['data'], collections.abc.Iterable):
            for item in result['data']:
                print(item)
        else:
            print(result['data'])
    else:
        print(long_text)


def _slack_output(
    result,
    pretext=None,
    short_text=None,
    level='CRITICAL',
    channel='monitor',
    username=None,
    icon=None    
):
    """
    Print a formatted message to slack if the result code is equal or
    higher than level (OK < WARNING < CRITICAL < UNKNOWN). The result
    data is split into message attachment items. If pretext is given,
    that becomes the message pretext verbatim. If short_text is given,
    this becomes the message pretext along with the node and check name.
    """
    level_map = {'OK': 0, 'WARNING': 1, 'CRITICAL': 2, 'UNKNOWN': 3}
    if level_map[result['status']] < level_map[level]:
        return

    colors = {
        'OK': 'good', 'WARNING': 'warning', 'CRITICAL': 'danger',
        'UNKNOWN': '#E066FF'
    }


    items = {'status check': 'No printable data'}
    if isinstance(result['data'], (str, bytes)):
        items = {'status check': str(result['data'])}
    elif isinstance(result['data'], collections.abc.Mapping):
        items = result['data']
        for key in result['data']:
            print('{0}: {1}'.format(key, result['data'][key]))
    elif isinstance(result['data'], collections.abc.Iterable):
        items = {}
        for idx, item in enumerate(result['data']):
            items['Item {}'.format(idx + 1)] = item

    if pretext is None:
        pretext = 'Check {} run on {}: {}'.format(
            result['name'],
            result['node'],
            short_text
        )

    footer = None
    if result['runtime'] is not None:
        footer = 'Execution time: {0} seconds'.format(result['runtime'])

    slack.status_message(
        pretext,
        items=items,
        color=colors[result['status']],
        channel=channel,
        username=username,
        icon=icon,
        footer=footer
    )
Source code for accre.monitor

Table of Contents

Search