'''
Framework for executing local monitoring checks through systems like
nagios, returning results, and also uploading to a document store
database such as elasticsearch, redis, or couch
Checks can be run with the accre-monitor cli endpoint. Each check is
a simple python function which takes a single argument of an options
payload passed through the command line as JSON, and returns a tuple
of a status string and some JSON-serializable data (dict, etc...).
A function is used as a check if decorated with ``monitor_command``.
The status and data from a check is placed into a JSON object along
with the time and hostname of the node. This object is sent to one
or more configured outputs. For example, ``stdout`` will return the
data to standard output. Additional outputs, such as elasticsearch
databases, may be added as needed.
The status string from the check should be one of OK, WARNING, CRITICAL,
or UNKNOWN. The exit code of the cli command will be determined
based on this string as well as the configured monitoring system, such
as nagios, which maps these strings to exit codes.
Functions to serve as monitoring checks should be added to a module in the
``monitor_checks`` package so that they will automatically appear as a CLI
subcommand.
A simple example check for load averages is as follows:
.. code-block:: python
@monitor_command
def loadavg(opts):
"""
Check the load average (1, 5, 15 minutes) of the server.
A JSON option of 'warning' and/or 'critical' may be given with
a list of 1,5,15 min averages above which the check will return
a warning or critical status
"""
data = os.getloadavg()
status = 'OK'
if 'warning' in opts:
for idx, val in enumerate(opts['warning']):
if data[idx] > val:
status = 'WARNING'
if 'critical' in opts:
for idx, val in enumerate(opts['critical']):
if data[idx] > val:
status = 'CRITICAL'
return status, data
This may be called as follows, if configured with the stdout output, and
will return the following:
.. code-block:: none
$ accre-monitor loadavg
{
"node": "vmps13",
"date": "2017-12-11T21:24:28.496027",
"name": "loadavg",
"status": "OK",
"data": [
1.134765625,
1.36279296875,
1.41650390625
]
}
or this can be called with options for warning and critical limits:
.. code-block:: none
$ accre-monitor --options '{"warning": [1, 1, 1], "critical": [2, 2, 2]}' loadavg
{
"status": "WARNING",
"date": "2017-12-11T21:24:18.842369",
"node": "dhcp-129-59-132-42.n1.vanderbilt.edu",
"data": [
1.07275390625,
1.35888671875,
1.416015625
],
"name": "loadavg"
}
For reasonably simple command options, the JSON format can be replaced with
additional options of the form ``--option value``. These will be reported as
strings, but there is a helper function ``util.interpret_string_value`` which
will interpret the values such that if the value can be converted
to a number it will be, and if ``value`` contains commas it will be
interpreted as a list. So the above command may be rewritten as
``accre-monitor loadavg --warning 1,1,1 --critical 2,2,2``.
This method of specifying options will not work for more complex options such as
nested JSON objects or lists, or values that contain whitespace or
special characters. If the value is missing for one of these additional
options, its value will be set as None, allowing for the use of flags.
Setting an output of ``nagios-stdout`` will return the information to STDOUT in a manner suitable
for a nagios check, with a pipe separating a human readable short description from a single
line JSON payload and then additional data split on the lines below. Setting an output
of ``nagios-base64-stdout`` will return the JSON payload in a base64-encoded form.
Setting an output of ``nagios-nsca`` will send the information to the nagios
server as an NSCA check. This option will only work with certain auditor checks
configured on the nagios server.
Status check results can be posted to slack in the #monitor channel by adding
``slack-all`` as an output. To only post for warnings or above, use
``slack-warning``, and use ``slack-critical`` to post only critical or
unknown checks.
In order to improve the nagios (or other) output for human readability, a check may optionally return
additional items in its output tuple. A third item if found will be a short (one-line) text
description of the status, and a fourth will be a longer (multi-line) description.
The docstring of each monitoring check will also be interpreted and placed
in the CLI help. All text in the docstring preceding a double-newline will
be shown for the command help, i.e. ``accre-monitor -h`` and the full docstring
will be shown for the subcommand help, i.e. ``accre-monitor subcommand -h``.
Status checks that begin with ``auditor`` are reserved for the auditor to
run and will not be added in the CLI unless the configuration file has
an auditor field set to true in the monitoring section.
In order to prevent stale checks that are stuck executing from piling
up on a node, an optional locking mechanism is available using ``flock(2)``
and a configurable directory for lock files. If used, flocks will ensure that
there is never more than one check running of a given type running on
a given node by a given user. To use this feature, add a ``lockdir`` field
to the configuration file specifying the directory to keep lock files.
If the value is set to ``[DEFAULT]`` it will be set to
``/tmp/<<USERNAME>>/accre-monitor``.
The lock directory will be created if it does not already exist when
configured.
'''
import base64
import datetime
import fcntl
import getpass
import json
import os
import os.path
import socket
import sys
import collections.abc
import accre.util
from accre.nagios import send_nsca_notification
import accre.slack as slack
from accre.config import get_config
from accre.util import RedStr, YellowStr, GreenStr, PurpleStr
CONFIG = get_config()
#: Mappings between status strings and exit codes expected
#: by monitoring systems (i.e. nagios)
EXITCODE_MAPS = {
'nagios': {'OK': 0, 'WARNING': 1, 'CRITICAL': 2, 'UNKNOWN': 3}
}
[docs]def monitor_command(f):
"""
Register a function to be available in the CLI as a montioring check.
Any monitoring command should accept a single argument of a
``argparse.Namespace`` type which is the parsed arguments given to the
check. Use of this argument is optional and it may be ignored.
The command should return a tuple of two elements with a status string of
"OK", "WARNING", "CRITICAL", or "UNKNOWN" and then a json serializable
object (i.e. simple dict, etc...) with check-specific information.
The docstring of the check will become the CLI help message for the
check.
"""
MONITORING_COMMANDS.append(f)
return f
# Internal list of montioring commands populated by the
# monitor_command decorator
MONITORING_COMMANDS = []
[docs]def maybe_acquire_lock(check_name, flock_id=None):
"""
Attempt to acquire a monitoring flock for the given check_name.
Returns a tuple of information as would a normal status check
for easy passage to monitoring check compeltion. In the case
of success, the status is 'OK', and in case of failure it is
'CRITICAL'. Information about any exception returned will be passed
as data.
:param str check_name: Name of the monitoring check for which a
lock should be aquired
:param int flock_id: An ID number for the check so that a different
flock is set for each check ID. This is useful for when you
want multiple checks (up to some limit) to be executable in
parallel on a given host.
:returns: Tuple of lock acquisition status with similar format to
a monitoring check
:rtype: tuple(str)
"""
lockname = check_name
if flock_id is not None:
lockname += '.{0}'.format(flock_id)
global _persist_lock
lockdir = CONFIG['monitor']['lockdir']
if lockdir == '[DEFAULT]':
lockdir = os.path.join('/tmp', getpass.getuser(), 'accre-monitor')
lockfile = os.path.join(lockdir, lockname)
try:
if not os.path.exists(lockdir):
os.makedirs(lockdir)
_persist_lock = open(lockfile, 'w+')
result = fcntl.flock(_persist_lock, fcntl.LOCK_EX | fcntl.LOCK_NB)
return 'OK', 'OK'
except Exception as e:
data = {
'lock_failure': (
'Unable to acquire lock file {0} to run this check'
.format(lockfile)
),
'suggestion': 'Check for stalled monitor check processes',
'exception': str(e)
}
return 'CRITICAL', data
# placeholder reference to hold lock file open for duration of process
_persist_lock = None
[docs]def complete_check(name, check_retval, outputs=None, runtime=None):
"""
Build the json object for the check record, send it to
all configured data stores or outputs, and return an
exit code determined by the check status and configured
monitoring system.
This function can be called by stand-alone checks written
as individual scripts outside the accre package to
propagate information as the checks in this module.
Calling this function will exit the interpreter.
:param str name: Name of the check being run
:param str status: Status of the check - OK, WARNING, CRITICAL, or UNKNOWN
:param tuple(str) check_retval: Return value of the check, of which
the first item is the status code, the second is any object
serializable as JSON with more detailed information about the
monitoring check result, the third (optional) is a short
one-line text description of the result, and the fourth (optional)
is a long multi-line description.
:param list(str) outputs: Outputs to override those set in the
configuration
:param float runtime: The wall clock execution time of the check,
in seconds
"""
status, data, short_text, long_text = _process_check_retval(check_retval)
checktime = accre.util.utcnow().isoformat()
host = socket.gethostname()
result = {
'node': host,
'date': checktime,
'status': status,
'data': data,
'name': name,
}
if runtime is not None:
result['runtime'] = runtime
if outputs is None:
outputs = CONFIG['monitor']['outputs'].split(',')
if 'stdout' in outputs:
print(json.dumps(result, indent=2))
if 'nagios-stdout' in outputs:
_nagios_output(result, short_text=short_text, long_text=long_text)
if 'nagios-base64-stdout' in outputs:
_nagios_output(result, short_text=short_text, long_text=long_text, b64=True)
if 'nagios-nsca' in outputs:
_nagios_nsca_output(result)
if 'human-stdout' in outputs:
_human_output(result, short_text=short_text, long_text=long_text)
if 'slack-all' in outputs:
_slack_output(result, short_text=short_text, level='OK')
if 'slack-warning' in outputs:
_slack_output(result, short_text=short_text, level='WARNING')
if 'slack-critical' in outputs:
_slack_output(result, short_text=short_text, level='CRITICAL')
if 'slack-auditor' in outputs:
_slack_output(
result,
pretext=short_text,
channel='auditor',
level='OK',
username='auditor',
icon='jarjar'
)
exitmap = EXITCODE_MAPS[CONFIG['monitor']['system']]
if status not in exitmap:
status = 'UNKNOWN'
sys.exit(exitmap[status])
def _process_check_retval(retval):
"""
Split the check return value into status, data,
short, and long text components
"""
# this is wrong, but we'll accept a string as
# a status code
if isinstance(retval, (str, bytes)):
return retval, None, None, None
result = [None, None, None, None]
for idx in range(len(retval)):
if idx > 3:
break
result[idx] = retval[idx]
return tuple(result)
def _nagios_output(result, short_text=None, long_text=None, b64=False):
"""
Print output in nagios format to STDOUT with full JSON result
as $SERVICEPERFDATA$, the check status as $SERVICEOUTPUT$,
and short and long descriptions added to the $SERVICEOUTPUT$
and $LONGSERVICEOUTPUT$ respectively. If long_text is None,
use
the information in the data field as $LONGSERVICEOUTPUT$ separated
by line if a mapping or iterable type, but no additional lines if
data is None.
"""
if short_text is None:
short_msg = '{0} - {1}'.format(result['name'], result['status'])
else:
short_msg = (
'{0} - {1} - {2}'
.format(result['name'], result['status'], short_text)
)
data = json.dumps(result)
if b64:
data = base64.b64encode(bytes(data, encoding='utf-8')).decode('utf-8')
print('{0} | {1}'.format(short_msg, data))
if long_text is None:
if result['data'] is None:
return
elif isinstance(result['data'], (str, bytes)):
print(result['data'])
elif isinstance(result['data'], collections.abc.Mapping):
for key in result['data']:
print('{0}: {1}'.format(key, result['data'][key]))
elif isinstance(result['data'], collections.abc.Iterable):
for item in result['data']:
print(item)
else:
print(result['data'])
else:
print(long_text)
def _nagios_nsca_output(result):
"""
Send NSCA formatted output nagios server,
"""
if result['status'] in EXITCODE_MAPS['nagios']:
status = EXITCODE_MAPS['nagios'][result['status']]
else:
status = 3
host = result['node'].split('.')[0]
send_nsca_notification(host, result['name'], status, result['data'])
def _human_output(result, short_text=None, long_text=None):
"""
Print information in a human readable format, using only the
status code with ANSI colors, the short and long text, and
if long text is not available split out the data as with
:func:`accre.monitor._nagios_output`.
"""
if result['status'] == 'OK':
print('[ {0} ]'.format(GreenStr('OK')), end=' ')
elif result['status'] == 'WARNING':
print('[ {0} ]'.format(YellowStr('WARNING')), end=' ')
elif result['status'] == 'CRITICAL':
print('[ {0} ]'.format(RedStr('CRITICAL')), end=' ')
else:
print('[ {0} ]'.format(PurpleStr('UNKNOWN')), end=' ')
if short_text is not None:
print(short_text)
else:
print('')
if long_text is None:
if result['data'] is None:
return
elif isinstance(result['data'], (str, bytes)):
print(result['data'])
elif isinstance(result['data'], collections.abc.Mapping):
for key in result['data']:
print('{0}: {1}'.format(key, result['data'][key]))
elif isinstance(result['data'], collections.abc.Iterable):
for item in result['data']:
print(item)
else:
print(result['data'])
else:
print(long_text)
def _slack_output(
result,
pretext=None,
short_text=None,
level='CRITICAL',
channel='monitor',
username=None,
icon=None
):
"""
Print a formatted message to slack if the result code is equal or
higher than level (OK < WARNING < CRITICAL < UNKNOWN). The result
data is split into message attachment items. If pretext is given,
that becomes the message pretext verbatim. If short_text is given,
this becomes the message pretext along with the node and check name.
"""
level_map = {'OK': 0, 'WARNING': 1, 'CRITICAL': 2, 'UNKNOWN': 3}
if level_map[result['status']] < level_map[level]:
return
colors = {
'OK': 'good', 'WARNING': 'warning', 'CRITICAL': 'danger',
'UNKNOWN': '#E066FF'
}
items = {'status check': 'No printable data'}
if isinstance(result['data'], (str, bytes)):
items = {'status check': str(result['data'])}
elif isinstance(result['data'], collections.abc.Mapping):
items = result['data']
for key in result['data']:
print('{0}: {1}'.format(key, result['data'][key]))
elif isinstance(result['data'], collections.abc.Iterable):
items = {}
for idx, item in enumerate(result['data']):
items['Item {}'.format(idx + 1)] = item
if pretext is None:
pretext = 'Check {} run on {}: {}'.format(
result['name'],
result['node'],
short_text
)
footer = None
if result['runtime'] is not None:
footer = 'Execution time: {0} seconds'.format(result['runtime'])
slack.status_message(
pretext,
items=items,
color=colors[result['status']],
channel=channel,
username=username,
icon=icon,
footer=footer
)