monitor_checks.common

Common monitoring checks usable across multiple services or node types.

For an overview of the monitoring check framework see accre.monitor.

accre.monitor_checks.common.accresystemhealth(opts)[source]

Checks the general system (not-hardware) health of an ACCRE server

This check is an omnibus check for commonly found system problems that can apply to any standard, CFE-managed node on the ACCRE infrastructure. The following things will be checked:

  1. System drive (/) block space, warn if <20%, crit if <10%

  2. System drive (/) free inodes, warn if <20%, crit if <10%

  3. Systemd init process memory usage, warn if >50M, crit if >100M

  4. Recent completion of a CFE run, warn if >15mins, crit if >30mins

These thresholds are not currently configurable.

accre.monitor_checks.common.certexpiry(opts)[source]

Check the certificate file specified by the cert option (required).

By default, report critical if the certificate is expired or will expire in the next three days, and warning if it will expire in the next 14 days. These limits may be changed by critical and warning options.

accre.monitor_checks.common.checkping(opts)[source]

Ping the given server and decide if it’s suitablly alive based on the supplied latency and packet loss valure.

checkssh [–verbose level] [–packets n] [–timeout dt] [–use-ipv4 | –use-ipv6] –warn latency,loss% –critical latency,loss% –host server

Options

–critical latency,loss% CRITICAL if latency(ms) or the percent lost packets is greater than provided –host server Server to ping –packets n Number of packets to send. Defaults is 5. –timeout dt How long to wait for a response in seconds. Default is 10. –use-ipv4 Use IPv4 addresses –use-ipv6 Use IPv6 addresses –verbose level Verbosity level to get extra information. Default is 0. –warn latency,loss% WARN if latency(ms) or the percent lost packets is greater than provided

Additional data returned:

critical_latency Critical latency specified (ms) critical_lost_percent Critical lost packets percent error Error message host Server name latency Average packet latency packets Packets sent packet_loss PErcentage of packets lost lost_percent Percentage of packets lost rta Ping round trip time average (ms) summary Test summary information timeout Max wait (seconds) warn_latency Warning latency specified (ms) warn_lost_percent Warning lost packets percent

accre.monitor_checks.common.checkread(opts)[source]

Check that it is possible to read and calculate a checksum of a randomly selected file in the specified directory.

Options:
--dir

Target directory

accre.monitor_checks.common.checkssh(opts)[source]

Try to connect to the specified SSH server and port

checkssh [–verbose level] [–port port] [–timeout dt] [–use-ipv4 | –use-ipv6] –host server

Options
--host server

Server to probe

--port port

Port to use. Defaults is 22.

--timeout dt

How long to wait for a response in seconds. Default is 10.

--use-ipv4

Use IPv4 addresses

--use-ipv6

Use IPv6 addresses

--verbose level

Verbosity level to get extra information. Default is 0.

Additional data returned:

error Error message host Server name port Port used time Time(seconds) to process the check timeout Max wait (seconds) version SSH server version string returned

accre.monitor_checks.common.checksum(opts)[source]

Efficiently calculates the checksum of the target file.

Options:
--file

Target file

--hash

Hash function (md5, sha1, sha224, sha256, sha384, sha512)

accre.monitor_checks.common.diskusage(opts)[source]

Check the used space on a mounted volume (default /)

Accepts an option of mountpoint to check a volume on a different specifed directory, and warning/critical options to alert on a fraction of total space used.

accre.monitor_checks.common.loadavg(opts)[source]

Check the load average (1, 5, 15 minutes) of the server.

An option of ‘warning’ and/or ‘critical’ may be given with a list of 1,5,15 min averages above which the check will return a warning or critical status, or just a single number which will return warning or critical if any of the three are above the specified value.

If a cpuscaling option is given with a value of true, then the limits for critical and warning will be multiplied by the number of logical cpu cores on the node (counting hyperthreading).