slurm

General purpose wrappers to call Slurm commands and put output into a machine-friendly format if needed.

exception accre.slurm.ACCRESlurmError[source]

Bases: ACCREError

Error occurring when running a slurm command

accre.slurm.create_acc_qos(group, partition, gpus=2, flags='OverPartQOS', ssh=False)[source]

Create a QOS record for use with a group and an accelerated partition to be applied to associations involving that group.

The QOS record created will be of the form “<group>_<partition>_acc”.

Parameters:
  • group (str) – group that this QOS should be used for

  • partition (str) – parition that this QOS should be used for

  • gpus (int) – maximum gpu resources for this QOS

  • flags (str) – comma separated list of flags for the QOS

  • ssh (bool) – Use ssh to run command on the remote scheduler

accre.slurm.create_slurm_association(user, group, partition, fairshare=None, max_cpu=None, max_mem=None, qos=None, ssh=False)[source]

Create a new slurm association for the given user, group, partition with specified optional parameters

Parameters:
  • user (str) – ACCRE user (VUNetID) to create the association for

  • group (str) – The group for the association

  • partition (str) – The cluster partition for the association

  • fairshare (int) – Optional fairshare for this association

  • max_cpu (int) – Optional maximum cores for this association

  • max_mem (str) – Optional maximum memory for this association

  • qos (str) – Optional qos for this association

  • ssh (bool) – Use ssh to run command on the remote scheduler

accre.slurm.delete_acc_qos(group, partition, ssh=False)[source]

Delete a QOS record for use with a group and an accelerated partition.

The QOS record deleted will be of the form “<group>_<partition>_acc”.

Parameters:
  • group (str) – group for this QOS

  • partition (str) – parition for this QOS

  • ssh (bool) – Use ssh to run command on the remote scheduler

accre.slurm.delete_slurm_association(user, group, partition, ssh=False)[source]

Delete a slurm association for the given user, group, partition

Parameters:
  • user (str) – ACCRE user (VUNetID) for the association

  • group (str) – The group for the association

  • partition (str) – The cluster partition for the association

  • ssh (bool) – Use ssh to run command on the remote scheduler

accre.slurm.get_acc_qos_records(ssh=False)[source]

Return a list of dicts each with information about a slurm QOS record for an accelerated partition including the group, partition, number of gpus, and flags.

Note that this will only return information for qos records of the form “<group>_<partition>_acc” which are expected to correspond to accelerated partitions.

Parameters:

ssh (bool) – Use ssh to run command on the remote scheduler

Returns:

list of QOS information dicts

Return type:

list(dict)

accre.slurm.get_default_groups(ssh=False)[source]

Return a dict of usernames with their default group as a value.

Parameters:

ssh (bool) – Use ssh to run command on the remote scheduler

Returns:

default scheduler group for each user

Return type:

dict(str, str)

accre.slurm.get_nodes_infor(options, ssh=False)[source]

Run the sinfo command so that to get the nodes and cores information for the given partition, such information can be used for drawing cluster ultilization graph.

Here we passed the general form of option from outside, just like the function of get_sacct_infor so that it’s flexible for the user to define the specific data fields.

Parameters:

options – options passed to the sinfo command to get the nodes/cores information

Returns:

the given nodes/cores information from sinfo

accre.slurm.get_runaway_jobs(ssh=False)[source]

Return the output of runaway jobs test

Parameters:

ssh (bool) – Use ssh to run command on the remote scheduler

Returns:

raw output from the slurm command

accre.slurm.get_sacct_infor(options, ssh=False)[source]

Return the sacct command output information from slurm. The options is passed outside

Parameters:

options – options for running sacct command, in form of list of string

Returns:

raw output from sacct output

accre.slurm.get_sacctmgr_status(options, ssh=False)[source]

Return the sacctmgr status checkout information from slurm. The options is passed outside

Parameters:

options – options for the sacctmgr check, in form of list of string

Returns:

raw output from sacctmgr status check

accre.slurm.get_sdiag_infor(options, ssh=False)[source]

Return the sdiag output information from slurm. The options is passed outside

Parameters:

options – options for the sdiag, in form of list of string

Returns:

raw output from sdiag

accre.slurm.get_slurm_associations(user=None, regular=True, accelerated=False, ssh=False)[source]

Return a list of dicts each with information about a slurm association including the cluster, group, user, partition, fairshare, max_cpus, max_mem, max_runmins, and qos. Fields that are unset in slurm for the association are set to None. max_cpus, fairshare, and max_runmins are ints if set.

Parameters:
  • user (str) – If not None, only return associations for the specified user

  • regular (bool) – Show associations for regular (non-accelerated) partitions. These should all involve slurm accounts not ending in “_acc”.

  • accelerated (bool) – Show associations for accelerated partitions. These should all involve slurm accounts ending in “_acc”.

  • ssh (bool) – Use ssh to run command on the remote scheduler

Returns:

list of association information dicts

Return type:

list(dict)

accre.slurm.get_slurm_users(ssh=False)[source]

Return a list of dicts each with information about a slurm user including the name, default account (default), and administrative access (admin).

Parameters:

ssh (bool) – Use ssh to run command on the remote scheduler

Returns:

list of user information dicts

Return type:

list(dict)

accre.slurm.get_squeue_infor(options, ssh=False)[source]

Return the squeue command output information from slurm. The options is passed outside

Parameters:

options – options for running squeue command, in form of list of string

Returns:

raw output from squeue output

accre.slurm.groups_by_account(ssh=False)[source]

Return a dictionary keyed by ACCRE slurm accounts with a list of slurm groups for each account.

Parameters:

ssh (bool) – Use ssh to run command on the remote scheduler

Returns:

All groups for each account

Return type:

dict(str, list(str))

accre.slurm.list_compute_nodes(responding=True, ssh=False, hidden=False)[source]

Return a list of all compute nodes, by default only the ones responding to the scheduler.

Parameters:
  • responding (bool) – Only show responding nodes if true

  • ssh (bool) – Use ssh to run command on the remote scheduler

  • hidden (bool) – Include nodes in hidden partititons

Returns:

all compute nodes

Return type:

list(str)

accre.slurm.modify_acc_qos_gpus(group, partition, gpus, ssh=False)[source]

Modify a QOS record for use with a group and an accelerated partition to change the number of allowed gpus.

The QOS record modified will be of the form “<group>_<partition>_acc”.

Parameters:
  • group (str) – group for this QOS

  • partition (str) – parition for this QOS

  • gpus (int) – The new number of allowed gpus

  • ssh (bool) – Use ssh to run command on the remote scheduler

accre.slurm.run_slurm_command(arglist, ssh=False, timeout=60)[source]

Run a specified slurm command with arguments and return standard output decoded to UTF-8.

If ssh is set to True, ssh to the configured slurm server to run the command. This requires either that non-interactive authentication is set up or that this is run interactively.

Parameters:
  • arglist (list(str)) – List of slurm command and arguments. Note that this will not be interpreted by a shell.

  • ssh (bool) – Use ssh to run command on the remote scheduler.

  • timeout (int) – set the timeout to this many seconds (default 60)

Returns:

Output of the command decoded to utf-8

Return type:

str

accre.slurm.set_default_group(user, group, ssh=False)[source]

Set the default group for a user in slurm :param str user: ACCRE user (VUNetID) :param str group: The desired default group :param bool ssh: Use ssh to run command on the remote scheduler

accre.slurm.slurm_node_info(node, ssh=False)[source]

Return a dictionary with the slurm properties of a specified node

Parameters:
  • node (str) – Node to collect information about for information collection. Defaults to alloc and idle.

  • ssh (bool) – Use ssh to run command on the remote scheduler

Returns:

slurm properties for the specified node

Return type:

dict(str)

accre.slurm.slurm_nodes_info_by_state(states=('alloc', 'idle'), ssh=False, hidden=False)[source]

Return a list of dictionaries with the slurm properties of each node in the cluster that has one of the specified states.

Parameters:
  • states (list(str)) – List of states from which to accept nodes for information collection. Defaults to alloc and idle.

  • ssh (bool) – Use ssh to run command on the remote scheduler

  • hidden (bool) – Allow queries for nodes in hidden partitions

Returns:

slurm properties for each node in the cluster with one of the specified states.

Return type:

list(dict(str))

accre.slurm.slurm_version()[source]

This function returns slurm current version