slurm¶
General purpose wrappers to call Slurm commands and put output into a machine-friendly format if needed.
- exception accre.slurm.ACCRESlurmError[source]¶
Bases:
ACCREError
Error occurring when running a slurm command
- accre.slurm.create_acc_qos(group, partition, gpus=2, flags='OverPartQOS', ssh=False)[source]¶
Create a QOS record for use with a group and an accelerated partition to be applied to associations involving that group.
The QOS record created will be of the form “<group>_<partition>_acc”.
- Parameters:
group (str) – group that this QOS should be used for
partition (str) – parition that this QOS should be used for
gpus (int) – maximum gpu resources for this QOS
flags (str) – comma separated list of flags for the QOS
ssh (bool) – Use ssh to run command on the remote scheduler
- accre.slurm.create_slurm_association(user, group, partition, fairshare=None, max_cpu=None, max_mem=None, qos=None, ssh=False)[source]¶
Create a new slurm association for the given user, group, partition with specified optional parameters
- Parameters:
user (str) – ACCRE user (VUNetID) to create the association for
group (str) – The group for the association
partition (str) – The cluster partition for the association
fairshare (int) – Optional fairshare for this association
max_cpu (int) – Optional maximum cores for this association
max_mem (str) – Optional maximum memory for this association
qos (str) – Optional qos for this association
ssh (bool) – Use ssh to run command on the remote scheduler
- accre.slurm.delete_acc_qos(group, partition, ssh=False)[source]¶
Delete a QOS record for use with a group and an accelerated partition.
The QOS record deleted will be of the form “<group>_<partition>_acc”.
- Parameters:
group (str) – group for this QOS
partition (str) – parition for this QOS
ssh (bool) – Use ssh to run command on the remote scheduler
- accre.slurm.delete_slurm_association(user, group, partition, ssh=False)[source]¶
Delete a slurm association for the given user, group, partition
- Parameters:
user (str) – ACCRE user (VUNetID) for the association
group (str) – The group for the association
partition (str) – The cluster partition for the association
ssh (bool) – Use ssh to run command on the remote scheduler
- accre.slurm.get_acc_qos_records(ssh=False)[source]¶
Return a list of dicts each with information about a slurm QOS record for an accelerated partition including the group, partition, number of gpus, and flags.
Note that this will only return information for qos records of the form “<group>_<partition>_acc” which are expected to correspond to accelerated partitions.
- Parameters:
ssh (bool) – Use ssh to run command on the remote scheduler
- Returns:
list of QOS information dicts
- Return type:
list(dict)
- accre.slurm.get_default_groups(ssh=False)[source]¶
Return a dict of usernames with their default group as a value.
- Parameters:
ssh (bool) – Use ssh to run command on the remote scheduler
- Returns:
default scheduler group for each user
- Return type:
dict(str, str)
- accre.slurm.get_nodes_infor(options, ssh=False)[source]¶
Run the sinfo command so that to get the nodes and cores information for the given partition, such information can be used for drawing cluster ultilization graph.
Here we passed the general form of option from outside, just like the function of get_sacct_infor so that it’s flexible for the user to define the specific data fields.
- Parameters:
options – options passed to the sinfo command to get the nodes/cores information
- Returns:
the given nodes/cores information from sinfo
- accre.slurm.get_runaway_jobs(ssh=False)[source]¶
Return the output of runaway jobs test
- Parameters:
ssh (bool) – Use ssh to run command on the remote scheduler
- Returns:
raw output from the slurm command
- accre.slurm.get_sacct_infor(options, ssh=False)[source]¶
Return the sacct command output information from slurm. The options is passed outside
- Parameters:
options – options for running sacct command, in form of list of string
- Returns:
raw output from sacct output
- accre.slurm.get_sacctmgr_status(options, ssh=False)[source]¶
Return the sacctmgr status checkout information from slurm. The options is passed outside
- Parameters:
options – options for the sacctmgr check, in form of list of string
- Returns:
raw output from sacctmgr status check
- accre.slurm.get_sdiag_infor(options, ssh=False)[source]¶
Return the sdiag output information from slurm. The options is passed outside
- Parameters:
options – options for the sdiag, in form of list of string
- Returns:
raw output from sdiag
- accre.slurm.get_slurm_associations(user=None, regular=True, accelerated=False, ssh=False)[source]¶
Return a list of dicts each with information about a slurm association including the cluster, group, user, partition, fairshare, max_cpus, max_mem, max_runmins, and qos. Fields that are unset in slurm for the association are set to None. max_cpus, fairshare, and max_runmins are ints if set.
- Parameters:
user (str) – If not None, only return associations for the specified user
regular (bool) – Show associations for regular (non-accelerated) partitions. These should all involve slurm accounts not ending in “_acc”.
accelerated (bool) – Show associations for accelerated partitions. These should all involve slurm accounts ending in “_acc”.
ssh (bool) – Use ssh to run command on the remote scheduler
- Returns:
list of association information dicts
- Return type:
list(dict)
- accre.slurm.get_slurm_users(ssh=False)[source]¶
Return a list of dicts each with information about a slurm user including the name, default account (default), and administrative access (admin).
- Parameters:
ssh (bool) – Use ssh to run command on the remote scheduler
- Returns:
list of user information dicts
- Return type:
list(dict)
- accre.slurm.get_squeue_infor(options, ssh=False)[source]¶
Return the squeue command output information from slurm. The options is passed outside
- Parameters:
options – options for running squeue command, in form of list of string
- Returns:
raw output from squeue output
- accre.slurm.groups_by_account(ssh=False)[source]¶
Return a dictionary keyed by ACCRE slurm accounts with a list of slurm groups for each account.
- Parameters:
ssh (bool) – Use ssh to run command on the remote scheduler
- Returns:
All groups for each account
- Return type:
dict(str, list(str))
- accre.slurm.list_compute_nodes(responding=True, ssh=False, hidden=False)[source]¶
Return a list of all compute nodes, by default only the ones responding to the scheduler.
- Parameters:
responding (bool) – Only show responding nodes if true
ssh (bool) – Use ssh to run command on the remote scheduler
hidden (bool) – Include nodes in hidden partititons
- Returns:
all compute nodes
- Return type:
list(str)
- accre.slurm.modify_acc_qos_gpus(group, partition, gpus, ssh=False)[source]¶
Modify a QOS record for use with a group and an accelerated partition to change the number of allowed gpus.
The QOS record modified will be of the form “<group>_<partition>_acc”.
- Parameters:
group (str) – group for this QOS
partition (str) – parition for this QOS
gpus (int) – The new number of allowed gpus
ssh (bool) – Use ssh to run command on the remote scheduler
- accre.slurm.run_slurm_command(arglist, ssh=False, timeout=60)[source]¶
Run a specified slurm command with arguments and return standard output decoded to UTF-8.
If ssh is set to True, ssh to the configured slurm server to run the command. This requires either that non-interactive authentication is set up or that this is run interactively.
- Parameters:
arglist (list(str)) – List of slurm command and arguments. Note that this will not be interpreted by a shell.
ssh (bool) – Use ssh to run command on the remote scheduler.
timeout (int) – set the timeout to this many seconds (default 60)
- Returns:
Output of the command decoded to utf-8
- Return type:
str
- accre.slurm.set_default_group(user, group, ssh=False)[source]¶
Set the default group for a user in slurm :param str user: ACCRE user (VUNetID) :param str group: The desired default group :param bool ssh: Use ssh to run command on the remote scheduler
- accre.slurm.slurm_node_info(node, ssh=False)[source]¶
Return a dictionary with the slurm properties of a specified node
- Parameters:
node (str) – Node to collect information about for information collection. Defaults to alloc and idle.
ssh (bool) – Use ssh to run command on the remote scheduler
- Returns:
slurm properties for the specified node
- Return type:
dict(str)
- accre.slurm.slurm_nodes_info_by_state(states=('alloc', 'idle'), ssh=False, hidden=False)[source]¶
Return a list of dictionaries with the slurm properties of each node in the cluster that has one of the specified states.
- Parameters:
states (list(str)) – List of states from which to accept nodes for information collection. Defaults to alloc and idle.
ssh (bool) – Use ssh to run command on the remote scheduler
hidden (bool) – Allow queries for nodes in hidden partitions
- Returns:
slurm properties for each node in the cluster with one of the specified states.
- Return type:
list(dict(str))