Monitoring System and Job Status

This page is no longer being updated. Please see the Monitoring System and Job Status Page on our new Documentation Site at https://mit-supercloud.github.io/supercloud-docs/ for the most up to date information.

 

 

 

 

The four actions you may take the most are checking system status and starting, monitoring, and stopping jobs. Since scheduling jobs is a longer topic, see this page for an in-depth description of how to start your job. Here we describe how to check the status of the system for available resources, monitor a currently running job, and stop a running job.

Each of these tasks is done through the scheduler, which is Slurm on the MIT Supercloud system. On this page and the job submission page we describe some of the basic options for submitting, monitoring, and stopping jobs. More advanced options are described in Slurm's documentation, and this handy two-page guide gives a breif description of the commands and their options.

Checking System Status

Our wrapper command, LLGrid_status, has a nicely formatted and easy to read output for checking system status:

[StudentX@login-0 ~]$ LLGrid_status
LLGrid: txe1 (running slurm 16.05.8)
============================================
Online Intel xeon-e5 nodes: 36
   Unclaimed nodes: 24
   Claimed slots: 172
   Claimed slots for exclusive jobs: 80
--------------------------------------------
   Available slots: 404

In the output, you can see the name of the system you are on (e1 here), the scheduler that's being used (Slurm), the number of unclaimed nodes, and the number of available slots.

You can also get information about the system status using the scheduler command sinfo:

[StudentX@login-0 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite     15  down* node-[039,049,051,058,063-072],raid-2
normal*      up   infinite      2    mix gpu-2,node-050
normal*      up   infinite     10  alloc gpu-1,node-[037-038,062,099,109-113]
normal*      up   infinite     24   idle node-[040-048,055-057,059-061,115-119,124,129-130],raid-1
normal*      up   infinite      1   down node-114
...
gpu          up   infinite      1  alloc gpu-1
dgx          up 1-00:00:00      1   idle txdgx

This command will list the nodes that are down (unavailable), fully allocated, partially allocated, and idle. Note that the same node can appear in multiple partitions.

Monitoring Jobs

You can monitor your jobs using the LLstat command:

[StudentX@login-0 ~]$ LLstat
LLGrid: txe1 (running slurm 16.05.8)
JOBID     ARRAY_J    NAME        USER     START_TIME          PARTITION  CPUS  FEATURES  MIN_MEMORY  ST  NODELIST(REASON)
40986     40986      myJob       Student  2017-10-19T15:35:46 normal     1     xeon-e5   5G          R   gpu-2
40980_100 40980      myArrayJob  Student  2017-10-19T15:35:37 normal     1     xeon-e5   5G          R   gpu-2
40980_101 40980      myArrayJob  Student  2017-10-19T15:35:37 normal     1     xeon-e5   5G          R   gpu-2
40980_102 40980      myArrayJob  Student  2017-10-19T15:35:37 normal     1     xeon-e5   5G          R   gpu-2

The output of the LLstat command lists the job IDs of the jobs running, their names, the start time, the number of cpus per task, its status, and the node that it is running on. If it is in error state, it lists that as well.

Stopping Jobs

Jobs can be stopped using the LLkill command. You specify the list of job ID's, separated by commas that you would like to stop, for example:

LLkill 40986,40980

Stops the jobs with job IDs 40986 and 40980. You can also use the LLkill command to stop all of your currently running jobs:

LLkill -u USERNAME