Monitoring Jobs on compute.cla

Checking Job Status

To see the status of your job, enter qstat -a at the command prompt:

goldy@compute:~$ qstat -a

The result will look something like the following:

compute.cla.umn.edu:                                                                                                                                                   
                                                                         Req'd  Req'd      Elap
Job ID                   Username  Queue    Jobname    SessID NDS   TSK  Memory Time     S Time
-----------------------  --------- -------- ---------- ------ ----  ---- ------ -------- - ---------

9876.compute.cla.umn.edu goldy     batch    myjob.pbs  60476  1     2    8gb    1:00:00  R 0:00:10
 
The output is fairly self-explanatory. Perhaps the main item to note is the State (“S”) column where the “R” indicates that the job is running. Other entries you may see in that column are “Q” for “queued”, “E” for “exiting”, or “C” for “completed.”

The “qstat -f” command will give you more information on the jobs you have in queue, including, for example, the execution host(s), variable list, and walltime remaining. More information on the qstat command can be found on the manpage.

Checking Job Array Status

Checking the status of an entire job array is done by running qstat with the -t option. Each array element will appear as a separate job in the queue and the normal scheduling rules apply to each element. The name of the array will be the job number assigned by PBS followed by a set of brackets. For example, if the assigned job number is 9876, the entire job array will be denoted as 9876[] and the individual jobs will be 9876[1], 9876[2], 9876[3] and so on.

goldy@compute$ qstat -t

Job ID Name User Time Use S Queue
-------------------- ---------------- -------- -------- - -----
2868[1].compute test.pbs-1 goldy 00:00:06 R batch
2868[2].compute test.pbs-2 goldy 0 R batch
2868[3].compute test.pbs-3 goldy 0 Q batch
2868[4].compute test.pbs-4 goldy 0 Q batch
2868[5].compute test.pbs-5 goldy 0 Q batch
2868[6].compute test.pbs-6 goldy 0 Q batch
2868[7].compute test.pbs-7 goldy 0 Q batch
2868[8].compute test.pbs-8 goldy 0 Q batch
2868[9].compute test.pbs-9 goldy 0 Q batch
2868[10].compute test.pbs-10 goldy 0 Q batch

Checking Job Logs

PBS by default will log both stdout and stderr to the job submission directory. (See the page for submitting jobs for information on how to have PBS log to another location.) If your job doesn’t run as  expected, check the stderr log for errors. If you submit a large batch of jobs, an easy way to check for errors is to look for stderr files whose size is greater than 0.

goldy@compute:~$ find . -type f -name “$JOBNAME.e*” -size +0c | xargs grep -iv loaded
Note: Module file loads and unloads get written to stderr. The xargs portion of the above command is a workaround since torque error logs will record module loads and unloads.  If you aren’t loading any modules when you run your job, you can exclude the last section of the above command ( the “| xargs grep -iv loaded” part) and just check for error files that have a size greater than 0.