CLA Compute Cluster FAQ

1. I'm running my batch job on the CLA compute cluster but I'm not seeing anything written to the output file that I specified in my script. Why is that?

2. I have a job that should take about 100 hours to run but hour 99 fell during the maintenance window so the cluster rebooted and I lost all my work. How can I prevent this from happening?

3. When I submit a job, how can I redirect the output and error files to a chosen path?

4. How can I direct the stdout and stderr to a single file?

5. I'm having trouble running a batch job on the CLA computing cluster. When I submit the job, I get a “No such file or directory” error.

 

1. I'm running my batch job on the CLA compute cluster but I'm not seeing anything written to the output file that I specified in my script. Why is that?

A: PBS writes the job output to the local scratch area on the node where your job is running. Those files won't be copied to the output file specified in your script until the job terminates. If you would like to monitor the progress of your batch job, you will need to include the appropriate code in your script to periodically write the relevant information to a file in your home directory (or other shared area of the filesystem that you have write access to). For example, if your script is iterating through a loop 10000 times and you want to know what iteration it's on, you can have the script write the value of 'i' (or whatever you are using) to your home directory on every 100th iteration. That said, a better solution is to add checkpointing to your script.

 

2. I have a job that should take about 100 hours to run but hour 99 fell during the maintenance window so the cluster rebooted and I lost all my work. How can I prevent this from happening?

A: The best way to prevent this is to add checkpointing to your script in order to save the state and allow it to restart where it left off. More specifics on how to do this can be found on our checkpointing page.

 

3. When I submit a job, how can I redirect the output and error files to a chosen path?

A: You can do this in your PBS script by using PBS directives:

Output to redirect

PBS Directive

What it does

STDOUT

#PBS -o myprog.out

Redirects STDOUT to the
myprog.out file which will be
located in $PBS_O_WORKDIR

STDERR #PBS -e myprog.err

Redirects STDERR to the
myprog.err file which will be
located in $PBS_O_WORKDIR

 

4. How can I direct the stdout and stderr to a single file:

A: Use the “-j” PBS directive to merge the standard output and standard error streams to a single file:

 

PBS Directive

What it does

#PBS -j oe

Both streams will be merged, intermixed, as standard output.

#PBS -j eo

Both streams will be merged, intermixed, as standard error.

#PBS -j n

The two streams will be separate files. This is also the default if the “-j” directive is not used.

 

5. I'm having trouble running a batch job on the CLA computing cluster. When I submit the job, I get a “No such file or directory” error.

A: There are a few different things that can cause that error. Here are some things to check:

  1. Does your PBS script contain a “cd $PBS_O_WORKDIR” command?
  2. If so, are you submitting the job from the directory that contains your script?
  3. Did you create your pbs script on a Windows computer? If so, the carriage returns are the likely culprits. To fix the script: 

ssh to apollo.cla.umn.edu
cd to the directory that contains your script
move your script to $scriptname.DIST (e.g., myscript.pbs.DIST)
run the script through the dos2unix command:
    dos2unix -n myscript.pbs.DIST myscript.pbs
   

You can check that everything looks good by using the “cat -A” command on your script. It should look something like this. (Note the "$" symbols indicating the end of line character.)

#PBS -S /bin/bash$
#PBS -q batch$
#PBS -l nodes=1:ppn=4$
#PBS -l mem=32gb$
#PBS -l walltime=96:00:00$
#PBS -N myscript$
#PBS -o myscript.o$
#PBS -e myscript.e$
#PBS -m abe$
#PBS -M goldy#@umn.edu$
$
cd $PBS_O_WORKDIR/$
module load python$
python myscript.pbs$