CLA Compute Cluster Intro

A cluster is a group of servers that are configured to work together in a manner such that they can essentially be viewed as one big computer from the standpoint of the user. Configuring servers in this way allows for better resource allocation so that users can request and reserve resources that they will need for the duration of their job. The cluster also provides parallel processing capabilities in which a large task is broken down into smaller tasks. These smaller tasks are then run simultaneously, which can greatly reduce the processing time needed to analyze the data.

Hardware and Configuration

The CLA cluster consists of a single head node (compute.cla.umn.edu) and 12 compute nodes. To manage these resources and allocate them efficiently, the cluster uses the TORQUE resource manager in conjunction with the Maui job scheduler. The cluster also has a small dedicated GPU available for testing code designed to run on this type of processor. If you would like to use the GPU, please let us know so that we can provide access to the appropriate queue.

Who can use the cluster?

Anyone with a CLA Linux/Unix account can use compute.cla and submit jobs to the batch and/or interactive queues. Other queues are restricted, with access being granted on an individual needs-based basis. Please see the table below in the Queues and Resource Limits section for more information.

How do I use the cluster?

To use the cluster, you can either connect to LTS using an NX Client and then ssh to compute.cla.umn.edu or you can ssh directly to compute.cla.umn.edu. More detailed information on accessing the cluster, including which connection method to choose, can be found at z.umn.edu/ltsconnect. Once you are on compute.cla.umn.edu, use the qsub command to submit your job to the cluster.

goldy@compute:~$ qsub myscript.pbs

More information on submitting jobs to the cluster can be found on our submitting jobs on compute.cla page.

Why should I use the cluster?

There are several advantages to using the cluster instead of one of the CLA standalone computers. To start with, the cluster is newer hardware with faster CPUs and more RAM than standalone machines. It also has a GPU available for those interested in testing out GPU code. But the main reason to use the cluster has to do with resource allocation. When you submit a job on the cluster, you can specify parameters, such as the number of cores you want, the amount of RAM you will need, and the maximum number of hours you expect your job to take. These resources are then reserved for you for the time requested or until your job finishes, whichever comes first.

If you were to run that same job on a standalone computer instead of the cluster, your job would be contending for resources with whatever other users on that computer happen to be doing at the time. If they were running jobs that require a lot of CPU and/or RAM, for example, there may not be enough resources available for your job to run in a timely manner. If you have ever logged in to a CLA standalone computer and the system has appeared sluggish and/or the program that you are trying to use takes an unusually long time to load, that’s an indication that the system was bogged down by too many users contending for too few resources. This won’t happen on the compute cluster because resources are allocated based on what each user requests for their job. You can specify, for example, that you will need 2 cores and 16GB of RAM for 24 hours and those resources will be reserved for your use for that amount of time. On a standalone computer, you get whatever resources are available and, since that depends on whatever other users happen to be doing on that computer at the time, there is no guarantee that the resources that are available will be sufficient to allow you to run your job in a timely manner. 

Torque server and scheduler

As outlined above, a cluster differs from a group of standalone computers in that it offers users a ‘single system image’ in terms of the management of their jobs and the aggregate compute resources available. The software that CLA uses to integrate a group of servers into a cluster is called TORQUE.

TORQUE is a resource management system that is used for submitting and controlling jobs on the cluster. TORQUE manages jobs that users submit to various queues, with each queue representing a group of resources with attributes that are specific to that particular queue. TORQUE is based on the original open source Portable Batch System (OpenPBS) project and the terms TORQUE and PBS are often used interchangeably.

Whereas the TORQUE server provides a mechanism for submitting, launching, and tracking jobs on the cluster, the TORQUE scheduler is used to manage and schedule those jobs, determining when, where, and how jobs are run so as to maximize the output of the cluster.

Storage / scratch

Each of the compute nodes has a scratch.local directory at the root level of the file system. Since working with local files is much faster than working with files that are accessed over the network (e.g., files in a home directory or in /labs, for instance), local scratch space is provided on each node in order to allow for faster file access. For example, if your program writes out files of  intermediate and/or final results, writing these files to /scratch.local on the compute node will allow for faster read/write access than writing them to a network share. When you submit a job, a directory is automatically created in /scratch.local for your scratch files. The name of the directory includes the PBS job ID of your job followed by “.compute.cla.umn.edu”, e.g, 6822.compute.cla.umn.edu. Please note that for any files in /scratch.local that you want to save, your script MUST include commands to copy them from the local scratch directory to a destination on a network share (e.g., your home dir, /labs/$labname/foldername, etc.) as once your job terminates, your files in /scratch.local will be deleted and cannot be recovered.

Queues and Resource Limits

The table below shows the default and maximum compute resources for each queue. The interactive and batch queues are available to everyone but access to the highmem, GPU and multinode queues is restricted. If you find you're needing more resources than what the batch queue provides, let us know and we can add you the highmem or multinode queues as needed. Access to the GPU queue is also restricted so If you would like to use the GPU, please let us know so that we can provide you access to that queue.

Linux Maintenance Window

Our normal maintenance window for Linux systems is the first Monday of the month from 5:00am to 6:00am. Servers may reboot or be unavailable during this time.  To receive an email reminder of an upcoming maintenance window join the list.

Routing Queue Queue Access Control List Direct Submission Allowed? Interactive?
  interactive None Yes Yes
batch*   None Yes No
  short None No No
  long None No No
  highmem Yes No No
  multinode Yes No No
gpu*   Yes Yes Yes
  gpu-prod Yes No Yes
  gpu-dev Yes No Yes

*The batch and gpu queues are routing queues.  Jobs submitted to these queues will be routed to the appropriate sub-queue based on the requested walltime, Procs, and RAM.

For access to highmem or multinode please email us at [email protected] with your needs.

The table below shows the default and maximum resources for each queue.

Queue Default Walltime Default RAM (GB) Default Procs (Cores) Maximum Walltime Maximum RAM (GB) Maximum Procs (Cores)
interactive 48:00:00 8 2 168:00:00 64 10
short 48:00:00 8 2 95:59:59 128 20
long 96:00:00 8 2 336:00:00 64 10
highmem 24:00:00 32 2 95:59:59 250 40
multinode 04:00:00 8 2 168:00:00 750 120
gpu-prod 08:00:00 32 2 48:00:00 96 20
gpu-dev 04:00:00 8 2 04:00:00 8 2