Checkpointing jobs on the cluster

What is checkpointing?

Checkpointing is a method of periodically saving the state of a job so that, in the event that the job is interrupted, it can be restarted at the point where it left off.

Why should I add checkpointing to my job script?

If there is a system interruption, such as a maintenance reboot, power outage, or hardware failure, without checkpointing, your job would have to be restarted from the beginning. If you have added checkpointing to your script (and your checkpointing is working properly), you can restart your job where it left off rather than lose hours of processing time by having to restart it from the beginning.

Where do I place checkpoints in my script?

You can place checkpoints anywhere in your workflow but one thing to keep in mind is that checkpoints come with a performance cost. Every time you take a checkpoint, data is collected and written to disk, which can slow down execution of your job. So while you can, in theory, save the value of each and every variable at every step in your program, it's generally not in your best interest to do so. Since checkpointing comes with a performance cost, you will want to be judicious in your use of them.

When planning how to checkpoint a script, you will want to take into consideration the process of resuming the workflow. A general rule to follow would be that if the time it takes to rerun a section of the script is greater than the time it takes to write the checkpoint state to disk, that section would be a good candidate for checkpointing. Thus you will want to add a checkpoint after each significant part of your script that takes a long time to complete, such as, for example, after each completion of a complex loop that performs calculations that would take a lot of time to redo if the job were interrupted. Another example might be a loop where each iteration runs quickly but the loop is being run. for example, 10^6 times. In a case like that, you may want to write the result to disk at every n mod 10^3 iteration of the loop. Checkpointing isn’t necessarily always straightforward so it’s important to analyze your code in order to determine what to checkpoint and where in order to ensure that your checkpointing is being done in an efficient manner.

How do I add checkpointing to my script?

This will vary depending on what your script is doing but the general formula for checkpointing can be broken down as:

Look for a state file. The name of the file can be hardcoded or, preferably, passed as a parameter to the script.
If the state file exists, restore the script to the place where it left off and continue from there.
If the state file doesn't exist, bootstrap.
Save the data and state periodically as the script runs. You will need to save everything that the script will need in order to restart at the place it left off.

Examples

The following examples should give you an idea of how to get started with checkpointing your script.

In R:

# Rscript –vanilla chkpoint_example.R
#
# chkpoint_example.R
#
# Example of an R script with checkpointing
#

# Recover starting point if it exists
start <- suppressWarnings(try(as.integer(read.table("state")), silent = TRUE))

# Otherwise bootstrap at 1
if (class(start) == "try-error")
start <- 1

# If start is greater than 1, we load the saved object from the last run.
# Otherwise, we load our initial data into a data frame.
if (start > 1) {
load("saveddf.RData")
} else {
df <- read.csv("mydata.csv", header = TRUE)
}

set.seed(1)
for (i in start:nrow(df)){
# Do some work on your data frame here...
res <- lapply(df, ...)

# Save the data frame object and index so the job can be restarted where it left
# off in the event that it gets interrupted.
df <- res
save(df, file = “saveddf.RData”)
write(i, file = "state")
}

# Done processing. Save any results that you need before doing cleanup.
write.csv(df, "final_results.csv")

# Cleanup
unlink("saveddf.RData")
unlink("state")

In Python:

#!/bin/env python
#
# chkpoint_example.py
#
# Example of a Python script with checkpointing
#

import pandas as pd
import random
import os
import sys

def process_data():
# Recover starting point if it exists
try:
with open("state", "r",) as file:
start = int(file.read())

# Otherwise bootstrap at 0
except:
start = 0

# If start is not 0, we load the saved object from the last run.
# Otherwise, we load our initial data into a data frame.

if (start != 0):
df = pd.read_csv("saveddf.csv")
else:
df = pd.read_csv("mydata.csv")

random.seed(1)
for row in df.itertuples(index=True):
# Do some work on your data frame here...
res = df.apply(...)

# Save the data frame object and row number so the job can be restarted where it left off in
# the event that it gets interrupted.

res.to_csv("saveddf.csv")
with open("state", "w+") as file:
file.write(str(row.Index))

# Done processing. Save any results that you need before doing cleanup. res.to_csv("final_results.csv")

# Cleanup
os.remove("saveddf.csv")
os.remove("state")

if __name__ == "__main__":
sys.exit(process_data())

For other programming languages, the process is similar. Look for sections of your code that perform complex calculations and write the results of those calculations to disk, along with an index or other variable that would allow you to restart where the code was interrupted.