Gábor Samu
Gábor Samu
Creator of this blog.
Apr 12, 2022 11 min read

LSF hookin' up with the CRIU

thumbnail for this post

With the unpredicable spring weather here in Southern Ontario, weekend projects are the order of the day. Whether it’s fixing my bike for spring, repairing things in the home which I’ve neglected for far long or topics relating to IT which have been percolating in my head, I am a textbook busybody.

A few decades back, when I was a support engineer at Platform Computing, I had my first experience working with clients using both kernel-level and user-level checkpoint and restart through the HPC workload scheduler Platform LSF (now IBM Spectrum LSF). I distinctly recall that user-level library was a bit tricky as you had to link you home grown code against it - and it had numerous limitations which I can’t recall off the top of my head. Back then, like today, IBM Spectrum LSF provides a number of ways that administrators can extend capabilities using plug-ins. Checkpoint and restart is an example where plug-ins can be used. More about this later.

I’ve been keeping an eye on the project known as CRIU for some time. CRIU, which stands for Checkpoint/Restore In Userspace provides checkpoint and restart functionality on Linux. And I thought it may be an interesting weekend project to integrate CRIU with LSF. As it turns out, I was not blazing any trails here as I found that there are others already using CRIU with LSF today. Nevertheless, I decided to give it a try.

My system of choice for this tinkering was a dual-socket POWER9 based system running CentOS Stream 8 and IBM Spectrum LSF Suite for HPC v10.2.0.12. The LSF online documentation contains information on the specifications of the LSF plugins for checkpoint and restart. The plugins are known as echkpnt and erestart, where the “e” denotes external.

Here is a quick rundown on the steps to integrate CRIU with LSF.

  • It turns out that my system already had criu installed. It’s a dependency on runc which was installed as part of podman. This step really depends on your distro. In my case, dnf provides criu was my friend.
# uname -a
Linux kilenc 4.18.0-373.el8.ppc64le #1 SMP Tue Mar 22 1539 UTC 2022 ppc64le ppc64le ppc64le GNU/Linux

# criu

Usage:
  criu dump|pre-dump -t PID [<options>]
  criu restore [<options>]
  criu check [--feature FEAT]
  criu page-server
  criu service [<options>]
  criu dedup
  criu lazy-pages -D DIR [<options>]

Commands:
  dump           checkpoint a process/tree identified by pid
  pre-dump       pre-dump task(s) minimizing their frozen time
  restore        restore a process/tree
  check          checks whether the kernel support is up-to-date
  page-server    launch page server
  service        launch service
  dedup          remove duplicates in memory dump
  cpuinfo dump   writes cpu information into image file
  cpuinfo check  validates cpu information read from image file

Try -h|--help for more info
  • The criu command needs to be run as root to be able to checkpoint processes. As we are going to leverage criu directly in the LSF echkpnt and erestart scripts, I chose to enable sudo access for criu. To do this I simply added the following to /etc/sudoers.
gsamu   ALL=NOPASSWD:/usr/sbin/criu
  • Next, I tested that the basic criu functionality was working. I found this to be a useful blog on how to perform a simple test.

  • With criu installed and working (see step 3), the next steps was to create the echkpnt and erestart scripts which would ultimately call the appropriate criu dump and criu restore commands. These scripts will be named echkpnt.criu and erestart.criu. The .criu extension denotes the checkpoint and restart method name in LSF. The checkpoint method is specified at the time of job submission in LSF.

The key for the echkpnt.criu script is to build out the list of PIDs for the job in question. For this I used an inelegant approach - simply scraping the output of the LSF bjobs -l command. This list of PIDs is then used as arguments to the criu dump command. The example echkpnt.criu script is included below.

Example echkpnt.criu script. Click to expand!
#!/bin/csh -f 

# Example external check pointing routine for CRIU (https://criu.org/Main_Page) 
# echkpnt  [-c] [-f] [-k|-s] -d chkpnt_dir process-group-id"
# tasks:
# 
# 1) Check parameters
# 2) Get job PIDS 
# 3) Invoke appropriate criu command to checkpoint the job PIDS

setenv PATH /usr/bin:/bin:/usr/etc:$PATH
set usage="Usage: $0 [-k] -d chkpnt_dir process-group-id"

# 1) Check parameters
while (x$1 != x)
	switch ($1)
	case -k: 
	    set killflag=TRUE
	    shift
	    breaksw
	case -d: 
	    set chkpntdir=$2
	    shift
	    shift
	    breaksw
	case -c: 
	    shift
	    breaksw
	case -s: 
	    set killflag=TRUE
	    shift
	    breaksw
	case -f: 
	    shift
	    breaksw
	case -*: 
	    echo "Illegal argument $1"
	    echo "$usage"
	    exit 1
	    breaksw
	default: 
	    break
	endsw
end

if ($#argv != 1) then
	echo "$usage"
	exit 1
endif
set progrpid=$1

if ($?chkpntdir != 1) then
	echo "$usage"
	exit 1
endif

if (! -e $chkpntdir) then
	echo "The check point directory does not exist."
	exit 1
endif

if (! -d $chkpntdir) then
	echo "The check point directory is not a directory."
	exit 1
endif

#
# 2) Get job PIDS 
# We scrape the output of bjobs to get the PGID, PIDS for the job. 
# Right now this only considers a job with a single PGID. 
#
set bjobs=`bjobs -l $LSB_JOBID |grep PGID`
set jobpids=`echo $bjobs | awk '{for(i=6;i<=NF;i++)printf "%s ",$i;printf "\n"}'
`
set chkpnt=`echo $chkpntdir|awk '{split($1,dir,".");print dir[1]}'`

#
# 3) Invoke appropriate criu command to checkpoint the job PIDS
# For the case when echkpnt -k is called (to checkpoint and terminate the job). 
# Otherwise, checkpoint the job and leave it running.
#  
foreach pid ($jobpids)  
    if ($?killflag == 1) then
       sudo criu dump -t $pid -j -D $chkpnt --shell-job --file-locks --ext-unix-
sk --tcp-established
    else
       sudo criu dump -t $pid -j -D $chkpnt --leave-running --shell-job --file-l
ocks --ext-unix-sk --tcp-established;
    endif
end 

exit 0

I used a simple approach as well for erestart.criu. As per the specification for erestart, the key is to create a new LSF jobfile which contains the appropriate criu restore invocation, pointing to the checkpoint data. The example erestart.criu script is included below.

Example erestart.criu script. Click to expand!
#!/bin/sh 

#
# Example external checkpoint restart routine for CRIU (https://criu.org/Main_Pa
ge).  
# erestart  [-c] [-f] chkpnt_dir 
# tasks:
# 1) Check parameters
# 2) Check LSF env variables for checkpoint
# 3) Update the original command with addition option "-restart lsf"
# 4) Put the new job file in .restart_cmd.
# 5) exit 0 to tell erestart that erestart.criu succeeded.
#

PATH=/usr/bin:/bin:/usr/etc:$PATH
export PATH
usage="Usage: $0 [-c] [-f] chkpnt_dir"

#
# 1) Check parameters
# "chkpnt_dir" is the new job_id
#
while [ "$1" != "" ]
do
        case $1 in
        -c)
            shift
	    ;;
        -f)
            shift
	    ;;
        *)
            break
	    ;;
        esac
done

#
# Save the chkpnt_dir for future
#
new_jobid="$1"

#
# 2) Check LSF env variables for checkpoint
#
if [ -f $LSB_CHKFILENAME ]
then
    :
else
    echo "Can not find $LSB_CHKFILENAME" 1>&2
    exec 2<&-
    exit 1
fi

# if LSB_CHKPNT_DIR is not defined, set it up (for LSF 3.1)
if [ _$LSB_CHKPNT_DIR = '_' ]; then
    LSB_CHKPNT_DIR=`dirname $LSB_CHKFILENAME`
fi

if [ -d $LSB_CHKPNT_DIR ]
then
    :
else
    echo "Can not find $LSB_CHKPNT_DIR" 1>&2
    exec 2<&-
    exit 1
fi

#
# 3) Update the original command with addition option "-restart"
#
new_jobfile=$LSB_CHKFILENAME.criu.restart
if [ -f "$new_jobfile" ]; then
    rm -rf $new_jobfile
fi
	while IFS= read -r line
	do
  		echo $line >> "$new_jobfile";  
  		if [[ "$line" == "# LSBATCH: User input" ]]; then
          		break; 
  		fi
	done < "$LSB_CHKFILENAME"

echo "sudo criu restore -j -D $LSB_CHKPNT_DIR" --shell-job >> "$new_jobfile" 
echo "ExitStat=$?" >> "$new_jobfile"
echo "wait" >> "$new_jobfile"
echo "# LSBATCH: End user input" >> "$new_jobfile"
echo "true" >> "$new_jobfile"
echo exit \`expr \$i\? \"\|\" \$ExitStat\` >> "$new_jobfile" 

chmod 700 $new_jobfile

#
# 4) Put the new job file in .restart_cmd.
#
echo LSB_RESTART_CMD=$new_jobfile > $LSB_CHKPNT_DIR/.restart_cmd
echo LSB_USE_MY_JOBFILE=Y >> $LSB_CHKPNT_DIR/.restart_cmd

# 5) exit 0 to tell erestart that erestart.criu succeeded.
exit 0

  • With the echkpnt.criu and erestart.criu scripts in the $LSF_SERVERDIR directory, the process to perform a checkpoint and restart of LSF jobs is straight forward using bchkpnt and brestart commands respectively. Here is a simple example.

  • Submit a job as checkpointable. The checkpoint method criu is specified as well as the location where the checkpoint data will be written to.

$ bsub -k "/home/gsamu/checkpoint_data method=criu" ./criu_test
Job <12995> is submitted to default queue <normal>.
  • The executable criu_test simply writes a message to standard out every 3 seconds.
$ bpeek 12995
<< output from stdout >>
0: Sleeping for three seconds ...
1: Sleeping for three seconds ...
2: Sleeping for three seconds ...
3: Sleeping for three seconds ...
4: Sleeping for three seconds ...
  • Next, we see that LSF has detected the job PIDS. Now we’re ready to perform the checkpoint.

    $ bjobs -l 12995
     
    Job <12995>, User <gsamu>, Project <default>, Status <RUN>, Queue <normal>, Com
                         mand <./criu_test>, Share group charged </gsamu>
    Tue Apr 12 0828: Submitted from host <kilenc>, CWD <$HOME>, C
                         heckpoint directory </home/gsamu/checkpoint_data/12995>;
    Tue Apr 12 0829: Started 1 Task(s) on Host(s) <kilenc>, Alloc
                         ated 1 Slot(s) on Host(s) <kilenc>, Executio
                         n Home </home/gsamu>, Execution CWD </home/gsamu>;
    Tue Apr 12 0838: Resource usage collected.
                         MEM: 12 Mbytes;  SWAP: 0 Mbytes;  NTHREAD: 4
                         PGID: 418130;  PIDs: 418130 418131 418133 
     
     
     MEMORY USAGE:
     MAX MEM: 12 Mbytes;  AVG MEM: 6 Mbytes
     
     SCHEDULING PARAMETERS:
               r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
     loadSched   -     -     -     -       -     -    -     -     -      -      -  
     loadStop    -     -     -     -       -     -    -     -     -      -      -  
     
     RESOURCE REQUIREMENT DETAILS:
     Combined: select[type == local] order[r15s:pg]
     Effective: select[type == local] order[r15s:pg] 

  • Initiate the checkpoint using the LSF bchkpnt command. The -k option is specified which will result in the job being checkpointed and killed.

    $ bchkpnt -k 12995
    Job <12995> is being checkpointed

  • We see in the history of the job using the bhist command that the checkpoint was initiated and succeeded. The job was subsequently killed (TERM_CHKPNT).

    $ bhist -l 12995
     
    Job <12995>, User <gsamu>, Project <default>, Command <./criu_test>
    Tue Apr 12 0828: Submitted from host <kilenc>, to Queue <norm
                         al>, CWD <$HOME>, Checkpoint directory </home/gsamu/checkp
                         oint_data/12995>;
    Tue Apr 12 0829: Dispatched 1 Task(s) on Host(s) <kilenc>, Al
                         located 1 Slot(s) on Host(s) <kilenc>, Effec
                         tive RES_REQ <select[type == local] order[r15s:pg] >;
    Tue Apr 12 0831: Starting (Pid 418130);
    Tue Apr 12 0831: Running with execution home </home/gsamu>, Execution CWD <
                         /home/gsamu>, Execution Pid <418130>;
    Tue Apr 12 0814: Checkpoint initiated (actpid 419029);
    Tue Apr 12 0815: Checkpoint succeeded (actpid 419029);
    Tue Apr 12 0815: Exited with exit code 137. The CPU time used is 2.1 second
                         s;
    Tue Apr 12 0815: Completed <exit>; TERM_CHKPNT: job killed after checkpoint
                         ing;
    		     
      
    MEMORY USAGE:
    MAX MEM: 12 Mbytes;  AVG MEM: 11 Mbytes
     
    Summary of time in seconds spent in various states by  Tue Apr 12 0815
      PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
      1        0        346      0        0        0        347         

  • Restart the job from the checkpoint data with the LSF brestart command. A new jobID is assigned.

    $ brestart /home/gsamu/checkpoint_data/ 12995 
    Job <12996> is submitted to queue <normal>.
    
    $ bjobs -l 12996
     
    Job <12996>, User <gsamu>, Project <default>, Status <RUN>, Queue <normal>, Com
                         mand <./criu_test>, Share group charged </gsamu>
    Tue Apr 12 0857: Submitted from host <kilenc>, CWD <$HOME>, R
                         estart, Checkpoint directory </home/gsamu/checkpoint_data/
                         /12996>;
    Tue Apr 12 0858: Started 1 Task(s) on Host(s) <kilenc>, Alloc
                         ated 1 Slot(s) on Host(s) <kilenc>, Executio
                         n Home </home/gsamu>, Execution CWD </home/gsamu>;
    Tue Apr 12 0807: Resource usage collected.
                         MEM: 14 Mbytes;  SWAP: 0 Mbytes;  NTHREAD: 5
                         PGID: 420069;  PIDs: 420069 420070 420073 420074 420076 
     
     
     MEMORY USAGE:
     MAX MEM: 14 Mbytes;  AVG MEM: 14 Mbytes
     
     SCHEDULING PARAMETERS:
               r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
     loadSched   -     -     -     -       -     -    -     -     -      -      -  
     loadStop    -     -     -     -       -     -    -     -     -      -      -  
     
     RESOURCE REQUIREMENT DETAILS:
     Combined: select[type == local] order[r15s:pg]
     Effective: select[type == local] order[r15s:pg] 

  • Viewing the standard output of the job, we see the point where it was killed and that it has picked up from where it left off.

    $ bpeek 12996
    << output from stdout >>
    0: Sleeping for three seconds ...
    1: Sleeping for three seconds ...
    2: Sleeping for three seconds ...
    3: Sleeping for three seconds ...
    4: Sleeping for three seconds ...
    ….
    ….
    110: Sleeping for three seconds ...
    111: Sleeping for three seconds ...
    112: Sleeping for three seconds ...
    113: Sleeping for three seconds ...
    /home/gsamu/.lsbatch/1649767708.12995: line 8: 418133 Killed                  ./criu_test
    114: Sleeping for three seconds ...
    115: Sleeping for three seconds ...
    116: Sleeping for three seconds ...
    117: Sleeping for three seconds ...
    118: Sleeping for three seconds ...
    119: Sleeping for three seconds ...
    120: Sleeping for three seconds ...
    ....
    ....

We’ve demonstrated how one can integrate CRIU checkpoint and restart with IBM Spectrum LSF using the echkpnt and erestart interfaces. As highlighted earlier, LSF provides a number of plugin interfaces which provides flexibility to organizations looking to do site specific customizations.