Grid Community Toolkit
6.2.1567772254 (tag: v6.2.20190906)
|
GRAM Job Manager Scheduler Tutorial
This tutorial describes the steps needed to build a GRAM Job Manager Scheduler interface package. The audience for this tutorial is a person interested in adding support for a new scheduler interface to GRAM. This tutorial will assume some familiarty with GTP, autoconf, automake, and Perl. As a reference point, this tutorial will refer to the code in the LSF Job Manager package.
This section deals with writing the perl module which implements the interface between the GRAM job manager and the local scheduler. Consult the globus_gram_job_manager_script_interface section of this manual for a more detailed reference on the Perl modules which are used here.
The scheduler interface is implemented as a Perl module which is a subclass of the Globus::GRAM::JobManager module. Its name must match the scheduler type string used when the service is installed. For the LSF scheduler, the name is lsf, so the module name is Globus::GRAM::JobManager::lsf and it is stored in the file lsf.pm
. Though there are several methods in the JobManager interface, they only ones which absolutely need to be implemented in a scheduler module are submit, poll, cancel.
We'll begin by looking at the start of the lsf source module, lsf.in (the transformation to lsf.pm happens when the setup script is run. To begin the script, we import the GRAM support modules into the scheduler module's namespace, declare the module's namespace, and declare this module as a subclass of the Globus::GRAM::JobManager module. All scheduler packages will need to do this, substituting the name of the scheduler type being implemented where we see lsf below.
Next, we declare any system-specifc values which will be substituted when the setup package scripts are run. In the LSF case, we need the know the paths to a few programs which interact with the scheduler:
The values surrounded by the at-sign (such as @MPIRUN@
) will be replaced by with the path to the named programs by the find-lsf-tools
script described below.
For scheduler interfaces which need to setup some data before calling their other methods, they can overload the new
method which acts as a constructor. Scheduler scripts which don't need any per-instance initialization will not need to provide a constructor, the Globus::GRAM::JobManager constructor will do the job.
If you do need to overloaded this method, be sure to call the JobManager module's constructor to allow it to do its initialization, as in this example:
The job interface methods are called with only one argument, the scheduler object itself. That object contains the a Globus::GRAM::JobDescription object ($self->{JobDescription}
) which includes the values from the RSL string associated with the request, as well as a few extra values:
Now, let's look at the methods which will interface to the scheduler.
All scheduler modules must implement the submit method. This method is called when the job manager wishes to submit the job to the scheduler. The information in the original job request RSL string is available to the scheduler interface through the JobDescription
data member of it's hash.
For most schedulers, this is the longest method to be implemented, as it must decide what to do with the job description, and convert them to something which the scheduler can understand.
We'll look at some of the steps in the LSF manager code to see how the scheduler interface is implemented.
In the beginning of the submit method, we'll get our parameters and look up the job description in the manager-specific object:
Then we will check for values of the job parameters that we will be handling. For example, this is how we check for a valid job type in the LSF scheduler interface:
The lsf module supports most of the core RSL attributes, so it does more processing to determine what to do with the values in the job description.
Once we've inspected the JobDescription we'll know what we need to tell the scheduler about so that it'll start the job properly. For LSF, we will construct a job description script and pass that to the bsub
command. This script is a bourne shell script with some special comments which LSF uses to decide what constraints to use when scheduling the job.
First, we'll open the new file, and write the file header:
Then, we'll add some special comments to pass job constraints to LSF:
Before we start the executable in the LSF job description script, we will quote and escape the job's arguments so that they will be passed to the application as they were in the job submission RSL string:
At the end of the job description script, we actually run the executable named in the JobDescription. For LSF, we support a few different job types which require different startup commands. Here, we will quote and escape the strings in the argument list so that the values of the arguments will be identical to those in the initial job request string. For this Bourne-shell syntax script, we will double-quote each argument, and escaping the backslash (), dollar-sign ($), double-quote ("), and single-quote (') characters. We will use this new string later in the script.
To end the LSF job description script, we will write the command line of the executable to the script. Depending on the job type of this submission, we will need to start either one or more instances of the executable, or the mpirun program which will start the job with the executable count in the JobDescription
:
Next, we submit the job to the scheduler. Be sure to close the script file before trying to redirect it into the submit command, or some of the script file may be buffered and things will fail in strange ways!
When the submission command returns, we check its output for the scheduler-specific job identifier. We will use this value to be able to poll or cancel the job.
The return value of the script should be either a GRAM error object or a reference to a hash of values. The Globus::GRAM::JobManager documentation lists the valid keys to that hash. For the submit method, we'll return the job identifier as the value of JOB_ID in the hash. If the scheduler returned a job status result, we could return that as well. LSF does not, so we'll just check for the job ID and return it, or if the job fails, we'll return an error object:
That finishes the submit method. Most of the functionality for the scheduler interface is now written. We just have a few more (much shorter) methods to implement.
All scheduler modules must also implement the poll method. The purpose of this method is to check for updates of a job's status, for example, to see if a job has finished.
When this method is called, we'll get the job ID (which we returned from the submit method above) as well as the original job request information in the object's JobDescription. In the LSF script, we'll pass the job ID to the bjobs
program, and that will return the job's status information. We'll compare the status field from the bjobs
output to see what job state we should return.
If the job fails, and there is a way to determine that from the scheduler, then the script should return in its hash both
and
Here's an excerpt from the LSF scheduler module implementation:
All scheduler modules must also implement the cancel method. The purpose of this method is to cancel a running job.
As with the poll
method described above, this method will be given the job ID as part of the JobDescription object held by the manager object. If the scheduler interface provides feedback that the job was cancelled successfully, then we can return a JOB_STATE change to the FAILED state. Otherwise we can return an empty hash reference, and let the poll method return the state change next time it is called.
To process a cancel in the LSF case, we will run the bkill command with the job ID.
It is required that all perl modules return a non-zero value when they are parsed. To do this, make sure the last line of your module consists of:
Once we've written the job manager script, we need to get it installed so that the gatekeeper will be able to run our new service. We do this by writing a setup script. For LSF, we will write the script setup-globus-job-manager-lsf.pl
, which we will list in the LSF package as the Post_Install_Program.
To set up the Gatekeeper service, our LSF setup script does the following:
First, our scheduler setup script probes for any system-specific information needed to interface with the local scheduler. For example, the LSF scheduler uses the mpirun
, bsub
, bqueues
, bjobs
, and bkill
commands to submit, poll, and cancel jobs. We'll assume that the administrator who is installing the package has these commands in their path. We'll use an autoconf script to locate the executable paths for these commands and substitute them into our scheduler Perl module. In the LSF package, we have the find-lsf-tools
script, which is generated during bootstrap by autoconf from the find-lsf-tools.in
file:
If this script exits with a non-zero error code, then the setup script propagates the error to the caller and exits without installing the service.
Next, the setup script installs it's perl module into the perl library directory and registers an entry in the Globus Gatekeeper's service directory. The program globus-job-manager-service
(distributed in the job manager program setup package) performs both of these tasks. When run, it expects the scheduler perl module to be located in the $GLOBUS_LOCATION/setup/globus
directory.
If the scheduler script implements RSL attributes which are not part of the core set supported by the job manager, it must publish them in the job manager's data directory. If the scheduler script wants to set some default values of RSL attributes, it may also set those as the default values in the validation file.
The format of the validation file is described in the globus_gram_job_manager_rsl_validation_file section of the documentation. The validation file must be named scheduler-type.rvf and installed in the $GLOBUS_LOCATION/share/globus_gram_job_manager
directory.
In the LSF setup script, we check the list of queues supported by the local LSF installation, and add a section of acceptable values for the queue RSL attribute:
Finally, the setup package should create and finalize a Grid::GPT::Setup
. The value of $package must be the same value as the gpt_package_metadata Name attribute in the package's metadata file. If either the new()
or finish()
methods fail, then it is considered good practice to clean up any files created by the setup script. From setup-globus-job-manager-lsf.pl
:
Now that we've written a job manager scheduler interface, we'll package it using GPT to make it easy for our users to build and install. We'll start by gathering the different files we've written above into a single directory lsf
.
If there are any scheduler-specific options defined for this scheduler module, or if there any any optional setup items, then it is good to provide a documentation page which describes these. For LSF, we describe the changes since the last version of this package in the file globus_gram_job_manager_lsf.dox
. This file consists of a doxygen mainpage. See www.doxygen.org for information on how to write documentation with that tool.
Now, we'll write our configure.in script. This file is converted to the configure shell script by the bootstrap script below. Since we don't do any probes for compile-time tools or system characteristics, we just call the various initialization macros used by GPT, declare that we may provide doxygen documentation, and then output the files we need substitions done on.
Now we'll write our metadata file, and put it in the pkgdata subdirectory of our package. The important things to note in this file are the package name and version, the post_install_program, and the setup sections. These define how the package distribution will be named, what command will be run by gpt-postinstall
when this package is installed, and what the setup dependencies will be written when the Grid::GPT::Setup object is finalized.
The automake Makefile.am for this package is short because there isn't any compilation needed for this package. We just need to define what needs to be installed into which directory, and what source files need to be put inot our source distribution. For the LSF package, we need to list the lsf.in
, find-lsf-tools
, and setup-globus-job-manager-lsf.pl
scripts as files to be installed into the setup directory. We need to add those files plus our documentation source file to the EXTRA_LIST variable so that they will be included in source distributions. The rest of the lines in the file are needed for proper interaction with GPT.
The final piece we need to write for our package is the bootstrap
script. This script is the standard bootstrap script for a globus package, with an extra line to generate the fine-lsf-tools
script using autoconf.
With this all done, we can now try to build our now package. To do so, we'll need to run
If all of the files are written correctly, this should result in our package being installed into $GLOBUS_LOCATION
. Once that is done, we should be able to run gpt-postinstall to configure our new job manager.
Now, we should be able to run the command
to start a gatekeeper configured to run a job manager using our new scripts. Running this will output a contact string (referred to as <contact-string> below), which we can use to connect to this new service. To do so, we'll run globus-job-run to submit a test job:
If the test above fails, or more complicated job failures are occurring, then you'lll have to debug your scheduler interface. Here are a few tips to help you out.
Make sure that your script is valid Perl. If you run
You should get no output. If there are any diagnostics, correct them (in the lsf.in file), reinstall your package, and rerun the setup script.
Look at the Globus Toolkit Error FAQ and see if the failure is perhaps not related to your scheduler script at all.
Enable logging for the job manager. By default, the job manager is configured to log only when it notices a job failure. However, if your problem is that your script is not returning a failure code when you expect, you might want to enable logging always. To do this, modify the job manager configuration file to contain "-save-logfile always" in place of "-save-log on_error".
Adding logging messages to your script: the JobManager object implements a log
method, which allows you to write messages to the job manager log file. Do this as your methods are called to pinpoint where the error occurs.
Save the job description file when your script is run. This will allow you to run the globus-job-manager-script.pl
interactively (or in the Perl debugger). To save the job description file, you can do
in any of the methods you've implemented.