Note

The Grid Community Toolkit documentation was taken from the Globus Toolkit 6.0 documentation. As a result, there may be inaccuracies and outdated information. Please report any problems to the Grid Community Forums as GitHub issues.

GCTGRAM 5 → GCT 6.0 GRAM5: User’s Guide

Introduction

GRAM services provide secure, remote job submission to different local resource managers in a Grid environment. This document describes the features of the GRAM service and an overview of tools to use the service.

GRAM5 Overview

GRAM provides a uniform, remote interface for executing jobs on compute resources. GRAM jobs consist of file transfers and program execution on one or more compute elements managed by a local resource manager. The GRAM client can submit the job and then later poll for its status, or it can request that the GRAM service notify it when the job changes state or completes. While the job is executing, the client may send control messages to the GRAM service to monitor or modify the job. GRAM provides reliable job submission, job recovery in case of service or client failures, file staging, and asynchronous notification messages.

GRAM achieves its uniform interface by implementing a domain-specific language called the Resource Specification Language (RSL) which provides a simple way to express job requirements, environment, and commands in a specification which is independent of the local resource manager which will actually execute the job.

The GRAM protocol is a two-phase protocol, so that when jobs are submitted to a GRAM service, they will not start until a client has received a contact handle to the job. The GRAM service will not clean up a job until it has received acknowledgment from the client that the job completion state has been received. In the case of transient errors, GRAM clients can reconnect to the GRAM service to determine job state, or to update information the job will need to stage output files.

The GRAM service has been built to work in the presence of client and service failures without losing state information about jobs. If a client exits and is restarted, it can request job state information, update URLs for output files to be staged to, and register a new address to receive job state callbacks. If a service exits and is restarted, it will resume processing all existing GRAM jobs from their previous state, and continue to send state updates to any clients which are registered for them.

GRAM provides file staging before and after a job runs, scratch directory management, and a cache location for common files. File staging is Grid-aware and access remote storage resources via the GridFTP, ftp https, http protocols.

Because the GRAM service implements client callbacks for job state changes, clients can submit a number of jobs and be notified when each completes. This allows clients to be more responsive to changes in state than services which require polling for job completion.

GRAM Client Tools

There are a number of GRAM clients which can be used to interact with the GRAM service. The Grid Community Toolkit includes globusrun, globus-job-submit, and globus-job-run. Other projects provide higher-level tools which can be used to manage large sets of jobs.

Condor-G

Condor is a high-throughput job scheduler from the University of Wisconsin. It provides a facility called Condor-G to run jobs via GRAM. See the Condor documentation, especially the section on Grid Universe, which describes how to write Condor Classified Ads to run jobs using GRAM services. The gt5 Grid type provides the best performance for using GRAM with Condor-G.

Swift

The Swift system from the University of Chicago is a data-oriented coarse grained scripting language that supports dataset typing and mapping, dataset iteration, conditional branching, and procedural composition. The SwiftScript language can be used to create workflows that are executed on various services, including GRAM. See the Swift User’s Guide for information about using Swift.

GridWay

The GridWay Metascheduler enables large-scale, reliable and efficient sharing of computing resources: clusters, supercomputers, stand-alone servers. It supports different LRM systems (PBS, SGE, LSF, Condor) within a single organization or scattered across several administrative domains. The GridWay manual describes how to use GridWay.

GRAM APIs

In addition to the tools above, you can write your own GRAM clients, using the public APIs described in the GRAM5 Developer’s Guide. The client APIs there can be used to write custom applications that interact with GRAM services in C/C++ or Java.

Portals and Science Gateways that use GRAM

XSEDE

XSEDE provides a number of domain-specific science gateways and portals, which provide interfaces to various computation and data resources, including some managed by GRAM.

These include CGD’s Atmospheric Modeling & Predictability Section from NCAR, the UltraScan LIMS Portal at the Bioinformatics Core Facility at the University of Texas Health Science Center at San Antonio, the Social Informatics Data (SID) Grid at the University of Chicago, and Southern California Earthquake Center headquarted at the University of Southern California.

Using GRAM5

Before Getting Started

GRAM and Security

GRAM uses the Grid Security Infrastructure for its security implementation, based on X.509 certificates and the TLSv1 protocol to authenticate user identities with GRAM services. Before using GRAM, you must first obtain a security credential. This is typically done by requesting a certificate from a site-specific CA, or by using a portal to obtain a temporary credential. In typical use, GRAM uses a proxy certificate which is a short-term credential digitally signed by a private key. Please read the Basic procedure for using GSI C to learn more about how to obtain and use a GSI credential before continuing this guide.

GRAM Resource Names

Before interacting with a GRAM service, you must know its contact address. GRAM uses a very flexible URL-like syntax to contain information about the service’s hostname, TCP port number, service name, and security identity. In the basic case, you will only need to use the service’s hostname to contact the service. However, if the service is configured to run on a non-standard port, or with a custom service name, or credential which doesn’t match its hostname, you will need to use one of the longer forms.

A fully-qualified resource name looks something like grid.example.org:2120/jobmanager-sge:/C=US/O=Example/OU=Grid/CN=host/grid.example.org. Breaking this down, the resource name includes:

Name Component Example Meaning Default

Host Name

grid.example.org

Host which the GRAM service is running on.

None. This is always a required component.

TCP Port

2120

TCP port which the GRAM service is listening on. If multiple GRAM services are running on the same machine, they may use alternate TCP ports.

2119

Service Name

jobmanager-sge

The name of the GRAM service on the given host. A host may provide access to multiple resources using different local resource managers. This name is used to distinguish which service to use for a particular job request. Typically, a host will provide a default entry called jobmanager which will interface with a batch computing or high-throughput scheduling system, and another called jobmanager-fork for simple non-compute jobs.

jobmanager

Credential Name

/C=US/O=Example/OU=Grid /CN=host/grid.example.org

The name of the credential which the GRAM service is using. This is only needed if the credential’s common name does not match the host name.

host@hostname

Any component of the resource name may be omitted, except for the host name, and defaults will be used. The field separator : must be retained when skipping between name components.

Basic Client Interace

This section contains the basic command-line interface for interacting with gram services. For these examples, we will use the GRAM resource named grid.example.org:2119/jobmanager-pbs. You will need to change that to resources which you have been granted access.

Batch and Interactive Use

The tools globus-job-run and globus-job-submit an both be used to submit jobs to GRAM resources. The difference is that globus-job-run will wait until the job terminates before exiting and prints job standard output and stderr after the job completes, while globus-job-submit will submit the job and then exit immediately, printing the job contact to its standard output stream. The job can be then polled for status with the globus-job-status command, its output can be fetched with the globus-job-get-output and cleaned up with the globus-job-clean command.

Running Basic Jobs with globus-job-run

The globus-job-run provides a simple blocking command-line interface to the GRAM service. The globus-job-run program submits a job to a GRAM resource and waits for the job to terminate. After the job terminates, the output and error streams of the job are sent to the output and error streams of globus-job-run. Note that truly interactive jobs are not supported with GRAM.

The globus-job-run program has command-line options to control most aspects of jobs run by GRAM. However, certain behaviors must be specified by definition of an RSL string containing various job attributes. A more detailed description about the RSL language is included on the section on running jobs with globusrun below.

The following examples show some of the common command-line options to globus-job-run. Full globus-job-run documentation is available in the GRAM5 public interface guide.

Example 1. Minimal job using globus-job-run

The following command line submits a single instance of the /bin/hostname executable to the resource named by executable to the resource named by grid.example.org/jobmanager-pbs.

%  globus-job-run
node1.grid.example.org
Example 2. Multiprocess job using globus-job-run

The following command line submits ten instances of an executable /bin/hostname. The output of the job is the name of the ten hosts that the job ran on. The . The output of the job is the name of the ten hosts that the job ran on. The -np ' option causes globus-job-run to run 'COUNT instances of the executable.

%  globus-job-run
node1.grid.example.org
node3.grid.example.org
node2.grid.example.org
node10.grid.example.org
Example 3. Staging an executable file using globus-job-run

The following command line submits an executable which is local to the submit machine to the GRAM resource, then executes it. The executable is removed automatically from the GRAM resource after the job completes. The -s option prior to the executable name causes globus-job-run to stage the executable using GASS (an https-based protocol) from the machine running globus-job-run to the GRAM resource.

%  globus-job-run
node1.grid.example.org
Example 4. Providing an input file to a job using globus-job-run

The following command line submits a job to a GRAM resource. When this job runs, its standard input will read from the file $HOME/inputfile.txt, which is located on the GRAM resource. The , which is located on the GRAM resource. The -stdin command-line option indicates this path.

%  globus-job-run
Hello, Grid
Example 5. Staging an input file to a job using globus-job-run

The following command line submits a job to a GRAM resource. When this job runs, its standard input will read from the file inputfile.txt, which is located on the submit client machine. The , which is located on the submit client machine. The -stdin -s command-line option combination causes the input to be staged in the above executable staging example.

%  globus-job-run
Hello, staged input on the Grid
Example 6. Canceling an interactive job

This example shows how using the (or other system-specific mechanism for sending the SIGINT signal) can be used to cancel a GRAM job.

%  globus-job-run

GRAM Job failed because the user cancelled the job (error code 8)
Example 7. Setting job environment variables with globus-job-run

The following command line submits one instances of the executable /usr/bin/env, setting some environment variables in the job environment beyond those set by GRAM5. The , setting some environment variables in the job environment beyond those set by GRAM5. The '-env ' command-line option adds the named variable to the job environment. It may be present multiple times in the command-line to set multiple environment variables.

%  globus-job-run
HOME=/home/juser
LOGNAME=juser
GLOBUS_GRAM_JOB_CONTACT=https://client.example.org:3882/16001579536700793196/5295612977485997184/
GLOBUS_LOCATION=/opt/globus-

Submitting Basic Jobs with globus-job-submit

A related tool to globus-job-run is globus-job-submit. This command submits a job to a GRAM5 service then exits without waiting for the job to terminate. Other tools (globus-job-cancel, globus-job-clean, and globus-job-get-output) allow futher interaction with the job.

Important

When using globus-job-submit, the job output and state will remain on disk on the GRAM resource until one of globus-job-clean or globus-job-cancel is run for that job. Be sure to clean up your jobs!

The globus-job-submit program has most of the same command-line options as globus-job-run. When run, instead of displaying the output and error streams of the job, it prints the job contact, which is used with the other globus-job tools to interact with the job.

Example 8. globus-job-submit

This example shows the interaction of submitting a job via globus-job-submit, checking its status with globus-job-status, getting its output with globus-job-get-output, and then cleaning the job with globus-job-clean. Note that this example uses the jobmanager-fork service when retrieving input and cleaning the job. This allows those tasks to be done without waiting in the batch system. Most sites will allow these sorts of administrative jobs to be run on the GRAM node, but consult your system administrator to be sure. Also, note, that the job contact returned from globus-job-submit can be used to get information about the job from any computer, provided you have GRAM tools installed and your security environment set up.

%  globus-job-submit
https://grid.example.org:38843/16001600430615223386/5295612977486013582/
%  globus-job-status
PENDING
%  globus-job-status
ACTIVE
%  globus-job-status
DONE
%  globus-job-get-output
node1.grid.example.org
%  globus-job-clean

    WARNING: Cleaning a job means:
        - Kill the job if it still running, and
        - Remove the cached output on the remote resource

    Are you sure you want to cleanup the job now (Y/N) ?

y

Cleanup successful.

Advanced Jobs with globus-job-run

Example 9. Using custom RSL clauses with globus-job-run

The following command line submits an mpi job using globus-job-run, setting the jobtype RSL attribute to mpi. Any RSL attribute understood by the LRM can be added to a job via this method.

%  globus-job-run
Hello, MPI (rank: 0, count: 5)
Hello, MPI (rank: 3, count: 5)
Hello, MPI (rank: 1, count: 5)
Hello, MPI (rank: 4, count: 5)
Hello, MPI (rank: 2, count: 5)
Example 10. Constructing RSL strings with globus-job-run

The globus-job-run program can also generate the RSL language description of a job based on the command-line options given to it. This example combines some of the features above and prints out the resulting RSL. This RSL string can be passed to tools such as globusrun to be run later.

%  globus-job-run -dumprsl
 &(jobtype=mpi)
    (executable="a.out")
    (environment= ("GRID" "1") ("TEST" "1"))
    (count=5)

Advanced GRAM Client with the globusrun tool

The globusrun tool provides a more flexible tool for submitting, monitoring, and canceling jobs. With this tool, most of the functionality of the GRAM5 APIs are made available from the command-line.

One major difference between globusrun and the other tools described above is that globusrun uses the RSL language to provide the job description, instead of multiple command-line options to describe the various aspects of the job. The section on globus-job-run contained a brief example RSL in the -dumprsl example above.

The following sections show examples of the different modes that globusrun can run in. Full information about globusrun command-line options is available in the public interface guide.

Checking RSL Syntax

This example shows how to check that an RSL document contains a syntactically correct job description. Note that this mode does not do semantic validation of the RSL, so an RSL document that passes this test may not work when submitted to a GRAM5 service.

Example 11. Checking RSL Syntax
%  globusrun -p "&(executable=a.out)"

RSL Parsed Successfully...

%  globusrun -p "&/executable=a.out)"

ERROR: cannot parse RSL &/executable=/adfadf/adf /adf /adf)

Syntax: globusrun [-help] [-f RSL file] [-s][-b][-d][...] [-r RM] [RSL]


Use -help to display full usage

Checking Service Contacts

This example shows how to check that a globus-gatekeeper is running at a particular contact and that the client and service have mutually-trusted credentials.

Example 12. GRAM Authentication test
%  globusrun -a -r grid.example.org/jobmanager-pbs
GRAM Authentication test successful
%  globusrun -a -r grid.example.org/jobmanager-lsf
GRAM Authentication test failure: the gatekeeper failed to find the requested service
%  globusrun -a -r grid.example.org/jobmanager-pbs:[email protected]
GRAM Authentication test failure: an authorization operation failed
globus_xio_gsi: gss_init_sec_context failed.
GSS Major Status: Unexpected Gatekeeper or Service Name
globus_gsi_gssapi: Authorization denied: The name of the remote host
([email protected]), and the expected name for the remote host
(grid.example.org) do not match. This happens when the name in the host
certificate does not match the information obtained from DNS and is often a DNS
configuration problem.
Note

The DNS configuration problem was a common issue in GRAM2, but GRAM5 will not depend on DNS to resolve names for mutual authentication.

Checking GRAM service version

This example shows how to determine what software version of GRAM5 is deployed at a particular service contact.

Example 13. GRAM version check
%  globusrun -j -r grid.example.org/jobmanager-pbs:[email protected]
Toolkit version: 4.3.0-HEAD
Job Manager version: 10.5 (1256257907-0)
Note

This example shows the version number for an unreleased development version of GRAM5. The actual numbers returned will be different.

Note

This feature is new in GRAM5. When contacting a GRAM2 service, globusrun will display the following error message:

GRAM version check failed : an incoming HTTP message did not contain the expected information

Basic Interactive job with globusrun

This example shows how to submit interactive job with globusrun. When the -s is used, the output of the job command is returned to the client and displayed as if the command ran locally. This is similar to the behavior of the globus-job-run program described above.

Example 14. Basic Interactive Job
%  globusrun -s -r example.grid.org/jobmanager-pbs "&(executable=/bin/hostname)(count=5)"
node03.grid.example.org
node01.grid.example.org
node02.grid.example.urg
node05.grid.example.org
node04.grid.example.org

Basic batch job with globusrun

This example shows how to submit, monitor, and cancel a batch job using globusrun. This method is useful for the case where the job may run for a long time, the job may be queued for a long time, or when there are network reliability issues between the client and service.

Example 15. Basic Batch Job
%  globusrun -b -r grid.example.org/jobmanager-pbs "&(executable=/bin/sleep)(arguments=500)"
globus_gram_client_callback_allow successful
GRAM Job submission successful
https://grid.example.org:38824/16001608125017717261/5295612977486019989/
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
%  globusrun -status https://grid.example.org:38824/16001608125017717261/5295612977486019989/
PENDING
%  globusrun -k https://grid.example.org:38824/16001608125017717261/5295612977486019989/
%

Refreshing a GRAM5 Credential

The following example shows how to refresh the credential used by a job manager and a job.

Example 16. Refreshing a Credential
%  globusrun -refresh-proxy https://grid.example.org:38824/16001608125017717261/5295612977486019989/
%  echo $?
0
Note

In GCT 6.0, globusrun does not print any diagnostics when given the -refresh-proxy command-line option. Therefore, check the exit code as above to ensure that the refresh is successful.

Dealing with credential expiration

When the Job Manager’s credential is about to expire, it sends a message to all clients registered for GLOBUS_GRAM_PROTOCOL_JOB_STATE_FAILED notifications that the job manager is terminating and that the job will continue to run without the job manager.

Any client which receives such a message can (if necessary) generate a new proxy as described above and then submit a restart request to start a job manager with a new credential. This job manager will resume monitoring the jobs which were started prior to proxy expiration.

In this example, the globusrun displays an error message when the job manager’s proxy is about to expire. The user creates a new proxy and resumes monitoring the job with globusrun.

Example 17. Proxy Expiration Example
%  globusrun -r grid.example.org "&(executable=a.out)"
globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE
GLOBUS_GRAM_PROTOCOL_JOB_STATE_FAILED
GRAM Job failed because the user proxy expired (job is still running) (error code 131)
%  grid-proxy-init
Your identity: /DC=org/DC=example/OU=grid/CN=Joe User
Enter GRID pass phrase for this identity:
Creating proxy ........................................................................... Done
Your proxy is valid until: Tue Nov 10 04:25:03 2009
%  globusrun -r grid.example.org "&(restart="https://grid.example.org:1997/16001700477575114131/5295612977486005428/)"
globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE
GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE

File staging

In addition to the standard output and error stream output done by globusrun, GRAM5 can do basic file management tasks to stage files to the GRAM5 service node before submitting a job and to stage files from the GRAM5 service node to a file service after the job completes.

GRAM5 file staging supports four URL schemes: ftp, gsiftp, http, and https. Note, that for the https scheme, GRAM expects the file server to be running with the same identity as the client.

General file staging is controlled by three RSL attributes: file_stage_in, file_stage_in_shared, and file_stage_out. In addition, the files named by the RSL attributes executable, stdin may be staged in and the files named by the RSL attributes stdout and stderr may be staged out.

The file_stage_in_shared RSL attribute instructs GRAM to store a local copy of the resource named by the URL in the GASS cache. This is useful if multiple concurrent jobs will be accessing one or more common files. The GASS cache will manage a reference count for files in the cache and remove them when all jobs that refer to them complete.

The following example shows how to stage a few files from a GridFTP server to the GRAM node. It uses the rsl_substitution mechanism to define a subsitution variable to reduce the amount of redundancy in the job description.

Example 18. File stage in
%  globusrun -s -r grid.example.org/jobmanager-pbs \
    "&(rsl_substitution = (GRIDFTP_SERVER gsiftp://gridftp.example.org)) \
      (executable=/bin/ls)
      (arguments=/tmp/staged_file)
      (file_stage_in = ($(GRIDFTP_SERVER)/staged_file /tmp/staged_file))"
/tmp/staged_file

The next example uses the file_stage_in_shared RSL attribute to stage a file into the cache. The file is transferred from the client using the GASS https server embedded in the globusrun program when the -s option is used.

Example 19. File stage in shared
%  globusrun -s -r grid.example.org/jobmanager-pbs \
    "&(executable=/bin/ls) \
      (arguments = -l /tmp/staged_file_link1 /tmp/staged_file_link1) \
      (file_stage_in_shared = \
          (\$(GLOBUSRUN_GASS_URL)/staged_file1 /tmp/staged_file_link1))"
lrwxr-xr-x  1 juser   juser  120 Nov 11 20:37 /tmp/staged_file1 -> /home/juser/.globus/.gass_cache/local/md5/ff/771bded8a2c7dacc1a1c0fecafa0ce/md5/39/13ab3db7fc002ed54012083ae6ed1c/data

The final staging example uses the file_stage_out RSL attribute to transfer a file from the GRAM service to an FTP server using anonymous FTP

Example 20. File stage out
%  globusrun -r grid.example.org/jobmanager-pbs \
    "&(executable=a.out) \
      (file_stage_out = (results.txt ftp://anonymous:[email protected]/incoming/results.txt))"
%
Note

In all of the above cases, multiple files may be staged using any combination of the supported URL schemes.

Temporary files and cleanup

GRAM5 supports creating a per-job scratch directory which can be used as a place to store files that will be automatically removed by GRAM when the job completes. It also supports an explicit list of files to remove when the job completes.

This example shows how to stage files into a scratch directory. It again uses the embedded GASS https server, stages to the GRAM service, then runs /bin/ls in the temporary directory. After the job completes, the contents of $(SCRATCH_DIRECTORY) and the directory itself are removed.

Example 21. Staging to scratch directory
%  globusrun -s grid.example.org/jobmanager-pbs \
    "&(scratch_dir = \$(HOME)) \
      (directory = \$(SCRATCH_DIRECTORY))
      (file_stage_in = \
          (\$(GLOBUSRUN_GASS_URL)/inputfile $(SCRATCH_DIRECTORY)/inputfile)) \
      (executable = /bin/ls)"
inputfile

This example shows how to explicitly remove a file that was created by the job.

Example 22. Cleaning up a file
%  globusrun -s grid.example.org/jobmanager-pbs \
    "&(executable = /bin/touch) \
      (arguments = temporary_file) \
      (file_clean_up = temporary_file)"
%

Reliable job submit

The globusrun command supports a two-phase commit protocol to ensure that the client knows the contact of the job which has been created so that it can be monitored or canceled in the case of a client or service error. The two-phase commit affects both job submission and termination.

The two-phase protocol is enabled by using the two_phase RSL attribute, as in the next example. When this is enabled, job submission will fail with the error GLOBUS_GRAM_PROTOCOL_ERROR_WAITING_FOR_COMMIT. The client must respond to this signal with either the GLOBUS_GRAM_PROTOCOL_JOB_SIGNAL_COMMIT_REQUEST or GLOBUS_GRAM_PROTOCOL_JOB_SIGNAL_COMMIT_EXTEND signals to either commit the job to execution or delay the commit timeout. One of these signals must be sent prior to the two phase commit timeout, or the job will be discarded by the GRAM service.

A two phase protocol is also used at job termination if the save_state RSL attribute is used along with the two_phase attribute. When the job manager sends a callback with the job state set to GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE or GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE it will wait to clean up the job until the two phase commit occurs. The client must reply with the GLOBUS_GRAM_PROTOCOL_JOB_SIGNAL_COMMIT_END signal to cause the job to be cleaned. Otherwise, the job will be unloaded from memory until a client restarts the job and sends the signal.

Example 23. Two phase commit example

In this example, the user submits a job with a two_phase timeout of 30 seconds and the save_state attribute. The client must send commit signals to ensure the job runs.

%  globusrun -r grid.example.org/jobmanager-pbs \
    "&(two_phase = 30) \
      (save_state = yes) \
      (executable = a.out)"

globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE
GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE
%

Reconnecting to a job

If a job manager or client exits before a job has completed, the job will continue to run. The client can reconnect to a job manager and receive job state notifications and output using the restart RSL attribute.

Example 24. Restart example

This example uses globus-job-submit to submit a batch job and then globusrun to reconnect to the job.

%  globus-job-submit grid.example.org/jobmanager-pbs /bin/sleep 90
https://grid.example.org:38824/16001746665595486521/5295612977486005662/
%  globusrun -r grid.example.org/jobmanager-pbs \
    "&(restart = https://grid.example.org:38824/16001746665595486521/5295612977486005662/)"
globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE
%

Submitting a Java job

To submit a job that runs a java program, the client must ensure that the job can find the Java interpreter and its classes. This example sets the default PATH and CLASSPATH environment variables and uses the shell to locate the path to the java program.

Example 25. Java example

This example uses globus-job-submit to submit a java job, staging a jar file from a remote service.

%  globusrun -r grid.example.org/jobmanager-pbs \
    "&(environment = (PATH '/usr/bin:/bin') (CLASSPATH \$(SCRATCH_DIRECTORY)))
      (scratch_dir = \$(HOME))
      (directory = \$(SCRATCH_DIRECTORY))
      (rsl_substitution = (JAVA_SERVER http://java.example.org))
      (file_stage_in =
          (\$(JAVA_SERVER)/example.jar \$(SCRATCH_DIRECTORY)/example.jar)
          (\$(JAVA_SERVER)/support.jar \$(SCRATCH_DIRECTORY)/support.jar))
      (executable=/bin/sh)
      (arguments=-c 'java -jar example.jar')"
globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE
GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE
%

Troubleshooting

GRAM Client Troubleshooting

Credential Problems

GRAM requires a client certificate and private key in order authenticate with the GRAM service. If these are not available, the GRAM client will fail. In typical use, a user will create a temporary proxy certificate either derived from their identity certificate issued by some certificate authority, or from a service such as myproxy. If a GRAM client command returns any error containing the string GSS Major Status you’ve hit a credential problem. Look at the Troubleshooting Section of the GSI manual for details about how to diagnose and correct these errors. The tool with the -p command-line option is especially helpful for diagnosing some of these types of problems.

Connection Problems

There are a few things which can go wrong when trying to contact a GRAM service. These have slightly different error types which can help diagnose which problem is occurring.

Invalid Resource Name

If the hostname or TCP port you are using for a GRAM resource name is not correct, then the GRAM client will be unable to access the service. Errors of this type will look like this:

%  globus-job-run grid.example.org/jobmanager-fork /bin/hostname

GRAM Job submission failed because the connection to the server failed (check host and port) (error code 12)

When this occurs, check with the resource administrator for correct resource naming so that you can contact the service.

Mutual Authentication Failure

GRAM performs mutual authentication, that is, both the client and service provide certificates indicating who they are. The service uses the client’s identity to map the user to a local unix account. The client uses the server’s identity to verify that the service is running with a host credential. The failure of the client to trust the server’s certificate will generate an error message that looks like this: globus_gsi_gssapi: Authorization denied: The expected name for the remote host ([email protected]) does not match the authenticated name of the remote host ([email protected]). This happens when the name in the host certificate does not match the information obtained from DNS and is often a DNS configuration problem.

This mismatch can happen for a number of reasons: a site administrator has multiple hosts sharing a certificate, a host has multiple DNS aliases, and the client is not aware of which name the server is using for its certificate, or a host’s name has changed since the certificate was issued. The remedy for the client, after confirming with the GRAM administrator that the name after "authenticated name of the remote host" is the correct certificate name is to use a form of the GRAM resource name which includes this name. For example, explicitly adding a name to the abbreviated GRAM contact so that instead of alias.example.org, you would use alias.example.org::[email protected].

Certificate Trust Issues

Because of the mutual authentication, both GRAM users and services can hit problems if they do not trust their peer’s certificate or the Certificate Authority which issued it. If the client doesn’t trust the server’s certificate, it is easier to diagnose, because the GRAM service doesn’t send much information back to the client if it doesn’t trust it. However, working with the system administrator to get information from the GRAM logs will usually fix these problems fairly easily.

If the service’s certificate is not trusted, the client will receive a message like this:

%  globus-job-run grid.example.org /bin/hostname
GRAM Job submission failed because an authentication operation failed
OpenSSL Error: s3_clnt.c:915: in library: SSL routines, function SSL3_GET_SERVER_CERTIFICATE: certificate verify failed
globus_gsi_callback_module: Could not verify credential
globus_gsi_callback_module: Can't get the local trusted CA certificate: Untrusted self-signed certificate in chain with hash bbfccedf

This error indicates that certificate chain from the service certificate to the client contained a self-signed certificate (usually an indication that it’s a CA certificate), which the client doesn’t trust, and includes the hash of the certificate name (bbfccedf in this case). If you hit this particular type of error, you should send the information to the GRAM administrator and determine which CA should be trusted and what its signing policy is, to determine if you want to add it to your local set of trust roots.

Note

Different versions of OpenSSL produce different hashes for the same certificate names. If you upgrade a system (or transfer CA certificates between systems) to a different version of OpenSSL, you may hit this problem even if you think you have the CA certificate in your trusted certificate directory. If so, run the globus-update-certificate-dir program to update your hashes.

There are other reasons why a certificate might not be trusted (it’s in a revoked list, it has expired or was issued in the future, etc). For more details look at the troubleshooting information in the GSI user’s guide.

If for some reason the service does not trust your certificate, you’ll get a rather cryptic message from GRAM that looks like this:

%  globus-job-run grid.example.org /bin/hostname
GRAM Job submission failed because an authentication operation failed
globus_gsi_gssapi: Unable to verify remote side's credentials
globus_gsi_gssapi: Unable to verify remote side's credentials: Couldn't verify the remote certificate
OpenSSL Error: s3_pkt.c:1086: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert bad certificate SSL alert number 42
 (error code 7)

To remedy this, consult the GRAM administrator to get information from the /var/log/globus-gatekeeper.log file to determine the reason why the gatekeeper didn’t like your certificate. Again it could be CA trust issues, clock skew, or a revoked certificate. The error in the gatekeeper log would typically look like the client-side trust issue above. file to determine the reason why the gatekeeper didn’t like your certificate. Again it could be CA trust issues, clock skew, or a revoked certificate. The error in the gatekeeper log would typically look like the client-side trust issue above.

Authentication with the Remote Server Failed

Once the GRAM service has authenticated the client, it maps the client’s identity to a local user account using a grid-mapfile or other mapping service. If this fails, the client will receive a message that looks like this:

%  globus-job-run grid.example.org /bin/hostname
GRAM Job submission failed because authentication with the remote server failed (error code 7)

To remedy this, consult the system administrator of the GRAM resource to be added to the authorized user’s list. Be sure to send your credential subject name to make it easier for them. To get that information, run the command grid-cert-info -s.

Unable to Find the Requested Service

Recall that a GRAM resource name includes a component called the service name. The default if not specified is jobmanager, but some sites may not use that name, or have a different LRM name than you expect. If you specify an incorrect service name, or the default is not present, you’ll get an error that looks like this:

%  globus-job-run grid.example.org /bin/hostname
GRAM Job submission failed because the gatekeeper failed to find the requested service (error code 93)

If you get this error, you’ll need to determine which services are available on that GRAM resource, either by asking the admin or by looking at the entries in /etc/grid-services

Failed to Run the Job Manager

The GRAM service is split between a priveleged process called the globus-gatekeeper and a non-privileged process called the and a non-privileged process called the globus-job-manager which runs as a user process. If the which runs as a user process. If the globus-gatekeeper is unable to locate the is unable to locate the globus-job-manager process, then this misconfiguration will show up like this: process, then this misconfiguration will show up like this:

%  globus-job-run grid.example.org /bin/hostname
GRAM Job submission failed because the gatekeeper failed to run the job manager (error code 47)

This is an installation mistake, and the administrator of the GRAM resource must fix this.

Jobs are Hanging

One problem GRAM users sometimes encounter is that it looks like jobs submitted to GRAM are not making any progress, even though the local resource manager thinks they’ve run. There are a couple of reasons why this might occur: GRAM is not getting the information it needs from the local resource manager or the GRAM client is not getting the information it needs. We’ll cover diagnosing and handling the latter case in this document, as the other is an system administrator issue.

The way globus-job-run and globusrun determine that jobs have completed is via GRAM job state callbacks. These are messages sent by the GRAM service to the client node indicating that something significant has happened in the lifecycle of the job. If for some reason the GRAM service can not get those messages to the client, the client will not be able to detect job state changes.

In order to determine if this is the case, submit a job using globus-job-submit, and then use the globus-job-status command to see if the job state changes. If it does not, then consult the GRAM administrator---there might be some problem with the installation. If it does, then for some reason the callbacks are not happening. This might be firewall issues or host naming issues.

The GRAM client sends a "callback contact" to the GRAM service when it submits a job, in order that it can receive notifications. This contact is a reference to a https server embedded in the GRAM client which only handles GRAM state callbacks. As with all web servers, it has a URL which defines how to contact it, which in this case consists of the client host name and the service port number. If the host name that is used is not resolvable (such as a for a laptop with a dynamic address), then the GRAM service will not be able to contact it. If that’s the case, you can set the GLOBUS_HOSTNAME environment variable to the IP address that your client can be reached at, and then submit your jobs. This will cause GRAM to publish that address instead of what it thinks the client’s host name is.

Another way that the GRAM service would be unable to send job state updates to a client would be if there’s a firewall between the service and the client. If that’s the case, you might need to set the GLOBUS_TCP_PORT_RANGE environment variable to a comma-separated list of numbers which represent a range of minimum and maximum TCP port numbers to listen on. You might have to contact your site administrator to determine what TCP ports are allowed. If there are none, you can still use globus-job-submit and globus-job-status to track your job’s state changes, or use another tool like those mentioned in the section about client tools.

Logs and Debugging

The GRAM service has a log file which contains information about the job as it is processed. These logs are located by default in /var/log/globus/gram_$USERNAME.log. There are some different logging levels available, as . There are some different logging levels available, as described in the GRAM Adminstrator’s Guide. These can be controlled on a per-job basis by adding the loglevel RSL attribute to your job description. The default is to log only FATAL and ERROR messages, but other levels can sometimes help understand what is going on.

Diagnosing LRM Errors

Sometimes, bugs creep into the LRM adapter scripts. When that occurs, the GRAM job will usually fail with an error like this:

GRAM Job failed because the job manager detected an invalid script status (error
code 25)

If this occurs, you may have to work with a GRAM administrator to help debug this problem. One helpful thing you can do when reporting it is to save the GRAM internal script data so that it can be used outside of the GRAM service to see what the low-level error looks like. To do this, add the RSL fragment (savejobdescription = yes) to your job request. This will cause GRAM to leave a file called something like $HOME/gram_[0-9]*.pl in your home directory. You can use this with the internal tool in your home directory. You can use this with the internal tool /usr/share/globus/globus-job-manager-script.pl to try to submit the job to the LRM without using the GRAM service. The command line to try to submit the job to the LRM without using the GRAM service. The command line /usr/share/globus/globus-job-manager-script.pl -m will attempt to submit the job to the LRM. It will show all the information the LRM script sends to the GRAM service, which might include some perl-language error or badly formatted output from the script (which must only output lines which begin with GRAM_SCRIPT_.

In some extreme cases, the savejobdescription option will not generate a file. If that’s the case, pass /dev/null as the argument to the as the argument to the -f command-line option. The problem is likely a perl syntax error which will be reached before the job description is loaded.

Email Support

If all else fails, please send information about your problem to [email protected]. You’ll have to subscribe to a list before you can send an e-mail to it. See here for general e-mail lists and information on how to subscribe to a list and here for GRAM specific lists. Depending on the problem, you may be requested to file a bug report to the globus project’s Issue Tracker.

Command-line Client Reference Pages

This section contains reference pages for all of the tools described in the previous section. These pages contain all the command-line options for these tools. These are available as manpages in the documentation subpackages for the globus-gram-client-tools package.

GLOBUSRUN(1)

NAME

globusrun - Execute and manage jobs via GRAM

SYNOPSIS

globusrun [-help ] [-usage ] [-version ] [-versions ]

Description

The globusrun program for submits and manages jobs run on a local or remote job host. The jobs are controlled by the globus-job-manager program which interfaces with a local resource manager that schedules and executes the job.

The globusrun program can be run in a number of different modes chosen by command-line options.

When -help, -usage, -version, or -versions command-line options are used, globusrun will print out diagnostic information and then exit.

When the -p or -parse command-line option is present, globusrun will verify the syntax of the RSL specification and then terminate. If the syntax is valid, globusrun will print out the string "RSL Parsed Successfully…​" and exit with a zero exit code; otherwise, it will print an error message and terminate with a non-zero exit code.

When the -a or -authenticate-only command-line option is present, globusrun will verify that the service named by RESOURCE_CONTACT exists and the client’s credentials are granted permission to access that service. If authentication is successful, globusrun will display the string "GRAM Authentication test successful" and exit with a zero exit code; otherwise it will print an explanation of the problem and will with a non-zero exit code.

When the -j or -jobmanager-version command-line option is present, globusrun will attempt to determine the software version that the service named by RESOURCE_CONTACT is running. If successful, it will display both the Toolkit version and the Job Manager package version and exit with a zero exit code; otherwise, it will print an explanation of the problem and exit with a non-zero exit code.

When the -k or -kill command-line option is present, globusrun will attempt to terminate the job named by JOB_ID. If successful, globusrun will exit with zero; otherwise it will display an explanation of the problem and exit with a non-zero exit code.

When the -y or -refresh-proxy command-line option is present, globusrun will attempt to delegate a new X.509 proxy to the job manager which is managing the job named by JOB_ID. If successful, globusrun will exit with zero; otherwise it will display an explanation of the problem and exit with a non-zero exit code. This behavior can be modified by the -full-proxy or -D command-line options to enable full proxy delegation. The default is limited proxy delegation.

When the -status command-line option is present, globusrun will attempt to determine the current state of the job. If successful, the state will be printed to standard output and globusrun will exit with a zero exit code; otherwise, a description of the error will be displayed and it will exit with a non-zero exit code.

Otherwise, globusrun will submit the job to a GRAM service. By default, globusrun waits until the job has terminated or failed before exiting, displaying information about job state changes and at exit time, the job exit code if it is provided by the GRAM service.

The globusrun program can also function as a GASS file server to allow the globus-job-manager program to stage files to and from the machine on which globusrun is executed to the GRAM service node. This behavior is controlled by the -s, -o, and -w command-line options.

Jobs submitted by globusrun can be monitored interactively or detached. To have globusrun detach from the GRAM service after submitting the job, use the -b or -F command-line options.

Options

The full set of options to globusrun consist of:

-help

Display a help message to standard error and exit.

-usage

Display a one-line usage summary to standard error and exit.

-version

Display the software version of globusrun to standard error and exit.

-versions

Display the software version of all modules used by globusrun (including DiRT information) to standard error and then exit.

-p, -parse

Do a parse check on the job specification and print diagnostics. If a parse error occurs, globusrun exits with a non-zero exit code.

-f RSL_FILENAME, -file RSL_FILENAME:: Read job specification from the file named by RSL_FILENAME.

-n, -no-interrupt

Disable handling of the SIGINT signal, so that the interrupt character (typically ) causes globusrun to terminate without canceling the job.

-r RESOURCE_CONTACT, -resource RESOURCE_CONTACT

Submit the request to the resource specified by RESOURCE_CONTACT. A resource may be specified in the following ways:

  • HOST

  • HOST:'PORT'

  • HOST:'PORT'/SERVICE

  • HOST/SERVICE

  • HOST:/SERVICE

  • HOST::'SUBJECT'

  • HOST:'PORT':'SUBJECT'

  • HOST/SERVICE:'SUBJECT'

  • HOST:/SERVICE:'SUBJECT'

  • HOST:'PORT'/SERVICE:'SUBJECT'

    If any of PORT, SERVICE, or SUBJECT is omitted, the defaults of 2811, jobmanager, and host@HOST are used respectively.

-j, -jobmanager-version

Print the software version being run by the service running at RESOURCE_CONTACT.

-k JOB_ID, -kill JOB_ID

Kill the job named by JOB_ID

-D, -full-proxy

Delegate a full impersonation proxy to the service. By default, a limited proxy is delegated when needed.

-y, -refresh-proxy

Delegate a new proxy to the service processing JOB_ID.

-status

Display the current status of the job named by JOB_ID.

-q, -quiet

Do not display job state change or exit code information.

-o, -output-enable

Start a GASS server within the globusrun application that allows access to its standard output and standard error streams only. Also, augment the RSL_SPECIFICATION with a definition of the GLOBUSRUN_GASS_URL RSL substitution and add stdout and stderr clauses which redirect the output and error streams of the job to the output and error streams of the interactive globusrun command. If this is specified, then globusrun acts as though the -q were also specified.

-s, -server

Start a GASS server within the globusrun application that allows access to its standard output and standard error streams for writing and any file local the the globusrun invocation for reading. Also, augment the RSL_SPECIFICATION with a definition of the GLOBUSRUN_GASS_URL RSL substitution and add stdout and stderr clauses which redirect the output and error streams of the job to the output and error streams of the interactive globusrun command. If this is specified, then globusrun acts as though the -q were also specified.

-w, -write-allow

Start a GASS server within the globusrun application that allows access to its standard output and standard error streams for writing and any file local the the globusrun invocation for reading or writing. Also, augment the RSL_SPECIFICATION with a definition of the GLOBUSRUN_GASS_URL RSL substitution and add stdout and stderr clauses which redirect the output and error streams of the job to the output and error streams of the interactive globusrun command. If this is specified, then globusrun acts as though the -q were also specified.

-b, -batch

Terminate after submitting the job to the GRAM service. The globusrun program will exit after the job hits any of the following states: PENDING, ACTIVE, FAILED, or DONE. The GASS-related options can be used to stage input files, but standard output, standard error, and file staging after the job completes will not be processed.

-F, -fast-batch

Terminate after submitting the job to the GRAM service. The globusrun program will exit after it receives a reply from the service. The JOB_ID will be displayed to standard output before terminating so that the job can be checked with the -status command-line option or modified by the -refresh-proxy or -kill command-line options.

-d, -dryrun

Submit the job with the dryrun attribute set to true. When this is done, the job manager will prepare to start the job but start short of submitting it to the service. This can be used to detect problems with the RSL_SPECIFICATION.

Environment

If the following variables affect the execution of globusrun

X509_USER_PROXY

Path to proxy credential.

X509_CERT_DIR

Path to trusted certificate directory.

Bugs

The globusrun program assumes any failure to contact the job means the job has terminated. In fact, this may be due to the globus-job-manager program exiting after all jobs it is managing have reached the DONE or FAILED states. In order to reliably detect job termination, the two_phase RSL attribute should be used.

See Also

globus-job-submit(1), globus-job-run(1), globus-job-clean(1), globus-job-get-output(1), globus-job-cancel(1)

GLOBUS-JOB-CANCEL(1)

NAME

globus-job-cancel - Cancel a GRAM batch job

SYNOPSIS

globus-job-cancel -f | -force -q | -quiet JOBID

Description

The globus-job-cancel program cancels the job named by JOBID. Any cached files associated with the job will remain until globus-job-clean is executed for the job.

By default, globus-job-cancel prompts the user prior to canceling the job. This behavior can be overridden by specifying the -f or -force command-line options.

Options

The full set of options to globus-job-cancel are:

-help, -usage

Display a help message to standard error and exit.

-version

Display the software version of the globus-job-cancel program to standard output.

-version

Display the software version of the globus-job-cancel program including DiRT information to standard output.

-force, -f

Do not prompt to confirm job cancel and clean-up.

-quiet, -q

Do not print diagnostics for succesful cancel. Implies -f

ENVIRONMENT

If the following variables affect the execution of globus-job-cancel.

X509_USER_PROXY

Path to proxy credential.

X509_CERT_DIR

Path to trusted certificate directory.

GLOBUS-JOB-CLEAN(1)

NAME

globus-job-clean - Cancel and clean up a GRAM batch job

SYNOPSIS

globus-job-clean -r RESOURCE | -resource RESOURCE

-f | -force -q | -quiet JOBID

Description

The globus-job-clean program cancels the job named by JOBID if it is still running, and then removes any cached files on the GRAM service node related to that job. In order to do the file clean up, it submits a job which removes the cache files. By default this cleanup job is submitted to the default GRAM resource running on the same host as the job. This behavior can be controlled by specifying a resource manager contact string as the parameter to the -r or -resource option.

By default, globus-job-clean prompts the user prior to canceling the job. This behavior can be overridden by specifying the -f or -force command-line options.

Options

The full set of options to globus-job-clean are:

-help, -usage

Display a help message to standard error and exit.

-version

Display the software version of the globus-job-clean program to standard output.

-version

Display the software version of the globus-job-clean program including DiRT information to standard output.

-resource RESOURCE, -r RESOURCE

Submit the clean-up job to the resource named by RESOURCE instead of the default GRAM service on the same host as the job contact.

-force, -f

Do not prompt to confirm job cancel and clean-up.

-quiet, -q

Do not print diagnostics for succesful clean-up. Implies -f

ENVIRONMENT

If the following variables affect the execution of globus-job-clean.

X509_USER_PROXY

Path to proxy credential.

X509_CERT_DIR

Path to trusted certificate directory.

GLOBUS-JOB-GET-OUTPUT(1)

NAME

globus-job-get-output - Retrieve the output and error streams from a GRAM job

SYNOPSIS

globus-job-get-output -r RESOURCE | -resource RESOURCE

-out | -err -t LINES | -tail LINES -follow LINES | -f LINES JOBID

Description

The globus-job-get-output program retrieves the output and error streams of the job named by JOBID. By default, globus-job-get-output will retrieve all output and error data from the job and display them to its own output and error streams. Other behavior can be controlled by using command-line options. The data retrieval is implemented by submitting another job which simply displays the contents of the first job’s output and error streams. By default this retrieval job is submitted to the default GRAM resource running on the same host as the job. This behavior can be controlled by specifying a particular resource manager contact string as the RESOURCE parameter to the -r or -resource option.

Options

The full set of options to globus-job-get-output are:

-help, -usage

Display a help message to standard error and exit.

-version

Display the software version of the globus-job-get-output program to standard output.

-version

Display the software version of the globus-job-get-output program including DiRT information to standard output.

-resource RESOURCE, -r RESOURCE

Submit the retrieval job to the resource named by RESOURCE instead of the default GRAM service on the same host as the job contact.

-out

Retrieve only the standard output stream of the job. The default is to retrieve both standard output and standard error.

-err

Retrieve only the standard error stream of the job. The default is to retrieve both standard output and standard error.

-tail LINES, -t LINES

Print only the last LINES count lines of output from the data streams being retrieved. By default, the entire output and error file data is retrieved. This option can not be used along with the -f or -follow options.

-follow LINES, -f LINES

Print the last LINES count lines of output from the data streams being retrieved and then wait until canceled, printing any subsequent job output that occurs. By default, the entire output and error file data is retrieved. This option can not be used along with the -t or -tail options.

ENVIRONMENT

If the following variables affect the execution of globus-job-get-output.

X509_USER_PROXY

Path to proxy credential.

X509_CERT_DIR

Path to trusted certificate directory.

GLOBUS-JOB-RUN(1)

NAME

globus-job-run - Execute a job using GRAM

SYNOPSIS

globus-job-run [-dumprsl ] [-dryrun ] [-verify ]

[-file ARGUMENT_FILE]

SERVICE_CONTACT

-np PROCESSES | -count PROCESSES

-m MAX_TIME | -maxtime MAX_TIME

-p PROJECT | -project PROJECT

-q QUEUE | -queue QUEUE

-d DIRECTORY | -directory DIRECTORY [-env NAME'VALUE']

[-stdin -l | -s STDIN_FILE] [-stdout -l | -s STDOUT_FILE] [-stderr -l | -s STDERR_FILE]

[-x RSL_CLAUSE]

-l | -s EXECUTABLE [ARGUMENT …​]

Description

The globus-job-run program constructs a job description from its command-line options and then submits the job to the GRAM service running at SERVICE_CONTACT. The executable and arguments to the executable are provided on the command-line after all other options. Note that the -dumprsl, -dryrun, -verify, and -file command-line options must occur before the first non-option argument, the SERVICE_CONTACT.

The globus-job-run provides similar functionality to globusrun in that it allows interactive start-up of GRAM jobs. However, unlike globusrun, it uses command-line parameters to define the job instead of RSL expressions.

Options

The full set of options to globus-job-run are:

-help, -usage

Display a help message to standard error and exit.

-version

Display the software version of the globus-job-run program to standard output.

-version

Display the software version of the globus-job-run program including DiRT information to standard output.

-dumprsl

Translate the command-line options to globus-job-run into an RSL expression that can be used with tools such as globusrun.

-dryrun

Submit the job request to the GRAM service with the dryrun option enabled. When this option is used, the GRAM service prepares to execute the job but stops before submitting the job to the LRM. This can be used to diagnose some problems such as missing files.

-verify

Submit the job request to the GRAM service with the dryrun option enabled and then without it enabled if the dryrun is successful.

-file ARGUMENT_FILE

Read additional command-line options from ARGUMENT_FILE.

-np PROCESSES, -count PROCESSES

Start PROCESSES instances of the executable as a single job.

-m MAX_TIME, -maxtime MAX_TIME

Schedule the job to run for a maximum of MAX_TIME minutes.

-p PROJECT, -project PROJECT

Request that the job use the allocation PROJECT when submitting the job to the LRM.

-q QUEUE, -queue QUEUE

Request that the job be submitted to the LRM using the named QUEUE.

-d DIRECTORY, -directory DIRECTORY

Run the job in the directory named by DIRECTORY. Input and output files will be interpreted relative to this directory. This directory must exist on the file system on the LRM-managed resource. If not specified, the job will run in the home directory of the user the job is running as.

-env NAME=VALUE

Define an environment variable named by NAME with the value VALUE in the job environment. This option may be specified multiple times to define multiple environment variables.

-stdin [-l | -s] STDIN_FILE

Use the file named by STDIN_FILE as the standard input of the job. If the -l option is specified, then this file is interpreted to be on a file system local to the LRM. If the -s option is specified, then this file is interpreted to be on the file system where globus-job-run is being executed, and the file will be staged via GASS. If neither is specified, the local behavior is assumed.

-stdout [-l | -s] STDOUT_FILE

Use the file named by STDOUT_FILE as the destination for the standard output of the job. If the -l option is specified, then this file is interpreted to be on a file system local to the LRM. If the -s option is specified, then this file is interpreted to be on the file system where globus-job-run is being executed, and the file will be staged via GASS. If neither is specified, the local behavior is assumed.

-stderr [-l | -s] STDERR_FILE

Use the file named by STDERR_FILE as the destination for the standard error of the job. If the -l option is specified, then this file is interpreted to be on a file system local to the LRM. If the -s option is specified, then this file is interpreted to be on the file system where globus-job-run is being executed, and the file will be staged via GASS. If neither is specified, the local behavior is assumed.

-x RSL_CLAUSE

Add a set of custom RSL attributes described by RSL_CLAUSE to the job description. The clause must be an RSL conjunction and may contain one or more attributes. This can be used to include attributes which can not be defined by other command-line options of globus-job-run.

-l

When included outside the context of -stdin, -stdout, or -stderr command-line options, -l option alters the interpretation of the executable path. If the -l option is specified, then the executable is interpreted to be on a file system local to the LRM.

-s

When included outside the context of -stdin, -stdout, or -stderr command-line options, -l option alters the interpretation of the executable path. If the -s option is specified, then the executable is interpreted to be on the file system where globus-job-run is being executed, and the file will be staged via GASS. If neither is specified, the local behavior is assumed.

ENVIRONMENT

If the following variables affect the execution of globus-job-run.

X509_USER_PROXY

Path to proxy credential.

X509_CERT_DIR

Path to trusted certificate directory.

See Also

globusrun(1), globus-job-submit(1), globus-job-clean(1), globus-job-get-output(1), globus-job-cancel(1)

GLOBUS-JOB-STATUS(1)

NAME

globus-job-status - Check the status of a GRAM5 job

SYNOPSIS

globus-job-status JOBID

Description

The globus-job-status program checks the status of a GRAM job by sending a status request to the job manager contact for that job specifed by the JOBID parameter. If successful, it will print the job status to standard output. The states supported by globus-job-status are:

PENDING

The job has been submitted to the LRM but has not yet begun execution.

ACTIVE

The job has begun execution.

FAILED

The job has failed.

SUSPENDED

The job is currently suspended by the LRM.

DONE

The job has completed.

UNSUBMITTED

The job has been accepted by GRAM, but not yet submitted to the LRM.

STAGE_IN

The job has been accepted by GRAM and is currently staging files prior to being submitted to the LRM.

STAGE_OUT

The job has completed execution and is currently staging files from the service node to other http, GASS, or GridFTP servers.

Options

The full set of options to globus-job-status are:

-help, -usage

Display a help message to standard error and exit.

-version

Display the software version of the globus-job-status program to standard output.

-versions

Display the software version of the globus-job-status program including DiRT information to standard output.

ENVIRONMENT

If the following variables affect the execution of globus-job-status.

X509_USER_PROXY

Path to proxy credential.

X509_CERT_DIR

Path to trusted certificate directory.

Bugs

The globus-job-status program can not distinguish between the case of the job manager terminating for any reason and the job being in the DONE state.

See Also

globusrun(1)

GLOBUS-JOB-SUBMIT(1)

NAME

globus-job-submit - Submit a batch job using GRAM

SYNOPSIS

globus-job-submit [-dumprsl ] [-dryrun ] [-verify ]

[-file ARGUMENT_FILE]

SERVICE_CONTACT

-np PROCESSES | -count PROCESSES

-m MAX_TIME | -maxtime MAX_TIME

-p PROJECT | -project PROJECT

-q QUEUE | -queue QUEUE

-d DIRECTORY | -directory DIRECTORY [-env NAME'VALUE']

[-stdin -l | -s STDIN_FILE] [-stdout -l | -s STDOUT_FILE] [-stderr -l | -s STDERR_FILE]

[-x RSL_CLAUSE]

-l | -s EXECUTABLE [ARGUMENT …​]

Description

The globus-job-submit program constructs a job description from its command-line options and then submits the job to the GRAM service running at SERVICE_CONTACT. The executable and arguments to the executable are provided on the command-line after all other options. Note that the -dumprsl, -dryrun, -verify, and -file command-line options must occur before the first non-option argument, the SERVICE_CONTACT.

The globus-job-submit provides similar functionality to globusrun in that it allows batch submission of GRAM jobs. However, unlike globusrun, it uses command-line parameters to define the job instead of RSL expressions.

To retrieve the output and error streams of the job, use the program globus-job-get-output. To reclaim resources used by the job by deleting cached files and job state, use the program globus-job-clean. To cancel a batch job submitted by globus-job-submit, use the program globus-job-cancel.

Options

The full set of options to globus-job-submit are:

-help, -usage

Display a help message to standard error and exit.

-version

Display the software version of the globus-job-submit program to standard output.

-versions

Display the software version of the globus-job-submit program including DiRT information to standard output.

-dumprsl

Translate the command-line options to globus-job-submit into an RSL expression that can be used with tools such as globusrun.

-dryrun

Submit the job request to the GRAM service with the dryrun option enabled. When this option is used, the GRAM service prepares to execute the job but stops before submitting the job to the LRM. This can be used to diagnose some problems such as missing files.

-verify

Submit the job request to the GRAM service with the dryrun option enabled and then without it enabled if the dryrun is successful.

-file ARGUMENT_FILE

Read additional command-line options from ARGUMENT_FILE.

-np PROCESSES, -count PROCESSES

Start PROCESSES instances of the executable as a single job.

-m MAX_TIME, -maxtime MAX_TIME

Schedule the job to run for a maximum of MAX_TIME minutes.

-p PROJECT, -project PROJECT

Request that the job use the allocation PROJECT when submitting the job to the LRM.

-q QUEUE, -queue QUEUE

Request that the job be submitted to the LRM using the named QUEUE.

-d DIRECTORY, -directory DIRECTORY

Run the job in the directory named by DIRECTORY. Input and output files will be interpreted relative to this directory. This directory must exist on the file system on the LRM-managed resource. If not specified, the job will run in the home directory of the user the job is running as.

-env NAME=VALUE

Define an environment variable named by NAME with the value VALUE in the job environment. This option may be specified multiple times to define multiple environment variables.

-stdin [-l | -s] STDIN_FILE

Use the file named by STDIN_FILE as the standard input of the job. If the -l option is specified, then this file is interpreted to be on a file system local to the LRM. If the -s option is specified, then this file is interpreted to be on the file system where globus-job-submit is being executed, and the file will be staged via GASS. If neither is specified, the local behavior is assumed.

-stdout [-l | -s] STDOUT_FILE

Use the file named by STDOUT_FILE as the destination for the standard output of the job. If the -l option is specified, then this file is interpreted to be on a file system local to the LRM. If the -s option is specified, then this file is interpreted to be on the file system where globus-job-submit is being executed, and the file will be staged via GASS. If neither is specified, the local behavior is assumed.

-stderr [-l | -s] STDERR_FILE

Use the file named by STDERR_FILE as the destination for the standard error of the job. If the -l option is specified, then this file is interpreted to be on a file system local to the LRM. If the -s option is specified, then this file is interpreted to be on the file system where globus-job-submit is being executed, and the file will be staged via GASS. If neither is specified, the local behavior is assumed.

-x RSL_CLAUSE

Add a set of custom RSL attributes described by RSL_CLAUSE to the job description. The clause must be an RSL conjunction and may contain one or more attributes. This can be used to include attributes which can not be defined by other command-line options of globus-job-submit.

-l

When included outside the context of -stdin, -stdout, or -stderr command-line options, -l option alters the interpretation of the executable path. If the -l option is specified, then the executable is interpreted to be on a file system local to the LRM.

-s

When included outside the context of -stdin, -stdout, or -stderr command-line options, -l option alters the interpretation of the executable path. If the -s option is specified, then the executable is interpreted to be on the file system where globus-job-run is being executed, and the file will be staged via GASS. If neither is specified, the local behavior is assumed.

ENVIRONMENT

If the following variables affect the execution of globus-job-submit.

X509_USER_PROXY

Path to proxy credential.

X509_CERT_DIR

Path to trusted certificate directory.

See Also

globusrun(1), globus-job-run(1), globus-job-clean(1), globus-job-get-output(1), globus-job-cancel(1)

GRAM RSL Quick Reference

The GRAM RSL language is described in detail in the GRAM Developer’s Guide. For basic use, job description RSLs consist of a set of RSL attributes preceded by the & character. The basic job description looks like:

&
    (attribute = value )
    (attribute = value )
    ...

The following list contains the RSL attributes which are available in the core job manager. Other LRM-specific RSL attributes may also be available in some situations.

arguments

The command line arguments for the executable. Use quotes, if a space is required in a single argument.

count

The number of executions of the executable.

directory

Specifies the path of the directory the jobmanager will use as the default directory for the requested job.

dry_run

If dryrun = yes then the jobmanager will not submit the job for execution and will return success.

environment

The environment variables that will be defined for the executable in addition to default set that is given to the job by the jobmanager.

executable

The name of the executable file to run on the remote machine. If the value is a GASS URL, the file is transferred to the remote gass cache before executing the job and removed after the job has terminated.

file_clean_up

Specifies a list of files which will be removed after the job is completed.

file_stage_in

Specifies a list of ("remote URL" "local file") pairs which indicate files to be staged to the nodes which will run the job.

file_stage_in_shared

Specifies a list of ("remote URL" "local file") pairs which indicate files to be staged into the cache. A symlink from the cache to the "local file" path will be made.

file_stage_out

Specifies a list of ("local file" "remote URL") pairs which indicate files to be staged from the job to a GASS-compatible file server.

gass_cache

Specifies location to override the GASS cache location.

gram_my_job

Obsolete and ignored.

host_count

Only applies to clusters of SMP computers, such as newer IBM SP systems. Defines the number of nodes ("pizza boxes") to distribute the "count" processes across.

job_type

This specifies how the jobmanager should start the job. Possible values are single (even if the count > 1, only start 1 process or thread), multiple (start count processes or threads), mpi (use the appropriate method (e.g. mpirun) to start a program compiled with a vendor-provided MPI library. Program is started with count nodes), and condor (starts condor jobs in the "condor" universe.)

library_path

Specifies a list of paths to be appended to the system-specific library path environment variables.

loglevel

Override the default log level for this job. The value of this attribute consists of a combination of the strings FATAL, ERROR, WARN, INFO, DEBUG, TRACE joined by the | character

logpattern

Override the default log path pattern for this job. The value of this attribute is a string (potentially containing RSL substitutions) that is evaluated to the path to write the log to. If the resulting string contains the string $(DATE) (or any other RSL substitution), it will be reevaluated at log time.

max_cpu_time

Explicitly set the maximum cputime for a single execution of the executable. The units is in minutes. The value will go through an atoi() conversion in order to get an integer. If the GRAM scheduler cannot set cputime, then an error will be returned.

max_memory

Explicitly set the maximum amount of memory for a single execution of the executable. The units is in Megabytes. The value will go through an atoi() conversion in order to get an integer. If the GRAM scheduler cannot set maxMemory, then an error will be returned.

max_time

The maximum walltime or cputime for a single execution of the executable. Walltime or cputime is selected by the GRAM scheduler being interfaced. The units is in minutes. The value will go through an atoi() conversion in order to get an integer.

max_wall_time

Explicitly set the maximum walltime for a single execution of the executable. The units is in minutes. The value will go through an atoi() conversion in order to get an integer. If the GRAM scheduler cannot set walltime, then an error will be returned.

min_memory

Explicitly set the minimum amount of memory for a single execution of the executable. The units is in Megabytes. The value will go through an atoi() conversion in order to get an integer. If the GRAM scheduler cannot set minMemory, then an error will be returned.

project

Target the job to be allocated to a project account as defined by the scheduler at the defined (remote) resource.

proxy_timeout

Obsolete and ignored. Now a job-manager-wide setting.

queue

Target the job to a queue (class) name as defined by the scheduler at the defined (remote) resource.

remote_io_url

Writes the given value (a URL base string) to a file, and adds the path to that file to the environment throught the GLOBUS_REMOTE_IO_URL environment variable. If this is specified as part of a job restart RSL, the job manager will update the file’s contents. This is intended for jobs that want to access files via GASS, but the URL of the GASS server has changed due to a GASS server restart.

restart

Start a new job manager, but instead of submitting a new job, start managing an existing job. The job manager will search for the job state file created by the original job manager. If it finds the file and successfully reads it, it will become the new manager of the job, sending callbacks on status and streaming stdout/err if appropriate. It will fail if it detects that the old jobmanager is still alive (via a timestamp in the state file). If stdout or stderr was being streamed over the network, new stdout and stderr attributes can be specified in the restart RSL and the jobmanager will stream to the new locations (useful when output is going to a GASS server started by the client that’s listening on a dynamic port, and the client was restarted). The new job manager will return a new contact string that should be used to communicate with it. If a jobmanager is restarted multiple times, any of the previous contact strings can be given for the restart attribute.

rsl_substitution

Specifies a list of values which can be substituted into other rsl attributes' values through the $(SUBSTITUTION) mechanism.

save_state

Causes the jobmanager to save it’s job state information to a persistent file on disk. If the job manager exits or is suspended, the client can later start up a new job manager which can continue monitoring the job.

savejobdescription

Save a copy of the job description to $HOME

scratch_dir

Specifies the location to create a scratch subdirectory in. A SCRATCH_DIRECTORY RSL substitution will be filled with the name of the directory which is created.

stderr

The name of the remote file to store the standard error from the job. If the value is a GASS URL, the standard error from the job is transferred dynamically during the execution of the job. There are two accepted forms of this value. It can consist of a single destination: stderr = URL, or a sequence of destinations: stderr = (DESTINATION) (DESTINATION). In the latter case, the DESTINATION may itself be a URL or a sequence of an x-gass-cache URL followed by a cache tag."

stderr_position

Specifies where in the file remote standard error streaming should be restarted from. Must be 0.

stdin

The name of the file to be used as standard input for the executable on the remote machine. If the value is a GASS URL, the file is transferred to the remote gass cache before executing the job and removed after the job has terminated.

stdout

The name of the remote file to store the standard output from the job. If the value is a GASS URL, the standard output from the job is transferred dynamically during the execution of the job. There are two accepted forms of this value. It can consist of a single destination: stdout = URL, or a sequence of destinations: stdout = (DESTINATION) (DESTINATION). In the latter case, the DESTINATION may itself be a URL or a sequence of an x-gass-cache URL followed by a cache tag.

stdout_position

Specifies where in the file remote output streaming should be restarted from. Must be 0.

two_phase

Use a two-phase commit for job submission and completion. The job manager will respond to the initial job request with a WAITING_FOR_COMMIT error. It will then wait for a signal from the client before doing the actual job submission. The integer supplied is the number of seconds the job manager should wait before timing out. If the job manager times out before receiving the commit signal, or if a client issues a cancel signal, the job manager will clean up the job’s files and exit, sending a callback with the job status as GLOBUS_GRAM_PROTOCOL_JOB_STATE_FAILED. After the job manager sends a DONE or FAILED callback, it will wait for a commit signal from the client. If it receives one, it cleans up and exits as usual. If it times out and save_state was enabled, it will leave all of the job’s files in place and exit (assuming the client is down and will attempt a job restart later). The timeoutvalue can be extended via a signal. When one of the following errors occurs, the job manager does not delete the job state file when it exits: GLOBUS_GRAM_PROTOCOL_ERROR_COMMIT_TIMED_OUT, GLOBUS_GRAM_PROTOCOL_ERROR_TTL_EXPIRED, GLOBUS_GRAM_PROTOCOL_ERROR_JM_STOPPED, GLOBUS_GRAM_PROTOCOL_ERROR_USER_PROXY_EXPIRED. In these cases, it can not be restarted, so the job manager will not wait for the commit signal after sending the FAILED callback

username

Verify that the job is running as this user.

GRAM Error Message Reference

Errors

Table 1. GRAM5 Error Codes
Error Code Reason Possible Solutions

1

one of the RSL parameters is not supported

Check RSL documentation

2

the RSL length is greater than the maximum allowed

Use RSL substitutions to reduce length of RSL strings

3

an I/O operation failed

Enable trace logging and report to [email protected]

4

jobmanager unable to set default to the directory requested

Check that RSL directory attribute refers to a directory that exists on the target system.

5

the executable does not exist

Check that the RSL executable attribute refers to an executable that exists on the target system.

6

of an unused INSUFFICIENT_FUNDS

Unimplemented feature.

7

authentication with the remote server failed

Check that the contact string contains the proper X.509 DN.

8

the user cancelled the job

Don’t cancel jobs you want to complete.

9

the system cancelled the job

Check RSL requirements such as maximum time and memory are valid for the job.

10

data transfer to the server failed

Check gatekeeper and/or job manager logs to see why the process failed.

11

the stdin file does not exist

Check that the RSL stdin attribute refers to a file that exists on the target system or has a valid ftp, gsiftp, http, or https URL.

12

the connection to the server failed (check host and port)

Check that the service is running on the expected TCP/IP port. Check that no firewall prevents contacting that TCP/IP port. Check $GLOBUS_LOCATION/var/globus-gatekeeper.log for runtme configuration errors.

13

the provided RSL maxtime value is not an integer

Check that the RSL maxtime value evaluates to an integer.

14

the provided RSL count value is not an integer

Check that the RSL count value evaluates to an integer.

15

the job manager received an invalid RSL

Check that the RSL string can be parsed by using globusrun -p RSL.

16

the job manager failed in allowing others to make contact

Check job manager log.

17

the job failed when the job manager attempted to run it

Verify that the LRM is configured properly.

18

an invalid paradyn was specified

OBSOLETE IN GRAM2

19

the provided RSL jobtype value is invalid

The RSL jobtype attribute is not indicated as supported by the LRM. Valid jobtype values are single, multiple, mpi, and condor.

20

the provided RSL myjob value is invalid

OBSOLETE IN GRAM5

21

the job manager failed to locate an internal script argument file

Check that $GLOBUS_LOCATION/libexec/globus-job-manager-script.pl exists and is executable. Check that the LRM-specific perl module is located in $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/ directory and is valid. The command perl -I$GLOBUS_LOCATION/lib/perl $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/LRM.pm can be used to check if there are any syntax errors in the script.

22

the job manager failed to create an internal script argument file

Check that your home directory is writable and not full.

23

the job manager detected an invalid job state

Check job manager logs.

24

the job manager detected an invalid script response

Check job manager logs. This is likely a bug in the LRM script.

25

the job manager detected an invalid script status

Check job manager logs. This is likely a bug in the LRM script.

26

the provided RSL jobtype value is not supported by this job manager

Check that the RSL jobtype attribute is implemented by the LRM script. Note that some job types require configuration

27

unused ERROR_UNIMPLEMENTED

LRM does not support some feature included in the job request.

28

the job manager failed to create an internal script submission file

Check that the user’s home file system is not full. Check job manager log

29

the job manager cannot find the user proxy

Check that client is delegating a proxy when authenticating with the gatekeeper. Check that the user’s home filesystem and the /tmp file system are not full.

30

the job manager failed to open the user proxy

Check that the user’s home filesystem and the /tmp file system are not full.

31

the job manager failed to cancel the job as requested

Check that the user’s home filesystem and the /tmp file system are not full.

32

system memory allocation failed

Check job manager log for details.

33

the interprocess job communication initialization failed

OBSOLETE IN GRAM5

34

the interprocess job communication setup failed

OBSOLETE IN GRAM5

35

the provided RSL host count value is invalid

Check that the RSL host_count attribute evaluates to an integer.

36

one of the provided RSL parameters is unsupported

Check job manager log for details about invalid parameter.

37

the provided RSL queue parameter is invalid

Check that the RSL queue attribute evaluates to a string that corresponds to an LRM-specific queue name.

38

the provided RSL project parameter is invalid

Check that the RSL project attribute evaluates to a string that corresponds to an LRM-specific project name.

39

the provided RSL string includes variables that could not be identified

Check that all RSL substitutions are defined before being used in the job description.

40

the provided RSL environment parameter is invalid

Check that the RSL environment attribute contains a sequence of VARIABLE VALUE pairs.

41

the provided RSL dryrun parameter is invalid

Remove the RSL dryrun attribute from the job description.

42

the provided RSL is invalid (an empty string)

Include a non-empty RSL string in your job submission request.

43

the job manager failed to stage the executable

Check that the file service hosting the executable is reachable from the GRAM5 service node. Check that the executable exists on the file service node. Check that there is sufficient disk space in the user’s home directory on the service node to store the executable.

44

the job manager failed to stage the stdin file

Check that the file service hosting the standard input file is reachable from the GRAM5 service node. Check that the standard input file exists on the file service node. Check that there is sufficient disk space in the user’s home directory on the service node to store the standard input file.

45

the requested job manager type is invalid

OBSOLETE IN GRAM5

46

the provided RSL arguments parameter is invalid

OBSOLETE IN GRAM2

47

the gatekeeper failed to run the job manager

Check the gatekeeper or job manager logs for more information.

48

the provided RSL could not be properly parsed

Check that the RSL string can be parsed by using globusrun -p RSL.

49

there is a version mismatch between GRAM components

Ask system administrator to upgrade GRAM service to GRAM2 or GRAM5

50

the provided RSL arguments parameter is invalid

Check that the RSL arguments attribute evaluates to a sequence of strings.

51

the provided RSL count parameter is invalid

Check that the RSL count attribute evaluates to a positive integer value.

52

the provided RSL directory parameter is invalid

Check that the RSL directory attribute evaluates to a string.

53

the provided RSL dryrun parameter is invalid

Check that the RSL dryrun attribute evaluates to either yes or no.

54

the provided RSL environment parameter is invalid

Check that the RSL environment attribute evaluates to a sequence of VARIABLE, VALUE pairs.

55

the provided RSL executable parameter is invalid

Check that the RSL executable attribute evaluates to a string value.

56

the provided RSL host_count parameter is invalid

Check that the RSL host_count attribute evaluates to a positive integer value.

57

the provided RSL jobtype parameter is invalid

Check that the RSL jobtype attribute evaluates to one of single, multiple, mpi, or condor

58

the provided RSL maxtime parameter is invalid

Check that the RSL maxtime attribute evaluates to a positive integer value.

59

the provided RSL myjob parameter is invalid

OBSOLETE IN GRAM5.

60

the provided RSL paradyn parameter is invalid

OBSOLETE IN GRAM2.

61

the provided RSL project parameter is invalid

Check that the RSL project attribute evaluates to a string value.

62

the provided RSL queue parameter is invalid

Check that the RSL queue attribute evaluates to a string value.

63

the provided RSL stderr parameter is invalid

Check that the RSL stderr attribute evaluates to a string value or a sequence of DESTINATION URLs with optional CACHE_TAG string parameters.

64

the provided RSL stdin parameter is invalid

Check that the RSL stdin attribute evaluates to a string value.

65

the provided RSL stdout parameter is invalid

Check that the RSL stdout attribute evaluates to a string value or a sequence of DESTINATION URLs with optional CACHE_TAG string parameters.

66

the job manager failed to locate an internal script

Check job manager log for more details.

67

the job manager failed on the system call pipe()

OBSOLETE IN GRAM5

68

the job manager failed on the system call fcntl()

OBSOLETE IN GRAM2

69

the job manager failed to create the temporary stdout filename

OBSOLETE IN GRAM5

70

the job manager failed to create the temporary stderr filename

OBSOLETE IN GRAM5

71

the job manager failed on the system call fork()

OBSOLETE IN GRAM2

72

the executable file permissions do not allow execution

Check that the RSL executable attribute refers to an executable program or script.

73

the job manager failed to open stdout

Check that the RSL stdout attribute refers to one or more valid destination files or URLs.

74

the job manager failed to open stderr

Check that the RSL stderr attribute refers to one or more valid destination files or URLs.

75

the cache file could not be opened in order to relocate the user proxy

Check that the user’s home directory is writable and not full on the GRAM5 service node.

76

cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space

Check that the user’s home directory is writable and not full on the GRAM5 service node.

77

the job manager failed to insert the contact in the client contact list

Check job manager log

78

the contact was not found in the job manager’s client contact list

Don’t attempt to unregister callback contacts that are not registered

79

connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, …​

Check that the job manager process is running. Check that the job manager credential has not expired. Check that the job manager contact refers to the correct TCP/IP host and port. Check that the job manager contact is not blocked by a firewall.

80

the syntax of the job contact is invalid

Check the syntax of job contact string.

81

the executable parameter in the RSL is undefined

Include the RSL executable in all job requests.

82

the job manager service is misconfigured. condor arch undefined

Add the -condor-arch to the command-line or configuration file for a job manager configured to use the condor LRM.

83

the job manager service is misconfigured. condor os undefined

Add the -condor-os to the command-line or configuration file for a job manager configured to use the condor LRM.

84

the provided RSL min_memory parameter is invalid

Check that the RSL min_memory attribute evaluates to a positive integer value.

85

the provided RSL max_memory parameter is invalid

Check that the RSL max_memory attribute evaluates to a positive integer value.

86

the RSL min_memory value is not zero or greater

Check that the RSL min_memory attribute evaluates to a positive integer value.

87

the RSL max_memory value is not zero or greater

Check that the RSL max_memory attribute evaluates to a positive integer value.

88

the creation of a HTTP message failed

Check job manager log.

89

parsing incoming HTTP message failed

Check job manager log.

90

the packing of information into a HTTP message failed

Check job manager log.

91

an incoming HTTP message did not contain the expected information

Check job manager log.

92

the job manager does not support the service that the client requested

Check that the client is talking to the correct servce

93

the gatekeeper failed to find the requested service

OBSOLETE IN GRAM2

94

the jobmanager does not accept any new requests (shutting down)

Execute queries before the job has been cleaned up.

95

the client failed to close the listener associated with the callback URL

Call globus_gram_client_callback_disallow() with a valid the callback contact.

96

the gatekeeper contact cannot be parsed

Check the syntax of the gatekeeper contact string you are attempting to contact.

97

the job manager could not find the poe command

OBSOLETE IN GRAM2

98

the job manager could not find the mpirun command

Configure the LRM script with mpirun in your path.

99

the provided RSL start_time parameter is invalid

OBSOLETE IN GRAM2

100

the provided RSL reservation_handle parameter is invalid

OBSOLETE IN GRAM2

101

the provided RSL max_wall_time parameter is invalid

Check that the RSL max_wall_time attribute evaluates to a positive integer.

102

the RSL max_wall_time value is not zero or greater

Check that the RSL max_wall_time attribute evaluates to a positive integer.

103

the provided RSL max_cpu_time parameter is invalid

Check that the RSL max_cpu_time attribute evaluates to a positive integer.

104

the RSL max_cpu_time value is not zero or greater

Check that the RSL max_cpu_time attribute evaluates to a positive integer.

105

the job manager is misconfigured, a scheduler script is missing

Check that the adminstrator has configured the LRM by running its setup script.

106

the job manager is misconfigured, a scheduler script has invalid permissions

Check that the adminstrator has installed the GLOBUS_LOCATION/libexec/globus-job-manager-script.pl script. Check that the file system containing that script allows file execution.

107

the job manager failed to signal the job

OBSOLETE IN GRAM2

108

the job manager did not recognize/support the signal type

Check that your signal operation is using the correct signal constant.

109

the job manager failed to get the job id from the local scheduler

OBSOLETE IN GRAM2

110

the job manager is waiting for a commit signal

Send a two-phase commit signal to the job manager to acknowledge receiving the job contact from the job manager.

111

the job manager timed out while waiting for a commit signal

Send a two-phase commit signal to the job manager to acknowledge receiving the job contact from the job manager. Increase the two-phase commit time out for your job. Check that the job manager contact TCP/IP port is reachable from your client.

112

the provided RSL save_state parameter is invalid

Check that the RSL save_state attribute is set to yes or no.

113

the provided RSL restart parameter is invalid

Check that the RSL restart attribute evaluates to a string containing a job contact string.

114

the provided RSL two_phase parameter is invalid

Check that the RSL two_phase attribute evaluates to a positive integer.

115

the RSL two_phase value is not zero or greater

Check that the RSL two_phase attribute evaluates to a positive integer.

116

the provided RSL stdout_position parameter is invalid

OBSOLETE IN GRAM5

117

the RSL stdout_position value is not zero or greater

OBSOLETE IN GRAM5

118

the provided RSL stderr_position parameter is invalid

OBSOLETE IN GRAM5

119

the RSL stderr_position value is not zero or greater

OBSOLETE IN GRAM5

120

the job manager restart attempt failed

OBSOLETE IN GRAM2

121

the job state file doesn’t exist

Check that the job contact you are trying to restart matches one that the job manager returned to you.

122

could not read the job state file

Check that the state file directory is not full.

123

could not write the job state file

Check that the state file directory is not full.

124

old job manager is still alive

Contact the returned job manager contact to manage the job you are trying to restart.

125

job manager state file TTL expired

OBSOLETE in GRAM2

126

it is unknown if the job was submitted

Check job manager log.

127

the provided RSL remote_io_url parameter is invalid

Check that the RSL remote_io_url attribute evaluates to a string value.

128

could not write the remote io url file

Check that the user’s home file system on the job manager service node is writable and not full.

129

the standard output/error size is different

Send a stdio update signal to redirect the job manager output to a new URL

130

the job manager was sent a stop signal (job is still running)

Submit a restart request to monitor the job.

131

the user proxy expired (job is still running)

Generate a new proxy and then submit a restart request to monitor the job.

132

the job was not submitted by original jobmanager

OBSOLETE IN GRAM2

133

the job manager is not waiting for that commit signal

Do not send a commit signal to a job that is not waiting for a commit signal.

134

the provided RSL scheduler specific parameter is invalid

Check the LRM-specific documentation to determine what values are legal for the RSL extensions implemented by the LRM.

135

the job manager could not stage in a file

Check that the file service hosting the file to stage is reachable from the GRAM5 service node. Check that the file to stage exists on the file service node. Check that there is sufficient disk space in the user’s home directory on the service node to store the file to stage.

136

the scratch directory could not be created

Check that the directory named by the RSL scratch_dir attribute exists and is writable. Check that the directory named by the RSL scratch_dir attribute is not full.

137

the provided gass_cache parameter is invalid

Check that the RSL gass_cache attribute evaluates to a string.

138

the RSL contains attributes which are not valid for job submission

Do not use restart- or signal-only RSL attributes when submitting a job.

139

the RSL contains attributes which are not valid for stdio update

Do not use submit- or restart-only RSL attributes when sending a stdio update signal to a job.

140

the RSL contains attributes which are not valid for job restart

Do not use submit- or signal-only RSL attributes when restarting a job.

141

the provided RSL file_stage_in parameter is invalid

Check that the RSL file_stage_in attribute evaluates to a sequence of SOURCE DESTINATION pairs.

142

the provided RSL file_stage_in_shared parameter is invalid

Check that the RSL file_stage_in_shared attribute evaluates to a sequence of SOURCE DESTINATION pairs.

143

the provided RSL file_stage_out parameter is invalid

Check that the RSL file_stage_out attribute evaluates to a sequence of SOURCE DESTINATION pairs.

144

the provided RSL gass_cache parameter is invalid

Check that the RSL gass_cache attribute evaluates to a string.

145

the provided RSL file_cleanup parameter is invalid

Check that the RSL file_clean_up attribute evaluates to a sequence of strings.

146

the provided RSL scratch_dir parameter is invalid

Check that the RSL scratch_dir attribute evaluates to a string.

147

the provided scheduler-specific RSL parameter is invalid

Check the LRM-specific documentation to determine what values are legal for the RSL extensions implemented by the LRM.

148

a required RSL attribute was not defined in the RSL spec

Check that the RSL executable attribute is present in your job request RSL. Check that the RSL restart attributes is present in your restart RSL.

149

the gass_cache attribute points to an invalid cache directory

Check that the RSL gass_cache attributes evaluates to a directory that exists or can be created. Check that the user’s home file system is writable and not full.

150

the provided RSL save_state parameter has an invalid value

Check that the RSL save_state attribute has a value of yes or no.

151

the job manager could not open the RSL attribute validation file

Check that $GLOBUS_LOCATION/share/globus_gram_job_manager/globus-gram-job-manager.rvf is present and readable on the job manager service node. Check that $GLOBUS_LOCATION/share/globus_gram_job_manager/LRM.rvf is readable on the job manager service node if present.

152

the job manager could not read the RSL attribute validation file

Check that $GLOBUS_LOCATION/share/globus_gram_job_manager/globus-gram-job-manager.rvf is valid. Check that $GLOBUS_LOCATION/share/globus_gram_job_manager/LRM.rvf is valid if present.

153

the provided RSL proxy_timeout is invalid

Check that RSL proxy_timeout attribute evaluates to a positive integer.

154

the RSL proxy_timeout value is not greater than zero

Check that RSL proxy_timeout attribute evaluates to a positive integer.

155

the job manager could not stage out a file

Check that the source file being staged exists on the job manager service node. Check that the directory of the destination file being staged exists on the file service node. Check that the directory of the destination file being staged is writable by the user. Check that the destination file service is reachable by the job manager service node.

156

the job contact string does not match any which the job manager is handling

Check that the job contact string matches one returned from a job request.

157

proxy delegation failed

Check that the job manager service node trusts the signer of your credential. Check that you trust the signer of the job manager service node’s credential.

158

the job manager could not lock the state lock file

Check that the file system holding the job state directory supports POSIX advisory locking. Check that the job state directory is writable by the user on the service node. Check that the job state directory is not full.

159

an invalid globus_io_clientattr_t was used.

Check that you have initialized the globus_io_clientattr_t attribute prior to using it with the GRAM client API.

160

an null parameter was passed to the gram library

Check that you are passing legal values to all GRAM API calls.

161

the job manager is still streaming output

OBSOLETE IN GRAM5

162

the authorization system denied the request

Check with your GRAM system administrator to allow a particular certificate to be authorized.

163

the authorization system reported a failure

Check with your system administrator to verify that the authorization system is configured properly.

164

the authorization system denied the request - invalid job id

Check with your system administrator to verify that the authorization system is configured properly. Use a credential which is authorized to interact with a particular GRAM job.

165

the authorization system denied the request - not authorized to run the specified executable

Check with your system administrator to verify that the authorization system is configured properly. Use a credential which is authorized to interact with a particular GRAM job.

166

the provided RSL user_name parameter is invalid.

Check that the RSL user_name attribute evaluates to a string.

167

the job is not running in the account named by the user_name parameter.

Ask with the GRAM system administrator to add an authorization entry to allow your credential to run jobs as the specified user account.

Known Problems in GRAM5

Known Problems in GRAM5

  • GT-45: Manager lock double-locked

  • GT-47: globus-job-manager null pointer dereference for some call paths

  • GT-52: SEG may deadlock with threads

  • GT-56: Tear-down of object requires multiple threads

  • GT-103: GRAM refresh credentials test sometimes fails because job terminates

  • GT-292: Service tags may not isolate services completely

  • GT-324: Behaviour of globus-job-status

  • GT-369: GRAM5 skips some SEG events for PBS batch system

  • GT-389: globusrun and globus-job-run don’t report job failures to user

  • GT-418: globus-gatekeeper leaves stale processes behind if port 2119 is probed

Usage statistics collection by the Globus Alliance

GRAM5-specific usage statistics

The following usage statistics are sent by default in a UDP packet (in addition to the GRAM component code, packet version, timestamp, and source IP address) at the end of each job.

  • Job Manager Session ID

  • dryrun used

  • RSL Host Count

  • Timestamp when job hit GLOBUS_GRAM_PROTOCOL_JOB_STATE_UNSUBMITTED

  • Timestamp when job hit GLOBUS_GRAM_PROTOCOL_JOB_STATE_FILE_STAGE_IN

  • Timestamp when job hit GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING

  • Timestamp when job hit GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE

  • Timestamp when job hit GLOBUS_GRAM_PROTOCOL_JOB_STATE_FAILED

  • Timestamp when job hit GLOBUS_GRAM_PROTOCOL_JOB_STATE_FILE_STAGE_OUT

  • Timestamp when job hit GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE

  • Job Failure Code

  • Number of times status is called

  • Number of times register is called

  • Number of times signal is called

  • Number of times refresh is called

  • Number of files named in file_clean_up RSL

  • Number of files being staged in (including executable, stdin) from http servers

  • Number of files being staged in (including executable, stdin) from https servers

  • Number of files being staged in (including executable, stdin) from ftp servers

  • Number of files being staged in (including executable, stdin) from gsiftp servers

  • Number of files being staged into the GASS cache from http servers

  • Number of files being staged into the GASS cache from https servers

  • Number of files being staged into the GASS cache from ftp servers

  • Number of files being staged into the GASS cache from gsiftp servers

  • Number of files being staged out (including stdout and stderr) to http servers

  • Number of files being staged out (including stdout and stderr) to https servers

  • Number of files being staged out (including stdout and stderr) to ftp servers

  • Number of files being staged out (including stdout and stderr) to gsiftp servers

  • Bitmask of used RSL attributes (values are 2^id from the gram5_rsl_attributes table)

  • Number of times unregister is called

  • Value of the count RSL attribute

  • Comma-separated list of string names of other RSL attributes not in the set defined in globus-gram-job-manager.rvf

  • Job type string

  • Number of times the job was restarted

  • Total number of state callbacks sent to all clients for this job

The following information can be sent as well in a job status packet but it is not sent unless explicitly enabled by the system administrator:

  • Value of the executable RSL attribute

  • Value of the arguments RSL attribute

  • IP adddress and port of the client that submitted the job

  • User DN of the client that submitted the job

In addition to job-related status, the job manager sends information periodically about its execution status. The following information is sent by default in a UDP packet (in addition to the GRAM component code, packet version, timestamp, and source IP address) at job manager start and every 1 hour during the job manager lifetime:

  • Job Manager Start Time

  • Job Manager Session ID

  • Job Manager Status Time

  • Job Manager Version

  • LRM

  • Poll used

  • Audit used

  • Number of restarted jobs

  • Total number of jobs

  • Total number of failed jobs

  • Total number of canceled jobs

  • Total number of completed jobs

  • Total number of dry-run jobs

  • Peak number of concurrently managed jobs

  • Number of jobs currently being managed

  • Number of jobs currently in the UNSUBMITTED state

  • Number of jobs currently in the STAGE_IN state

  • Number of jobs currently in the PENDING state

  • Number of jobs currently in the ACTIVE state

  • Number of jobs currently in the STAGE_OUT state

  • Number of jobs currently in the FAILED state

  • Number of jobs currently in the DONE state

Also, please see our policy statement on the collection of usage statistics.

Index