urgent computing topline

 

Admin Guide for PBSPro

  $Revision: 1.2 $

This document outlines everything needed to deploy the SPRUCE urgent computing environment on your local computing resource. All the software plug-ins and installation instructions are provided along with insight on upcoming improvements. Please note that the current instruction guide is designed for sites using PBSPro for resource management. If you are looking for other combinations of jobmanager/scheduler, refer our software page.

Table of Contents


Introduction

Special PRiority Urgent Computing Environment (SPRUCE) is software that supports urgent computing for distributed, large-scale high-performance computers. Government agencies such as NSF, DOE, NIH, and NASA have invested hundreds of millions of dollars in high performance computing centers. Often, those computers are linked via distributed computing environments, or Grids. Leveraging that infrastructure, SPRUCE provides a set of tools, interfaces, and policies to support urgent computing.

Applications that need urgent computing have deadlines --- late results are useless results. Knowing that a tornado will rip through a town is only useful if it can be determined while there is time for action. Similarly, knowing how a storm will effect flood waters is important for both preparation activities and rescue plans. Finding fast, available compute cycles is critical.

The current configuration of SPRUCE works for Grid environments using Globus as well as local-submission jobs using the command-line. The instructions below will need modifications based on your local site deployment configuration for Globus and job queues. Please feel free to ask for help if you run into problems configuring SPRUCE for your local environment.

If you do not run Globus at all, please write to us, we will provide you with a completely Globus independent distribution.

[TOC]

Limitations and Prerequisites

The current version has been tested and deployed for the configuration shown below. Other configurations may require additional tweaks to configuration variables or scripts.

  • The Globus support is extensible to GT 3.9.x or 4.0.x pre-ws versions. The previous version of GT 2.4.x is not supported, neither is the web services version (we are working on this).
  • The current version of the SPRUCE software works with the PBS Pro 7.1 or higher. One component, the 'submit-filter' is designed to validate job submissions before they are put in the queue. Since PBS Pro doesn't have this feature yet, the filter script needs to be adapted into a job submission wrapper suitable to your resource.
  • SPRUCE requires a high-priority queue be enabled. Urgent computing jobs are funnelled into that queue after validation by SPRUCE. SPRUCE cannot be effectively used without a special high-priority queue.
  • You must have root permission for some of the tasks, including making the new high-priority queue, installing a new Globus Job Manager, and the submit-filter. The install script will assume that a 'globus' account exists, and installed files and scripts will be installed as user 'globus' for safety.

If any of the above conditions do not hold true in your case, please contact the Spruce Team for further information.

[TOC]

System Preparation

Before configuring and building the software, two things must be set up: a high-priority job queue and a file area, SPRUCE_ROOT, for all the supporting scripts and tools.

The system is setup for the default behavior of 'next-to-run' status. If you want to extend the policies, please contact us for collaboration on this front.

[TOC]

Priority Queue

The first step is to make a new priority queue called spruce. Urgent computing jobs will be automatically routed to the queue by SPRUCE. Once created, the queue must be configured to be operational. At a minimum, this includes enabling the queue, setting default nodes and walltimes etc. Other configuration attributes depend on local site settings.

With the queue created, we can now set up the scheduler to allow the needed priority. The PRIORITY field must be a value that is larger than any of your other queues in order to be the highest priority. After editing the configuration, restart the scheduler.

[TOC]

SPRUCE_ROOT

SPRUCE has binaries and shared scripts which must be accessible to users as well as system components. We put those components in a generic spruce directory. To make installation easier, we will assume a single source code and install directory called SPRUCE_ROOT. You can put this file space anywhere on the front-end, such as /usr/local. For TeraGrid sites, it is convenient to use the "community space" ($TG_COMMUNITY). The TeraGrid sites could set up the SPRUCE_ROOT to be $TG_COMMUNITY/spruce. As root, create the spruce directory that will be both the compile/build area and the bin area for the final scripts. The example below leverages the TeraGrid Community directory:

> mkdir $TG_COMMUNITY/spruce
> chmod 775 $TG_COMMUNITY/spruce


[TOC]

Installation

With the preceeding steps complete, you are ready to configure, compile, and install the SPRUCE code. If you have not already downloaded the source, download the version 2.0 at: http://spruce.uchicago.edu/download/spruce-rp.pbspro.v2.tar.bz2

Save the file into your SPRUCE_ROOT area. You should now have a file called SPRUCE_ROOT/spruce-rp.pbspro.v2.tar.bz2. The command below shows how to extract the package contents.

> cd SPRUCE_ROOT
> bzcat spruce-rp.pbspro.v2.tar.bz2 | tar xf -
> cd spruce-rp


Look through the distribution. You will find the following pieces:

build/
The directory where build targets will be copied
docs/
Documentation
src/
Source code
src/config.parameters
All the local configuration settings that need to be set before building examples and scripts
src/resource_provider/
The scripts and code that must be compiled for spruce resources
src/examples/
Examples and tests for submitting and running spruce jobs.

[TOC]

Configuration Parameters

The next step is to configure the software before compilation. We use a very simple configuration system. A single configuration file uses simple variables to transform *.in files. Edit the file src/config.parameters to prepare for compilation.

The fields of the configuration file are shown below.

# SPRUCE CONFIGURATION FILE:
#

# Please  provide the paths below with the escape
# character of backslash (\). For example, a path of the form -
# /foo/bar would be input here as \/foo\/bar


# SPRUCE has a special "Job Manager" to handle incoming Globus-SPRUCE
# requests.  GT 3.9.x or GT 4.0.x is required.  Since many sites have
# multiple Globus versions installed, we ask here that you specify the
# path to the Globus used by SPRUCE.  If you only have one Globus
# installed, the value of $GLOBUS_LOCATION would be the right choice.

SPRUCE_GLOBUS=\/soft\/prews-gram-4.0.1-r3

# SPRUCE_ROOT path totally expanded, no environment variable shortcuts

SPRUCE_ROOT=\/soft\/community\/spruce

# Spruce job manager contact information, this varies by site,
# and has to be modified to fit into your local configuaration.
# The example below shows the setup for the UC/ANL resources.

SPRUCE_JOB_MANAGER=tg-grid1.uc.teragrid.org

# This is the default resource property name used while making 
# tokens, if you are not restricting access based on internal 
# resource properties. Please not this is not restriction based 
# on resource, but rather resource properties on a given resource.

DEFAULT_RESOURCE=ia64-compute


NOTE: For the remainder of this document, we will use SPRUCE_ROOT and SPRUCE_GLOBUS to mean the values you set in the configuration file (without the escape characters). These to not need to be set as environment variables ($SPRUCE_ROOT), but are simply text place-holders for your local config paths in the documentation here. So if you see 'cd SPRUCE_ROOT/bin' you will know that may mean 'cd $TG_COMMUNITY/spruce/bin' for your system.

[TOC]

Building Spruce Components

After editing the configuration file, you may begin the two-step build process. The first step is to build the spruce.pm.in and spruce.rvf file, the site-specific files that depend on your local Globus installation and hook to your resource manager and job scheduler. Some manual editing steps are required. The second part will compile the code and build generic scripts.

We will start by making a duplicate of the current job-manager perl module for queued submissions and modify it to create a spruce job manager. Globus allows you to have multiple job managers, so you won't need to change anything for your existing job managers. Instead, we will simply copy what works for your system and add in the SPRUCE patches.

Move to src/resource-provider/ directory and copy the globus job-manager perl module that your system uses for queued submissions. For sites with PBSPro, it is likely the file pbs.pm. The copied file should be called spruce.pm.in. An example is provided below.

> cd SPRUCE_ROOT/spruce-rp/src/resource-provider
> cp SPRUCE_GLOBUS/lib/perl/Globus/GRAM/JobManager/pbs.pm spruce.pm.in


The hard part is to edit the file spruce.pm.in to add spruce patches. They can be found by inspecting spruce.pm.uc-example located in the same directory. We suggest opening the file in a side-by-side window. Because almost all Torque job managers are very similar, applying the small patches is actually easier than it may seem.

All of the patches are very clearly marked in the file.

# In spruce.pm.uc-example, patches are wrapped with:

############################################################## 
# [Patch description] : SPRUCE PATCH BEGIN [Patch number] 
############################################################## 
 
The patch code

############################################################## 
# [Patch description] : SPRUCE PATCH END [Patch number]
##############################################################


Since the .pm files under this directory are customized to suit the site requirements, we cannot provide a generic solution. Adding the patches in the right places, should suffice. Currently there exist 4 patches.

Some sites may experience an NFS File Handle sync problem, which is unrelated to SPRUCE, but can cause difficulties. You may want to include the NFS PROBLEM WORKAROUND in your spruce.pm (a 6th patch). This code snippet is located between the tags:

In spruce.pm.uc-example, the NFS file handle sync is wrapped with:


####################################################   
#  JP NAVARRO:  NFS PROBLEM WORKAROUND CODE BEGIN 
####################################################   

The patch code


#################################################### 
#  JP NAVARRO:  NFS PROBLEM WORKAROUND CODE END  
####################################################


After this file, we should now build the spruce.rvf file. Start by making a duplicate of the current job manager resource verification file (rvf). You won't need to change anything for your existing job managers. Instead, we will simply copy what works for your system and add in the SPRUCE patches.

For sites with PBSPro, it is likely the file pbs.rvf. The copied file should be called spruce.rvf. An example is provided below.

> cd SPRUCE_ROOT/spruce-rp/src/resource-provider
> cp SPRUCE_GLOBUS/share/globus_gram_job_manager/pbs.rvf spruce.rvf


Now simply add the content of the file spruce.rvf.in into this file as appropriate. This file basically lets an additional RSL parameter called 'urgency' to be used.

 Content to be edited into the spruce.rvf file:


Attribute: urgency
Description: "Indicates urgent computations and has different 
levels specified for resolving among conflicting on demand jobs"
Values: yellow orange red
ValidWhen: GLOBUS_GRAM_JOB_SUBMIT


After you have edited in the patches to spruce.pm.in and spruce.rvf you are ready to build the remainder of the code

Type make at the command prompt. Do not worry, no system files will be installed to any system areas. Instead, all files built will be copied to the spruce build directory. No actions outside the spruce directories are performed. You will be able to inspect the build directory before copying files to the live areas. Example output from the compile step is shown below.

> cd SPRUCE_ROOT/spruce-rp/src/resource_provider
> make
Built token_authentication_ws.class
Built token_authentication
Built spruce_sub
Built submitfilter
Built spruce.pm
Built jobmanager-spruce
Built perl.config
Copied all files to build/resource_provider directory
Build process for resource_provider successful!  
 


Your build directory should now contain the following files:

> ls ../../build/resource_provider
spruce.pm          
spruce_sub
token_authentication_ws.class
token_authentication_ws.sh	
jobmanager-spruce  
spruce.rvf        
token_authentication  
submitfilter 
perl.config      
install-spruce
lib (directory with 14 jar files)


The build process of the resource_provider files is now complete. Proceed to the building of examples to test the system.

[TOC]

Building Examples

Move to the examples/ subdirectory and build the examples. Once again, don't worry, no system files will be modified. The results will be installed in your build directory. An example of the output is shown below.

> cd SPRUCE_ROOT/spruce-rp/src/examples
> make
Built globus_test.rsl
Built qsub_test.pbs
Built qsub_test_spruce.pbs
Built spruce_test.pbs
Built helloworld MPI program
Copied all files to build/examples directory
Build process for examples completed successfully


Your build directory should now contain the following files.

> ls ../../build/examples
globus_test.rsl  
mpihello  
qsub_test.pbs  
qsub_test_spruce.pbs  
spruce_test.pbs


The build process of the examples files is successful at this point. Next, the files in the build directory can be installed to system locations.

[TOC]

SPRUCE Components

With the build process for resource_provider complete, you can look over the components. SPRUCE_ROOT/spruce-rp/build/resource-provider consists of the following pieces :

spruce.rvf
The SPRUCE resource verification file. It is used internally by Globus components for incoming SPRUCE jobs.
spruce.pm
This file plugs into Globus and handles all SPRUCE jobs. Jobs submitted to Globus with a special SPRUCE RSL script are passed to spruce.pm where the parameters are parsed. It builds a submission script for the local queue system and then submits the job to the high-priority queue.
spruce_sub
A wrapper for qsub that can process urgent job submissions. It takes spruce-specific parameters and sets up the job script. Users cannot submit emergency computations directly via qsub.
token_authentication_ws.sh
This script checks with the Spruce Portal using Web Services to determine if the user has a valid, activated Urgent Computing Token.
token_authentication
Executable that performs encode/decode for queries sent to to Spruce Portal.
token_authentication_ws.class
This is the java class file which performs the actual Web Services call.
lib
This is the lib directory consisting of all the AXIS2 Web Services jar files to aid the token check.
submitfilter
This filter helps in peforming the SPRUCE token check as well as prevent users from submitting directly to the high-priority spruce queue. Jobs submitted are passed through the submit filter for validation. Job scripts without active tokens with necessary permissions are rejected. It is built to use a special Torque feature, the 'submit filter'. Since PBSPro doesn't have a 'submit-filter' or equivalent feature yet, you will have to modify this to work as a 'qsub' wrapper.
jobmanager-spruce
This is the jobmanager contact file to the newly built spruce.pm. It is just a copy of existing SPRUCE_GLOBUS/etc/grid-services/jobmanager-pbs with the name pbs changed to spruce.

Final Installation

The final installation has two parts. The first part is handled by a simple script. Go to SPRUCE_ROOT/spruce-rp/build/resource-provider. The script called install-spruce will copy all of the components into their respective places and fix up the permissions on directories and files. It should be run as root. The components added to Globus will be changed to be owned by user 'globus' instead of root. The table below shows where the files will be copied.

FILE DESTINATION OWNER
jobmanager-spruce SPRUCE_GLOBUS/etc/grid-services globus
spruce.rvf SPRUCE_GLOBUS/share/globus_gram_job_manager globus
spruce.pm SPRUCE_GLOBUS/lib/perl/Globus/GRAM/JobManager globus
token_authentication SPRUCE_ROOT/bin globus
token_authentication_ws.sh SPRUCE_ROOT/bin globus
token_authentication_ws.class SPRUCE_ROOT/bin globus
lib directory SPRUCE_ROOT/bin globus
spruce_sub SPRUCE_ROOT/bin globus



Run ./install-spruce and install the components above.

The final piece is the submit filter, which must be modified for PBSPro to work as a qsub wrapper. This integration depends on your local cluster setup. Please contact us if you need help integrating this into your system.

Your installation is complete! The next section helps you test out the whole deployment to make sure everything is in place.

[TOC]

Testing the Deployment

Once the build process for the examples is complete, check the directory build/examples for the following files.

  • globus_test.rsl
  • qsub_test.pbs
  • qsub_test_spruce.pbs
  • spruce_test.pbs
  • mpihello

The .rsl and .pbs files are essentially job submission scripts, and mpihello is a simple MPI Hello World program to test the system. The example runs below allow the admin to test functionality of the deployment. Various use-cases are listed, and sample output from them is provided for verification.

Normal 'qsub'

Since a submit-filter has been installed, this is to check whether any of the normal functionality has broken.

The qsub_test.pbs is a simple PBS script with no reference to priority queue.

NOTE: You may have to modify the resource name in all examples if ia64-compute is not supported at your site.

#!/bin/csh
#PBS -N qsub_job 
#PBS -l nodes=4:ia64-compute
#PBS -l walltime=0:10:00
#PBS -o qsub_out 
#PBS -e qsub_err
#PBS -V
mpirun -np 4 SPRUCE_ROOT/spruce-rp/build/examples/mpihello


Try a qsub submission to the default queue and verify the output.

> cd SPRUCE_BUILD/spruce-rp/build/examples
> qsub qsub_test.pbs
<job-number>.<tg-master.some-string-here>

> qstat
                                                                   
                                                  Req'd  Req'd   Elap
Job ID Username Queue Jobname    SessID NDS   TSK Memory Time  S Time
---------------------------------------------------------------------   

## other job contents here if any

job-no uname default-q qsub_job  --      4  --    --  00:10 R   -- 

## other job contents here if any


The job should be submitted as any other qsub job to the default queue.

'qsub' to SPRUCE queue

Now, test the functionality of the submit-filter in case of a qsub request asking for access to the spruce queue. Any direct request of this form has to be rejected, since no token validation check has been performed (This check is available only for Globus submissions indicating urgency parameter, or a command line submission using the qsub wrapper called spruce_sub).

The qsub_test_spruce.pbs is similar to above script, but with queue to run on set to the spruce.

#!/bin/csh
#PBS -q spruce
#PBS -N qsub_job 
#PBS -l nodes=4:ia64-compute
#PBS -l walltime=0:10:00
#PBS -o qsub_out 
#PBS -e qsub_err
#PBS -V  mpirun -np 4 SPRUCE_ROOT/spruce-rp/build/examples/mpihello


Try a qsub submission to the spruce job manager and verify the output.

> cd SPRUCE_ROOT/spruce-rp/build/examples
> qsub qsub_test_spruce.pbs

Cannot submit directly to 'spruce' queue.
Please use either 'spruce_sub' or Globus urgent job submission interface.
qsub: Your job has been administratively rejected by the queueing system.
qsub: There may be a more detailed explanation prior to this notice. 
>


The job should be rejected, indicating that it was rejected by the submit-filter.

'spruce_sub'

This example helps test the functionality of the qsub wrapper provided by spruce, under the name spruce_sub. The wrapper performs token validation checks for a given user submitting the job and aborts the submission if he does not have a valid token registered on his name at the portal. If a token is present, then it does the needed action depending on the urgency level specified and policy pertaining to it (In this case, we will try a red level run, which submits to spruce queue if token present). The same example script can be used to check both cases of acceptance and rejection.

A test-token to input at the Spruce Portal is required to test the functionality. Please contact us to have one emailed, with a lifetime of 24 hours once activated, for testing purposes.

Please refer to the Users' Guide for information on how to use the token.

The spruce_test.pbs is a general PBS job script with no reference to spruce queue.

#!/bin/csh
#PBS -N spruce_job 
#PBS -l nodes=4:ia64-compute
#PBS -l walltime=0:10:00
#PBS -o spruce_out 
#PBS -e spruce_err
#PBS -V  
#PBS -V  mpirun -np 4 SPRUCE_ROOT/spruce-rp/build/examples/mpihello 


Submit a job using the spruce_sub command with the usage as shown -

> SPRUCE_ROOT/bin/spruce_sub

Usage: 
spruce-sub [urgency=yellow|orange|red] full_path_name_of_pbs_job_script 
Currently all 3 levels submit to spruce queue with next in line priority.
Please make note that only a PBS script (with full path included) could 
be passed in, command line args not supported currently.


First run would be using without a valid token.

> cd SPRUCE_ROOT/spruce-rp/build/examples
> SPRUCE_ROOT/bin/spruce_sub urgency=red  spruce_test.pbs

Spruce token was invalid, aborting job submission.
qsub: Your job has been administratively rejected by the queueing system.
qsub: There may be a more detailed explanation prior to this notice.  
>


The job should get rejected, indicating that there was no valid token for the user.

Now, goto the Spruce Token Management Portal, activate the test-token and add yourself as a valid user. Repeat the above run.

> cd SPRUCE_ROOT/spruce-rp/build/examples
> SPRUCE_ROOT/bin/spruce_sub urgency=red  spruce_test.pbs 
<job-number>.<tg-master.some-string-here>

> qstat

                                                 Req'd  Req'd   Elap
Job ID Username  Queue  Jobname  SessID NDS  TSK Memory Time  S Time
--------------------------------------------------------------------

## other job contents here if any

job-no uname  spruce spruce_job --   4  --    --  00:10 R   --

## other job contents here if any


The job should be submitted, to the spruce queue, since the user has valid token.

Globus test

Coming to the globus test, the system currently supports only globusrun submissions. An RSL script with additional parameter indicating urgency level (similar to the spruce_sub syntax) has to be input and the same validation as spruce_sub is performed.

The test flow would be similar to above, with the same token re-used again.

The globus_test.rsl is an RSL script indicating the right contact string for jobmanager-spruce.

+
(&
(resourceManagerContact = spruce-jm-contact/jobmanager-spruce)
(rsl_substitution = (HOMEDIR "SPRUCE_ROOT/spruce-rp/build/examples"))
(executable = $(HOMEDIR)/mpihello)
(jobType = mpi)
(host_types = ia64-compute)
(host_xcount = 4)
(urgency = red)
(stdout = $(HOMEDIR)/globus_stdout)
(stderr = $(HOMEDIR)/globus_stderr)
)


NOTE: Since the submit-filter is a global installation, it would be useful to replace the jobmanager-spruce contact with your local default jobmanager-pbs or anything similar and check normal job submissions as well. Once the proxy is initialized, a submission with default job manager, should be successful.

First goto the Spruce Token Management Portal and deactivate the token on your name. Then make sure you have a grid-proxy initialized.

> grid-proxy-init
Your identity: your-DN-here 
Enter GRID pass phrase for this identity:
Creating proxy ........................................ Done
Your proxy is valid until: some-date-and-time


First run would be using without a valid token.

> cd SPRUCE_ROOT/spruce-rp/build/examples
> globusrun -o -f globus_test.rsl


The job does not run and the command prompt appears almost immediately. A gram log file is dumped into the home directory, which you can grep to find the following line.

> cd ~/
> more gram_job_mgr_some-number.log
## lots of log messages

date-and-time-of-log JMI: while return_buf = No Valid Token found for 
user = your-uname, aborting urgent job submission 

## more log messages
>


Now, goto the Spruce Token Management Portal, login with the test-token and add yourself as a valid user. Repeat the above run.

> cd SPRUCE_ROOT/spruce-rp/build/examples
> globusrun -o -f globus_test.rsl


The job gets submitted and the command prompt waits for completion, so if you check the queue status in another terminal window,

> qstat
                                                  Req'd  Req'd   Elap
Job ID Username Queue Jobname    SessID NDS   TSK Memory Time  S Time
------------------------------------------------------------------------

## other job contents here if any

job-no uname  spruce STDIN      --      4  --    --  00:10 R   --

## other job contents here if any

--------------------------------------------------------------------


If all the above components show the right behavior, then the spruce deployment is successful!

Congratulations, you have joined the Urgent Computing community!

Once everything is in place, you can run a make distclean in SPRUCE_ROOT/src, to clear out all the additional files created in both build and src directories. The spruce.pm.in created by you though would be just moved to spruce.pm.in.hand-edits and the config file changes would also be retained.

[TOC]

Troubleshooting

Please contact Spruce Team for any questions or problems using Spruce software.

[TOC]