urgent computing topline

 

Admin Guide for Torque/Moab without Globus

  $Revision: 1.2 $

This document outlines everything needed to deploy the SPRUCE urgent computing environment on your local computing resource. All the software plug-ins and installation instructions are provided along with insight on upcoming improvements. Please note that the current instruction guide is designed for sites using Torque/Moab for resource management. If you are looking for other combinations of jobmanager/scheduler, refer our software page.

Table of Contents


Introduction

Special PRiority Urgent Computing Environment (SPRUCE) is software that supports urgent computing for distributed, large-scale high-performance computers. Government agencies such as NSF, DOE, NIH, and NASA have invested hundreds of millions of dollars in high performance computing centers. Often, those computers are linked via distributed computing environments, or Grids. Leveraging that infrastructure, SPRUCE provides a set of tools, interfaces, and policies to support urgent computing.

Applications that need urgent computing have deadlines --- late results are useless results. Knowing that a tornado will rip through a town is only useful if it can be determined while there is time for action. Similarly, knowing how a storm will effect flood waters is important for both preparation activities and rescue plans. Finding fast, available compute cycles is critical.

The current configuration of SPRUCE works for Grid environments which just use local-submission job tools. The instructions below will need modifications based on your local site deployment configuration and job queues. Please feel free to ask for help if you run into problems configuring SPRUCE for your local environment.

[TOC]

Limitations and Prerequisites

The current version has been tested and deployed for the configuration shown below. Other configurations may require additional tweaks to configuration variables or scripts.

  • The current version of the SPRUCE software works with the Torque 2.0/Moab 4.0.x Scheduler or higher. One component, the 'submit-filter' is designed explicitly to use a feature within Torque/Moab that validates job submissions before they are put in the queue. To use the previous version fo Torque, you may have to install a job submission wrapper or hook in a similar feature.
  • SPRUCE requires a high-priority queue be enabled. Urgent computing jobs are funnelled into that queue after validation by SPRUCE. SPRUCE cannot be effectively used without a special high-priority queue.
  • You must have root permission for some of the tasks, including making the new high-priority queue and the submit-filter.

If any of the above conditions do not hold true in your case, please contact the Spruce Team for further information.

[TOC]

System Preparation

Before configuring and building the software, two things must be set up: a high-priority job queue and a file area, SPRUCE_ROOT, for all the supporting scripts and tools.

The system is setup for the default behavior of 'next-to-run' status. If you want to extend the policies, please contact us for collaboration on this front.

[TOC]

Priority Queue

The first step is to make a new priority queue called spruce. Urgent computing jobs will be automatically routed to the queue by SPRUCE. For Torque, queue creation is done with the qmgr command (see [QMGR] for more details on the command).

Once created, the queue must be configured to be operational. At a minimum, this includes setting the options started and enabled, along with resources_default.nodes and resources_default.walltime. Other configuration attributes depend on local site settings. (See [QCFG] for more information). The following commands show setting up the queue for an ia64-compute resource with default wall time 15 minutes. Replace 'ia64-compute' in the commands below with the name of your resource class and '00:15:00' with default wall time of your choice.

> qmgr -c "create queue spruce queue_type=execution"
> qmgr -c "set queue spruce started=true"
> qmgr -c "set queue spruce enabled=true"
> qmgr -c "set queue spruce resources_default.nodes=1:ia64-compute"
> qmgr -c "set queue spruce resources_default.walltime=00:15:00"


With the queue created, we can now set up the scheduler to allow the needed priority. For Moab, edit the configuration file moab.cfg. The code below is what needs to be added to the configuration file:

CLASSCFG[spruce]            QDEF=ondemand
QOSCFG[ondemand]            PRIORITY=20


The PRIORITY field must be a value that is larger than any of your other queues in order to be the highest priority. So in this example, the default queues have priority 10, while the spruce queue is set to priority 20. After editing the Moab configuration, restart the scheduler.

[TOC]

SPRUCE_ROOT

SPRUCE has binaries and shared scripts which must be accessible to users as well as system components. We put those components in a generic spruce directory. To make installation easier, we will assume a single source code and install directory called SPRUCE_ROOT. You can put this file space anywhere on the front-end, such as /usr/local. For TeraGrid sites, it is convenient to use the "community space" ($TG_COMMUNITY). The TeraGrid sites could set up the SPRUCE_ROOT to be $TG_COMMUNITY/spruce. As root, create the spruce directory that will be both the compile/build area and the bin area for the final scripts. The example below leverages the TeraGrid Community directory:

> mkdir $TG_COMMUNITY/spruce
> chmod 775 $TG_COMMUNITY/spruce


[TOC]

Installation

With the preceeding steps complete, you are ready to configure, compile, and install the SPRUCE code. If you have not already downloaded the source, download the version 2.0 at: http://spruce.uchicago.edu/download/spruce-rp.torque-moab-ng.v2.tar.bz2

Save the file into your SPRUCE_ROOT area. You should now have a file called SPRUCE_ROOT/spruce-rp.torque-moab-ng.v2.tar.bz2. The command below shows how to extract the package contents.

> cd SPRUCE_ROOT
> bzcat spruce-rp.torque-moab-ng.v2.tar.bz2 | tar xf -
> cd spruce-rp


Look through the distribution. You will find the following pieces:

build/
The directory where build targets will be copied
docs/
Documentation
src/
Source code
src/config.parameters
All the local configuration settings that need to be set before building examples and scripts
src/resource_provider/
The scripts and code that must be compiled for spruce resources
src/examples/
Examples and tests for submitting and running spruce jobs.

[TOC]

Configuration Parameters

The next step is to configure the software before compilation. We use a very simple configuration system. A single configuration file uses simple variables to transform *.in files. Edit the file src/config.parameters to prepare for compilation.

The fields of the configuration file are shown below.

# SPRUCE CONFIGURATION FILE:
#

# Please  provide the paths below with the escape
# character of backslash (\). For example, a path of the form -
# /foo/bar would be input here as \/foo\/bar


# SPRUCE_ROOT path totally expanded, no environment variable shortcuts

SPRUCE_ROOT=\/soft\/community\/spruce

# This is the default resource property name used while making 
# tokens, if you are not restricting access based on internal 
# resource properties. Please not this is not restriction based 
# on resource, but rather resource properties on a given resource.

DEFAULT_RESOURCE=systemx


NOTE: For the remainder of this document, we will use SPRUCE_ROOT to mean the values you set in the configuration file (without the escape characters). These do not need to be set as environment variables ($SPRUCE_ROOT), but are simply text place-holders for your local config paths in the documentation here. So if you see 'cd SPRUCE_ROOT/bin' you will know that may mean 'cd $TG_COMMUNITY/spruce/bin' for your system.

[TOC]

Building Spruce Components

After editing the configuration file, you may begin the build process.

Move to src/resource-provider/ directory and type make at the command prompt. Do not worry, no system files will be installed to any system areas. Instead, all files built will be copied to the spruce build directory. No actions outside the spruce directories are performed. You will be able to inspect the build directory before copying files to the live areas. Example output from the compile step is shown below.

> cd SPRUCE_ROOT/spruce-rp/src/resource_provider
> make
Built token_authentication_ws.class
Built token_authentication
Built spruce_sub
Built torque_submitfilter
Built perl.config
Copied all files to build/resource_provider directory
Build process for resource_provider successful!  
 


Your build directory should now contain the following files:

> ls ../../build/resource_provider
spruce_sub
token_authentication_ws.class
token_authentication_ws.sh	
token_authentication  
torque_submitfilter 
perl.config      
install-spruce
lib (directory with 14 jar files)


The build process of the resource_provider files is now complete. Proceed to the building of examples to test the system.

[TOC]

Building Examples

Move to the examples/ subdirectory and build the examples. Once again, don't worry, no system files will be modified. The results will be installed in your build directory. An example of the output is shown below.

> cd SPRUCE_ROOT/spruce-rp/src/examples
> make
Built qsub_test.pbs
Built qsub_test_spruce.pbs
Built spruce_test.pbs
Built helloworld MPI program
Copied all files to build/examples directory
Build process for examples completed successfully


Your build directory should now contain the following files.

> ls ../../build/examples
mpihello  
qsub_test.pbs  
qsub_test_spruce.pbs  
spruce_test.pbs


The build process of the examples files is successful at this point. Next, the files in the build directory can be installed to system locations.

[TOC]

SPRUCE Components

With the build process for resource_provider complete, you can look over the components. SPRUCE_ROOT/spruce-rp/build/resource-provider consists of the following pieces :

spruce_sub
A wrapper for qsub that can process urgent job submissions. It takes spruce-specific parameters and sets up the job script. Users cannot submit emergency computations directly via qsub.
token_authentication_ws.sh
This script checks with the Spruce Portal using Web Services to determine if the user has a valid, activated Urgent Computing Token.
token_authentication
Executable that performs encode/decode for queries sent to to Spruce Portal.
token_authentication_ws.class
This is the java class file which performs the actual Web Services call.
lib
This is the lib directory consisting of all the AXIS2 Web Services jar files to aid the token check.
torque_submitfilter
This filter helps in peforming the SPRUCE token check as well as prevent users from submitting directly to the high-priority spruce queue. It is built to use a special Torque feature, the 'submit filter'. Jobs submitted to Torque are passed through the submit filter for validation. Job scripts without active tokens with necessary permissions are rejected.

Final Installation

The final installation has two parts. The first part is handled by a simple script. Go to SPRUCE_ROOT/spruce-rp/build/resource-provider. The script called install-spruce will copy all of the components into their respective places and fix up the permissions on directories and files. It should be run as root. The table below shows where the files will be copied.

FILE DESTINATION OWNER
token_authentication SPRUCE_ROOT/bin root
token_authentication_ws.sh SPRUCE_ROOT/bin root
token_authentication_ws.class SPRUCE_ROOT/bin root
lib directory SPRUCE_ROOT/bin root
spruce_sub SPRUCE_ROOT/bin root



Run ./install-spruce and install the components above.

The final piece is the submit filter, which must be copied into your resource management system. For Torque, it means simply dropping the file into the correct Torque space. Torque will see it and use it. Torque 2.0 looks for this file at /usr/local/sbin by default. If your site is already using a submit filter, you will need to merge the SPRUCE version with your local version. If you are not using one, just copy it into place and make it owned by root. An example is shown below:

> cp torque_submitfilter /usr/local/sbin/torque_submitfilter
> chmod 755 /usr/local/sbin/torque_submitfilter
> chown root.bin /usr/local/sbin/torque_submitfilter


Your installation is complete! The next section helps you test out the whole deployment to make sure everything is in place.

[TOC]

Testing the Deployment

Once the build process for the examples is complete, check the directory build/examples for the following files.

  • qsub_test.pbs
  • qsub_test_spruce.pbs
  • spruce_test.pbs
  • mpihello

The .pbs files are job submission scripts, and mpihello is a simple MPI Hello World program to test the system. The example runs below allow the admin to test functionality of the deployment. Various use-cases are listed, and sample output from them is provided for verification.

Normal 'qsub'

Since a submit-filter has been installed, this is to check whether any of the normal functionality has broken.

The qsub_test.pbs is a simple PBS script with no reference to priority queue.

NOTE: You may have to modify the resource name in all examples if ia64-compute is not supported at your site.

#!/bin/csh
#PBS -N qsub_job 
#PBS -l nodes=4
#PBS -l walltime=0:10:00
#PBS -o qsub_out 
#PBS -e qsub_err
#PBS -V
mpirun -np 4 SPRUCE_ROOT/spruce-rp/build/examples/mpihello


Try a qsub submission to the default queue and verify the output.

> cd SPRUCE_BUILD/spruce-rp/build/examples
> qsub qsub_test.pbs
<job-number>.<tg-master.some-string-here>

> qstat
                                                                   
                                                  Req'd  Req'd   Elap
Job ID Username Queue Jobname    SessID NDS   TSK Memory Time  S Time
---------------------------------------------------------------------   

## other job contents here if any

job-no uname default-q qsub_job  --      4  --    --  00:10 R   -- 

## other job contents here if any


The job should be submitted as any other qsub job to the default queue.

'qsub' to SPRUCE queue

Now, test the functionality of the submit-filter in case of a qsub request asking for access to the spruce queue. Any direct request of this form has to be rejected, since no token validation check has been performed (This check is available only for a command line submission using the qsub wrapper called spruce_sub).

The qsub_test_spruce.pbs is similar to above script, but with queue to run on set to the spruce.

#!/bin/csh
#PBS -q spruce
#PBS -N qsub_job 
#PBS -l nodes=4
#PBS -l walltime=0:10:00
#PBS -o qsub_out 
#PBS -e qsub_err
#PBS -V  mpirun -np 4 SPRUCE_ROOT/spruce-rp/build/examples/mpihello


Try a qsub submission to the spruce job manager and verify the output.

> cd SPRUCE_ROOT/spruce-rp/build/examples
> qsub qsub_test_spruce.pbs

Cannot submit directly to 'spruce' queue.
Please use 'spruce_sub' urgent job submission interface.
qsub: Your job has been administratively rejected by the queueing system.
qsub: There may be a more detailed explanation prior to this notice. 
>


The job should be rejected, indicating that it was rejected by the submit-filter.

'spruce_sub'

This example helps test the functionality of the qsub wrapper provided by spruce, under the name spruce_sub. The wrapper performs token validation checks for a given user submitting the job and aborts the submission if he does not have a valid token registered on his name at the portal. If a token is present, then it does the needed action depending on the urgency level specified and policy pertaining to it (In this case, we will try a red level run, which submits to spruce queue if token present). The same example script can be used to check both cases of acceptance and rejection.

A test-token to input at the Spruce Portal is required to test the functionality. Please contact us to have one emailed, with a lifetime of 24 hours once activated, for testing purposes.

Please refer to the Users' Guide for information on how to use the token.

The spruce_test.pbs is a general PBS job script with no reference to spruce queue.

#!/bin/csh
#PBS -N spruce_job 
#PBS -l nodes=4
#PBS -l walltime=0:10:00
#PBS -o spruce_out 
#PBS -e spruce_err
#PBS -V  
#PBS -V  mpirun -np 4 SPRUCE_ROOT/spruce-rp/build/examples/mpihello 


Submit a job using the spruce_sub command with the usage as shown -

> SPRUCE_ROOT/bin/spruce_sub

Usage: 
spruce-sub [urgency=yellow|orange|red] full_path_name_of_pbs_job_script 
Currently all 3 levels submit to spruce queue with next in line priority.
Please make note that only a PBS script (with full path included) could 
be passed in, command line args not supported currently.


First run would be using without a valid token.

> cd SPRUCE_ROOT/spruce-rp/build/examples
> SPRUCE_ROOT/bin/spruce_sub urgency=red  spruce_test.pbs

Spruce token was invalid, aborting job submission.
qsub: Your job has been administratively rejected by the queueing system.
qsub: There may be a more detailed explanation prior to this notice.  
>


The job should get rejected, indicating that there was no valid token for the user.

Now, goto the Spruce Token Management Portal, activate the test-token and add yourself as a valid user. Repeat the above run.

> cd SPRUCE_ROOT/spruce-rp/build/examples
> SPRUCE_ROOT/bin/spruce_sub urgency=red  spruce_test.pbs 
<job-number>.<tg-master.some-string-here>

> qstat

                                                 Req'd  Req'd   Elap
Job ID Username  Queue  Jobname  SessID NDS  TSK Memory Time  S Time
--------------------------------------------------------------------

## other job contents here if any

job-no uname  spruce spruce_job --   4  --    --  00:10 R   --

## other job contents here if any


The job should be submitted, to the spruce queue, since the user has valid token.


If all the above components show the right behavior, then the spruce deployment is successful!

Congratulations, you have joined the Urgent Computing community!

Once everything is in place, you can run a make distclean in SPRUCE_ROOT/src, to clear out all the additional files created in both build and src directories.

[TOC]

Troubleshooting

Please contact Spruce Team for any questions or problems using Spruce software.

[TOC]