Admin Guide for Torque/Moab without Globus
$Revision: 1.2 $This document outlines everything needed to deploy the SPRUCE urgent computing environment on your local computing resource. All the software plug-ins and installation instructions are provided along with insight on upcoming improvements. Please note that the current instruction guide is designed for sites using Torque/Moab for resource management. If you are looking for other combinations of jobmanager/scheduler, refer our software page.
Table of Contents
- Introduction
- Limitations and Prerequisites
- System Preparation
- Installation
- Testing the Deployment
- Troubleshooting
Introduction
Special PRiority Urgent Computing Environment (SPRUCE) is software that supports urgent computing for distributed, large-scale high-performance computers. Government agencies such as NSF, DOE, NIH, and NASA have invested hundreds of millions of dollars in high performance computing centers. Often, those computers are linked via distributed computing environments, or Grids. Leveraging that infrastructure, SPRUCE provides a set of tools, interfaces, and policies to support urgent computing.
Applications that need urgent computing have deadlines --- late results are useless results. Knowing that a tornado will rip through a town is only useful if it can be determined while there is time for action. Similarly, knowing how a storm will effect flood waters is important for both preparation activities and rescue plans. Finding fast, available compute cycles is critical.
The current configuration of SPRUCE works for Grid environments which just use local-submission job tools. The instructions below will need modifications based on your local site deployment configuration and job queues. Please feel free to ask for help if you run into problems configuring SPRUCE for your local environment.
[TOC]Limitations and Prerequisites
The current version has been tested and deployed for the configuration shown below. Other configurations may require additional tweaks to configuration variables or scripts.
- The current version of the SPRUCE software works with the Torque 2.0/Moab 4.0.x Scheduler or higher. One component, the 'submit-filter' is designed explicitly to use a feature within Torque/Moab that validates job submissions before they are put in the queue. To use the previous version fo Torque, you may have to install a job submission wrapper or hook in a similar feature.
- SPRUCE requires a high-priority queue be enabled. Urgent computing jobs are funnelled into that queue after validation by SPRUCE. SPRUCE cannot be effectively used without a special high-priority queue.
- You must have root permission for some of the tasks, including making the new high-priority queue and the submit-filter.
If any of the above conditions do not hold true in your case, please contact the Spruce Team for further information.
[TOC]System Preparation
Before configuring and building the software, two things must be set up: a high-priority job queue and a file area, SPRUCE_ROOT, for all the supporting scripts and tools.
The system is setup for the default behavior of 'next-to-run' status. If you want to extend the policies, please contact us for collaboration on this front.
[TOC]Priority Queue
The first step is to make a new priority queue called spruce. Urgent computing jobs will be automatically routed to the queue by SPRUCE. For Torque, queue creation is done with the qmgr command (see [QMGR] for more details on the command).
Once created, the queue must be configured to be operational. At a minimum, this includes setting the options started and enabled, along with resources_default.nodes and resources_default.walltime. Other configuration attributes depend on local site settings. (See [QCFG] for more information). The following commands show setting up the queue for an ia64-compute resource with default wall time 15 minutes. Replace 'ia64-compute' in the commands below with the name of your resource class and '00:15:00' with default wall time of your choice.
> qmgr -c "create queue spruce queue_type=execution" > qmgr -c "set queue spruce started=true" > qmgr -c "set queue spruce enabled=true" > qmgr -c "set queue spruce resources_default.nodes=1:ia64-compute" > qmgr -c "set queue spruce resources_default.walltime=00:15:00"
With the queue created, we can now set up the scheduler to allow the needed priority. For Moab, edit the configuration file moab.cfg. The code below is what needs to be added to the configuration file:
CLASSCFG[spruce] QDEF=ondemand QOSCFG[ondemand] PRIORITY=20
The PRIORITY field must be a value that is larger than any of your other queues in order to be the highest priority. So in this example, the default queues have priority 10, while the spruce queue is set to priority 20. After editing the Moab configuration, restart the scheduler.
[TOC]SPRUCE_ROOT
SPRUCE has binaries and shared scripts which must be accessible to users as well as system components. We put those components in a generic spruce directory. To make installation easier, we will assume a single source code and install directory called SPRUCE_ROOT. You can put this file space anywhere on the front-end, such as /usr/local. For TeraGrid sites, it is convenient to use the "community space" ($TG_COMMUNITY). The TeraGrid sites could set up the SPRUCE_ROOT to be $TG_COMMUNITY/spruce. As root, create the spruce directory that will be both the compile/build area and the bin area for the final scripts. The example below leverages the TeraGrid Community directory:
> mkdir $TG_COMMUNITY/spruce > chmod 775 $TG_COMMUNITY/spruce
[TOC]
Installation
With the preceeding steps complete, you are ready to configure, compile, and install the SPRUCE code. If you have not already downloaded the source, download the version 2.0 at: http://spruce.uchicago.edu/download/spruce-rp.torque-moab-ng.v2.tar.bz2
Save the file into your SPRUCE_ROOT area. You should now have a file called SPRUCE_ROOT/spruce-rp.torque-moab-ng.v2.tar.bz2. The command below shows how to extract the package contents.
> cd SPRUCE_ROOT > bzcat spruce-rp.torque-moab-ng.v2.tar.bz2 | tar xf - > cd spruce-rp
Look through the distribution. You will find the following pieces:
- build/
- The directory where build targets will be copied
- docs/
- Documentation
- src/
- Source code
- src/config.parameters
- All the local configuration settings that need to be set before building examples and scripts
- src/resource_provider/
- The scripts and code that must be compiled for spruce resources
- src/examples/
- Examples and tests for submitting and running spruce jobs. [TOC]
- spruce_sub
- A wrapper for qsub that can process urgent job submissions. It takes spruce-specific parameters and sets up the job script. Users cannot submit emergency computations directly via qsub.
- token_authentication_ws.sh
- This script checks with the Spruce Portal using Web Services to determine if the user has a valid, activated Urgent Computing Token.
- token_authentication
- Executable that performs encode/decode for queries sent to to Spruce Portal.
- token_authentication_ws.class
- This is the java class file which performs the actual Web Services call.
- lib
- This is the lib directory consisting of all the AXIS2 Web Services jar files to aid the token check.
- torque_submitfilter
- This filter helps in peforming the SPRUCE token check as well as prevent users from submitting directly to the high-priority spruce queue. It is built to use a special Torque feature, the 'submit filter'. Jobs submitted to Torque are passed through the submit filter for validation. Job scripts without active tokens with necessary permissions are rejected.
- qsub_test.pbs
- qsub_test_spruce.pbs
- spruce_test.pbs
- mpihello
Configuration Parameters
The next step is to configure the software before compilation. We use a very simple configuration system. A single configuration file uses simple variables to transform *.in files. Edit the file src/config.parameters to prepare for compilation.
The fields of the configuration file are shown below.
# SPRUCE CONFIGURATION FILE: # # Please provide the paths below with the escape # character of backslash (\). For example, a path of the form - # /foo/bar would be input here as \/foo\/bar # SPRUCE_ROOT path totally expanded, no environment variable shortcuts SPRUCE_ROOT=\/soft\/community\/spruce # This is the default resource property name used while making # tokens, if you are not restricting access based on internal # resource properties. Please not this is not restriction based # on resource, but rather resource properties on a given resource. DEFAULT_RESOURCE=systemx
NOTE: For the remainder of this document, we will use SPRUCE_ROOT to mean the values you set in the configuration file (without the escape characters). These do not need to be set as environment variables ($SPRUCE_ROOT), but are simply text place-holders for your local config paths in the documentation here. So if you see 'cd SPRUCE_ROOT/bin' you will know that may mean 'cd $TG_COMMUNITY/spruce/bin' for your system.
[TOC]Building Spruce Components
After editing the configuration file, you may begin the build process.
Move to src/resource-provider/ directory and type make at the command prompt. Do not worry, no system files will be installed to any system areas. Instead, all files built will be copied to the spruce build directory. No actions outside the spruce directories are performed. You will be able to inspect the build directory before copying files to the live areas. Example output from the compile step is shown below.
> cd SPRUCE_ROOT/spruce-rp/src/resource_provider > make Built token_authentication_ws.class Built token_authentication Built spruce_sub Built torque_submitfilter Built perl.config Copied all files to build/resource_provider directory Build process for resource_provider successful!
Your build directory should now contain the following files:
> ls ../../build/resource_provider spruce_sub token_authentication_ws.class token_authentication_ws.sh token_authentication torque_submitfilter perl.config install-spruce lib (directory with 14 jar files)
The build process of the resource_provider files is now complete. Proceed to the building of examples to test the system.
[TOC]Building Examples
Move to the examples/ subdirectory and build the examples. Once again, don't worry, no system files will be modified. The results will be installed in your build directory. An example of the output is shown below.
> cd SPRUCE_ROOT/spruce-rp/src/examples > make Built qsub_test.pbs Built qsub_test_spruce.pbs Built spruce_test.pbs Built helloworld MPI program Copied all files to build/examples directory Build process for examples completed successfully
Your build directory should now contain the following files.
> ls ../../build/examples mpihello qsub_test.pbs qsub_test_spruce.pbs spruce_test.pbs
The build process of the examples files is successful at this point. Next, the files in the build directory can be installed to system locations.
[TOC]SPRUCE Components
With the build process for resource_provider complete, you can look over the components. SPRUCE_ROOT/spruce-rp/build/resource-provider consists of the following pieces :
Final Installation
The final installation has two parts. The first part is handled by a simple script. Go to SPRUCE_ROOT/spruce-rp/build/resource-provider. The script called install-spruce will copy all of the components into their respective places and fix up the permissions on directories and files. It should be run as root. The table below shows where the files will be copied.
| FILE | DESTINATION | OWNER |
| token_authentication | SPRUCE_ROOT/bin | root |
| token_authentication_ws.sh | SPRUCE_ROOT/bin | root |
| token_authentication_ws.class | SPRUCE_ROOT/bin | root |
| lib directory | SPRUCE_ROOT/bin | root |
| spruce_sub | SPRUCE_ROOT/bin | root |
Run ./install-spruce and install the components above.
The final piece is the submit filter, which must be copied into your resource management system. For Torque, it means simply dropping the file into the correct Torque space. Torque will see it and use it. Torque 2.0 looks for this file at /usr/local/sbin by default. If your site is already using a submit filter, you will need to merge the SPRUCE version with your local version. If you are not using one, just copy it into place and make it owned by root. An example is shown below:
> cp torque_submitfilter /usr/local/sbin/torque_submitfilter > chmod 755 /usr/local/sbin/torque_submitfilter > chown root.bin /usr/local/sbin/torque_submitfilter
Your installation is complete! The next section helps you test out the whole deployment to make sure everything is in place.
[TOC]Testing the Deployment
Once the build process for the examples is complete, check the directory build/examples for the following files.
The .pbs files are job submission scripts, and mpihello is a simple MPI Hello World program to test the system. The example runs below allow the admin to test functionality of the deployment. Various use-cases are listed, and sample output from them is provided for verification.
Normal 'qsub'
Since a submit-filter has been installed, this is to check whether any of the normal functionality has broken.
The qsub_test.pbs is a simple PBS script with no reference to priority queue.
NOTE: You may have to modify the resource name in all examples if ia64-compute is not supported at your site.
#!/bin/csh #PBS -N qsub_job #PBS -l nodes=4 #PBS -l walltime=0:10:00 #PBS -o qsub_out #PBS -e qsub_err #PBS -V mpirun -np 4 SPRUCE_ROOT/spruce-rp/build/examples/mpihello
Try a qsub submission to the default queue and verify the output.
> cd SPRUCE_BUILD/spruce-rp/build/examples
> qsub qsub_test.pbs
<job-number>.<tg-master.some-string-here>
> qstat
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
---------------------------------------------------------------------
## other job contents here if any
job-no uname default-q qsub_job -- 4 -- -- 00:10 R --
## other job contents here if any
The job should be submitted as any other qsub job to the default queue.
'qsub' to SPRUCE queue
Now, test the functionality of the submit-filter in case of a qsub request asking for access to the spruce queue. Any direct request of this form has to be rejected, since no token validation check has been performed (This check is available only for a command line submission using the qsub wrapper called spruce_sub).
The qsub_test_spruce.pbs is similar to above script, but with queue to run on set to the spruce.
#!/bin/csh #PBS -q spruce #PBS -N qsub_job #PBS -l nodes=4 #PBS -l walltime=0:10:00 #PBS -o qsub_out #PBS -e qsub_err #PBS -V mpirun -np 4 SPRUCE_ROOT/spruce-rp/build/examples/mpihello
Try a qsub submission to the spruce job manager and verify the output.
> cd SPRUCE_ROOT/spruce-rp/build/examples > qsub qsub_test_spruce.pbs Cannot submit directly to 'spruce' queue. Please use 'spruce_sub' urgent job submission interface. qsub: Your job has been administratively rejected by the queueing system. qsub: There may be a more detailed explanation prior to this notice. >
The job should be rejected, indicating that it was rejected by the submit-filter.
'spruce_sub'
This example helps test the functionality of the qsub wrapper provided by spruce, under the name spruce_sub. The wrapper performs token validation checks for a given user submitting the job and aborts the submission if he does not have a valid token registered on his name at the portal. If a token is present, then it does the needed action depending on the urgency level specified and policy pertaining to it (In this case, we will try a red level run, which submits to spruce queue if token present). The same example script can be used to check both cases of acceptance and rejection.
A test-token to input at the Spruce Portal is required to test the functionality. Please contact us to have one emailed, with a lifetime of 24 hours once activated, for testing purposes.
Please refer to the Users' Guide for information on how to use the token.
The spruce_test.pbs is a general PBS job script with no reference to spruce queue.
#!/bin/csh #PBS -N spruce_job #PBS -l nodes=4 #PBS -l walltime=0:10:00 #PBS -o spruce_out #PBS -e spruce_err #PBS -V #PBS -V mpirun -np 4 SPRUCE_ROOT/spruce-rp/build/examples/mpihello
Submit a job using the spruce_sub command with the usage as shown -
> SPRUCE_ROOT/bin/spruce_sub Usage: spruce-sub [urgency=yellow|orange|red] full_path_name_of_pbs_job_script Currently all 3 levels submit to spruce queue with next in line priority. Please make note that only a PBS script (with full path included) could be passed in, command line args not supported currently.
First run would be using without a valid token.
> cd SPRUCE_ROOT/spruce-rp/build/examples > SPRUCE_ROOT/bin/spruce_sub urgency=red spruce_test.pbs Spruce token was invalid, aborting job submission. qsub: Your job has been administratively rejected by the queueing system. qsub: There may be a more detailed explanation prior to this notice. >
The job should get rejected, indicating that there was no valid token for the user.
Now, goto the Spruce Token Management Portal, activate the test-token and add yourself as a valid user. Repeat the above run.
> cd SPRUCE_ROOT/spruce-rp/build/examples
> SPRUCE_ROOT/bin/spruce_sub urgency=red spruce_test.pbs
<job-number>.<tg-master.some-string-here>
> qstat
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------------------------------------------------------------
## other job contents here if any
job-no uname spruce spruce_job -- 4 -- -- 00:10 R --
## other job contents here if any
The job should be submitted, to the spruce queue, since the user has valid token.
If all the above components show the right behavior, then the spruce deployment is successful!
Congratulations, you have joined the Urgent Computing community!
Once everything is in place, you can run a make distclean in SPRUCE_ROOT/src, to clear out all the additional files created in both build and src directories.
[TOC]Troubleshooting
Please contact Spruce Team for any questions or problems using Spruce software.
[TOC]