Admin Guide for Cobalt$Revision: 1.1 $
This document outlines everything needed to deploy the SPRUCE urgent computing environment on your local computing resource. All the software plug-ins and installation instructions are provided along with insight on upcoming improvements. Please note that the current instruction guide is designed for sites using Cobalt for resource management. If you are looking for other combinations of jobmanager/scheduler, refer our software page.
Table of Contents
- Limitations and Prerequisites
- System Preparation
- Testing the Deployment
Special PRiority Urgent Computing Environment (SPRUCE) is software that supports urgent computing for distributed, large-scale high-performance computers. Government agencies such as NSF, DOE, NIH, and NASA have invested hundreds of millions of dollars in high performance computing centers. Often, those computers are linked via distributed computing environments, or Grids. Leveraging that infrastructure, SPRUCE provides a set of tools, interfaces, and policies to support urgent computing.
Applications that need urgent computing have deadlines --- late results are useless results. Knowing that a tornado will rip through a town is only useful if it can be determined while there is time for action. Similarly, knowing how a storm will effect flood waters is important for both preparation activities and rescue plans. Finding fast, available compute cycles is critical.
The current configuration of SPRUCE works for Grid environments using Globus as well as local-submission jobs using the command-line. The instructions below will need modifications based on your local site deployment configuration for Globus and job queues. Please feel free to ask for help if you run into problems configuring SPRUCE for your local environment.
If you do not run Globus at all, please write to us, we will provide you with a completely Globus independent distribution.[TOC]
Limitations and Prerequisites
The current version has been tested and deployed for the configuration shown below. Other configurations may require additional tweaks to configuration variables or scripts.
- The Globus support is extensible to GT 3.9.x or 4.0.x pre-ws versions. The previous version of GT 2.4.x is not supported, neither is the web services version (we are working on this).
- The current version of the SPRUCE software works with Cobalt 0.97.0 scheduler. One component, the 'submit-filter' is designed explicitly to use a feature within Cobalt that validates job submissions before they are put in the queue.
- SPRUCE requires a high-priority queue be enabled. Urgent computing jobs are funnelled into that queue after validation by SPRUCE. SPRUCE cannot be effectively used without a special high-priority queue.
- You must have root permission for some of the tasks, including making the new high-priority queue, installing a new Globus Job Manager, and the submit-filter. The install script will assume that a 'globus' account exists, and installed files and scripts will be installed as user 'globus' for safety.
If any of the above conditions do not hold true in your case, please contact the Spruce Team for further information.[TOC]
Before configuring and building the software, two things must be set up: a high-priority job queue and a file area, SPRUCE_ROOT, for all the supporting scripts and tools.
The system is setup for the default behavior of 'next-to-run' status. If you want to extend the policies, please contact us for collaboration on this front.[TOC]
The first step is to make a new priority queue called spruce. Urgent computing jobs will be automatically routed to the queue by SPRUCE.
Once created, the queue must be configured to be operational. At a minimum, this includes enabling the queue, setting default nodes and walltimes etc. Other configuration attributes depend on local site settings.
With the queue created, we can now set up the scheduler to allow the needed priority. The PRIORITY field must be a value that is larger than any of your other queues in order to be the highest priority. After editing the configuration, restart the scheduler.[TOC]
SPRUCE has binaries and shared scripts which must be accessible to users as well as system components. We put those components in a generic spruce directory. To make installation easier, we will assume a single source code and install directory called SPRUCE_ROOT. You can put this file space anywhere on the front-end, such as /usr/local. For TeraGrid sites, it is convenient to use the "community space" ($TG_COMMUNITY). The TeraGrid sites could set up the SPRUCE_ROOT to be $TG_COMMUNITY/spruce. As root, create the spruce directory that will be both the compile/build area and the bin area for the final scripts. The example below leverages the TeraGrid Community directory:
> mkdir $TG_COMMUNITY/spruce > chmod 775 $TG_COMMUNITY/spruce
With the preceeding steps complete, you are ready to configure, compile, and install the SPRUCE code. If you have not already downloaded the source, download the version 2.0 at: http://spruce.uchicago.edu/download/spruce-rp.cobalt.v2.tar.bz2
Save the file into your SPRUCE_ROOT area. You should now have a file called SPRUCE_ROOT/spruce-rp.cobalt.v2.tar.bz2. The command below shows how to extract the package contents.
> cd SPRUCE_ROOT > bzcat spruce-rp.cobalt.v2.tar.bz2 | tar xf - > cd spruce-rp
Look through the distribution. You will find the following pieces:
- The directory where build targets will be copied
- Source code
- All the local configuration settings that need to be set before building examples and scripts
- The scripts and code that must be compiled for spruce resources
- Examples and tests for submitting and running spruce jobs. [TOC]
- The SPRUCE resource verification file. It is used internally by Globus components for incoming SPRUCE jobs.
- This file plugs into Globus and handles all SPRUCE jobs. Jobs submitted to Globus with a special SPRUCE RSL script are passed to spruce.pm where the parameters are parsed. It builds a submission script for the local queue system and then submits the job to the high-priority queue.
- A wrapper for cqsub that can process urgent job submissions. It takes spruce-specific parameters and sets up the job script. Users cannot submit emergency computations directly via cqsub.
- This script checks with the Spruce Portal using Web Services to determine if the user has a valid, activated Urgent Computing Token.
- Executable that performs encode/decode for queries sent to to Spruce Portal.
- This is the java class file which performs the actual Web Services call.
- This is the lib directory consisting of all the AXIS2 Web Services jar files to aid the token check.
- This filter helps in peforming the SPRUCE token check as well as prevent users from submitting directly to the high-priority spruce queue. It is built to use a special Cobalt feature, the 'submit_filter'. Jobs submitted to Cobalt are passed through the submit filter for validation. Job scripts without active tokens with necessary permissions are rejected.
- This is the jobmanager contact file to the newly built spruce.pm. It is just a copy of existing SPRUCE_GLOBUS/etc/grid-services/jobmanager-cobalt with the name cobalt changed to spruce.
The next step is to configure the software before compilation. We use a very simple configuration system. A single configuration file uses simple variables to transform *.in files. Edit the file src/config.parameters to prepare for compilation.
The fields of the configuration file are shown below.
# SPRUCE CONFIGURATION FILE: # # Please provide the paths below with the escape # character of backslash (\). For example, a path of the form - # /foo/bar would be input here as \/foo\/bar # SPRUCE has a special "Job Manager" to handle incoming Globus-SPRUCE # requests. GT 3.9.x or GT 4.0.x is required. Since many sites have # multiple Globus versions installed, we ask here that you specify the # path to the Globus used by SPRUCE. If you only have one Globus # installed, the value of $GLOBUS_LOCATION would be the right choice. SPRUCE_GLOBUS=\/soft\/prews-gram-4.0.1-r3 # SPRUCE_ROOT path totally expanded, no environment variable shortcuts SPRUCE_ROOT=\/soft\/community\/spruce # Spruce job manager contact information, this varies by site, # and has to be modified to fit into your local configuaration. # The example below shows the setup for the UC/ANL resources. SPRUCE_JOB_MANAGER=tg-grid1.uc.teragrid.org # This is the default resource property name used while making # tokens, if you are not restricting access based on internal # resource properties. Please not this is not restriction based # on resource, but rather resource properties on a given resource. DEFAULT_RESOURCE=ia64-compute
NOTE: For the remainder of this document, we will use SPRUCE_ROOT and SPRUCE_GLOBUS to mean the values you set in the configuration file (without the escape characters). These do not need to be set as environment variables ($SPRUCE_ROOT), but are simply text place-holders for your local config paths in the documentation here. So if you see 'cd SPRUCE_ROOT/bin' you will know that may mean 'cd $TG_COMMUNITY/spruce/bin' for your system.[TOC]
Building Spruce Components
After editing the configuration file, you may begin the two-step build process. The first step is to build the spruce.pm.in and spruce.rvf file, the site-specific files that depend on your local Globus installation and hook to your resource manager and job scheduler. Some manual editing steps are required. The second part will compile the code and build generic scripts.
We will start by making a duplicate of the current job-manager perl module for queued submissions and modify it to create a spruce job manager. Globus allows you to have multiple job managers, so you won't need to change anything for your existing job managers. Instead, we will simply copy what works for your system and add in the SPRUCE patches.
Move to src/resource-provider/ directory and copy the globus job-manager perl module that your system uses for queued submissions. For sites with cobalt, it is likely the file cobalt.pm. The copied file should be called spruce.pm.in. An example is provided below.
> cd SPRUCE_ROOT/spruce-rp/src/resource-provider > cp SPRUCE_GLOBUS/lib/perl/Globus/GRAM/JobManager/cobalt.pm \ spruce.pm.in
The hard part is to edit the file spruce.pm.in to add spruce patches. They can be found by inspecting spruce.pm.uc-example located in the same directory. We suggest opening the file in a side-by-side window. Because almost all job managers are very similar, applying the small patches is actually easier than it may seem.
All of the patches are very clearly marked in the file.
# In spruce.pm.uc-example, patches are wrapped with: ############################################################## # [Patch description] : SPRUCE PATCH BEGIN [Patch number] ############################################################## The patch code ############################################################## # [Patch description] : SPRUCE PATCH END [Patch number] ##############################################################
Since the .pm files under this directory are customized to suit the site requirements, we cannot provide a generic solution. Adding the patches in the right places, should suffice. Currently there exist 4 patches.
After this file, we should now build the spruce.rvf file. Start by making a duplicate of the current job manager resource verification file (rvf). You won't need to change anything for your existing job managers. Instead, we will simply copy what works for your system and add in the SPRUCE patches.
For sites with cobalt, it is likely the file cobalt.rvf. The copied file should be called spruce.rvf. An example is provided below.
> cd SPRUCE_ROOT/spruce-rp/src/resource-provider > cp SPRUCE_GLOBUS/share/globus_gram_job_manager/cobalt.rvf spruce.rvf
Now simply add the content of the file spruce.rvf.in into this file as appropriate. This file basically lets an additional RSL parameter called 'urgency' to be used.
Content to be edited into the spruce.rvf file: Attribute: urgency Description: "Indicates urgent computations and has different levels specified for resolving among conflicting on demand jobs" Values: yellow orange red ValidWhen: GLOBUS_GRAM_JOB_SUBMIT
After you have edited in the patches to spruce.pm.in and spruce.rvf you are ready to build the remainder of the code.
Type make at the command prompt. Do not worry, no system files will be installed to any system areas. Instead, all files built will be copied to the spruce build directory. No actions outside the spruce directories are performed. You will be able to inspect the build directory before copying files to the live areas. Example output from the compile step is shown below.
> cd SPRUCE_ROOT/spruce-rp/src/resource_provider > make Built token_authentication_ws.class Built token_authentication Built spruce_sub Built bgl_submitfilter Built spruce.pm Built jobmanager-spruce Built perl.config Copied all files to build/resource_provider directory Build process for resource_provider successful!
Your build directory should now contain the following files:
> ls ../../build/resource_provider spruce.pm spruce_sub token_authentication_ws.class token_authentication_ws.sh jobmanager-spruce spruce.rvf token_authentication bgl_submitfilter perl.config install-spruce lib (directory with 14 jar files)
The build process of the resource_provider files is now complete. Proceed to the building of examples to test the system.[TOC]
Move to the examples/ subdirectory and build the examples. Once again, don't worry, no system files will be modified. The results will be installed in your build directory. An example of the output is shown below.
> cd SPRUCE_ROOT/spruce-rp/src/examples > make Built globus_test.rsl Built cqsub_tests Built helloworld MPI program Copied all files to build/examples directory Build process for examples completed successfully
Your build directory should now contain the following files.
> ls ../../build/examples globus_test.rsl mpihello cqsub_tests
The build process of the examples files is successful at this point. Next, the files in the build directory can be installed to system locations.[TOC]
With the build process for resource_provider complete, you can look over the components. SPRUCE_ROOT/spruce-rp/build/resource-provider consists of the following pieces :
The final installation has two parts. The first part is handled by a simple script. Go to SPRUCE_ROOT/spruce-rp/build/resource-provider. The script called install-spruce will copy all of the components into their respective places and fix up the permissions on directories and files. It should be run as root. The components added to Globus will be changed to be owned by user 'globus' instead of root. The table below shows where the files will be copied.
Run ./install-spruce and install the components above.
The final piece is the bgl_submitfilter, which must be copied into your resource management system. For this copy the file and add the submit filter path to the cobalt.conf configuration file.
[cqm] filters: /path/to/bgl_submitfilter
Your installation is complete! The next section helps you test out the whole deployment to make sure everything is in place.[TOC]
Testing the Deployment
Once the build process for the examples is complete, check the directory build/examples for the following files.
The .rsl is essentially a job submission script, while cqsub_tests consists of the command line test details. mpihello is a simple MPI Hello World program to test the system. The example runs below allow the admin to test functionality of the deployment. Various use-cases are listed, and sample output from them is provided for verification.
Since a submit-filter has been installed, this is to check whether any of the normal functionality has broken.
Try a cqsub submission to the default queue and verify the output.
> cqsub -t 2 -n 2 SPRUCE_ROOT/spruce-rp/build/examples/mpihello <job-number>.<tg-master.some-string-here> > showq ID Owner Submitted ST Class ----------------------------------------------------------------- ## other job contents here if any job-no uname date R default-class ## other job contents here if any
The job should be submitted as any other bsub job to the default queue.
'cqsub' to SPRUCE queue
Now, test the functionality of the submit-filter in case of a cqsub request asking for access to the spruce queue. Any direct request of this form has to be rejected, since no token validation check has been performed (This check is available only for Globus submissions indicating urgency parameter, or a command line submission using the cqsub wrapper called spruce_sub).
> cqsub -t 2 -n 2 -q spruce SPRUCE_ROOT/spruce-rp/build/examples/mpihello Cannot submit directly to 'spruce' queue. Please use either 'spruce_sub' or Globus urgent job submission interface. >
The job should be rejected, indicating that it was rejected by the submit-filter.
This example helps test the functionality of the cqsub wrapper provided by spruce, under the name spruce_sub. The wrapper performs token validation checks for a given user submitting the job and aborts the submission if he does not have a valid token registered on his name at the portal. If a token is present, then it does the needed action depending on the urgency level specified and policy pertaining to it (In this case, we will try a red level run, which submits to spruce queue if token present). The same example script can be used to check both cases of acceptance and rejection.
A test-token to input at the Spruce Portal is required to test the functionality. Please contact us to have one emailed, with a lifetime of 24 hours once activated, for testing purposes.
Please refer to the Users' Guide for information on how to use the token.
Submit a job using the spruce_sub command with the usage as shown -
> SPRUCE_ROOT/bin/spruce_sub Usage: spruce_sub -u <urgency> <other parameters for cqsub> Currently all three levels submit to spruce queue with next in line priority.
First run would be using without a valid token.
> cd SPRUCE_ROOT/spruce-rp/build/examples > SPRUCE_ROOT/bin/spruce_sub -u red -t 2 -n 2 mpihello Spruce token was invalid, aborting job submission. Your job has been administratively rejected by the queueing system. There may be a more detailed explanation prior to this notice. >
The job should get rejected, indicating that there was no valid token for the user.
Now, goto the Spruce Token Management Portal, activate the test-token and add yourself as a valid user. Repeat the above run.
> cd SPRUCE_ROOT/spruce-rp/build/examples > SPRUCE_ROOT/bin/spruce_sub -u red -t 2 -n 2 mpihello <job-number>.<tg-master.some-string-here> > showq ID Owner Submitted ST Class ---------------------------------------------------------------- ## other job contents here if any job-no uname date R spruce ## other job contents here if any
The job should be submitted, to the spruce queue, since the user has valid token.
Coming to the globus test, the system currently supports only globusrun submissions. An RSL script with additional parameter indicating urgency level (similar to the spruce_sub syntax) has to be input and the same validation as spruce_sub is performed.
The test flow would be similar to above, with the same token re-used again.
The globus_test.rsl is an RSL script indicating the right contact string for jobmanager-spruce.
+ (& (resourceManagerContact = spruce-jm-contact/jobmanager-spruce) (rsl_substitution = (HOMEDIR "SPRUCE_ROOT/spruce-rp/build/examples")) (executable = $(HOMEDIR)/mpihello) (jobType = mpi) (host_count = 4) (urgency = red) (stdout = $(HOMEDIR)/globus_stdout) (stderr = $(HOMEDIR)/globus_stderr) )
NOTE: Since the submit-filter is a global installation, it would be useful to replace the jobmanager-spruce contact with your local default jobmanager-cobalt or anything similar and check normal job submissions as well. Once the proxy is initialized, a submission with default job manager, should be successful.
First goto the Spruce Token Management Portal and deactivate the token on your name. Then make sure you have a grid-proxy initialized.
> grid-proxy-init Your identity: your-DN-here Enter GRID pass phrase for this identity: Creating proxy ........................................ Done Your proxy is valid until: some-date-and-time
First run would be using without a valid token.
> cd SPRUCE_ROOT/spruce-rp/build/examples > globusrun -o -f globus_test.rsl
The job does not run and the command prompt appears almost immediately. A gram log file is dumped into the home directory, which you can grep to find the following line.
> cd ~/ > more gram_job_mgr_some-number.log ## lots of log messages date-and-time-of-log JMI: while return_buf = No Valid Token found for user = your-uname, aborting urgent job submission ## more log messages >
Now, goto the Spruce Token Management Portal, login with the test-token and add yourself as a valid user. Repeat the above run.
> cd SPRUCE_ROOT/spruce-rp/build/examples > globusrun -o -f globus_test.rsl
The job gets submitted and the command prompt waits for completion, so if you check the queue status in another terminal window,
> showq ID Owner Submitted ST Class ---------------------------------------------------------------- ## other job contents here if any job-no uname date R spruce ## other job contents here if any --------------------------------------------------------------------
If all the above components show the right behavior, then the spruce deployment is successful!
Congratulations, you have joined the Urgent Computing community!
Once everything is in place, you can run a make distclean in SPRUCE_ROOT/src, to clear out all the additional files created in both build and src directories. The spruce.pm.in and spruce.rvf created by you though would be just moved to spruce.pm.in.hand-edits and spruce.rvf.hand-edits respectively. The config file changes would also be retained.[TOC]
Please contact Spruce Team for any questions or problems using Spruce software.[TOC]