urgent computing topline

 

Users Guide

  $Revision: 1.5 $

This document outlines everything a user needs to know about getting priority access to resources and submitting urgent jobs. Reading the Work Flow section would be useful before proceeding futher. Please note that this version uses the Globus Toolkit Distinguished Names (DN) as the primary user identifier. If your site requires other mechanisms, contact the SPRUCE Team for further information.

Table of Contents

Introduction

Special PRiority and Urgent Computing Environment (SPRUCE) is software that supports Urgent Computing on distributed, large-scale high-performance computers. Government agencies such as NSF, DOE, NIH, and NASA have invested hundreds of millions of dollars in high performance computing centers. Often, those computers are linked via distributed computing environments, or Grids. Leveraging that infrastructure, SPRUCE provides a set of tools, interfaces, and policies to support Urgent Computing.

The SPRUCE Portal functions as the one stop shop for attaining priority access and monitoring usage.Users are provided with right-of-way tokens which can be activated at the portal in case of an emergency. The identities of the team members who will be submitting urgent jobs should be added to the token. Once activated, a special set of parameters indicating urgency level are to be used along with your job submission request to the SPRUCE custom job manager. This job manager makes sure that the user has necessary permissions and acts according to the local policy for the urgency level requested, for example by giving next-to-run priority. Users can monitor the remaining time and token attributes from the portal.

[TOC]

Portal

The SPRUCE portal provides a single-point of administration and authorization for urgent computing across an entire Grid. This section details the functionality of attaining priority access and managing users who can submit urgent jobs. Step-by-step screenshots of activity are provided for convenience.

The main Portal Home Page provides functionality to manage your tokens and user information. Built upon AJAX technologies and web services, the portal requires JavaScript to be turned on before proceeding.

User Portal Home Page

User Portal Home Page with options to Manage token or view User Info

[TOC]

Right-Of-Way Token

A Right-of-Way Token is similar to the flashing lights and siren on an ambulance. Activating a Right-of-Way Token allows the user to turn on the siren and flashers and request that other users sharing the computing resources yield the right-of-way. The tokens are transferrable.A scientist can give his Right-of-Way Token to his staff while away on travel, just like a sysadmin might hand his pager and machine room key to another staff member.

The SPRUCE Right-of-Way Token has a 16 digit number printed on it as below. This number will be used as the login to the portal.

SPRUCE Right-of-Way Token

SPRUCE Right-of-Way Token

In case of emergency the person holding the token has to activate it from the Web portal. Once activated, Tokens have a finite lifetime. Typical token lifetime ranges from 4-24 hours, this can vary for every token. After activation, the user has a window lasting for the token lifetime where he can submit Urgent Computing jobs. When the time runs out, they must spend another Token. If no jobs, or only standard priority jobs are submitted, nothing happens, and the Token simply expires.

If you would like to request tokens for your project, please contact the Spruce Team about how to get one.

[TOC]

Token Info

If you wish to see information related to a particular token, manage the users associated with it, or check remaining time, you would need to login using the Manage Tokens.. link on the portal home page. Entering the 16 digit token number, exactly as it appears, along with the hypens, will let you login into the portal.

Logging in with a token

Logging in with the token number, to manage the token

On login, all interesting information about the token, such as its status, lifetime, maximum urgency level, expiration date, associated resources and any users already associated with it are displayed.

Displayed information about the token

Displayed information about the token

[TOC]

Activating Tokens

Depending on the status of the token, whether activated or not, the options change between 'checktime' and 'activate' the token. If you have an unactivated token, then you can turn it on, by clicking on the activate option. A comment field opens up, where you should type in the reason for activating this token. Once done, you can click on activate to proceed, or cancel to stop the activation process.

Activating the token

Activating a token and entering comment about the reason for it.

A view of the token status is returned, where you can see that it is now 'Activated'.

Activation Confirmation

Token status view showing 'Activated'.

[TOC]

Adding Users to Token

The user identities can be added to the token both before and after activation. If you already know the users who will be running the jobs, it is a good practice to keep them added, to reduce any overhead during emergencies.

Once the token has been activated, the team members who have their identities on the token can submit emergency jobs. The form of identification in the current distribution is Globus Toolkit's Distinguished Names (DN). The PI handing the token, needs to find the DN for each team member. This information is found in the grid-mapfile of any site. The typical command line to find this information is -

grep user-name-of-member /etc/grid-security/grid-mapfile

If you encounter any trouble identifying DN, please contact the SPRUCE Team. The DN of any user, along with his real name and email address for contact should be entered in the interface. All of these fields are necessary. Any number of users can be added per token and every one gets the permission to submit urgent jobs as long as the token is alive.

NOTE: A user can have more than one DN listed on his name. Its sufficient to add any one of the listed DNs per user.

Adding DN of 'Demo User' to the token

Adding DN of 'Demo User' to the token

On completion, the user then shows up in the list of users associated with this token. Note that it was empty earlier in this example (previous screenshot).

Successfully added DN of 'Demo User' to the token

Successfully added user 'Demo User' to the token

NOTE: You cannot modify an added user. If any information was entered wrongly, you need to delete the user and add him again with the correct information.

[TOC]

Removing Users

If any user is no longer needed to make the job submissions, he can be removed from the active users list on a given token. Aslo, as user information cannot be modified, users need to be removed and added on again, if any details were entered wrong by mistake.

The list of users associated with a token has a 'remove' buttong beside each name. You just need to click on that and confirm that you indeed want to remove this particular user.

Removing of 'Demo User' from the token

Confirming removal of 'Demo User' from the token

On completion, the user is removed from the list of users associated with this token.

Successfully deleted DN of 'Demo User' from the token

Successfully deleted user 'Demo User' from the token

[TOC]

Check Time

Once the token has been activated, the time remaining for submissions starts counting down. Clicking on the 'checktime' button on an activated token, shows you the time left, and how fresh this information is.

Check time remaining on a token

Time remaining on a token

[TOC]

User Info

Any user who wants to know if he has any active tokens, and more information, can login to the portal using his email address and DN. The user does not need to know the token number.

Check User Information

Logging in to check information related to a particular user

NOTE: The user has to login using the same DN as the one which was activated on his name. Using any of the other DNs he may have, will not return the right permissions.

On login, all pertinent information is displayed about active and not yet activated, but still unexpired tokens. Details include maximum urgency level, lifetime, resources etc.

Displayed User Information for 'Demo User'

Displayed information for 'Demo User'

NOTE: Users can submit urgent jobs only if all the resources being requested in one call are applicable on a single active token.

[TOC]

Job Submission

We currently support two forms of job submission - either from using the Globus Toolkit provided globus-run command or direct command line submission akin to qsub, llsubmit or bsub depending on your local resource manager. The idea is to support both distributed Grids running Globus as well as traditional supercomputers.

NOTE: Users should take note that any kind of direct submission to the priority implementation technique (queue, QOS etc) is restricted. The only way of getting priority access would be through the above methods, which check for your authorizations before processing the request.

[TOC]

Globus

The current software is compatible to run with Globus Toolkit 4.0.1 PRE-WS version. Depending on the site chosen, you need to idenfity the contact information and job manager name. Typical name of the job manager is jobmanager-spruce. Additionally, SPRUCE currently supports submissions to single resource using globus-run. The Resource Specification Language (RSL) is used to indicate all the configuration and resource requirements to the scheduler when using 'globus-run'.

When submitting an urgent computing job, the user needs to specifiy an additional RSL parameter called 'urgency'. This parameter has three valid values - yellow,orange,red. More information about how to select which level is suitable for your application and necessity can be found here.

NOTE: Different sites usually have different policies associated to these urgencies. The examples below assume that all three urgency levels map to the same policy of getting next-to-run status. Hence the examples below show that if you have a valid token, you will get a running spot in the 'spruce queue' which has the priority set make your job run as immediate next.

UC/ANL resource manager contact: tg-grid1.uc.teragrid.org/jobmanager-spruce 
urgency=yellow/orange/red

Example resource manager contact and job manager for UC/ANL TG resource

An example RSL job file is given below. Please note that some parameters such as host_xcount may vary depending on the site you wish to submit to.

+
(&
(resourceManagerContact = tg-grid1.uc.teragrid.org/jobmanager-spruce)
(executable = $ENV{HOME}/spruce/demo/mpihello)
(jobType = mpi)
(directory = $ENV{HOME}/spruce/demo/)
(host_types = ia64-compute)
(host_xcount = 30)
(urgency = red)  
(stdout = $ENV{HOME}/spruce/demo/stdout)
(stderr = $ENV{HOME}/spruce/demo/stderr)
)           

Example RSL submission file to use with 'globus-run'

If the user does not have a valid token activated at the Spruce Portal, the job submission will be aborted and the gram_log will contain an error message pertinent to the situation. Otherwise,the job gets submitted successfully and by doing a qstat or equivalent, you can see that the job was submitted to the spruce queue to run as the immediate next queued job .

 
User does not have a valid token :

> globusrun -o -f globus_test.rsl
> more gram_job_mgr_some_number.log 

........
2/9 10:46:04 JMI: while return_buf = No Valid Token found for user = your_name,
             aborting urgent job submission 
........


User has a valid token with the policy of getting next in queue position:

> qstat

Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
238316.tg-master g40.batch        user1           00:00:00 R dque
238337.tg-master ...ps.10x2.randh user1                  0 Q dque
238349.tg-master STDIN            user2           00:00:00 R dque
238350.tg-master ...ps.10x2.randh user1                  0 Q dque
238353.tg-master job              user3                  0 R dque
238354.tg-master job              user3                  0 Q dque
238355.tg-master job              user3                  0 Q dque
238356.tg-master job              user3                  0 Q dque
238357.tg-master job              user3                  0 Q dque
238358.tg-master job              user4                  0 Q dque
238359.tg-master job              user4                  0 Q dque
238360.tg-master job              user4                  0 Q dque
238361.tg-master  STDIN           your_name              0 R spruce

Example job run with and without active token

[TOC]

The Wrapper:'spruce_sub'

Inorder to make the system compatible with traditional supercomputers or users who wish to use direct command line job submission tools rather than Globus, SPRUCE provides a wrapper command called 'spruce_sub'.

The 'spruce_sub' wrapper works exactly the same way as local job scripts, with an additional urgency command line parameter. This flag accepts one of the three defined levels - yellow,orange,red. Currently all these levels map to the same policy of providing next-to-run status, hence the examples below indicate that in the event of successful submission, you can see the job as running in the spruce queue, which has ability to put the job as immediate next to run.

Usage: spruce_sub [urgency=yellow/orange/red] job_script 

Command line usage of the 'spruce_sub' command


NOTE: For all the TeraGrid resources, the access location of the script is standard as shown in the below example.The flag $TG_COMMUNITY is a Teragrid wide standard, so there should not be any problems accessing the script. If running from a non TG local resource, please contact your administrator about access location of the script.

The job script can remain exactly same as your original version. Nothing needs to be changed in there, the urgency is indicated at the command line. If the user does not have a valid token input with the Spruce Portal, the job submission will be aborted and an error message pertinent to the situation will be displayed. Otherwise, by doing a qstat or equivalent, you can see that the job was successfully submitted to spruce queue to run as the immediate next queued job.

Example PBS script - Any generic job submission script

#!/bin/csh					# Running in C shell
#PBS -N spruce_job				# Name of the job
#PBS -l nodes=4:ia64-compute:ppn=1		# Number and type of nodes
#PBS -l walltime=0:10:00			# Maximum wall clock run time
#PBS -o out					# Standard output
#PBS -e err					# Standard Input
#PBS -V						# Ship environment variables
mpirun -np 4 $ENV{HOME}/spruce/demo/mpihello	# Executable

User does not have a valid token :

> $TG_COMMUNITY/spruce/spruce-sub urgency=red helloworld.pbs 
 No Valid Token found for user:your_name, aborting urgent job submission
>

User has a valid token with the policy of getting next in queue position:

> qstat

Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
238316.tg-master g40.batch        user1           00:00:00 R dque
238337.tg-master ...ps.10x2.randh user1                  0 Q dque
238349.tg-master STDIN            user2           00:00:00 R dque
238350.tg-master ...ps.10x2.randh user1                  0 Q dque
238353.tg-master job              user3                  0 R dque
238354.tg-master job              user3                  0 Q dque
238355.tg-master job              user3                  0 Q dque
238358.tg-master job              user4                  0 Q dque
238359.tg-master job              user4                  0 Q dque
238361.tg-master spruce_job       your_name              0 R spruce

Example job run using 'spruce_sub' on the TG resources

[TOC]

Troubleshooting

Please contact Spruce Team for any questions or problems using Spruce software.

[TOC]