urgent computing topline

 

Frequently Asked Questions

Index

  1. What is SPRUCE?
  2. What is Urgent Computing?
  3. What is On Demand Computing?
  4. What can be classified as Emergency Computing ?
  5. What is a Right-of-Way Token?
  6. Should I ask for a token after there is an emergency?
  7. How long does a Right-of-Way Token last?
  8. Do I need a separate token for each job and every person on my team?
  9. Can Urgent Computing jobs have different priorities?
  10. How soon are Urgent Computing jobs run? How are they handled?
  11. What are the possible Urgent Computing Resource Policies?
  12. Who decides what is an emergency?
  13. How does SPRUCE affect security?
  14. What is Warm Standby?
  15. What is SPRUCE Advisor?
  16. I am a resource provider, how do I add SPRUCE functionality to my Grid?
  17. What should I do if I have an application suitable for Urgent Computing?
  18. This sounds great, is SPRUCE running in production anywhere?
  19. Who can join the SPRUCE network?
  20. Any other exciting work happening?

Answers

What is SPRUCE?

High-performance modeling and simulation are playing a driving role in decision making and prediction. For time-critical emergency support applications such as severe weather prediction, flood modeling, and influenza modeling, late results can be useless. A specialized infrastructure is needed to provide computing resources quickly, automatically, and reliably. SPRUCE, the Special PRiority and Urgent Computing Environment is a system to support urgent or event-driven computing on both traditional supercomputers and distributed Grids.

[top]
What is Urgent Computing?

Urgent Computing refers to high performance computing jobs that have a very immediate need for resources. While all users, at one time or another, believe their job is important, some computations and simulations have real deadlines. They can be vital for emergency decision support or planning. For example, consider a wildfire simulation that uses real-time weather data to predict the fire's most likely path. Results from Urgent Computing must be received before they become irrelevant, the "deadline".

[top]
What is On Demand Computing?

Well, that depends on who you talk to. Sometimes, the term "On Demand Computing" has been used by computational scientists to mean "I need resources immediately!" Or, in other words, "I can't wait in a batch queue, I need the resource now!" The pressing need could be because an on-line scientific instrument is about to become available, or a important visitor is standing in front of an interactive ultra-resolution display wall demo, waiting impatiently. However, On Demand Computing has also become a business term meaning "data center servers available for immediate rental to handle cyclic business needs". IBM, HP, and SUN are all selling "On Demand Computing". Just search for On Demand Computing and many sponsored links will pop up. IBM has widely advertised their 'On Demand Center". Because of this naming abiguity, we suggest folks not confuse the business people. We are encouraging the scientific computing community to abandon the term On Demand Computing, unless of course, they really want to rent a dozen web servers for a couple days to handle the increased traffic from their web site's year-end sales event.

[top]
What can be classified as Emergency Decision Support?

Simply put, Emergency Decision Support computation provides key insights or answers to aid decision makers during an emergency. For example, an emergency coordinator managing a hazardous materials team responding to a chemical spill may ask "given current weather conditions, which population centers are most at risk?" One of the keys to good Emergency Decisions Support is interpretation. The data needs interpretation by a domain scientist for it to be useful to a decision maker.

[top]
What is a Right-of-Way Token?

In an emergency, complexity becomes an enemy. Getting the necessary Urgent Computing cycles must be fast, easy, and transferable. To solve this, we support a very simple system. A "Right-of-Way Token" is similar to the flashing lights and siren on an ambulance. Activating a Right-of-Way Token allows the user to turn on the siren and flashers and request that other users sharing the computing resources yield the right-of-way. Tokens are transferable. A scientist can give his Right-of-Way Token to his staff while his is away on travel, just like a sysadmin might hand his pager and machine room key to another staff member.

[top]
Should I ask for a token after there is an emergency?

Scientists who run emergency decision support computations get tokens prior to the occurence of any incident. This enables the user to just get it out of the pocket, activate and add himself onto it and start submitting jobs. If needed, users can be pre-added onto the token as well to reduce the overhead time delay.

[top]
How long does a Right-of-Way Token last?

Once activated, Tokens have a finite lifetime. Generally, a Token is good for several hours. In other words, after activation, the user has a window lasting several hours where they can submit Urgent Computing jobs. When the time runs out, they must spend another Token. If no jobs, or only standard priority jobs are submitted, nothing happens, and the Token simply expires.

[top]
Do I need a separate token for each job and every person on my team?

No, the whole team working on a single emergency computation situation at a given time can use the same token. The users are added onto a token by logging into the Portal. Any number of jobs can be submitted (at varying priority levels if needed) as long as the token is alive. Once its lifetime is completed, a new token needs to be activated and users added onto it.

[top]
Can Urgent Computing jobs have different priorities?

Yes, we currently support three levels of priority: critical (red), high (orange), and important (yellow). Naturally, there is no exact definition of each priority, but the intent is for jobs with higher priority to displace low-priority jobs if resources are limited. Scientists submitting the job must choose the priority. Guidelines are as follows:

Critical (Red)
Large, imminent life-threatening condition requires immediate HPC computing. Think "Disaster of Biblical Proportions"
High (Orange)
Urgent Computing is needed, real-life danger exists and timely results will reduce impact.
Important (Yellow)
Results are very important, but jobs should not supplant high priority or critical priority jobs.
[top]
How soon are Urgent Computing jobs run? How are they handled?

When Urgent Computing jobs are submitted to a resource, they will be run as soon as practical, based on the job's priority and local Urgent Computing Resource Policy. For example, if a site immediately kills all existing jobs to make room for a top-priority Urgent Computing job, the job will start within minutes. On the other hand, sites that don't support special actions for Urgent jobs will just run the job when it finally becomes next in the queue.

[top]
What are the possible Urgent Computing Resource Policies?

Each site participating in SPRUCE can decide on how they will handle urgent jobs. Generally, there are four policies a site can use:

No Support
Submitted jobs run at normally
Next To Run
Existing jobs will complete, but the Urgent Computing job will be next, before any anything else that may be waiting in the queue.
Automated Preemption
Existing (normal priority) jobs are preempted. They are either killed or checkpointed, and the urgent job immediately started. The time to start the job is nearly constant, possibly several minutes.
Human-in-the-loop Preemption
.
A site manager must make the decision about killing or preempting existing jobs, and so there is a human in the loop.
[top]
Who decides what is an emergency?

This is an often asked question, but is actually relatively straightforward. Scientists who have Right-of-Way Tokens get to decide. Currently, everyone in the United States has access to the telephone emergency 911 system. There is no policy board deciding who gets access, nobody predicting ahead of time who might have a bona fide emergency, and therefore have permission to call. Instead, there is very strong social pressure and civil penalty for misusing the system. Similarly, if a research group has an Urgent Computing application with allocated cycles at a supercomputer center, they can request and receive Right-of-Way Tokens. Of course, using their tokens when there is no critical need (dialing 911 when there is no emergency) may get your privileges on the system revoked. However, scientists are a trustworthy lot, and we do not expect problems.

[top]
How does SPRUCE affect security?

SPRUCE does not have any affect on security. "Authentication" and "Authorization" mechanisms for users to log in and submit jobs are not modified. A user must still have a valid account on the target machine, and the user must have the permissions to access and run jobs on the platform. SPRUCE only changes how soon the job will run. SPRUCE changes resource utilization.

[top]
What is Warm Standby?

When an emergency situation happens and urgent computing is required, there is no time to port applications and tune computer codes. It is probably the case that unless the application is ready, waiting for a call, like a fireman at a fire station, it will be too late.

Therefore, we need to keep the applications used for Urgent Computing ready and waiting to be launched at a moment's notice. We call this "Warm Standby". The applications are completely prepared and tested on the platform, at scale, and have be shown to be accurate. The only thing missing is the input parameters and where to send the results. Applications that are in Warm Standby are periodically tested for readiness and correctness, just like emergency civil defense sirens and the Emergency Alert System at a radio station.

[top]
What is SPRUCE Advisor?

One of the defining features of SPRUCE is the ability for resource providers to define their own policies on how urgent computations requests are handled. As such, a "red" urgent job at one resource may result in currently executing jobs to be pre-empted, while another resource may simply designate the incoming request as "next-to-run". This flexibility further complicates the issue of resource selection. How does a user select the resource that has best liklihood of meeting a given a deadline? To aid users in resource selection, SPRUCE users have the option of querying an automated SPRUCE "Advisor".

For a given workflow and deadline, the Advisor determines the likelihood of meeting the deadline on a pre-selected subset of resources at each urgency level. The liklihood is determined by generating a bound on the total turnaround time of the workflow. The total turnaround time consists of the transfer delay (i.e., input/output file staging), pre-allocaiton delay (i.e., batch queue delay) and the execution delay. The advisor makes use of historic data (e.g., NWS probe data, batch queue history, resource policy, etc.), live data (e.g., current queue state), and applicaiton-specific data (e.g., past performance history). From the ranked list of resources the Advisor generates, the user can identify the resource(s) most likely to meet their deadline. We are currently producing an Advisor prototype.

[top]
I am a resource provider, how do I add SPRUCE functionality to my Grid?

There is a lot of documentation available which should answer most questions. SPRUCE is built based on the local job manager or scheduler. We have versions supporting most of the famous jobmanagers such as Torque, Moab, PBS, LoadLeveler, Catalina etc. If you don't find the distribution you are looking for, please Contact Us for more information.

[top]
What should I do if I have an application suitable for Urgent Computing?

Great! If you are looking to integrate SPRUCE into your existing workflows, we offer webservices and loads of documentation. For more information on how to use the system, please read our User Guide. If you are looking to request tokens to use on an existing SPRUCE enabled Production Grid, please Contact Us for more information.

[top]
This sounds great, is SPRUCE running in production anywhere?

Yes! The SPRUCE system currently is deployed on NSF TeraGrid resources at six sites: The University of Chicago/Argonne National Laboratory (UC/ANL), National Center for Supercomputing Applications (NCSA), National Center for Atmospheric Research (NCAR), Purdue University, San Diego Supercomputer Center (SDSC) and Texas Advanced Computing Center (TACC). Work is going on to get the other sites on board as well. We are also deployed on the Louisiana Optical Network Initiative (LONI) machines and on the Virginia Tech cluster. We are looking to extend the system into other institutions as well, so if you think your organization may benefit from this capability, please Contact Us.

[top]
Who can join the SPRUCE network?

Any application that needs Urgent Computing during emergencies or any resource provider who believes this capability will enhance their system can join us! Please Contact Us so we can give you software customized for your needs.

[top]
Any other exciting work happening?

Absolutely! We are working on many interesting projects and collaborations. Work on the Advisor component is going rapidly, and we hope to concentrate on becoming an official TeraGrid CTSS component. We are also exploring new venues with Condor (urgent computing for high-throughput) and HARC (advance reservations via tokens).

[top]