urgent computing topline

 

Workflow

Introduction

This document first presents an overview of the SPRUCE workflow. This document has a lot of Urgent Computing jargon, so please quickly read the FAQ before proceeding, to frame the discussion.

Steps

SPRUCE Workflow, Parts 1 and 2

Parts 1 and 2

Step 1: An Urgent Computing request is created.  This process begins with a trigger.  The trigger could be a phone call to a scientist, or an automated message from a tsunami alarm buoy.  The Urgent Computing request  includes information on how quickly the results are needed (the deadline) and the criticality of the need.  Other important details may be emergency contact information, initial data sets, etc.

SPRUCE Right-of-Way Token

Right-of-Way Token

Step 2: The scientist or operator receiving the Urgent Computing request must activate the SPRUCE system with a Right-of-Way Token (shown above).  Right-of-Way Tokens are 16-digit numbers printed on a special special card that is handed over prior and can be carried in your wallet.  The Token is entered into the SPRUCE Science Gateway portal, and authenticated.  Right-of-Way Tokens have a lifetime of up to 24 hrs.  After the Token expires, a new Token may be used.

SPRUCE Workflow, Parts 3 - 5

Parts 3 - 5

Step 3:  Choose a site and a resource for the job and Urgent Computing parameters.  Obviously, the job should be submitted to a resource that is extremely reliable, and where the code has run successfully many times before.  Also, it should be submitted to a site that agressively supports Urgent Computing by preempting jobs.  Unfortuately, at this time, we don't have any tools to help you easily choose which sites have the most favorable Urgent Computing policy, current queue depth, and availability.  Some of that data, however, is available via the SPRUCE portal.  We are hoping to find help to expand this area, and make choosing the site that can best meet the deadline for the computation more automated.  At the chosen site, the local Globus Gatekeep and Job Manager handles the request and checks to be sure an activated Token is on file at the Gateway.  If everything is correct, the job is submitted to the SPRUCE priority queue.

Step 4:  The Job waits in the priority queue until resources are available.  For sites strongly supporting Urgent Computing with preemption of existing jobs, the wait in the priority queue will be very short. 

Step 5:  The job runs on a supercomputing resource.  The length of time from when the job was submitted until it eventually begins execution on the big iron is called the Time to Begin.  Adding the Time to Begin to the wall clock run time for the job gives us the Time to Solution, which must be less than the deadline for the results to be useful.

SPRUCE Workflow, Parts 6 - 7

Parts 6 - 7

Step 6: After the results have been collected and possibly vizualized, a Doman Specialist Interpreter will help read explain the results.  Similar to a Radiologist at a hostpital, the Intepreter understands which conditions lead to reliable results, when the program may be less accurate, and what the graphs and values all mean.

Step 7: Finally, the decision maker uses the results and discusses the imporant aspects of the data with the Interpreter before making her plans.  If the situation has changed dramatically, the data may be stale, and a new Urgent Computing run started, and the whole process repeated.