Welcome to the Workflows Quickstart Tutorial, Part 1. Learn how to launch and monitor a preconfigured workflow to analyze a single entity of genomic data in Terra.
There are three parts to the Workflows Quickstart. Each is independent, with its own learning objectives and time and cost estimates to complete. You should do the three in order, but you don’t need to do them in one sitting.
What you will learn
This exercise will give you a feel for the mechanics of running a workflow successfully on
data in the data table. For input, you’ll use downsized sample data (stored in a public Google bucket) already referenced in the workspace "sample" table. The workflow is set up to take input from and write output to the table. For Part 1 of the Quickstart, the setup has been done for you. You will get an overview of the form, choose data (pre-loaded in the workspace table), and launch the workflow.
How much will it cost? How long will it take?
The exercise should take no more than fifteen minutes (unless you are in the queue a long time) and cost a few pennies.
Hint: Right-click to open the tutorial demo in a new tab
Before you start - Clone your own Quickstart workspace
The Workflows-Quickstart featured workspace is “Read only”. For hands-on practice, you'll need to be able to run workflows and store data in your workspace bucket. Making your own copy of the workspace allows you to do that since you're the owner. If you haven't already done so, you'll need to make your own copy of this workspace following the directions below.
Start by clicking on the round circle with three dots at the upper right-hand corner and select Clone from the dropdown menu. Then follow the directions below to complete the form.
- Rename your copy something memorable
It may help to write down the name of your workspace
- Choose your billing project
Note that this can be "getting started" credits from GCP! Don’t worry, you’ll have plenty left
over when you’ve completed the Quickstart
- Do not select an Authorization Domain, since these are only required when using restricted-access data
- Click the “Clone Workspace” button to make your own copy
- Rename your copy something memorable
Step 1: Open the workflow setup form
Once you're in your own copy of the workspace, you'll be ready to get hands-on to learn about setting up and running workflows!
1.1. Start by going to the Workflows page.
1.2. Select the Part1_CRAM_to_BAM workflow by clicking on the card.
This will reveal the workflow configuration form where you'll set up the workflow to run on your data.
Some details about the quickstart workflows: The workflows in Parts 1 and 2 of the Quickstart are identical - they convert genomic files from one format (CRAM) to another (BAM) for downstream analysis. They’ve been renamed to simplify the instructions. This workflow should complete in just a few minutes once it starts running.
Step 2: Select data
1. Confirm root entity type = "sample". This is the table that contains the input data.
2. Click the "Select Data" button. This will take you to the Select Data form (below).
3. Select the Choose specific rows to process radio button.
4. Select the NA12878 sample.
5. Click the blue OK button to finalize your selection.
Additional pre-configured runtime options
Additional runtime and cost-savings options have been set with the defaults. These are fine to use in many cases (including the quickstart). If you're curious, click below for more details or what to expect.
1. Workflow version (dropdown)
2. Input definition: "Run workflows(s) with inputs defined by the data table" (radio button)
3. Cost-saving options (follow the links for more information about each option)
- Use call caching (checked)
- Delete intermediate outputs (unchecked)
- Use reference disks (unchecked)
- Retry with more memory (unchecked)
Step 3: Confirm and launch
3.1. In the workflow configuration form, click the blue RUN ANALYSIS button to submit your workflow.
3.2. Click LAUNCH in the Confirm launch popup.
3.3. You'll be directed to the Job History page where you can monitor your submission status (highlighted below).
For job status updates, refresh the page.
Your submission is complete! What to expect
When your job completes successfully, you'll see a green checkmark in the Status column of the Job History page. This should only take a couple of minutes once the job starts running (see What happens when you launch a workflow for more details about things that can cause your job to remain in the submitted or queued stage).
Once you see the green check, go back to the Data page
Your data table will include three new columns (analysis_ready_BAI, analysis_ready_BAM, and CRAM_to_BAM_valdation_report).
Where’s the (generated) data stored? Generated data from a workflow is stored by default in the Workspace bucket. You can check that the files are in the Workspace bucket by clicking on the “File” icon (bottom of the far left column) in the Data tab. Note that you will need to go down several file directories to get to the data files (NA12878.bam and NA12878.bai).
Follow-up (thought) questions
- Answer: It now has additional columns that include links to the generated data in the workspace bucket. Because the columns are added to the input table, generated data is associated with input data automatically.
Sample table after a completed run
Answer: The workflow generated the columns automatically because it was set up to write the generated metadata to the data table.
Hint: select the workflow card in the Workflows tab and compare the “Outputs” attributes to the new columns in the sample tableOutputs configuration
Answer: The new columns include metadata links to the generated data. The actual data is stored in the workspace bucket, which you can access by clicking on the "Files" link from the Data page. Note that you will need to go down several directory levels to find the actual data files.
|Congratulations! You've completed Part 1 of the Workflows Quickstart!