Executing jobs on the cloud can be a little scary if you don't know how much you're spending. For peace of mind that you're not going over a project budget, read on for two ways to understand the cost of an executed workflow.
Workflow cost reporting: Built-in versus notebooks
There are primarily two means of tracking/estimating the cost of running workflows in Terra: Built-in (found in the Job History) and using a cost-reporting notebook. See the important differences below.
Built-in cost reports take a few hours to compile
Results from Terra's built-in cost report aren't available until some hours after the workflow has completed. It's best to allow a day to pass before checking in on the cost.
Notebooks allow cost reporting for workflows in progress
You can run the cost-estimating notebook at any point in the workflow execution and it will li how much the workflow has cost thus far. You can use as an early indicator of whether you may exceed your budget.
Notebooks give more granular cost details
Unlike Terra's cost report, which only provides the total job cost, the notebook estimates task-level and instance-level costs (i.e. the cost of running a particular task within a workflow). This is useful when you want to know what task(s) are responsible for most of the cost.
The notebook gives an estimate; built-in reporting is a final cost
However (see below), this estimate is usually fairly accurate.
Option 1: Terra's built-in cost reporting
To help make cloud computing costs more transparent, you can have Terra display the cost for each executed workflow (including failed and aborted workflows) in the Job History page. In order to access spend reporting, you will configure spend reporting for your Terra Billing Project. Most users will also need to do a one-time setup on GCP, to allow Terra access to GCP's cost reports.
How accurate is the cost report? Terra's built-in cost reports come directly from Google. This is the actual cost for the given workflow (i.e. not an estimate). They are generated by accessing Google's Billing Repository, and can take several hours to log.
Step 1. Set up access (once per Cloud Billing account)
If you are not using a Broad Institute Google Cloud Billing account, you will need to grant Terra access to cost reports on GCP console to enable the workflows spend reporting.
Not able to follow these directions? You must have "Owner" permission of the Terra Billing Project and “Owner” or "Admin" permission on the Google Cloud Billing account. If you are not able to follow the directions below, or do not see the options in the screenshots (or they are greyed out), it is likely because you do not have sufficient permissions on the GCP Billing account. You will need to ask the owner for admin privileges on the Cloud Billing account or to set up a BigQuery dataset for billing export (follow directions below).
This may be the case, for example, if you are using a third-party resellers such as Onix.
-
1.1. Navigate to the billing account management page in Google Cloud Console.
1.2. Click on the name of the billing account you would like to enable billing exports for. If you have more than one, you will need to repeat for each one associated with a Terra Billing Project.
1.3. Click on Billing export in the left sidebar.
1.4. Under Standard usage cost, click the Edit Settings button.
1.5. Next you will configure/create the BigQuery dataset to export the billing data to.
Note that you must have an active Google project - not a Terra Billing project - tied to this billing account for this step.
The Billing project you select cannot be a Terra-generated workspace project To check, make sure the ID in the dropdown is not of the format <Terra-Billing-project-name>--<workspace-name>. If the project name has that formatting, it was created by Terra and will not work for this step.
In that case, you will need to create a GCP-native project first. You can find step-by-step instructions here.If you don't have a project, Google will prompt you to create one.
1.6. Create the project.
(if you already have an existing Google project, you may skip this step).1.7. Select the Google project. This will be used to host the billing export BigQuery dataset (cannot be a workspace project created by Terra - see 1.5 above).
1.8. From the dropdown menu, select (or create) the BigQuery dataset to store the billing export data.
If you don't already have a dataset in this Google project, Google will prompt you to create one: From this menu, click on Create new dataset as shown above (1.8), fill in this form and select the Create dataset button at the bottom.1.9. Click Save.
1.10. To view the BigQuery dataset, click on the link (name of your BigQuery dataset) in the BigQuery export tab - to the right of Dataset name).
1.11. From the dataset tab, copy the Google project ID (to the left of the period) and BigQuery dataset name (to the right of the period) to a safe place. You will need it in the next step (2.4) in Terra.The format is Google project ID : BigQuery dataset name (separated by a colon)
1.12. Hover over the person icon to the right of the project and dataset name and click Share dataset.
If this option is greyed out, it is most likely because you don't have the right permission. You will need to ask the owner of the Google Cloud Billing account to grant you permission to share a BigQuery dataset.1.13. Type
spend-reporting@terra.bio
into the Add principals field and select BigQuery Data Viewer from the Select a role dropdown. This grants Terra permission to access the dataset you've just set up.1.14. Click Add (to the right of the dropdown), then Done (at the bottom of the form).
Google BigQuery billing exports are now configured for this Cloud Billing account. You can confirm by expanding the menu under BigQuery Data Viewer (below).
The next step will be to configure workflow spend report.
-
1.1. Navigate to the billing account management page in Google Cloud Console
1.2. Click on the name of the STRIDES billing account you would like to enable billing exports for. It will have the form
NIH.NHLBI.BDC.Cohort#.Fellow.00#
.1.3. Click on "Billing export" in the left sidebar.
1.4. Under Standard usage cost, click the "Edit Settings" button.
1.6. From the dropdown menu under Dataset ID, select the Billing dataset. This is the pre-configured BigQuery dataset to store the billing export data.1.7. Click Save.
1.8. To view the BigQuery dataset, click on the Billing link in the BigQuery export tab - to the right of Dataset name under Standard usage cost.
1.9. Copy the Google project ID to a safe place. You will need it in the next step (2.4) in Terra.1.10. Hover over the person icon to the right of the project and dataset name and click Share dataset.
1.11. Type
spend-reporting@terra.bio
into the Add principals field and select BigQuery Data Viewer from the Select a role dropdown. This grants Terra permission to access the dataset you've just set up.1.12. Click Add (to the right of the dropdown), then Done (at the bottom of the form).
Google BigQuery billing exports are now configured for the STRIDES Cloud billing account. You can confirm by expanding the menu under BigQuery Data Viewer (below).
-
If you use a third-party reseller, you will need to ask them to set up the spend reporting (in GCP console) for you following the steps below.
1. Create a BigQuery dataset for exporting billing data on GCP.
2. Share the dataset with
spend-reporting@terra.bio
as BigQuery Dataset Viewer.3. Tell you the name and project of the dataset (to use in Step 2 below).
-
Broad GCP Billing account users can skip this first step and go directly to Step 2: Configure the workflow spend report.
Before you move on to step 2If you just completed Part 1 we recommend you wait several hours to complete Part 2 in order for billable activity to be recorded in BigQuery or you may receive an error that the dataset cannot be found.
Step 2. Configure workflow spend report
Now that you have set up the GCP billing data export, you will set up spend reporting in Terra. You will only need to do this once per Terra Billing project.
2.1. Go to the Billing page by first clicking your name and selecting Billing from the main navigation menu (top left of any page in Terra).
2.2. Select the Terra Billing project associated with the workspace where you're running your workflow analysis.
Note that you need to be an owner to follow these steps. You'll know you're the owner if you see the Terra billing project listed under Owned by You in the top left column.
2.3. Click the pencil icon beside Workflow Spend Report Configuration to edit.
2.4. Fill in the Dataset Project Name (Project ID) and Dataset Name from GCP console (step 1.11 for general users or step 1.9 for STRIDES - above).
2.5. Click the OK button to save.
You will not get a confirmation message,but as long as you don't get an error message, your configuration should be saved.
If you get an error that looks like this
It is because there is currently no data in the dataset.
To remedy this, try the following
1. Run a small workflow in the workspace.
2. Wait 2-3 hours and follow steps 2.1 - 2.5 again.
How to find built-in workflow cost reporting
1. Navigate to the Job History page of the workspace.
This page includes all workflow submissions for the workspace
2. Click the submission of interest in the far left column.
3. If your spend reporting has been correctly set up, you will find the Total Run Cost at the top right corner. Note that the spend report can take up to 24 hours to appear in Terra, as GCP costs reports have some delay.
If your submission included more than one execution, each will be listed separately under "Run Cost"
Option 2: Workflow cost estimate via Jupyter Notebooks
You can estimate the cost of running your workflow with a Workflow Cost Estimator notebook created for BioData Catalyst Powered by Terra (available in the biodata-catalyst/BioData Catalyst Collection workspace). Follow the instructions below to find and run the notebook. You can also find the Python code in the BioData Catalyst Git repository DataBiosphere/bdcat_notebooks.
Step 1. Import the notebook to your workspace
-
1.1. Navigate to the Notebooks page of the biodata-catalyst/BioData Catalyst Collection workspace. You will need the Workflow Cost Estimator notebook.
1.2. Click the three-dot icon to the left of the Workflow Cost Estimator notebook.
1.3. Select Copy to another workspace.
1.4. Import the notebook to the workspace where you ran or are running the workflow by entering your workspace name in the Destination field. Then click Copy.
-
1.1. Navigate to the Analyses page of the biodata-catalyst/BioData Catalyst Collection workspace. You will need the Workflow Cost Estimator notebook.
1.2. Click the three-dot icon to the right of the Workflow Cost Estimator notebook card.
1.3. Select Copy to another workspace.
1.4. Import the notebook to the workspace where you ran or are running the workflow by entering your workspace name in the Destination field. Then click Copy.
Step 2. Run the notebook
2.1. Click on the Workflow Cost Estimator notebook (in the workspace Notebooks or Analyses tab).
2.2. Either Open the notebook, or run in Playground mode.
2.3. If Jupyter is not set up, click the Create button to start a default Jupyter Cloud Environment.
2.4. When Jupyter is running, run each cell in the notebook. The notebook itself describes each cell does and how to use it. For a description of how to run a notebook in Terra, see Interactive statistics and visualization with Jupyter notebooks, or the Interactive Jupyter notebooks video.
What to expect
The notebook uses FireCloud Service Selector (FISS) to request information on all the submitted jobs associated with the workspace.
The notebook will list all the submission IDs on the screen, and you'll choose which submission to process for cost estimates. The notebook will then use FISS to obtain metadata information about the particular submission - such as how many VMs were used, the number of CPUs used, the duration for each VM, etc. A cost formula uses this information to calculate the cost of running the workflow. The cost formula is based on the GCP's price estimate per resource.
How accurate are notebook-generated workflow cost estimates?
These cost estimates do not come directly from Google billing. Instead, the notebook calculates a cost estimate based on metadata from Terra. The estimates are (usually) very close to the real cost, though they could be slightly off. Below are descriptions of what's accounted for in the calculations, and the difference between the notebook results, and what Terra's built-in results showed in a benchmark.
Included GCP costs
Note that estimates will be lower than actual costs, because the notebook cost formula does not account for all possible GCP resources. The table below lists each parameter available in a WDL runtime block (i.e. what type of GCP resource is used for a task) and whether it's included in the cost formula.
WDL Runtime Parameters
|
Accounted for in Formula?
|
CPU/GPU*
|
Yes
|
Memory
|
Yes
|
Preemptibles
|
Yes
|
Disk
|
Yes |
Data Egress
|
No |
noAddress (rarely used)
|
No
|
cpuPlatform (rarely used)
|
No
|
zones (rarely used)
|
No
|
* Currently, the cost formula assumes all instances are of type N1 listed here, which uses the least expensive type of CPU instance even if GPUs are being used.
Benchmark: Terra cost report versus notebook cost estimates
The two spend report options were benchmarked by running the Cram-to-Bam workflow N times on different CRAM samples. The notebook cost-estimates and built-in cost reporting showed an average difference of $0.02 per sample/run.
Note that when running larger sample sets, there will be a larger differences as minor differences accumulate.
CRAM_to_BAM - Sample Number | Terra Cost Report | Notebook Cost Estimator |
1 | $0.22 | $0.23 |
50 | $14.87 | $13.95 |
100 | $30.39 | $26.59 |
200 | $55.62 | $53.41 |