How do I retrieve the time and cost of my workflow?
Within every featured workspace description is a subsection providing the estimated time and cost for running a method using the sequence data in the data model. These results are gathered by obtaining the time and cost of a completed workflow using a combination of Terra's built in monitoring feature and Google’s BigQuery service. This document will briefly describe how the time and cost for these results are obtained and provide a link to a walkthrough for users interested using the same approach.
The simpler of the two approaches is obtaining the duration time of an executed workflow, which is retrieved from the monitor tab within the workspace submission page. Each submitted workflow has some brief information, which includes Submitted, Started, and End time. Often submitted and start time are identical but they should not be confused. The submitted time is when the workflow was initiated (the “Launch Workflow” icon was clicked), and the start time is when the workflow has reserved a virtual machine (VM) and begins running tasks. Between the submission and start time there might be a delay due to the workflow being on queue caused by a high volume of users, network lag, or some other reason. Thus, the duration is calculated using the start and end time listed for each submitted workflow.
The cost for running a workflow is obtained using BigQuery. BigQuery is a free GCloud service that you can think of as a search tool that provides metadata related to a submitted workflow. We can use BigQuery to search through google's database for a submitted workflow to show specific details for that job using a workflow id.
Once a workflow is executed a workflow id is created for that particular run, this unique ID identifies the workflow from the hundreds to thousands of other workflows being executed in Terra. This ID along with other billing related IDs is used to perform a search in BigQuery. The results from the search generates a tsv file with each row being a resource usage (e.g. compute, network, use of preemptible) and the columns are descriptions of the resource such as the cost for a particular resource. This tsv file can be downloaded to google sheet or local excel sheet and the sum of the cost column can be calculated giving the total cost of resources within the workflow.
- Workflow data on BigQuery is not available until some hours after the workflow has complete. Its best to allow a day to pass before querying the database.
- The tsv includes a column for start and end time, but BigQuery has been unreliable thus far in terms of time so you are better off using the times in the workspace monitoring tab.
- BigQuery does not automatically export workflow execution metadata used to make the tsv from your billing project, this feature has to be enabled in order to use the service.