Last week we were very excited for our colleagues in the gnomAD team, who announced on their blog that the entire gnomAD dataset is now available for direct use or download from Google Cloud as well as Amazon Web Services and Microsoft Azure.
If you're not familiar with gnomAD, the name stands for Genome Aggregation Database; it's an amazing resource for medical and population genetics that aggregates variant information from tens of thousands of human exomes and whole genomes. It's the result of an ongoing collaborative effort between investigators at institutions across the world, which you can read more about here.
My immediate reaction was, this could be really useful to bioinformaticians and geneticists working in Terra! Yet at the same time, I know it can be challenging to figure out how to use a "free-standing" resource like this effectively; there are several ways that you could interact with the gnomAD data hosted by Google Cloud from within Terra, and it may not be immediately obvious what is the best approach for a given use case.
To address that, I put together a demo workspace that illustrates the three main approaches that I think would be useful:
- Running a workflow on the callset VCFs
- Exploring the callset using Hail in a notebook
- Exploring the callset using BigQuery in a notebook
Here's a brief rundown of each; you can find more details in the Dashboard section of the workspace.
Running a workflow on the gnomAD callset VCFs
If you want to run an analysis that might run a long time and/or can be automated, you probably want to do it as a workflow. But how do you run your workflow on the gnomAD data when what you're given is a web page with a list of file locations?
At this point, I should probably mention that the latest release of the gnomAD dataset (v3.1) includes variant sites-only VCFs for the entire variant callset, as well as a subset of the data in VCFs that includes sample-level genotype information. In both cases, the VCFs are split by chromosome to make file sizes manageable.
It's awkward to have to look up the paths to all those files manually (that's 2x 24 for anyone who's counting) so my first move was to collate all the Google Cloud Storage file paths provided on the gnomAD website in a data table in my Terra workspace.
Then I set up a variant validation workflow with a basic configuration to serve as an example of how you could run the workflow on the gnomAD VCFs via the data table. To use this resource, just clone the workspace and try running the pre-configured workflow. Then you should be able to set up your own workflow in a similar way.
Exploring the gnomAD dataset with Hail
If you're interested in exploring the gnomAD dataset interactively, one great option is to use Hail, which is the gnomAD team's preferred toolkit for variant manipulation. One great thing about Hail that I learned in the process of putting this demo together is that it's capable of accessing data stored in Google Cloud Storage without needing to copy the entire file to the local disk of the VM or cluster that you're running on. So it's really well suited for working on the cloud.
Not coincidentally, the gnomAD team provides a copy of the dataset that is already formatted in Hail's native table format, and I really wanted to take advantage of that. So I created a fairly basic Jupyter notebook that shows how to read in the tables for the sites-only data and the subset-with-genotypes, which are formatted as a regular Hail table and a Hail MatrixTable, respectively. The notebook also shows some typical intro-level Hail commands in action, then points you to the Hail documentation, which includes full analysis examples, for more in-depth analysis guidance.
As a sidebar, I took the opportunity to include some instructions for retrieving the data table contents programmatically from within the notebook via the Terra API, which is a neat trick in itself. I used that to retrieve the VCF file paths so that I could show how to import a VCF into a Hail table, which you may need to do for datasets where there is not already a convenient Hail table version available.
Exploring the gnomAD dataset with BigQuery
Another option for interactive analysis of the gnomAD dataset is to use the BigQuery database version of the dataset, which was prepared by the Google Cloud Life Sciences team as described here. In this case, we no longer use files; instead, we send queries to the database, which will return dataframes that we can then manipulate using our preferred data toolkit.
The advantage of this option is that you don't need to deal with files or file paths at all, so it can be a faster way to get to the answers you're looking for. The downside is that Google Cloud charges you a fee for each query. The fee is based on the amount of data you are querying, so if you're composing queries efficiently the amount is usually very small. And you get free quota to work with gnomAD, so it's worth checking out. This option is usually most popular with people who have experience working with databases and are familiar with SQL syntax.
The Google Cloud team made a notebook that demonstrates how this works, which includes several types of queries and goes quite far in illustrating the kinds of analyses this enables. They kindly gave me permission to include their notebook in my demo workspace, so it's right in there with the Hail notebook. May the toolkit that best applies to your use case win your favor!
Next step: go play!
I encourage you to check out the demo workspace, which is fully public and see if it might be helpful to you in your work. If you'd like to try out the workflow or either notebook for yourself, just clone the workspace and have at it. Be sure to read the instructions in the workspace Dashboard; it has more hands-on details, and I also included some pointers to help you get started if you're new to Terra.
Let me know what you think about this resource; it was a fun little project that distracted me from what was otherwise a terrible week, so I'd love to hear from you if you find it helpful and/or if you want to see more resources like this.
There are no publication restrictions or embargoes on the gnomAD data. However, the project maintainers request that you please cite the flagship gnomAD paper if you use any of this data in your work.