Learn how to find, access, and set up AnVIL data for analysis in Terra, including controlled-data you have permission to work with. This document provides instructions for selecting and setting up AnVIL data for analysis in the cloud, as is highly recommended by the AnVIL.
AnVIL data may also be downloaded out of the cloud to local/institutional systems. Note that you will pay the costs charged by Google for doing so.
AnVIL data is now provided through the AnVIL Data Explorer The AnVIL Gen3 data portal is no longer available. Instead, go to AnVIL Data Explorer.
Please note: All requests for AnVIL data are made through dbGaP. You do not need to submit a request through DUOS to access AnVIL data.
AnVIL data overview
- You can find AnVIL datasets in the AnVIL Data Explorer, and the Data Use Oversight System (DUOS).
- Some AnVIL datasets are open-access, yet most are controlled-access, and require permission to access the data. You can obtain permission to access controlled data from dbGaP.
- The AnVIL Data Explorer provides faceted search and selection of AnVIL studies and the ability to export the user-selected data to either:
- A Terra user workspace for analysis in Terra (preferred)
- A TSV manifest, which can be used as input to download the data
- The structure and content of dataset tabular data are as provided by the data submitters. This includes the data dictionary, the names of tables and columns, and the values. There is often similarity among datasets submitted by the same consortium, yet substantial variation overall.
- The AnVIL data is hosted by the Terra Data Repository, although the TDR system is not used by researchers directly.
- AnVIL data is accessed using the GA4GH Data Repository Service (DRS) protocol.
- Each file is identified by a unique DRS URI.
- AnVIL data files are stored in Google Cloud Storage buckets in the GCP
us-central1region.- These buckets are Requester Pays, meaning the user requesting the data is charged for all access to the data, including downloading. See Using Requester Pays workspaces/Buckets in Terra for more details.
- Note that when AnVIL data is analyzed in Terra and accessed via DRS URIs, the Google Cloud costs charged to the user for DRS data access are very low. Reading data costs less than a penny per 10,000 operations, and data transfer within the region is free.
Best practices for analyzing AnVIL data
Researchers analyzing AnVIL data are encouraged to use the Terra GCP analysis platform, which provides free on-demand access to AnVIL data and secure, scalable interactive and batch computation. Working in Terra minimizes data storage and egress costs for dynamic/on-demand AnVIL data access (via GA4GH DRS protocol) when used in Terra GCP in the GCP us-central1 region. If data is downloaded to a Terra workspace bucket, standard Google storage costs will apply to that data.
Recommended AnVIL data access flow
Part 1: Set up billing/access > Part 2: Select data (Data Explorer) > Part 3: Export files for analysis
If you need to download data locally to analyze (note potential cost), researchers who choose to perform their analysis on local/institutional systems may do so. However, NHGRI AnVIL policy requires users to pay the out-of-cloud download costs Google charges. Files downloaded to local/institutional systems will incur Google data egress costs.