Learn how to find, access, and set up AnVIL data for analysis in Terra, including controlled-data you have permission to work with. This document provides instructions for selecting and setting up AnVIL data for analysis in the cloud, as is highly recommended by the AnVIL.
AnVIL data may also be downloaded out of the cloud to local/institutional systems. Note that you will pay the costs charged by Google for doing so.
AnVIL data is now provided through the AnVIL Data Explorer The AnVIL Gen3 data portal is no longer available. Instead, go to AnVIL Data Explorer.
AnVIL data overview
- You can find AnVIL datasets in the AnVIL Data Explorer, and the Data Use Oversight System (DUOS).
- Some AnVIL datasets are open-access, yet most are controlled-access, and require permission to access the data. You can obtain permission to access controlled data from dbGaP or the Data Use Oversight System (DUOS).
- The AnVIL Data Explorer provides faceted search and selection of AnVIL studies and the ability to export the user-selected data to either:
- A Terra user workspace for analysis in Terra (preferred)
- A TSV manifest, which can be used as input to download the data
- The structure and content of dataset tabular data are as provided by the data submitters. This includes the data dictionary, the names of tables and columns, and the values. There is often similarity among datasets submitted by the same consortium, yet substantial variation overall.
- The AnVIL data is hosted by the Terra Data Repository, although the TDR system is not used by researchers directly.
- AnVIL data is accessed using the GA4GH Data Repository Service (DRS) protocol.
- Each file is identified by a unique DRS URI.
- AnVIL data files are stored in Google Cloud Storage buckets in the GCP
us-central1
region.- These buckets are Requester Pays, meaning the user requesting the data is charged for all access to the data, including downloading. See Using Requester Pays workspaces/Buckets in Terra for more details.
- Note that when AnVIL data is analyzed in Terra and accessed via DRS URIs, the user is not charged for data access.
Best practices for analyzing AnVIL data
Researchers analyzing AnVIL data are encouraged to use the Terra GCP analysis platform, which provides free on-demand access to AnVIL data and secure, scalable interactive and batch computation. Working in Terra minimizes data storage and egress costs as there are no data access charges for dynamic/on-demand AnVIL data access (via GA4GH DRS protocol) when used in Terra GCP in the GCP us-central-1 region. If data is downloaded to a Terra workspace bucket, standard Google storage costs will apply to that data.
Recommended AnVIL data access flow
Part 1: Set up billing/access > Part 2: Select data (Data Explorer) > Part 3: Export files for analysis
If you need to download data locally to analyze (note potential cost)Researchers who choose to perform their analysis on local/institutional systems may do so. However, NHGRI AnVIL policy requires users to pay the out-of-cloud download costs Google charges. Files downloaded to local/institutional systems will incur Google data access costs.