How to stage data in your AnVIL deposit workspace

Allie Cliffe
  • Updated

Once you have prepared your omics object files and generated TSV files for each table in your data model, follow the directions below to deposit the data object files and all TSV files into AnVIL-owned data deposit workspaces. Note that AnVIL on Azure instructions are in a second tab. 

Step 1: Log into Terra/AnVIL

1.1. Go to https://anvil.terra.bio/#workspaces and log in with your Terra ID. If you have logged into Terra before, you will use the same login ID.

Logging in: Google or Microsoft SSO?The Terra login is a Single Sign-On (SSO) that stands in for your Terra ID and doesn’t determine what cloud platform you work in. You can use your Google ID (or G-Suite email) or Microsoft ID to log into Terra. If you already have a Terra ID, you should use that.

For more details, see How to use Terra on Azure with a Google login. If you don’t already have a Terra account, see How to register on Terra (Google SSO) or How to set up/register in Terra on Azure (Microsoft SSO).

1.2. The AnVIL Data Ingestion team will provide you with your submission workspace. Once logged into , search for your data deposit workspace in Your Workspaces and click on the link to open it. 

Step 2: Set up cloud storage

To facilitate ingestion into TDR, the workspace cloud storage needs to have a particular directory structure. Follow the directions below to set up the proper (GCP or Azure) storage hierarchy. 

  • 2.1. Click the file icon in the bottom left column of the Data page to expose the workspace cloud storage directory.

    AnVIL-GCP_Files-icon-in-the-workspace-data-page.png

    2.2. Click the New Folder icon at the top to create directory folders.

    AnVIL-GCP_Create-new-folder-in-workspace-Bucket-screenshot.png

    Top-level directory (one dataset/consent group)

    • data_files folder
      It should contain the data file objects organized in whatever structure you desire. TDR has no preference, though a common structure is a directory per sample.
    • tabular_data folder
      Houses the tabular metadata or phenotypic tables

      This folder should ONLY contain TSVs that will be made into data tablesPreviously, these were the files that submitters would update into the workspace data tables. If they have phenotypic data that isn’t in a flat, tabular form and shouldn’t end up in a data table, that shouldn’t be included in this folder.

    Example Cloud storage directory structure (single dataset/consent group)

    AnVIL-GCP_Screenshot-of-workspace-Bucket-directory-for-single-consent-code.png

    Note that AnVIL deposit workspaces have a notebooks directory by default. 

    Top-level and sub-directories (multiple datasets/consent groups)

    • Dataset (phsID 1, Consent Code 1)  
      • data_files
      • tabular_data
    • Dataset 2 (phsID 1, Consent Code 2)     
      • data_file 
      • tabular_data

    Example Cloud storage directory structure (multiple datasets/consent group)

    AnVIL-GCP_Screenshot-of-workspace-Bucket-file-directory-with-multiple-content-codes.png

  • 2.1. Click the file icon in the right sidebar to expose the workspace cloud storage directory.

    2.2. Click the New Folder icon at the top to create directory folders.

    DataSubmitters_Create-file-directory-in-deposit-workspace_Screenshot.png

    Top-level directory (one dataset/consent group)

    • data_files folder
      It should contain the data file objects organized in whatever structure you desire. TDR has no preference, though a common structure is a directory per sample.
    • tabular_data folder
      Houses the tabular metadata or phenotypic tables

      This folder should ONLY contain TSVs that will be made into data tablesPreviously, these were the files that submitters would update into the workspace data tables. If they have phenotypic data that isn’t in a flat, tabular form and shouldn’t end up in a data table, that shouldn’t be included in this folder.

    Example Cloud storage directory structure (single dataset/consent group)

    Data-Submitters-Guide_Cloud-storage-directory-single-dataset_Screenshot.png

    Example Cloud storage directory structure (two datasets/consent group)

    Data-Submitters-Guide_Cloud-storage-directory-two-datasets_Screenshot.png

    Top-level and sub-directories (multiple datasets/consent groups)

    • Dataset (phsID 1, Consent Code 1)  
      • data_files
      • tabular_data
    • Dataset 2 (phsID 1, Consent Code 2)     
      • data_file 
      • tabular_data

Step 3: Upload TSV files

Next you will upload your tabular data (TSVs) to the appropriate directory in the deposit workspace. This includes clinical and phenotypic data as well as metadata - all the tables from your data model. 

TSVs files to copy to tabular_data

  • All clinical and phenotypic data
  • All TSV load files in the Data Model
    • biosample.tsv
    • donor.tsv
    • file.tsv
    • diagnosis.tsv (optional)
    • family.tsv (optional)
    • project.tsv (optional)
    • Any additional optional tables  

Step 4: Upload object data files

Last, you'll upload unstructured data files (omics files, images, etc.) to the data_files folder or sub-folder. 

Large object files staged in data_files

  • Omics data
  • Images  

Data Indexing

Note that AnVIL will generate a global unique ID (GUID) for object files (staged in the data_files folder) to add to the file.tsv (crai_path and cram_path in the figure below) when they ingest the data files.

Your TSVs should include columns for object file metadata (crai_path and cram_path), AnVIL will provide the drs://dg part. 

file_id biosample_id crai_path cram_path donor_id
bio123890 bio123890 drs://dg.4503:dg.4503/a60 drs://dg.4503:dg.4503/af9 HG000097
  • In AnVIL (GCP), use gloud storage command line in a terminal to copy object files to the data_files directory.

    See How to move data between local and workspace storage (Option 2: gcloud storage) for step-by-step instructions.

  • In AnVIL on Azure, use the Microsoft Azure Storage Explorer App or AzCopy in a terminal to copy object files to the data_files directory.

    For step-by-step instructions, see How to bring your own data to Terra on Azure (Option 2: Microsoft Azure Storage Explorer App or Option 3: AzCopy in a terminal.

What happens next

Data Release

  1. Once your staging upload is complete, reach out to the AnVIL Support team, who will send you a submission checklist that will serve as a final sign off for release.
  2. Once you provide release approval, the AnVIL Support team will coordinate the release with dbGaP, add appropriate access lists (e.g., dbGaP access lists), ingest the data into TDR, and release the dataset to general researchers.
  3. Upon release, the AnVIL Support team will coordinate the deletion of any deposit workspaces.

Additional resources

Data model resources

Data transfer resources

Analysis Resources

  •  

If you have additional questions, please reach out to the AnVIL ingestion team at anvil-data@broadinstitute.org.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.