Terra on Azure: AnVIL data submitter's guide

Allie Cliffe
  • Updated

Step-by-step instructions to help (AnVIL) Data Submitters previously using TDR (GCP) get started uploading data to TDR (Azure).

Process overview and requirements

You’ll first stage data in a dedicated Terra on Azure workspace. Then the AnVIL team will ingest the data files and tables into TDR.

What’s the same

  • Login with the same Terra ID (can be a Google or GSuite)
  • Stage data in a deposit workspace

What’s different in Azure

  • View workspace cloud storage directory in the workspace (not in GCP console)
  • Azure-specific directory structure for CSVs and data files
  • Workspace storage identified with a SAS URL (updated every 8 hours)
  • Upload large files with AzCopy (versus gsutil)

For additional data submission support, reach out to the AnVIL Support team at anvil-data@broadinstitute.org

AnVIL provides data submitters with a submission workspace where you will stage data for ingestion (large data files and CSV files for each dataset table).

As the data submitter, you’re expected to abide by the following guidelines Only upload data from the current approved data submission.

Use a separate workspace to run any compute or analysis on this data unless you have prior approval from the AnVIL program. Note that the WRITER role allows you to run computations, although you are not allowed to without approval.

Don’t copy or move primary data from this workspace without prior approval from the AnVIL program. 

Next steps: Accessing the data

Note that once the data is ingested, you will be able to access it in TDR for analysis. Please do NOT clone this workspace for long-term use. This workspace will be deleted once your submission is complete.

Step 1: Log into Terra

1.1. Go to https://app.terra.bio/#workspaces and log in with your Terra ID. If you have logged into Terra before, you will use the same login ID.

Logging in: Google or Microsoft SSO?The Terra login is a Single Sign-On (SSO) that stands in for your Terra ID and doesn’t determine what cloud platform you work in. You can use your Google ID (or G-Suite email) or Microsoft ID to log into Terra. If you already have a Terra ID, you should use that.

For more details, see How to use Terra on Azure with a Google login. If you don’t already have a Terra account, see How to register on Terra (Google SSO) or How to set up/register in Terra on Azure (Microsoft SSO).

1.2. The AnVIL Data Ingestion team will provide you with your submission workspace. Once logged in, search for your data deposit workspace and click on the link to open it. 

Step 2: Set up your submission workspace

To facilitate ingestion into TDR, the workspace cloud storage needs to have a particular directory structure.

2.1. Click the file icon in the right sidebar to expose the workspace cloud storage directory.

2.2. Click the New Folder icon at the top to create directory folders.

DataSubmitters_Create-file-directory-in-deposit-workspace_Screenshot.png

Top-level directory (one dataset/consent group)

  • data_files folder
    It should contain the data file objects organized in whatever structure you desire. TDR has no preference, though a common structure is a directory per sample.
  • tabular_data folder
    Houses the tabular metadata or phenotypic tables

    This folder should ONLY contain TSVs/CSVs that will be made into data tablesPreviously, these were the files that submitters would update into the workspace data tables. If they have phenotypic data that isn’t in a flat, tabular form and shouldn’t end up in a data table, that shouldn’t be included in this folder.

Example Cloud storage directory structure (single dataset/consent group)

Data-Submitters-Guide_Cloud-storage-directory-single-dataset_Screenshot.png

Example Cloud storage directory structure (two datasets/consent group)

Data-Submitters-Guide_Cloud-storage-directory-two-datasets_Screenshot.png

Top-level and sub-directories (multiple datasets/consent groups)

  • Dataset (phsID 1, Consent Code 1)  
    • data_files
    • tabular_data
  • Dataset 2 (phsID 1, Consent Code 2)     
    • data_file 
    • tabular_data

Step 3: Upload data to your Terra on Azure workspace

Tools for uploading files If your files are smaller than 5GB (per file), you can upload files directly in Terra.

For large numbers of large files, you can use the Microsoft Azure Storage Explorer App or AzCopy in a terminal.  See Bring your own data to Terra on Azure for step-by-step instructions. 

Data-Submitters-Guide_Upload-TSV-files_Screenshot.png

3.1. Upload CSV files to the tabular_data folder or sub-folder.

See How to bring your own data to Terra on Azure (option 1: Upload with Terra’s File Manager).

3.2. Upload data files to the data_files folder or sub-folder using the Microsoft Azure Storage Explorer App or AzCopy in a terminal.

See How to bring your own data to Terra on Azure (Option 2: Microsoft Azure Storage Explorer App or Option 3: AzCopy in a terminal).

What happens next

Data Release

  • Once your staging upload is complete, reach out to the AnVIL Support team, who will send you a submission checklist that will serve as a final signoff for release.
  • Once you provide release approval, the AnVIL Support team will coordinate the release with dbGaP, add appropriate access lists (e.g. dbGaP access lists), ingest the data into TDR, and release the dataset to general researchers.
  • Upon release, the AnVIL Support team will coordinate the deletion of any deposit workspace.

Additional resources

For step-by-step instructions to transfer data from Terra (GCP) to Terra on Azure workspace storage, see Bring your own data to Terra on Azure.

Data model resources

Analysis Resources

 

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.