How to create a TDR dataset and ingest data with Zebrafish

Leyla Tarhan
  • Updated

Zebrafish is an interactive portal to create datasets and upload data to the Terra Data Repository. It’s especially useful if you’re not already familiar with using API endpoints.

Note that Zebrafish is currently only available for Terra-on-GCP. If you’re using Terra-on-Azure, or if your data are stored in a non-Google bucket, see Create a TDR dataset with APIs instead.

Step 1. Upload your data to the Cloud

You must store your data on the Cloud in order to ingest it to TDR with Zebrafish.

Your dataset will likely include the following components

  • Data tables - Tabular data (tables) form the core of your dataset. These data tables may contain metadata (e.g., subject identifiers, age, etc.), phenotypic data (e.g., diagnoses, observations), or any other data that makes sense to store in a flat, relational structure. Generally, these will include references to the data files (e.g., .bam files) in your dataset. 
  • Data files - Data files are files that will be stored as objects in your dataset (e.g., .bam files, .vcf files, et cetera). Your tabular data tables should reference these data files - this will make it easier to analyze the data downstream, and enable fine-grained access controls.
  • A dataset schema file - The schema is a JSON file that specifies the structure of the tabular data tables within your  dataset. This includes  a definition of the tables, their columns and primary keys, the relationships between them, and any assets (i.e., schema subsets) that need to be defined to support the creation of snapshots downstream. For more on schemas, see Overview: Defining your TDR dataset schema and How to write a TDR dataset schema.

You can store your data in a GCS bucket associated with a Terra workspace, or another GCS bucket outside of Terra.

  • 1.1. Create a Terra workspace (or use an existing one).

    Make sure that your workspace is on the Google CloudTo check this, make sure that the cloud name listed in the Cloud Information section of your workspace's dashboard is Google Cloud (not Azure): 
    Screenshot of the Cloud Information section on an example workspace's dashboard. An orange box highlights the Cloud Name, which is Google Cloud.

    1.2. Upload the data to the Files section of the workspace's Data tab. See How to move data to/from a Google Bucket to learn how to do this.
    Screenshot of an example Terra workspace containing data for a TDR dataset in the Files section of the Data tab. Orange boxes highlight the Data tab and the Files section, which is listed on the left-hand panel.

    1.3. Add the Zebrafish production service account (zebrafish-prod-mep-cc@broad-dsde-prod.iam.gserviceaccount.com) to the workspace as a reader:
    Screenshot of the menu used to share a workspace. Orange boxes highlight the three-dot menu icon - which is located at the top right of any workspace screen - and the Share option.
    Screenshot showing Zebrafish's production service account being added to a workspace as a reader.

    1.4. Add your dataset's ingest service account to the workspace as a reader. The ingest service account is listed in the dataset's dataset summary dashboard on the TDR web interface.
    Screenshot of an example TDR dataset's Dataset Summary dashboard on the TDR web interface. An orange rectangle highlights the Ingest Service Account field.
  • 1.1. Upload the data to a Google Cloud Storage (GCS) bucket outside of Terra. Currently, Zebrafish can only ingest data from GCS buckets, not other Cloud providers.

    1.2. Add zebrafish-prod-mep-cc@broad-dsde-prod.iam.gserviceaccount.com to the bucket as a principal, with a role that contains the storage.object.list permission.

    1.3 Add your dataset's ingest service account to the workspace as a reader. The ingest service account is listed in the dataset's dataset summary dashboard on the TDR web interface.
    Screenshot of an example TDR dataset's Dataset Summary dashboard on the TDR web interface. An orange rectangle highlights the Ingest Service Account field.

Formatting your data for TDR

Regardless of how you upload your data, you must follow these formatting rules:

  • Tabular data must be stored as flat, delimited CSV or TSV files.
  • All tabular data files must be located within the same directory
  • If you have one file per table, it should have the same name as the table (for example, if your table is called “sample,” your file might be named “sample.tsv”).

    If you have multiple data files per table, save all files for that table in a sub-directory with the same name as the table. They should all include the table’s name in their file names. 

    For example, the bucket shown below includes a table called “sample.” TSV files that make up the data in the Sample table are stored inside the “sample” directory, and they all include “sample” in their names.

    Screenshot showing how files are organized and named for an example TDR dataset. There are three files, named 'sample.tsv', 'sample_2.tsv', and 'sample_3.tsv'. All three files are saved in a folder called 'sample'. All of these files contain data for the same table, which will be called 'sample.'

  • You do not need to save other files to the same GCS bucket as the tabular data, or as each other. However, they do need to be saved to the Cloud, and it is often simpler to save all data to the same bucket.

Step 2. Add Zebrafish to your TDR billing profile

2.1. Go to the addProfilePolicy endpoint in TDR and authorize Swagger.

2.2. Click Try it out.

2.3. Enter your TDR billing profile ID in the “id” field. If you don’t know this ID, run the enumerateProfiles endpoint and it will return your billing profile ID in the “id” field of the response.

2.4. Set the PolicyName field to user.

2.5. Replace the text in the Request body field with:

{
"email": "zebrafish-prod-mep-cc@broad-dsde-prod.iam.gserviceaccount.com"
}

2.6. Click Execute.

2.7. To check whether you successfully added Zebrafish to your billing profile, enter your billing profile id in the retrieveProfilePolicies endpoint and click Execute. In the response body, you should see zebrafish-prod-mep-cc@broad-dsde-prod.iam.gserviceaccount.com listed among the members.

Screenshot showing the response body that is produced by running a retrieveProfilePolicies job using the Swagger API endpoints. An orange arrow highlights that the zebrafish production service account is listed in the 'members' field of the 'policies' object in the response body.

Step 3. Create a dataset and upload data

3.1. Sign into https://zebrafish.dsde-prod.broadinstitute.org/ with your Terra credentials.

3.2. Select the wavy Pipeline Monitoring icon (top left).

Screenshot showing the Zebrafish website. An orange rectangle highlights the pipeline monitoring icon, which is a wavy line at the top left of the screen.

3.3. Click the New Ingestion button (top right).

Screenshot showing the Zebrafish website. An orange rectangle highlights the New Ingestion button, which is at the top right of the screen.

3.4. In the Ingestion Details form that appears, fill in your dataset's name in the TDR Dataset field. It should be a unique name that will help you identify the dataset, containing only letters and underscores.

Screenshot of the first form that appears after clicking on the New Ingestion button. There is a blank field for TDR Dataset and buttons to select dataset type, validation mode, and whether to create a snapshot.

3.5. Under Dataset Type, select New to create a new dataset and upload data to that new dataset. The “Existing” option is used to update the data in existing datasets (documentation forthcoming).

3.6. Select your validation mode.

  • NONE: no validation queries will be run.

    GCS: run validation queries (including referential integrity checks and data profile validations) and write out the results to a separate, controlled-access GCS bucket. If you want to see the results of this validation, you’ll have to fetch it via the Zebrafish validation API endpoint

    TDR: run validation queries and write out the results to TDR, as an additional table. 

3.7. Optional: toggle create snapshot on to create a “full-view” snapshot that you can share with others at a later time. A full-view snapshot includes the full dataset. See How to create snapshots in TDR to learn how to create snapshots that capture a subset of the dataset, or after ingesting data.

3.8. Specify a manifest for the job. This is the configuration Zebrafish will use to process your ingestion job. There are two options: click create new manifest to create a new manifest from scratch, or click use existing manifest to re-use the settings from a previous Zebrafish job.

  • Enter general information

    3.9. Name the Manifest under Manifest Name.
    3.10. Under ingestion mode, select either source or transformed.
    • Source mode: Ingests your data “as-is”, with the expectation that your data are already organized in the way you want for the final dataset. This is the simplest option, which we recommend for most cases.

      Transformed mode: Both ingests your data “as-is” and allows you to subsequently apply SQL-based transformations to your data and store the outputs in your dataset. In this mode, TDR will store both the original and transformed data in your dataset.

    Configure the dataset

    Screenshot of the configuration form on Zebrafish. There are fields for the manifest name and ingestion mode, and expandable menus for TDR Dataset, Tabular Data, and Data Files. 3.11. Under Billing Profile ID, enter your TDR billing profile id.
    3.12. Under Schema URI, include a GCS URI for your dataset’s schema file. This is an optional step; however, it’s highly recommended if you want to leverage the full functionality of TDR, as the default schema inference logic cannot detect primary keys and relationships between tables. Note that in either case, once you’ve created a dataset, you can't add or edit your schema through Zebrafish.
    • If your schema file is saved to a Terra workspace, navigate to the file in the Files section of the Data tab, then hover your mouse over the file’s name and click on the clipboard icon to copy the file’s URI to your clipboard.

      If your schema file is saved to a non-Terra GCS bucket, navigate to the file on the Google Cloud Console. Click on the three-dot icon at the right-hand end of your file’s row and then click copy gsutil URI to copy the file’s URI to your clipboard.

      Screenshot showing how to locate the DRS URI for a file on the google cloud console. An orange rectangle highlights a three-dot icon at the far right side of an example file's row in a table. Another orange rectangle highlights the 'copy gsutil URI' option in the resulting menu.

    3.13. Under Description, enter a description for your dataset. No special characters are allowed.
    3.14. Add any optional settings that are appropriate for your dataset
    • Region: specify the region where you’d like your data to be stored on TDR. The default is us-central1. See Google’s documentation for the list of available regions.

      PHS ID: include your dataset’s PHS id, if one exists. This is a study accession number used to identify studies in the database of Genotypes and Phenotypes (dbGaP).

      Bulk mode - RECOMMENDED: this setting makes the ingest process more efficient, and we recommend using it by default. The downside is that the dataset can’t be manipulated in any other way while a bulk mode ingest job is running; however, this is not a concern for most cases.

      Secure monitoring: includes additional logging to track all requests to access your data. These logs will be saved to the same location where your data are staged in the cloud (e.g., your Google bucket). The downside is that storing the additional logging can make jobs run with secure monitoring more expensive.

      Self hosted dataset: When this is turned on, TDR can reference your data files in their own buckets, rather than physically copying them into your TDR dataset. As a result, it may cost less to store self-hosted datasets. However, self-hosted datasets can be more vulnerable to accidental changes, and makes it harder to track file versions.

      Predictable file ids: When this is turned on, TDR will generate IDs for your files deterministically based on their properties, rather than randomly. As a result, the files will have the same IDs if you ingest them into multiple TDR datasets. This can make downstream analyses smoother. Note that this will only work if MD5s are present for your files.

      Extra fields: If you want to specify additional settings, you can use JSON format to specify extra fields. See the createDataset API endpoint for a list of the available fields (click on the “schema” tab above the response body for more information about each field).

    Upload data tables

    Screenshot showing the Tabular Data section of the Zebrafish configuration menu. There are fields for bucket, source path, fileref mode, fileref detection, and configure file reference.
    3.15. Under Bucket, enter your GCS bucket’s name.
    • If your data is in a Terra workspace, you can find this id in the cloud information section of your workspace’s dashboard, under “bucket name.”

      Screenshot showing the cloud information section on an example Terra workspace's dashboard.

      If your data is in a Google bucket outside of Terra, you can find the bucket’s name by navigating to the bucket on the google cloud console.

    3.15. Under Source Path, enter the path from the bucket’s top level to the folder containing your tabular data. 

    For example, if your tables are stored at the top level of your bucket, they might have a google storage path like gs://bucketname/tabular_data — the source path would be tabular_data.

    In contrast, if your tables are nested a few levels deeper, they might have a google storage path like gs://bucketname/paper1/dataset1/tabular_data. In this case, the source path would be paper1/dataset1/tabular_data.
    3.16. Under Fileref Mode, select how you want TDR to handle file references in your tables
    • Inline: All columns containing string file references (i.e., all columns containing gs:// paths to files) will be converted to fileref type columns in TDR and have their values replaced with proper file references.

      File_table: An additional table containing the file references will be added to the TDR dataset, and other tables referencing files will instead point to records in this new table. This allows TDR to manage all file references in a single location, and also provides a useful inventory of the files in the dataset.

      Inline_and_table - RECOMMENDED: All columns containing string file references will be converted to fileref type columns in TDR and have their values replaced with property file references. Additionally, a table containing file references will be added to the TDR dataset. We recommend this option for most cases.

    3.17. If you selected file_table or inline_and_table for your fileref mode, provide a name for your file table under file table name.
    3.18. Add any optional settings that are appropriate for your dataset:
    • Fileref detection: TDR will automatically attempt to detect file references in your data. If this is toggled on, TDR will look for any strings that start with gs:// and assume that these are file references. If you don’t have permission to access the files with these paths, it’s better to leave these strings as strings, rather than treating them as file references.

      Configure file reference: Specify your file references manually. Only do this if your file references are not strings starting with gs://

    Optional: Upload files

    If you want to upload files that aren’t already referenced in your tabular data, fill in the data files section. In most cases, this shouldn’t be necessary because the best practice is to reference all of your files in your tabular data.

    To add a file that’s not in your tabular data, expand the data files section and click add configuration. Then, specify the files’ GCS bucket and (optionally) source paths.

  • 3.9. To select the manifest you want to re-use, search by name or manifest ID.

    3.10. Once you’ve selected the manifest that you want to use, you can preview the manifest by clicking on the eye icon to the right of the manifest id. This will reveal the manifest’s full JSON.
    Screenshot showing an example of how to preview the selected manifest when running a Zebrafish job using an existing manifest. And orange rectangle highlights the eye icon used to show this preview.
    3.11. Once you’ve verified that this is the manifest you want, click Next.

Review and Submit

Once you've finished configuring or selecting your manifest, you’ll see a summary of the manifest's JSON. Edit this JSON as necessary (you can add fields) and then click Submit Ingestion to create your dataset and ingest your data.

If you're using an existing manifest, you'll only see the JSON fields that are different between this job and when the manifest was created. Go back and preview the manifest you selected if you need to double-check any of the fields not displayed in this summary.

What to expect

Once you've submitted the job to create your TDR dataset, you can monitor its progress on Zebrafish's Pipeline Monitoring dashboard:

Screenshot of the pipeline monitoring dashboard, where you can monitor the progress of a Zebrafish job.

Your job’s status will start at “queued,” then change to “running.” While the status is “running,” hover your mouse over the status to see which step is currently running.

Once the job has finished running successfully, the status will change to “Succeeded” and you should be able to see your data on the TDR website:

1. Log into https://data.terra.bio/

2. In the datasets tab, click on your dataset’s name.

3. You’ll see a summary of your dataset, including files, tables, and their columns:

Screenshot of an example dataset on the Terra Data Repository's web interface.

4. Click on view dataset data to view your data tables.

Troubleshooting failed jobs

If the job encounters a problem, the dashboard will display a “failed” status. Hover your mouse over the status to see more information about why the job failed.

Next Steps

Once you've created a dataset and uploaded your data with Zebrafish, you're off and running! Next, you can continue adding to and editing your dataset. Once the data are ready, you can Create a snapshot to share your data and analyze your data in a workflow.

 

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.