Zebrafish is an interactive portal to create datasets and upload data to the Terra Data Repository. It’s especially useful if you’re not already familiar with using API endpoints.
Note that Zebrafish is currently only available for Terra-on-GCP. If you’re using Terra-on-Azure, or if your data are stored in a non-Google bucket, see Create a TDR dataset with APIs instead.
Step 1. Upload your data to the Cloud
You must store your data on the Cloud in order to ingest it to TDR with Zebrafish.
Your dataset will likely include the following components
- Data tables - Tabular data (tables) form the core of your dataset. These data tables may contain metadata (e.g., subject identifiers, age, etc.), phenotypic data (e.g., diagnoses, observations), or any other data that makes sense to store in a flat, relational structure. Generally, these will include references to the data files (e.g., .bam files) in your dataset.
- Data files - Data files are files that will be stored as objects in your dataset (e.g., .bam files, .vcf files, et cetera). Your tabular data tables should reference these data files - this will make it easier to analyze the data downstream, and enable fine-grained access controls.
- A dataset schema file - The schema is a JSON file that specifies the structure of the tabular data tables within your dataset. This includes a definition of the tables, their columns and primary keys, the relationships between them, and any assets (i.e., schema subsets) that need to be defined to support the creation of snapshots downstream. For more on schemas, see Overview: Defining your TDR dataset schema and How to write a TDR dataset schema.
You can store your data in a GCS bucket associated with a Terra workspace, or another GCS bucket outside of Terra.
-
1.1. Create a Terra workspace (or use an existing one).
Make sure that your workspace is on the Google CloudTo check this, make sure that the cloud name listed in the Cloud Information section of your workspace's dashboard is Google Cloud (not Azure):
1.2. Upload the data to the Files section of the workspace's Data tab. See How to move data to/from a Google Bucket to learn how to do this.
1.3. Add the Zebrafish production service account (zebrafish-prod-mep-cc@broad-dsde-prod.iam.gserviceaccount.com
) to the workspace as a reader:
1.4. Add your dataset's ingest service account to the workspace as a reader. The ingest service account is listed in the dataset's dataset summary dashboard on the TDR web interface.
-
1.1. Upload the data to a Google Cloud Storage (GCS) bucket outside of Terra. Currently, Zebrafish can only ingest data from GCS buckets, not other Cloud providers.
1.2. Addzebrafish-prod-mep-cc@broad-dsde-prod.iam.gserviceaccount.com
to the bucket as a principal, with a role that contains the storage.object.list permission.
1.3 Add your dataset's ingest service account to the workspace as a reader. The ingest service account is listed in the dataset's dataset summary dashboard on the TDR web interface.
Formatting your data for TDR
Regardless of how you upload your data, you must follow these formatting rules:
- Tabular data must be stored as flat, delimited CSV or TSV files.
- All tabular data files must be located within the same directory.
-
If you have one file per table, it should have the same name as the table (for example, if your table is called “sample,” your file might be named “sample.tsv”).
If you have multiple data files per table, save all files for that table in a sub-directory with the same name as the table. They should all include the table’s name in their file names.
For example, the bucket shown below includes a table called “sample.” TSV files that make up the data in the Sample table are stored inside the “sample” directory, and they all include “sample” in their names.
- You do not need to save other files to the same GCS bucket as the tabular data, or as each other. However, they do need to be saved to the Cloud, and it is often simpler to save all data to the same bucket.
Step 2. Add Zebrafish to your TDR billing profile
2.1. Go to the addProfilePolicy endpoint in TDR and authorize Swagger.
2.2. Click Try it out.
2.3. Enter your TDR billing profile ID in the “id” field. If you don’t know this ID, run the enumerateProfiles endpoint and it will return your billing profile ID in the “id” field of the response.
2.4. Set the PolicyName field to user.
2.5. Replace the text in the Request body field with:
{
"email": "zebrafish-prod-mep-cc@broad-dsde-prod.iam.gserviceaccount.com"
}
2.6. Click Execute.
2.7. To check whether you successfully added Zebrafish to your billing profile, enter your billing profile id in the retrieveProfilePolicies endpoint and click Execute. In the response body, you should see zebrafish-prod-mep-cc@broad-dsde-prod.iam.gserviceaccount.com
listed among the members.
Step 3. Create a dataset and upload data
3.1. Sign into https://zebrafish.dsde-prod.broadinstitute.org/ with your Terra credentials.
3.2. Select the wavy Pipeline Monitoring icon (top left).
3.3. Click the New Ingestion button (top right).
3.4. In the Ingestion Details form that appears, fill in your dataset's name in the TDR Dataset field. It should be a unique name that will help you identify the dataset, containing only letters and underscores.
3.5. Under Dataset Type, select New to create a new dataset and upload data to that new dataset. The “Existing” option is used to update the data in existing datasets (documentation forthcoming).
3.6. Select your validation mode.
-
NONE: no validation queries will be run.
GCS: run validation queries (including referential integrity checks and data profile validations) and write out the results to a separate, controlled-access GCS bucket. If you want to see the results of this validation, you’ll have to fetch it via the Zebrafish validation API endpoint
TDR: run validation queries and write out the results to TDR, as an additional table.
3.7. Optional: toggle create snapshot on to create a “full-view” snapshot that you can share with others at a later time. A full-view snapshot includes the full dataset. See How to create snapshots in TDR to learn how to create snapshots that capture a subset of the dataset, or after ingesting data.
3.8. Specify a manifest for the job. This is the configuration Zebrafish will use to process your ingestion job. There are two options: click create new manifest to create a new manifest from scratch, or click use existing manifest to re-use the settings from a previous Zebrafish job.
-
Enter general information
3.9. Name the Manifest under Manifest Name.
3.10. Under ingestion mode, select either source or transformed.
-
Source mode: Ingests your data “as-is”, with the expectation that your data are already organized in the way you want for the final dataset. This is the simplest option, which we recommend for most cases.
Transformed mode: Both ingests your data “as-is” and allows you to subsequently apply SQL-based transformations to your data and store the outputs in your dataset. In this mode, TDR will store both the original and transformed data in your dataset.
Configure the dataset
3.11. Under Billing Profile ID, enter your TDR billing profile id.
3.12. Under Schema URI, include a GCS URI for your dataset’s schema file. This is an optional step; however, it’s highly recommended if you want to leverage the full functionality of TDR, as the default schema inference logic cannot detect primary keys and relationships between tables. Note that in either case, once you’ve created a dataset, you can't add or edit your schema through Zebrafish.
-
If your schema file is saved to a Terra workspace, navigate to the file in the Files section of the Data tab, then hover your mouse over the file’s name and click on the clipboard icon to copy the file’s URI to your clipboard.
If your schema file is saved to a non-Terra GCS bucket, navigate to the file on the Google Cloud Console. Click on the three-dot icon at the right-hand end of your file’s row and then click copy gsutil URI to copy the file’s URI to your clipboard.
3.14. Add any optional settings that are appropriate for your dataset
-
Region: specify the region where you’d like your data to be stored on TDR. The default is
us-central1
. See Google’s documentation for the list of available regions.PHS ID: include your dataset’s PHS id, if one exists. This is a study accession number used to identify studies in the database of Genotypes and Phenotypes (dbGaP).
Bulk mode - RECOMMENDED: this setting makes the ingest process more efficient, and we recommend using it by default. The downside is that the dataset can’t be manipulated in any other way while a bulk mode ingest job is running; however, this is not a concern for most cases.
Secure monitoring: includes additional logging to track all requests to access your data. These logs will be saved to the same location where your data are staged in the cloud (e.g., your Google bucket). The downside is that storing the additional logging can make jobs run with secure monitoring more expensive.
Self hosted dataset: When this is turned on, TDR can reference your data files in their own buckets, rather than physically copying them into your TDR dataset. As a result, it may cost less to store self-hosted datasets. However, self-hosted datasets can be more vulnerable to accidental changes, and makes it harder to track file versions.
Predictable file ids: When this is turned on, TDR will generate IDs for your files deterministically based on their properties, rather than randomly. As a result, the files will have the same IDs if you ingest them into multiple TDR datasets. This can make downstream analyses smoother. Note that this will only work if MD5s are present for your files.
Extra fields: If you want to specify additional settings, you can use JSON format to specify extra fields. See the createDataset API endpoint for a list of the available fields (click on the “schema” tab above the response body for more information about each field).
Upload data tables
3.15. Under Bucket, enter your GCS bucket’s name.
-
If your data is in a Terra workspace, you can find this id in the cloud information section of your workspace’s dashboard, under “bucket name.”
If your data is in a Google bucket outside of Terra, you can find the bucket’s name by navigating to the bucket on the google cloud console.
For example, if your tables are stored at the top level of your bucket, they might have a google storage path likegs://bucketname/tabular_data
— the source path would betabular_data
.
In contrast, if your tables are nested a few levels deeper, they might have a google storage path likegs://bucketname/paper1/dataset1/tabular_data
. In this case, the source path would bepaper1/dataset1/tabular_data.
3.16. Under Fileref Mode, select how you want TDR to handle file references in your tables
-
Inline: All columns containing string file references (i.e., all columns containing
gs://
paths to files) will be converted to fileref type columns in TDR and have their values replaced with proper file references.File_table: An additional table containing the file references will be added to the TDR dataset, and other tables referencing files will instead point to records in this new table. This allows TDR to manage all file references in a single location, and also provides a useful inventory of the files in the dataset.
Inline_and_table - RECOMMENDED: All columns containing string file references will be converted to fileref type columns in TDR and have their values replaced with property file references. Additionally, a table containing file references will be added to the TDR dataset. We recommend this option for most cases.
file_table
orinline_and_table
for your fileref mode, provide a name for your file table under file table name.
3.18. Add any optional settings that are appropriate for your dataset:
-
Fileref detection: TDR will automatically attempt to detect file references in your data. If this is toggled on, TDR will look for any strings that start with
gs://
and assume that these are file references. If you don’t have permission to access the files with these paths, it’s better to leave these strings as strings, rather than treating them as file references.Configure file reference: Specify your file references manually. Only do this if your file references are not strings starting with
gs://
Optional: Upload files
If you want to upload files that aren’t already referenced in your tabular data, fill in the data files section. In most cases, this shouldn’t be necessary because the best practice is to reference all of your files in your tabular data.
To add a file that’s not in your tabular data, expand the data files section and click add configuration. Then, specify the files’ GCS bucket and (optionally) source paths.
-
-
3.9. To select the manifest you want to re-use, search by name or manifest ID.
3.10. Once you’ve selected the manifest that you want to use, you can preview the manifest by clicking on the eye icon to the right of the manifest id. This will reveal the manifest’s full JSON.
3.11. Once you’ve verified that this is the manifest you want, click Next.
Review and Submit
Once you've finished configuring or selecting your manifest, you’ll see a summary of the manifest's JSON. Edit this JSON as necessary (you can add fields) and then click Submit Ingestion to create your dataset and ingest your data.
If you're using an existing manifest, you'll only see the JSON fields that are different between this job and when the manifest was created. Go back and preview the manifest you selected if you need to double-check any of the fields not displayed in this summary.
What to expect
Once you've submitted the job to create your TDR dataset, you can monitor its progress on Zebrafish's Pipeline Monitoring dashboard:
Your job’s status will start at “queued,” then change to “running.” While the status is “running,” hover your mouse over the status to see which step is currently running.
Once the job has finished running successfully, the status will change to “Succeeded” and you should be able to see your data on the TDR website:
1. Log into https://data.terra.bio/.
2. In the datasets tab, click on your dataset’s name.
3. You’ll see a summary of your dataset, including files, tables, and their columns:
4. Click on view dataset data to view your data tables.
Troubleshooting failed jobs
If the job encounters a problem, the dashboard will display a “failed” status. Hover your mouse over the status to see more information about why the job failed.
Next Steps
Once you've created a dataset and uploaded your data with Zebrafish, you're off and running! Next, you can continue adding to and editing your dataset. Once the data are ready, you can Create a snapshot to share your data and analyze your data in a workflow.