Managing data tables with the WDS API

Allie Cliffe
  • Updated

Learn how to create, modify, and delete workspace data tables in Terra on Azure with APIs (Application Programming Interface) using the Workspace Data Service (WDS) Python Client. Scripting with the WDS API is much like running on a local machine but with Terra's built-in security and cloud integration. You can use the WDS Client commands from a JupyterLab Notebook or from the command line.

To see some of these commands in action, see the Using Data Tables in a Notebook featured workspace

Why use scripting/APIs?

Why would you want to use Terra through an API when you can work directly in Terra?

Flexible data access

Interacting directly with back-end Application Programming Interfaces (APIs) gives greater flexibility when manipulating data and setting up data tables. Often APIs allow functionality that hasn’t (yet) been integrated into the platform user interface.

Automation and scalability

Maybe you hope to streamline your analysis while avoiding human errors. Using the APIs means you can automate much of the setup process, which lets you standardize and scale up your work.

Personal preference

Sometimes you may want to upload data and run workflows without using Terra's graphical user interface. Maybe you are more comfortable with scripting than clicking buttons. If you just prefer scripting, you can do that as well.

Overview: WDS Python client

The WDS client is a Python-based API and command-line interface to Terra on Azure’s Workspace Data Service, which powers data tables.

WDS API versus FISS

Terra on Google users may be familiar with the FISS API for programmatically interacting with Terra’s features. FISS will not be available in Terra on Azure. Instead, we plan to release new clients to interact programmatically with Terra on Azure. The Workspace Data Service (WDS) Python client is the first of these clients to be released.

R support

This client is callable from the R-program language using a package called ‘Reticulate’. We provide an example code for how to use Reticulate with the WDS Python client in this example R notebook.

Low-level API

The WDS client corresponds one-to-one with the WDS API. Currently, anything supported by the WDS API is callable through the WDS Python client.

Technical details

How to install in your local or cloud environment

You can manually install using pip3 install wds-client.

Example: Using data in a table as notebook inputs

You may wish to use data contained in your tables in an interactive analysis with a notebook. The example notebooks in the “Using data tables in a Notebook” Featured Workspace show you how to bring a data table into a notebook and convert it into a data frame for analysis.

For a list of workspace variables you can use in a notebook, click here to jump to the section below..

Example: Managing data tables with the WDS API

Managing a small collection of samples by manually creating a workspace table to track them and their metadata is fairly straightforward using a spreadsheet and the "Upload TSV" feature in Terra.

However, large projects can produce data on hundreds, thousands, or even hundreds of thousands of samples. The time to upload this data to cloud storage or reference the files in your workspace data model is significant. It is impractical, even infeasible, to manually create a workspace, upload data, and track potentially hundreds of fields of extra metadata by hand.

When projects grow, the best approach is to script data management. Scripting can cover both uploading data to your workspace storage and creating a workspace data table to track the file locations and all their extra metadata. This lets you efficiently create and manage workspaces containing thousands (or many more) of samples while minimizing errors and maximizing your time.

Click here for a list of useful subcommands for working with data tables.

Workspace Variables

Workspace Variables are used to set up your environment to sync with your Terra workspace permissions. All of these variables are available in the Jupyter Notebook. Below is a description of each workspace variable and where in your workspace you can find them.

  • The title of your workspace. You can find it at the top of any workspace page (after the billing project).

  • This is the full path of your workspace blob storage container. You will need this, for example, when pulling files from workspace storage into your cloud environment for analysis (in JupyterLab, for example).

    To find this variable, go to your workspace dashboard under the cloud information section. It is the Storage Container URL.

    For example, in the screenshot below, it is
    https://lze033433beed5b4a6a47de6.blob.core.windows.net/sc-b57267b1-ef5c-472c-840f-a71a6174bb53.

    <INSERT SCREENSHOT>

  • You can obtain the Workspace ID from the Storage Container URL. It is the UUID after “sc-”.
    From the example above, the workspace ID is b57267b1-ef5c-472c-840f-a71a6174bb53.

  • You can obtain the Workspace Storage Container ID from the Storage Container URL found above.

    It is the UUID after https://. From the example above, the storage container ID is lze033433beed5b4a6a47de6.

Useful subcommands for working with data tables

Upload a TSV file

# upload TSV from Jupyter VM into a data table

tsv_file_name = "TestType_uploaded.tsv";

# this will create a tsv on the notebook VM or locally where this code is run

testType = testType.to_csv(tsv_file_name, sep="\t", index = False)

# Upload entity to workspace data table with name "testType_uploaded"

response = records_client.upload_tsv(current_workspaceId, version,
"testType_uploaded", tsv_file_name)

print(response)

Delete specific records(s)/row(s) in a data table

from datetime import datetime

records_client = wds_client.RecordsApi(api_client)
dict_values = {"Colors":["green","red","blue"], "Number": 2023,
"DateTimeCreatedAt": datetime.now()}

record_request = wds_client.RecordRequest(attributes=dict_values);

# this will create a record with table name "testType" and record row name
"testRecord"


# if you dont provide the primary_key, the operation will complete and the
primary key column will be called "sys_name" by default


# in this context:

# "TestType" is the name of the Data Table

# "testRecord" is the value that will be populated for the primary key for the
row to be added (left most column in the Data Table)


# Primary key is the name of the column that defines the primary key for the
given Data Table


recordCreated = records_client.create_or_replace_record(current_workspaceId,
version, 'testType', 'testRecord', record_request, primary_key="column_key")

# delete record

record = records_client.delete_record(current_workspaceId, version, 'testType',
'testRecord')

List all the records/rows by name and id in a workspace

import pandas as pd

records_client = wds_client.RecordsApi(api_client)

# query records that were just added

records = records_client.get_records_as_tsv(current_workspaceId, version,
'testType')

print(records)

testType = pd.read_csv(records, sep='\t')

testType.head()

Write a dataframe as a TSV

import pandas as pd

records_client = wds_client.RecordsApi(api_client)

# query records that were just added

records = records_client.get_records_as_tsv(current_workspaceId, version,
'testType')

print(records)

testType = pd.read_csv(records, sep='\t')

testType.head()

traits_for_analysis.to_csv("file_to_tsv.tsv", sep="\t", index = False)

#This writes the file_to_tsv.tsv to your notebook VM. From here, you can use
azcopy to copy this tsv to your workspace storage container, or use code
from the upload TSV section.

Return the names of the record types/tables in a workspace

schema_instance = wds_client.SchemaApi(api_client)

# get workspace data specifics, such as what tables exist

workspace_ent_type =
schema_instance.describe_all_record_types(current_workspaceId, version)

for t in workspace_ent_type:

print ("name:", t.name ,"count:", t.count)

 

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.