Learn to perform Google Cloud operations like writing to a BigQuery dataset, running dsub jobs, and more. Did you know you can do many of these things in Terra already? This article explains how to leverage Terra notebooks and workflows to access additional Google Cloud features in Terra.
Getting Started with advanced Google Cloud features
The Terra platform is designed to remove some of the barriers of moving to the cloud: Terra interfaces directly with Google. However, there are many other Google Cloud features not yet on the Terra platform. Some are on the horizon, others are niche capabilities that may never be integrated with Terra.
Examples of what you can do
- WRITE to BigQuery
- Interact with Cloud Storage buckets other than the workspace bucket
- Run dsub jobs
- Run Cloud Dataflow jobs
- Run Cloud Machine Learning (ML) engine jobs
How to do it
Just because they aren't in Terra doesn't mean you cannot use them. You can access these advanced features through a Google project that you will set up on the Google Cloud console. You can connect to Terra with a human-friendly personal Terra group following the steps below.
Once you follow these three setup steps, you'll be able to use the Google Cloud project to leverage advanced Google Cloud features by running notebooks and workflows on Terra.
Before you start! Your Terra user ID must have a Google Cloud Billing accountTo set up a Google Cloud-native project on Google Cloud console, you need to be an owner or a user on a Google Cloud Billing account linked to Terra. If what you see on the console does not look like the screenshots, it is because you do not have the right permissions on a Google Cloud Billing account.
To learn how to set up Google Cloud billing, and access $300 in free credits from Google, see How to set up billing in Terra.
Set up your Google Project in three steps
Step 1. Set up a Google Cloud-native project (on Google Cloud console)
1.2. On the "Select organization" drop-down at the top of the page, select the organization in which you want to create a project. Free trial users can skip this step, as this list does not appear.
1.3. Select Create project.
1.4. In the New Project window, enter a project name and select a Billing account. This is the Cloud Billing account that will cover all Google Cloud costs incurred in your Google project.
1.5. Enter the parent organization or folder in the Location box. That resource will be the hierarchical parent of the new project.
1.6. When you're finished entering new project details, click Create.
Step 2. Create a human-friendly personal Terra group
Why use a Terra group for external access? Each Terra user has a prebuilt "Proxy" Group for accessing resources outside of Terra.
However, your proxy group is not very human-friendly. If you're looking at a list of users with access to an external Google Cloud bucket, seeing that there's a grant to
PROXY_11564882405514439@firecloud.org is not helpful unless you happen to have a way to figure out what user is associated with that Proxy Group.
Instead, you can create a Terra group (with a sensible name) as an alias for your proxy. if your registered Terra account is
firstname.lastname@example.org, create a Terra Group named
j_doe_at_someplace_org. Don't add anyone else to this group. You can make grants to email@example.com.
This group contains one member, namely the proxy group for
firstname.lastname@example.org. This is much easier for a human to recognize and remember.
2.1. Go to your Groups page (Your name > "Groups" from the main navigation menu at top left of any page in Terra).
2.2. Click on the blue Create a new group button.
2.3. Enter your human-friendly user-ID (can be your Terra login - see screenshot below) and click the Create Group button.
Terra creates a mirrored Google group (your Terra ID plus your built-in proxy) for interfacing directly with Google Cloud that you can use as well.
You'll see the full name in your list of Groups (below). In the next step, you'll grant permission for this group to access the Google Cloud-native project you created in step 1:
Step 3. Add your Terra group on the Google project
This step allows you to work in Terra (i.e., a Terra notebook), while Terra acts on your behalf (as your "proxy") behind the scenes in the project you just set up in Google Cloud.
You will give your personal Terra group "Editor" permission (for more information about Google Cloud permissions, see IAM basic and predefined roles reference).
Consider group membership before giving editor permission Note: If your Terra group includes additional people, be careful of what permissions you grant to the group. Be aware editors can turn on a large number of services, including ones that can be expensive!
3.1. Go to IAM >Manage Resources in your new Google Cloud project and select Add Member.
3.2. Add your human-friendly personal Terra group as a member in your project permissions.
3.3. Give the group Editor permission.
What to expect and next steps
Once these three steps are complete, you can do many advanced Google Cloud tasks. In many cases, Terra will interface with Google Cloud on your behalf! Read on for details of how to do specific tasks. We will continue to add to this list.
Additional instructions and template notebooks
Below are a series of requested features that are not (yet!) available in Terra. Expand each section for step-by-step instructions - or a link to a notebook in the public workspace.
Why use an external bucket?To learn more about the benefits of using external buckets for storing shared data resources, see IAM basic and predefined role reference.
1. Go to Google Cloud Storage Console.
2. Select your Google Cloud-native project from the dropdown and click Create bucket.
External Google Cloud bucket configuration tips In general, you can use the default values when setting up your external bucket.
For customization details, see the Google documentation.
When you are done, you will see your external bucket in the console!
Why set bucket to autodelete?When you're testing code, you may generate a lot of data that you don't want to keep (or pay for). To avoid cleaning up at the end of the day, set your storage bucket to delete the contents every day with the following steps.
1. Go to Google Cloud Storage console.
2. Select the bucket you want to set to automatically delete data by clicking the bucket name.
3. Select the Lifecycle tab.
4. Choose Add a Rule.
5. Follow the instructions to set up a custom rule.
If you set up a rule to delete contents after 1 day, for example, you will see this:
There are times when you may not want to keep shared data in a Workspace bucket (particularly if you're sharing large numbers of large data files with a large group - see IAM basic and predefined roles reference for good reasons).
Why use external buckets? To learn more about sharing large numbers of large data files with large groups, see Best practices for sharing and protecting data resources.
For an end-to-end example of interacting with an external bucket, see this template notebook (Py 3 end to end demo.ipynb).
1. Go to BigQuery in the Google Cloud console and select the Google Cloud-native project you created above.
2. Select Create Dataset to the right of the project name.
3. In the dataset creation form, choose a unique dataset name and select the default table expiration.
In general, you would choose "Never". But if you are testing queries and saving those results as tables, you may generate a lot of tables that you don't want to keep (or pay for). To avoid having to clean up those tables at the end of the day, you can create a BigQuery dataset for test results that auto deletes its tables after your selected time period.
4. You will see your new Big Query dataset in the Resources section on the far left.
Note: Before you can load data to BigQuery, you must have (at least) WRITE access permission to an existing BQ dataset. If you set up your own BigQuery dataset (above), you automatically have those permissions.
See an example notebook (Py 3 How to load data to BigQuery.ipynb) in a public Terra workspace.
See this Google Cloud tutorial on running dsub jobs in Python.
See this Google Cloud Quickstart on running Dataflow in Python.