How to move data to/from a Google bucket (workspace or external)

Allie Hajian
  • Updated

Explore how to add data to - or download from - a Google bucket (your Workspace bucket or an external bucket). The best approach depends on how many files you have and what size they are, whether you're moving to or from local storage, and how comfortable you are with different options. 

Transfer between local storage and workspace bucket in Terra
(small numbers of small files) 

Uploading/downloading data in Terra Only for transfers between workspace storage and local storage (e.g., laptop).

Recommended only for small numbers of small data files.


Note: This is the sort of transfer you often see where you upload or download a file from the internet. Because your local storage has no cloud-native  "path", you can only transfer files stored on the system running your browser. 

  • 1. Start in the workspace Data page.

    2. Select the "Files" icon on the lower left side.
    Screenshot of workspace data tab highlighting the files icon at the bottom left

    3. In the "files" section, click on the "+" button in the lower right corner.

  • 1. Start from the workspace Data page.

    2. Select the "Files" icon on the bottom of the left column (underneath "Other Data").

    3. Find the file you want to download (you may have to navigate down many levels of file folders) to access the file you want.
    Screenshot_2023-05-18_at_9.22.26_PM.png

    4. Click on the file to download. This will open a pop-up window with multiple choices for copying the data, as well as the cost.

  • 1. Start from the workspace Data page.

    2. Click on the table with the data to download on the left side of the screen. The example below is for the sample table.

    Any files available for download will be shown as a link in the sample row.
    Screenshot of sample table with two rows. In the first column is the unique sample ID (NA12878 and template_sample). In the second cram column is a clickable link to down load the file.

    3. Click on a file link to open a pop-up window describing the size and cost of the download.
    Screenshot_2023-05-18_at_9.22.44_PM.png

    4. Click on the “Download for $0.56” link to initiate the download. Note: This button starts the download immediately. You won't get another opportunity to verify before the download starts. However, you can cancel the download at any time during the process. 

    5. Repeat for any additional files you would like to download.

Transfer using gsutil (large or large numbers of files)

When to use gsutil

  • Works well for all size transfers
  • Ideal for large file sizes or 1000s of files
  • Can be used for transfers between local storage and a bucket, workspace virtual machine (VM) or persistent disk and a Google bucket, as well as between Google buckets (external and workspace)

What is gsutil? gsutil is a Python application that lets you access Cloud Storage from the command line in a terminal.

The terminal you use can be run on your local machine (local instance) or built into the workspace Cloud Environment (workspace instance).

gsutil in a terminal - Step-by-step instructions

Depending on where your data are stored, you may need to run a particular instance of the terminal.

Diagram of three locations for data: local storage, workspace cloud storage (Google bucket) and Cloud Environment (VM disk or persistent disk). An arrow between the Cloud Environment and workspace storage shows that you can use workspace tooks (cloud environemnt terminal) to move or copy files between the cloud environment and workspace storage.

Step 1. Open gsutil in a terminal

You can run a terminal locally or in your workspace. Which you use depends on where your data are located

Which terminal instance should you use?

  • Moving data to or from the Cloud Environment VM/PD?
    Use the workspace terminal instance.

  • Moving data to or from local storage?
    Use a local terminal instance.

  • Google bucket to Google bucket transfer?
    You can use either instance.

  • Use for moving data to/from a cloud environment

    1.1. Scroll to the right of any workspace page to see these icons, which will lead you to one of the best-kept secrets of Terra - a command line interface. Click on the (>_) icon  and you can access what resembles a UNIX terminal.

    Screenshot of right sidebar with (from the top) the cloud environment rate, the cloud environment lightening logo, and the terminal logo

    1.2.  You will need to start a Cloud Environment first if one is not already running, as this is the virtual machine the terminal runs on.

    1.3. From here, you can perform command-line tasks including gsutil.

  • Use for moving data to/from local storage

    First open a terminal on your local machine. Then follow Google’s installation instructions for Cloud SDK or the directions below to install Google Cloud SDK, which includes gsutil.

    1.1. Run the following command using bash shells in your Terminal:
    curl https://sdk.cloud.google.com | bash 

    Or download google-cloud-sdk.zip or google-cloud-sdk.tar.gz and unpack it. Note: The command is only supported in bash shells.

    1.2. Restart your shell: exec -l $SHELL or open a new bash shell in your Terminal.

    1.3. Run gcloud init to authenticate.

    Verify gsutil installationBefore uploading/downloading data using gsutil, run gsutil ls to see all of the Cloud Storage buckets you have access to.

    Before running this command, be sure to set a default project name using gcloud config set project MY_PROJECT.

    Run gsutil ls -p PROJECT_NAME to list buckets for a specific project. You will need to have owner access to the project to run this command.  

Step 2. Run gsutil commands 

Once in a terminal (either on your local machine or in a Terra workspace), you can copy data from one place to another using the cp command:

gsutil cp WHERE_TO_COPY_DATA_FROM/FILENAME WHERE_TO_COPY_DATA_TO 

Additional details on the gsutil cp command can be found in the official Google gsutil documentation.

You must be an Owner or Writer to upload to a Google bucket, including the workspace bucket!

  • To generate a manifest when uploading, use the - L option.

    gsutil cp - L WHERE_TO_COPY_DATA_FROM/FILENAME WHERE_TO_COPY_DATA_TO
  • To copy the file "Example.bam" from an external bucket gs://My_GCP_bucket" into the "gene_files" folder in a workspace bucket "gs://fc-7ac2cfe6-4ac5-4a00-add1-c9b3c84a36b7", use the command

    gsutil cp gs://MY_GOOGLE_BUCKET/EXAMPLE.bam gs://fc-7ac2cfe6-4ac5-4a00-add1-c9b3c84a36b7

    Finding the full path to workspace bucket

    In Terra, you can find the full path to the workspace bucket by clicking the Clipboard icon in the right ride of the workspace Dashboard. 

    Screenshot of clipboard icon to copy the full path to the workspace bucket to the clipboard.

    Note: To do this, you must use gsutil in a terminal on your local machine.

    To download data from a bucket, reverse the order of the bucket URL and local file path, use 

    gsutil cp [bucket URL]/[file name] [local file path] 

    Make sure to leave a space between the the bucket URL and the file path.

    gsutil cp gs://WorkspaceBucket/GeneFiles/example.bam /Users/Documents

    To download data from a bucket that is enabled with requester-pays, run the command this way.

    gsutil -u GOOGLE_BILLING_PROJECT cp gs://BUCKET_URL/FILE_NAME LOCAL_FILE_PATH

    To learn more about accessing files from a requester-pays enabled Google bucket, see the  Google requester pays docs.

    Downloading folders

    If you're downloading folders, you'll need to use the -R flag to copy the folder and its contents

    gsutil cp -R gs://EXAMPLE_BUCKET/FOLDER_1 LOCAL_FILE_PATH

    1. You can to use the -m flag to copy the files in parallel.

    gsutil -m cp -R gs://EXAMPLE_BUCKET/FOLDER_1 LOCAL_FILE_PATH

    More gsutil instructions working with large data can be found here and an explanation of -mcan be found here.

    2. It's also possible to to maximize parallelization by configuring thread count using -o.Try this article about large file download optimization.

    gsutil -o ‘GSUtil:parallel_thread_count=1’ -o ‘GSUtil:sliced_object_download_max_components=8’ cp gs://BUCKET_URL/FILE_NAME LOCAL_FILE_PATH

File validation / checksum generation

Per Google documentation: At the end of every upload or download the gsutil cp command validates that the checksum it computes for the source file/object matches the checksum the service computes. If the checksums do not match, gsutil will delete the corrupted object and print a warning message. This very rarely happens, but if it does, please contact gs-team@google.com.

Troubleshooting

The following are the most common errors our users encounter when moving data using gsutil. If you experience a different error, please contact frontline support and mark in the comments of this article, so we can update the information. 

Cloud authorization error 

You may have trouble accessing your Terra workspaces if you authorized your gcloud sdk installation with a Google Account that is not registered in Terra and applied to your workspace.  You can verify which Google Account you’ve authorized with gcloud by running the following command: gcloud auth list.

  1. If the Google ID returned <strong>matches the one on your Terra workspace</strong>, you should be able to access your workspace.  If it doesn't, please contact your Project Manager.
  2. If the Google ID returned does not match the one on your Terra workspace, run the following command to specify the correct account:
    gcloud auth login GOOGLE_ACCOUNT
  3.  

gsutil errors on Unix

When working on a Unix system, you need to to tell it not to try to start a browser. Once you do that, you should receive a url you can paste into your desktop browser. 

To tell the system not to start a browser, use the command

gcloud auth login --no-launch-browser

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.