Need Help?

Search our documentation and community forum

Terra is a cloud-native platform for biomedical researchers to access data, run analysis tools, and collaborate.
Terra powers important scientific projects like FireCloud, AnVIL, and BioData Catalyst. Learn more.

Using Data from Google Bucket in R Jupyter Notebook or changing work directory

Comments

9 comments

  • Avatar
    Sushma Chaluvadi

    Hello,

    I haven't tested this myself just yet but this Notebook has a section "Option 2: Save file as tsv to your workspace bucket" that has an example of how to copy a file BACK to your workspace bucket so perhaps testing reversing the strings of source and destination in:

    #copy tsv to workspace bucket
    system(str_glue('gsutil cp traits_for_analysis.tsv {ws_bucket}/tsv-objects/ 2>&1'), intern = TRUE)

    Possibly useful!

    0
    Comment actions Permalink
  • Avatar
    Dan Spagnolo

    Doesn't seem to be doing what I need.

    It's not clear to me how to load in data sitting in a Google Bucket in a Terra R notebook. E.g. loading up gs://my-directory/file with something as simple as read.table(), where this would be quite easy to do in a corresponding Python notebook.

    All the guides I am seeing in Terra documentation reference BigQuery which I am not using. I just have a standard Google Bucket.

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    I'm not sure that the below still gets you exactly what you had in mind but I was able to read in a file named `inputs.yaml` from my Workspace bucket to the R kernel Notebook Runtime (renamed as inputs_R.yaml) with the following commands:

     

    0
    Comment actions Permalink
  • Avatar
    Dan Spagnolo

    Thanks! This looks like it is working, though I had to hard code in the bucket directory since I don't have str_glue() installed and wasn't sure what library it is a part of.

    So is this copying the file from the google bucket to virtual disk that is running the notebook? Is there a way to bypass doing so and just setwd() to a google bucket? One of the files I need to use is quite large. In my Python notebook I just use the bucket directory as part of the file path and it seems to work without anything copying to the notebook directory.

    It is also not in my Terra google bucket though I can copy it there if needed.

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    I think the library(readr) is the one that you need to use str_glue() -- atleast that is the one that I had to import to get the str_glue() command to work without an error! 

     

    Yes, effectively, these commands are going to copy the file from the Terra workspace bucket, into the Notebook runtime environment's persistent disk. At this time, there isn't a way to bypass that though I believe there is work for the future to find a way to "mount" buckets to the runtime environments which would alleviate the intermediate step of copying.

    If your file is in a non-Terra google bucket you *should* be able to copy it with the same commands. You do have to grant your Proxy Group (found in the "Profile" section from the upper left hand "hamburger" menu) Storage Object Viewer permissions in your external bucket.

    Re: In my Python notebook I just use the bucket directory as part of the file path and it seems to work without anything copying to the notebook directory. I'm not quite sure what you mean by this but if you can paste a line of code that might help - It will also allow the Terra support team to share ideas from their end!

    0
    Comment actions Permalink
  • Avatar
    Sushma Chaluvadi

    I was re-running these cells again today and realized that I needed to add library(stringr) as well. Just a note in case there are errors with using the above code.

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Dan Spagnolo,

    Just wanted to check in here to see how things were going. Were Sushma's suggestions helpful in moving you forward?

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    Dan Spagnolo

    Yes, Sushma's suggestions helped me greatly. Thanks Sushma Chaluvadi! I did have to use stringr as you mentioned, not readr.

    Is this the standard way to access data in an R notebook Jason Cerrato?

    My confusion, was that for a Python notebook, if my gsutil link is gs://my-directory, and i set bucket = "gs://my-directory" then I could simply use open(bucket+my_file) or in the case of importing files in Hail use import_bgen(bucket+my_file).

    R notebooks have the extra step of copying the needed files to the Notebook runtime environment's persistent disk. One of the files I need to use is 5GB, and I got a warning that I should install and use gsutil crcmod, but was able to download without that install luckily.

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Dan,

    Sushma's suggestion is probably the best way to go about it at this time, as there isn't really a standard way of doing this in R. That said, our notebooks product manager has some plans to investigate ways we can create/support mechanisms for importing data down the road.

    Kind regards,

    Jason

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk