Using Data from Google Bucket in R Jupyter Notebook or changing work directory

Post author
Dan Spagnolo

Note: I am not looking to copy files to the notebook disk per https://support.terra.bio/hc/en-us/articles/360046617372-Analyzing-data-from-a-workspace-bucket-in-a-notebook

I would like to load in a dataframe that lives on a Google Bucket I own and have used before with Terra Notebooks. With a Python-based environment, it seems I can just use "gs://my-directory/file" but this doesnt seem to work in R.

I also tried to use setwd(google_bucket) and got the error message:

Error in setwd(google_bucket): cannot change working directory



Any tips?

Comments

13 comments

  • Comment author
    Sushma Chaluvadi

    Hello,

    I haven't tested this myself just yet but this Notebook has a section "Option 2: Save file as tsv to your workspace bucket" that has an example of how to copy a file BACK to your workspace bucket so perhaps testing reversing the strings of source and destination in:

    #copy tsv to workspace bucket
    system(str_glue('gsutil cp traits_for_analysis.tsv {ws_bucket}/tsv-objects/ 2>&1'), intern = TRUE)

    Possibly useful!

    0
  • Comment author
    Dan Spagnolo

    Doesn't seem to be doing what I need.

    It's not clear to me how to load in data sitting in a Google Bucket in a Terra R notebook. E.g. loading up gs://my-directory/file with something as simple as read.table(), where this would be quite easy to do in a corresponding Python notebook.

    All the guides I am seeing in Terra documentation reference BigQuery which I am not using. I just have a standard Google Bucket.

    1
  • Comment author
    Sushma Chaluvadi

    I'm not sure that the below still gets you exactly what you had in mind but I was able to read in a file named `inputs.yaml` from my Workspace bucket to the R kernel Notebook Runtime (renamed as inputs_R.yaml) with the following commands:

     

    0
  • Comment author
    Dan Spagnolo

    Thanks! This looks like it is working, though I had to hard code in the bucket directory since I don't have str_glue() installed and wasn't sure what library it is a part of.

    So is this copying the file from the google bucket to virtual disk that is running the notebook? Is there a way to bypass doing so and just setwd() to a google bucket? One of the files I need to use is quite large. In my Python notebook I just use the bucket directory as part of the file path and it seems to work without anything copying to the notebook directory.

    It is also not in my Terra google bucket though I can copy it there if needed.

    0
  • Comment author
    Sushma Chaluvadi

    I think the library(readr) is the one that you need to use str_glue() -- atleast that is the one that I had to import to get the str_glue() command to work without an error! 

     

    Yes, effectively, these commands are going to copy the file from the Terra workspace bucket, into the Notebook runtime environment's persistent disk. At this time, there isn't a way to bypass that though I believe there is work for the future to find a way to "mount" buckets to the runtime environments which would alleviate the intermediate step of copying.

    If your file is in a non-Terra google bucket you *should* be able to copy it with the same commands. You do have to grant your Proxy Group (found in the "Profile" section from the upper left hand "hamburger" menu) Storage Object Viewer permissions in your external bucket.

    Re: In my Python notebook I just use the bucket directory as part of the file path and it seems to work without anything copying to the notebook directory. I'm not quite sure what you mean by this but if you can paste a line of code that might help - It will also allow the Terra support team to share ideas from their end!

    0
  • Comment author
    Sushma Chaluvadi

    I was re-running these cells again today and realized that I needed to add library(stringr) as well. Just a note in case there are errors with using the above code.

    0
  • Comment author
    Jason Cerrato

    Hi Dan Spagnolo,

    Just wanted to check in here to see how things were going. Were Sushma's suggestions helpful in moving you forward?

    Kind regards,

    Jason

    0
  • Comment author
    Dan Spagnolo

    Yes, Sushma's suggestions helped me greatly. Thanks Sushma Chaluvadi! I did have to use stringr as you mentioned, not readr.

    Is this the standard way to access data in an R notebook Jason Cerrato?

    My confusion, was that for a Python notebook, if my gsutil link is gs://my-directory, and i set bucket = "gs://my-directory" then I could simply use open(bucket+my_file) or in the case of importing files in Hail use import_bgen(bucket+my_file).

    R notebooks have the extra step of copying the needed files to the Notebook runtime environment's persistent disk. One of the files I need to use is 5GB, and I got a warning that I should install and use gsutil crcmod, but was able to download without that install luckily.

    0
  • Comment author
    Jason Cerrato

    Hi Dan,

    Sushma's suggestion is probably the best way to go about it at this time, as there isn't really a standard way of doing this in R. That said, our notebooks product manager has some plans to investigate ways we can create/support mechanisms for importing data down the road.

    Kind regards,

    Jason

    0
  • Comment author
    Aravind Easwar

    Do we still have to copy files from the google bucket or is there a way to access them without copying them to the persistent disk ?

    0
  • Comment author
    Samantha (she/her)

    Hi Aravind Easwar,

    Thanks for your question. It would depend on what you're doing with the files. Can you explain what you are ultimately trying to achieve without copying them?

    Best,

    Samantha

    0
  • Comment author
    Aravind Easwar

    Hi Samantha (she/her),

    Apologies for the late reply. I would like to use the abundance files generated by kallisto for further analysis using sleuth. The abundance files are present in the google bucket. For sleuth analysis , I'm using RStudio which has the storage as persistent disk.

    Thanks,

    Aravind Easwar

    0
  • Comment author
    Samantha (she/her)

    HI Aravind Easwar,

    In that case, you may be interested in using the AnVIL R package, which allows you to interact with Google buckets. You can see an example of how to use AnVIL in RStudio by viewing this video here: https://youtu.be/JAcCtTkkvJw?t=125

    I hope this helps. Let me know if you have any other questions.

    Best,

    Samantha

     

    0

Please sign in to leave a comment.