AMP-PD access large data

Post author
Katie Saund

Hi, 

I am working with the AMP-PD transcriptomics data. My goal is to save a sample x gene count matrix in my workspace bucket that I can use for RNAseq analyses using limma. I can create such a matrix easily for small dataset by querying part of the `amp-pd-research:2021_v2_5release_0510_transcriptomics.feature_counts` table within a jupyter notebook or Rstudio session in Terra. I kept the dataset small by using the LIMIT command in the query. 

Ideally, I want to get all of the PPMI RNAseq feature counts from the Month 0 clinic visit. This is a lot of data.  I've tried the methods listed below to get the data into R or Rstudio, but they've all failed. Any advice on how to work with data this large in the Terra ecosystem? 

  1. Within Rstudio I queried the `amp-pd-research:2021_v2_5release_0510_transcriptomics.feature_counts` table and subset to visit.month = 0 and study = PPMI. But this is so big that in both Rstudio and running as an Rscript from the terminal the function bq_query() fails. I get the following error:

    “Error: Exceeded rate limits: Your project:456925683206 exceeded quota for tabledata.list bytes per second per project. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors [rateLimitExceeded]  Try increasing the `page_size` value of `bq_table_download()`“

    As suggested in the error message, I tried to fix this problem by increasing the page_size. I tried several different page_sizes but none worked. If I picked a page_size exactly the size or smaller than the number of rows then I get the same error as quoted above. If I picked a number larger than the number of rows then I get a different error, but the function still fails to pull in the data. 

  2.  I next tried to run the query in the BigQuery editor interactively. This was an improvement because at least the query completed: I got a resulting table with ~260,000,000 rows. I tried to save this output to a CSV file, which is an option in available from BigQuery. Unfortunately, the data table is >1GB, which is larger than the max allowable file. Google cloud documentation mentions the ability to save larger data across multiple CSV files but I did not see a place to do so within the BigQuery interface for my particular results.

  3. Finally, I tried to work with the aggregated.featureCounts.tsv file in the amppd google cloud bucket. I figured I would move this file to my workspace bucket, load into R, and then filter the file down to just PPMI samples from visits from month 0. To work with such a large file (~260GB) I upped the machine size to to the max available: 624GB. I used the provided gcs_read_file function to read in the file. A progress message prints as the function runs. It ran just fine, indexing the file. Then, finally, it said “indexed 303.35GB in 26m, 198.10MB/s” and then produced this error message: “Error: Error occurred during transmission”. I repeated one more time and it also failed, this time after indexing 302GB. 

Any advice on how to get all of PPMI Month 0 RNAseq data into a genes x samples count matrix into Rstudio or R Jupyter Notebook or my workspace bucket in the Terra ecosystem? 

Thank you,

Katie

Comments

1 comment

  • Comment author
    Katie Saund

    Solution:  

    I was able to solve this issue by looping through the sample ids; for each run through a for loop in R I queried just a subset of the sample ids. I converted the query results from long to wide format. Then I concatenated each of the wide dataframes into the final gene count x sample matrix.

    0

Please sign in to leave a comment.