I'm the UCSC guy making a featured workspace involving the TOPMed aligner prior to go-live. I'm testing it by having it run on the 1000 Genomes data available on Terra. That data has CRAM files but no CRAI files (needed by the aligner) so I need to generate those CRAI files with samtools or another tool before I can run the aligner.
If I put the CRAM files in my workspace bucket, samtools cannot find it.
!samtools index gs://[path to cram]
[E::hts_open_format] fail to open file 'gs://[path to cram]' samtools index: failed to open "gs://[path to cram]": Protocol not supported
I'm not sure if it's because Terra's verison of samtools doesn't contain htslib or if the notebook bucket can't "see" the workspace bucket. So I tried putting it into the notebook bucket using gsutil cp and got the following error output.
==> NOTE: You are downloading one or more large file(s), which would run significantly faster if you enabled sliced object downloads. This feature is enabled by default but requires that compiled crcmod be installed (see "gsutil help crcmod").
CommandException: Downloading this composite object requires integrity checking with CRC32c, but your crcmod installation isn't using the module's C extension, so the hash computation will likely throttle download performance. For help installing the extension, please see "gsutil help crcmod". To download regardless of crcmod performance or to skip slow integrity checks, see the "check_hashes" option in your boto config file.
I can't install crcmod without root permissions. Does anyone have any ideas? If it doesn't involve samtools, that's fine, I just need some way of indexing these CRAM files. I've considered using a custom docker container with a installation of samtools that I know contains htslib, but that won't help if the real issue is that I can't transfer the CRAM into the notebook bucket.
The workspace name is "TOPMed Alignment and Freeze8 Variant Calling" and has a Jupyter notebook with my process.
Please sign in to leave a comment.