Pysam and SSL certificate problem
Hi All,
I've been trying to get this to work on Terra and I'm a little frustrated because, it hasn't been easy.
I have a python program that uses pysam which is a Samtools wrapper. Samtools helps read, write and manipulate bams and sam files. I need to use pysam to read RNA seq reads aligned to given positions in various bam files and process them. Therefore, I wanted to avoid locating the bam files to run the task. Instead, I created a docker based on google/cloud-sdk, installed samtools and then pysam and in the command section of the task in the WDL I used:
export GCS_OAUTH_TOKEN=$(gcloud auth application-default print-access-token)
samtools view -h $bam
This happily works. I can see the header of the bam file and the bam file hasn't been localize, which saves time and disk space.
However, attempting to use Pysam to read the same BAM file results in an error related to SSL certificate verification (Problem with the SSL CA cert
).:
[E::easy_errno] Libcurl reported error 77 (Problem with the SSL CA cert (path? access rights?)) [E::hts_open_format] Failed to open file "gs://fc-secure-29fcb143-0827-4430-b92f-3fc3cdc76cb7/Samples/SRR7300571/Aligned.sortedByCoord.out.bam" : Input/output error gs://fc-secure-29fcb143-0827-4430-b92f-3fc3cdc76cb7/Samples/SRR7300571/Aligned.sortedByCoord.out.bam Traceback (most recent call last): File "/Scripts/Read_bam_file_pysam.py", line 37, in <module> main(sys.argv[1:]) File "/Scripts/Read_bam_file_pysam.py", line 30, in main bam_file_ref = pysam.AlignmentFile(bam_file, "rb") File "pysam/libcalignmentfile.pyx", line 748, in pysam.libcalignmentfile.AlignmentFile.__cinit__ File "pysam/libcalignmentfile.pyx", line 947, in pysam.libcalignmentfile.AlignmentFile._open OSError: [Errno 5] could not open alignment file `gs://fc-secure-29fcb143-0827-4430-b92f-3fc3cdc76cb7/Samples/SRR7300571/Aligned.sortedByCoord.out.bam`: Input/output error
So, the samtools view -H over the bam file: gs://fc-secure-29fcb143-0827-4430-b92f-3fc3cdc76cb7/Samples/SRR7300571/Aligned.sortedByCoord.out.bam works,
but the same using the pysam throws the error : Problem with the SSL CA cert (path? access rights?)
Does anyone knows how to solve this problem, in order to be able to use pysam seamlessly as samtools?
Thank you in advance,
MaVi
Comments
16 comments
Hi MaVi,
Thank you for writing in about this issue. Can you share the workspace where you are seeing this issue with Terra Support by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.
Please provide us with
If possible, can you also share the contents of your Dockerfile?
We’ll be happy to take a closer look as soon as we can!
Kind regards,
Samantha
Hi Samantha,
Thank you so much for your reply!
I have shared the workspace: WDLs_Tests
Here the other information:
1. https://app.terra.bio/#workspaces/broad-firecloud-cptac/WDLs_Tests
2. Submission ID: d0cd7a2f-6187-4a39-8e92-af40e450b83e
3. Workflow ID: 29fcb143-0827-4430-b92f-3fc3cdc76cb7 (Read_Bam_File)
The Dockerfile is:
And I have a second one that calls the first one and copies the python script:
Once again, thank you so much !!!
MaVi
Hi MaVi,
Are you able to run the that pysam python script in your local terminal? In other words, is pysam able to access that file when you run it outside of Terra?
Best,
Samantha
Hi Samantha,
Thank you for taking the time to investigate this issue.
Outside Terra, yes! Pysam is able to access the file.
Best,
MaVi
Hi Samantha,
I wanted to tell you that I managed to create a new docker that integrates samtools and pysam. Now I can do queries with both to bam files in the google bucket without locating them. I can now investigate a bam file in few locations in the genome and it works correctly.
However, I have run into a new problem and I think this time Terra is the culprit. It turns out that when I have thousands of locations to investigate (while using multiprocessing to speed up the search), I get the following error:
My hypothesis is that there is a firewall or something in Terra, which closes the connection when there are many request to a google bucket. Please let me know if you have an idea how to prevent this new problem from happening, so that I can finally get my program working on Terra and therefore I can have a happier holidays.
Thank you so much in advance!!
Best,
MaVi
Hi Maria,
Sorry for the delayed response. Regarding your new issue, it does seem like it could be due to the amount of requests being made. Can you try setting your max_retries value to 1 or higher so the task can get retried if it fails? I see that the value is currently set to 0 by default in your WDL.
Best,
Samantha
Hi Samantha,
I hope you had great holidays!
I came back to this post as I still facing the same problem. I tried the max_retrites, however the same error appears.
So I was wondering if my problem is not related to a quota https://cloud.google.com/apis/docs/capping-api-usage that is reached as several connections are requesting information from the google bucket.
Please let me know if this could be the problem, and how to upgrade the number of connections that can be request!
Thank you so much un advance,
MaVi
Hi MaVi,
Can you please provide the submission ID for the recently failed jobs so we can take a closer look?
Best,
Samantha
Hi Samantha,
Sure, here is the submission ID: 09feba12-4fd3-43a4-a3da-92ba1210c22b
You can see that in the last task that was executed call-module_2_counting_and_normalization/
Shard 2 got the message:
And after that, two of the other shards got the message:
Probably, because the connection was lost at some point also.
Thank you so much for looking into this.
Best,
MaVi
Hi MaVi,
Are you running the workflow in a different workspace? I'm not able to find that submission ID in the workspace you previously shared, https://app.terra.bio/#workspaces/broad-firecloud-cptac/WDLs_Tests.
If you are using a different workspace, can you please share it with Terra Support and provide the workspace name?
Thanks,
Samantha
Hi Samantha,
Yes, indeed. Below I share the details of this workspace, I am sorry for the confusion.
Here the other information:
1. https://app.terra.bio/#workspaces/broad-firecloud-cptac/BamQuery
2. Submission ID: 09feba12-4fd3-43a4-a3da-92ba1210c22b
3. Workflow ID: 5ee6127a-9b01-4547-b385-74b62d455f79
Thanks so much again!
MaVi
Just chiming in here - I ran into the same issue with pysam (though outside of terra), and it looks like this started happening with pysam 0.22.0. I'm not seeing the same SSL cert issue with pysam 0.21.0.
Hi Julia Kodysh
Thanks for your reply!
I just want to be sure that I understood correctly your message.
Did you encounter the same problem (SSL cert) with pysam 0.22.0? Or any of the other problems that I have posted here? :'(
Hi Samantha (she/her)
I wanted to know if you have any news regarding the last post about the connection closure.
Thanks!
MaVi
Yeah we encountered the exact same error with pysam 0.22.0 that went away when we downgraded to 0.21.0.
Hi Maria Virginia Ruiz Cuevas,
I believe the connection error might be coming from the python script you're running in your task, rather than Terra itself. Would it be possible to share the contents of that script to see exactly what commands are being run?
Best,
Samantha
Please sign in to leave a comment.