Paired-Fastq-to-Unmapped-Bam can't open my fastq.gz files
Hi, Samantha (she/her), I'm having trouble completing a run of this sequence format converter workflow when using my own data. My fastq files are on google cloud, and I'm loading the data through tables as recommended.
It says it exited with return code 3, which has not been declared as a valid return code. I believe the basis of the error is that the program doesn't recognize my file as a gzip file:
htsjdk.samtools.SAMException: Error opening file: /cromwell_root/jc_genome/bk1.fastq.gz at htsjdk.samtools.util.IOUtil.openGzipFileForReading(IOUtil.java:729) at htsjdk.samtools.util.IOUtil.openFileForReading(IOUtil.java:695) at htsjdk.samtools.util.IOUtil.openFileForBufferedReading(IOUtil.java:1007) at htsjdk.samtools.util.IOUtil.openFileForBufferedReading(IOUtil.java:1002) at htsjdk.samtools.fastq.FastqReader.<init>(FastqReader.java:77) at picard.sam.FastqToSam.fileToFastqReader(FastqToSam.java:432) at picard.sam.FastqToSam.doWork(FastqToSam.java:322) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308) at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Caused by: java.util.zip.ZipException: Not in GZIP format at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:165) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91) at htsjdk.samtools.util.IOUtil.openGzipFileForReading(IOUtil.java:726) ... 11 more
Does this mean the files must to be gzipped for the program to work? There is another support thread here where the response to a similar issue indicates gatk workflows will not accept gzipped fasta files, so I assumed that would also be the case here. My files are stored on google cloud as gzipped files, but I set up the bucket to serve them uncompressed in order to accommodate that. It looks like I need to change that setting, and serve the compressed file to this workflow? Or is there actually some other issue I'm not understanding, based on that error code?
It would be very helpful for beginners like me if the instructions explicitly stated which compression formats are compatible or required for the workflows.
My last run failed because my dates were not ISO formatted. Strict metadata formatting requirements like this would be another useful thing to know upfront. (And especially since that error isn't triggered until after it spends a couple hours of runtime transferring files.)
If these feel like standard things everyone should be able to just assume, please understand I'm brand new to cloud computing and genomic analysis! There is a pretty steep learning curve to get from understanding how things work in theory to actually getting any bit of a pipeline operational.
Comments
9 comments
Hi Alex,
Thank you for writing in about this issue. Can you share the workspace where you are seeing this issue with Terra-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.
Please provide us with
We’ll be happy to take a closer look as soon as we can!
Kind regards,
Samantha
Hi! Thank you so much for getting back to me so quickly. I'm sorry I did not respond as quickly! I got overwhelmed at work and let this fall off my radar last night.
I have added Terra support to the project.
https://app.terra.bio/#workspaces/PracticeGenomics/Sequence-Format-Conversion%20copy
This is the link in the browser, but I'm unsure if that will work by itself. It seems unlikely to be a unique name? I assume you can get to the right place with the workflow ID, but let me know if there is something else I need to share.
Workflow: 2a8f700c-eaa7-4e50-b9a4-b049e8dc3730
Submission: d4946245-26bb-46d3-8ce1-c791f1c434f9
That's the most recent submission I ran, as referenced above. It says that the workflow configuration has been changed, but the only thing I've done was play with the values that could be accepted by the optional parameters. I've saved and unsaved elements there, but everything is actually still configured as it was when I ran the submission.
I came here to check if I had missed a response, and realized maybe I needed to tag you in every post, Samantha (she/her). If so, sorry about that!
Samantha (she/her), just an update. I tried setting the cache-control on my files to force them to be served compressed and tried the run again, but I received the same error code and same events in the log: it still thinks it's not a gzipped file.
I went back to my original files in storage to make sure they are actually gzipped and not just mislabeled that way, but they definitely are gzipped files.
I'm really at a loss. I assume something must be getting messed up in the file transfer, but don't understand enough about what's going on in the localization process to troubleshoot any of that.
Hi Alex Sticco,
Apologies for the delayed response. I'm unable to access your workspace. Can you confirm that the workspace was successfully shared with Terra-Support@firecloud.org? We should be listed as a current collaborator.
Best,
Samantha
I had shared it as instructed, but just happened to notice this text over the authorization domain: "Collaborators must be a member of all of these groups to access this workspace."
So, I suspect that was the issue. I have added the same email to my authorization domain. Please check again, and let me know if I need to adjust the role further.
Whoops, forgot to tag you again, Samantha (she/her). As I mentioned above, although I shared as instructed, I think I needed to also add the email to my authorization domain. Please check again to see if you have access now.
Hi Alex Sticco,
Thanks. I can confirm I have access to the workspace now. I'll take a look at your submission and get back to you as soon as I can.
Best,
Samantha
Hi Alex Sticco,
Thanks for your patience. The FASTQ files do not have to be in the gzipped format, but if the extension of the file is .gz, it needs to be gzipped properly. Unfortunately, I am not aware of an easy way to confirm whether it's gzipped properly - it's mostly trial and error when running into these issues. But since you received that error, it probably means that there is something wrong with your gzipped file.
As a new user to cloud computing and genomic analysis, a helpful forum for you would be the biostars community forum. There are a good number of posts regarding gzip issues: https://www.biostars.org/post/search/?query=gzip.
There is also a post on our GATK forum regarding the same "Not in GZIP format" error: https://gatk.broadinstitute.org/hc/en-us/community/posts/1260803912330-Caused-by-java-util-zip-ZipException-Not-in-GZIP-format. As suggested in the thread, renaming the files without the .gz extension could potentially be enough to resolve the error.
Best,
Samantha
Please sign in to leave a comment.