Paired-Fastq-to-Unmapped-Bam can't open my fastq.gz files

Post author
Alex Sticco

Hi, Samantha (she/her), I'm having trouble completing a run of this sequence format converter workflow when using my own data. My fastq files are on google cloud, and I'm loading the data through tables as recommended. 

It says it exited with return code 3, which has not been declared as a valid return code. I believe the basis of the error is that the program doesn't recognize my file as a gzip file:

htsjdk.samtools.SAMException: Error opening file: /cromwell_root/jc_genome/bk1.fastq.gz
	at htsjdk.samtools.util.IOUtil.openGzipFileForReading(IOUtil.java:729)
	at htsjdk.samtools.util.IOUtil.openFileForReading(IOUtil.java:695)
	at htsjdk.samtools.util.IOUtil.openFileForBufferedReading(IOUtil.java:1007)
	at htsjdk.samtools.util.IOUtil.openFileForBufferedReading(IOUtil.java:1002)
	at htsjdk.samtools.fastq.FastqReader.<init>(FastqReader.java:77)
	at picard.sam.FastqToSam.fileToFastqReader(FastqToSam.java:432)
	at picard.sam.FastqToSam.doWork(FastqToSam.java:322)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
	at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
	at org.broadinstitute.hellbender.Main.main(Main.java:289)
Caused by: java.util.zip.ZipException: Not in GZIP format
	at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:165)
	at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
	at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
	at htsjdk.samtools.util.IOUtil.openGzipFileForReading(IOUtil.java:726)
	... 11 more

Does this mean the files must to be gzipped for the program to work? There is another support thread here where the response to a similar issue indicates gatk workflows will not accept gzipped fasta files, so I assumed that would also be the case here. My files are stored on google cloud as gzipped files, but I set up the bucket to serve them uncompressed in order to accommodate that. It looks like I need to change that setting, and serve the compressed file to this workflow? Or is there actually some other issue I'm not understanding, based on that error code? 

It would be very helpful for beginners like me if the instructions explicitly stated which compression formats are compatible or required for the workflows. 

My last run failed because my dates were not ISO formatted. Strict metadata formatting requirements like this would be another useful thing to know upfront. (And especially since that error isn't triggered until after it spends a couple hours of runtime transferring files.) 

If these feel like standard things everyone should be able to just assume, please understand I'm brand new to cloud computing and genomic analysis! There is a pretty steep learning curve to get from understanding how things work in theory to actually getting any bit of a pipeline operational.

Comments

9 comments

  • Comment author
    Samantha (she/her)

    Hi Alex,

    Thank you for writing in about this issue. Can you share the workspace where you are seeing this issue with Terra-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.

    1. Add Terra-Support@firecloud.org to the User email field and press enter on your keyboard.
    2. Click Save.

     

    Please provide us with

    1. A link to your workspace
    2. The relevant submission ID
    3. The relevant workflow ID

    We’ll be happy to take a closer look as soon as we can!

    Kind regards,

    Samantha​ 

    0
  • Comment author
    Alex Sticco

    Hi! Thank you so much for getting back to me so quickly. I'm sorry I did not respond as quickly! I got overwhelmed at work and let this fall off my radar last night. 

    I have added Terra support to the project. 

    https://app.terra.bio/#workspaces/PracticeGenomics/Sequence-Format-Conversion%20copy

    This is the link in the browser, but I'm unsure if that will work by itself. It seems unlikely to be a unique name? I assume you can get to the right place with the workflow ID, but let me know if there is something else I need to share. 

    Workflow: 2a8f700c-eaa7-4e50-b9a4-b049e8dc3730

    Submission: d4946245-26bb-46d3-8ce1-c791f1c434f9 

    That's the most recent submission I ran, as referenced above. It says that the workflow configuration has been changed, but the only thing I've done was play with the values that could be accepted by the optional parameters. I've saved and unsaved elements there, but everything is actually still configured as it was when I ran the submission. 

    0
  • Comment author
    Alex Sticco

    I came here to check if I had missed a response, and realized maybe I needed to tag you in every post, Samantha (she/her). If so, sorry about that!

    0
  • Comment author
    Alex Sticco

    Samantha (she/her), just an update. I tried setting the cache-control on my files to force them to be served compressed and tried the run again, but I received the same error code and same events in the log: it still thinks it's not a gzipped file.

    I went back to my original files in storage to make sure they are actually gzipped and not just mislabeled that way, but they definitely are gzipped files.

    I'm really at a loss. I assume something must be getting messed up in the file transfer, but don't understand enough about what's going on in the localization process to troubleshoot any of that. 

    0
  • Comment author
    Samantha (she/her)

    Hi Alex Sticco,

    Apologies for the delayed response. I'm unable to access your workspace. Can you confirm that the workspace was successfully shared with Terra-Support@firecloud.org? We should be listed as a current collaborator.

    Best,

    Samantha

    0
  • Comment author
    Alex Sticco

    I had shared it as instructed, but just happened to notice this text over the authorization domain: "Collaborators must be a member of all of these groups to access this workspace."

    So, I suspect that was the issue. I have added the same email to my authorization domain. Please check again, and let me know if I need to adjust the role further. 

    0
  • Comment author
    Alex Sticco

    Whoops, forgot to tag you again, Samantha (she/her). As I mentioned above, although I shared as instructed, I think I needed to also add the email to my authorization domain. Please check again to see if you have access now.

    0
  • Comment author
    Samantha (she/her)

    Hi Alex Sticco,

    Thanks. I can confirm I have access to the workspace now. I'll take a look at your submission and get back to you as soon as I can.

    Best,

    Samantha

    0
  • Comment author
    Samantha (she/her)

    Hi Alex Sticco,

    Thanks for your patience. The FASTQ files do not have to be in the gzipped format, but if the extension of the file is .gz, it needs to be gzipped properly. Unfortunately, I am not aware of an easy way to confirm whether it's gzipped properly - it's mostly trial and error when running into these issues. But since you received that error, it probably means that there is something wrong with your gzipped file.

    As a new user to cloud computing and genomic analysis, a helpful forum for you would be the biostars community forum. There are a good number of posts regarding gzip issues: https://www.biostars.org/post/search/?query=gzip.

    There is also a post on our GATK forum regarding the same "Not in GZIP format" error: https://gatk.broadinstitute.org/hc/en-us/community/posts/1260803912330-Caused-by-java-util-zip-ZipException-Not-in-GZIP-format. As suggested in the thread, renaming the files without the .gz extension could potentially be enough to resolve the error.

    Best,

    Samantha

    0

Please sign in to leave a comment.