Hi, Samantha-she-her, I'm having trouble completing a run of this sequence format converter workflow when using my own data. My fastq files are on google cloud, and I'm loading the data through tables as recommended.
It says it exited with return code 3, which has not been declared as a valid return code. I believe the basis of the error is that the program doesn't recognize my file as a gzip file:
htsjdk.samtools.SAMException: Error opening file: /cromwell_root/jc_genome/bk1.fastq.gz at htsjdk.samtools.util.IOUtil.openGzipFileForReading(IOUtil.java:729) at htsjdk.samtools.util.IOUtil.openFileForReading(IOUtil.java:695) at htsjdk.samtools.util.IOUtil.openFileForBufferedReading(IOUtil.java:1007) at htsjdk.samtools.util.IOUtil.openFileForBufferedReading(IOUtil.java:1002) at htsjdk.samtools.fastq.FastqReader.<init>(FastqReader.java:77) at picard.sam.FastqToSam.fileToFastqReader(FastqToSam.java:432) at picard.sam.FastqToSam.doWork(FastqToSam.java:322) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308) at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Caused by: java.util.zip.ZipException: Not in GZIP format at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:165) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91) at htsjdk.samtools.util.IOUtil.openGzipFileForReading(IOUtil.java:726) ... 11 more
Does this mean the files must to be gzipped for the program to work? There is another support thread here where the response to a similar issue indicates gatk workflows will not accept gzipped fasta files, so I assumed that would also be the case here. My files are stored on google cloud as gzipped files, but I set up the bucket to serve them uncompressed in order to accommodate that. It looks like I need to change that setting, and serve the compressed file to this workflow? Or is there actually some other issue I'm not understanding, based on that error code?
It would be very helpful for beginners like me if the instructions explicitly stated which compression formats are compatible or required for the workflows.
My last run failed because my dates were not ISO formatted. Strict metadata formatting requirements like this would be another useful thing to know upfront. (And especially since that error isn't triggered until after it spends a couple hours of runtime transferring files.)
If these feel like standard things everyone should be able to just assume, please understand I'm brand new to cloud computing and genomic analysis! There is a pretty steep learning curve to get from understanding how things work in theory to actually getting any bit of a pipeline operational.