basename() does not function as expected on DRS URIs Answered

Post author
Aisling O'Farrell

There seems to be a bug with the WDL function basename() when it is run on files that originate via DRS URIs. Instead of giving the basename, it is pulling the name of the folder is being localized into.

Expected and actual behavior

Normally, basename works like this, and this is the behavior I'd expect out of files from DRS URIs too.

basename("cromwell-root/34930493049/0fbb8b5d-81a5-4928-a42d-7cac707f746e/shibainu.chr14.vcf.gz") --> "shibainu.chr14.vcf.gz"

But if the file came from a DRS URI, this is the pattern I see:

basename("cromwell-root/34930493049/0fbb8b5d-81a5-4928-a42d-7cac707f746e/shibainu.chr14.vcf.gz") --> "0fbb8b5d-81a5-4928-a42d-7cac707f746e"

Specific example

Beth Sheets ran into this issue when running this workflow. The pipeline expects a vcf, vcf.gz, vcf.bgz, or bcf file. It uses regex string to replace its extension with "gds" to generate the name of an output file. (I've left out the actual generation of that gds file for simplicity's sake in these code blocks; it is done via an R script that takes in the .config text file to determine the name of its outputs.) Ergo if vcf is "cromwell-root/34930493049/0fbb8b5d-81a5-4928-a42d-7cac707f746e/shibainu.chr14.vcf.gz" then output_file_name is "shibainu.chr14.gds", at least, when run on non-controlled data.

The code looks like this:

task vcf2gds {
input {
File vcf
String output_file_name = basename(sub(vcf, "\.vcf\.gz(?!.{1,})|\.vcf\.bgz(?!.{5,})|\.vcf(?!.{5,})|\.bcf(?!.{1,})", ".gds"))
}
command {
echo "Generating config file"
python << CODE
import os
f = open("vcf2gds.config", "a")
f.write("vcf_file ~{vcf}\n")
f.write("\ngds_file '~{output_file_name}'\n")
f.close()
CODE
}

This works fine locally, but fails with DRS URIs, resulting in GDS files with extension-less names like 0fbb8b5d-81a5-4928-a42d-7cac707f746e. I did come up with a workaround/debugging setup, the relevant parts of which look like this:

task vcf2gds {
input {
File vcf
String debug_basename = basename(vcf)
String debug_basenamesub = basename(sub(vcf, "\.vcf\.gz(?!.{1,})|\.vcf\.bgz(?!.{5,})|\.vcf(?!.{5,})|\.bcf(?!.{1,})", ".gds"))
}
command {
echo "Input vcf is: " | tee -a debug.txt
echo "~{vcf}" | tee debug.txt
echo "Basename of input vcf is: " | tee -a debug.txt
echo "~{debug_basename}" | tee -a debug.txt
echo "Basename of input vcf with a subsitution is: " | tee -a debug.txt
echo "~{debug_basenamesub}" | tee -a debug.txt

echo "Generating config file"
python << CODE
import os
import shutil
py_split = (os.path.basename("~{vcf}")).split(".vcf")
py_basename = py_split[0]
if len(py_split) != 1:
py_ext = py_split[1]
else:
py_ext = ""
shutil.copy2("~{vcf}", "./%s.vcf%s" % (py_basename, py_ext))
py_newname = "%s.gds" % py_basename

f = open("vcf2gds.config", "a")
f.write("vcf_file ~{vcf}\n")
f.write("\ngds_file "+"'"+py_newname+"'\n")
f.close()
exit()
CODE
}

Strictly speaking this does work, but it is a bit of a tedious workaround that results in having to duplicate all of the input files of this pipeline. This is the putput of debug.txt:

/cromwell_root/dg.4503_dg.4503/0fbb8b5d-81a5-4928-a42d-7cac707f746e/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bi_maf001.vcf.bgz
Basename of input vcf is:
0fbb8b5d-81a5-4928-a42d-7cac707f746e
Basename of input vcf with a substitution is:
0fbb8b5d-81a5-4928-a42d-7cac707f746e

Credentials issue?

This happens regardless of whether Beth or I run the pipeline after both of us refreshed our user and developer creds. Although the data is accessed via DRS URIs, it is open-access. Furthermore the files are still getting localized, so I don't believe this is the typical issue with stale credentials.

That said I think I read somewhere Cromwell tries to calculate variable not in the command section before files localize... maybe it's pulling the name of the destination directory before the files are actaually localized to that directory?

Comments

5 comments

  • Comment author
    Jason Cerrato

    Hi Aisling,

    Thanks for flagging this up! I'll inform a member of our team and we'll file a bug report if needed. I'll get back to you as soon as I can.

    Kind regards,

    Jason

    0
  • Comment author
    Aisling O'Farrell

    Thanks for always being quick to respond Jason! It's much appreciated!

    0
  • Comment author
    Jason Cerrato

    Hey Aisling,

    We were able to validate this behavior and we've filed a bug report. I'll be happy to reach out once I get word that the bug is fixed.

    Thanks so much for bringing this up and for posting your workaround so that other users of the platform can use it in the meantime! We really appreciate it.

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato
    • Edited

    Hey Aisling O'Farrell,

    This bug is now fixed! See our release notes here: https://support.terra.bio/hc/en-us/articles/4407470708251

    Thanks again for reporting this.

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Aisling,

    The workflows team has reverted to the previous version of Cromwell because our recent change for basename() was causing unexpected issues with outputs defined by String-type inputs: https://support.terra.bio/hc/en-us/community/posts/4407522454811

    I will follow up when a new fix has been implemented.

    Kind regards,

    Jason

    0

Please sign in to leave a comment.