basename() does not function as expected on DRS URIs Answered

Post author
Ash O'Farrell

There seems to be a bug with the WDL function basename() when it is run on files that originate via DRS URIs. Instead of giving the basename, it is pulling the name of the folder is being localized into.

Expected and actual behavior

Normally, basename works like this, and this is the behavior I'd expect out of files from DRS URIs too.

basename("cromwell-root/34930493049/0fbb8b5d-81a5-4928-a42d-7cac707f746e/shibainu.chr14.vcf.gz") --> "shibainu.chr14.vcf.gz"

But if the file came from a DRS URI, this is the pattern I see:

basename("cromwell-root/34930493049/0fbb8b5d-81a5-4928-a42d-7cac707f746e/shibainu.chr14.vcf.gz") --> "0fbb8b5d-81a5-4928-a42d-7cac707f746e"

Specific example

Beth Sheets ran into this issue when running this workflow. The pipeline expects a vcf, vcf.gz, vcf.bgz, or bcf file. It uses regex string to replace its extension with "gds" to generate the name of an output file. (I've left out the actual generation of that gds file for simplicity's sake in these code blocks; it is done via an R script that takes in the .config text file to determine the name of its outputs.) Ergo if vcf is "cromwell-root/34930493049/0fbb8b5d-81a5-4928-a42d-7cac707f746e/shibainu.chr14.vcf.gz" then output_file_name is "shibainu.chr14.gds", at least, when run on non-controlled data.

The code looks like this:

task vcf2gds {
input {
File vcf
String output_file_name = basename(sub(vcf, "\.vcf\.gz(?!.{1,})|\.vcf\.bgz(?!.{5,})|\.vcf(?!.{5,})|\.bcf(?!.{1,})", ".gds"))
}
command {
echo "Generating config file"
python << CODE
import os
f = open("vcf2gds.config", "a")
f.write("vcf_file ~{vcf}\n")
f.write("\ngds_file '~{output_file_name}'\n")
f.close()
CODE
}

This works fine locally, but fails with DRS URIs, resulting in GDS files with extension-less names like 0fbb8b5d-81a5-4928-a42d-7cac707f746e. I did come up with a workaround/debugging setup, the relevant parts of which look like this:

task vcf2gds {
input {
File vcf
String debug_basename = basename(vcf)
String debug_basenamesub = basename(sub(vcf, "\.vcf\.gz(?!.{1,})|\.vcf\.bgz(?!.{5,})|\.vcf(?!.{5,})|\.bcf(?!.{1,})", ".gds"))
}
command {
echo "Input vcf is: " | tee -a debug.txt
echo "~{vcf}" | tee debug.txt
echo "Basename of input vcf is: " | tee -a debug.txt
echo "~{debug_basename}" | tee -a debug.txt
echo "Basename of input vcf with a subsitution is: " | tee -a debug.txt
echo "~{debug_basenamesub}" | tee -a debug.txt

echo "Generating config file"
python << CODE
import os
import shutil
py_split = (os.path.basename("~{vcf}")).split(".vcf")
py_basename = py_split[0]
if len(py_split) != 1:
py_ext = py_split[1]
else:
py_ext = ""
shutil.copy2("~{vcf}", "./%s.vcf%s" % (py_basename, py_ext))
py_newname = "%s.gds" % py_basename

f = open("vcf2gds.config", "a")
f.write("vcf_file ~{vcf}\n")
f.write("\ngds_file "+"'"+py_newname+"'\n")
f.close()
exit()
CODE
}

Strictly speaking this does work, but it is a bit of a tedious workaround that results in having to duplicate all of the input files of this pipeline. This is the putput of debug.txt:

/cromwell_root/dg.4503_dg.4503/0fbb8b5d-81a5-4928-a42d-7cac707f746e/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bi_maf001.vcf.bgz
Basename of input vcf is:
0fbb8b5d-81a5-4928-a42d-7cac707f746e
Basename of input vcf with a substitution is:
0fbb8b5d-81a5-4928-a42d-7cac707f746e

Credentials issue?

This happens regardless of whether Beth or I run the pipeline after both of us refreshed our user and developer creds. Although the data is accessed via DRS URIs, it is open-access. Furthermore the files are still getting localized, so I don't believe this is the typical issue with stale credentials.

That said I think I read somewhere Cromwell tries to calculate variable not in the command section before files localize... maybe it's pulling the name of the destination directory before the files are actaually localized to that directory?

Comments

9 comments

  • Comment author
    Jason Cerrato

    Hi Aisling,

    Thanks for flagging this up! I'll inform a member of our team and we'll file a bug report if needed. I'll get back to you as soon as I can.

    Kind regards,

    Jason

    0
  • Comment author
    Ash O'Farrell

    Thanks for always being quick to respond Jason! It's much appreciated!

    0
  • Comment author
    Jason Cerrato

    Hey Aisling,

    We were able to validate this behavior and we've filed a bug report. I'll be happy to reach out once I get word that the bug is fixed.

    Thanks so much for bringing this up and for posting your workaround so that other users of the platform can use it in the meantime! We really appreciate it.

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato
    • Edited

    Hey Ash O'Farrell,

    This bug is now fixed! See our release notes here: https://support.terra.bio/hc/en-us/articles/4407470708251

    Thanks again for reporting this.

    Kind regards,

    Jason

    0
  • Comment author
    Jason Cerrato

    Hi Aisling,

    The workflows team has reverted to the previous version of Cromwell because our recent change for basename() was causing unexpected issues with outputs defined by String-type inputs: https://support.terra.bio/hc/en-us/community/posts/4407522454811

    I will follow up when a new fix has been implemented.

    Kind regards,

    Jason

    0
  • Comment author
    Megan Shand

    Jason Cerrato is there any update on when this might be fixed? 

    0
  • Comment author
    Megan Shand
    • Edited

    For what it's worth, my own workaround for this was the following:

    task getBasename {
      input {
        File drs_vcf
      }
      command {
        ln -s ~{drs_vcf} .
        ls *vcf.gz > vcf_file.txt
      }
      output {
        String output_vcf_name = read_string("vcf_file.txt")
      }
    }

    This worked because I knew the file extension of the input file, but as long as you can get the base file name in a text file you can use read_string() to create the output path.

    Note that this needs to be a separate task from a task that tries to delocalize the output file.

    0
  • Comment author
    Josh Evans

    Hi Megan,

    Thanks for writing in! Allow me to answer for Jason. Given the complex nature of this bug in the system, our engineering teams are still looking for the best way to implement a solution.  While we don't currently have an ETA for a bug fix, I have already added your name to their bug report, which should help them in determining priority.

    I also wanted to thank you for providing your workaround. I believe this will be very helpful for users in the meantime. 

    Please let me know if you have any questions or need anything else.

    Best,

    Josh

    0
  • Comment author
    Chris Whelan

    Hi,

    Is there any update on progress towards finding an alternate solution to this issue?

    Thanks,

    Chris

    0

Please sign in to leave a comment.