Error message: File is larger than 10000000 Bytes

Allie Hajian

What this looks like

You’ve likely used one of the read_X functions in your WDL and have surpassed the default limits set in Terra’s Cromwell instance. In practice when Cromwell starts reading a file and surpasses the size limit, Cromwell will immediately stop downloading and fail the workflow giving you this error message.

Cromwell limits

  • read_lines: = 10MB
  • read_json = 10MB
  • read_tsv = 10MB
  • read_object = 10MB
  • read_boolean = 7 bytes
  • read_int = 19 bytes
  • read_float = 50 bytes
  • read_string = 128KB
  • read_map = 128KB

Workarounds

In the case where you are using read_lines() with a large file of filenames and are getting an error, the best workaround will be to split the large file by line count into multiple small files, scatter over the array of small files, and grab the filename by reading contents of each small file. This same concept can be applied to other read_X errors.

Alternatively, you can pass these files in as workflow inputs individually or collected in a tar.

Here are two example WDLs for inspiration.

Option 1

workflow w {
  File fileOfFilenames # 1GB in size

  #Split large file into small individual files
  call splitFile { input: largeFile = fileOfFilenames }

 scatter (f in splitFile.tiny_files) {
    String fileName = read_string(f)   
  }

  Array[String] filenames = fileName
}

task splitFile {
    File largeFile

    command {
        mkdir sandbox
        split -l 1 ${largeFile} sandbox/
    }

    output {
        Array[File] tiny_files = glob("sandbox/*")
    }
    runtime {
        docker: "ubuntu:latest"
    }
}

 Option 2

workflow use_file_of_filenames {
  File file_of_filenames
  call count_filenames_in_file { input: file_of_filenames = file_of_filenames }
  scatter (index in range(count_filenames_in_file.count)) {
    call operate_on_file { input: file_of_filenames = file_of_filenames, file_index = index }
  }
}

task count_filenames_in_file {
  File file_of_filenames
  command {
    wc -l < ${file_of_filenames}
  }
  output {
    Int count = read_int(stdout())
  }
}

task operate_on_file {
  File file_of_filenames
  Int file_index
  command {
    # 1: Get the appropriate file name from the list
    # 2: Operate on that file as a URL
  }
}

Was this article helpful?

0 out of 0 found this helpful

Comments

2 comments

  • Comment author
    Carmen Diaz Verdugo
    • Edited

    Hi Allie Hajian,

    I had the error message File is larger than 10000000 Bytes.

    I implemented the Option 1, and it works, but just for small files. When I split a file bigger than 10MB the splitFile task return an empty array. Do you know what could be the issue?

    If the largeFile has ~7680 lines (<<10MB), it splits it, scatters the read_string and returns an Array[String] with the correct content.

    However, if the largeFile has ~107520 lines (12.9MB), it doesn't give any error, just returns an empty Array.

    Am I hitting another limit with that file size? Do you have any suggestion on how to solve it?

    Thank you!

    0
  • Comment author
    Allie Hajian

    Carmen Diaz Verdugo Hello Carmen! I'm sorry you're having a hard time with your large files. Sadly, I am only the author of the article, not the methods expert you need. I submitted a request on your behalf to frontline support, who should be getting back to you soon. 

    0

Please sign in to leave a comment.