join_fastq is a bit of a bottleneck #44

evanfields · 2024-07-30T15:19:26Z

I recently ran the cleaning and ribodepletion steps of the pipeline (this branch) on four pairs of read files where each read file was about 6GB compressed. Total time was about 5.5 elapsed hours (145 CPU hours). The wall time was significantly affected by join_fastq.py:

At a glance, it looks like join_fastq contributes 2-3 hours of the wall time. Mike suggests that for my use case just using the mix flag in BBMerge should be fine; I'm not sure about the general pipeline refactor use case.

The text was updated successfully, but these errors were encountered:

willbradshaw · 2024-07-30T15:28:30Z

Mm, yes, you're right. This is another case where a dumb serial Python script is eating up excess time. I don't think the mix flag is a good fit for the broader use case, since I think it keeps the unmerged read pairs separate. But this should be pretty easy to parallelize to bring down the clock time.

…

On Tue, 30 Jul 2024 at 11:19, Evan Fields ***@***.***> wrote: I recently ran the cleaning and ribodepletion steps of the pipeline (this branch <https://github.com/naobservatory/mgs-workflow/tree/ef_only_cleaning>) on four pairs of read files where each read file was about 6GB compressed. Total time was about 5.5 elapsed hours (145 CPU hours). The wall time was significantly affected by join_fastq.py: image.png (view on web) <https://github.com/user-attachments/assets/174693ec-63b8-4fdd-884d-2ed4b1e6972d> At a glance, it looks like join_fastq contributes 2-3 hours of the wall time. Mike suggests that for my use case just using the mix flag in BBMerge should be fine; I'm not sure about the general pipeline refactor use case. — Reply to this email directly, view it on GitHub <#44>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADWVV6V7B7KRJIM3L4ARPSTZO6VJJAVCNFSM6AAAAABLWTLRRKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZTQMBZHA2TCNA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

mikemc · 2024-08-07T18:04:12Z

If we're able to modify the python script so that it is working on batches of reads (if not the entire read file) rather than looping over individual reads, I think that would give a big speedup. Possibly even looping over individual reads to create the new reads is ok as long as we're writing in batches rather than calling SeqIO.write() for every read. I think I could prototype something in R using the Biostrings package reasonably quickly but I haven't used biopython for a while.

mikemc · 2024-08-07T18:26:21Z

@willbradshaw is the only thing the pipeline needs the joined output for Kraken2?

mikemc · 2024-08-08T14:39:04Z

If we do end up wanting to refactor this to speed it up, here is some R code using the Biostrings package that uses vectorized operations to do the RC'ing and concatenation, working on chunks of reads rather than iterating through individual reads. My test files are only 5 read pairs so I've set the chunk size to 2 reads for testing, but we'd want to set it to as big as the memory on the instance allows, and probably at least 100K or 1M reads.

files <- XVector::open_input_files(c("test/test-r1.fastq","test/test-r2.fastq"))
chunk_size <- 2

i <- 0
while (TRUE) {
  i <- i + 1
  ## Load `chunk_size` records at a time.
  fwd <- Biostrings::readQualityScaledDNAStringSet(files[1], nrec = chunk_size)
  rev_rc <- Biostrings::readQualityScaledDNAStringSet(files[2], nrec = chunk_size) |> 
    Biostrings::reverseComplement()
  if (length(fwd) == 0L) {
    break
  }
  stopifnot(identical(length(fwd), length(rev_rc)))
  cat("processing chunk", i, "...\n")
  seq_new <- Biostrings::xscat(
    fwd,
    Biostrings::DNAString("N"),
    rev_rc
  )
  qual_new <- Biostrings::xscat(
    Biostrings::quality(fwd),
    Biostrings::PhredQuality("!"),
    Biostrings::quality(rev_rc)
  ) |>
    Biostrings::PhredQuality()
  names(seq_new) <- names(fwd) |> stringr::str_replace(" ", " joined ")
  joined <- Biostrings::QualityScaledDNAStringSet(
    seq_new, qual_new
  )
  Biostrings::writeQualityScaledXStringSet(joined, "test-joined-chunk.fastq.gz", append = TRUE, compress = TRUE)
}

On a small test file of 5 reads this gives identical results to the join_paired_reads() python function.

Edit: I tried running a slightly cleaned-up version on a set of ~40M paired-end reads processing 1M records at a time and it took ~40min on my Macbook and used about 4.5G of memory. So it's still not speedy.

willbradshaw · 2024-08-10T17:35:20Z

@willbradshaw is the only thing the pipeline needs the joined output for Kraken2?

Kraken2 and RC-sensitive single-end deduplication with Clumpify.

willbradshaw · 2024-09-18T14:52:57Z

I tried playing around with Awk here and managed to get a significant speedup; I think combining this with threading should get us to a pretty good place. Will tackle this next quarter as part of a general attack on clocktime bottlenecks.

harmonbhasin · 2024-10-08T18:18:31Z

Downgrading to p2 as this did not come up as a bottleneck when looking at nextflow logs over three different datasets.

willbradshaw · 2024-12-17T19:35:52Z

Resolving this since it doesn't currently seem to be an issue and a lot of this will change after we implement #122.

willbradshaw added enhancement New feature or request priority_1 labels Sep 18, 2024

willbradshaw mentioned this issue Sep 18, 2024

Identify and eliminate runtime bottlenecks #50

Closed

harmonbhasin added priority_2 and removed priority_1 labels Oct 8, 2024

willbradshaw closed this as completed Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

join_fastq is a bit of a bottleneck #44

join_fastq is a bit of a bottleneck #44

evanfields commented Jul 30, 2024

willbradshaw commented Jul 30, 2024 via email

mikemc commented Aug 7, 2024

mikemc commented Aug 7, 2024

mikemc commented Aug 8, 2024 •

edited

Loading

willbradshaw commented Aug 10, 2024

willbradshaw commented Sep 18, 2024

harmonbhasin commented Oct 8, 2024 •

edited

Loading

willbradshaw commented Dec 17, 2024

join_fastq is a bit of a bottleneck #44

join_fastq is a bit of a bottleneck #44

Comments

evanfields commented Jul 30, 2024

willbradshaw commented Jul 30, 2024 via email

mikemc commented Aug 7, 2024

mikemc commented Aug 7, 2024

mikemc commented Aug 8, 2024 • edited Loading

willbradshaw commented Aug 10, 2024

willbradshaw commented Sep 18, 2024

harmonbhasin commented Oct 8, 2024 • edited Loading

willbradshaw commented Dec 17, 2024

mikemc commented Aug 8, 2024 •

edited

Loading

harmonbhasin commented Oct 8, 2024 •

edited

Loading