Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join_fastq is a bit of a bottleneck #44

Closed
evanfields opened this issue Jul 30, 2024 · 8 comments
Closed

join_fastq is a bit of a bottleneck #44

evanfields opened this issue Jul 30, 2024 · 8 comments
Labels
enhancement New feature or request priority_2

Comments

@evanfields
Copy link
Collaborator

I recently ran the cleaning and ribodepletion steps of the pipeline (this branch) on four pairs of read files where each read file was about 6GB compressed. Total time was about 5.5 elapsed hours (145 CPU hours). The wall time was significantly affected by join_fastq.py:
image

At a glance, it looks like join_fastq contributes 2-3 hours of the wall time. Mike suggests that for my use case just using the mix flag in BBMerge should be fine; I'm not sure about the general pipeline refactor use case.

@willbradshaw
Copy link
Contributor

willbradshaw commented Jul 30, 2024 via email

@mikemc
Copy link
Member

mikemc commented Aug 7, 2024

If we're able to modify the python script so that it is working on batches of reads (if not the entire read file) rather than looping over individual reads, I think that would give a big speedup. Possibly even looping over individual reads to create the new reads is ok as long as we're writing in batches rather than calling SeqIO.write() for every read. I think I could prototype something in R using the Biostrings package reasonably quickly but I haven't used biopython for a while.

@mikemc
Copy link
Member

mikemc commented Aug 7, 2024

@willbradshaw is the only thing the pipeline needs the joined output for Kraken2?

@mikemc
Copy link
Member

mikemc commented Aug 8, 2024

If we do end up wanting to refactor this to speed it up, here is some R code using the Biostrings package that uses vectorized operations to do the RC'ing and concatenation, working on chunks of reads rather than iterating through individual reads. My test files are only 5 read pairs so I've set the chunk size to 2 reads for testing, but we'd want to set it to as big as the memory on the instance allows, and probably at least 100K or 1M reads.

files <- XVector::open_input_files(c("test/test-r1.fastq","test/test-r2.fastq"))
chunk_size <- 2

i <- 0
while (TRUE) {
  i <- i + 1
  ## Load `chunk_size` records at a time.
  fwd <- Biostrings::readQualityScaledDNAStringSet(files[1], nrec = chunk_size)
  rev_rc <- Biostrings::readQualityScaledDNAStringSet(files[2], nrec = chunk_size) |> 
    Biostrings::reverseComplement()
  if (length(fwd) == 0L) {
    break
  }
  stopifnot(identical(length(fwd), length(rev_rc)))
  cat("processing chunk", i, "...\n")
  seq_new <- Biostrings::xscat(
    fwd,
    Biostrings::DNAString("N"),
    rev_rc
  )
  qual_new <- Biostrings::xscat(
    Biostrings::quality(fwd),
    Biostrings::PhredQuality("!"),
    Biostrings::quality(rev_rc)
  ) |>
    Biostrings::PhredQuality()
  names(seq_new) <- names(fwd) |> stringr::str_replace(" ", " joined ")
  joined <- Biostrings::QualityScaledDNAStringSet(
    seq_new, qual_new
  )
  Biostrings::writeQualityScaledXStringSet(joined, "test-joined-chunk.fastq.gz", append = TRUE, compress = TRUE)
}

On a small test file of 5 reads this gives identical results to the join_paired_reads() python function.

Edit: I tried running a slightly cleaned-up version on a set of ~40M paired-end reads processing 1M records at a time and it took ~40min on my Macbook and used about 4.5G of memory. So it's still not speedy.

@willbradshaw
Copy link
Contributor

@willbradshaw is the only thing the pipeline needs the joined output for Kraken2?

Kraken2 and RC-sensitive single-end deduplication with Clumpify.

@willbradshaw
Copy link
Contributor

I tried playing around with Awk here and managed to get a significant speedup; I think combining this with threading should get us to a pretty good place. Will tackle this next quarter as part of a general attack on clocktime bottlenecks.

@harmonbhasin
Copy link
Collaborator

harmonbhasin commented Oct 8, 2024

Downgrading to p2 as this did not come up as a bottleneck when looking at nextflow logs over three different datasets.

@willbradshaw
Copy link
Contributor

Resolving this since it doesn't currently seem to be an issue and a lot of this will change after we implement #122.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority_2
Projects
None yet
Development

No branches or pull requests

4 participants