-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
join_fastq is a bit of a bottleneck #44
Comments
Mm, yes, you're right. This is another case where a dumb serial Python
script is eating up excess time.
I don't think the mix flag is a good fit for the broader use case, since I
think it keeps the unmerged read pairs separate. But this should be pretty
easy to parallelize to bring down the clock time.
…On Tue, 30 Jul 2024 at 11:19, Evan Fields ***@***.***> wrote:
I recently ran the cleaning and ribodepletion steps of the pipeline (this
branch
<https://github.com/naobservatory/mgs-workflow/tree/ef_only_cleaning>) on
four pairs of read files where each read file was about 6GB compressed.
Total time was about 5.5 elapsed hours (145 CPU hours). The wall time was
significantly affected by join_fastq.py:
image.png (view on web)
<https://github.com/user-attachments/assets/174693ec-63b8-4fdd-884d-2ed4b1e6972d>
At a glance, it looks like join_fastq contributes 2-3 hours of the wall
time. Mike suggests that for my use case just using the mix flag in
BBMerge should be fine; I'm not sure about the general pipeline refactor
use case.
—
Reply to this email directly, view it on GitHub
<#44>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADWVV6V7B7KRJIM3L4ARPSTZO6VJJAVCNFSM6AAAAABLWTLRRKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZTQMBZHA2TCNA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
If we're able to modify the python script so that it is working on batches of reads (if not the entire read file) rather than looping over individual reads, I think that would give a big speedup. Possibly even looping over individual reads to create the new reads is ok as long as we're writing in batches rather than calling |
@willbradshaw is the only thing the pipeline needs the joined output for Kraken2? |
If we do end up wanting to refactor this to speed it up, here is some R code using the Biostrings package that uses vectorized operations to do the RC'ing and concatenation, working on chunks of reads rather than iterating through individual reads. My test files are only 5 read pairs so I've set the chunk size to 2 reads for testing, but we'd want to set it to as big as the memory on the instance allows, and probably at least 100K or 1M reads. files <- XVector::open_input_files(c("test/test-r1.fastq","test/test-r2.fastq"))
chunk_size <- 2
i <- 0
while (TRUE) {
i <- i + 1
## Load `chunk_size` records at a time.
fwd <- Biostrings::readQualityScaledDNAStringSet(files[1], nrec = chunk_size)
rev_rc <- Biostrings::readQualityScaledDNAStringSet(files[2], nrec = chunk_size) |>
Biostrings::reverseComplement()
if (length(fwd) == 0L) {
break
}
stopifnot(identical(length(fwd), length(rev_rc)))
cat("processing chunk", i, "...\n")
seq_new <- Biostrings::xscat(
fwd,
Biostrings::DNAString("N"),
rev_rc
)
qual_new <- Biostrings::xscat(
Biostrings::quality(fwd),
Biostrings::PhredQuality("!"),
Biostrings::quality(rev_rc)
) |>
Biostrings::PhredQuality()
names(seq_new) <- names(fwd) |> stringr::str_replace(" ", " joined ")
joined <- Biostrings::QualityScaledDNAStringSet(
seq_new, qual_new
)
Biostrings::writeQualityScaledXStringSet(joined, "test-joined-chunk.fastq.gz", append = TRUE, compress = TRUE)
} On a small test file of 5 reads this gives identical results to the Edit: I tried running a slightly cleaned-up version on a set of ~40M paired-end reads processing 1M records at a time and it took ~40min on my Macbook and used about 4.5G of memory. So it's still not speedy. |
Kraken2 and RC-sensitive single-end deduplication with Clumpify. |
I tried playing around with Awk here and managed to get a significant speedup; I think combining this with threading should get us to a pretty good place. Will tackle this next quarter as part of a general attack on clocktime bottlenecks. |
Downgrading to p2 as this did not come up as a bottleneck when looking at nextflow logs over three different datasets. |
Resolving this since it doesn't currently seem to be an issue and a lot of this will change after we implement #122. |
I recently ran the cleaning and ribodepletion steps of the pipeline (this branch) on four pairs of read files where each read file was about 6GB compressed. Total time was about 5.5 elapsed hours (145 CPU hours). The wall time was significantly affected by join_fastq.py:
At a glance, it looks like join_fastq contributes 2-3 hours of the wall time. Mike suggests that for my use case just using the
mix
flag in BBMerge should be fine; I'm not sure about the general pipeline refactor use case.The text was updated successfully, but these errors were encountered: