-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bowtie2: do not overcommit #5985
bowtie2: do not overcommit #5985
Conversation
fixes galaxyproject#5983 by using a pipe bowtie2 and samtools run in parallel. since already bowtie2 alone uses more CPU than the assigned ones (less that 1 core) we should not do this, but run them serially. also properly redirects stderr.
One problem with the non piping is that you have to write to the disk one more time... that can be very expensive for some environments. Maybe there is a way to partition the thread usage so that samtools can use some of the available resources. |
tools/bowtie2/bowtie2_wrapper.xml
Outdated
#end if | ||
|
||
## output file | ||
#if str( $sam_options.sam_options_selector ) == "no" or (str( $sam_options.sam_opt ) == "false" and str($sam_options.reorder) == ''): | ||
| samtools sort --no-PG -@\${GALAXY_SLOTS:-2} -T "\${TMPDIR:-.}" -O bam -o '$output' | ||
> alignment.sam | ||
&& samtools sort --no-PG -@\${GALAXY_SLOTS:-2} -T "\${TMPDIR:-.}" -O bam -o '$output' alignment.sam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why no-PG... i think it's good to have the headers that describe the tools that worked on a bam in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand we record everything in Galaxy. So one could say it's redundant.
Also it makes the tests more "unstable", i.e. more lines that may differ due to version changes.
I really don't care. But we should make this an IUC guideline My impression was that we added --no-PG
to quite a few tools in the past and we should not go back and forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see that complexity for testing, but i think that bams produced by galaxy do not only stay within galaxy.
E.g. they get submitted to SRA as data associated with scientific publications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should come to an IUC-wide decision with this, and I would go with @bwlang's arguments for recording everything in the header.
One case that comes to my mind immediately where --no-PG seems justified to me is with the reheader tool itself because you use it explicitly to shape the header like you think it's best - and even there we only make it optional to omit the additional @PG
record
tools-iuc/tool_collections/samtools/samtools_reheader/samtools_reheader.xml
Lines 19 to 21 in 57fa4dc
<param argument="--no-PG" name="no_pg" type="boolean" truevalue="--no-PG" falsevalue="" | |
label="Keep the new header unmodified?" | |
help="By default, samtools reheader will add itself to the chain of tools used to manipulate the target dataset, which will cause one or more @PG records to be inserted into the new header. If you want the output dataset to end up with the header of the source dataset exactly, you should activate this option." /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's another case of what I'd consider a valid use of --no-PG:
samtools view -bH localbam.bam --no-PG -o header.bam && |
bamtools_split_ref first gets the header of the BAM it's operating on so that it can add it back to "empty" split results that don't have reads in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed - those seem reasonable to me. I can't think of a case where view is important to the analysis history of a file - it would be implied if we had some analysis tool that only takes sam for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so @bernt-matthias in your HISAT2 sorting solution:
tools-iuc/tools/hisat2/hisat2.xml
Line 351 in 57fa4dc
| samtools sort --no-PG -l 0 -T "\${TMPDIR:-.}" -O bam | samtools view --no-PG -O bam -@ \${GALAXY_SLOTS:-1} -o '${output_alignments}' |
I'd say that the
--no-PG
with samtools view is fine (since it's an implementation detail of how you write the result of sorting), but the samtools sort should be added as a PG record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess when you use samtools view for read filtering, that's when it deserves its own record?
Seems you have given the solution for HISAT2 quite some thought. What makes you think it's not the way to go here? |
the one extra thread needed by samtools sort or view could be subtracted from the ones used by bowtie2 with logic like that used in the samtools collection of tools: tools-iuc/tool_collections/samtools/macros.xml Lines 157 to 160 in 57fa4dc
|
It just seemed to be to complicated. Give me some time I will change it. Probably I would go for |
4f26923
to
45812b7
Compare
bf22085
to
1c04353
Compare
This reverts commit 6e1b4f2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good solution to me!
Co-authored-by: Wolfgang Maier <[email protected]>
fixes #5983
by using a pipe bowtie2 and samtools run in parallel. since already bowtie2 alone uses more CPU than the assigned ones (less than 1 core) we should not do this, but run them serially.
background had a user running 2 fastq files (each 2.6GB) against a file containing 500k contigs. Running bowtie there was extremely efficient (tested up to 40 cores), i.e. for
GALAXY_SLOTS
=X the CPU usage (according to top) was(X+1)*100%
.. probably because X alignment + 1 control thread are used.Alternatively we could do this, but I do not see a reason why we should.
also properly redirects
stderr
.FOR CONTRIBUTOR: