Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bowtie2: do not overcommit #5985

Merged
merged 8 commits into from
May 13, 2024

Conversation

bernt-matthias
Copy link
Contributor

fixes #5983

by using a pipe bowtie2 and samtools run in parallel. since already bowtie2 alone uses more CPU than the assigned ones (less than 1 core) we should not do this, but run them serially.
background had a user running 2 fastq files (each 2.6GB) against a file containing 500k contigs. Running bowtie there was extremely efficient (tested up to 40 cores), i.e. for GALAXY_SLOTS=X the CPU usage (according to top) was (X+1)*100% .. probably because X alignment + 1 control thread are used.

Alternatively we could do this, but I do not see a reason why we should.

also properly redirects stderr.

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

fixes galaxyproject#5983

by using a pipe bowtie2 and samtools run in parallel. since already
bowtie2 alone uses more CPU than the assigned ones (less that 1 core)
we should not do this, but run them serially.

also properly redirects stderr.
@bwlang
Copy link
Contributor

bwlang commented May 5, 2024

One problem with the non piping is that you have to write to the disk one more time... that can be very expensive for some environments. Maybe there is a way to partition the thread usage so that samtools can use some of the available resources.

#end if

## output file
#if str( $sam_options.sam_options_selector ) == "no" or (str( $sam_options.sam_opt ) == "false" and str($sam_options.reorder) == ''):
| samtools sort --no-PG -@\${GALAXY_SLOTS:-2} -T "\${TMPDIR:-.}" -O bam -o '$output'
> alignment.sam
&& samtools sort --no-PG -@\${GALAXY_SLOTS:-2} -T "\${TMPDIR:-.}" -O bam -o '$output' alignment.sam
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why no-PG... i think it's good to have the headers that describe the tools that worked on a bam in place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand we record everything in Galaxy. So one could say it's redundant.

Also it makes the tests more "unstable", i.e. more lines that may differ due to version changes.

I really don't care. But we should make this an IUC guideline My impression was that we added --no-PG to quite a few tools in the past and we should not go back and forth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see that complexity for testing, but i think that bams produced by galaxy do not only stay within galaxy.
E.g. they get submitted to SRA as data associated with scientific publications.

Copy link
Contributor

@wm75 wm75 May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should come to an IUC-wide decision with this, and I would go with @bwlang's arguments for recording everything in the header.

One case that comes to my mind immediately where --no-PG seems justified to me is with the reheader tool itself because you use it explicitly to shape the header like you think it's best - and even there we only make it optional to omit the additional @PG record

<param argument="--no-PG" name="no_pg" type="boolean" truevalue="--no-PG" falsevalue=""
label="Keep the new header unmodified?"
help="By default, samtools reheader will add itself to the chain of tools used to manipulate the target dataset, which will cause one or more @PG records to be inserted into the new header. If you want the output dataset to end up with the header of the source dataset exactly, you should activate this option." />
.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another case of what I'd consider a valid use of --no-PG:

samtools view -bH localbam.bam --no-PG -o header.bam &&

bamtools_split_ref first gets the header of the BAM it's operating on so that it can add it back to "empty" split results that don't have reads in them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - those seem reasonable to me. I can't think of a case where view is important to the analysis history of a file - it would be implied if we had some analysis tool that only takes sam for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so @bernt-matthias in your HISAT2 sorting solution:

| samtools sort --no-PG -l 0 -T "\${TMPDIR:-.}" -O bam | samtools view --no-PG -O bam -@ \${GALAXY_SLOTS:-1} -o '${output_alignments}'

I'd say that the --no-PG with samtools view is fine (since it's an implementation detail of how you write the result of sorting), but the samtools sort should be added as a PG record.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess when you use samtools view for read filtering, that's when it deserves its own record?

@wm75
Copy link
Contributor

wm75 commented May 6, 2024

Alternatively we could do this, but I do not see a reason why we should.

Seems you have given the solution for HISAT2 quite some thought. What makes you think it's not the way to go here?

@wm75
Copy link
Contributor

wm75 commented May 6, 2024

the one extra thread needed by samtools sort or view could be subtracted from the ones used by bowtie2 with logic like that used in the samtools collection of tools:

<token name="@ADDTHREADS@"><![CDATA[
##compute the number of ADDITIONAL threads to be used by samtools (-@)
addthreads=\${GALAXY_SLOTS:-1} && (( addthreads-- )) &&
]]></token>

@bernt-matthias
Copy link
Contributor Author

Seems you have given the solution for HISAT2 quite some thought. What makes you think it's not the way to go here?

It just seemed to be to complicated. Give me some time I will change it. Probably I would go for GALAXY_SLOTS - 1 threads for bowtie2 (here I will need to make sure that the result will be at least 1, since bowtie seems not to allow 0).

@bernt-matthias bernt-matthias marked this pull request as draft May 7, 2024 10:40
@bernt-matthias bernt-matthias changed the title bowtie2: do not use pipes bowtie2: do not overcommit May 7, 2024
@bernt-matthias bernt-matthias marked this pull request as ready for review May 7, 2024 15:40
Copy link
Contributor

@wm75 wm75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good solution to me!

Co-authored-by: Wolfgang Maier <[email protected]>
@bernt-matthias bernt-matthias merged commit 7c23f32 into galaxyproject:main May 13, 2024
9 of 11 checks passed
@bernt-matthias bernt-matthias deleted the bowtie-decouple branch May 13, 2024 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bowtie overcommits CPU ressources
3 participants