fix: lower required threads for annotation rules #1528
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The cluster has been quite clogged in this period in early 2025 and @Karl-Svard in prodbioinfo notified me about a job in balsamic cadd_annotate_somaticINDEL_research which was hardly using any threads or memory on the node on ganglia but it required 36 cores and booked up the whole node. It seems that this rule doesn't at all require this amount of resources and the cluster could be free:d up a bit if we lowered it.
I then looked at other similar rules and saw a few bcftools commands that was also run on the whole node. I don't think this should be necessary at all since the sizes of the VCFs are rarely even in the scale of 1gb.
On top of this the benchmark files specified in the rules also had the same name and should overwrite each other, meaning that we cannot track the benchmark of these rules.
Link issue: #1529
Changed
Fixed
Documentation
Tests
Feature Tests
Run WGS TN case and check the benchmark files for the rule to see if the threads are matching the memory requirement
See table below from benchmark files for the adjusted rules:
All rules except the cadd_annotate rule used less than 90M memory. The cadd_annotate rule used about 9.1Gb memory, which should be fine for using 4 threads as the nodes have about 190Gb mem available. So filling up one of these nodes with for instance 9 jobs (in the case of a 36 core node, 36 / 4 = 9) would only use like 82Gb ram. Which is far away from memory issues.
However this was a WGS TN case which has due to the normal filtration much fewer variants than a WGS TO case. So let's compare it to a WGS TO case as well, and I will guess that we need to increase it from 4 to some higher number to be safe. But let's compare...
Run WGS TO case and check the benchmark files for the rule to see if the threads are matching the memory requirement
Comparing to the TO case civilsole we actually see even a lower memory usage.
We don't have the benchmark file to compare as this was overwritten due to the benchmark files having the same name. But we can look at the log-file from the rule to see how much time it took to finish, to see if supplying 36 cores was useful to speed up the step.
It started at 19:24:05 and finished at 02:30:32, so it took roughly 7 hours, and in comparison this run which used 4 cores took about 6.5 hours, so I think we can conclude that this step does not benefit from multiple cores, and it doesn't seem like there's a risk of memory-issues from this either.
Pipeline Integrity Tests
.hk
file)Clinical Genomics Stockholm
Documentation
Panel of Normal specific criteria
User Changes
Infrastructure Changes
Validation criteria
Validation criteria to be added to validation report PR: [LINK-TO-VALIDATION-REPORT-PR from the validations repository]
Version specific criteria
Important
One of the below checkboxes for validation need to be checked
Checklist
Important
Ensure that all checkboxes below are ticked before merging.
For Developers
For Reviewers
conditions where applicable, with satisfactory results.