Skip to content

Commit

Permalink
Merge pull request #57 from Pathogen-Genomics-Cymru/tbprofiler
Browse files Browse the repository at this point in the history
Tbprofiler
  • Loading branch information
WhalleyT authored May 8, 2024
2 parents 446f587 + 604733d commit 6d1efb0
Show file tree
Hide file tree
Showing 5 changed files with 60 additions and 60 deletions.
91 changes: 51 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,11 @@ Name of the bowtie index, e.g. hg19_1kgmaj<br />
Run [vcfmix](https://github.com/AlexOrlek/VCFMIX), yes or no. Set to no for synthetic samples<br />
* **resistance_profiler**<br />
Run resistance profiling for Mycobacterium tubercuclosis. Either ["tb-profiler"](https://tbdr.lshtm.ac.uk/) or "none".
* **afanc_myco_db**<br />Path to the [afanc](https://github.com/ArthurVM/Afanc) database used for speciation. Obtain from https://s3.climb.ac.uk/microbial-bioin-sp3/Mycobacteriaciae_DB_7.0.tar.gz
<br />
* **afanc_myco_db**<br />
Path to the [afanc](https://github.com/ArthurVM/Afanc) database used for speciation. Obtain from https://s3.climb.ac.uk/microbial-bioin-sp3/Mycobacteriaciae_DB_7.0.tar.gz
* **update_tbprofiler**<br />
Update tb-profiler. Either "yes" or "no". "yes" may be useful when running outside of a container for the first time as we will not have constructed a tb-profiler database matching our reference. This is not needed with the climb, docker and singluarity profiles as the reference has already been added. Alternatively you can run ```tb-profiler update_tbdb --match_ref <lodestone_dir>/resources/tuberculosis.fasta```.


For more information on the parameters run `nextflow run main.nf --help`

Expand All @@ -83,49 +86,57 @@ NXF_VER=20.11.0-edge nextflow run main.nf -stub -config testing.config
## Checkpoints ##
Checkpoints used throughout this workflow to fail a sample/issue warnings:

processes preprocessing:checkFqValidity or preprocessing:checkBamValidity
1. (Fail) If sample does not pass fqtools 'validate' or samtools 'quickcheck', as appropriate.
**processes preprocessing:checkFqValidity or preprocessing:checkBamValidity**
1. (*Fail*) If sample does not pass fqtools 'validate' or samtools 'quickcheck', as appropriate.

process preprocessing:countReads\
2. (Fail) If sample contains < 100k pairs of raw reads.
**process preprocessing:countReads**

2. (*Fail*) If sample contains < 100k pairs of raw reads.

process preprocessing:fastp\
3. (Fail) If sample contains < 100k pairs of cleaned reads, required to all be > 50bp (cleaning using fastp with --length_required 50 --average_qual 10 --low_complexity_filter --correction --cut_right --cut_tail --cut_tail_window_size 1 --cut_tail_mean_quality 20).

process preprocessing:kraken2\
4. (Fail) If the top family hit is not Mycobacteriaceae\
5. (Fail) If there are fewer than 100k reads classified as Mycobacteriaceae \
6. (Warn) If the top family classification is mycobacterial, but this is not consistent with top genus and species classifications\
7. (Warn) If the top family is Mycobacteriaceae but no G1 (species complex) classifications meet minimum thresholds of > 5000 reads or > 0.5% of the total reads (this is not necessarily a concern as not all mycobacteria have a taxonomic classification at this rank)\
8. (Warn) If sample is mixed or contaminated - defined as containing reads > the 5000/0.5% thresholds from multiple non-human species\
9. (Warn) If sample contains multiple classifications to mycobacterial species complexes, each meeting the > 5000/0.5% thresholds\
10. (Warn) If no species classification meets the 5000/0.5% thresholds\
11. (Warn) If no genus classification meets the 5000/0.5% thresholds
**process preprocessing:fastp**

3. (*Fail*) If sample contains < 100k pairs of cleaned reads, required to all be > 50bp (cleaning using fastp with --length_required 50 --average_qual 10 --low_complexity_filter --correction --cut_right --cut_tail --cut_tail_window_size 1 --cut_tail_mean_quality 20).

**process preprocessing:kraken2**

4. (*Fail*) If the top family hit is not Mycobacteriaceae
5. (*Fail*) If there are fewer than 100k reads classified as Mycobacteriaceae
6. (*Warn*) If the top family classification is mycobacterial, but this is not consistent with top genus and species classifications
7. (*Warn*) If the top family is Mycobacteriaceae but no G1 (species complex) classifications meet minimum thresholds of > 5000 reads or > 0.5% of the total reads (this is not necessarily a concern as not all mycobacteria have a taxonomic classification at this rank)
8. (*Warn*) If sample is mixed or contaminated - defined as containing reads > the 5000/0.5% thresholds from multiple non-human species
9. (*Warn*) If sample contains multiple classifications to mycobacterial species complexes, each meeting the > 5000/0.5% thresholds
10. (*Warn*) If no species classification meets the 5000/0.5% thresholds
11. (*Warn*) If no genus classification meets the 5000/0.5% thresholds

process preprocessing:identifyBacterialContaminants\
12. (Fail) If regardless of what Kraken reports, Afanc does not make a species-level mycobacterial classification (note that we do not use Kraken mycobacterial classifications other than to determine whether 100k reads are family Mycobacteriaceae; for higher-resolution classification, we defer to Afanc)\
13. (Fail) If the sample is not contaminated and the top species hit is not one of the 10 supported Mycobacteria: abscessus|africanum|avium|bovis|chelonae|chimaera|fortuitum|intracellulare|kansasii|tuberculosis\
14. (Fail) If the sample is not contaminated and the top species hit is contrary to the species expected (e.g. "avium" rather than "tuberculosis" - only tested if you provide that expectation)\
15. (Warn) If the top Afanc species hit, on the basis of highest % coverage, does not also have the highest median depth\
16. (Warn) If we are unable to associate an NCBI taxon ID to any given contaminant species, which means we will not be able to locate its genome, and thereby remove it as a contaminant\
17. (Warn) If we are unable to determine a URL for the latest RefSeq genome associated with a contaminant species' taxon ID\
18. (Warn) If no complete genome could be found for a contaminant species. The workflow will proceed with alignment-based contaminant removal, but you're warned that there's reduced confidence in detecting reads from this species
**process preprocessing:identifyBacterialContaminants**

12. (*Fail*) If regardless of what Kraken reports, Afanc does not make a species-level mycobacterial classification (note that we do not use Kraken mycobacterial classifications other than to determine whether 100k reads are family Mycobacteriaceae; for higher-resolution classification, we defer to Afanc)
13. (*Fail*) If the sample is not contaminated and the top species hit is not one of the 10 supported Mycobacteria: abscessus|africanum|avium|bovis|chelonae|chimaera|fortuitum|intracellulare|kansasii|tuberculosis
14. (*Fail*) If the sample is not contaminated and the top species hit is contrary to the species expected (e.g. "avium" rather than "tuberculosis" - only tested if you provide that expectation)
15. (*Warn*) If the top Afanc species hit, on the basis of highest % coverage, does not also have the highest median depth
16. (*Warn*) If we are unable to associate an NCBI taxon ID to any given contaminant species, which means we will not be able to locate its genome, and thereby remove it as a contaminant
17. (*Warn*) If we are unable to determine a URL for the latest RefSeq genome associated with a contaminant species' taxon ID
18. (*Warn*) If no complete genome could be found for a contaminant species. The workflow will proceed with alignment-based contaminant removal, but you're warned that there's reduced confidence in detecting reads from this species

process preprocessing:downloadContamGenomes\
19. (Fail) If a contaminant is detected but we are unable to download a representative genome, and thereby remove it
**process preprocessing:downloadContamGenomes**

19. (*Fail*) If a contaminant is detected but we are unable to download a representative genome, and thereby remove it

process preprocessing:summarise\
20. (Fail) If after having taken an alignment-based approach to decontamination, Kraken still detects a contaminant species\
21. (Fail) If after having taken an alignment-based approach to decontamination, the top species hit is not one of the 10 supported Mycobacteria\
22. (Fail) If, after successfully removing contaminants, the top species hit is contrary to the species expected (e.g. "avium" rather than "tuberculosis" - only tested if you provide that expectation)

process clockwork:alignToRef\
23. (Fail) If < 100k reads could be aligned to the reference genome\
24. (Fail) If, after aligning to the reference genome, the average read mapping quality < 10\
25. (Fail) If < 50% of the reference genome was covered at 10-fold depth

process clockwork:minos\
26. (Warn) If sample is not TB, then it is not passed to a resistance profiler
**process preprocessing:summarise**

20. (*Fail*) If after having taken an alignment-based approach to decontamination, Kraken still detects a contaminant species
21. (*Fail*) If after having taken an alignment-based approach to decontamination, the top species hit is not one of the 10 supported Mycobacteria
22. (*Fail*) If, after successfully removing contaminants, the top species hit is contrary to the species expected (e.g. "avium" rather than "tuberculosis" - only tested if you provide that expectation)

**process clockwork:alignToRef**

23. (*Fail*) If < 100k reads could be aligned to the reference genome
24. (*Fail*) If, after aligning to the reference genome, the average read mapping quality < 10
25. (*Fail*) If < 50% of the reference genome was covered at 10-fold depth

**process clockwork:minos**

26. (*Warn*) If sample is not TB, then it is not passed to a resistance profiler

## Acknowledgements ##
For a list of direct authors of this pipeline, please see the contributors list. All of the software dependencies of this pipeline are recorded in the version.json
Expand Down
11 changes: 1 addition & 10 deletions config/containers.config
Original file line number Diff line number Diff line change
@@ -1,13 +1,4 @@
params{
container_enabled = "true"
container_enabled = "true"
}


process {
update_tbprofiler = "false"


process {
withLabel:low_cpu {cpus = 2}
withLabel:normal_cpu { cpus = 8 }
withLabel:low_memory { memory = '5GB' }
Expand Down
8 changes: 5 additions & 3 deletions docker/Dockerfile.tbprofiler-0.9.8
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ ARG TBPROFILER_VER="6.2.0"
# this version is the shortened commit hash on the `master` branch here https://github.com/jodyphelan/tbdb/
# commits are found on https://github.com/jodyphelan/tbdb/commits/master
# this was the latest commit as of 2024-05-01
ARG TBDB_VER="152d603"

ARG TBDB_VER="e6a0040"

# LABEL instructions tag the image with metadata that might be important to the user
LABEL base.image="micromamba:1.3.0"
Expand Down Expand Up @@ -48,7 +49,8 @@ ENV PATH="/opt/conda/bin:${PATH}"
# Version of database can be confirmed at /opt/conda/share/tbprofiler/tbdb.version.json
# can also run 'tb-profiler list_db' to find the same version info
# In 5.0.1 updating_tbdb does not work with tb-profiler update_tbdb --commit ${TBDB_VER}
RUN tb-profiler update_tbdb --commit ${TBDB_VER}

WORKDIR /data
RUN tb-profiler update_tbdb --match_ref tuberculosis.fasta

#wants full path to reference
RUN tb-profiler update_tbdb --match_ref /data/tuberculosis.fasta --commit ${TBDB_VER}
6 changes: 0 additions & 6 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,6 @@ if(!resistance_profilers.contains(params.resistance_profiler)){
exit 1, 'Invalid resistance profiler. Must be one of "tb-profiler" or "none" to skip.'
}

//tbprofiler container already has the reference genome in the DB, so skip if using docker
if((params.resistance_profiler == "tb-profiler") && (params.container_enabled == true)) {
update_tbprofiler = true
} else {
update_tbprofiler = false
}

resistance_profiler = params.resistance_profiler

Expand Down
4 changes: 3 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,9 @@ params {
vcfmix = 'yes'

resistance_profiler = "tb-profiler"
update_tbprofiler = "yes"

update_tbprofiler = "no"


// path to singularity recipes directory (needed to strip software versions in getversion)
sing_dir = "${baseDir}/singularity"
Expand Down

0 comments on commit 6d1efb0

Please sign in to comment.