Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config restructure #621

Draft
wants to merge 13 commits into
base: dev
Choose a base branch
from
Draft

Config restructure #621

wants to merge 13 commits into from

Conversation

jfy133
Copy link
Member

@jfy133 jfy133 commented May 30, 2024

Closes #501

Proposal table: #501 (comment)

  • test (separate run)
  • test_single_end
  • test_alternatives
  • test_preassembly_binrefine -> might be problematic due to very slow download of CheckM (Aria2)/Genomad/Gunc databases...
  • test_hybrid_rm
  • test_nothing
  • test_extras
  • test_bigdb
  • test_full

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@jfy133 jfy133 marked this pull request as draft May 30, 2024 13:57
Copy link

github-actions bot commented May 30, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 44015d4

+| ✅ 323 tests passed       |+
#| ❔   2 tests were ignored |#
!| ❗   5 tests had warnings |!

❗ Test warnings:

  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file [TODO: try and test using for --host_fasta and --host_genome]
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2024-11-28 13:56:34

@jfy133
Copy link
Member Author

jfy133 commented Jun 22, 2024

Note: checkm database download is very slow - mirror to s3 bucket?

@jfy133 jfy133 mentioned this pull request Jun 24, 2024
11 tasks
@jfy133
Copy link
Member Author

jfy133 commented Jul 18, 2024

TODO as of today (test_preassemblu_binrefine):

  • Sort out using genomad database: maybe add optional decompression of tar.gz in addition to direct directory
  • Find best metaeuk db (mmseqs causing my laptop to hang, might be misconfigured resources though) (edit: pushed the modules yeast version, but should move to mag branch if we go with it)

@jfy133
Copy link
Member Author

jfy133 commented Sep 3, 2024

Latest failure is likely due to GUNC dmnd dvb being 13.GB...

And there is not a small version of it sadly.

Although it is just a DMND file... I wonder if I could just make one of a single genome myself and see what happens (expect failure though)

EDIT: That worked! The pass.GUNC score is nonsense, but should be fine for basic tests (we don't use that downstream)

## Download the FASTA (wget might not work - it's the B. fragilis coding sequences FASTA faa)
curl "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&log$=seqview&db=nuccore&report=fasta_cds_aa&id=1992822979&extrafeat=null&conwithfeat=on&hide-cdd=on&ncbi_phid=CE8C15326D6BB8C10000000006490560" -o sequence.txt

## Build diamond DB
diamond makedb --in sequence.txt -d gunc-test

## OR (With taxdump and prot2acc i download a while ago)
diamond makedb --in sequence.txt -d gunc-test2 --taxonmap ~/cache/databases/acc2taxid/prot.accession2taxid.FULL.gz --taxonnodes ~/cache/databases/taxdmp/2024-02-02/nodes.dmp --taxonnames ~/cache/databases/taxdmp/2024-02-02/names.dmp

## run GUNC
gunc run --db_file gunc-test.dmnd -i genome.fna.gz

## Produces
$ cat GUNC.progenomes_2.1.maxCSS_level.tsv 
genome	n_genes_called	n_genes_mapped	n_contigs	taxonomic_level	proportion_genes_retained_in_major_clades	genes_retained_index	clade_separation_score	contamination_portion	n_effective_surplus_clades	mean_hit_identity	reference_representation_score	pass.GUNC
genome.fna.gz	4411	4122	2	kingdom	nan	nan	nan	nan	nan	nan	nan	nan



EDIT: I wonder if I need to add the taxonomy information in the diamond makedb to build a database that builds proper results...

EDIT EDIT: nope, still all nans so maybe it doesn't matter

@jfy133 jfy133 marked this pull request as ready for review November 28, 2024 13:47
@jfy133 jfy133 marked this pull request as draft November 28, 2024 13:47
@jfy133
Copy link
Member Author

jfy133 commented Nov 28, 2024

FOR CAT_Pack (while I'm waiting for old GUNC compatible prot2file to download)

## Download the FASTA of coding sequences FASTA AA of B. fragilis
curl "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&log$=seqview&db=nuccore&report=fasta_cds_aa&id=1992822979&extrafeat=null&conwithfeat=on&hide-cdd=on&ncbi_phid=CE8C15326D6BB8C10000000006490560" -o sequence.txt

## use my scripts to filter nodes/names/acc2taxid files to just a given taxids

bash ~/bin/taxdmp_filter.sh 817 ## for nodes/naames
bash ~/bin/accession2taxid_filter.sh 817 ## for accessions

## Repair headers of B. fragilis protein FASTA
sed 's/lcl|//g;s/_/ /2' sequence.txt > sequence_fixedheaders.txt

## Generate database
CAT_pack prepare --db_fasta input_files/sequence_fixedheaders.txt --names input_files/names_reduced.dmp --nodes  input_files/nodes_reduced.dmp --acc2tax input_files/accession2taxid_reduced.dmp --db_dir test2/

## Test using uncompressed contigs from metaspades assembly (note: --no_stars was required for some reason but seesms to only occur when we have a single genome in there possible..)
CAT_pack contigs -c SPAdes-test_minigut_sample2.scaffolds.fa -d ../../../cat_fakedb/test2/db/ -t ../../../cat_fakedb/test2/tax/ --no_stars --force

Seems to make something?

@jfy133
Copy link
Member Author

jfy133 commented Nov 28, 2024

Next time:

  • Add CAT_* modules so can add these to tests using ythe mini DB I made
  • continue with making tiny GUNC database again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant