Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genome name changed in output NJ_tree #159

Closed
flass opened this issue Mar 8, 2021 · 6 comments · Fixed by #161
Closed

genome name changed in output NJ_tree #159

flass opened this issue Mar 8, 2021 · 6 comments · Fixed by #161
Labels
bug Something isn't working

Comments

@flass
Copy link

flass commented Mar 8, 2021

Hi John,
I've got a yet another minor bug to report here:

Versions
I am using PopPUNK v2.3.0 with pp-sketchlib 1.6.2, as provided by a conda environment built with

conda create -n poppunk230 -c defaults -c conda-forge -c bioconda poppunk==2.3.0 pp-sketchlib==1.6.2 graph-tool==2.37

Command used and output returned

poppunk_visualise --ref-db /lustre/scratch118/infgen/team216/fl4/poppunk_7kVc/950Vc --output 950Vc --threads 8 \
 --microreact --grapetree

Describe the bug
When uploading viz output files to microreact, I realised there was something wrong as some genome had their name edited in the NJ tree Newick file.

the name of the genome is RKI-ZBS2-CH129_TACAGC_L002.contigs_spades but appears in 950Vc_core_NJ.nwk as: 'RKI-ZBS2-CH129 TACAGC L002'
so with underscores replaced by spaces.

The other output files (.csv and .dot) have the correct spelling, even though in some it's edited as well to drop the .contigs_spades suffix:
in the 950Vc_perplexity20.0_accessory_tsne.dot file:

... "RKI-ZBS2-CH129_TACAGC_L002"[x=24.0984845161438,y=199.08136367797852]; ...

in the 950Vc_grapetree_clusters.csv and 950Vc_microreact_clusters.csv files:

RKI-ZBS2-CH129_TACAGC_L002,96

in the 950Vc_clusters.csv file:

RKI-ZBS2-CH129_TACAGC_L002.contigs_spades,96

So because of the difference between 950Vc_core_NJ.nwk and 950Vc_microreact_clusters.csv it leads to a bug when when uploading to Microreact.

Weirdly there are many other genomes that have underscores in their name but none other have been replaced. is this due to some specificity of that name that is wrongly parsed when getting rid of the name tail? (it's true it's got dashes and underscores and dots)

In this case it's only one name to correct so it's easy to deal with but I've had it before where many names more were missing/edited, so properly preventing me to enjoy the Microreact viz.

Best,

Florent

@nickjcroucher
Copy link
Collaborator

@flass did you use dendropy or rapidnj for the tree? The latter will be much faster for a large dataset

@flass
Copy link
Author

flass commented Mar 8, 2021

it was using the default tool - I assumed it to be RapidNJ (was pretty fast to compute indeed).

@nickjcroucher
Copy link
Collaborator

The default is to use dendropy, and the changes are probably down to the dendropy's treatment of underscores, which is complex: https://dendropy.org/primer/taxa.html. I would add --rapidnj rapidnj for faster tree building that I think will also preserve names (assuming rapidnj is on your path).

We should still take a look at the denropy behaviour - have you got a test set of ~10 sequences, with odd names, we could use @flass? @johnlees I think we need to add preserve_underscores=True to dendropy.PhylogeneticDistanceMatrix.from_csv(), I can do this as part of the changes to trees.py today.

@nickjcroucher nickjcroucher added the bug Something isn't working label Mar 9, 2021
@johnlees
Copy link
Member

Ah do we not use that? Yes, we should add the flag. This will be addressed in #148 then
(I think this is all from 'old school' phylogenetic formats being more inflexible with names and their lengths)

@johnlees johnlees linked a pull request Mar 10, 2021 that will close this issue
@flass
Copy link
Author

flass commented Mar 10, 2021

@nickjcroucher see the list of names below:

GCA_000387605.1_BJG-01_genomic
RKI-ZBS2-CH129_TACAGC_L002.contigs_spades
GCA_000154005.2_Vibrio_cholerae_623-39_V1_genomic
GCA_006803105.1_ASM680310v1_genomic
220011_4_C4_L003_R1.contigs_velvet
GCA_007623975.1_ASM762397v1_genomic
22776_8#168.contigs_spades
220875_4_C7_L002_R1.contigs_velvet
220871_8_C3_L003_R1.contigs_velvet
SRR7962186.contigs_spades
GCA_003716435.1_ASM371643v1_genomic
GCA_007624145.1_ASM762414v1_genomic
221749_4_C3_L003_R1.contigs_velvet

@johnlees
Copy link
Member

Fixed in 66ca2d2

@johnlees johnlees removed a link to a pull request Mar 17, 2021
@johnlees johnlees linked a pull request Mar 17, 2021 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants