genome name changed in output NJ_tree #159

flass · 2021-03-08T19:13:27Z

Hi John,
I've got a yet another minor bug to report here:

Versions
I am using PopPUNK v2.3.0 with pp-sketchlib 1.6.2, as provided by a conda environment built with

conda create -n poppunk230 -c defaults -c conda-forge -c bioconda poppunk==2.3.0 pp-sketchlib==1.6.2 graph-tool==2.37

Command used and output returned

poppunk_visualise --ref-db /lustre/scratch118/infgen/team216/fl4/poppunk_7kVc/950Vc --output 950Vc --threads 8 \
 --microreact --grapetree

Describe the bug
When uploading viz output files to microreact, I realised there was something wrong as some genome had their name edited in the NJ tree Newick file.

the name of the genome is RKI-ZBS2-CH129_TACAGC_L002.contigs_spades but appears in 950Vc_core_NJ.nwk as: 'RKI-ZBS2-CH129 TACAGC L002'
so with underscores replaced by spaces.

The other output files (.csv and .dot) have the correct spelling, even though in some it's edited as well to drop the .contigs_spades suffix:
in the 950Vc_perplexity20.0_accessory_tsne.dot file:

... "RKI-ZBS2-CH129_TACAGC_L002"[x=24.0984845161438,y=199.08136367797852]; ...

in the 950Vc_grapetree_clusters.csv and 950Vc_microreact_clusters.csv files:

RKI-ZBS2-CH129_TACAGC_L002,96

in the 950Vc_clusters.csv file:

RKI-ZBS2-CH129_TACAGC_L002.contigs_spades,96

So because of the difference between 950Vc_core_NJ.nwk and 950Vc_microreact_clusters.csv it leads to a bug when when uploading to Microreact.

Weirdly there are many other genomes that have underscores in their name but none other have been replaced. is this due to some specificity of that name that is wrongly parsed when getting rid of the name tail? (it's true it's got dashes and underscores and dots)

In this case it's only one name to correct so it's easy to deal with but I've had it before where many names more were missing/edited, so properly preventing me to enjoy the Microreact viz.

Best,

Florent

The text was updated successfully, but these errors were encountered:

nickjcroucher · 2021-03-08T20:43:17Z

@flass did you use dendropy or rapidnj for the tree? The latter will be much faster for a large dataset

flass · 2021-03-08T22:49:25Z

it was using the default tool - I assumed it to be RapidNJ (was pretty fast to compute indeed).

nickjcroucher · 2021-03-09T06:29:56Z

The default is to use dendropy, and the changes are probably down to the dendropy's treatment of underscores, which is complex: https://dendropy.org/primer/taxa.html. I would add --rapidnj rapidnj for faster tree building that I think will also preserve names (assuming rapidnj is on your path).

We should still take a look at the denropy behaviour - have you got a test set of ~10 sequences, with odd names, we could use @flass? @johnlees I think we need to add preserve_underscores=True to dendropy.PhylogeneticDistanceMatrix.from_csv(), I can do this as part of the changes to trees.py today.

johnlees · 2021-03-10T12:53:30Z

Ah do we not use that? Yes, we should add the flag. This will be addressed in #148 then
(I think this is all from 'old school' phylogenetic formats being more inflexible with names and their lengths)

flass · 2021-03-10T12:58:46Z

@nickjcroucher see the list of names below:

GCA_000387605.1_BJG-01_genomic
RKI-ZBS2-CH129_TACAGC_L002.contigs_spades
GCA_000154005.2_Vibrio_cholerae_623-39_V1_genomic
GCA_006803105.1_ASM680310v1_genomic
220011_4_C4_L003_R1.contigs_velvet
GCA_007623975.1_ASM762397v1_genomic
22776_8#168.contigs_spades
220875_4_C7_L002_R1.contigs_velvet
220871_8_C3_L003_R1.contigs_velvet
SRR7962186.contigs_spades
GCA_003716435.1_ASM371643v1_genomic
GCA_007624145.1_ASM762414v1_genomic
221749_4_C3_L003_R1.contigs_velvet

johnlees · 2021-03-17T16:32:31Z

Fixed in 66ca2d2

nickjcroucher added the bug Something isn't working label Mar 9, 2021

johnlees linked a pull request Mar 10, 2021 that will close this issue

Updates to MST #148

Merged

johnlees removed a link to a pull request Mar 17, 2021

Updates to MST #148

Merged

johnlees linked a pull request Mar 17, 2021 that will close this issue

Add alternate names, method description #161

Merged

4 tasks

johnlees closed this as completed in #161 Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genome name changed in output NJ_tree #159

genome name changed in output NJ_tree #159

flass commented Mar 8, 2021 •

edited

Loading

nickjcroucher commented Mar 8, 2021

flass commented Mar 8, 2021

nickjcroucher commented Mar 9, 2021

johnlees commented Mar 10, 2021

flass commented Mar 10, 2021

johnlees commented Mar 17, 2021

genome name changed in output NJ_tree #159

genome name changed in output NJ_tree #159

Comments

flass commented Mar 8, 2021 • edited Loading

nickjcroucher commented Mar 8, 2021

flass commented Mar 8, 2021

nickjcroucher commented Mar 9, 2021

johnlees commented Mar 10, 2021

flass commented Mar 10, 2021

johnlees commented Mar 17, 2021

flass commented Mar 8, 2021 •

edited

Loading