Issue with a large Pseudomonas dataset #299

RyanCFink · 2024-01-16T19:15:14Z

About the dataset

We have two datasets, one with 8046 Pseudomonas genomes, and
another, a subset of the first, with 6347 Pseudomonas aeruginosa genomes.
For the issues described here we will focus on the latter, but the issues are common to both.

About the issue

As part of our workflow, we create subsets based on the core and accessory distances, as infered by PopPUNK.
In this dataset we found that many of the distances for the genomes are 0 for both the accessory and the core genomes.

Here are the command we used to extract the distance table:

poppunk --create-db

poppunk \
    --create-db \
    --output poppunk_db \
    --r-files poppunk_samplesheet.tsv  \
    --threads 24

poppunk_extract_distances.py

poppunk_extract_distances.py \
    --distances poppunk_db/poppunk_db.dists \
    --output poppunk_db_distances.tsv

We also fit to the dbscan model:

poppunk \
    --fit-model dbscan \
    --ref-db poppunk_db \
    --output poppunk_db \
    --threads 24 \
    --k-step 3 --min-k 17 --max-k 41 --plot-fit 5

Here is the resulting figure:

We then tried to subset the genomes based on their "similarity" (1-distance). Here are the genomes retained for a few different thresholds:

You can see the number of retained genomes is very small compared to the input size (6347), this is something we've only encountered in this dataset.

Other datasets of similar size have yielded more sensible subsets, e.g. an 11373 genome dataset resulted in these subsets (with a few more thresholds displayed to show how it scales):

Some things we've tried

Running the PopPUNK QC prior to fitting the model didn't work, most type isolates we chose got removed during the QC and the QC failed:

  Graph-tools OpenMP parallelisation enabled: with 1 threads
  Running sequence QC
  Running QC on sketches
  93 samples failed
  Running distance QC
  Selected type isolate for distance QC is SAMD00012124
  5282 samples failed
  Traceback (most recent call last):
    File "/usr/local/bin/poppunk", line 11, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.9/site-packages/PopPUNK/__main__.py", line 413, in main
      raise RuntimeError('Type isolate ' + qc_dict['type_isolate'] + \
  RuntimeError: Type isolate SAMD00012124 not found in isolates after QC; check name of type isolate and QC options

Adding some of our genomes to the BacPop Pseudomonas database using poppunk_assign. This seemingly didn't change anything on the original PopPUNK db.

Before adding to the database:

After:

Running an even smaller subset with only 100 random Pseudomonas aeruginosa genomes. This led to similar issues.

The text was updated successfully, but these errors were encountered:

johnlees · 2024-01-17T10:28:44Z

Regarding the QC options, could you try with:

--max-pi-dist 1 --max-a-dist 1 --max-zero-dist 20 --length-sigma 5 --prop-n 0.1

It would be useful to know from the output (which I think you should still get) the reasons isolates are being removed

Regarding query assignment: this does not update the distance plot but you should still get clusters assignments out. Are you having problems/errors in this output? A flag that can be helpful in these cases is --serial which means that assignments are made one by one, so if there are problematic input sequences they only cause issues in their own assignments. See also: https://poppunk.bacpop.org/troubleshooting.html#most-all-of-my-samples-merge-when-i-run-a-query

RyanCFink · 2024-01-25T17:44:05Z

We tried the extra flags you provided when running the QC, this time it ran to completion,
but the underlying issue remains the same:

PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
        (with backend: sketchlib v2.0.1
         sketchlib: /usr/local/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)

Graph-tools OpenMP parallelisation enabled: with 1 threads
Running sequence QC
Running QC on sketches
93 samples failed
Running distance QC
0 samples failed
Removing 93 sequences
Recalculating random matches with strand_preserved = False
Calculating random match chances using Monte Carlo

Done

We were wondering, since you used a similar Pseudomonas dataset in the original PopPUNK paper and in bacpop, if you could share:

If there were any options different from the default when creating your pseudomonas db, and if so, if you could share
the commands that were used.
If you performed any cleanup on the genomes prior to running PopPUNK (e.g. removing plasmids)

johnlees · 2024-01-26T09:28:59Z

@samhorsfield96 I think you created the pseudomonas DB? Do you remember/know if there was anything special you did?

We wouldn't have changed the sequence files (e.g. removing plasmids), but we would have removed isolates. It still looks to me like you have a few isolates that need to be removed to get this to work. The other thing that would be useful is to find those which are giving zero core distance. You can do that with a command like:

sketchlib query dist poppunk_db | awk '$3 == 0 && $4 > 0.1 {print $1}' | sort | uniq -c | sort -k2,2 -n -r

Which should give you the isolates with the most 'bad' distances from most to least

samhorsfield96 · 2024-01-26T14:05:28Z

Hi both,

The database on the bacpop website was generated using:

poppunk --create-db --output db --r-files input.txt --qc-filter prune --length-sigma 1 --plot-fit 3 --min-k 17 --max-k 41 --sketch-size 100000

From memory I had to increase the sketch size and k-mer ranges to increase resolution at smaller core distances. I also reduced the genome size standard deviation boundary to remove likely contaminated genomes.

Let me know if you have any further questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with a large Pseudomonas dataset #299

Issue with a large Pseudomonas dataset #299

RyanCFink commented Jan 16, 2024

johnlees commented Jan 17, 2024 •

edited

Loading

RyanCFink commented Jan 25, 2024

johnlees commented Jan 26, 2024

samhorsfield96 commented Jan 26, 2024

Issue with a large Pseudomonas dataset #299

Issue with a large Pseudomonas dataset #299

Comments

RyanCFink commented Jan 16, 2024

About the dataset

About the issue

Some things we've tried

johnlees commented Jan 17, 2024 • edited Loading

RyanCFink commented Jan 25, 2024

johnlees commented Jan 26, 2024

samhorsfield96 commented Jan 26, 2024

johnlees commented Jan 17, 2024 •

edited

Loading