Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessory distance issue #135

Closed
sbberes opened this issue Dec 17, 2020 · 22 comments · Fixed by #160
Closed

Accessory distance issue #135

sbberes opened this issue Dec 17, 2020 · 22 comments · Fixed by #160

Comments

@sbberes
Copy link

sbberes commented Dec 17, 2020

I am running poppunk on a large set of sequences that essentially lack any accessory gene content. I can complete the first step fo generating the DB, but when trying to run an initial model fit, I get numerous warnings: Accessory outlier at a= ..., and a final output that "Distances failed quality control (change QC options to run anyway). Can you suggest what such QC options changes should be made? I have tried adding --ignore-length and --core-only with the same end results. Thanks for the software and any assistance.
Best regards,
Stephen Beres

@nickjcroucher
Copy link
Collaborator

Hi Stephen - what version of the software are you using? The most recent versions have a --qc-filter continue flag to prevent runs halting at the database creation stage; the --max-a-dist 1.0 will prevent such filtering at the distance estimation stage. Hopefully those will help!

@johnlees johnlees changed the title John and Nicholas Accessory distance issue Dec 17, 2020
@sbberes
Copy link
Author

sbberes commented Dec 17, 2020 via email

@nickjcroucher
Copy link
Collaborator

Congratulations on getting the vaccine so quickly! The conda version should be 2.2.0 at the moment, though we are hopeful of upgrading this in the next couple of days.

@johnlees
Copy link
Member

If you cannot get 2.2.0, please make sure all your input is alphabetically sorted. See:
https://poppunk.readthedocs.io/en/latest/troubleshooting.html#when-i-look-at-my-clusters-on-a-tree-they-make-no-sense

Ideally you should upgrade to the most recent version, but I appreciate conda can be a pain to get right. My advice would be:

  • Make a totally clean conda install
  • Never install anything in the base environment
  • Make a new environment with conda create -n poppunk_env poppunk>=2.2.0 and then use it with conda activate poppunk_env

@johnlees
Copy link
Member

Also make sure the channel order is correct (conda-forge -> bioconda -> defaults) and see the advice here if there are problems: https://conda-forge.org/docs/user/tipsandtricks.html#using-multiple-channels%3E

@sbberes
Copy link
Author

sbberes commented Dec 17, 2020 via email

@sbberes
Copy link
Author

sbberes commented Dec 22, 2020 via email

@johnlees
Copy link
Member

Hi Stephen,
A quick look suggests that --max-a-dist 1 may solve your first issue, though be sure to take a look at your distanceDistribution.png and other output plots as well as the command line output.

The second issue may be a bug at our end, it was previously reported here too: #106
I will have to take a look at this in more detail after we reopen in January. Will update you when I've had more of a chance to check

@johnlees
Copy link
Member

johnlees commented Jan 5, 2021

I have just released a new version of PopPUNK. Could you try updating by running conda install poppunk==2.3.0 pp-sketchlib==1.6.0 in your conda environment?

The --threshold command works on my tests in PopPUNK 2.3.0 w/ pp-sketchlib 1.6.0. Its format has changed slightly, so in your case would be:

poppunk --fit-model threshold --threshold 0.05 --distances 27065x442708/27065x442708.dists --output 27065x442708 --full-db --ref-db 27065x442708 --threads 54

(see https://poppunk.readthedocs.io/en/latest/model_fitting.html#threshold)

If that still doesn't work, perhaps you could share your .h5 file with me so I can see if I can replicate your issue?

@sbberes
Copy link
Author

sbberes commented Jan 5, 2021 via email

@flass
Copy link

flass commented Mar 8, 2021

Hi there,
just wanted to add to this discussion: I had to resort to using --max-a-dist 1 on my own data because of the little conservation of genome between some strains. however this led me to get a warning that there was a division by zero attempted, which I assume happened because some strains had indeed a distances equal to zero:

/lustre/scratch118/infgen/team216/fl4/miniconda3/envs/poppunk230/lib/python3.8/site-packages/PopPUNK/sketchlib.py:732: RuntimeWarning: divide by zero encountered in log
  args=(klist, np.log(pairwise)),
Fitting k-mer curve failed: Residuals are not finite in the initial point.
With mash input [0.0079,0.0007,0.0005,0.0003,0.0002,0.    ]
Check for low quality input genomes

I found that a good workaround was to use --max-a-dist 0.99, which led to filter out quite a few genome pairs in the fitting run, but worked.

Florent

@sbberes
Copy link
Author

sbberes commented Mar 9, 2021 via email

@sbberes
Copy link
Author

sbberes commented Mar 9, 2021 via email

@sbberes
Copy link
Author

sbberes commented Mar 9, 2021 via email

@sbberes
Copy link
Author

sbberes commented Mar 9, 2021 via email

@johnlees
Copy link
Member

Hi there,
just wanted to add to this discussion: I had to resort to using --max-a-dist 1 on my own data because of the little conservation of genome between some strains. however this led me to get a warning that there was a division by zero attempted, which I assume happened because some strains had indeed a distances equal to zero:

/lustre/scratch118/infgen/team216/fl4/miniconda3/envs/poppunk230/lib/python3.8/site-packages/PopPUNK/sketchlib.py:732: RuntimeWarning: divide by zero encountered in log
  args=(klist, np.log(pairwise)),
Fitting k-mer curve failed: Residuals are not finite in the initial point.
With mash input [0.0079,0.0007,0.0005,0.0003,0.0002,0.    ]
Check for low quality input genomes

I found that a good workaround was to use --max-a-dist 0.99, which led to filter out quite a few genome pairs in the fitting run, but worked.

Florent

@flass Was this with using --plot-fit? I don't think this function should be being called otherwise, I think you could alternatively just omit the argument to avoid this error

@flass
Copy link

flass commented Mar 10, 2021

yes I was using the --plot-fit option. Good to know for future use. thanks!

@johnlees
Copy link
Member

@sbberes To try and answer your points in turn:

  • The RAM use (and time) of HDBSCAN fits on large datasets is indeed a problem, and one I ran into in parallel to you. I've fixed it in the recent code and it should now stay below a few Gb in most cases (I was able to fit to 50k genomes using 8 cores in about three hours, and around 30Gb of memory).
  • I believe that fit refinement should also now benefit from similar improvements.

These fixes will be included in PopPUNK v2.4.0, which we hope will be out in the next couple of weeks.

With the scripts:
Yes, my apologies that these have changed. I had missed the warning from sklearn that these modules would be moved in v0.24. I've updated these paths to fix this. They are runnable from poppunk_calclulate_rand.py, and I have also the manual with this.

These will be in poppunk v2.4.0 also, but to run it now you can just download the updated standalone script from here, which will run from any path as long as you have your conda environment activated.

@sbberes
Copy link
Author

sbberes commented Mar 10, 2021 via email

@sbberes
Copy link
Author

sbberes commented Mar 10, 2021 via email

@johnlees
Copy link
Member

Looks like another deprecation I've missed! I've got a fix here for you to try:
https://raw.githubusercontent.com/johnlees/PopPUNK/30cd9c15b2503e090a0a2bf02617e724584dc576/scripts/poppunk_calculate_rand_indices.py

I don't have files to hand to test it myself so I hope that it just works, but let me know if there is still an error and I'll get set up to test it properly

@sbberes
Copy link
Author

sbberes commented Mar 10, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants