-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All numeric input file base names prevents labeling #178
Comments
Hi @louellette, Thanks for your interest in I think I see what the problem might be. It looks like IDs are being used that have the format of floating-point numbers, and I'm not catching that somewhere. That might be in the FASTA input, or the Would it please be possible for you to attach a minimal failing example (only needs a couple of genomes, and the L. |
Thanks, I emailed the files but it looks like they did not go through due to the size of the fasta files. I'm attaching the labels.txt file here. I suppose you could just rename any old fasta file to the names in my labels.txt to test. Thank you, Lisa |
Thanks, Lisa. I think you're correct about naming the FASTA files. I'll get onto that. L. |
Hi Lisa, I've been investigating this, and I have a few comments. The problem is not with the The analysis results are represented in a Simply remapping the index to I'm coming to the end of my train journey, so will have to pause for a while. The root of the problem is L. |
This was a knotty problem. The user had input files whose names were all interpretable as floating-point numbers. When loading the output CSV files, the names of files were interpreted as float64 datatypes, but only in the dataframe index, not the headers. This meant that the label/class files were not being used, and the sequence names in the graphical output had extra digits where the floats did not have an exact representation. The solution was to read the CSV file without the `index_col=0` argument, but specifying that column zero shoul dbe the str datatype. Once loaded, this column could be specified as a new index for the dataframe. This fix forms the basis for release 0.2.10
Thank you. I have implemented the workaround which simply involved prefixing my filenames with a letter and using that file name in the labels.txt file. Works fine. just fyi, the file names I was using are PATRIC genome ids. |
Applies changes that fix issue #178 and tidy surrounding code.
Hi Lisa, Release v0.2.10 should have fixed this bug - my test input/output is in the attached files. I hope it works for you. labels.txt I see why you ended up with filenames in that form, now. I'm surprised this hasn't come up before! There should be L. |
HI Lisa, That's v0.2.10 up at: I'll close this issue now, but please do raise any other problems you find, and thank you so much for finding this bug - I don't think I'd ever have noticed it. L. |
Summary:
All numeric input file base names prevent labeling
Description:
If all the .fna files in the input directory have numeric base names, then the labels.txt file is not being used. It appears as if the file base names become floating-point. See attached pdf output.
Reproducible Steps:
average_nucleotide_identity.py -i -o pyani --labels labels.txt -g --gmethod mpl --gformat pdf --workers 16 -l pyani.log -m ANIb
Let me know if you need more details.
Current Output:
All looks good on terminal output.
Faecalibacterium_prausnitzii_ANIb_percentage_identity.pdf
Expected Output:
Labeled output.
pyani Version:
pyani version: 0.2.9
installed dependencies
If you are running a version of
pyani
v0.3 or later, then please run the commandpyani listdeps
at the command line, and enter the output below.Python Version:
Python 3.7.4
Operating System:
ubuntu 16.04
The text was updated successfully, but these errors were encountered: