Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing labels and captions in plots with default settings #221

Open
peterjc opened this issue Aug 19, 2020 · 8 comments
Open

Missing labels and captions in plots with default settings #221

peterjc opened this issue Aug 19, 2020 · 8 comments
Labels
documentation documentation is unclear or incomplete question how can I do this? why does it do that? where can I get this? etc. visualisation issues relating to plot outputs
Milestone

Comments

@peterjc
Copy link
Collaborator

peterjc commented Aug 19, 2020

Following workflow produced plots without labels or class colors (edited for brevity):

pyani index pyani_sample_genomes
pyani anim pyani_sample_genomes pyani_sample_genomes_anim
pyani plot --formats png,pdf pyani_sample_genomes_anim 1

Solution is to explicitly set labels and classes when call pyani anim,

pyani index pyani_sample_genomes
pyani anim pyani_sample_genomes pyani_sample_genomes_anim \
--labels pyani_sample_genomes/labels.txt --classes pyani_sample_genomes/classes.txt
pyani plot --formats png,pdf pyani_sample_genomes_anim 1

Would it make sense to have --labels and --classes default to $DIR/labels.txt and $DIR/classes.txt if present when run on input directory $DIR?

If no labels are given, would it make sense to use the filename stems as the default labels?

(I'm also puzzled why the classes and labels are tied to the run; I expected pyani index $DIR to record them from $DIR/labels.txt and $DIR/classes.txt)

@widdowquinn
Copy link
Owner

widdowquinn commented Aug 19, 2020

That's expected behaviour, at the moment… v0.3.0a is in active development and should be considered unfinished.

Default behaviour might eventually turn out to use the hash, or the filestem, for labels (where not provided). My intention is that classes will be ignored if not specified.

Tying classes and labels to the run allows the user to rerun the same analysis with different labels and classes, e.g. for generating plots. It may be helpful to provide a way to override the database classes/labels specifically for an analysis, but at the moment, this is intended behaviour.

@widdowquinn widdowquinn added the question how can I do this? why does it do that? where can I get this? etc. label Aug 19, 2020
@peterjc
Copy link
Collaborator Author

peterjc commented Aug 19, 2020

I agree that the hash would be another practical default for labels when not provided.

I don't yet understand why you would tie the class and label metadata to the comparison computation stage. I wouldn't want to recompute the comparisons (even with recovery mode) just to re-plot with different classes (e.g. samples sites, or sample year) or different labels (e.g. sampling method).

Can we supply the classes and labels to the plotting (and report?) commands?

@widdowquinn
Copy link
Owner

The database stores the results from previous comparisons, so you don't need to recompute them.

Considering a use-case:

  • I have a set of genomes I don't know how to classify
  • I use ANI to generate a likely classification (labels/classes are arbitrary at this point)
  • I plot the results of the analysis, and this tells me what the classes should be (and suggests labels, e.g. new species divisions)

Now I want to plot the results again, but with my new classes/labels. There are two obvious options:

  1. replot, but use a new classes/labels file specific to that plot
  2. "reanalyse" but with a new classes/labels file (this is computationally almost cost-free, as the results are stored)

Both will give the same file output. However, if you only redo the plot step, using a new class/label file, this gives an output that isn't consistent with the database (though you could make notes/log the changed labels/classes).

To have the database be "reproducible" such that plotting/writing tables for a particular run gives the same outputs each time with only the database as input, we'd need to capture the labels/classes files used for the plot, and remember that it's a combination of ANI run and plot command (and we could have arbitrarily many plots for a single run) that defines the output.

One goal is to have a Flask/whatever is useful at the time interface onto the local database, so that interactive plots can be produced, as well as those which are written statically to a file. These will get their information from the database. It makes sense in that context to have a "run" defined as the genomes + corresponding labels/classes. Changing labels/classes (keeping the same genomes) corresponds to another "run", in the same way removing genomes, but keeping labels/classes, corresponds to another "run". When no new calculations are required, this is a straightforward database update in both cases.

Now, I do see the utility of providing a classes/labels file at the pyani plot step, but it breaks that definition of "run" being "genomes + their labels/classes" that I want to keep for the more advanced interaction with the database. For quick and dirty outputs I see the attraction of having --classes/--labels options in pyani plot. Maybe that's worth implementing - but I'm quite keen on enforcing that "run" definition.

@peterjc
Copy link
Collaborator Author

peterjc commented Aug 19, 2020

That did clarify your design goals, thank you.

The "quick and dirty" option of --classes / --labels options in pyani plot is attractive, especially while "reanalyse" remains somewhat slow (even in -recovery mode).

@widdowquinn
Copy link
Owner

I should really write this stuff down somewhere ;)

@widdowquinn widdowquinn added the documentation documentation is unclear or incomplete label Apr 18, 2021
@widdowquinn widdowquinn added this to the 0.3.0 milestone Apr 18, 2021
@widdowquinn
Copy link
Owner

This is another thing that should go into the doumentation - the design goals and motivation for the database integration and how that affects the way we need to provide metadata for visualisation.

@baileythegreen baileythegreen added the visualisation issues relating to plot outputs label Jul 22, 2021
@baileythegreen
Copy link
Contributor

Which part of the documentation? Design goals and motivation sounds like wiki material; there is already a bit of text in indexing.rst that seems related to the use of class and label files discussed here.

@widdowquinn
Copy link
Owner

Which part of the documentation? Design goals and motivation sounds like wiki material;

It does.

there is already a bit of text in indexing.rst that seems related to the use of class and label files discussed here.

Yes, there is. As ever there may be a judgement call involved to decide what is appropriately user-facing (so goes in ReadTheDocs) and what is "motivation/design detail" (so goes in the Wiki) - and some items may be represented, with different levels of detail perhaps, in both places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation documentation is unclear or incomplete question how can I do this? why does it do that? where can I get this? etc. visualisation issues relating to plot outputs
Projects
None yet
Development

No branches or pull requests

3 participants