Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entropy panel mk2 #1684

Merged
merged 35 commits into from
Aug 20, 2023
Merged

Entropy panel mk2 #1684

merged 35 commits into from
Aug 20, 2023

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Aug 4, 2023

For the full story please read through the commit messages. I'll try to highlight the main UI changes here, but perhaps it's better to just jump in and use the review app with fresh eyes.

Main UI changes

Overlapping CDSs are better shown

Click for details

By using opacity and jittering them based on the frame the CDS is in. The aim is not to have a perfect genome viewer (e.g. see how much space a "proper" genome browser needs!) but to expose what's there and allow you to hover over them / click them (clicking a CDS selects it, see next section).

image

nCoV, current Auspice on top, this PR on bottom; much better communication of the CDSs present (they were always there in the JSON, just visually obscured).

Hovering over the CDSs is now possible too:

image

image

Selecting a CDS focuses the entropy panel on the translated CDS

Click for details

When viewing a CDS we now have the main axis represent this and only this CDS, always shown going L→R, with tick labels representing amino acids. This better represents biology as we are able to ignore any artifacts upstream of a translation. The UI to view a CDS is to either click on a CDS or select a genotype color-by for that CDS.

This screenshot (top, bottom) shows the pre-capsid CDS for HepB. In the top view other CDSs are overlapping so most of the bars you see are from a different CDS (pre-capsid CDS is the orange coloured CDS, and then the green CDS is overlapping), and it was not possible to see the actual diversity of pre-capsid¹.

image

¹ I guess in this example the CDSs are in frame, so the diversity is identical, but barcharts from different CDSs obscuring each other has always been a problem in auspice

Segmented CDSs (nCoV example)

Click for details

Top view is our current genome annotation for nCoV, bottom is an example of what it could be (incomplete, as I've only added in 2 of the 16 proteins cleaved from the polyproteins)

image

Specifically, we can now view ORF1ab (PP1ab) in its entirety:

image

(The main axis now has two segments, which is what we have been referring to as ORF1a and ORF1b, separated by a -1 ribosomal frameshift.)

And since overlapping CDSs are better shown we can (finally!) show all the proteins which are produced, e.g. RdRp (a complex one, since it spans the -1 frameshift). The entropy panel conveys the expected lack of diversity here well I think.

image

CDSs which wrap around the origin (HepB example)

Click for details

Most of HepB's CDSs wrap the origin! Displaying the translation (i.e. the top half of the panel) is no different from any segmented gene, it's the zooming that's tricky here. My solution was to ~invert the brush; perhaps easier to explore than to explain!

image

Segmented CDSs may have codons which span multiple segments (referred to as the phase of the segment), and we handle this well now:

image

Entropy panel state is decoupled from the color-by

Click for details

Previously the entropy panel was tightly coupled to the color-by. It is still linked to it, but it can also operate independently if (and only if) you start interacting with it. Any change to the (genotype) color-by will bring it back in sync. For instance, I can now select Spike 452 color-by (which will always jump to that position in the entropy panel + select Spike), and then use the UI to select another CDS in the entropy panel without changing the colorBy. If (via the sidebar) I change to a different Spike position then the entropy panel will jump back to show Spike.

Proposed JSON schema

The current proposal is mostly backwards compatible, except segmented CDSs are ignored by previous Auspice versions. For instance here's how ORF1ab looks:

  "ORF1ab": {
    "gene": "Multiple CDSs with the same gene will be grouped together",
    "strand": "+",
    "segments":[
      {"start": 266, "end": 13468},
      {"start": 13468, "end": 21555}
    ],
    "display_name": "Text here is shown when hovering on the CDSs"
  },

And you can specify a CDS which wraps around the origin by using an end value beyond the end value of the genome (or create the segments yourself, either way works):

"nuc": {
  "start": 1, "end": 3182, "strand": "+"
},
"envL": {
  "start": 2850, "end": 4019, "strand": "+", "gene": "S"
},

Test datasets

Available via the review app:

  • Moneypox (no changes from staging site version) - useful for a genome with lots of CDSs
  • Zika (no changes from staging site version) - a very simple genome
  • HepB - a short genome but complex: overlapping CDSs, many CDSs which wrap the origin
  • ncov - A modified nCoV dataset which removes our ORF1a / 1b artifact and shows the ORF1ab polyprotein (2 segments, one with a -1 frameshift), the PP1a polyprotein (aka ORF1a) and 2 of the 16 proteins cleaved out of those polyproteins. Created using a small script to modify a nextclade nCoV tree so it shouldn't be taken as 100% perfect!

Things still to do (either here or in subsequent work):

styling

  • The y-jitter (currently using the CDS's frame) still results in hard-to-click / hard-to-see CDSs. I'm not sure of the ideal solution here, but we could probably allow a little more height in overall entropy panel (perhaps calculated at load time depending on how complicated the genome map is). Please come up with good suggestions!!!
  • I must better communicate which CDS is selected - currently I change the text/stroke colour, but it's not clear enough.
  • CDS text overlapping
  • "Simple" genomes (e.g. zika) have too much whitespace because all CDSs are in a single frame
  • The zoom elements are a little bit too small - originally the grey brush element extended above the annotations (but behind them), but this obscured the fact that you could pan the zoom by dragging the grey part. Arguably it's so small now it's still obscured.

Negative strand CDSs

  • All that's required is writing the code to handle the projection from genome space to protein space. And a whole lot of testing.

misc

  • I've done a lot here to remove duplicate / conflicting redux state but there is one which still exists: state.metadata.genomeAnnotations (used by the tree info boxes to order the genes).
  • Sometimes the entropy zoom URL params aren't quite right (e.g. missing when they should be there or vice versa)
  • Button to expand zoom (where possible)
  • Remove the wireframes / console.log statements
  • Help popup (similar to the ones we have in the sidebar)
  • Failing tests
  • Changelog
  • Add documentation / tutorial? (I won't let this hold up the PR, but it is important)

future work?

  • For nucleotide positions we used to check if it fell inside a gene, which allowed on-hover info boxes to show information such as "Nucleotide x (Codon y in protein z)". This never considered the case where a nucleotide fell in multiple genes (CDSs), so it was somewhat misleading. It was a generally useful idea, so we should bring it back in an improved form. It's plausible that we could find the AA positions for all CDSs which overlapped that nucleotide.
  • Main axis uses ticks which go above and below the axis, with an additional label to communicate the corresponding nt position in the genome (information which exists in the on-hover box)
  • Mousewheel zooming has been disabled. I should bring it back, but it needs a proper debounce.

Issues closed

Closes #1596
Closes #1651
Closes #1145
Closes #299
Closes #1516

Remove unused prop, fixup typos
Essentially an empty commit so that we can more easily track changes
in subsequent commits.
@nextstrain-bot nextstrain-bot temporarily deployed to auspice-entropy-updates-rgiu2h August 4, 2023 03:55 Inactive
@nextstrain-bot nextstrain-bot temporarily deployed to auspice-entropy-updates-rgiu2h August 4, 2023 04:03 Inactive
@nextstrain-bot nextstrain-bot temporarily deployed to auspice-entropy-updates-rgiu2h August 7, 2023 05:27 Inactive
@rneher
Copy link
Member

rneher commented Aug 8, 2023

this is very cool! Looking forward to finally having this in auspice!

A few points, some of which you touch upon above.

  • When using the MPXV build, the zoom elements for the CDS get very small. We might need some way to zoom into the genome first. Or maybe put the zoom elements for the CDS on the CDS and keep the genome zoom below?
  • as you say, negative strand CDS are not properly accounted for. The majority of MPXV genes in the first third of the genome are on the negative strand. Auspice used to know about that.
  • we could add localAnnotation to the CDS along side the ranges to specify things that should only be shown in the CDS view, like RBD.

@nextstrain-bot nextstrain-bot temporarily deployed to auspice-entropy-updates-rgiu2h August 8, 2023 08:42 Inactive
@jameshadfield
Copy link
Member Author

jameshadfield commented Aug 8, 2023

Thanks @rneher

-ve strand CDSs are now working, except that zooming within a CDS is backwards (will fix tomorrow). I've abandoned the use of frame for the y-offsets and instead just stack them as efficiently as possible to avoid any CDSs overlapping (in the bottom view, anyways, the top view used too much space that way so they overlap a little bit).

image

we could add localAnnotation to the CDS along side the ranges to specify things that should only be shown in the CDS view, like RBD.

Yeah! I played around with this yesterday, and I think it'll be very helpful. Here's how it looked for Spike (clicking on the local annotation toggled the highlighting). I'm going to leave it out of this PR, or at least get this PR finished before adding it in. I'd have confidence that we should add it to the JSON spec now tho.

image

@nextstrain-bot nextstrain-bot temporarily deployed to auspice-entropy-updates-rgiu2h August 9, 2023 04:56 Inactive
@nextstrain-bot nextstrain-bot temporarily deployed to auspice-entropy-updates-rgiu2h August 9, 2023 22:06 Inactive
@nextstrain-bot nextstrain-bot temporarily deployed to auspice-entropy-updates-rgiu2h August 10, 2023 03:00 Inactive
Currently the JSON annotation block is overly simple, and there are
many features we want to parse & display in Auspice. This will
inevitably involve an update to the JSON schema, but here we introduce
a new hierarchical structure `genomeMap` to represent
Genome → Chromosome[] → Gene[] → CDS[] → CDS_Segment[]

Parsing multiple CDS segments from a single CDS in the current JSON
schema is only possible for CDSs which wrap around the origin.

We provide a default color per gene, but allow per-CDS overrides from
the JSON.

Future commits will remove the `entropy.annotations` and `entropy.geneMap`
as the intention is that `entropy.genomeMap` is the canonical source of
this information.

Note - previously auspice treated CDSs & genes as the same thing.
They are now distinguished from each other, but this change will
take time to propagate through the codebase.

Note - Typescript errors in existing code are addressed largely using "any"
types, as the plan is to remove this code as part of this PR.
This is in preparation for upcoming changes to how annotations are
drawn.

The offsets (position calculations) remain the same, but now use named
constants to better convey the intention. Different elements are now
separated out so that we may move them independently -- e.g. the nav
annotations, brush and axis are no longer tied together by the same
offset.

There still remain some hardcoded pixel values, especially in the
brush code, which we will improve when we update that later on.

Wireframes are added to the brush elements to help with development.
They will be removed in an upcoming commit (on this branch), as the
zoom / brush will be significantly changed.
Previously we only had one tooltip (active while hovering on bars),
generated by `<InfoPanel>`. Here we abstract out the logic for
tooltip positioning & display and then delegate the contents of the
tooltip to the triggering event. This is in preparation for future
commits which will add tooltips for other events in the entropy panel.
Includes offsetting CDS segments by frame and on-hover tooltips. Negative strand CDSs are not yet considered. The styles aren't quite right but I'll tackle that once all the functionality is in place.
This also includes a (overdue) update to the offset calculations.
The rendered CDSs are similar to as before, but now with a frame offset
similar to that implemented for the navBar annotations
Previously the entropy panel could toggle between "aa" and "nuc", and
we stored this in `state.controls.mutType`. The aim of this PR is to
be able to select a CDS (or select "nucleotide view") within the entropy
panel rather than a aa-nuc toggle. We could simply add another bit of
state to represent this, but here I've chosen to remove mutType and
introduce `state.entropy.selectedCds`. Note that `mutType` still appears
in some functions, but it is derived from `selectedCds` rather than taken
from redux state.

This involved a lot of changes to the codebase, but it is significantly
improved because of it. The Entropy panel now no subscribes to redux
colorBy state, which addresses a series of bugs related to state-change-
ordering (see removed comment from `./src/actions/colors.js`).

There should be minimal changes to the UI, but there are a few:

* The AA-NT toggle has gone from the entropy panel. AA entropy can still
be shown by using an appropriate genotype colorBy (via UI or URL).

* When switching to nuc view in the entropy panel from an aa-genotype-
colorBy, or vice vera,  we previously would change the colorBy to the
default (often "country"). This is no longer the case, as I don't think
it's that user-friendly. But it's easy to add back if people like it!

* The way the entropy panel automatically zooms upon genotype changes
isn't working quite as before. This will be a big focus of a future commit,
and the current implementation is rather ad-hoc, so I'm not paying it
too much attention here.
I plan to refactor the entropy/counts calculation code in the next commit so it's helpful to have some sanity tests (as well as inspecting the UI).

It turns out no code uses the "x" property for aa-entropy/count data, so we remove it here.
genomeMap better describes the annotation structure we will be using
going forward in Auspice and having one source of truth (in redux
state) is important in avoiding bugs etc. There are other pieces of
state which also contain similar information which i'll replace with
genomeMap in forthcoming commits.

There are two functional changes:

- For nucleotide positions we used to check if it fell inside a gene, which
allowed on-hover info boxes to show information such as "Nucleotide x
(Codon y in protein z)". This never considered the case where a nucleotide
fell in multiple genes (CDSs), so it was somewhat misleading.
It was a generally useful idea, so we should bring it back in an
improved form.

- CDSs which have multiple segments now have their bars (mostly)
correctly displayed -- e.g. if a CDS wraps the origin then bars will
appear at the start & end of the genome. This will be fully addressed
when a subsequent commit moves the UI to representing the CDS by itself,
rather than as segments on the genome.

Aside: the terminology used around auspice re: "gene" vs "CDS" is now
rather inconsistent. I'll hopefully tidy this up in a forthcoming
commit.
The redux state for entropy panel zoom bounds was confusing and contained
a number of little bugs.

The requested bounds were stored as "zoomMin" and "zoomMax" in both
controls and entropy state. The controls state is only ever updated by a
URL query (i.e. dataset load / narrative page change), it doesn't
respond to actions which update the zoom bounds ('CHANGE_ZOOM'),
although these actions do change the URL query; I've not changed this
here (other than to add a comment), however a future commit will revamp
the entropy zooming code & can attend to this. The entropy state, while
remaining up-to-date, is never subscribed to, so we remove it.
Similarly, `entropy.zoomCoordinates` was never used.

The checking of these bounds on page load / 'CHANGE_ZOOM' action
previously used `controls.absoluteZoomMin` and
`controls.absoluteZoomMax`, but we now use the canonical `genomeMap`
state. These can therefore be removed, as can `entropy.lengthSequence`
which was used to create the controls state!

The checking of zoom bounds is much improved on page load, and previous
oddities where the URL query wouldn't update appropriately are no longer
(e.g. you can no longer have `?gmax=x` if x is beyond the genome
length).
We now make a guarantee at dataset-load time whether genotype colorings
are available: if mutations exist on the tree and the genomeMap exists
(which implies the `metadata.genome_annotations` exists and at least
defines the nuc coordinates) then we will always be able to have a
genotype coloring. This knowledge will allow upcoming work as we can now
be sure genotype colorings can only exist if they are supported by the
data.

Observations and bug-fixes from this commit:

- It's not possible to get a colorBy for a CDS (or 'nuc') which is not
in the genomeMap, or to specify a valid CDS but invalid position. If
such a CDS is specified in the URL query, we fallback to a default
colorBy at load time.

- It is no longer possible to have a partially-valid color-by.
Specifically, it was possible to specify multiple (genotype) positions,
and as long as one was valid then the color-by would be considered "ok".
This is now caught and the color-by changed to only those positions
which are valid, and a console.error printed.

- It's possible to have a CDS defined in the genomeMap which is never
observed on the tree. This will work as expected -- the tree coloring
will always be grey or a single color, depending on if the root-sequence
JSON is available. No entropy/counts bar will be created, so it will
appear to be zero.

- It's possible to have a CDS defined as a branch mutation which is not
present in the genomeMap, but this cannot be accessed as a coloring. It
will be shown in the branch hover/click info-box.
This is unused (and I believe it always was, as `geneMap` was essentially
the same data structure).
The information for gene (CDS) lengths is better found within the
`genomeMap`, as that covers more complex genome annotations than the
`geneLength` did. Furthermore, the `geneLength` calculations included
an off-by-one bug¹ which didn't allow us to color-by the final aa in the
CDS².

The validity of genotype positions is no longer checked within color-
scale creation, due to the genomeMap not being available to that
function and a lot of prop-drilling necessary to make it so. However a
previous commit guaranteed that genotype color-bys must be valid prior
to this point, so it's ok to skip this check.

¹ In the JSON structure, both start and end are 1-based (following GFF)
so (end-start) will undercount the nuc length by 1, which resulted
in aa lengths of `<int>.6667`.

² This maps to the stop codon, so it's arguable whether this was a
feature or a bug, but there can be mutations here, and when there are
Auspice should be able to display them.
Slightly complicated by the observation that ribosomal frameshifts (e.g.
in SARS-CoV-2's ORF1ab) mean that a single nucleotide can appear twice
within a CDS.
A python script to post-process a nCoV dataset to create annotations
which more closely match the biology. There may be some off-by-one
errors there, and it's incomplete, but it's very useful for testing. Of
course we would use Nextclade to actually generate the translations

P.S. The get-data script is run for Heroku review-apps
Allows CDSs to have both a display name and a description. The general
style and layout of the info-boxes is also improved.
GFF allows for strands '?' (features whose strandedness is relevant, but
unknown) and '.' (features that are not stranded), which are represented
by augur as '?' and null, respectively. (null comes from `None` in
python.) In both cases it's not a good idea to make an assumption of
strandedness, or to assume it's even a CDS.
The appearance now follows the date slider (sidebar), and the on-hover
icon correctly indicates whether bidirectional or unidirectional
dragging is possible. Changes requested in PR review
<#1684 (comment)>
It's not necessary (or a good idea) to keep a copy of the annotations in
redux state as opposed to using the canonical entropy.genomeMap. This is
the final piece of ~duplicated redux state identified as part of
<#1684>
jameshadfield added a commit that referenced this pull request Aug 17, 2023
A number of small improvements, many of which were suggested in review.

Added a "help" icon, following the pattern we use extensively in the
sidebar.

Updated the zoom triangle appearance and their cursor icons. Items 1 and
4 in
<#1684 (comment)>

Changed the "return to nuc" button to a "RESET LAYOUT" button which will
always return you to a fully zoomed out view of the genome. See added
comment for why this isn't greyed out when it is a no-op. Item 2 in
<#1684 (comment)>

Changed the panel title to indicate if we're looking at the genome or a
CDS, and add the CDS name to the title in the latter case. Also added a
small axis label indicating that the numbering system is AA for selected
CDSs (it's a bit cluttered to show 'Nt #' as well)
<#1684 (comment)>
jameshadfield added a commit that referenced this pull request Aug 17, 2023
A number of small improvements, many of which were suggested in review.

Added a "help" icon, following the pattern we use extensively in the
sidebar.

Updated the zoom triangle appearance and their cursor icons. Items 1 and
4 in
<#1684 (comment)>

Changed the "return to nuc" button to a "RESET LAYOUT" button which will
always return you to a fully zoomed out view of the genome. See added
comment for why this isn't greyed out when it is a no-op. Item 2 in
<#1684 (comment)>

Changed the panel title to indicate if we're looking at the genome or a
CDS, and add the CDS name to the title in the latter case. Also added a
small axis label indicating that the numbering system is AA for selected
CDSs (it's a bit cluttered to show 'Nt #' as well)
<#1684 (comment)>
A number of small improvements, many of which were suggested in review.

Added a "help" icon, following the pattern we use extensively in the
sidebar.

Updated the zoom triangle appearance and their cursor icons. Items 1 and
4 in
<#1684 (comment)>

Changed the "return to nuc" button to a "RESET LAYOUT" button which will
always return you to a fully zoomed out view of the genome. See added
comment for why this isn't greyed out when it is a no-op. Item 2 in
<#1684 (comment)>

Changed the panel title to indicate if we're looking at the genome or a
CDS, and add the CDS name to the title in the latter case. Also added a
small axis label indicating that the numbering system is AA for selected
CDSs (it's a bit cluttered to show 'Nt #' as well)
<#1684 (comment)>
@jameshadfield
Copy link
Member Author

Thanks for all the reviews. I think this PR is at a good point to wrap up and release which I plan to do ~early next week (and we can keep improving this after merge if we notice things).

Would it be possible to add an AA or NT label for the plotted units?

Good idea! I've added a AA pos label when a CDS is selected, but having "NT" felt a bit too cramped. I've also made the panel title indicate whether we're viewing the genome or a CDS, and the little help icon also explains the coordinates.

Could it be a unidirectional mouse cursor when at the end of the rope? ... Giving handle triangles a white center and gray outline could help to indicate that they are draggable.

Yup! Both done.

The two labels for ORF1ab ... are a bit confusing to me. I think these should just be "ORF1a" and "ORF1b"

(Discussed elsewhere) the best way forward is to continue to annotate these as separate CDSs, rather than a single segmented CDS which I demoed here. This allows people to continue to label mutations in an ORF1b coordinate system which has become quite common.

How difficult would it be to draw faint lines from the gray bounding box on the bottom track to the left and right edges of the top track?

Is there a reason for why CDSs are strictly rectangles in the lower track, while they have the helpful arrows on the top track?

Both possible -- I tried the arrows approach on the lower track and it looked quite messy (lots of aliasing), but I'm sure if we played around with it enough it'd work. I haven't tried the faint lines but think it's worth pursuing. The zooming approach will need further work to support large (>200kb) genomes well.

I'd think replacing "↵ nuc" with "RESET ZOOM" (or "RESET LAYOUT") would be more appropriate.

Done. It's a bridge too far to make this greyed out at the present moment, but we can tackle that aspect later on.

jameshadfield added a commit to nextstrain/augur that referenced this pull request Aug 17, 2023
See <nextstrain/auspice#1684> for the context
for these additions.
jameshadfield added a commit to nextstrain/zika-tutorial that referenced this pull request Aug 18, 2023
The precursor protein is cleaved into ~10 proteins including the
precusor membrane protein (prM). prM is subsequently cleaved by the host
protease furin resulting in protein M (aka MP).

The previous annotation included CDSs PRO and MP, but not the (parent)
prM protein. prM is cleaved directly from the polyprotein and is
functional, forming heterodimers with protein E (similar behaviour
across all flaviviruses I believe).

NCBI's genome viewer shows both prM as well as both the sub-proteins
after cleavage, pr and M. pr would correspond to the previous PRO
annotation. I've not seen protein pr referenced anywhere else so I've not
included that here.

Note that this change wasn't desirable before Auspice could display
overlapping CDSs <nextstrain/auspice#1684>.

References:
<https://www.microbiologyresearch.org/content/journal/jgv/10.1099/vir.0.18723-0>
<https://www.ncbi.nlm.nih.gov/gene/7751225>
jameshadfield added a commit to nextstrain/augur that referenced this pull request Aug 20, 2023
See <nextstrain/auspice#1684> for the context
for these additions.
jameshadfield added a commit to nextstrain/augur that referenced this pull request Aug 20, 2023
See <nextstrain/auspice#1684> for the context
for these additions.
@jameshadfield jameshadfield merged commit 84485a1 into master Aug 20, 2023
@jameshadfield jameshadfield deleted the entropy-updates branch August 20, 2023 21:50
jameshadfield added a commit to nextstrain/augur that referenced this pull request Mar 15, 2024
Given a SeqFeqture with a CompoundLocation we now correctly write out
the CDS/gene using segmented coordinates. Auspice can now handle such
coordinates (see <nextstrain/auspice#1684> and
<#1281> for the corresponding
schema updates).

Note that the translations (via augur translate) of complex CDSs are not
modified in this commit.

Supersedes #1333
jameshadfield added a commit to nextstrain/augur that referenced this pull request Mar 15, 2024
Given a SeqFeqture with a CompoundLocation we now correctly write out
the CDS/gene using segmented coordinates. Auspice can now handle such
coordinates (see <nextstrain/auspice#1684> and
<#1281> for the corresponding
schema updates).

Note that the translations (via augur translate) of complex CDSs did not
need modifying as they already used BioPython's SeqFeature.extract
method.

Supersedes #1333
jameshadfield added a commit to nextstrain/augur that referenced this pull request Mar 16, 2024
Given a SeqFeqture with a CompoundLocation we now correctly write out
the CDS/gene using segmented coordinates. Auspice can now handle such
coordinates (see <nextstrain/auspice#1684> and
<#1281> for the corresponding
schema updates).

Note that the translations (via augur translate) of complex CDSs did not
need modifying as they already used BioPython's SeqFeature.extract
method.

Supersedes #1333
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
5 participants