-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add explicit option to control how densities are resolved, change how densities are resolved by default #9172
Conversation
I got stuck trying to fix the issues with density area marks, so I thought I would summarize what I'm trying to do so that others can help: I believe the recent issues with density marks stem from PR #9018, which made stacking the new default for area marks. Unfortunately stacked area marks also perform an imputation to zero by default as described here #5611 (comment) by @domoritz which leads to issues with stacked density as previously noted in #6624 (this became more obvious after the PR mentioned above since it made stacked areas the default). Thanks @mattijn for pointing out that this imputation was happening in vega/vega#3815. The reason the imputation leads to jagged densities is because grouped densities value are defined only for data values existing within their own group. So when putting two grouped densities together on the same axis, but without exactly the same extent over which the density is defined will lead to a lot of imputation to zero and spiky appearance. In #9106 @jonmmease fixed this by guaranteeing the same extent for all individual grouped densities. However this introduced a new issue, which is that densities are no longer cut off at the min/max values of the data within each group, but extends to the min/max values of the data within all groups leading to the appearance of values where there actually are none as shown in vega/vega#3815. For density area marks to work properly, we need these two things:
My hunch would be that limiting the range of each grouped density in the KDE transform in Vega and rolling back #9018, would be the most straightforward way to fix this, but I'm very open to hearing other suggestions. |
Here is another example of how the current behavior leads to unexpected outcomes. Take this spec that creates two densities which seem to cover the same data range:
If we add another chart to the layer, such as a mean line (but really any chart), suddenly it becomes apparent that the densities don't at all cover the same region and the one to the right now appears cut off for no reason since it looked like the extent was the same in the previous chart:
The reason seems to be that the layering is moving the transforms around, but this is not easy to understand and the inconsistent behavior is quite confusing. |
Gentle ping on this @jheer, @domoritz, and @kanitw. What do you think is the best way forward to fix the density specs? The decision we take here will also have an impact on the violin plots we have been talking about. My hunch would be that limiting the range of each grouped density in the KDE transform in Vega as suggested in vega/vega#3815, and rolling back #9018 would be the most straightforward way to fix this, but I'm very open to hearing other suggestions. |
If densities should be handled differently from other area marks when it comes to stacking, I think it would make sense to introduce enough knobs to support handling them differently and then to introduce a higher-level mark type (which could even create the density transform as well). I know @jheer was advocating for that for a long time 😅 so I'd say let's use this opportunity to do that. Imputation by default has issues but I still think it's a reasonable choice (as discussed in #5611). Would introducing xc as a channel as an alternative to centering a stacked area mark help as well? It would make for a more canonical representation of densities (https://vega.github.io/vega/examples/violin-plot/), I think. @yhoonkim and @kanitw since you introduced #9018, what are your thoughts on the issue/reverting? At this point, I am a bit lost between the discussion here and slack. Would it make sense to have a sync meeting where we can agree on a plan with @kanitw @yhoonkim @joelostblom @jheer and @domoritz? |
Co-authored-by: Dominik Moritz <[email protected]>
Outcome from meeting: let's get this option added but not change the default for now. |
@domoritz I updated this as per our discussion to change the default back to "shared", but now it's tunable which gives us more flexibility for designing the density mark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider the default and make sure we test different cases where resolve is set and not set in unit tests.
@@ -15,7 +15,8 @@ | |||
"type": "kde", | |||
"field": "IMDB Rating", | |||
"bandwidth": 0.3, | |||
"as": ["value", "density"] | |||
"as": ["value", "density"], | |||
"resolve": "shared" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to be independent
. https://vega.github.io/vega/docs/transforms/kde/
Why should this be shared for non-stacked transforms? Maybe a reasonable default is to have it only shared if there is stacking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly from our discussion, we said it would be difficult to set the default value of the transform based on whether the encoding channel is stacked or not. I believe the main reason was that it would be hard to detect which encoding channel the density values are plotted on.
Leaving "shared" as the default would work in most situations, including most of the common ones, and if there are situations where the "independent" is needed, it would be easy to switch. To avoid that densities stack by default, we said we would turn off stacking as part of a mark_density
, but leave the default stacking behavior for the area mark as it is now to be consistent with other marks.
test/compile/data/density.test.ts
Outdated
as: ['A', 'density'] | ||
}); | ||
}); | ||
|
||
it('should add resolve shared if we group', () => { | ||
it('should only add resolve "shared" if we set it explicitly', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test this, we would need a case that doesn't set resolve
when it's not set in the transform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't resolve
always be in the transform sine the parameter has a default value? I think this test case should just be switched to "independent" instead of "shared" to reflect the updates in e3f4505. I did that in my latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I'm really sorry, I missed that you had left new comments a while ago @domoritz. I just responded to the two comments you made and pushed a new commit for the test, let me know what you think.
@@ -15,7 +15,8 @@ | |||
"type": "kde", | |||
"field": "IMDB Rating", | |||
"bandwidth": 0.3, | |||
"as": ["value", "density"] | |||
"as": ["value", "density"], | |||
"resolve": "shared" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly from our discussion, we said it would be difficult to set the default value of the transform based on whether the encoding channel is stacked or not. I believe the main reason was that it would be hard to detect which encoding channel the density values are plotted on.
Leaving "shared" as the default would work in most situations, including most of the common ones, and if there are situations where the "independent" is needed, it would be easy to switch. To avoid that densities stack by default, we said we would turn off stacking as part of a mark_density
, but leave the default stacking behavior for the area mark as it is now to be consistent with other marks.
test/compile/data/density.test.ts
Outdated
as: ['A', 'density'] | ||
}); | ||
}); | ||
|
||
it('should add resolve shared if we group', () => { | ||
it('should only add resolve "shared" if we set it explicitly', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't resolve
always be in the transform sine the parameter has a default value? I think this test case should just be switched to "independent" instead of "shared" to reflect the updates in e3f4505. I did that in my latest commit.
Thank. I will take a look. Can you merge the latest main and make sure that the CI runs? |
@domoritz Thanks! Merge complete and CI ran successfully. |
Will add more description soon, just seeing if tests pass first because my local installation doesn't work.