Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitise bin calculations #6212

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Conversation

teunbrand
Copy link
Collaborator

@teunbrand teunbrand commented Dec 4, 2024

This PR aims to fix #6207.

However, the linked issue is merely an occasion to sanitise bin calculations across ggplot2.
Most binning functions in ggplot2 acted similar, but not quite identical. An example of this is stat_bin(boundary) has replaced the stat_bin(origin) argument since ggplot2 2.1.0, but stat_bin_2d(origin) continued to exist.

The main changes in this PR:

  • We have a new binning orchestrator compute_bins(), which is now used by all binning functionality. It is modelled after the stat_bin() functionality, as this was the most polished version. stat_bin() is able to deal with 0-width data elegantly, so this fixed geom_bin_2d() should recognise 0-width data #6207.
  • All binning stats now use boundary/center instead of origin. They were never formal arguments to most binning stats.
  • I also realised that stat_bin(keep.zeroes) introduced in Treatment options for zeroes in histograms #6139 was fulfilling the same/very similar role as stat_bin_2d(drop). However, stat_bin(drop) has been deprecated (also since 2.1.0). This PR renames stat_bin(keep.zeroes) to stat_bin(drop), effectively resurrecting a formerly deprecated argument. There is no need for backward compatibility here since keep.zeroes was introduced in this cycle of development.
  • StatBin2d is now a subclass of StatSummary2d, which prevents duplicated code.
  • stat_bin(bins) can be a function now, purely for symmetry with the stat_bin(breaks, binwidth) arguments that can also be a function. You can use this to do bins = nclass.Sturges for example.

Reprex from the linked issue:

devtools::load_all("~/packages/ggplot2/")
#> ℹ Loading ggplot2
df <- data.frame(x = 1:5, y = 1)

ggplot(df, aes(x, y)) +
  geom_bin_2d()
#> `stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.

Created on 2024-12-04 with reprex v2.1.1

binwidth <- dual_param(binwidth, list(NULL, NULL))
breaks <- dual_param(breaks, list(NULL, NULL))
fun = "mean", fun.args = list(),
boundary = 0, closed = NULL, center = NULL) {
bins <- dual_param(bins, list(x = 30, y = 30))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this is also done in setup_params(), but then the default argument to compute_group() should be bins = list(x = 30, y = 30), which causes a test to complain that there is a mismatch between compute_group() defaults and stat_*() constructor defaults.

Comment on lines -176 to -180
expect_snapshot_error(bin_breaks_width(3))
expect_snapshot_error(comp_bin(dat, binwidth = letters))
expect_snapshot_error(comp_bin(dat, binwidth = -4))

expect_snapshot_error(bin_breaks_bins(3))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer testing bin_breaks_width(3) and bin_breaks_bins(3) because these are always called from inside compute_bins(), which protects against improper ranges.

@arcresu
Copy link
Contributor

arcresu commented Jan 8, 2025

Thanks for this - it'll be great to have the bin handing unified!

I noticed that the documentation of the 2D binning layers has been a bit neglected and maybe this would be a good opportunity to bring them up to date? Specifically:

  • The existing docs and examples don't really explain the API enabled behind the scenes by dual_param(). We could explain what happens when giving a scalar (effectively recycled for x and y) vs. named vector/list vs. unnamed vector of length 2. Possibly in the details section rather than repeating for each param?
  • StatBin2d$compute_group(breaks) was already supported, but stat_bin_2d(breaks) wasn't documented. Especially now that you've brought parity to the accepted binning params, should they also be added to the docs and usage for the 2D layer constructors?

@teunbrand
Copy link
Collaborator Author

Thanks for these suggestions, I've adapted the docs

Copy link
Member

@thomasp85 thomasp85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - great work

}
check_numeric(breaks)
bins <- bin_breaks(breaks, closed)
return(bins)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we warn about ignoring the other arguments here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bins usually gets prepopulated which results in false positive warnings. We can warn about the other arguments, but I think it'd be slightly inconsistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

geom_bin_2d() should recognise 0-width data
3 participants