Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix groupreduce with var and std for Unitful types #2601

Merged
merged 6 commits into from
Jan 21, 2021
Merged

Fix groupreduce with var and std for Unitful types #2601

merged 6 commits into from
Jan 21, 2021

Conversation

nalimilan
Copy link
Member

For Numbers for which squaring changes the type, we need to use different types for the mean, the variance and the standard deviation.

Fixes #2600.

@bkamins Do you think we should add Unitful as a test dependency to test this? I wonder whether there are types in Base that would allow reproducing the problem. Otherwise we could create a custom type just for this.

@bkamins
Copy link
Member

bkamins commented Jan 16, 2021

I think adding Unitful.jl as [extras] dependency is not a problem.

@bkamins bkamins added the bug label Jan 16, 2021
@bkamins bkamins added this to the 1.0 milestone Jan 16, 2021
@bkamins
Copy link
Member

bkamins commented Jan 16, 2021

@nalimilan - when we merge this we should backport it to https://github.com/JuliaData/DataFrames.jl/tree/0.22_patches and make a release. Therefore can you please change the version to 0.22.3 in Project.toml in this PR?

(I can do the backport and the release later, unless you would prefer to do it)

@bkamins
Copy link
Member

bkamins commented Jan 16, 2021

So now we have:

m^2 and NaN are not dimensionally compatible.

On Julia 1.0 😢.

@nalimilan
Copy link
Member Author

Actually that's a legitimate failure that must be fixed on other versions too. The random values just happened to hit that case on Julia 1.0. I've pushed a fix and an adaptation of tests to more reliably cover the situation where a group contains a single row.

@nalimilan nalimilan marked this pull request as ready for review January 16, 2021 17:42
test/grouping.jl Outdated
df = DataFrame(a = [rand([1:4;missing], 19); 5],
x1 = rand(1:100, 20),
x2 = rand(1:100, 20) + im*rand(1:100, 20),
x4 = rand(1:100, 20) .* u"m")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add x5 column that would be like x4 but also contain missing value and then below, in particular we would test also var∘skipmissing etc. (and make sure that we have a group that has only missing values in it).

In general I would want to test what happens if fast aggregation functions get a vector of zero length (which is only possible with missings as currently our groups must have a positive length).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tested below. Do you mean we should also test it for Unitful? I'm afraid that will require duplicating the whole block, as prod doesn't apply to these so we can't just loop over different types of columns. And then there's also Complex and CategoricalArray...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I meant a test with Unitful. Maybe then just add a small test with e.g. var∘skipmissing on a concrete test DataFrame like: one group only, column is union of missing and Unitful and contains only missing values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's bite the bullet and add systematic tests like for floats. I've pushed a commit.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow - that was fast and precise. I would just add one more test as commented (hopefully it will pass so I approve the PR).

nalimilan and others added 6 commits January 21, 2021 09:52
For `Number`s for which squaring changes the type, we need to use different
types for the mean, the variance and the standard deviation.
Co-authored-by: Bogumił Kamiński <[email protected]>
# Test reduction over group with only missing values
gd = groupby_checked(df, :a, skipmissing=skip, sort=sort)
indices && @test gd.idx !== nothing # Trigger computation of indices
gd[1][:, :x2m] .= missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 😄

@nalimilan nalimilan merged commit 34e53e0 into main Jan 21, 2021
@nalimilan nalimilan deleted the nl/agg branch January 21, 2021 15:05
@bkamins
Copy link
Member

bkamins commented Jan 21, 2021

will you make a patch release or should I do it?

@nalimilan
Copy link
Member Author

Please go ahead. :-)

bkamins pushed a commit that referenced this pull request Jan 21, 2021
For `Number`s for which squaring changes the type, we need to use different
types for the mean, the variance and the standard deviation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

combine() error with std() on GroupedDataFrame with Unitful columns
2 participants