Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mode #453

Merged
merged 12 commits into from
Nov 12, 2023
Merged

Add mode #453

merged 12 commits into from
Nov 12, 2023

Conversation

cigrainger
Copy link
Member

@cigrainger cigrainger commented Dec 17, 2022

See: #452. This is a WIP because I'm not sure how to handle the fact that you can have no mode or multiple. This gets particularly hairy in a groupby scenario where each group may have different lengths (and thus would cause problems with other summary statistics like mean, median where you know the length will be 1).

Pandas:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html
https://pandas.pydata.org/docs/reference/api/pandas.Series.mode.html#pandas.Series.mode
https://stackoverflow.com/a/54304691

R:
https://stackoverflow.com/questions/66972590/how-to-find-mean-median-mode-based-on-distinctive-groups-in-r
https://cran.r-project.org/web/packages/modeest/modeest.pdf

Okay! Thanks to #725 this is good to go. Closes #452.

Copy link
Member

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets particularly hairy in a groupby scenario where each group may have different lengths

I think this is fine, because I think the computation is going to occur inside the group's context. Maybe adding a test - actually change an existing one with another column - to the summarise or mutate can clarify a little bit. WDYT?

lib/explorer/series.ex Outdated Show resolved Hide resolved
@cigrainger
Copy link
Member Author

I think this is fine, because I think the computation is going to occur inside the group's context

Yes, between groups, but not within groups. So take the example where you summarise down to mean and mode. If mode is length 0 or >1, then you have a problem where within groups the aggregations are different lengths. Adding a test or two now to clarify as suggested 👍.

@cigrainger
Copy link
Member Author

cigrainger commented Dec 17, 2022

Well it clarified in a different direction: summarise w/ mode returns a single value per group -- a list! We'll have to pick up #401 anyway.

Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM although it seems we are waiting on lists?

liamdiprose pushed a commit to liamdiprose/explorer that referenced this pull request Feb 16, 2023
lib/explorer/series.ex Outdated Show resolved Hide resolved
lib/explorer/series.ex Outdated Show resolved Hide resolved
@cigrainger cigrainger marked this pull request as ready for review November 12, 2023 13:34
@doc """
Gets the most common value of the series.

## Supported dtypes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s just say all except lists? Otherwise it is easy for this to go out of date!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep fair 😂

non_list_dtypes = [
non_list_dtypes = non_list_types()
list_dtypes = for dtype <- non_list_dtypes, do: {:list, dtype}
non_list_dtypes ++ list_dtypes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List types are recursive so this is theoretically incomplete. Instead of doing this list, what if we just append {:list, :any} instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fine by me 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like this is used for tests? I'm going to leave it how it was (I just wrapped the non-list ones into a separate function) and then I think we should revisit it in a future PR.

Copy link
Contributor

@billylanchantin billylanchantin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woo! It's cool to see this PR get across the finish line!

I just had one suggestion about the docs: I think it's worth calling out ties.

lib/explorer/series.ex Outdated Show resolved Hide resolved
cigrainger and others added 2 commits November 12, 2023 16:19
Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it and I can look at the dtypes stuff. :)

@cigrainger cigrainger merged commit e0c02a4 into main Nov 12, 2023
4 checks passed
@cigrainger cigrainger deleted the cg/mode branch November 12, 2023 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compute mode on grouped DataFrame
4 participants