Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add mapdf(), restore the default Base.map() behaviour #1049

Closed
wants to merge 1 commit into from

Conversation

alyst
Copy link
Contributor

@alyst alyst commented Sep 3, 2016

Currently, DataFrames.jl redefines how Base.map() works if applied to GroupedDataFrame or GroupApplied. Instead of returning the collection of the results, it tries to convert each f(sub_df) into DataFrame and returns GroupApplied object.

This PR restores the default map() behaviour (i.e. just returning the collection of the results). It allows e.g. using Base.map() to convert row groups into non-data frame objects.

The older version of map(f, ::GroupedDataFrame) is renamed into mapdf (similar to stackdf). In fact, it's rarely called from the user code, because there's by().

I also renamed GroupApplied into TransformedGroupedDataFrame, it should better describe what it is. TransformedGroupedDataFrame is not exported since it's clearly an internal type.

keep the semantic of Base.map() intact and rename the data frame-friendly
version of map() into mapdf().
@nalimilan
Copy link
Member

I wonder how this fits into the future jplyr.jl/Query.jl framework. Maybe we could wait a bit more, deprecate the current map and recommend using new syntax from these packages?

@alyst
Copy link
Contributor Author

alyst commented Sep 3, 2016

@nalimilan I must admit I wasn't aware of these packages until your comment. They look very exciting!

ATM the workaround to apply a function returning not a data frame is to use [.. for df in grp_df], but it is somewhat misleading that comprehension and map() give different results.
So another strategy would be

  • add eachgroup(grp_df) method returning GroupedDataFrameIterator
  • deprecate map and the collection interface for GroupedDataFrame
  • provide special method mapdf() (or whatever it could be called to meet the expectations of jplyr/query/...) to be used for data frame transformation when the grouping information should be preserved.

@nalimilan
Copy link
Member

OTOH map on Dict has switched to expecting functions to take and return a Pair, which allows it to return a Dict (JuliaLang/julia#17714). So there might be a case to make for the current behaviour with data frames.

@alyst
Copy link
Contributor Author

alyst commented Sep 4, 2016

Ah, yes, it does make sense that map(f, A) preserves the type of A.
What is still confusing is that map(f, Dict) throws an exception, when the result type of f is not Pair, whereas map(f, ::GroupedDataFrame) tries hard to convert the result into a DataFrame by doing wrap(f(sub_df)). I would rather keep the result of f "as is" and throw an exception if it's not dataframe-compatible.

@nalimilan
Copy link
Member

I would rather keep the result of f "as is" and throw an exception if it's not dataframe-compatible.

An issue is what we should consider as "dataframe-compatible". For dicts, that's not really a problem since it doesn't make much sense to convert a non-Pair object to a Pair (and that's the most convenient form anyway). But here, should we keep converting arrays to data frames?

@alyst
Copy link
Contributor Author

alyst commented Sep 4, 2016

Theoretically, map(f, Dict) can also accept Tuple{K,V}, but it doesn't.
The problem with accepting arrays or vectors is that DataFrames has to generate the column names.
I can imagine a situation when the user gets an output with x1.1, x2.1 etc columns and has the hard time finding which of the multiple grouping operations generated that. It would be better when the user has to explicitly convert array to frame providing the column names.

@nalimilan
Copy link
Member

It would be better when the user has to explicitly convert array to frame providing the column names.

Yeah, maybe. We could start raising a deprecation warning, and stop the automatic conversion) in the next major release (i.e. when porting to Nullable and breaking a lot of other things).

@alyst
Copy link
Contributor Author

alyst commented Sep 4, 2016

We could start raising a deprecation warning, and stop the automatic conversion) in the next major release

The other change in that direction could be hcat(). It currently silently renames the columns if there are name clashes (it often happens in map(f, grouped_df), when the output of f still contains the grouped columns). It could be regulated by duplicate_names=:warn/:error/:rename option, but personally I don't see how automatically renamed columns could be useful.

So I will submit another PR with the proposed behavior changes when NullableArrays would land.

I can either close this PR or strip it down to just renaming GroupApplied into TransformedGroupedDataFrame. What do you think?

@nalimilan
Copy link
Member

See #1520.

@nalimilan nalimilan closed this Sep 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants