-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DataFrame.explode/2
#751
Add DataFrame.explode/2
#751
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I like the Polars behaviour with multiple columns explosion, and that you kept it :)
I added one suggestion of validation, but the other one - about the out_df
- is optional for now.
Also, I'm going to release a new version without this, but we can work in another patch version soon.
Thank you for the PR!
@doc type: :single | ||
@spec explode(df :: DataFrame.t(), column :: column_name() | [column_name()]) :: DataFrame.t() | ||
def explode(%DataFrame{} = df, column_or_columns) do | ||
columns = to_existing_columns(df, List.wrap(column_or_columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could validate that columns
is a list of {:list, any()}
. Does it return error if we pass a column that is not a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just silently ignores the non-list columns and the operation is successful. Do you think we should raise on the Elixir side in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@costaraphael personally I think it should raise if one of the columns is not of a list dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philss @josevalim done in a1ad581!
Since it seems like this is the first time this kind of validation is performed at the DataFrame
level (checking if the given columns are of a certain type), I've chosen not to introduce any new abstractions. I figured this could be done once there were a couple more examples of this. Let me know if you disagree!
@@ -748,6 +748,11 @@ defmodule Explorer.PolarsBackend.DataFrame do | |||
|> Shared.create_dataframe() | |||
end | |||
|
|||
@impl true | |||
def explode(df, columns) do | |||
Shared.apply_dataframe(df, df, :df_explode, [columns]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the callback could receive the out_df
with the "correct" dtype. It's OK to have the first version without it, but I think we can say what dtypes the out DF is going to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@costaraphael excellent! ✨
thanks!
💚 💙 💜 💛 ❤️ |
Hey folks!
With this one list processing becomes essentially complete IMO, as if there is no convenient list-specific function for some operation (like string trimming, or arithmetic operations), you can just explode the column, process the elements, and then aggregate them back into lists.
One thing that caught me by surprise while implementing this is that I assumed that passing multiple columns to explode would perform a cartesian-style expansion on the data frame, which is not the case. It instead matches the list elements one by one for each row of the exploded columns.
Since I can see how this can be a bit unexpected, I've addressed this behavior in the function docs. Let me know if you think it should be given more emphasis, or if you think it should force a different behavior.
Regarding the name,
explode
sounds nice and it matches exactly what happens in Polars, so I personally like it. It also gives us a nice error message for free when the list lengths don't match:The other name I could think of is
unnest
, which is the name of a similar operation in PostgreSQL, but the semantics are a bit different, and there would be extra work translating the error message. Let me know your thoughts on it!