Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nest, unnest, extract and extract! #3258

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
Draft

Add nest, unnest, extract and extract! #3258

wants to merge 13 commits into from

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Dec 28, 2022

Fixes
#3005
#2890
#3116
#2767

The PR adds nest and unnest and introduces scalar kwarg to flatten (which is needed in unnest.

flatten is ready for review.

For nest and unnest requires discussion if we like the proposed API (they work, but maybe we will decide to change API).

Some important decisions I propose:

  • nest only works on GroupedDataFrame (the reason is to avoid complexity of group order specification); nesting is done always to DataFrame (to keep things simple); another not easy decision is syntax I proposed [:x, :y] => :z which means that columns :x and :y should be nested and stored in column :z (but I would like to confirm that we find it intuitive, as syntax :z => [:x, :y] also could be advocated for).
  • unnest supports both tables (e.g. DataFrame) and rows (e.g. Tables.AbstractRow) and has two options: flatten=true, when rows of the nested columns are flattened, and flatten=false (when they are left as is - this is probably useful, if we work with rows)

TODO:

  • add metadata
  • write tests
  • update manual

CC @nalimilan @pdeffebach @jariji

@bkamins bkamins mentioned this pull request Dec 29, 2022
@nalimilan
Copy link
Member

nest only works on GroupedDataFrame (the reason is to avoid complexity of group order specification); nesting is done always to DataFrame (to keep things simple); another not easy decision is syntax I proposed [:x, :y] => :z which means that columns :x and :y should be nested and stored in column :z (but I would like to confirm that we find it intuitive, as syntax :z => [:x, :y] also could be advocated for).

Agreed. [:x, :y] => :z sounds more logical as the input is on the LHS and the output on the RHS.

unnest supports both tables (e.g. DataFrame) and rows (e.g. Tables.AbstractRow) and has two options: flatten=true, when rows of the nested columns are flattened, and flatten=false (when they are left as is - this is probably useful, if we work with rows)

Let's continue discussion at #3258 (comment).

src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved
Comment on lines 167 to 168
`cols` (default `:setequal`) and `promote` (default `true`) keyword arguments
have the same meaning as in [`push!`](@ref).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe repeat the explanation? Usually we don't require users to read other docstrings like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I started describing it I realized that anything except cols=:union and promote=true is not really useful. We can always add it later. So I opted for a simpler design for now and functions do not take these arguments.

src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved
src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved
src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved
src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved
src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved
src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved
src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved
src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jan 2, 2023

I propose to discuss this PR step by step.
Let us start with nest. Here the major question is if we need it. The major reason is that Ref is a nesting operator in operator specification syntax.

Start with some data frame:

julia> df = DataFrame(a=[1, 1, 2, 2], b=11:14, c=21:24)
4×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11     21
   2 │     1     12     22
   3 │     2     13     23
   4 │     2     14     24

First, already doing groupby gives us nested SubDataFrames - and maybe this is what is enough for most users:

julia> groupby(df, :a)
GroupedDataFrame with 2 groups based on key: a
First Group (2 rows): a = 1
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11     21
   2 │     1     12     22
⋮
Last Group (2 rows): a = 2
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2     13     23
   2 │     2     14     24

(note that such grouped data frame also supports convenient indexing to get the nested data frames, something which is not easily available if we indeed nest data)

Now, a basic pattern to nest some columns as NamedTuple is:

julia> combine(groupby(df, :a), AsTable([:b, :c]) => Ref => :bc)
2×2 DataFrame
 Row │ a      bc
     │ Int64  NamedTup…
─────┼─────────────────────────────────────
   1 │     1  (b = [11, 12], c = [21, 22])
   2 │     2  (b = [13, 14], c = [23, 24])

The only thing user needs to remember here is that :b and :c fields of named tuples are views, but this is something that might be preferred in many scenarios.

If one wants to nest data frames one can just write:

julia> combine(groupby(df, :a), AsTable([:b, :c]) => Ref∘DataFrame => :bc)
2×2 DataFrame
 Row │ a      bc
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  2×2 DataFrame
   2 │     2  2×2 DataFrame

Now the columns are copied (as DataFrame constructor copies data by default)

Finally note that operation specification syntax also works with row nesting (something we have not discussed but is useful):

julia> combine(groupby(df, :a), AsTable([:b, :c]) => ByRow(identity) => :bc)
4×2 DataFrame
 Row │ a      bc
     │ Int64  NamedTup…
─────┼─────────────────────────
   1 │     1  (b = 11, c = 21)
   2 │     1  (b = 12, c = 22)
   3 │     2  (b = 13, c = 23)
   4 │     2  (b = 14, c = 24)

In the implementation in the PR I used a bit different pattern:

julia> combine(groupby(df, :a), x -> (bc = select(x, [:b, :c]),))
2×2 DataFrame
 Row │ a      bc
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  2×2 DataFrame
   2 │     2  2×2 DataFrame

but it was only because the implementation should support all cases (and AsTable can be problematic with compilation in case of extremely wide tables).

In summary - my question is. Given these considerations do we need to add nest? Maybe it is enough to add examples in the manual how nesting can be done?

@jariji
Copy link
Contributor

jariji commented Jan 2, 2023

combine says

If target_cols is a Symbol or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, a NamedTuple, or a DataFrameRow raises an error.

e.g.

julia> combine(groupby(df, :a), AsTable([:b, :c]) => DataFrame => :bc)
ERROR: ArgumentError: a single value or vector result is required (got DataFrame)

Is the reason for this error documented somewhere? i.e. why doesn't this just work?

The error message could be a good place to document the Ref/fill trick.


I don't like having to specify the columns twice. If it's okay for :a to be in the nested dataframes too I can just use

combine(groupby(df, :a), AsTable(:) => RefDataFrame => :bc)

but sometimes it's desired that :a not be duplicated there, and I'd rather not have to spell it out. I'm not currently sure how big an issue to make of this.

@bkamins
Copy link
Member Author

bkamins commented Jan 2, 2023

i.e. why doesn't this just work?

The reason is safety (in other works: to allow for non-breaking change in the future if we decided it is needed). For example if user writes:

AsTable([:b, :c]) => DataFrame

it would be ambiguous if user wants the produced data frame to be stored in one cell of a data frame or expanded (as intuitively users might expect it to be expanded just as vectors are expanded).

This is especially relevant when trying to detect rare cases when some function can either return a table or a scalar (i.e. when the return type of the operation is not type stable).

The error message could be a good place to document the Ref/fill trick.

This is a good point. I will make a PR changing this.

@nalimilan
Copy link
Member

Yeah nest isn't strictly needed. Its main advantage is that it's easier to discover, but it's not clear that nesting is really useful in DataFrames.jl thanks to the existence of GroupedDataFrame (that dplyr doesn't have).

That said, if we add unnest we should probably have nest for consistency. But the behavior with flatten=false doesn't really match what I would expect from unnest. Maybe the action of splitting a column into several ones would be better named separate or extract similar to the dplyr functions? In dplyr these only allow splitting strings, but we could make it more general and allow creating columns from any collection, including named tuples? The only peculiarity of named tuples (and dicts) is that appropriate column names can be extracted automatically.

@bkamins
Copy link
Member Author

bkamins commented Jan 2, 2023

Maybe the action of splitting a column into several ones would be better named separate or extract similar to the dplyr functions?

This is what I had in mind. extract seems as a good name.

In dplyr these only allow splitting strings, but we could make it more general and allow creating columns from any collection, including named tuples?

This is something we do not need to add as we already have it. As AsTable as target does exactly this.
The only limitation is that AsTable assumes a fixed schema for all rows.

What we need is a function designed to handle cases when each row potentially has a different schema.
And maybe also (this is not added, but we could add it) allowing for e.g. missing value in a row that would be exapnded to missing values.

The only peculiarity of named tuples (and dicts) is that appropriate column names can be extracted automatically.

Currently dicts would not work as they do not have a defined column order (but we could change this; but then also a change in push! et al. should be introduced for consistency)

@jariji
Copy link
Contributor

jariji commented Jan 2, 2023

it's not clear that nesting is really useful in DataFrames.jl thanks to the existence of GroupedDataFrame (that dplyr doesn't have).

I'm missing something, how does GroupedDataFrame substitute for df nesting?

My uninformed impression so far is that GroupedDataFrame partially substitutes for Pandas's row labels but that hierarchical column labels have no equivalent in DFjl and nested dataframes is my workaround for that missing feature.

@bkamins
Copy link
Member Author

bkamins commented Jan 2, 2023

nested dataframes is my workaround for that missing feature.

Indeed nested column labels are not supported and a work-around for them is needed. However, my question is why do you use data frame for this. Normally a NamedTuple of scalars would be used here like:

julia> df = DataFrame(x=[(a="aa", b="bb"), (a="pp", b="qq")])
2×1 DataFrame
 Row │ x
     │ NamedTup…
─────┼──────────────────────
   1 │ (a = "aa", b = "bb")
   2 │ (a = "pp", b = "qq")

That is why I have said that normally I would expect flatten=false to be needed.

I'm missing something, how does GroupedDataFrame substitute for df nesting?

What we mean is that you can easily index into a GroupedDataFrame to get a portion of the source data frame for certain combination of key column values. Notice that it naturally combines with column nesting of scalars.

The point is that if you nest whole DataFrame you fix the row structure. While if you nest a NamedTuple of scalars you have nested columns but can groupby different columns flexibly later.

@jariji
Copy link
Contributor

jariji commented Jan 3, 2023

In my dataframe, each row specifies a regression model and the :df column has the data, including the regression residuals, similar to the broom vignette. What is your opinion about using this style in DFjl?

@bkamins
Copy link
Member Author

bkamins commented Jan 3, 2023

You mean that "per group" you want to store different objects:

  • source data frame as one column
  • estimation results as another column
  • residuals etc. as yet another column

Then indeed nesting a data frame (or any other object) makes sense.

@bkamins
Copy link
Member Author

bkamins commented Jan 4, 2023

@jariji - given my last comment. Now I realized that one cannot create such an object by nesting. I.e. the use case when nesting seems to be indeed needed is when you add different objects sequentially (if you did it in one-shot they would have to have the same number of columns). Is this indeed your case, i.e. you nest only columns needed for estimation of the model but the other columns, that are also nested, are only added later?

@bkamins
Copy link
Member Author

bkamins commented Jan 5, 2023

OK, I have thought how we should move forward. Although the functions are simple I think we can keep them for user-friendlyness. The functions are:

  • nest and unnest for working with tables
  • expand and expand! for expanding row-like sources (here the ! version makes sense to have as this is most likely what user will want to do since number of rows does not change).

For now I gave tentative implementations (to show how things internally would work).
@nalimilan To move forward I need #3245 to me merged first and then I need to merge it to this PR.

@nalimilan
Copy link
Member

In dplyr these only allow splitting strings, but we could make it more general and allow creating columns from any collection, including named tuples?

This is something we do not need to add as we already have it. As AsTable as target does exactly this. The only limitation is that AsTable assumes a fixed schema for all rows.

What we need is a function designed to handle cases when each row potentially has a different schema. And maybe also (this is not added, but we could add it) allowing for e.g. missing value in a row that would be exapnded to missing values.

The schema isn't fixed either when splitting string columns into one or more columns: in some cases you might have no occurrence of the separator, in some cases one or more occurrences, giving one, two or more columns. That's why it seems to make sense to be able to support both strings and collections in the same function.

Why call it expand rather than separate or extract? expand is something completely different in dplyr.

Otherwise your proposal sounds good to me.

@bkamins
Copy link
Member Author

bkamins commented Jan 9, 2023

Why call it expand rather than separate or extract?

I meant extract - fixed.

Now regarding:

The schema isn't fixed either when splitting string columns into one or more columns

This does not supported anyway, as currently there is no way to specify that the string should be split in the syntax.

What we provide is the following:

julia> df = DataFrame(x=["a,b", "c,d"])
2×1 DataFrame
 Row │ x
     │ String
─────┼────────
   1 │ a,b
   2 │ c,d

julia> select(df, :x => ByRow(x -> split(x, ',')) => AsTable)
2×2 DataFrame
 Row │ x1         x2
     │ SubStrin…  SubStrin…
─────┼──────────────────────
   1 │ a          b
   2 │ c          d

julia> select(df, :x => ByRow(x -> split(x, ',')) => [:p, :q])
2×2 DataFrame
 Row │ p          q
     │ SubStrin…  SubStrin…
─────┼──────────────────────
   1 │ a          b
   2 │ c          d

with the restriction that every string must be split into the same number of groups.

Maybe then - instead of adding extract and extract! we should change the implementation of => AsTable and => [:p, :q] syntaxes above and allow for varying number of columns to be produced by the expression (in which case we would make the cols=:union equivalent instead?).

@jariji
Copy link
Contributor

jariji commented Jan 9, 2023

Related PR for string-splitting with fixed number of splitpoints JuliaLang/julia#43557

@bkamins bkamins mentioned this pull request Feb 5, 2023
@bkamins bkamins changed the title Add nest, unnest, improve flatten Add nest, unnest, extract and extract! Feb 5, 2023
@bkamins
Copy link
Member Author

bkamins commented Feb 5, 2023

OK - flatten is removed from this PR.
We leave it as WIP with unnest, nest, extract and extract!

@bkamins bkamins marked this pull request as draft February 5, 2023 08:17
@bkamins bkamins modified the milestones: 1.5, 1.6 Feb 5, 2023
@bkamins
Copy link
Member Author

bkamins commented Jun 4, 2023

Self-note. Investigate:

:src_column => only => AsTable

pattern

@bkamins bkamins modified the milestones: 1.6, 1.7 Jul 10, 2023
@ohaaga
Copy link

ohaaga commented May 14, 2024

Just learning Julia, so apologies if this is redundant, but I'd love to have nest/unnest (and convenience functions for mapping pipelined functions over columns that contain dataframes) for something like Jennifer Bryan's "row-oriented" workflow, which really helps to keep a project organized (in a real data structure, rather than with ad-hoc naming conventions, etc) when repeating multiple analyses over e.g. different geographic units.

https://github.com/jennybc/row-oriented-workflows

@bkamins bkamins modified the milestones: 1.7, 1.x Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unnest feature: cols=:union argument (or something like it) for combine with AsTable
4 participants