Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tables.jl interface for DataFrame(Rows|Columns) #2055

Merged
merged 7 commits into from
Dec 16, 2019

Conversation

tkf
Copy link
Contributor

@tkf tkf commented Dec 15, 2019

This PR adds Tables.jl interface for DataFrameRows and DataFrameColumns. It is useful for defining data manipulation functions expecting iterators. Example:

julia> tablemap(f, xs) = Tables.materializer(xs)(map(f, xs))
tablemap (generic function with 1 method)

julia> tablemap(x -> (A=x.a, B=x.b), eachrow(DataFrame(a=[3], b=[4])))
1×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 134

src/other/tables.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Dec 15, 2019

In general - this should not be problematic to add these definitions. Can you please elaborate in what cases you would find this useful (the issue is that you can always call parent on an iterator and get the data frame to use with Tables.jl).

src/other/tables.jl Outdated Show resolved Hide resolved
@tkf
Copy link
Contributor Author

tkf commented Dec 15, 2019

Thanks for the review!

Can you please elaborate in what cases you would find this useful (the issue is that you can always call parent on an iterator and get the data frame to use with Tables.jl).

I need this interface to make this test for Transducers.copy pass: https://github.com/tkf/Transducers.jl/pull/104/files#diff-da78dddbbd001af539780c4952b0e9fcR23-R27

I can't use parent in general as I want to process wrapper types as-is, e.g., SubArray-of-StructVector.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you. Let us just wait for @quinnj to confirm that all is OK with the usage of Tables.jl interface.

@@ -182,7 +182,7 @@ end

df2 = DataFrame!(eachrow(df))
@test df == df2
@test !any(((a,b),) -> a === b, zip(eachcol(df), eachcol(df2)))
@test all(((a,b),) -> a === b, zip(eachcol(df), eachcol(df2)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, df2 = DataFrame!(eachrow(df)) does not copy columns any more. Is it an OK change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that it is even better. @nalimilan - are you OK with this?
If someone writes DataFrame! an explicit opt-out from copying, if possible, is assumed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense.

@@ -48,6 +48,18 @@ DataFrame!(x::Vector{<:NamedTuple}) =
"`$(typeof(x))` without allocating new columns: use " *
"`DataFrame(x)` instead"))

for T in [DataFrameRows, DataFrameColumns]
@eval begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need a loop? Isn't ::Union{DataFrameRows, DataFrameColumns} enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I hadn't seen @bkamins's comment above. I'd just repeat the Union without defining a custom type alias.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with both approaches. I wrote this as @bkamins preferred this approach. Ref: #2055 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with both - the @eval approach is used in Base often. But using Union without defining the alias is also OK (I just prefer not to introduce the alias here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 22e32db look good?

@@ -182,7 +182,7 @@ end

df2 = DataFrame!(eachrow(df))
@test df == df2
@test !any(((a,b),) -> a === b, zip(eachcol(df), eachcol(df2)))
@test all(((a,b),) -> a === b, zip(eachcol(df), eachcol(df2)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense.

@nalimilan
Copy link
Member

Shouldn't materializer return a DataFramesRows/DataFramesColumns object rather than a DataFrame?

@tkf
Copy link
Contributor Author

tkf commented Dec 16, 2019

My understanding is that materializer(x)(input) isa typeof(x) does not have to hold. For example, there is a fallback definition materializer(x) = columntable which creates a NamedTuple-of-Vectors.

@nalimilan
Copy link
Member

Maybe it doesn't have to, but should it? Or why shouldn't it? :-)

@tkf
Copy link
Contributor Author

tkf commented Dec 16, 2019

Actually materializer(::DataFramesRows) = DataFramesRows ∘ DataFrame makes sense.

@bkamins
Copy link
Member

bkamins commented Dec 16, 2019

Well materializer is @quinnj's idea, so probably he can comment 😄. When I analyze it (it does not have a documentation) I get the understanding that it should materialize x, and DataFrameRows and DataFrameColums are views, while DataFrame is a materialized object (but maybe this is a wrong way to look at it - @quinnj can you please comment here what is the design purpose of materialize).

@bkamins
Copy link
Member

bkamins commented Dec 16, 2019

Actually materializer(::DataFramesRows) = DataFramesRows ∘ DataFrame makes sense.

As I have commented - whether it makes sense depends on the contract @quinnj wants materialize to support I think (I was writing in parallel).

@tkf
Copy link
Contributor Author

tkf commented Dec 16, 2019

@bkamins I just noticed your comment after writing a patch to do (something like) materializer(::DataFramesRows) = eachrow ∘ DataFrame. But I agree waiting for @quinnj's comment is a good idea. If returning a DataFrame is a better idea, I'll just remove that commit.

The reason why I thought returning a DataFramesRows make sense was that we can then have tablemap(f ∘ g, xs) == tablemap(f, tablemap(g, xs) which is a nice property.

Tables.materializer(itr::DataFrameRows) =
eachrow ∘ prefer_singleton_callable(Tables.materializer(parent(itr)))
Tables.materializer(itr::DataFrameColumns) =
eachcol ∘ prefer_singleton_callable(Tables.materializer(parent(itr)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If @quinnj comments that we should return here DataFrameColumns and not DataFrame then we should inherit from itr if it was created with names positional argument set to true or false.

@quinnj
Copy link
Member

quinnj commented Dec 16, 2019

We don't currently have strict requirements on materializer; I think, as mentioned, it's preferable if it returns the same type as the input, but I don't think that's always feasible (immutable or view-like inputs, certain DB or file-based format tables, etc.).

These changes seem fine by me, though I will note that I recently tried to use DataFrameRows and wanted to open an issue up about it's useability: namely the fact that it doesn't inherit from <: AbstractDataFrame means things like names(df_rows) don't work, but would be nice if they did.

Anyway, I'm good w/ this.

@quinnj quinnj merged commit cac52ca into JuliaData:master Dec 16, 2019
@tkf tkf deleted the tables branch December 16, 2019 19:44
@bkamins
Copy link
Member

bkamins commented Dec 16, 2019

namely the fact that it doesn't inherit from <: AbstractDataFrame means things like names(df_rows) don't work, but would be nice if they did.

It cannot because it inherits from AbstractVector and we do not have multiple inheritance. I will open a PR adding common AbstractDataFrame methods for them.

@tkf
Copy link
Contributor Author

tkf commented Dec 17, 2019

Actually, can we go back to materializer(::DataFramesRows) = DataFrame? 😅

When I analyze it (it does not have a documentation) I get the understanding that it should materialize x, and DataFrameRows and DataFrameColums are views, while DataFrame is a materialized object

@bkamins I just realized that your point totally makes sense. I needed something I can push!/append! to. At the moment, DataFramesRows does not support this and I'm guessing DataFrames devs don't want to expand the API surface. If that's the case, can we go back to the original materializer definition?

Sorry, I should've tried to re-implement JuliaFolds/Transducers.jl#107 after the API was changed...

@bkamins
Copy link
Member

bkamins commented Dec 17, 2019

I'm guessing DataFrames devs don't want to expand the API surface

It is a simple rule: both objects are views so they should not be allowed to mutate the schema of the parent (i.e. changing numer of rows or columns or renaming them).

So I understand the change should be made to make materialize produce a DataFrame - right.?

Also just to confirm. This new DataFrame should be a copy of the source (not reuse the columns) - right?

@nalimilan
Copy link
Member

I guess returning a DataFrame would be OK if materializer is supposed to return a mutable table (like similar for arrays), but we would have to make this clear. Otherwise it sounds weird not to return DataFrameRows.

@tkf
Copy link
Contributor Author

tkf commented Dec 17, 2019

It is a simple rule: both objects are views so they should not be allowed to mutate the schema of the parent (i.e. changing numer of rows or columns or renaming them).

The "rule" sounds arbitrary to me, in the sense that you made it so you can change it. I can't find any explicit API contracts defined for eachrows and eachcols. I agree that the function name sounds like that the caller should not expect more than iterator API. However, if DataFrameRows is also used for Tables.rows #2051, it may makes sense to adding mutation support. For example, Tables.jl mentions that Tables.rows(table) === table is a reasonable choice if table is already a row iterator. This includes a table type that support push!.

@bkamins
Copy link
Member

bkamins commented Dec 18, 2019

I see your point, so let us wait for other to comment about their preference.

My thinking was that DtaFrameRows for AbstractDataFrame is like (v for x in x) in Base.

Also note that DataFrameRows might wrap a SubDataFrame in which case it will not be resizable anyway, as SubDataFrame is not resizable.

@nalimilan
Copy link
Member

@quinnj Do you think materializer is supposed to return a mutable table, i.e. to which rows or columns can be added?

@quinnj
Copy link
Member

quinnj commented Dec 19, 2019

I think materializer should return the most well-supported, "standard" table type for a family of table types. The point is that some higher-order table processor somewhere wants to manipulate a table and spit back out something like what you put in; so while it definitely makes sense to return the exact same type as input, I also think for general useability, it makes sense to return the most "standard" type, in this case a DataFrame.

@nalimilan
Copy link
Member

Yeah it's a tough choice. It's not ideal that if you passed a table which iterates rows (DataFrameRows) you get a table that cannot be iterated over (DataFrame). Though I guess you should call Tables.rows on the result anyway if you want to iterate over rows, so returning a DataFrame should be OK.

@tkf
Copy link
Contributor Author

tkf commented Dec 19, 2019

@bkamins @nalimilan @quinnj Thanks for the discussion! I opened #2058

@bkamins
Copy link
Member

bkamins commented Dec 27, 2019

I would prefer in the future to squash-merge PRs, as otherwise it is hard for me to write appropriate release notes. Thank you!

@quinnj
Copy link
Member

quinnj commented Dec 31, 2019

Huh......I think I did a rebase-merge on another repo and then it must remember your latest preference regardless of repo? I almost always squash, but there was a case on another repo where I wanted to rebase-merge and then I must have just hit the default here. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants