Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata on data frame and column level #3055

Merged
merged 97 commits into from
Sep 19, 2022
Merged

Metadata on data frame and column level #3055

merged 97 commits into from
Sep 19, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented May 22, 2022

This PR waits for JuliaData/DataAPI.jl#48.

I have done an initial implementation. Now we need to discuss for which methods metadata propagation should happen. For now I have implemented it for getindex.

I stopped at hcat - if we hcat several data frames, how do you think we should handle metadata. Options are:

  • drop all metadata
  • use only left table metadata
  • merge metadata by overwriting left table metadata with right table metadata
  • merge metadata by ignoring right table metadata that is already in left table metadata

Which one do we pick (when we have this decision it will naturally propagate to other cases).

@bkamins bkamins added this to the 1.4 milestone May 22, 2022
@bkamins
Copy link
Member Author

bkamins commented May 22, 2022

Fixes #2961 #35 #2276

@nalimilan
Copy link
Member

Another option to consider is to use metadata only if there are no conflicts between input data frames (i.e. it's present in one but absent from others, or equal in all data frames that have it). The advantage is that it would be order-independent.

FWIW, R's rbind (for both data.frame and tibble) takes variable labels from the first table, even in case of conflict; for some reason, dplyr's bind_rows drops them). Not sure what Stata does. Maybe @pdeffebach knows.

@pdeffebach
Copy link
Contributor

For joining in Stata, the left data frame takes precedence. I think this is the correct default, and we should do it in DataFrames.jl as well. See this gist describing Stata's behavior.

For hcat, since we don't have an option to overwrite column names (unless I'm forgetting), I think its fair for columns to keep their metadata even if they get renamed dup_column_1.

@bkamins
Copy link
Member Author

bkamins commented May 23, 2022

For joining in Stata, the left data frame takes precedence.

You mean that if left and right table have the same "table level" (not column level) metadata key, then value is kept from left table?

(please keep in mind that we will have two kind of metadata: table level and column level; now we are discussing table level metadata)

@pdeffebach
Copy link
Contributor

Ah. Sorry for the confusion.

Just did some research. It looks like Stata does not have named dataset-level dataset, for example "Date" or "Source". It's just a vector of strings. So Stata doesn't deal with this explicitly. All the notes just get added together.

But I still think having the left one be dominant is the right way to go.

@nalimilan
Copy link
Member

nalimilan commented May 23, 2022

Don't you think it would be confusing or even dangerous if doing hcat(januarydf, februarydf), with inputs having meta-data "month" => "January" and "month" => "February" respectively, gave a data frame with only "month" => "January"? I'd rather have at least some conflict detection, or just drop all metadata when calling hcat for now.

EDIT: joins are different as in leftjoin the first argument has the main role, and conversely for rightjoin; things are less clear for outerjoin and innerjoin.

@pdeffebach
Copy link
Contributor

Good point. But still, left taking precedence seems like a consistent default that will cause fewer headaches for users than something as destructive as getting rid of metadata.

@quinnj
Copy link
Member

quinnj commented May 23, 2022

I would also agree that having the left data be dominant makes sense. It's the table for which you're keeping all keys (+ rows) and the joining table is "additional", so it feels like that would make sense to me.

@nalimilan
Copy link
Member

@quinnj You're thinking about leftjoin, right? What about rightjoin?

@bkamins
Copy link
Member Author

bkamins commented May 23, 2022

@nalimilan - can you please have a look at the implementation? If it is OK for you I will go ahead and add:

  • manual section on metadata
  • tests
  • track all places where propagating metadata would make sense

I have implemented both table and column level metadata.

@bkamins bkamins changed the title Metadata on data frame level Metadata on data frame and column level May 23, 2022
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks good! The dict of dicts approach to store per-column metadata can always be improved later if needed.

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/other/tables.jl Outdated Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented May 24, 2022

The dict of dicts approach to store per-column metadata can always be improved later if needed.

We should decide on it now. The reason is that breaking internals of DataFrame breaks serialization, so we should not do such changes too often. I made "dict of dicts" as if only few columns have metadata it uses least memory. What alternatives do you see? Vector of dicts of dict of vectors?

@bkamins
Copy link
Member Author

bkamins commented May 24, 2022

Note: this PR needs to wait till #3047 is merged. Then it needs to be rebased (the reason is that in #3047 we add methods that have to handle metadata correctly)

@nalimilan
Copy link
Member

We should decide on it now. The reason is that breaking internals of DataFrame breaks serialization, so we should not do such changes too often. I made "dict of dicts" as if only few columns have metadata it uses least memory. What alternatives do you see? Vector of dicts of dict of vectors?

Yes. More precisely a Dict{String, Vector} holding Vector{Union{Nothing, Some}} objects with one entry per column equal to nothing if no metadata is set for a column. The advantage would be to use less memory and to reduce the number of objects that the GC has to track in the case where all columns share common metadata keys (the main use case I have in mind is column labels), and there are more columns than different metadata keys. But given that the vectors wouldn't be concretely typed I'm not sure the gain would be so large, and you're right that it wouldn't be as efficient if metadata is set only for some columns. It would also be more complex as metadata(df, col) would have to return a custom lazy AbstractDict that would be a view of this structure.

@bkamins
Copy link
Member Author

bkamins commented May 24, 2022

Ah - now I see we do not need to wait for #3047 as I intentionally kept there only functions that do not mutate list of columns. Problematic will be e.g. pushfirst! but I will open a PR for this later.

docs/src/man/metadata.md Outdated Show resolved Hide resolved
docs/src/man/metadata.md Outdated Show resolved Hide resolved
docs/src/man/metadata.md Outdated Show resolved Hide resolved
docs/src/man/metadata.md Outdated Show resolved Hide resolved
docs/src/lib/metadata.md Outdated Show resolved Hide resolved
docs/src/lib/metadata.md Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Contributor

Looking at the PR right now, is it true that if the column :y has metadata, then

transform(df, :y => f => :y)

will destroy that metadata?

@nalimilan
Copy link
Member

nalimilan commented Jun 8, 2022

Yes. The idea is that transform(df, :y => ByRow(y -> y^2) => :y) will make metadata such as :unit => "m/s" incorrect.

@bkamins
Copy link
Member Author

bkamins commented Jun 8, 2022

only if f were identity or copy the metadata would be retained.

@pdeffebach
Copy link
Contributor

Okay. I guess the equivalent in Stata is replace x = ... if .... And we don't have that feature yet.

@bkamins
Copy link
Member Author

bkamins commented Jun 8, 2022

I guess the equivalent in Stata is replace x = ... if .... And we don't have that feature yet.

Could you please elaborate what you mean there? Thank you!

@bkamins
Copy link
Member Author

bkamins commented Sep 11, 2022

@nalimilan - I am done with the updates after your review. metadata.jl is significantly refactored.
I have resolved all comments that I believe are clear.
The only open is #3055 (comment), but I think it is OK what I do, so this also can be resolved if you agree.

This was referenced Sep 13, 2022
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lost in all these tests. I guess that means they cover almost everything. :-D

src/other/metadata.jl Outdated Show resolved Hide resolved
test/metadata.jl Outdated Show resolved Hide resolved
test/metadata.jl Show resolved Hide resolved
test/metadata.jl Show resolved Hide resolved
test/metadata.jl Show resolved Hide resolved
test/metadata.jl Show resolved Hide resolved
test/metadata.jl Outdated Show resolved Hide resolved
test/metadata.jl Outdated Show resolved Hide resolved
test/metadata.jl Show resolved Hide resolved
test/metadata.jl Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Sep 17, 2022

@nalimilan - I have applied all suggestions. Things to discuss that I left unresolved:

  • behavior of delete*! functions (now they silently do no-op on non-existent key, but error on non-existent column); I think it is OK
  • what name to use for :none metadata

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's see how it goes! :-)

@bkamins bkamins merged commit b01fd38 into main Sep 19, 2022
@bkamins bkamins deleted the bk/metadata branch September 19, 2022 20:57
@bkamins
Copy link
Member Author

bkamins commented Sep 19, 2022

Thank you! We are almost at 1.4 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants