Metadata on data frame and column level #3055

bkamins · 2022-05-22T21:47:56Z

This PR waits for JuliaData/DataAPI.jl#48.

I have done an initial implementation. Now we need to discuss for which methods metadata propagation should happen. For now I have implemented it for getindex.

I stopped at hcat - if we hcat several data frames, how do you think we should handle metadata. Options are:

drop all metadata
use only left table metadata
merge metadata by overwriting left table metadata with right table metadata
merge metadata by ignoring right table metadata that is already in left table metadata

Which one do we pick (when we have this decision it will naturally propagate to other cases).

bkamins · 2022-05-22T21:50:25Z

Fixes #2961 #35 #2276

nalimilan · 2022-05-23T06:22:51Z

Another option to consider is to use metadata only if there are no conflicts between input data frames (i.e. it's present in one but absent from others, or equal in all data frames that have it). The advantage is that it would be order-independent.

FWIW, R's rbind (for both data.frame and tibble) takes variable labels from the first table, even in case of conflict; for some reason, dplyr's bind_rows drops them). Not sure what Stata does. Maybe @pdeffebach knows.

pdeffebach · 2022-05-23T13:21:47Z

For joining in Stata, the left data frame takes precedence. I think this is the correct default, and we should do it in DataFrames.jl as well. See this gist describing Stata's behavior.

For hcat, since we don't have an option to overwrite column names (unless I'm forgetting), I think its fair for columns to keep their metadata even if they get renamed dup_column_1.

bkamins · 2022-05-23T13:30:45Z

For joining in Stata, the left data frame takes precedence.

You mean that if left and right table have the same "table level" (not column level) metadata key, then value is kept from left table?

(please keep in mind that we will have two kind of metadata: table level and column level; now we are discussing table level metadata)

pdeffebach · 2022-05-23T14:27:16Z

Ah. Sorry for the confusion.

Just did some research. It looks like Stata does not have named dataset-level dataset, for example "Date" or "Source". It's just a vector of strings. So Stata doesn't deal with this explicitly. All the notes just get added together.

But I still think having the left one be dominant is the right way to go.

nalimilan · 2022-05-23T15:35:58Z

Don't you think it would be confusing or even dangerous if doing hcat(januarydf, februarydf), with inputs having meta-data "month" => "January" and "month" => "February" respectively, gave a data frame with only "month" => "January"? I'd rather have at least some conflict detection, or just drop all metadata when calling hcat for now.

EDIT: joins are different as in leftjoin the first argument has the main role, and conversely for rightjoin; things are less clear for outerjoin and innerjoin.

pdeffebach · 2022-05-23T16:02:19Z

Good point. But still, left taking precedence seems like a consistent default that will cause fewer headaches for users than something as destructive as getting rid of metadata.

quinnj · 2022-05-23T18:20:25Z

I would also agree that having the left data be dominant makes sense. It's the table for which you're keeping all keys (+ rows) and the joining table is "additional", so it feels like that would make sense to me.

nalimilan · 2022-05-23T19:44:50Z

@quinnj You're thinking about leftjoin, right? What about rightjoin?

bkamins · 2022-05-23T21:55:15Z

@nalimilan - can you please have a look at the implementation? If it is OK for you I will go ahead and add:

manual section on metadata
tests
track all places where propagating metadata would make sense

I have implemented both table and column level metadata.

nalimilan

Yes, looks good! The dict of dicts approach to store per-column metadata can always be improved later if needed.

src/dataframe/dataframe.jl

src/other/tables.jl

src/other/utils.jl

NEWS.md

bkamins · 2022-05-24T16:59:38Z

The dict of dicts approach to store per-column metadata can always be improved later if needed.

We should decide on it now. The reason is that breaking internals of DataFrame breaks serialization, so we should not do such changes too often. I made "dict of dicts" as if only few columns have metadata it uses least memory. What alternatives do you see? Vector of dicts of dict of vectors?

bkamins · 2022-05-24T18:50:17Z

Note: this PR needs to wait till #3047 is merged. Then it needs to be rebased (the reason is that in #3047 we add methods that have to handle metadata correctly)

nalimilan · 2022-05-24T19:29:11Z

We should decide on it now. The reason is that breaking internals of DataFrame breaks serialization, so we should not do such changes too often. I made "dict of dicts" as if only few columns have metadata it uses least memory. What alternatives do you see? Vector of dicts of dict of vectors?

Yes. More precisely a Dict{String, Vector} holding Vector{Union{Nothing, Some}} objects with one entry per column equal to nothing if no metadata is set for a column. The advantage would be to use less memory and to reduce the number of objects that the GC has to track in the case where all columns share common metadata keys (the main use case I have in mind is column labels), and there are more columns than different metadata keys. But given that the vectors wouldn't be concretely typed I'm not sure the gain would be so large, and you're right that it wouldn't be as efficient if metadata is set only for some columns. It would also be more complex as metadata(df, col) would have to return a custom lazy AbstractDict that would be a view of this structure.

bkamins · 2022-05-24T19:39:53Z

Ah - now I see we do not need to wait for #3047 as I intentionally kept there only functions that do not mutate list of columns. Problematic will be e.g. pushfirst! but I will open a PR for this later.

docs/src/man/metadata.md

docs/src/lib/metadata.md

pdeffebach · 2022-06-08T15:32:59Z

Looking at the PR right now, is it true that if the column :y has metadata, then

transform(df, :y => f => :y)

will destroy that metadata?

nalimilan · 2022-06-08T15:40:00Z

Yes. The idea is that transform(df, :y => ByRow(y -> y^2) => :y) will make metadata such as :unit => "m/s" incorrect.

bkamins · 2022-06-08T18:21:04Z

only if f were identity or copy the metadata would be retained.

pdeffebach · 2022-06-08T18:42:43Z

Okay. I guess the equivalent in Stata is replace x = ... if .... And we don't have that feature yet.

bkamins · 2022-06-08T18:58:31Z

I guess the equivalent in Stata is replace x = ... if .... And we don't have that feature yet.

Could you please elaborate what you mean there? Thank you!

bkamins · 2022-09-11T21:19:20Z

@nalimilan - I am done with the updates after your review. metadata.jl is significantly refactored.
I have resolved all comments that I believe are clear.
The only open is #3055 (comment), but I think it is OK what I do, so this also can be resolved if you agree.

nalimilan

I'm lost in all these tests. I guess that means they cover almost everything. :-D

src/other/metadata.jl

test/metadata.jl

Co-authored-by: Milan Bouchet-Valat <[email protected]>

into bk/metadata

bkamins · 2022-09-17T17:34:53Z

@nalimilan - I have applied all suggestions. Things to discuss that I left unresolved:

behavior of delete*! functions (now they silently do no-op on non-existent key, but error on non-existent column); I think it is OK
what name to use for :none metadata

nalimilan

OK, let's see how it goes! :-)

bkamins · 2022-09-19T20:57:30Z

Thank you! We are almost at 1.4 release.

bkamins added feature metadata labels May 22, 2022

bkamins added this to the 1.4 milestone May 22, 2022

bkamins mentioned this pull request May 22, 2022

add metadata JuliaData/DataAPI.jl#48

Merged

bkamins changed the title ~~Metadata on data frame level~~ Metadata on data frame and column level May 23, 2022

nalimilan reviewed May 24, 2022

View reviewed changes

This was referenced May 26, 2022

Import R data frame attributes as metadata JuliaData/RData.jl#93

Merged

Redesign ReadStatMeta and add ReadStatColMeta for DataAPI.jl v1.13 junyuan-chen/ReadStatTables.jl#6

Merged

bkamins commented Jun 4, 2022

View reviewed changes

docs/src/man/metadata.md Outdated Show resolved Hide resolved

nalimilan reviewed Jun 4, 2022

View reviewed changes

docs/src/man/metadata.md Outdated Show resolved Hide resolved

docs/src/man/metadata.md Outdated Show resolved Hide resolved

docs/src/man/metadata.md Outdated Show resolved Hide resolved

nalimilan reviewed Jun 4, 2022

View reviewed changes

docs/src/lib/metadata.md Outdated Show resolved Hide resolved

docs/src/lib/metadata.md Outdated Show resolved Hide resolved

bkamins added 4 commits September 11, 2022 11:38

fix typos

788750e

clean-up after code review

a884974

improve docstring of transformation metadata

752beeb

finalize responses to code review

3f4e53c

bkamins added 2 commits September 12, 2022 10:13

fix KeyError to ArgumentError in tests

dec06ab

fix docstring

9f50fbe

This was referenced Sep 13, 2022

Require Julia 1.6 #3145

Merged

1-arg permutedims(df) #3115

Merged

nalimilan reviewed Sep 16, 2022

View reviewed changes

bkamins and others added 5 commits September 17, 2022 09:38

Apply suggestions from code review

6d63131

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Apply suggestions from code review

8a58d86

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Merge branch 'main' into bk/metadata

a95e4ea

changes after code review

a20c281

Merge branch 'bk/metadata' of https://github.com/JuliaData/DataFrames.jl

1c1d120

into bk/metadata

nalimilan and others added 3 commits September 17, 2022 22:14

Wording and hyphenation consistency

99ed0d7

change :none to :default and add tests for custom functions

d8ae441

Merge branch 'main' into bk/metadata

6921efd

nalimilan approved these changes Sep 19, 2022

View reviewed changes

bkamins added 2 commits September 19, 2022 20:37

Merge branch 'main' into bk/metadata

175d593

update to DataAPI.jl 1.11

ac91787

bkamins merged commit b01fd38 into main Sep 19, 2022

bkamins deleted the bk/metadata branch September 19, 2022 20:57

This was referenced Sep 20, 2022

Metadata for columns and/or DataFrames #35

Closed

What metadata should be #2276

Closed

Add metadata #2961

Closed

kescobo mentioned this pull request Dec 6, 2022

Faster methods for building metadata with subsets of fields EcoJulia/Microbiome.jl#137

Closed

bkamins mentioned this pull request Dec 8, 2022

Segmentation fault Julia 1.8.2, DataFrames v1.4.3 #3227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata on data frame and column level #3055

Metadata on data frame and column level #3055

bkamins commented May 22, 2022

bkamins commented May 22, 2022

nalimilan commented May 23, 2022

pdeffebach commented May 23, 2022

bkamins commented May 23, 2022

pdeffebach commented May 23, 2022

nalimilan commented May 23, 2022 •

edited

Loading

pdeffebach commented May 23, 2022

quinnj commented May 23, 2022

nalimilan commented May 23, 2022

bkamins commented May 23, 2022

nalimilan left a comment

bkamins commented May 24, 2022

bkamins commented May 24, 2022

nalimilan commented May 24, 2022

bkamins commented May 24, 2022

pdeffebach commented Jun 8, 2022

nalimilan commented Jun 8, 2022 •

edited

Loading

bkamins commented Jun 8, 2022

pdeffebach commented Jun 8, 2022

bkamins commented Jun 8, 2022

bkamins commented Sep 11, 2022

nalimilan left a comment

bkamins commented Sep 17, 2022

nalimilan left a comment

bkamins commented Sep 19, 2022

Metadata on data frame and column level #3055

Metadata on data frame and column level #3055

Conversation

bkamins commented May 22, 2022

bkamins commented May 22, 2022

nalimilan commented May 23, 2022

pdeffebach commented May 23, 2022

bkamins commented May 23, 2022

pdeffebach commented May 23, 2022

nalimilan commented May 23, 2022 • edited Loading

pdeffebach commented May 23, 2022

quinnj commented May 23, 2022

nalimilan commented May 23, 2022

bkamins commented May 23, 2022

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented May 24, 2022

bkamins commented May 24, 2022

nalimilan commented May 24, 2022

bkamins commented May 24, 2022

pdeffebach commented Jun 8, 2022

nalimilan commented Jun 8, 2022 • edited Loading

bkamins commented Jun 8, 2022

pdeffebach commented Jun 8, 2022

bkamins commented Jun 8, 2022

bkamins commented Sep 11, 2022

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Sep 17, 2022

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Sep 19, 2022

nalimilan commented May 23, 2022 •

edited

Loading

nalimilan commented Jun 8, 2022 •

edited

Loading