Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata on data frame and column level #3055

Merged
merged 97 commits into from
Sep 19, 2022
Merged
Show file tree
Hide file tree
Changes from 87 commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
ddc429a
initial metadata implementation without select et. al.
bkamins Jun 23, 2022
6987eb9
updates after code review
bkamins Jun 26, 2022
2002ca6
update metadatata rules
bkamins Jun 27, 2022
6c4ea7c
add prompt to fix stack and unstack rules
bkamins Jun 27, 2022
e3f3350
add table-level hascometadata
bkamins Jun 27, 2022
ebef71d
rewrite the metadata rules again
bkamins Jun 29, 2022
a41162d
start adding tests
bkamins Jul 1, 2022
4c18372
start systematic tests of metadata propagation
bkamins Jul 1, 2022
7487134
continue adding tests and patching things
bkamins Jul 2, 2022
ba03e6c
continue adding tests
bkamins Jul 2, 2022
e4ffab7
various fixes
bkamins Jul 2, 2022
dbb886e
tests up to vcat
bkamins Jul 2, 2022
9f703e8
finished adding abstractdataframe metadata tests
bkamins Jul 3, 2022
7edb1f3
keep adding tests up to reshaping
bkamins Jul 3, 2022
0dc2402
fix docs location
bkamins Jul 3, 2022
8c755c6
continue adding tested functions
bkamins Jul 3, 2022
6159e7b
handle all easy functions tests
bkamins Jul 3, 2022
71c18d0
add metadata to broadcasting
bkamins Jul 3, 2022
2044db5
add handling of append!/prepend!
bkamins Jul 4, 2022
2d385a6
add leftjoin! metadata
bkamins Jul 4, 2022
17e0fb3
finish adding join metadata
bkamins Jul 4, 2022
06b5aa2
update documentation
bkamins Jul 4, 2022
a8f35cb
partially correct transformation metadata
bkamins Jul 4, 2022
8425494
finish tests for GroupedDataFrame
bkamins Jul 4, 2022
b8cd9ba
finalize metadata PR
bkamins Jul 4, 2022
6340d8a
fix typo
bkamins Jul 4, 2022
8bf4116
remove not needed code paths
bkamins Jul 4, 2022
2298808
fix indentation
bkamins Jul 4, 2022
5bf7ad5
Apply suggestions from code review
bkamins Jul 17, 2022
09f0aae
Apply suggestions from code review
bkamins Jul 17, 2022
be26b32
Apply suggestions from code review
bkamins Jul 17, 2022
652d4cb
Apply suggestions from code review
bkamins Jul 17, 2022
0c399e0
Apply suggestions from code review
bkamins Jul 17, 2022
cfbab04
Apply suggestions from code review
bkamins Jul 17, 2022
3a70e22
Apply suggestions from code review
bkamins Jul 17, 2022
defb8b0
Apply suggestions from code review
bkamins Jul 17, 2022
1691e46
Apply suggestions from code review
bkamins Jul 17, 2022
4146b4c
change dropallmetadata! to dropmetadata!
bkamins Jul 17, 2022
d08eec4
update nonunique and completecases
bkamins Jul 17, 2022
9a757c1
update implementation following the review
bkamins Jul 17, 2022
5a22904
update tests and manual
bkamins Jul 17, 2022
7deac18
fix NEWS.md entry
bkamins Jul 17, 2022
f1736ac
fix manual
bkamins Jul 17, 2022
e82b6e5
fix tests
bkamins Jul 17, 2022
887cba6
fix tests
bkamins Jul 17, 2022
773ecfc
update documentation and interface
bkamins Aug 6, 2022
b2c2471
initial implementation for DataFrame
bkamins Aug 6, 2022
adf789e
implement metadata API
bkamins Aug 6, 2022
68cb600
done abstractdataframe.jl
bkamins Aug 7, 2022
d3d7164
do changes up to (excluding hcat!)
bkamins Aug 8, 2022
5d816e7
finished dataframe.jl
bkamins Aug 15, 2022
9c5369b
update rules up to join
bkamins Aug 17, 2022
0a47603
finish porting to :note rules
bkamins Aug 21, 2022
1cead97
basic functionality tests working
bkamins Aug 22, 2022
201fd26
tests up to permutedims (including)
bkamins Aug 29, 2022
adbfc44
add push!/append! tests
bkamins Aug 29, 2022
df64989
added joins
bkamins Aug 30, 2022
52709cf
finished tests
bkamins Aug 30, 2022
ae96f43
fix error in hcat!
bkamins Aug 30, 2022
68c4573
fix another old style call
bkamins Aug 30, 2022
b83d9b3
Merge branch 'main' into bk/metadata
bkamins Aug 30, 2022
1d7dd78
adjust to broadcating rule changes in Julia 1.7
bkamins Aug 31, 2022
0100a75
fix typo
bkamins Aug 31, 2022
afb0c10
improve test coverage
bkamins Aug 31, 2022
db971e0
improve test coverage
bkamins Aug 31, 2022
789e410
Apply suggestions from code review
bkamins Sep 9, 2022
5a260e8
Apply suggestions from code review
bkamins Sep 9, 2022
a620e7b
Apply suggestions from code review
bkamins Sep 9, 2022
dbcce2d
Apply suggestions from code review
bkamins Sep 9, 2022
bae8510
Apply suggestions from code review
bkamins Sep 9, 2022
c9bcdc1
Apply suggestions from code review
bkamins Sep 9, 2022
1177844
Apply suggestions from code review
bkamins Sep 9, 2022
0a69318
Apply suggestions from code review
bkamins Sep 9, 2022
f0d2627
Apply suggestions from code review
bkamins Sep 9, 2022
cad4c0c
Apply suggestions from code review
bkamins Sep 10, 2022
9e14b4d
Update src/other/metadata.jl
bkamins Sep 10, 2022
d845373
fix table-level, column-level, and :note-style
bkamins Sep 10, 2022
92eeb64
Update src/dataframe/dataframe.jl
bkamins Sep 10, 2022
87f5742
start implementing changes after code review
bkamins Sep 10, 2022
b95fb03
use @eval and change _df_ to _table_
bkamins Sep 11, 2022
489b78c
one more fix of _df_ to _table_
bkamins Sep 11, 2022
788750e
fix typos
bkamins Sep 11, 2022
a884974
clean-up after code review
bkamins Sep 11, 2022
752beeb
improve docstring of transformation metadata
bkamins Sep 11, 2022
3f4e53c
finalize responses to code review
bkamins Sep 11, 2022
dec06ab
fix KeyError to ArgumentError in tests
bkamins Sep 12, 2022
9f50fbe
fix docstring
bkamins Sep 12, 2022
6d63131
Apply suggestions from code review
bkamins Sep 17, 2022
8a58d86
Apply suggestions from code review
bkamins Sep 17, 2022
a95e4ea
Merge branch 'main' into bk/metadata
bkamins Sep 17, 2022
a20c281
changes after code review
bkamins Sep 17, 2022
1c1d120
Merge branch 'bk/metadata' of https://github.com/JuliaData/DataFrames…
bkamins Sep 17, 2022
99ed0d7
Wording and hyphenation consistency
nalimilan Sep 17, 2022
d8ae441
change :none to :default and add tests for custom functions
bkamins Sep 17, 2022
6921efd
Merge branch 'main' into bk/metadata
bkamins Sep 18, 2022
175d593
Merge branch 'main' into bk/metadata
bkamins Sep 19, 2022
ac91787
update to DataAPI.jl 1.11
bkamins Sep 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,12 @@
* New `threads` argument allows disabling multithreading in
`combine`, `select`, `select!`, `transform`, `transform!`, `subset` and `subset!`
([#3030](https://github.com/JuliaData/DataFrames.jl/pull/3030))
* Add support for table-level and column-level metadata using
DataAPI.jl interface
([#3055](https://github.com/JuliaData/DataFrames.jl/pull/3055))
* `completecases` and `nonunique` no longer throw an error when data frame
with no columns is passed
([#3055](https://github.com/JuliaData/DataFrames.jl/pull/3055))

## Previously announced breaking changes

Expand All @@ -52,8 +58,19 @@
or older it is an in place operation.
([#3022](https://github.com/JuliaData/DataFrames.jl/pull/3022))

## Internal changes
bkamins marked this conversation as resolved.
Show resolved Hide resolved

* `DataFrame` is now a `mutable struct` and has three new fields
`metadata`, `colmetadata`, and `allnotemetadata`;
this change makes `DataFrame` objects serialized under
earlier versions of DataFrames.jl incompatible with version 1.4
([#3055](https://github.com/JuliaData/DataFrames.jl/pull/3055))

## Bug fixes

* fix dispatch ambiguity in `rename` and `rename!` when only
source data frame is passed
([#3055](https://github.com/JuliaData/DataFrames.jl/pull/3055))
* Make sure that `AsTable` accepts only valid argument
([#3064](https://github.com/JuliaData/DataFrames.jl/pull/3064))
* Make sure we avoid aliasing when repeating the same column
Expand Down
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Unicode = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"
[compat]
CategoricalArrays = "0.10.0"
Compat = "3.46, 4.2"
DataAPI = "1.10"
DataAPI = "1.10" # change to 1.11 when DataAPI.jl is released
InvertedIndices = "1"
IteratorInterfaceExtensions = "0.1.1, 1"
Missings = "0.4.2, 1"
Expand Down
1 change: 1 addition & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ makedocs(
"Types" => "lib/types.md",
"Functions" => "lib/functions.md",
"Indexing" => "lib/indexing.md",
"Metadata" => "lib/metadata.md",
hide("Internals" => "lib/internals.md"),
]
],
Expand Down
14 changes: 14 additions & 0 deletions docs/src/lib/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,3 +189,17 @@ pairs
```@docs
isapprox
```

## Metadata
```@docs
metadata
metadatakeys
metadata!
deletemetadata!
emptymetadata!
colmetadata
colmetadatakeys
colmetadata!
deletecolmetadata!
emptycolmetadata!
```
291 changes: 291 additions & 0 deletions docs/src/lib/metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,291 @@
# Metadata

## Design of metadata support

DataFrames.jl allows you to store and retrieve metadata on table and column
level. This is supported using the functions defined by the DataAPI.jl interface:

* for table-level metadata: [`metadata`](@ref), [`metadatakeys`](@ref),
[`metadata!`](@ref), [`deletemetadata!`](@ref), [`emptymetadata!`](@ref);
* for column-level metatadata: [`colmetadata`](@ref), [`colmetadatakeys`](@ref),
[`colmetadata!`](@ref), [`deletecolmetadata!`](@ref), [`emptycolmetadata!`](@ref).

Assume that we work with a data frame-like object `df` that has a column `col`
(referred to either via a `Symbol`, a string or an integer index).

Table-level metadata are key-value pairs that are attached to `df`.
Column-level metadata are key-value pairs that are attached to
a specific column `col` of `df` data frame.

Additionally each metadata key-value pair has a style information attached to
it.
In DataFrames.jl the metadata style influences how metadata is propagated when
`df` is transformed. The following metadata styles are supported:

* `:none`: Metadata having this style is considered to be attached to a concrete
state of `df`. This means that any operation on this data frame
invalidates such metadata and it is dropped in the result of such operation.
Note that this happens even if the operation eventually does not change
the data frame: the rule is that calling a function that might alter a data
frame drops such metadata; in this way it is possible to statically determine
whether metadata of styles other than `:note` is dropped after a function call.
Only two functions are exceptions that keep non-`:note`-style metadata, as these
operations are specifically designed to create an identical copy of the source
data frame:
- [`DataFrame`](@ref) constructor;
- [`copy`](@ref) of a data frame;
* `:note`: Metadata having this style is considered to be an annotation of
a table or a column that should be propagated under transformations
(exact propagation rules of such metadata are described below).
* All other metadata styles are allowed but they are currently treated as having
`:none`-style (this might change in the future if other standard metadata
styles are defined).

All DataAPI.jl metadata functions work with [`DataFrame`](@ref),
[`SubDataFrame`](@ref), [`DataFrameRow`](@ref)
objects, and objects returned by [`eachrow`](@ref) and [`eachcol`](@ref)
functions. In this section collectively these objects will be called
*data frame-like*, and follow the rules:

* objects returned by
[`eachrow`](@ref) and [`eachcol`](@ref) functions have the same metadata
as their parent `AbstractDataFrame`;
* [`SubDataFrame`](@ref) and [`DataFrameRow`](@ref) have only metadata from
their parent `DataFrame` that has `:note`-style.

Notably metadata is not supported for [`GroupedDataFrame`](@ref) as it does not
expose columns directly. You can inspect metadata of the `parent` of a
[`GroupedDataFrame`](@ref) or of any of its groups.

!!! note

DataFrames.jl allows users to extract out columns of a data frame
and perform operations on them. Such operations will not affect
metadata. Therefore, even if some metadata has `:none` style it might
no longer correctly describe the column's contents if the user mutates
columns directly.

### DataFrames.jl-specific design principles for use of metadata

DataFrames.jl supports storing any object as metadata values. However,
it is recommended to use strings as values of the metadata,
as some storage formats, like for example Apache Arrow, only support
strings.

For all functions that operate on column-level metadata, an `ArgumentError` is
thrown if passed column is not present in a data frame.

If [`metadata!`](@ref) or [`colmetadata!`](@ref) is used to add metadata
to a [`SubDataFrame`](@ref) or a [`DataFrameRow`](@ref) then:

* using `:none` style for metadata throws an error;
* trying to add key-value pair for which a mapping for key already exists
with the `:none` style in the parent data frame throws an error.

DataFrames.jl is designed so that there is no performance overhead due to metadata support
when there is no metadata in a data frame. Therefore if you need maximum performance
of operations that do not rely on metadata call `emptymetadata!` and
`emptycolmetadata!` before running these operations.

Processing metadata for `SubDataFrame` and `DataFrameRow` has more overhead
than for other types defined in DataFrames.jl that support metadata, because
they have a more complex logic of handling it (they support only `:note`-style
metadata, which means that other metadata needs to be filtered-out).

## Examples

Here is a simple example how you can work with metadata in DataFrames.jl:

```jldoctest dataframe
julia> using DataFrames

julia> df = DataFrame(name=["Jan Krzysztof Duda", "Jan Krzysztof Duda",
"Radosław Wojtaszek", "Radosław Wojtaszek"],
date=["2022-Jun", "2021-Jun", "2022-Jun", "2021-Jun"],
rating=[2750, 2729, 2708, 2687])
4×3 DataFrame
Row │ name date rating
│ String String Int64
─────┼──────────────────────────────────────
1 │ Jan Krzysztof Duda 2022-Jun 2750
2 │ Jan Krzysztof Duda 2021-Jun 2729
3 │ Radosław Wojtaszek 2022-Jun 2708
4 │ Radosław Wojtaszek 2021-Jun 2687

julia> metadatakeys(df)
()

julia> metadata!(df, "caption", "ELO ratings of chess players", style=:note);

julia> collect(metadatakeys(df))
1-element Vector{String}:
"caption"

julia> metadata(df, "caption")
"ELO ratings of chess players"

julia> metadata(df, "caption", style=true)
("ELO ratings of chess players", :note)

julia> emptymetadata!(df);

julia> metadatakeys(df)
()

julia> colmetadatakeys(df)
()

julia> colmetadata!(df, :name, "label", "First and last name of a player", style=:note);

julia> colmetadata!(df, :date, "label", "Rating date in yyyy-u format", style=:note);

julia> colmetadata!(df, :rating, "label", "ELO rating in classical time control", style=:note);

julia> colmetadata(df, :rating, "label")
"ELO rating in classical time control"

julia> colmetadata(df, :rating, "label", style=true)
("ELO rating in classical time control", :note)

julia> collect(colmetadatakeys(df))
3-element Vector{Pair{Symbol, Base.KeySet{String, Dict{String, Tuple{Any, Any}}}}}:
:date => ["label"]
:rating => ["label"]
:name => ["label"]

julia> [only(names(df, col)) =>
[key => colmetadata(df, col, key) for key in metakeys] for
(col, metakeys) in colmetadatakeys(df)]
3-element Vector{Pair{String, Vector{Pair{String, String}}}}:
"date" => ["label" => "Rating date in yyyy-u format"]
"rating" => ["label" => "ELO rating in classical time control"]
"name" => ["label" => "First and last name of a player"]

julia> emptycolmetadata!(df);

julia> colmetadatakeys(df)
()
```

## Propagation of `:note`-style metadata

An important design feature of `:note`-style metatada is how it is handled when
data frames are transformed.

!!! note

The provided rules might slightly change in the future. Any change to
`:note`-style metadata propagation rules will not be considered as breaking
and can be done in any minor release of DataFrames.jl.
Such changes might be made based on users' feedback about what metadata
propagation rules are most convenient in practice.

The general design rules for propagation of `:note`-style metadata are as follows.

For operations that take a single data frame as an input:
* Table level metadata is propagated to the returned data frame object.
* For column-level metadata:
- in all cases when a single column is transformed to
a single column and the name of the column does
not change (or is automatically changed e.g. to de-duplicate column names or
via column renaming in joins)
column-level metadata is preserved (example operations of this kind are
`getindex`, `subset`, joins, `mapcols`).
- in all cases when a single column is transformed with `identity` or `copy` to a single column,
column-level metadata is preserved even if column name is changed (example
operations of this kind are `rename`, or the `:x => :y` or
`:x => copy => :y` operation specification in `select`).

For operations that take multiple data frames as their input two cases are distinguished:

- When there is a natural main table in the operation (`append!`, `prepend!`,
`leftjoin`, `leftjoin!`, `rightjoin`, `semijoin`, `antijoin`, `setindex!`):
- table-level metadata is taken from the main table;
- column-level metadata for columns from the main table is taken from main table;
- column-level metadata for columns from the non-main table is taken only for
columns not present in the main table.
- When all tables are equivalent (`hcat`, `vcat`, `innerjoin`, `outerjoin`):
- table-level metadata is preserved only for keys which are defined
in all passed tables and have the same value;
- column-level metadata is preserved only for keys which are defined
in all passed tables that contain this column and have the same value.
In all these operations when metadata is preserved the values in the key-value
pairs are not copied (this is relevant in case of mutable values).

!!! note

The rules for `:note`-style column-level metadata propagation are designed
to make the right decision in common cases. In particular, they assume that if
source and target column name is the same then the metadata for the column is
not changed. While this is valid for many operations, it is not always true
in general. For example the `:x => ByRow(log) => :x` transformation might
invalidate metadata if it contained unit of measure of the variable. In such
cases user must either use a different name for the output column,
set metadata style to `:none` before the operation,
or manually drop or update such metadata from the `:x` column
after the transformation.

### Operations that preserve `:note`-style metadata

Most of the functions in DataFrames.jl just preserve table and column metadata
that has `:note`-style.
Some functions use a more complex logic, even if they follow the general rules
described above (in particular under any transformation all non-`:note`-style
metadata is always dropped). These are:

* [`describe`](@ref) drops all metadata.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
* [`hcat`](@ref): propagates table-level metadata only for keys which are defined
in all passed tables and have the same value;
column-level metadata is preserved.
* [`vcat`](@ref): propagates table-level metadata only for keys which are defined
in all passed tables and have the same value;
column-level metadata is preserved only for keys which are defined
in all passed tables that contain this column and have the same value;
* [`stack`](@ref): propagates table-level metadata and column-level metadata
for identifier columns.
* [`stack`](@ref): propagates table-level metadata and column-level metadata
for row keys columns.
* [`permutedims`](@ref): propagates table-level metadata and drops column-level
metadata.
* broadcasted assignment does not change target metadata;
under Julia earlier than 1.7 operation of kind `df.a .= s` does not drop non-`:note`-style
metadata; under Julia 1.7 or later this operation perserves only `:note`-style
metadata
* broadcasting propagates table-level metadata if some key is present
in all passed data frames and value associated with it is identical in all
passed data frames; column-level metadata is propagated for columns if some
key for a given column is present in all passed data frames and value
associated with it is identical in all passed data frames.
* `getindex` preserves table-level metadata and column-level metadata
for selected columns
* `setindex!` does not affect table-level and column-level metadata
* [`push!`](@ref), [`pushfirst!`](@ref), [`insert!`](@ref) do not affect
table-level nor column-level metadata (even if they add new columns and pushed row is
a `DataFrameRow` or other value supporting metadata interface)
* [`append!`](@ref) and [`prepend!`](@ref) do not change table and column-level
metadata of the destination data frame, except that if new columns are added
and these columns have metadata in the appended/prepended table then this
metadata is preserved.
* [`leftjoin!`](@ref), [`leftjoin`](@ref): table and column-level metadata is
taken from the left table except for non-key columns from right table for which
metadata is taken from right table;
* [`rightjoin`](@ref): table and column-level metadata is taken from the right
table except for non-key columns from left table for which metadata is
taken from left table;
* [`innerjoin`](@ref), [`outerjoin`](@ref): propagates table-level metadata only for keys
that are defined in all passed data frames and have the same value;
column-level metadata is propagated for all columns except for key
columns, for which it is propagated only for keys that are defined
in all passed data frames and have the same value.
* [`semijoin`](@ref), [`antijoin`](@ref): table and column-level metadata is
taken from the left table.
* [`crossjoin`](@ref): propagates table-level metadata only for keys
that are defined in both passed data frames and have the same value;
propagates column-level metadata from both passed data frames.
* [`select`]](@ref), [`select!`](@ref), [`transform`](@ref),
[`transform!`](@ref), [`combine`]](@ref): propagate table-level metadata;
column-level metadata is propagated if:
a) a single column is transformed to a single column and the name of the column does not change
(this includes all column selection operations), or
b) a single column is transformed with `identity` or `copy` to a single column
even if column name is changed (this includes column renaming).
Loading