Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document use of isequal for comparisons #3313

Merged
merged 1 commit into from
Apr 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/src/man/categorical.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Categorical Data
# [Categorical Data](@id man-categorical)

Often, we have to deal with columns in a data frame that take on a small number
of levels:
Expand Down
11 changes: 11 additions & 0 deletions docs/src/man/joins.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,17 @@ julia> crossjoin(people, jobs, makeunique = true)
4 │ 40 Jane Doe 60 Astronaut
```

## Key value comparisons and floating point values

Key values from the two or more data frames are compared using the `isequal`
function. This is consistent with the `Set` and `Dict` types in Julia Base.

It is not recommended to use floating point numbers as keys: floating point
comparisons can be surprising and unpredictable. If you do use floating point
keys, note that by default an error is raised when keys include `-0.0`
(negative zero) or `NaN` values. This can be overridden by wrapping the key
values in a [categorical](@ref man-categorical) vector.

## Joining on key columns with different names

In order to join data frames on keys which have different names in the left and
Expand Down
86 changes: 42 additions & 44 deletions src/join/composer.jl
Original file line number Diff line number Diff line change
Expand Up @@ -640,13 +640,12 @@ change in future releases.
- `df1`, `df2`, `dfs...`: the `AbstractDataFrames` to be joined

# Keyword Arguments
- `on` : A column name to join `df1` and `df2` on. If the columns on which
`df1` and `df2` will be joined have different names, then a `left=>right`
pair can be passed. It is also allowed to perform a join on multiple columns,
in which case a vector of column names or column name pairs can be passed
(mixing names and pairs is allowed). If more than two data frames are joined
then only a column name or a vector of column names are allowed.
`on` is a required argument.
- `on` : The names of the key columns on which to join the data frames.
This can be a single name, or a vector of names (for joining on multiple
columns). When joining only two data frames, a `left=>right` pair of names
can be used instead of a name, for the case where a key has different names
in `df1` and `df2` (it is allowed to mix names and name pairs in a vector).
Key values are compared using `isequal`. `on` is a required argument.
- `makeunique` : if `false` (the default), an error will be raised
if duplicate names are found in columns not joined on;
if `true`, duplicate names will be suffixed with `_i`
Expand All @@ -666,7 +665,7 @@ change in future releases.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched; if equal to `:notequal` then missings are dropped in `df1` and `df2`
`on` columns; `isequal` is used for comparisons of rows for equality
`on` columns.
- `order` : if `:undefined` (the default) the order of rows in the result is
undefined and may change in future releases. If `:left` then the order of
rows from the left data frame is retained. If `:right` then the order of rows
Expand Down Expand Up @@ -799,11 +798,12 @@ change in future releases.
- `df1`, `df2`: the `AbstractDataFrames` to be joined

# Keyword Arguments
- `on` : A column name to join `df1` and `df2` on. If the columns on which
`df1` and `df2` will be joined have different names, then a `left=>right`
pair can be passed. It is also allowed to perform a join on multiple columns,
in which case a vector of column names or column name pairs can be passed
(mixing names and pairs is allowed).
- `on` : The names of the key columns on which to join the data frames.
This can be a single name, or a vector of names (for joining on multiple
columns). A `left=>right` pair of names can be used instead of a name, for
the case where a key has different names in `df1` and `df2` (it is allowed to
mix names and name pairs in a vector). Key values are compared using
`isequal`. `on` is a required argument.
- `makeunique` : if `false` (the default), an error will be raised
if duplicate names are found in columns not joined on;
if `true`, duplicate names will be suffixed with `_i`
Expand All @@ -826,8 +826,7 @@ change in future releases.
data frame and left unchanged.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns;
`isequal` is used for comparisons of rows for equality
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns.
- `order` : if `:undefined` (the default) the order of rows in the result is
undefined and may change in future releases. If `:left` then the order of
rows from the left data frame is retained. If `:right` then the order of rows
Expand Down Expand Up @@ -955,11 +954,12 @@ change in future releases.
- `df1`, `df2`: the `AbstractDataFrames` to be joined

# Keyword Arguments
- `on` : A column name to join `df1` and `df2` on. If the columns on which
`df1` and `df2` will be joined have different names, then a `left=>right`
pair can be passed. It is also allowed to perform a join on multiple columns,
in which case a vector of column names or column name pairs can be passed
(mixing names and pairs is allowed).
- `on` : The names of the key columns on which to join the data frames.
This can be a single name, or a vector of names (for joining on multiple
columns). A `left=>right` pair of names can be used instead of a name, for
the case where a key has different names in `df1` and `df2` (it is allowed to
mix names and name pairs in a vector). Key values are compared using
`isequal`. `on` is a required argument.
- `makeunique` : if `false` (the default), an error will be raised
if duplicate names are found in columns not joined on;
if `true`, duplicate names will be suffixed with `_i`
Expand All @@ -982,8 +982,7 @@ change in future releases.
data frame and left unchanged.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched; if equal to `:notequal` then missings are dropped in `df1` `on` columns;
`isequal` is used for comparisons of rows for equality
matched; if equal to `:notequal` then missings are dropped in `df1` `on` columns.
- `order` : if `:undefined` (the default) the order of rows in the result is
undefined and may change in future releases. If `:left` then the order of
rows from the left data frame is retained (non-matching rows are put at the end).
Expand Down Expand Up @@ -1113,13 +1112,12 @@ This behavior may change in future releases.
- `df1`, `df2`, `dfs...` : the `AbstractDataFrames` to be joined

# Keyword Arguments
- `on` : A column name to join `df1` and `df2` on. If the columns on which
`df1` and `df2` will be joined have different names, then a `left=>right`
pair can be passed. It is also allowed to perform a join on multiple columns,
in which case a vector of column names or column name pairs can be passed
(mixing names and pairs is allowed). If more than two data frames are joined
then only a column name or a vector of column names are allowed.
`on` is a required argument.
- `on` : The names of the key columns on which to join the data frames.
This can be a single name, or a vector of names (for joining on multiple
columns). When joining only two data frames, a `left=>right` pair of names
can be used instead of a name, for the case where a key has different names
in `df1` and `df2` (it is allowed to mix names and name pairs in a vector).
Key values are compared using `isequal`. `on` is a required argument.
- `makeunique` : if `false` (the default), an error will be raised
if duplicate names are found in columns not joined on;
if `true`, duplicate names will be suffixed with `_i`
Expand All @@ -1143,7 +1141,7 @@ This behavior may change in future releases.
data frame and left unchanged.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched; `isequal` is used for comparisons of rows for equality
matched.
- `order` : if `:undefined` (the default) the order of rows in the result is
undefined and may change in future releases. If `:left` then the order of
rows from the left data frame is retained (non-matching rows are put at the end).
Expand Down Expand Up @@ -1289,11 +1287,12 @@ The order of rows in the result is kept from `df1`.
- `df1`, `df2`: the `AbstractDataFrames` to be joined

# Keyword Arguments
- `on` : A column name to join `df1` and `df2` on. If the columns on which
`df1` and `df2` will be joined have different names, then a `left=>right`
pair can be passed. It is also allowed to perform a join on multiple columns,
in which case a vector of column names or column name pairs can be passed
(mixing names and pairs is allowed).
- `on` : The names of the key columns on which to join the data frames.
This can be a single name, or a vector of names (for joining on multiple
columns). A `left=>right` pair of names can be used instead of a name, for
the case where a key has different names in `df1` and `df2` (it is allowed to
mix names and name pairs in a vector). Key values are compared using
`isequal`. `on` is a required argument.
- `makeunique` : ignored as no columns are added to `df1` columns
(it is provided for consistency with other functions).
- `indicator` : Default: `nothing`. If a `Symbol` or string, adds categorical indicator
Expand All @@ -1307,8 +1306,7 @@ The order of rows in the result is kept from `df1`.
By default no check is performed.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns;
`isequal` is used for comparisons of rows for equality
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns.

It is not allowed to join on columns that contain `NaN` or `-0.0` in real or
imaginary part of the number. If you need to perform a join on such values use
Expand Down Expand Up @@ -1400,11 +1398,12 @@ The order of rows in the result is kept from `df1`.
- `df1`, `df2`: the `AbstractDataFrames` to be joined

# Keyword Arguments
- `on` : A column name to join `df1` and `df2` on. If the columns on which
`df1` and `df2` will be joined have different names, then a `left=>right`
pair can be passed. It is also allowed to perform a join on multiple columns,
in which case a vector of column names or column name pairs can be passed
(mixing names and pairs is allowed).
- `on` : The names of the key columns on which to join the data frames.
This can be a single name, or a vector of names (for joining on multiple
columns). A `left=>right` pair of names can be used instead of a name, for
the case where a key has different names in `df1` and `df2` (it is allowed to
mix names and name pairs in a vector). Key values are compared using
`isequal`. `on` is a required argument.
- `makeunique` : ignored as no columns are added to `df1` columns
(it is provided for consistency with other functions).
- `validate` : whether to check that columns passed as the `on` argument
Expand All @@ -1414,8 +1413,7 @@ The order of rows in the result is kept from `df1`.
By default no check is performed.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns;
`isequal` is used for comparisons of rows for equality
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns.

It is not allowed to join on columns that contain `NaN` or `-0.0` in real or
imaginary part of the number. If you need to perform a join on such values use
Expand Down
14 changes: 7 additions & 7 deletions src/join/inplace.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,12 @@ added to `df1`.
- `df1`, `df2`: the `AbstractDataFrames` to be joined

# Keyword Arguments
- `on` : A column name to join `df1` and `df2` on. If the columns on which
`df1` and `df2` will be joined have different names, then a `left=>right`
pair can be passed. It is also allowed to perform a join on multiple columns,
in which case a vector of column names or column name pairs can be passed
(mixing names and pairs is allowed).
- `on` : The names of the key columns on which to join the data frames.
This can be a single name, or a vector of names (for joining on multiple
columns). A `left=>right` pair of names can be used instead of a name, for
the case where a key has different names in `df1` and `df2` (it is allowed to
mix names and name pairs in a vector). Key values are compared using
`isequal`. `on` is a required argument.
- `makeunique` : if `false` (the default), an error will be raised
if duplicate names are found in columns not joined on;
if `true`, duplicate names will be suffixed with `_i`
Expand All @@ -30,8 +31,7 @@ added to `df1`.
the column name will be modified if `makeunique=true`.
- `matchmissing` : if equal to `:error` throw an error if `missing` is present
in `on` columns; if equal to `:equal` then `missing` is allowed and missings are
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns;
`isequal` is used for comparisons of rows for equality
matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns.

The columns added to `df1` from `df2` will support missing values.

Expand Down