diff --git a/docs/src/man/categorical.md b/docs/src/man/categorical.md index 5a7c0e78d1..11db4e12a9 100755 --- a/docs/src/man/categorical.md +++ b/docs/src/man/categorical.md @@ -1,4 +1,4 @@ -# Categorical Data +# [Categorical Data](@id man-categorical) Often, we have to deal with columns in a data frame that take on a small number of levels: diff --git a/docs/src/man/joins.md b/docs/src/man/joins.md index 31dbd86bb9..d9696d5efa 100644 --- a/docs/src/man/joins.md +++ b/docs/src/man/joins.md @@ -137,6 +137,17 @@ julia> crossjoin(people, jobs, makeunique = true) 4 │ 40 Jane Doe 60 Astronaut ``` +## Key value comparisons and floating point values + +Key values from the two or more data frames are compared using the `isequal` +function. This is consistent with the `Set` and `Dict` types in Julia Base. + +It is not recommended to use floating point numbers as keys: floating point +comparisons can be surprising and unpredictable. If you do use floating point +keys, note that by default an error is raised when keys include `-0.0` +(negative zero) or `NaN` values. This can be overridden by wrapping the key +values in a [categorical](@ref man-categorical) vector. + ## Joining on key columns with different names In order to join data frames on keys which have different names in the left and diff --git a/src/join/composer.jl b/src/join/composer.jl index 8a1d07ac20..5f79c725b8 100644 --- a/src/join/composer.jl +++ b/src/join/composer.jl @@ -640,13 +640,12 @@ change in future releases. - `df1`, `df2`, `dfs...`: the `AbstractDataFrames` to be joined # Keyword Arguments -- `on` : A column name to join `df1` and `df2` on. If the columns on which - `df1` and `df2` will be joined have different names, then a `left=>right` - pair can be passed. It is also allowed to perform a join on multiple columns, - in which case a vector of column names or column name pairs can be passed - (mixing names and pairs is allowed). If more than two data frames are joined - then only a column name or a vector of column names are allowed. - `on` is a required argument. +- `on` : The names of the key columns on which to join the data frames. + This can be a single name, or a vector of names (for joining on multiple + columns). When joining only two data frames, a `left=>right` pair of names + can be used instead of a name, for the case where a key has different names + in `df1` and `df2` (it is allowed to mix names and name pairs in a vector). + Key values are compared using `isequal`. `on` is a required argument. - `makeunique` : if `false` (the default), an error will be raised if duplicate names are found in columns not joined on; if `true`, duplicate names will be suffixed with `_i` @@ -666,7 +665,7 @@ change in future releases. - `matchmissing` : if equal to `:error` throw an error if `missing` is present in `on` columns; if equal to `:equal` then `missing` is allowed and missings are matched; if equal to `:notequal` then missings are dropped in `df1` and `df2` - `on` columns; `isequal` is used for comparisons of rows for equality + `on` columns. - `order` : if `:undefined` (the default) the order of rows in the result is undefined and may change in future releases. If `:left` then the order of rows from the left data frame is retained. If `:right` then the order of rows @@ -799,11 +798,12 @@ change in future releases. - `df1`, `df2`: the `AbstractDataFrames` to be joined # Keyword Arguments -- `on` : A column name to join `df1` and `df2` on. If the columns on which - `df1` and `df2` will be joined have different names, then a `left=>right` - pair can be passed. It is also allowed to perform a join on multiple columns, - in which case a vector of column names or column name pairs can be passed - (mixing names and pairs is allowed). +- `on` : The names of the key columns on which to join the data frames. + This can be a single name, or a vector of names (for joining on multiple + columns). A `left=>right` pair of names can be used instead of a name, for + the case where a key has different names in `df1` and `df2` (it is allowed to + mix names and name pairs in a vector). Key values are compared using + `isequal`. `on` is a required argument. - `makeunique` : if `false` (the default), an error will be raised if duplicate names are found in columns not joined on; if `true`, duplicate names will be suffixed with `_i` @@ -826,8 +826,7 @@ change in future releases. data frame and left unchanged. - `matchmissing` : if equal to `:error` throw an error if `missing` is present in `on` columns; if equal to `:equal` then `missing` is allowed and missings are - matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns; - `isequal` is used for comparisons of rows for equality + matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns. - `order` : if `:undefined` (the default) the order of rows in the result is undefined and may change in future releases. If `:left` then the order of rows from the left data frame is retained. If `:right` then the order of rows @@ -955,11 +954,12 @@ change in future releases. - `df1`, `df2`: the `AbstractDataFrames` to be joined # Keyword Arguments -- `on` : A column name to join `df1` and `df2` on. If the columns on which - `df1` and `df2` will be joined have different names, then a `left=>right` - pair can be passed. It is also allowed to perform a join on multiple columns, - in which case a vector of column names or column name pairs can be passed - (mixing names and pairs is allowed). +- `on` : The names of the key columns on which to join the data frames. + This can be a single name, or a vector of names (for joining on multiple + columns). A `left=>right` pair of names can be used instead of a name, for + the case where a key has different names in `df1` and `df2` (it is allowed to + mix names and name pairs in a vector). Key values are compared using + `isequal`. `on` is a required argument. - `makeunique` : if `false` (the default), an error will be raised if duplicate names are found in columns not joined on; if `true`, duplicate names will be suffixed with `_i` @@ -982,8 +982,7 @@ change in future releases. data frame and left unchanged. - `matchmissing` : if equal to `:error` throw an error if `missing` is present in `on` columns; if equal to `:equal` then `missing` is allowed and missings are - matched; if equal to `:notequal` then missings are dropped in `df1` `on` columns; - `isequal` is used for comparisons of rows for equality + matched; if equal to `:notequal` then missings are dropped in `df1` `on` columns. - `order` : if `:undefined` (the default) the order of rows in the result is undefined and may change in future releases. If `:left` then the order of rows from the left data frame is retained (non-matching rows are put at the end). @@ -1113,13 +1112,12 @@ This behavior may change in future releases. - `df1`, `df2`, `dfs...` : the `AbstractDataFrames` to be joined # Keyword Arguments -- `on` : A column name to join `df1` and `df2` on. If the columns on which - `df1` and `df2` will be joined have different names, then a `left=>right` - pair can be passed. It is also allowed to perform a join on multiple columns, - in which case a vector of column names or column name pairs can be passed - (mixing names and pairs is allowed). If more than two data frames are joined - then only a column name or a vector of column names are allowed. - `on` is a required argument. +- `on` : The names of the key columns on which to join the data frames. + This can be a single name, or a vector of names (for joining on multiple + columns). When joining only two data frames, a `left=>right` pair of names + can be used instead of a name, for the case where a key has different names + in `df1` and `df2` (it is allowed to mix names and name pairs in a vector). + Key values are compared using `isequal`. `on` is a required argument. - `makeunique` : if `false` (the default), an error will be raised if duplicate names are found in columns not joined on; if `true`, duplicate names will be suffixed with `_i` @@ -1143,7 +1141,7 @@ This behavior may change in future releases. data frame and left unchanged. - `matchmissing` : if equal to `:error` throw an error if `missing` is present in `on` columns; if equal to `:equal` then `missing` is allowed and missings are - matched; `isequal` is used for comparisons of rows for equality + matched. - `order` : if `:undefined` (the default) the order of rows in the result is undefined and may change in future releases. If `:left` then the order of rows from the left data frame is retained (non-matching rows are put at the end). @@ -1289,11 +1287,12 @@ The order of rows in the result is kept from `df1`. - `df1`, `df2`: the `AbstractDataFrames` to be joined # Keyword Arguments -- `on` : A column name to join `df1` and `df2` on. If the columns on which - `df1` and `df2` will be joined have different names, then a `left=>right` - pair can be passed. It is also allowed to perform a join on multiple columns, - in which case a vector of column names or column name pairs can be passed - (mixing names and pairs is allowed). +- `on` : The names of the key columns on which to join the data frames. + This can be a single name, or a vector of names (for joining on multiple + columns). A `left=>right` pair of names can be used instead of a name, for + the case where a key has different names in `df1` and `df2` (it is allowed to + mix names and name pairs in a vector). Key values are compared using + `isequal`. `on` is a required argument. - `makeunique` : ignored as no columns are added to `df1` columns (it is provided for consistency with other functions). - `indicator` : Default: `nothing`. If a `Symbol` or string, adds categorical indicator @@ -1307,8 +1306,7 @@ The order of rows in the result is kept from `df1`. By default no check is performed. - `matchmissing` : if equal to `:error` throw an error if `missing` is present in `on` columns; if equal to `:equal` then `missing` is allowed and missings are - matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns; - `isequal` is used for comparisons of rows for equality + matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns. It is not allowed to join on columns that contain `NaN` or `-0.0` in real or imaginary part of the number. If you need to perform a join on such values use @@ -1400,11 +1398,12 @@ The order of rows in the result is kept from `df1`. - `df1`, `df2`: the `AbstractDataFrames` to be joined # Keyword Arguments -- `on` : A column name to join `df1` and `df2` on. If the columns on which - `df1` and `df2` will be joined have different names, then a `left=>right` - pair can be passed. It is also allowed to perform a join on multiple columns, - in which case a vector of column names or column name pairs can be passed - (mixing names and pairs is allowed). +- `on` : The names of the key columns on which to join the data frames. + This can be a single name, or a vector of names (for joining on multiple + columns). A `left=>right` pair of names can be used instead of a name, for + the case where a key has different names in `df1` and `df2` (it is allowed to + mix names and name pairs in a vector). Key values are compared using + `isequal`. `on` is a required argument. - `makeunique` : ignored as no columns are added to `df1` columns (it is provided for consistency with other functions). - `validate` : whether to check that columns passed as the `on` argument @@ -1414,8 +1413,7 @@ The order of rows in the result is kept from `df1`. By default no check is performed. - `matchmissing` : if equal to `:error` throw an error if `missing` is present in `on` columns; if equal to `:equal` then `missing` is allowed and missings are - matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns; - `isequal` is used for comparisons of rows for equality + matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns. It is not allowed to join on columns that contain `NaN` or `-0.0` in real or imaginary part of the number. If you need to perform a join on such values use diff --git a/src/join/inplace.jl b/src/join/inplace.jl index d2b87d84b2..9f1a9e0c6c 100644 --- a/src/join/inplace.jl +++ b/src/join/inplace.jl @@ -15,11 +15,12 @@ added to `df1`. - `df1`, `df2`: the `AbstractDataFrames` to be joined # Keyword Arguments -- `on` : A column name to join `df1` and `df2` on. If the columns on which - `df1` and `df2` will be joined have different names, then a `left=>right` - pair can be passed. It is also allowed to perform a join on multiple columns, - in which case a vector of column names or column name pairs can be passed - (mixing names and pairs is allowed). +- `on` : The names of the key columns on which to join the data frames. + This can be a single name, or a vector of names (for joining on multiple + columns). A `left=>right` pair of names can be used instead of a name, for + the case where a key has different names in `df1` and `df2` (it is allowed to + mix names and name pairs in a vector). Key values are compared using + `isequal`. `on` is a required argument. - `makeunique` : if `false` (the default), an error will be raised if duplicate names are found in columns not joined on; if `true`, duplicate names will be suffixed with `_i` @@ -30,8 +31,7 @@ added to `df1`. the column name will be modified if `makeunique=true`. - `matchmissing` : if equal to `:error` throw an error if `missing` is present in `on` columns; if equal to `:equal` then `missing` is allowed and missings are - matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns; - `isequal` is used for comparisons of rows for equality + matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns. The columns added to `df1` from `df2` will support missing values.