Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite reshaping functions to improve performance #285

Merged
merged 29 commits into from
Oct 10, 2022
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
9bf7e88
init
etiennebacher Oct 3, 2022
37e9bb7
deal with names_sep and names_pattern
etiennebacher Oct 4, 2022
ffb5e3e
remove unused helper
etiennebacher Oct 4, 2022
b868c6c
fix row issues, make benchmark
etiennebacher Oct 4, 2022
9d12a41
start rewriting data_to_wide
etiennebacher Oct 5, 2022
4945f8b
almost finished, two tests to fix
etiennebacher Oct 5, 2022
bb1f729
update benchmark
etiennebacher Oct 5, 2022
2fb2d38
minor
etiennebacher Oct 5, 2022
b05c2a4
fix tests
etiennebacher Oct 6, 2022
dd92c9a
polish code for data_to_wide
etiennebacher Oct 6, 2022
3d107ff
polish code for data_to_long
etiennebacher Oct 7, 2022
4ad020b
finish polishing, reorganize files
etiennebacher Oct 7, 2022
cfb35bf
put reprex in benchmark file
etiennebacher Oct 7, 2022
1ccdd02
minor
etiennebacher Oct 7, 2022
a9d16ae
minor
etiennebacher Oct 7, 2022
6473150
Merge branch 'main' into rewrite-reshape
etiennebacher Oct 7, 2022
9f8688e
bump version and news [skip ci]
etiennebacher Oct 7, 2022
b20255f
comment
etiennebacher Oct 7, 2022
8b9630e
Update R/data_to_long.R
etiennebacher Oct 7, 2022
dfccecc
add comment [skip ci]
etiennebacher Oct 7, 2022
bdf1393
Merge branch 'rewrite-reshape' of https://github.com/easystats/datawi…
etiennebacher Oct 7, 2022
8ed9bc5
fix test broken by modified message
etiennebacher Oct 7, 2022
8e7ab2c
add missing snapshot
etiennebacher Oct 7, 2022
7f11408
remove old.R
etiennebacher Oct 7, 2022
18cf17a
forgot to devtools::document()
etiennebacher Oct 7, 2022
40036e8
move benchmarks to /dev [skip ci]
etiennebacher Oct 10, 2022
60b9797
update branch [skip ci]
etiennebacher Oct 10, 2022
bf87604
Merge branch 'main' into rewrite-reshape
etiennebacher Oct 10, 2022
145bf68
fix typo in news [skip ci]
etiennebacher Oct 10, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: datawizard
Title: Easy Data Wrangling and Statistical Transformations
Version: 0.6.2
Version: 0.6.2.1
Authors@R: c(
person("Indrajeet", "Patil", , "[email protected]", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-1995-6531", Twitter = "@patilindrajeets")),
Expand Down
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,8 @@ export(kurtosis)
export(normalize)
export(object_has_names)
export(object_has_rownames)
export(old_data_to_long)
export(old_data_to_wide)
etiennebacher marked this conversation as resolved.
Show resolved Hide resolved
export(print_html)
export(print_md)
export(ranktransform)
Expand Down
7 changes: 7 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
# datawizard (development version)

MAJOR CHANGES

* `data_to_long()` and `data_to_wide()` have had significant performance improvements,
sometimes as high as a ten-fold speedup.

# datawizard 0.6.2

BREAKING CHANGES
Expand Down
282 changes: 282 additions & 0 deletions R/data_to_long.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
#' Reshape (pivot) data from wide to long
#'
#' This function "lengthens" data, increasing the number of rows and decreasing
#' the number of columns. This is a dependency-free base-R equivalent of
#' `tidyr::pivot_longer()`.
#'
#' @param data A data frame to pivot.
#' @param names_to The name of the new column that will contain the column
#' names.
#' @param names_prefix A regular expression used to remove matching text from
#' the start of each variable name.
#' @param names_sep,names_pattern If `names_to` contains multiple values, this
#' argument controls how the column name is broken up.
#' `names_pattern` takes a regular expression containing matching groups, i.e. "()".
#' @param values_to The name of the new column that will contain the values of
#' the pivoted variables.
#' @param values_drop_na If `TRUE`, will drop rows that contain only `NA` in the
#' `values_to` column. This effectively converts explicit missing values to
#' implicit missing values, and should generally be used only when missing values
#' in data were created by its structure.
#' @param rows_to The name of the column that will contain the row names or row
#' numbers from the original data. If `NULL`, will be removed.
#' @param ... Currently not used.
#' @inheritParams find_columns
#' @param cols Identical to `select`. This argument is here to ensure compatibility
#' with `tidyr::pivot_longer()`. If both `select` and `cols` are provided, `cols`
#' is used.
#' @param colnames_to Deprecated. Use `names_to` instead.
#'
#' @return If a tibble was provided as input, `reshape_longer()` also returns a
#' tibble. Otherwise, it returns a data frame.
#'
#' @examples
#' \donttest{
#' wide_data <- data.frame(replicate(5, rnorm(10)))
#'
#' # Default behaviour (equivalent to tidyr::pivot_longer(wide_data, cols = 1:5))
#' data_to_long(wide_data)
#'
#' # Customizing the names
#' data_to_long(wide_data,
#' select = c(1, 2),
#' names_to = "Column",
#' values_to = "Numbers",
#' rows_to = "Row"
#' )
#'
#' # Full example
#' # ------------------
#' if (require("psych")) {
#' data <- psych::bfi # Wide format with one row per participant's personality test
#'
#' # Pivot long format
#' data_to_long(data,
#' select = regex("\\d"), # Select all columns that contain a digit
#' names_to = "Item",
#' values_to = "Score",
#' rows_to = "Participant"
#' )
#'
#' if (require("tidyr")) {
#' reshape_longer(
#' tidyr::who,
#' select = new_sp_m014:newrel_f65,
#' names_to = c("diagnosis", "gender", "age"),
#' names_pattern = "new_?(.*)_(.)(.*)",
#' values_to = "count"
#' )
#' }
#' }
#' }
#'
#' @inherit data_rename seealso
#' @export

data_to_long <- function(
data,
select = "all",
names_to = "name",
names_prefix = NULL,
names_sep = NULL,
names_pattern = NULL,
values_to = "value",
values_drop_na = FALSE,
rows_to = NULL,
ignore_case = FALSE,
regex = FALSE,
...,
cols,
colnames_to
){

# Check args
if (!missing(colnames_to)) {
.is_deprecated("colnames_to", "names_to")
if (is.null(names_to)) {
names_to <- colnames_to
}
}

# Prefer "cols" over "select" for compat with tidyr::pivot_longer
if (!missing(cols)) {
select <- substitute(cols)
cols <- .select_nse(
select,
data,
exclude = NULL,
ignore_case = ignore_case,
regex = regex,
verbose = FALSE
)
} else {
if (!missing(select) || !is.null(select)) {
cols <- .select_nse(
select,
data,
exclude = NULL,
ignore_case = ignore_case,
regex = regex,
verbose = FALSE
)
} else {
insight::format_error(
"You need to specify columns to pivot, either with `select` or `cols`."
)
}
}

if (length(names_to) > 1 && is.null(names_sep) && is.null(names_pattern)) {
insight::format_error(
"If you supply multiple names in `names_to`, you must also supply one of `names_sep` or `names_pattern`."
)
}

# Remove tidyverse attributes, will add them back at the end
if (inherits(data, "tbl_df")) {
tbl_input <- TRUE
data <- as.data.frame(data, stringsAsFactors = FALSE)
} else {
tbl_input <- FALSE
}

if (any(names_to %in% setdiff(names(data), cols))) {
insight::format_error(
"Some values of the columns specified in 'names_to' are already present as column names.",
etiennebacher marked this conversation as resolved.
Show resolved Hide resolved
paste0(
"Either use another value in `names_to` or rename the following columns: ",
text_concatenate(names_to[which(names_to %in% setdiff(names(data), cols))])
)
)
}

# nothing to select?
if (!length(cols)) {
stop("No columns found for reshaping data.", call. = FALSE)
}

not_selected <- setdiff(names(data), cols)

# create a temp id so that we know how to rearrange the rows once the data is
# stacked
not_stacked <- data[, not_selected, drop = FALSE]
not_stacked[["_Rows"]] <- coerce_to_numeric(row.names(data))

# stack the selected columns
stacked_data <- .stack(data[, cols, drop = FALSE])[, 2:1]

# reorder the rows to have a repeated sequence when all vars are selected to
# pivot
#
# See with following example:
# wide_data <- data.frame(replicate(5, rnorm(10)))
# data_to_long(wide_data)

needs_to_rearrange <- length(not_selected) == 0 && is.null(rows_to)
if (isTRUE(needs_to_rearrange)) {
# https://stackoverflow.com/questions/73984957/efficient-way-to-reorder-rows-to-have-a-repeated-sequence
stacked_data <- stacked_data[c(
matrix(
seq_len(nrow(stacked_data)),
nrow = length(unique(stacked_data$ind)),
byrow = TRUE
)
), ]

row.names(stacked_data) <- NULL
}

stacked_data <- data_rename(stacked_data, "values", values_to)

# split columns if several names in names_to or names_pattern is specified
if (length(names_to) > 1) {

if (is.null(names_pattern)) {
# faster than strsplit
tmp <- utils::read.csv(
text = stacked_data$ind,
sep = names_sep,
stringsAsFactors = FALSE,
header = FALSE
)
names(tmp) <- paste0("V", seq_len(ncol(tmp)))
tmp[tmp == ""] <- NA

stacked_data$ind <- NULL
stacked_data <- cbind(tmp, stacked_data)

} else {
tmp <- regmatches(
unique(stacked_data$ind),
regexec(names_pattern, unique(stacked_data$ind))
)
tmp <- as.data.frame(do.call(rbind, tmp), stringsAsFactors = FALSE)
names(tmp) <- c("ind", names_to)
# faster than merge
stacked_data <- cbind(stacked_data, tmp[match(stacked_data[["ind"]], tmp[["ind"]]), -1])
etiennebacher marked this conversation as resolved.
Show resolved Hide resolved
stacked_data$ind <- NULL

}

}

stacked_data <- data_relocate(stacked_data, select = values_to, after = -1)

# reunite unselected data with stacked data
out <- cbind(
not_stacked, stats::setNames(stacked_data, c(names_to, values_to)),
row.names = NULL
)


if (!is.null(names_prefix)) {
if (length(names_to) > 1) {
insight::format_error(
"`names_prefix` only works when `names_to` is of length 1."
)
}
out[[names_to]] <- gsub(paste0("^", names_prefix), "", out[[names_to]])
}

# rearrange the rows with the temp id
if (length(not_selected) > 0) {
out <- data_arrange(out, "_Rows")
}

# Remove or rename the row index
if (is.null(rows_to)) {
out[["_Rows"]] <- NULL
} else {
out <- data_rename(out, "_Rows", rows_to)
}

if (values_drop_na) {
out <- out[!is.na(out[, values_to]), ]
}

# add back tidyverse attributes
if (isTRUE(tbl_input)) {
class(out) <- c("tbl_df", "tbl", "data.frame")
}

# reset row names
if (.has_numeric_rownames(data)) {
row.names(out) <- NULL
}

out
}


#' Code adapted from utils::stack (but largely modified)
#'
#' @noRd

.stack <- function(x) {
ind <- rep(names(x), times = lengths(x))
data.frame(values = unlist(unname(x)), ind, stringsAsFactors = FALSE)
}

#' @rdname data_to_long
#' @export
reshape_longer <- data_to_long
Loading