Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent naming schema for data wrangling and transformation functions #57

Closed
strengejacke opened this issue Jan 26, 2022 · 14 comments · Fixed by #190 or #204
Closed

Consistent naming schema for data wrangling and transformation functions #57

strengejacke opened this issue Jan 26, 2022 · 14 comments · Fixed by #190 or #204
Labels
Discussion 🦜 Docs 📚 Improvements or additions to documentation

Comments

@strengejacke
Copy link
Member

https://easystats.github.io/datawizard/reference/index.html

due to aliases for some functions that start with data_, and pkgdown.yml using starts_with("data"), functions are shown under two headers.

@strengejacke strengejacke added the Docs 📚 Improvements or additions to documentation label Jan 26, 2022
@IndrajeetPatil
Copy link
Member

Why do we have these aliases?

I think only functions that have anything to do with data wrangling should carry the data_ prefix, and the other shouldn't.

So I would vote for removing the following aliases:

  • data_adjust (not used anywhere in the ecosystem)
  • data_to_numeric (re-exported from parameters)
  • data_rescale (this is the trickiest one since it is widely used in the ecosystem, and we have decided to prefer it over change_rescale)

These are the odd ones out.
We don't have data_center, data_winsorize, etc., so no reason to treat few transformation functions specially.

What do you think @DominiqueMakowski, @mattansb, @bwiernik?

@bwiernik
Copy link
Contributor

Not sure what you are considering "data wrangling" vs not

@IndrajeetPatil
Copy link
Member

Yeah, that's a tough question.

Roughly speaking, anything that changes the structure of how the data is stored (and relevant metadata) but leaves the values of the retained data intact.

So, for example, the following will be all data wrangling helpers:
data_partition, data_relocate, data_remove, data_rename, data_rename_rows, data_reorder, data_to_long,
data_to_wide

All the data transformation helpers are changing the values of the data. For example,
center, change_scale, adjust, etc.

Does that make a little bit of sense?

@strengejacke
Copy link
Member Author

I would distinguish between data frame and variable transformations. The first is "data wrangling", maybe?

@DominiqueMakowski
Copy link
Member

DominiqueMakowski commented Jan 27, 2022

I think all functions that take data as argument (and primarily do any sort of data manipulation/transformation) should start with data_

@IndrajeetPatil

This comment was marked as outdated.

@DominiqueMakowski
Copy link
Member

I think all functions that take data as argument (and primarily do any sort of data manipulation/transformation) should start with data_

"...from now onwards" 😁

@bwiernik
Copy link
Contributor

Aliases are free.

@IndrajeetPatil
Copy link
Member

Also helps us avoid masking like these:

library(tidyr)
library(datawizard)
#> 
#> Attaching package: 'datawizard'
#> The following object is masked from 'package:tidyr':
#> 
#>     extract

Created on 2022-01-28 by the reprex package (v2.0.1.9000)

I think we should be retiring this particular alias ASAP.

@bwiernik
Copy link
Contributor

It's not on CRAN yet I don't think, so rename it

This was referenced Feb 21, 2022
@IndrajeetPatil IndrajeetPatil changed the title duplicates in pkgdown reference Consistent naming schema for data wrangling and transformation functions Feb 21, 2022
@IndrajeetPatil IndrajeetPatil pinned this issue Feb 21, 2022
@DominiqueMakowski
Copy link
Member

(Similarly, we should probably deprecate adjust(), centre() ad demean() in favour of data_*)

@strengejacke
Copy link
Member Author

Yes, this is a small inconsistency we have here:

  • The name of the first argument for functions that only work on data frames is data.
  • Whenever a function also has a numeric/factor/vector method, the first arg is named x

(though I'm not sure this is consistent)

However, not all functions that start with data_ only work on data frames, see data_cut().

This means: We could rename center() into data_center() - however, I think there are some names that are too common to be named only data_*(). But after all, data_standardize() or data_center() would make sense in the context of the naming scheme, are those names don't look too awkward.

@strengejacke strengejacke reopened this Apr 22, 2022
@mattansb
Copy link
Member

We can have data_*() for data frames, and vector_*() for vectors? It's a bit verbose, but also easy to use.

The yardstick package uses a somewhat similar schema (with *() for data frames, and vec_*() for vectors).

@bwiernik
Copy link
Contributor

I don't see the problem with having data_*() accept either vector or data frame input?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion 🦜 Docs 📚 Improvements or additions to documentation
Projects
None yet
5 participants