Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column manipulation API cleanup #239

Closed
piever opened this issue Apr 15, 2019 · 5 comments
Closed

Column manipulation API cleanup #239

piever opened this issue Apr 15, 2019 · 5 comments

Comments

@piever
Copy link
Collaborator

piever commented Apr 15, 2019

Discussing the DataFrames API over at JuliaData/DataFrames.jl#1695 (comment), I got the impression that our column manipulation developed organically over time and may need a little bit of clean-up.

Specifically:

Problem: pushcol and setcol seem largely overlapping (if the name is new in setcol the column gets added). The method setcol(table, name, col) and setcol(table, name => col) are redundant and the multi argument metod setcol(table, pair1, pair2, pair3) can be a bit confusing combined with setcol(table, name, col). The name is a bit puzzling as setcol is singular but can be applied to many columns: it seems like the preference in Julia in this case is to go with the plural (say sum(v, dims = i)). Also, it's inconsistent that the getter / setter for columns are column / setcol.

Proposed solution: Deprecate both in favor of mutate(table, changes::Pair...). This is consistent with the replace(v, old, new) to replace(v, old => new) deprecation in Julia Base and with DataFrames intended mutate! syntax. Here the corresponding JuliaDBMeta macro, called @transform, should probably become @mutate. Related to this, we could also consider adding a mergecols to merge columns of two tables.

Problem: popcol is a poor name for various reasons. Unlike pop!, it does not return the item it takes out of the collection, the singular is odd as it allows to remove many columns, unlike pop! one can specify which column to remove.

Proposed solution: rename to deletecols (consistent with DataFrames: JuliaData/DataFrames.jl#1772), stop defaulting to last column. deletecols(t, ncols(t)) is not that bad and it's not clear to me why it should be more common to delete the last column instead of any other. If we want special syntax for that we may add a special selector Last() (along the lines of Keys()) but I don't think that's necessary.

renamecol also has similar issues as above and I think the correct signature should be renamecols(table, changes:::Pair...) (DataFrames uses rename, but I think it's nice to be consistent and use ...cols for everything).

With insertcol and friends I would also rename to insertcols and friends (similar to DataFrames.insertcols!) and again only support the name => col interface (like DataFrames does). For example insertcols(table, ind, name => v) or insertcolsbefore(table, col, name => v).

cc: @bkamins and @nalimilan

@nalimilan
Copy link
Member

+1 for a cleanup common to JuliaDB and DataFrames (as much as possible at least).

Regarding mutate, I'm actually a bit hesitant about the name: AFAICT dplyr chose that term because transform was already taken by base R. transform sounds like a more natural term than mutate, and it would be consistent to what JuliaDBMeta and DataFramesMeta already support. Tables.jl also has a function with that name already: JuliaData/CSV.jl#395 (comment).

@joshday
Copy link
Collaborator

joshday commented Apr 15, 2019

+💯 for mutate(table, changes::Pair...) (or transform).

We might want to do away with any kind of drop column operation and let people rely on

select(t, Not(col))

That's something that didn't exist when popcol was implemented. JuliaDB's special selectors are one of my favorite things about the package, so I'm happy to push them on users more.

@piever
Copy link
Collaborator Author

piever commented Apr 15, 2019

So as a first deprecation PR I could go for:

  1. deprecate pushcol and setcol to transform(table, changes::Pair....) (everything else being equal, I'd go for consistency with DataFramesMeta, JuliaDBMeta and Tables)
  2. deprecate popcol(t, cols...) in favor of select(t, Not(cols...)).

I hadn't thought about select(t, Not(cols)) but given that we are selecting the complementary (rather than actually deleting columns, as the object is immutable) this makes a lot of sense.

Question: for setcol and renamecol we also have a setco(t, (:a => col1, :b => col2)) method which maybe adds to the confusion. Should we just stick with transform(t, :a => col1, :b => col2) and deprecate the tuple version?

EDIT: actually DataFrames.rename! for example allows passing a pair container, so we should do that as well (consistently across functions).

Then we could have a separate PR with possible renamings of renamecol and insertcol

@piever
Copy link
Collaborator Author

piever commented Apr 15, 2019

I have a couple of API doubts:

  • rename or renamecols? I would be inclined to go for rename for consistency with DataFrames as it clearly is a column oriented operation
  • Should we special case NamedTuple to also work? Meaning transform(t, (name1 = :col => val, ...)). In the current implementation, it wants the last argument to iterate pairs but I imagine a named tuple could be automatically be replaced with pairs(nt).

@piever piever closed this as completed Apr 16, 2019
@nalimilan
Copy link
Member

Wow that was fast!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants