Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column selectors should guarantee column order is preserved #20

Open
CameronBieganek opened this issue Apr 28, 2020 · 3 comments
Open

Comments

@CameronBieganek
Copy link

CameronBieganek commented Apr 28, 2020

I think that all column selectors (other than arrays) should guarantee that the column order in the original table is preserved. One would certainly expect that to be the case for Between, though it's not explicitly mentioned in the docstring. It would be a bummer if you had foo(x, y) = 2x .+ y but Between(:x, :y) => foo happened to lower to [:y, :x] => foo instead of [:x, :y] => foo.

And I think it makes sense to guarantee column order preservation for the other selectors. E.g.

df = DataFrame(a=1, b=2, c=3)
select(df, Not(:b) => foo)

should be guaranteed to lower to

select(df, [:a, :c] => foo)

rather than

select(df, [:c, :a] => foo)

I'm not totally certain the best way to specify the column ordering properties of Cols, but I think this specification makes sense:

  • Individual column selectors inside Cols are first lowered to (ordered) arrays.
    • The lowering of the individual column selectors (except for arrays) follows the rule above that table column order should be preserved.
  • Cols is then lowered as follows: Cols(A, B, C) ==> [A, B\A, C\(A ∪ B)] (where the arguments on the right side are splatted into the array).

Since setdiff on arrays preserves the order of the first argument to setdiff, we get the following behavior:

df = DataFrame(a=1, b=2, c=3)
Cols([:c, :b], [:a, :b]) == [:c, :b, :a]
Cols(r"[bc]", r"[ab]") == [:b, :c, :a]
@bkamins
Copy link
Member

bkamins commented Apr 28, 2020

What you propose is exactly how it is implemented in DataFrames.jl (unless I made a bug in the code).

Essentially the rule can be stated that: column selectors are evaluated left to right and if a duplicate is encountered it is ignored.

@CameronBieganek
Copy link
Author

CameronBieganek commented Apr 28, 2020

What you propose is exactly how it is implemented in DataFrames.jl

Agreed, that is how it is currently implemented in DataFrames.jl. I just thought it might be a good idea to make the order guarantee explicit in DataAPI.jl.

@bkamins
Copy link
Member

bkamins commented Sep 4, 2021

Going back to this issue - would you care to make a PR implementing the proposed changes? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants