Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convience methods for getproperty and setproperty in DataFrames with new ownership rules #1753

Closed
pdeffebach opened this issue Mar 26, 2019 · 29 comments

Comments

@pdeffebach
Copy link
Contributor

I have remarked about this on slack and in #1695. On @nalimilan's suggestion I am posting an issue to discuss how we can make df.col easier with non-literals.

Motivation: With #1695 (mostly) concluded, we will likely have the following two ways to get a single column from a DataFrame

  • df[:, :col]: a copy
  • df.col a non-copy.
  • df[:col]: deprecated.

I really like df.col syntax. I find it intuitive and requires little typing. However it only works with the actual literal col. If you have a variable x representing the symbol :col, you cannot do df.x to get the column :col. You also can't just do df[:, x] because that has different behavior -- a copy rather than the exact same vector.

I want to find a syntax for

df.col = f.(df.col)

where you use the variable x to represent :col.

Alternative solutions for getproperty:

  • We can use getproperty(df, x) to have the same behavior as df.col.
  • We can maybe use select!(df, x) to have the same behavior.

Alternative solutions for `setproperty!:

  • setcol!(df, x, v) will work. However I don't like it because now you have to worry about an extra parentheses.
df.col = map(v) do e 
    @match e begin
    ... pattern matching
    end
end

Is complicated enough as is, without putting it in setcol!(df, x, ...).

Finally I like the symmetry of getproperty and setproperty looking the same. df.col = f.(df.col) is elegant.

Solutions:

Ideally:

  • df.$x = v would be valid Julia syntax, where the $ escapes whatever is following it.
  • you could define a struct and overload getproperty(df, s::ColumnSelector). This is not possible, however, because Julia doesn't evaluate the expression after the dot at all. This doesn't work.

Pragmatically:

  • A macro? I explored this a bit and didn't get very far.

I'm at a bit of a dead end in terms of ideas for this. But I do think it's important. The average scientist codes in global scope, and we aren't going to get them to put things in functions if they have to re-write all their code in a less intuitive way in order to do that.

Thanks for reading my rant.

@bkamins
Copy link
Member

bkamins commented Mar 26, 2019

Thanks for reading my rant.

I am done reading it 😄. Thank you for the input. This is indeed a hard nut to crack.

For an alternative to getproperty I thought of getcol(df, x). Note that x can be also a number.

For an alternative to setproperty! I was considering:

  • setcol!(df, x, v) or setcol!(df, x=>v)
  • setcol!(fun, df, x) (this syntax can be used to handle a do block, fun takes no parameters)

An alternative to setcol! could be mutate! with the semantics (or we could have both):

  • mutate!(df, x=>v...) (mutliple pairs allowed)
  • mutate!(df; x=v...) (mutliple kwargs allowed)
  • mutate!(fun, df) (this syntax can be used to handle a do block, fun takes no parameters and returns something that would normally be accepted by DataFrame constructor: Pair or Pairs, NamedTuple of vectors, etc. - here the list would have to get defined)

I am giving here my loose thoughts as I am still not sure what would be best.

Also I would explore macros rather in the DataFramesMeta.jl package.

@pdeffebach
Copy link
Contributor Author

I have found a hack!

df = DataFrame(col = [1, 2, 3, 4, 5])
x = :col 
df.:($x)
julia> df.:($x)
5-element Array{Int64,1}:
 1
 2
 3
 4
 5

This might be too ugly to show to users though.

@bkamins
Copy link
Member

bkamins commented Mar 27, 2019

This is nice indeed.
The only limitation of this approach (which means that we probably still need the other methods) is that x must be a Symbol (so it cannot be an integer nor you cannot pass an expression that should be evaluated)

@pdeffebach
Copy link
Contributor Author

We can define the following:

import Base.getproperty
 getproperty(df::AbstractDataFrame, i::Int) = DataFrames.columns(df)[i]

But the evaluated symbol is a good point.

Just to be clear, there is no way to have a function like

col(df, :x1) = ... 

That has the same symmetry as df.x1 in terms of getproperty or setproperty? As in, this is a limitation in the way Julia parses expressions and should be done through a macro?

Another soluttion, that I like a lot, is having an object where you can overload getindex and setindex however you want.

cols(df)[:x] = ... # should work, right?

@bkamins
Copy link
Member

bkamins commented Mar 28, 2019

cols(df)[:x] is a possible idea. Let us wait what other people think about it.
(there is a question what is the best name for it, but the idea that cols(df) returns a column-oriented is possible IMO)

@bkamins
Copy link
Member

bkamins commented Mar 28, 2019

Also removal of df[:col] is orthogonal to "data frame ownership" - so if some important arguments in favor of retaining it are raised this probably could be reconsidered as not a single line of code was written yet to remove it 😄.

EDIT I am writing it as I keep answering to different questions in several places and people keep writing df[:col] although already for a long time we support df.col.

@oxinabox
Copy link
Contributor

oxinabox commented Mar 28, 2019

I do not understand what is wrong with
@view df[:, :col]
or if a variable @view df[:, x].

It matches well with how views work in the rest of the language.
And doesn't it already exist with the behavior you want?

@bkamins
Copy link
Member

bkamins commented Mar 28, 2019

I agree with what @oxinabox says, and I was voting for removing df[:col] because it is simply inconsistent with the rest of the design, but I just acknowledge that for some reason people keep writing df[:col] (maybe it is only because in the past it was the only option).

@pdeffebach
Copy link
Contributor Author

@bkamins you are right, this my be premature considering we haven't even started deprecating df[:col] yet.

You correctly note that this issue is really about the planned deprecation of df[:col], but I think it does have to do with ownership in the dense that with the planned changes in #1695 df.x is going to have different behavior than df[:, x] (one will copy and one will not). So motivating this discussion is a way to ensure we have the right methods that work in global and local scope.

Here is my reasoning regarding @view df[:, x]

  • I don't think it has the same behavior. Maybe this is up the the air still but @view df[:, :col] returns a view of the entirety of df.col, and doesn't return the vector itself.
  • It's scary enough to ward off new users. My audience for this syntax is the second year econometrics / biology statistics student. In order for them to use @view df[:, x] we need to explain to them what a view is. Additionally, these students are going to program in global scope, but we want them to make the switch to functions. This disincentives that switch.
  • Even the more experienced user who understands views will be annoyed as they switch from prototyping in global scope to local scope.
  • Data frames are not like arrays, which is the reason why we have the getproperty method in the first place. It seems inconsistent to have two different philosophies for indexing data frames in local and global scope (or with literals and variables).

@bkamins
Copy link
Member

bkamins commented Mar 29, 2019

To be clear neither df.col, neither view(df, :, :col) nor df[:, :col] are going to change their meaning.

We are considering to remove df[:col] and view(df, :col). I guess that no one complains about removing view(df, :col) so I will concentrate on df[:col]. What problems to the users will removal of df[:col] cause:

  • getting the column from df
    • df.col does not support getting columns by their number and does not support invalid idenfiers
    • one can use getproperty(df, :col) which solves the invalid identifier problem; we can consider adding getproperty(df, number) support; the problem with this is that getproperty is probably not an intuitive name
    • we can introduce getcol to have a better name; this should generally solve all the needs from the users on RHS
  • setting the column in df
    • df.col = val does not support getting columns by their number and does not support invalid idenfiers
    • one can use setproperty!(df, :col, val) which solves the invalid identifier problem; we can consider adding setproperty!(df, number, val) support; the problem with this is that setproperty! is probably not an intuitive name
    • we can introduce setcol! to have a better name; however, this is a bit problematic as we will not be able to write some_function(df) = val as this is not a valid syntax

@view df[:, :col] and df[:, :col] are valid alternatives, but neither of them solves the LHS (assignment) problem, because if they are on LHS they will update the old vector to have val values, not rebind a new vector to :col name. The problem is best understood I think when you try to rewrite the following code assuming df[:, :col] is deprecated:

df = DataFrame()
col = "my fancy name"
df[Symbol(col)] = 1:10

(of course adding setcol! would allow to handle this without a problem but it would not use assignment syntax)

@oxinabox
Copy link
Contributor

Riiiight, ok, i hadn't considered the LHS problem.
I can see the argument not for not deprecating setindex!(df, data, column::Symbol).
Crazy idea might be to be asymmetric about that, so getindex(df, column::Symbol) might not work (prob give a helpful warning).
Reason it might be sensible to not have this getindex, is that it would be awkward.

  1. it should be direct (viewish) column access since setindex!(df, data, column::Symbol) is.
  2. it should be a copy since all other getindex are copies, unless marked with @view

OTOH, my life would be easier if we left that getindex and setindex exactly as they are

@bkamins
Copy link
Member

bkamins commented Mar 29, 2019

OTOH, my life would be easier if we left that getindex and setindex exactly as they are

That is my point - we have consistency against convenience issue here.

@nalimilan
Copy link
Member

An available syntax is df[SOMETHING, col], with SOMETHING being a replacement for :. which would indicate no copy should be done. I haven't been able to find a good idea for that special object, though. It could be an name like inplace/view, or any symbol (available or used elsewhere with a different meaning) like +, *, ~, !, ^, | or even (). ! might kind of make sense for the similarity with f!.

@bkamins
Copy link
Member

bkamins commented Mar 29, 2019

! makes most sense for me as it is a consistent with other notations.

@oxinabox
Copy link
Contributor

Woah, that is so crazy, it might just work.

@bkamins
Copy link
Member

bkamins commented Mar 29, 2019

So might go this way:

  • df[!, :col] gets a column directly (this is simple);
  • @view df[!, :col] gets a view of this column (this is still simple);
  • df[!, cols] is problematic - should it be a SubDataFrame or a DataFrame (I opt for a DataFrame but @nalimilan pointed out that it might be better to use a SubDataFrame)
  • the issue is that we also have to define what @view df[!, cols] would mean - this is particularly tricky when someone would write @view df[!, :] (: as column selector dynamically resizes the view to reflect the changes in columns in the parent)

@oxinabox
Copy link
Contributor

oxinabox commented Mar 29, 2019

df[!, :col] gets a column directly (this is simple);

@view df[!, :col] gets a view of this column (this is still simple);

Actually its not, from a DataFrames as there own thing, the raw column vector df[!, :col] is a viewish thing.
@view df[!, :col] is equiv to x::Vector = df[!, :col]; @view x[:]
Which is a full length view of a Vector.
Which from a DataFrames as thier own, is a view of a view, which is
1.) Weird
2.) Kind of pointless. (In general for a vector @view x[:] is pointless).

As such it might be that @view df[!, :col] should be exactly the same as df[!, :col].

df[!, cols] is problematic - should it be a SubDataFrame or a DataFrame (I opt for a DataFrame but @nalimilan pointed out that it might be better to use a SubDataFrame)

One factor worth considering is:
should df[!, cols1][!, cols2][!, col] = xs
should mutate the original DataFrame?
as if df[!,col]=xs were called

@nalimilan
Copy link
Member

As such it might be that @view df[!, :col] should be exactly the same as df[!, :col].

I don't think we should really be concerned about this since it's a corner case. I'd tend to go with the most consistent solution. We could even throw an error for now.

One factor worth considering is:
should df[!, cols1][!, cols2][!, col] = xs
should mutate the original DataFrame?
as if df[!,col]=xs were called

If we return a SubDataFrame, it will throw an error. If we return a DataFrame, it won't mutate the original one.

My argument to return a SubDataFrame is that it makes it obvious that column vectors are shared (which we generally want to avoid for DataFrame in the ownership approach). The downside could be that it's more convenient to work with a DataFrame, but I'm not sure in what situations that would be the case (the example above is a bit convoluted).

@bkamins
Copy link
Member

bkamins commented Mar 29, 2019

So here is my point. We deprecate df[:col] in favor of df[!, :col] and df[cols] in favor of df[!, cols] the benefit of this approach are:

  • it is very simple to implement (no new bugs introduced)
  • it is very simple to write correct deprecation warnings
  • we do not have to reinvent the wheel (all is currently consistent around df[:col] and df[cols] and we inherit this behavior)
  • user code is very simple to migrate (they simply need to add ! in a few places and all will work without any other change)

Note that @view x[:] is a valid syntax and we should support it (as it is also supported in Base). Also note that df can be a DataFrame but it also can be a SubDataFrame (that is why I have said above to inherit all the functionality from current getindex and setindex! taking a single dimensional argument and they consistently handle all these cases).

So for example in the future:

df[!, cols1][!, cols2][!, col] = xs

should do exactly the same what:

df[cols1][cols2][col] = xs

does currently.

EDIT
By "we deprecate" in the first sentence I mean a direct 1-to-1 deprecation without any change in the functionality.

@bkamins
Copy link
Member

bkamins commented Mar 29, 2019

The only drawback is that df[!, cols] will create a DataFrame that shares the columns with the source, and I understand this is a concern of @nalimilan. Therefore we could as well allow df[!, col] but disallow df[!, cols] (and require users to call select that could be soon introduced).

EDIT The only problem will be current calls like df[cols] = x will not have a convenient way to express (but I think this is something that is very rarely if not never used).

@oxinabox
Copy link
Contributor

oxinabox commented Mar 29, 2019

We could always throw an error for df[!, cols]

Perhaps though for now, we don't touch df[:col], leaving it as is.
Make a release with the other new features and return to it again later.

@bkamins
Copy link
Member

bkamins commented Mar 29, 2019

This is what we do with the only change that in #1742 currently calling df[cols] copies the columns.

@pdeffebach
Copy link
Contributor Author

pdeffebach commented Mar 29, 2019

It would be cool to see how confused people are about df[:col] returning a vector and df[1] returning a row.

We could emphasize that df[:col] is only a replacement for df.col and just avoid defining df[cols::Vector{Symbol}] at all. You can't do that with getproperty and you can't use integer inputs with getproperty either.

There is still some semblance of logic because our goal is just to have a df.col <--------> df[:col] symmetry.

Would it make sense to have a feature request to Julia to add some sort of getproperty with evaluation?

@nalimilan
Copy link
Member

It would be cool to see how confused people are about df[:col] returning a vector and df[1] returning a row.

There's no way you'll convince me of doing this madness. We're not Pandas! :-p

Also that would mean there's no short syntax for df[1], which would be weird.

Would it make sense to have a feature request to Julia to add some sort of getproperty with evaluation?

What do you mean?

@nickeubank
Copy link
Contributor

FWIW:

  • Of the current proposals, my favorite is to get rid of df[:col], mostly use df.col, and offer setcol!(df, x) and getcol(df, x) for when people want to use variables for column names. I think setcol and getcol are exceedingly readable and clear, and we avoid the "ascii salad" problem we get with the ! operator, which I can live with, but just seems... weird? Not intuitive? That's not a notation that allows a casual reader to read your code and clearly know what's going on...
    • I definitely don't like using getproperty and setproperty -- property is just too vague a term (so it isn't obvious what code is doing if you read it), and it requires lots of typing, both of which I think adds to confusion.
  • I'm with @nalimilan on not having something like " df[:col] returning a vector and df[1] returning a row.". I agree it seems cool, but I think that kind of cleverness is too weird for new users. (again, I'm an educator for non-computer scientists, and my general experience is that these kinds of idiosyncracies set off red flags as potential challenges for students. A consistent model of behavior -- even if a little less efficient -- I think is easier for new and even intermediate users.

@bkamins
Copy link
Member

bkamins commented Mar 29, 2019

Just to expand on setcol! we possibly could have both setcol!(df, col, v) and setcol!(fun, df, col) where the latter can be used with do syntax.

Also df[:, col] = v will be available to change the value of the column in-place (setcol! is only needed for replacing the column with a new vector).

@bkamins
Copy link
Member

bkamins commented Mar 31, 2019

While we are unclear what to do with df[col] and df[cols] I think it is good to outline what other general methods should be added (as they are needed anyway an largely cover the usecases of df[col] and df[cols] - except df[col] to get a raw vector). My current list is the following (ColSelector is anything that currently deletecols! would accept):

  • select(df, col::ColSelector;copycolumns:Bool=true): create a new DataFrame with selected columns
  • select! - the same as select but without copycolums kwarg and changes the DataFrame in place
  • mutate(df, ::Pair{Symbol,Any}..., copycolumns::Bool=true): create a new DataFrame with added columns specified by Pairs
  • mutate!: the same as mutate but without copycolums kwarg and changes the DataFrame in place
  • deletecols - the same as deletecols! but with copycolumns::Bool=true kwarg and creating a new DataFrame

E.g. here mutate! would give us a replacement of df[col] = val (no matter if we keep it or not).

@bkamins
Copy link
Member

bkamins commented Jul 12, 2019

@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

Closing this - we have df[!, col] to handle programmatic direct column access.

@bkamins bkamins closed this as completed Jul 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants