Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add select, select! and deletecols #1772

Merged
merged 11 commits into from
Apr 19, 2019

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Apr 10, 2019

A follow up of #1753.

Adds select, select! and deletecols. In this way we have a flexible range of functions for column subsetting in a DataFrame - in-place and creating a new DataFrame, with column copying or without it.

The functions always return a DataFrame (as opposed to df[col] vs df[cols]).

@bkamins
Copy link
Member Author

bkamins commented Apr 10, 2019

I get a strange error on Julia 0.7 - for some reason DataFrames.jl does not recognize the select! function in the tests. Anyone has an idea what could be the reason?

@bkamins
Copy link
Member Author

bkamins commented Apr 11, 2019

OK - I see why this fails on Julia 0.7. It defined (although deprecated) select and select! functions that are currently not in base.

@nalimilan - so the decision is if we want to keep supporting 0.7 (I would drop it), then I simply change the REQUIRE and testset specification. If you think we should keep supporting 0.7 could you hint what is a proper way to handle such a case using Compat.jl (especially as old select and select! got renamed)?

@nalimilan
Copy link
Member

Thanks. I'm hesitant about whether select(df, col) should return a data frame or a vector. In JuliaDB it returns a vector, so doing something different would make it harder to unify APIs.

The advantage of returning a data frame is that select would be clearly a table operation, very similar to mutate/transform. For example, if we extend it to allow things like select(df, col => normalize) and select(df, newcol = col => normalize, newcol2 = col2 => -), both would return a data frame. JuliaDB doesn't support the latter syntax: it requires you to write select(t, (newcol = col => normalize, newcol2 = col2 => -)). Not the end of the world, but it makes things more complex. Also using a one-element named tuple looks weird when you want to create a single column and get a data frame.

@nalimilan - so the decision is if we want to keep supporting 0.7 (I would drop it), then I simply change the REQUIRE and testset specification. If you think we should keep supporting 0.7 could you hint what is a proper way to handle such a case using Compat.jl (especially as old select and select! got renamed)?

Let's drop 0.7, it doesn't seem very useful at this point.

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
test/dataframe.jl Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Apr 11, 2019

I'm hesitant about whether select(df, col) should return a data frame or a vector

That is why I like to make a PR as then things are on the table.

My position is that we should return a DataFrame because:

  • select guarantees that you get a DataFrame which I think users will like (one less thing to learn)
  • we have an identity select(df, inds) is the same as select!(df, inds)

@piever - could you please comment on the select issue?

I also envision that in the long run select should allow various column transformations like combine, but I wanted to keep the initial simple.

Regarding dropping 0.7 - I will make a separate PR (to keep things atomic) and then we can rebase this one (as probably we need some more discussion on this PR with the community).

@bkamins
Copy link
Member Author

bkamins commented Apr 12, 2019

Regarding select, select!, deletecols and the old deletecols!. The question is (it can be a separate PR) if we want to support them for SubDataFrame also. There are two levels of the question:

  • do we want to allow mutating the view by select! and deletecols! (here I would say no - the silent assumption is that views are not mutable) - so this should be an error;
  • do we want to allow select and deletecols - I would say yes, but the question is if we should return a SubDataFrame or allocate a new DataFrame

@piever
Copy link

piever commented Apr 12, 2019

The advantage of returning a data frame is that select would be clearly a table operation, very similar to mutate/transform. For example, if we extend it to allow things like select(df, col => normalize) and select(df, newcol = col => normalize, newcol2 = col2 => -), both would return a data frame. JuliaDB doesn't support the latter syntax: it requires you to write select(t, (newcol = col => normalize, newcol2 = col2 => -)). Not the end of the world, but it makes things more complex. Also using a one-element named tuple looks weird when you want to create a single column and get a data frame.

In the IndexedTables case, select(df, col => normalize) would return a vector, whereas select(df, (col => normalize,)) would return a one column table (named col), whereas to also rename it's either with pairs of pairs or named tuple: select(df, (:newcol => col => normalize,)) or select(df, (newcol = col => normalize,)) (of course the pair version is most useful when the symbol is generated programmatically).

I think design-wise IndexedTables generally prefers named tuples to many keyword arguments, for example table((a = rand(10), b = rand(10))).

I'd tend to agree that the trade off is that if one wants the "many args or many kwargs" version, than select(df, col) needs to be a DataFrame.

I'll list what I believe may be reasons for the current IndexedTables design:

  • one can use a wide variety of things as "selections", listed here. The important rule is IMO "Tuple of Selection – returns a table containing a column for every selector in the tuple.". This requires things like select(df, sym::Symbol)::AbstractVector and select(df, v::AbstractVector) = v but then is quite powerful:
julia> t = table((a=1:10, b=rand(10)));

julia> select(t, (:a, :b => log, :c => rand(10)))
Table with 10 rows, 3 columns:
a   b          c
────────────────────────
1   -2.02031   0.854743
2   -0.67662   0.726542
3   -0.318742  0.793672
4   -1.0202    0.0611947
5   -0.54456   0.987932
6   -0.312883  0.766104
7   -0.627879  0.0454886
8   -1.12356   0.907651
9   -1.42856   0.864524
10  -1.23728   0.734384
  • many table operations accept a keyword argument. For example map(f, t; select = All()) = map(f, select(t, select)), which is generally useful because we need a "thin table" to be sure the compiler doesn't collapse when iterating rows. In this case it's easier for map(f, t, select = col) to return a vector so that the user can write a f that expects a regular vector. For example map(log, t, select = :b) or filter(in([2, 3]), t, select = :a)

But I agree that these rules require some getting used to and maybe DataFrames user prefer a simpler: select returns a DataFrame kind of rule.

@bkamins
Copy link
Member Author

bkamins commented Apr 12, 2019

@piever - thank you for a detailed explanation (actually many of these things (or similar) in the long run should land in select for DataFrame). Also df[:col] currently returns a vector in DataFrames.jl, so having select(df, :col) do the same would not be an end of the world.

However, as you have commented at the end - I feel that having a guarantee that select always returns a data frame will make novice user's life simpler (and if someone really wants to select one column writing df[:col] is available).

@nalimilan
Copy link
Member

Thanks @piever. I see the logic behind the design of select in JuliaDB. OTOH I'm not completely convinced of the utility of select returning vectors, since the same result can be achieved by accessing the column vector using columns (in JuliaDB) or getindex (in DataFrame) and applying any transformation you want to it directly if needed.

At any rate, it would be nice if we could have a least one common method to access a column vector in JuliaDB and DataFrames. If that's not select, what can it be? Could JuliaDB support getproperty(t, col) for that? Otherwise DataFrames could support coumns(df, col) like JuliaDB, but columns(df) would be redundant with eachcol.

@bkamins
Copy link
Member Author

bkamins commented Apr 14, 2019

We chose eachcol over columns to be consistent with Base.

@piever
Copy link

piever commented Apr 14, 2019

The simplest JuliaDB syntax for selecting just one column is column(t, sym). There is a fair amount of redundancy in JuliaDB in that columns works in general (for symbol, int and tuple) whereas column is only to select one column as an AbstractVector. Maybe DataFrames could use column?

I'm not sure getproperty(t, col) will be implemented on IndexedTables, there was some negative feedback on it (see JuliaData/IndexedTables.jl#203 (comment)) and not many people have asked for it.

@nalimilan
Copy link
Member

We could support column(df, col), especially if we end up deprecating df[col]. A possible name for that operation was getcol, but indeed "get" isn't very Julian. At least it wouldn't be terrible to add this at some point if that helps developing a common table API.

But getproperty(::IndexedTable, ::Symbol) also makes sense to me. The comment by Jeff you refer to may be a bit different AFAICT since it's about StructArrays, which are more expected to behave like arrays of named tuples than tables. Anyway that can be discussed independently from column(df, col).

@bkamins
Copy link
Member Author

bkamins commented Apr 14, 2019

I am ok to add column(df, col) (even if in the end we decide we leave df[col]).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with that then unless somebody objects.

@bkamins
Copy link
Member Author

bkamins commented Apr 14, 2019

I will wait a bit for comments and add column to this PR if there are none (also please comment at some suitable time if the rest of the functionality is then what we want).

@bkamins
Copy link
Member Author

bkamins commented Apr 14, 2019

I have added it to show the changes (if we decide to drop it we can remove last 2 commits)

@piever
Copy link

piever commented Apr 14, 2019

In terms of "working with columns" API, JuliaDB uses the singular (pushcol, setcol, popcol, see here) even though it also allows to set many columns at the same time: I personally think the plural (deletecols as in this PR) makes a bit more sense, but don't have a strong opinion either way.

@nalimilan
Copy link
Member

In terms of "working with columns" API, JuliaDB uses the singular (pushcol, setcol, popcol, see here) even though it also allows to set many columns at the same time: I personally think the plural (deletecols as in this PR) makes a bit more sense, but don't have a strong opinion either way.

Actually we chose the plural at #1514 (comment) (and following comments), and we already use it for insertcols. The inconsistency with JuliaDB is a bit annoying.

@bkamins
Copy link
Member Author

bkamins commented Apr 15, 2019

Given the discussion in #1695 we might drop adding column for now (unless there are strong opinions to add it). The issue is related to whether we retain df[col] and df[col] = v syntax.

@nalimilan
Copy link
Member

Yeah, let's merge this PR without columns and discuss that later.

@nalimilan
Copy link
Member

Reading https://discourse.julialang.org/t/common-api-for-tabular-data-backends/21546 again, I recalled that select is basically equivalent to map if we consider data frames as collections of rows. I guess one difference which could justify having both is that select on a GroupedDataFrame will recycle scalars to ensure that the result has as many rows as the input (just like transform/mutate would).

Then these three functions would cover all interesting combinations of these two dimensions: 1) keep or drop remaining columns, 2) recycle scalars or not. The missing function is the one which keeps columns but doesn't recycle, which is the one which doesn't make sense since remaining columns have by definition one value per row. Does that make sense?

@bkamins
Copy link
Member Author

bkamins commented Apr 15, 2019

Then these three functions would cover all interesting combinations of these two dimensions

@nalimilan - could you please expand on this, as I am not 100% clear which functions in which use cases you mean.

@bkamins bkamins force-pushed the add_select_and_deletecols branch from 31c35ed to 7686bfd Compare April 15, 2019 14:00
@bkamins
Copy link
Member Author

bkamins commented Apr 15, 2019

OK - I have reverted the column commits for a separate discussion. Also adding functionality to select (which I think we should do in the long run) should be probably a separate PR.

@nalimilan
Copy link
Member

@nalimilan - could you please expand on this, as I am not 100% clear which functions in which use cases you mean.

I was referring to #1727 (comment).

@bkamins
Copy link
Member Author

bkamins commented Apr 15, 2019

So you mean the triplet transform/select/map in the case of GroupedDataFrame? Then yes I think that select for a GroupedDataFrame should preserve number of rows (thus implicitly recycle scalars in particular). But this is a separate PR I think

@bkamins
Copy link
Member Author

bkamins commented Apr 18, 2019

If there will be no more comments on this I am going to merge this PR.

@bkamins bkamins merged commit cc0bf2b into JuliaData:master Apr 19, 2019
@bkamins bkamins deleted the add_select_and_deletecols branch April 19, 2019 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants