Add select, select! and deletecols #1772

bkamins · 2019-04-10T19:40:45Z

A follow up of #1753.

Adds select, select! and deletecols. In this way we have a flexible range of functions for column subsetting in a DataFrame - in-place and creating a new DataFrame, with column copying or without it.

The functions always return a DataFrame (as opposed to df[col] vs df[cols]).

bkamins · 2019-04-10T23:02:26Z

I get a strange error on Julia 0.7 - for some reason DataFrames.jl does not recognize the select! function in the tests. Anyone has an idea what could be the reason?

bkamins · 2019-04-11T10:41:37Z

OK - I see why this fails on Julia 0.7. It defined (although deprecated) select and select! functions that are currently not in base.

@nalimilan - so the decision is if we want to keep supporting 0.7 (I would drop it), then I simply change the REQUIRE and testset specification. If you think we should keep supporting 0.7 could you hint what is a proper way to handle such a case using Compat.jl (especially as old select and select! got renamed)?

nalimilan · 2019-04-11T18:03:06Z

Thanks. I'm hesitant about whether select(df, col) should return a data frame or a vector. In JuliaDB it returns a vector, so doing something different would make it harder to unify APIs.

The advantage of returning a data frame is that select would be clearly a table operation, very similar to mutate/transform. For example, if we extend it to allow things like select(df, col => normalize) and select(df, newcol = col => normalize, newcol2 = col2 => -), both would return a data frame. JuliaDB doesn't support the latter syntax: it requires you to write select(t, (newcol = col => normalize, newcol2 = col2 => -)). Not the end of the world, but it makes things more complex. Also using a one-element named tuple looks weird when you want to create a single column and get a data frame.

@nalimilan - so the decision is if we want to keep supporting 0.7 (I would drop it), then I simply change the REQUIRE and testset specification. If you think we should keep supporting 0.7 could you hint what is a proper way to handle such a case using Compat.jl (especially as old select and select! got renamed)?

Let's drop 0.7, it doesn't seem very useful at this point.

src/dataframe/dataframe.jl

test/dataframe.jl

Co-Authored-By: bkamins <[email protected]>

bkamins · 2019-04-11T19:40:41Z

I'm hesitant about whether select(df, col) should return a data frame or a vector

That is why I like to make a PR as then things are on the table.

My position is that we should return a DataFrame because:

select guarantees that you get a DataFrame which I think users will like (one less thing to learn)
we have an identity select(df, inds) is the same as select!(df, inds)

@piever - could you please comment on the select issue?

I also envision that in the long run select should allow various column transformations like combine, but I wanted to keep the initial simple.

Regarding dropping 0.7 - I will make a separate PR (to keep things atomic) and then we can rebase this one (as probably we need some more discussion on this PR with the community).

bkamins · 2019-04-12T09:09:03Z

Regarding select, select!, deletecols and the old deletecols!. The question is (it can be a separate PR) if we want to support them for SubDataFrame also. There are two levels of the question:

do we want to allow mutating the view by select! and deletecols! (here I would say no - the silent assumption is that views are not mutable) - so this should be an error;
do we want to allow select and deletecols - I would say yes, but the question is if we should return a SubDataFrame or allocate a new DataFrame

piever · 2019-04-12T10:40:48Z

The advantage of returning a data frame is that select would be clearly a table operation, very similar to mutate/transform. For example, if we extend it to allow things like select(df, col => normalize) and select(df, newcol = col => normalize, newcol2 = col2 => -), both would return a data frame. JuliaDB doesn't support the latter syntax: it requires you to write select(t, (newcol = col => normalize, newcol2 = col2 => -)). Not the end of the world, but it makes things more complex. Also using a one-element named tuple looks weird when you want to create a single column and get a data frame.

In the IndexedTables case, select(df, col => normalize) would return a vector, whereas select(df, (col => normalize,)) would return a one column table (named col), whereas to also rename it's either with pairs of pairs or named tuple: select(df, (:newcol => col => normalize,)) or select(df, (newcol = col => normalize,)) (of course the pair version is most useful when the symbol is generated programmatically).

I think design-wise IndexedTables generally prefers named tuples to many keyword arguments, for example table((a = rand(10), b = rand(10))).

I'd tend to agree that the trade off is that if one wants the "many args or many kwargs" version, than select(df, col) needs to be a DataFrame.

I'll list what I believe may be reasons for the current IndexedTables design:

one can use a wide variety of things as "selections", listed here. The important rule is IMO "Tuple of Selection – returns a table containing a column for every selector in the tuple.". This requires things like select(df, sym::Symbol)::AbstractVector and select(df, v::AbstractVector) = v but then is quite powerful:

julia> t = table((a=1:10, b=rand(10)));

julia> select(t, (:a, :b => log, :c => rand(10)))
Table with 10 rows, 3 columns:
a   b          c
────────────────────────
1   -2.02031   0.854743
2   -0.67662   0.726542
3   -0.318742  0.793672
4   -1.0202    0.0611947
5   -0.54456   0.987932
6   -0.312883  0.766104
7   -0.627879  0.0454886
8   -1.12356   0.907651
9   -1.42856   0.864524
10  -1.23728   0.734384

many table operations accept a keyword argument. For example map(f, t; select = All()) = map(f, select(t, select)), which is generally useful because we need a "thin table" to be sure the compiler doesn't collapse when iterating rows. In this case it's easier for map(f, t, select = col) to return a vector so that the user can write a f that expects a regular vector. For example map(log, t, select = :b) or filter(in([2, 3]), t, select = :a)

But I agree that these rules require some getting used to and maybe DataFrames user prefer a simpler: select returns a DataFrame kind of rule.

bkamins · 2019-04-12T11:03:19Z

@piever - thank you for a detailed explanation (actually many of these things (or similar) in the long run should land in select for DataFrame). Also df[:col] currently returns a vector in DataFrames.jl, so having select(df, :col) do the same would not be an end of the world.

However, as you have commented at the end - I feel that having a guarantee that select always returns a data frame will make novice user's life simpler (and if someone really wants to select one column writing df[:col] is available).

nalimilan · 2019-04-14T15:47:37Z

Thanks @piever. I see the logic behind the design of select in JuliaDB. OTOH I'm not completely convinced of the utility of select returning vectors, since the same result can be achieved by accessing the column vector using columns (in JuliaDB) or getindex (in DataFrame) and applying any transformation you want to it directly if needed.

At any rate, it would be nice if we could have a least one common method to access a column vector in JuliaDB and DataFrames. If that's not select, what can it be? Could JuliaDB support getproperty(t, col) for that? Otherwise DataFrames could support coumns(df, col) like JuliaDB, but columns(df) would be redundant with eachcol.

bkamins · 2019-04-14T15:58:19Z

We chose eachcol over columns to be consistent with Base.

piever · 2019-04-14T16:25:56Z

The simplest JuliaDB syntax for selecting just one column is column(t, sym). There is a fair amount of redundancy in JuliaDB in that columns works in general (for symbol, int and tuple) whereas column is only to select one column as an AbstractVector. Maybe DataFrames could use column?

I'm not sure getproperty(t, col) will be implemented on IndexedTables, there was some negative feedback on it (see JuliaData/IndexedTables.jl#203 (comment)) and not many people have asked for it.

nalimilan · 2019-04-14T16:35:51Z

We could support column(df, col), especially if we end up deprecating df[col]. A possible name for that operation was getcol, but indeed "get" isn't very Julian. At least it wouldn't be terrible to add this at some point if that helps developing a common table API.

But getproperty(::IndexedTable, ::Symbol) also makes sense to me. The comment by Jeff you refer to may be a bit different AFAICT since it's about StructArrays, which are more expected to behave like arrays of named tuples than tables. Anyway that can be discussed independently from column(df, col).

bkamins · 2019-04-14T16:40:44Z

I am ok to add column(df, col) (even if in the end we decide we leave df[col]).

nalimilan

Let's go with that then unless somebody objects.

bkamins · 2019-04-14T16:52:35Z

I will wait a bit for comments and add column to this PR if there are none (also please comment at some suitable time if the rest of the functionality is then what we want).

bkamins · 2019-04-14T18:11:03Z

I have added it to show the changes (if we decide to drop it we can remove last 2 commits)

piever · 2019-04-14T19:50:00Z

In terms of "working with columns" API, JuliaDB uses the singular (pushcol, setcol, popcol, see here) even though it also allows to set many columns at the same time: I personally think the plural (deletecols as in this PR) makes a bit more sense, but don't have a strong opinion either way.

src/abstractdataframe/abstractdataframe.jl

nalimilan · 2019-04-15T08:14:33Z

In terms of "working with columns" API, JuliaDB uses the singular (pushcol, setcol, popcol, see here) even though it also allows to set many columns at the same time: I personally think the plural (deletecols as in this PR) makes a bit more sense, but don't have a strong opinion either way.

Actually we chose the plural at #1514 (comment) (and following comments), and we already use it for insertcols. The inconsistency with JuliaDB is a bit annoying.

bkamins · 2019-04-15T09:27:02Z

Given the discussion in #1695 we might drop adding column for now (unless there are strong opinions to add it). The issue is related to whether we retain df[col] and df[col] = v syntax.

nalimilan · 2019-04-15T12:04:08Z

Yeah, let's merge this PR without columns and discuss that later.

nalimilan · 2019-04-15T12:12:49Z

Reading https://discourse.julialang.org/t/common-api-for-tabular-data-backends/21546 again, I recalled that select is basically equivalent to map if we consider data frames as collections of rows. I guess one difference which could justify having both is that select on a GroupedDataFrame will recycle scalars to ensure that the result has as many rows as the input (just like transform/mutate would).

Then these three functions would cover all interesting combinations of these two dimensions: 1) keep or drop remaining columns, 2) recycle scalars or not. The missing function is the one which keeps columns but doesn't recycle, which is the one which doesn't make sense since remaining columns have by definition one value per row. Does that make sense?

bkamins · 2019-04-15T13:57:20Z

Then these three functions would cover all interesting combinations of these two dimensions

@nalimilan - could you please expand on this, as I am not 100% clear which functions in which use cases you mean.

bkamins · 2019-04-15T14:02:55Z

OK - I have reverted the column commits for a separate discussion. Also adding functionality to select (which I think we should do in the long run) should be probably a separate PR.

nalimilan · 2019-04-15T14:21:08Z

@nalimilan - could you please expand on this, as I am not 100% clear which functions in which use cases you mean.

I was referring to #1727 (comment).

bkamins · 2019-04-15T14:27:20Z

So you mean the triplet transform/select/map in the case of GroupedDataFrame? Then yes I think that select for a GroupedDataFrame should preserve number of rows (thus implicitly recycle scalars in particular). But this is a separate PR I think

bkamins · 2019-04-18T10:39:02Z

If there will be no more comments on this I am going to merge this PR.

bkamins added 5 commits April 10, 2019 21:36

add select, select! and deletecols

c72b4a8

small fix

60ee40a

fix typo

a3216e4

fix tests

b3b2a6f

fix corner cases

ff95c03

nalimilan reviewed Apr 11, 2019

View reviewed changes

nalimilan and others added 2 commits April 11, 2019 21:20

Apply suggestions from code review

271e883

Co-Authored-By: bkamins <[email protected]>

fixes after the review

ea6236b

bkamins mentioned this pull request Apr 11, 2019

remove Julia 0.7 dependencies #1774

Merged

bkamins added 4 commits April 11, 2019 21:50

fix error type

db7c312

fix test errors

2431e70

Merge branch 'master' into add_select_and_deletecols

afde22f

fix bug in select!

7686bfd

nalimilan approved these changes Apr 14, 2019

View reviewed changes

nalimilan reviewed Apr 15, 2019

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

nalimilan mentioned this pull request Apr 15, 2019

Policy regarding in-place operations #1695

Closed

piever mentioned this pull request Apr 15, 2019

Column manipulation API cleanup JuliaData/IndexedTables.jl#239

Closed

bkamins force-pushed the add_select_and_deletecols branch from 31c35ed to 7686bfd Compare April 15, 2019 14:00

bkamins merged commit cc0bf2b into JuliaData:master Apr 19, 2019

bkamins deleted the add_select_and_deletecols branch April 19, 2019 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add select, select! and deletecols #1772

Add select, select! and deletecols #1772

bkamins commented Apr 10, 2019

bkamins commented Apr 10, 2019

bkamins commented Apr 11, 2019

nalimilan commented Apr 11, 2019

bkamins commented Apr 11, 2019

bkamins commented Apr 12, 2019

piever commented Apr 12, 2019

bkamins commented Apr 12, 2019

nalimilan commented Apr 14, 2019

bkamins commented Apr 14, 2019

piever commented Apr 14, 2019

nalimilan commented Apr 14, 2019

bkamins commented Apr 14, 2019

nalimilan left a comment

bkamins commented Apr 14, 2019

bkamins commented Apr 14, 2019

piever commented Apr 14, 2019 •

edited

Loading

nalimilan commented Apr 15, 2019

bkamins commented Apr 15, 2019

nalimilan commented Apr 15, 2019

nalimilan commented Apr 15, 2019

bkamins commented Apr 15, 2019

bkamins commented Apr 15, 2019

nalimilan commented Apr 15, 2019

bkamins commented Apr 15, 2019

bkamins commented Apr 18, 2019

Add select, select! and deletecols #1772

Add select, select! and deletecols #1772

Conversation

bkamins commented Apr 10, 2019

bkamins commented Apr 10, 2019

bkamins commented Apr 11, 2019

nalimilan commented Apr 11, 2019

bkamins commented Apr 11, 2019

bkamins commented Apr 12, 2019

piever commented Apr 12, 2019

bkamins commented Apr 12, 2019

nalimilan commented Apr 14, 2019

bkamins commented Apr 14, 2019

piever commented Apr 14, 2019

nalimilan commented Apr 14, 2019

bkamins commented Apr 14, 2019

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Apr 14, 2019

bkamins commented Apr 14, 2019

piever commented Apr 14, 2019 • edited Loading

nalimilan commented Apr 15, 2019

bkamins commented Apr 15, 2019

nalimilan commented Apr 15, 2019

nalimilan commented Apr 15, 2019

bkamins commented Apr 15, 2019

bkamins commented Apr 15, 2019

nalimilan commented Apr 15, 2019

bkamins commented Apr 15, 2019

bkamins commented Apr 18, 2019

piever commented Apr 14, 2019 •

edited

Loading