A less frustrating way to select rows? #456

gitfoxi · 2013-12-19T22:18:26Z

The problem with select is that it can't access all the functions -- especially ones that are not part of Base. Maybe esc can be used but it didn't work for me yet.

The problem with map is that it returns type DataArray{Any,1} regardless of whether all your elements are of type Bool the [] operator will not accept this and wants you to convert it to type DataArray{Bool,1}.

using DataFrames
import FunctionalUtils.partial

Warning: New definition 
    map(Any,Zip{I<:(Any...,)}) at /Users/m/.julia/FunctionalUtils/src/FunctionalUtils.jl:142
is ambiguous with: 
    map(Union(Function,DataType),Any...) at reduce.jl:129.
To fix, define 
    map(Union(Function,DataType),Zip{I<:(Any...,)})
before the new definition.
Warning: New definition 
    map(Any,Dict{K,V}) at /Users/m/.julia/FunctionalUtils/src/FunctionalUtils.jl:144
is ambiguous with: 
    map(Union(Function,DataType),Any...) at reduce.jl:129.
To fix, define 
    map(Union(Function,DataType),Dict{K,V})
before the new definition.



df = DataFrame(A = ["foo", "bar", "baz"], B = [1,2,3])




3x2 DataFrame:
            A B
[1,]    "foo" 1
[2,]    "bar" 2
[3,]    "baz" 3





# Careful. Don't skip a step.
isfoo(x) = ismatch(r"foo", x)
tf = map(isfoo, df["A"])
tf = convert(DataArray{Bool, 1}, tf)
df[tf, :]




1x2 DataFrame:
            A B
[1,]    "foo" 1





# Want
df[map(partial(ismatch, r"foo"), df["A"]), :]


no method haskey(Index, (DataArray{Any,1},Range1{Int64}))
while loading In[11], in expression starting on line 2

 in get at /Users/m/.julia/DataFrames/src/dataframe.jl:812

 in getindex at dict.jl:136

 in getindex at dict.jl:145



# Want
select(:(map(partial(ismatch, r"foo"))), df)


partial not defined
while loading In[12], in expression starting on line 2

 in anonymous at /Users/m/.julia/DataFrames/src/dataframe.jl:1525

 in with at /Users/m/.julia/DataFrames/src/dataframe.jl:1526

 in select at /Users/m/.julia/DataFrames/src/dataframe.jl:1080

The text was updated successfully, but these errors were encountered:

johnmyleswhite · 2013-12-19T22:23:46Z

To get this to work, you'd need to introduce some quoting to get partial to be handled appropriately when it's not evaluated in a different scope. It's not trivial to do this right in general.

johnmyleswhite · 2013-12-19T22:37:28Z

I know that @StefanKarpinksi has some qualms with the use of expression indexing. While it's very common to do this in R, I am tempted to avoid allowing this in Julia. In its current form, it's basically a trap for the unwary.

johnmyleswhite · 2013-12-19T22:43:31Z

One thing I've been debating is removing select, with, etc as functions and replacing them with macros. One could (and I think I'm starting to) argue that all of R's delayed evaluation magic is really an elaborate set of examples of where you need macros.

StefanKarpinski · 2013-12-19T22:57:02Z

Yeah, that's kind of true. Macros are the better way to do this but it still feels a bit awkward.

johnmyleswhite · 2013-12-19T22:58:22Z

I agree that it's a little awkward, but it's what I've got so far. Any other ideas? Getting this right will make DataFrames a lot more usable in Julia.

gitfoxi · 2013-12-19T22:59:44Z

Here's how I will be getting through the day:

using DataFrames
import FunctionalUtils.partial

function (|>)(pred::DataArray{ASCIIString,1}, pat::Regex)
    convert(DataArray{Bool, 1}, map(partial(ismatch, pat), pred))
end

function (|>)(pred::DataArray{Bool,1}, df::DataFrame)
    df[pred, :]
end

function (|>)(pred::DataArray{Bool,1}, df::DataArray)
    df[pred]
end

######### test #############
using Base.Test
df = DataFrame(A = ["foo", "bar", "baz"], B = [1,2,3])

@test df["A"] |> r"b.." == [false, true, true]
@test typeof(df["A"] |> r"b..") == DataArray{Bool,1}

@test df["A"] |> r"b.." |> df == DataFrame(A = ["bar", "baz"], B = [2,3])
@test df["A"] |> r"b.." |> df["B"] == [2, 3]

StefanKarpinski · 2013-12-19T23:04:28Z

I actually think that if indexing were a staged function, it might help. @JeffBezanson – does that make sense?

johnmyleswhite · 2013-12-19T23:09:21Z

I don't understand staged functions well enough to have a sense for that. For me, the biggest problem is that, in a macro that would do something like select on a DataFrame, you typically want to rewrite any variable name that is a column name as a reference to that column while preserving every other symbol. But it's not clear that the resulting code is going to be efficient in general.

StefanKarpinski · 2013-12-19T23:13:10Z

Basically, it's a bit like a macro that operates after type information is known. If you make indexing a staged function, it should be possible to inject a "staged method" for the case where the thing you're indexing into is a data frame and in such cases, transform the expression for the indexes to do something different.

johnmyleswhite · 2013-12-19T23:15:27Z

Ok. That at least gives me an intuition.

johnmyleswhite · 2013-12-20T14:29:50Z

Thought about this more: not sure that either macros or staged functions would work, since you need not only type information, but also value information -- the correct interpretation of a select clause depends on knowing which variable names are bound to columns and which are free to vary in the calling environment. Without using eval, this seems like an intractable problem.

simonster · 2013-12-20T17:48:03Z

One idea related to what I suggested in #451 is to parametrize the DataFrame by its column names (either as symbols, n-tuples of Uint8, or hashes that are sufficiently unlikely to collide). Then the type information gives you the column name information. There is still a problem with the staged function approach, though: every function needs to be recompiled for every combination of column names, which may or may not be acceptable.

gitfoxi · 2013-12-20T18:13:38Z

So I ain't getting my head around staged functions even though I googled it -- veering in another direction, what do you guys think of this concept: http://code.google.com/p/asq/

simonster · 2013-12-20T18:27:50Z

Another idea is to rewrite select(fn(a), df)`` (or@select fn(a) df```) as:

x[fn(haskey(x, "a") ? x["a"] : a), :]

This is a more limited approach, but it might be sufficient?

Another possibility is to require users to put a symbol (e.g. ~ if no one really needs bitwise not) in front of variable names that are supposed to be bound to columns, so we don't have to guess. This might actually be a good idea, since it reduces confusion in the case where there is a variable named a in the calling scope and a column named a in the DataFrame.

simonster · 2013-12-20T18:31:58Z

@gitfoxi: That's been suggested before as well and also sounds quite useful. See #381.

gitfoxi · 2013-12-20T18:51:06Z

@simonster Cool. There was a lot of positivity surrounding the concept a month ago. It's probably easier to implement then SQL (but I don't know, a lot of people implement SQL) and it could be useful for things other than DataFrames like processing semi-structured streams of JSON or querying a database.

quinnj · 2017-09-07T04:10:12Z

Feel free to reopen if still relevant.

quinnj closed this as completed Sep 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A less frustrating way to select rows? #456

A less frustrating way to select rows? #456

gitfoxi commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

StefanKarpinski commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

gitfoxi commented Dec 19, 2013

StefanKarpinski commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

StefanKarpinski commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

johnmyleswhite commented Dec 20, 2013

simonster commented Dec 20, 2013

gitfoxi commented Dec 20, 2013

simonster commented Dec 20, 2013

simonster commented Dec 20, 2013

gitfoxi commented Dec 20, 2013

quinnj commented Sep 7, 2017

A less frustrating way to select rows? #456

A less frustrating way to select rows? #456

Comments

gitfoxi commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

StefanKarpinski commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

gitfoxi commented Dec 19, 2013

StefanKarpinski commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

StefanKarpinski commented Dec 19, 2013

johnmyleswhite commented Dec 19, 2013

johnmyleswhite commented Dec 20, 2013

simonster commented Dec 20, 2013

gitfoxi commented Dec 20, 2013

simonster commented Dec 20, 2013

simonster commented Dec 20, 2013

gitfoxi commented Dec 20, 2013

quinnj commented Sep 7, 2017