Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A less frustrating way to select rows? #456

Closed
gitfoxi opened this issue Dec 19, 2013 · 17 comments
Closed

A less frustrating way to select rows? #456

gitfoxi opened this issue Dec 19, 2013 · 17 comments

Comments

@gitfoxi
Copy link

gitfoxi commented Dec 19, 2013

The problem with select is that it can't access all the functions -- especially ones that are not part of Base. Maybe esc can be used but it didn't work for me yet.

The problem with map is that it returns type DataArray{Any,1} regardless of whether all your elements are of type Bool the [] operator will not accept this and wants you to convert it to type DataArray{Bool,1}.

using DataFrames
import FunctionalUtils.partial

Warning: New definition 
    map(Any,Zip{I<:(Any...,)}) at /Users/m/.julia/FunctionalUtils/src/FunctionalUtils.jl:142
is ambiguous with: 
    map(Union(Function,DataType),Any...) at reduce.jl:129.
To fix, define 
    map(Union(Function,DataType),Zip{I<:(Any...,)})
before the new definition.
Warning: New definition 
    map(Any,Dict{K,V}) at /Users/m/.julia/FunctionalUtils/src/FunctionalUtils.jl:144
is ambiguous with: 
    map(Union(Function,DataType),Any...) at reduce.jl:129.
To fix, define 
    map(Union(Function,DataType),Dict{K,V})
before the new definition.



df = DataFrame(A = ["foo", "bar", "baz"], B = [1,2,3])




3x2 DataFrame:
            A B
[1,]    "foo" 1
[2,]    "bar" 2
[3,]    "baz" 3





# Careful. Don't skip a step.
isfoo(x) = ismatch(r"foo", x)
tf = map(isfoo, df["A"])
tf = convert(DataArray{Bool, 1}, tf)
df[tf, :]




1x2 DataFrame:
            A B
[1,]    "foo" 1





# Want
df[map(partial(ismatch, r"foo"), df["A"]), :]


no method haskey(Index, (DataArray{Any,1},Range1{Int64}))
while loading In[11], in expression starting on line 2

 in get at /Users/m/.julia/DataFrames/src/dataframe.jl:812

 in getindex at dict.jl:136

 in getindex at dict.jl:145



# Want
select(:(map(partial(ismatch, r"foo"))), df)


partial not defined
while loading In[12], in expression starting on line 2

 in anonymous at /Users/m/.julia/DataFrames/src/dataframe.jl:1525

 in with at /Users/m/.julia/DataFrames/src/dataframe.jl:1526

 in select at /Users/m/.julia/DataFrames/src/dataframe.jl:1080
@johnmyleswhite
Copy link
Contributor

To get this to work, you'd need to introduce some quoting to get partial to be handled appropriately when it's not evaluated in a different scope. It's not trivial to do this right in general.

@johnmyleswhite
Copy link
Contributor

I know that @StefanKarpinksi has some qualms with the use of expression indexing. While it's very common to do this in R, I am tempted to avoid allowing this in Julia. In its current form, it's basically a trap for the unwary.

@johnmyleswhite
Copy link
Contributor

One thing I've been debating is removing select, with, etc as functions and replacing them with macros. One could (and I think I'm starting to) argue that all of R's delayed evaluation magic is really an elaborate set of examples of where you need macros.

@StefanKarpinski
Copy link
Member

Yeah, that's kind of true. Macros are the better way to do this but it still feels a bit awkward.

@johnmyleswhite
Copy link
Contributor

I agree that it's a little awkward, but it's what I've got so far. Any other ideas? Getting this right will make DataFrames a lot more usable in Julia.

@gitfoxi
Copy link
Author

gitfoxi commented Dec 19, 2013

Here's how I will be getting through the day:

using DataFrames
import FunctionalUtils.partial

function (|>)(pred::DataArray{ASCIIString,1}, pat::Regex)
    convert(DataArray{Bool, 1}, map(partial(ismatch, pat), pred))
end

function (|>)(pred::DataArray{Bool,1}, df::DataFrame)
    df[pred, :]
end

function (|>)(pred::DataArray{Bool,1}, df::DataArray)
    df[pred]
end

######### test #############
using Base.Test
df = DataFrame(A = ["foo", "bar", "baz"], B = [1,2,3])

@test df["A"] |> r"b.." == [false, true, true]
@test typeof(df["A"] |> r"b..") == DataArray{Bool,1}

@test df["A"] |> r"b.." |> df == DataFrame(A = ["bar", "baz"], B = [2,3])
@test df["A"] |> r"b.." |> df["B"] == [2, 3]

@StefanKarpinski
Copy link
Member

I actually think that if indexing were a staged function, it might help. @JeffBezanson – does that make sense?

@johnmyleswhite
Copy link
Contributor

I don't understand staged functions well enough to have a sense for that. For me, the biggest problem is that, in a macro that would do something like select on a DataFrame, you typically want to rewrite any variable name that is a column name as a reference to that column while preserving every other symbol. But it's not clear that the resulting code is going to be efficient in general.

@StefanKarpinski
Copy link
Member

Basically, it's a bit like a macro that operates after type information is known. If you make indexing a staged function, it should be possible to inject a "staged method" for the case where the thing you're indexing into is a data frame and in such cases, transform the expression for the indexes to do something different.

@johnmyleswhite
Copy link
Contributor

Ok. That at least gives me an intuition.

@johnmyleswhite
Copy link
Contributor

Thought about this more: not sure that either macros or staged functions would work, since you need not only type information, but also value information -- the correct interpretation of a select clause depends on knowing which variable names are bound to columns and which are free to vary in the calling environment. Without using eval, this seems like an intractable problem.

@simonster
Copy link
Contributor

One idea related to what I suggested in #451 is to parametrize the DataFrame by its column names (either as symbols, n-tuples of Uint8, or hashes that are sufficiently unlikely to collide). Then the type information gives you the column name information. There is still a problem with the staged function approach, though: every function needs to be recompiled for every combination of column names, which may or may not be acceptable.

@gitfoxi
Copy link
Author

gitfoxi commented Dec 20, 2013

So I ain't getting my head around staged functions even though I googled it -- veering in another direction, what do you guys think of this concept: http://code.google.com/p/asq/

@simonster
Copy link
Contributor

Another idea is to rewrite select(fn(a), df)`` (or@select fn(a) df```) as:

x[fn(haskey(x, "a") ? x["a"] : a), :]

This is a more limited approach, but it might be sufficient?

Another possibility is to require users to put a symbol (e.g. ~ if no one really needs bitwise not) in front of variable names that are supposed to be bound to columns, so we don't have to guess. This might actually be a good idea, since it reduces confusion in the case where there is a variable named a in the calling scope and a column named a in the DataFrame.

@simonster
Copy link
Contributor

@gitfoxi: That's been suggested before as well and also sounds quite useful. See #381.

@gitfoxi
Copy link
Author

gitfoxi commented Dec 20, 2013

@simonster Cool. There was a lot of positivity surrounding the concept a month ago. It's probably easier to implement then SQL (but I don't know, a lot of people implement SQL) and it could be useful for things other than DataFrames like processing semi-structured streams of JSON or querying a database.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

Feel free to reopen if still relevant.

@quinnj quinnj closed this as completed Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants