-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A less frustrating way to select rows? #456
Comments
To get this to work, you'd need to introduce some quoting to get partial to be handled appropriately when it's not evaluated in a different scope. It's not trivial to do this right in general. |
I know that @StefanKarpinksi has some qualms with the use of expression indexing. While it's very common to do this in R, I am tempted to avoid allowing this in Julia. In its current form, it's basically a trap for the unwary. |
One thing I've been debating is removing select, with, etc as functions and replacing them with macros. One could (and I think I'm starting to) argue that all of R's delayed evaluation magic is really an elaborate set of examples of where you need macros. |
Yeah, that's kind of true. Macros are the better way to do this but it still feels a bit awkward. |
I agree that it's a little awkward, but it's what I've got so far. Any other ideas? Getting this right will make DataFrames a lot more usable in Julia. |
Here's how I will be getting through the day: using DataFrames
import FunctionalUtils.partial
function (|>)(pred::DataArray{ASCIIString,1}, pat::Regex)
convert(DataArray{Bool, 1}, map(partial(ismatch, pat), pred))
end
function (|>)(pred::DataArray{Bool,1}, df::DataFrame)
df[pred, :]
end
function (|>)(pred::DataArray{Bool,1}, df::DataArray)
df[pred]
end
######### test #############
using Base.Test
df = DataFrame(A = ["foo", "bar", "baz"], B = [1,2,3])
@test df["A"] |> r"b.." == [false, true, true]
@test typeof(df["A"] |> r"b..") == DataArray{Bool,1}
@test df["A"] |> r"b.." |> df == DataFrame(A = ["bar", "baz"], B = [2,3])
@test df["A"] |> r"b.." |> df["B"] == [2, 3] |
I actually think that if indexing were a staged function, it might help. @JeffBezanson – does that make sense? |
I don't understand staged functions well enough to have a sense for that. For me, the biggest problem is that, in a macro that would do something like select on a DataFrame, you typically want to rewrite any variable name that is a column name as a reference to that column while preserving every other symbol. But it's not clear that the resulting code is going to be efficient in general. |
Basically, it's a bit like a macro that operates after type information is known. If you make indexing a staged function, it should be possible to inject a "staged method" for the case where the thing you're indexing into is a data frame and in such cases, transform the expression for the indexes to do something different. |
Ok. That at least gives me an intuition. |
Thought about this more: not sure that either macros or staged functions would work, since you need not only type information, but also value information -- the correct interpretation of a select clause depends on knowing which variable names are bound to columns and which are free to vary in the calling environment. Without using eval, this seems like an intractable problem. |
One idea related to what I suggested in #451 is to parametrize the DataFrame by its column names (either as symbols, n-tuples of Uint8, or hashes that are sufficiently unlikely to collide). Then the type information gives you the column name information. There is still a problem with the staged function approach, though: every function needs to be recompiled for every combination of column names, which may or may not be acceptable. |
So I ain't getting my head around staged functions even though I googled it -- veering in another direction, what do you guys think of this concept: http://code.google.com/p/asq/ |
Another idea is to rewrite x[fn(haskey(x, "a") ? x["a"] : a), :] This is a more limited approach, but it might be sufficient? Another possibility is to require users to put a symbol (e.g. |
@simonster Cool. There was a lot of positivity surrounding the concept a month ago. It's probably easier to implement then SQL (but I don't know, a lot of people implement SQL) and it could be useful for things other than DataFrames like processing semi-structured streams of JSON or querying a database. |
Feel free to reopen if still relevant. |
The problem with
select
is that it can't access all the functions -- especially ones that are not part ofBase
. Maybeesc
can be used but it didn't work for me yet.The problem with
map
is that it returns typeDataArray{Any,1}
regardless of whether all your elements are of typeBool
the[]
operator will not accept this and wants you to convert it to typeDataArray{Bool,1}
.The text was updated successfully, but these errors were encountered: