Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Interface for high-performance indexing operations #71

Open
johnmyleswhite opened this issue Feb 2, 2014 · 21 comments
Open

Interface for high-performance indexing operations #71

johnmyleswhite opened this issue Feb 2, 2014 · 21 comments
Labels

Comments

@johnmyleswhite
Copy link
Member

As mentioned in JuliaData/DataFrames.jl#523, we might want to expose an "unsafe" interface to the underlying values of a DataArray for those trying to do high-performance work:

  • Check if an entry is NA by making isna return a reference. We could also implementing complex indexing for isna(da, inds...) but that seems like a lot of needless work.
  • Access the underlying values of a DataArray using something like values(da), which will have undefined values for any NA entries.

This would put us in a position to write code like:

dm = @data([1 2; 3 4])
isna(dm)[1, 1]
values(dm)[1, 1]

We could make this code very fast because it would be perfectly type stable. As a (probably too) radical step, we could even change getindex to implement the semantics of values.

@tshort
Copy link

tshort commented Feb 2, 2014

Overall, this is a great approach. One issue with this is that the result of isna() now becomes a placeholder type, not an array. That means you can't do df[isna(df[:colA]), :], at least without adding another getindex method to DataFrames.

@johnmyleswhite
Copy link
Member Author

Why is isna a placeholder now? I'm proposing just returning the "true" values of the na field instead of returning a copy of them.

@tshort
Copy link

tshort commented Feb 2, 2014

Got it. I misunderstood what you were planning to return.

On Sunday, February 2, 2014, John Myles White [email protected]
wrote:

Why is isna a placeholder now? I'm proposing just returning the "true"
values of the na field instead of returning a copy of them.

Reply to this email directly or view it on GitHubhttps://github.com//issues/71#issuecomment-33903565
.

@johnmyleswhite
Copy link
Member Author

To make sure everybody's on the same page here: this change would make DataArrays mutable in a potentially dangerous way, since you could modify one of the underlying values array and/or missingness mask without modifying the other.

@simonster
Copy link
Member

Could we do:

isna(da::DataArray, inds...) = getindex(da.na, inds...)
values(da::DataArray, inds...) = getindex(da.data, inds...)

Seems like this would be less dangerous, and also more easily adapted to deal with PooledDataArrays.

@johnmyleswhite
Copy link
Member Author

That's a way better idea than mine.

@johnmyleswhite
Copy link
Member Author

One of the things that troubles me about this approach (which I think is still such an improvement over our current interface that we should move forward with it regardless) is it makes getindex and setindex! less symmetric.

@simonster
Copy link
Member

I see these not as desirable parts of our API, but as stopgap solutions until Julia does a better job of handling union types. We don't have a type inference problem for setindex!, so it makes sense for it to be asymmetric.

@johnmyleswhite
Copy link
Member Author

Agreed.

@johnmyleswhite
Copy link
Member Author

Spent some time discussing this exact issue (re. compilation of R) with Duncan Temple Lang the other day. If we manage to find a solution to nominal/ordinal variables that allows us to rid of PDA's, it might be nice to introduce a @notna macro that lets us write things like:

da = @data([1, 2, 3, NA, 5, NA])
s = 0
for i in 1:length(da)
    if !isna(da, i)
        @notna begin
            s += (da[i] * da[i]) / sin(da[i])
        end
    end
end

This could then get expanded into some trivial function call like getdata that is exactly getindex for most types and is getindex(da.data) for DataArrays.

@johnmyleswhite
Copy link
Member Author

In 80068de, I set up @simonster's generalized definition of isna. If there's no objection, I can also go ahead and set up a definition for values, which should maybe get called something less-overloaded like getdata.

@johnmyleswhite
Copy link
Member Author

Over time, I've come to change my mind about this issue. I now think that we will want, in the future, to implement option types as the representation of scalar values with potential missingness. We'd keep DataArrays around, but have all scalar indexing into them return wrapped option types. This would solve all of the type stability issues, although it requires moving our idioms further from R's.

@simonster
Copy link
Member

If I'm understanding you correctly, with this change we'd have to give up on passing DataArrays to functions that accept AbstractArrays. This would seem to move our idioms further from Julia's as well, and I'm not sure that's worth it. The generality is often useful despite the performance cost.

An alternative approach might be to create a macro that converts getindex calls to getindex_wrapped, which could return a wrapper type.

@johnmyleswhite
Copy link
Member Author

Why would we have to give up on AbstractArray? Under the hypothetical option type, you could even get rid of DataArray and just use Array{Option{T}}. In practice, DataArray{T} would behave like Array{Option{T}} except that it would store things in a different format.

@simonster
Copy link
Member

We could have DataArray{T} <: AbstractArray{Option{T}}, but we'd have to figure out how to delegate methods to the contained element for that to be useful. (This is tractable for arithmetic operators in Base, but maybe not so tractable for package code.) Even if we can make that work, we can't have Option{T<:S} <: S, so a method that takes a Number wouldn't take an Option{Number}, and a method that takes an AbstractArray{T<:Number} wouldn't take an DataArray{T<:Number} (although maybe we could define NumberOption{T<:Number} <: Number and StringOption{T<:String} <: String and give up on other types?).

I think we could probably make the most common idioms work, but other code that works fine with Arrays and currently works slowly with DataArrays would not work at all. I'm still hoping we'll get better handling of union types at the compiler level...

@johnmyleswhite
Copy link
Member Author

I'd like to have a longer chat with the folks working on the compiler about their opinions on this topic. I feel like working with Union types isn't the ideal long run solution, because we're putting so much effort into teaching people to avoid Union types in the parts of their code that aren't interacting with DataArrays.

From my perspective, I think we'd want to maintain our current assumption that DataArray{T} <: AbstractArray{T}. I don't think we'd want to have Option{T} <: T. In my mind, the virtue of using Option is that it makes most usage of a potentially NA value an error, which I like. If you want to apply foo(x::Number) to Option{Float64}, you'd need to resolve the option's uncertainty using a get function that produces an actual number. Otherwise, you'd immediately get a fatal error -- either a no method error or an error about a missing value. As such, you wouldn't need to define functions to work on Option.

To elaborate on that: my thinking is that you'd want to start with the simplest possible implementation of Option, which would just act as a wrapper around values that requires people to decide how NA should be treated before you could treat the Option object as a value. Specifically, we'd implement get, which would allow you to deference an object as long as it's not NA. If the value was missing, you'd immediately get an NAException when calling get.

In cases in which NA could be replaced by a value like NaN and produce reasonable results, you'd use get(o, NaN) as a way of getting the value or replacing a missing value with an acceptable default. In that case, you'd be able to get all the benefits of languages like R and Python, which use NaN for NA for speed. We'd still be slower, because of branches, but we'd make it easier to write code that runs quickly.

Stefan, Jeff and the rest are visiting my office on Friday. I'm planning to spend some time chatting with them about this then.

@nalimilan
Copy link
Member

This makes much sense as this would make DataArrays just a special case of a wider feature which would be supported an optimized by Julia everywhere. But there's still the problem that an Option type would need to store two fields: the actual value, and whether the value is missing or not. Thus, for efficiency, DataArrays would not be simply an array of Option objects, but remain stored as they currently are. So I'm not sure how this simplifies the implementation, nor how it allows it to run faster...

The other possibility is to allow the compiler to mark that Option objects are NULL/NA in the same memory structure as non-NULL values, i.e. like using NaN payloads for floats, etc. Then DataArrays could just be arrays of Options and still be memory-efficient, and Julia would take care of optimizing code, knowing it can throw an error whenever a NULL/NA is accessed.

@johnmyleswhite
Copy link
Member Author

Yes, we would keep DataArray as is.

The speedup comes from code like the following:

s, n = 0.0, 0
for i in 1:length(da)
    x = da[i]
    if !isna(x)
        s += get(x)
        n += 1
    end
end

This code has no type-instability, which makes it much easier for Julia to optimize under the assumption that s will always be a Float64. Right now, you can never index into a DataArray without suffering the cost of type-instability because da[i] always produces Union(NAtype, Float64) as the output's type.

The sentinel values approach used by R is interesting, but doesn't generalize to arbitrary types. It's much easier to allow people to do something like:

s = 0.0
for i in 1:length(da)
    s += get(da[i], NaN)
end

In cases where this is a sensible default, this lets people choose it as appropriate for their problem. Trying to invent a sensible default for all problems doesn't work.

@nalimilan
Copy link
Member

@johnmyleswhite But wouldn't x = da[i] incur an overhead given that the actual value and its missingness would have to be retrieved from the two fields of da, and an Option type be constructed from that? If it can be made efficient then indeed it would be a good solution.

(I don't really agree that "sentinel values" do not generalize to arbitrary types: it's mostly a problem for basic types like integers. There are many ways of storing a NULL/NA value in more complex types, and as most of them are just composite types, one just needs to make one of its fields an Option and make it NULL/NA to mark that the whole type is NULL/NA.)

@johnmyleswhite
Copy link
Member Author

The construction of an immutable type has almost no cost.

@simonster
Copy link
Member

AFAIK, the reason we teach people to avoid union types in parts of their code not interacting with DataArrays is that they are slow, not that there is anything inherently wrong with them. If we didn't want people to use them at all, I don't think they'd exist at all. There's a difference between what is presently fast and what can be made fast. I would suggest that we design around the latter and provide (hopefully temporary) workarounds for the former.

Optimizing union types is probably a decent amount of work, but it's something that JITs for dynamic languages generally do (see polymorphic inline caching). This particular case should be reasonably simple since the possible types can still be statically inferred. I believe what is needed for good performance is 1) to pass union types of immutables on the stack or in registers instead of allocating them on the heap and 2) to emit two direct calls with a branch instead of calling via jl_apply_generic.

I'm not sure if we can make DataArray{T} <: AbstractArray{T} if indexing actually returns Option{T}. It is already a bit weird that we don't have DataArray{T} <: AbstractArray{Union(T,NAtype)}, and some code definitely breaks because of this. I'm also not convinced that it's helpful to break generic code on DataArrays because elements could be NA even if no NAs are actually present.

The Option API seems reasonable if the user knows that they will be operating on a DataArray with NA values, but it seems less reasonable if one wants functions that accept arbitrary AbstactArrays to work on DataArrays as well, for which the current strategy seems best. If we are focusing on what can be made fast and not what is presently fast, I think we should keep the current indexing behavior, but maybe also provide an indexing API that returns Option types for the time being.

Here is the set of things I think we need for indexing a DataArray to be fast:

  1. Scalar indexing operations on a DataArray should not allocate on the heap
  2. Operations on the return value of indexing into a DataArray should not incur dynamic dispatch overhead
  3. Scalar indexing operations on a DataArray should be fully inlined
  4. Checks for undefined references for data and na should be removed or hoisted outside of the loop

(1) and (2) can be satisfied by either the Option API or at the compiler level. (3) and (4) are problems for either approach, although both are on the list at JuliaLang/julia#3440.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants