Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic performance problems #523

Closed
johnmyleswhite opened this issue Feb 2, 2014 · 25 comments
Closed

Generic performance problems #523

johnmyleswhite opened this issue Feb 2, 2014 · 25 comments

Comments

@johnmyleswhite
Copy link
Contributor

The example below does a good job of capturing where our current interface is making things excessively difficult for type-inference to do its work:

using DataArrays, DataFrames

srand(1)
n = 5_000_000
a = rand(n)
b = rand(n)
da = data(a)
db = data(b)
df = DataFrame(a = da, b = db)

Base.values(da::DataArray) = da.data

function dot1(a::Vector, b::Vector)
    x = 0.0
    for i in 1:length(a)
        x += a[i] * b[i]
    end
    return x
end

function dot2(da::DataVector, db::DataVector)
    T = eltype(da)
    x = 0.0
    for i in 1:length(da)
        x += da[i]::T * db[i]::T
    end
    return x
end

function dot3(df::DataFrame)
    da, db = df[:a], df[:b]
    T = eltype(da)
    x = 0.0
    for i in 1:length(da)
        x += da[i]::T * db[i]::T
    end
    return x
end

function dot4(df::DataFrame)
    da, db = df[:a], df[:b]
    return dot2(da, db)
end

function dot5(da::DataVector, db::DataVector)
    x = 0.0
    for i in 1:length(da)
        x += da.data[i] * db.data[i]
    end
    return x
end

function dot6(da::DataVector, db::DataVector)
    x = 0.0
    for i in 1:length(da)
        x += values(da)[i] * values(db)[i]
    end
    return x
end

function dot7(da::DataVector, db::DataVector)
    x = 0.0
    for i in 1:length(da)
        if !(isna(da, i) || isna(da, i))
            x += values(da)[i] * values(db)[i]
        end
    end
    return x
end

function dot8(a::Vector, b::Vector)
    x = 0.0
    for i in 1:length(a)
        if !(isnan(a[i]) || isnan(a[i]))
            x += a[i] * b[i]
        end
    end
    return x
end

t1 = @elapsed dot1(a, b)
t2 = @elapsed dot2(da, db)
t3 = @elapsed dot3(df)
t4 = @elapsed dot4(df)
t5 = @elapsed dot5(da, db)
t6 = @elapsed dot6(da, db)
t7 = @elapsed dot7(da, db)
t8 = @elapsed dot8(a, b)

t1 = @elapsed dot1(a, b)
t2 = @elapsed dot2(da, db)
t3 = @elapsed dot3(df)
t4 = @elapsed dot4(df)
t5 = @elapsed dot5(da, db)
t6 = @elapsed dot6(da, db)
t7 = @elapsed dot7(da, db)
t8 = @elapsed dot8(a, b)

t1 = @elapsed dot1(a, b)
t2 = @elapsed dot2(da, db)
t3 = @elapsed dot3(df)
t4 = @elapsed dot4(df)
t5 = @elapsed dot5(da, db)
t6 = @elapsed dot6(da, db)
t7 = @elapsed dot7(da, db)
t8 = @elapsed dot8(a, b)

@printf "Vectors: %f\n" t1
@printf "DataVectors: %f\n" t2
@printf "DataVectors in DataFrame: %f\n" t3
@printf "DataVectors in DataFrame w/ Indirection: %f\n" t4
@printf "DataVectors w/ unsafe access: %f\n" t5
@printf "DataVectors w/ safish access: %f\n" t6
@printf "DataVectors w/ branch costs: %f\n" t7
@printf "Vectors w/ branch costs: %f\n" t8

This gives the following results, which are pretty representative of other runs on my laptop:

  • Vectors: 0.008418
  • DataVectors: 0.221481
  • DataVectors in DataFrame: 1.747264
  • DataVectors in DataFrame w/ Indirection: 0.214700
  • DataVectors w/ unsafe access: 0.008278
  • DataVectors w/ safish access: 0.007597
  • DataVectors w/ branch costs: 0.039523
  • Vectors w/ branch costs: 0.011451

Until we come up with something better, we might want to embrace the element-wise isna function I added as well as something like the values function I've defined here. (Which, like isna, was an idea that Tim Holy proposed.) That would at least get us to a sane place when using DataArrays.

If we follow my proposal to implement categorical data using an enum-type stored in a DataArray, we only need to optimize the performance of DataArrays going forward.

That still leaves a question above how to optimize the DataFrames case. Needing to introduce indirection to get proper type inference and good performance isn't a great long-term strategy.

@johnmyleswhite
Copy link
Contributor Author

One extremely radical suggestion is to change DataArray's so that getindex never returns NA and has completely undefined behavior if an indexed element is undefined. I'm not really cool with it, but it is the nuclear option.

@ivarne
Copy link

ivarne commented Feb 2, 2014

Wouldn't a getindex that throws exception on NA be about as performant as undefined behaviour?

@tshort
Copy link
Contributor

tshort commented Feb 2, 2014

Great summary, John. Handling the DataArrays typing issue is probably best done with isna/value as you outlined in JuliaStats/DataArrays.jl#71.

For the DataFrames column indexing issue, there are a couple of options. The simplest is just to have a macro helper. For example,

@expose df a b

could be shorthand for:

a, b = df[:a], df[:b]

Another option is @with that automatically creates column references before operations. This might look like the following for the example above.

function dot3(df::DataFrame)
    @with df begin
        x = 0.0
        for i in 1:length(:a)
            x += value(:a)[i] * value(:b)[i]
        end
    end
    return x
end

This could be done along with @select and/or @linq for a more comprehensive approach to using macros for deferred operations and to use automatic column references.

@johnmyleswhite
Copy link
Contributor Author

I think we can throw exceptions just as efficiently.

@johnmyleswhite
Copy link
Contributor Author

@tshort, will the method you're proposing be type-inferrable?

I think the main problem is that Julia compiles functions en-masse and the types of the two columns can't be known from the input type of DataFrame.

@tshort
Copy link
Contributor

tshort commented Feb 2, 2014

You're right on the type inference, John. That won't fix that.

@tshort
Copy link
Contributor

tshort commented Feb 2, 2014

Other than using the approach outlined in #471, I can't think of a way around the type inference issue for columns that doesn't require new Julia language features.

@tshort
Copy link
Contributor

tshort commented Feb 2, 2014

Actually, maybe @with could create a function definition and then call that function. Needed columns would be passed into the function. All of the processing would be done in the function. Because the function would be JIT'd based on the types of the columns, it should be type stable.

@simonster
Copy link
Contributor

Re: @ivarne's comment, unfortunately I think throwing errors on NA would cost us at least a little bit because it would make getindex uninlineable.

Another thing to consider is that the bounds checks for the NA array really cost us (probably mostly because they make indexing uninlineable):

function dot9(da::DataVector, db::DataVector)
    x = 0.0
    for i in 1:length(da)
        if !(Base.getindex_unchecked(da.na.chunks, i) || Base.getindex_unchecked(db.na.chunks, i))
            x += values(da)[i] * values(db)[i]
        end
    end
    return x
end

On my machine, this is nearly twice as fast as dot8 (note that there is a typo in the NA checks in dot7 and dot8; both are testing da). This is why I use the @bitenumerate macro in operators.jl, which iterates over the NA array instead of indexing into it. I am not sure how we can actually expose this performance in a sane way to user code, though.

In the long term, it would be great if Julia could create more efficient code for the "naive" variants of these functions. I have a couple ideas for this. First, type inference can already determine that getindex on a DataArray returns either NA or the type of the DataArray, but we still have to box the return value, which requires allocation, and we perform dynamic method lookups at runtime even though there are only two possible methods we could call. Since the compiler knows there are only two possible return types, it could reserve stack space for both and call the appropriate method directly with a branch instead of having to do dynamic method lookup. Second, for cases like dot3, the compiler could determine that types inside the loop depend on unknown types outside the loop, and automatically "outline" that loop out into a separate function, effectively generating the same code as dot4. Together I think these improvements could make the naive approach give decent performance, but I have no idea if/when we're likely to get them.

@nalimilan
Copy link
Member

I'm wondering whether it wouldn't be simpler to use bitpatterns to signal NAs. All type inference issues would go away with such a design...

@simonster
Copy link
Contributor

@nalimilan This is only possible for floating point DataArrays, and I created an issue for it a while back (JuliaStats/DataArrays#5). For Bools, Ints, Strings, and other types, there are no bit patterns available that aren't otherwise valid. For DataFrames there's also the second problem of knowing the types of columns, which bit patterns can't fix.

@HarlanH
Copy link
Contributor

HarlanH commented Feb 2, 2014

Or, similarly, some sort of option to return a distinct value of the user's
choice instead of an NA, sorta like the old NAreplace() iterator did.

On Sun, Feb 2, 2014 at 4:43 PM, Milan Bouchet-Valat <
[email protected]> wrote:

I'm wondering whether it wouldn't be simpler to use bitpatterns to signal
NAs. All type inference issues would go away with such a design...

Reply to this email directly or view it on GitHubhttps://github.com//issues/523#issuecomment-33913856
.

@tshort
Copy link
Contributor

tshort commented Feb 3, 2014

I coded an @with macro that passes column references to a function that's called. For John's test case, the following code is as fast as dot5.

function dot9(df::DataFrame)
    @with df begin
        x = 0.0
        for i in 1:length(:a)
            x += values(:a)[i] * values(:b)[i]
        end
        x
    end
end

Here is a gist with the code for @with and John's example:

https://gist.github.com/tshort/8778410#file-withmacro-md

@johnmyleswhite
Copy link
Contributor Author

@tshort: Your approach is really cool. What do you think of creating a DataFramesMeta package to explore experimental applications of metaprogramming to DataFrames?

@tshort
Copy link
Contributor

tshort commented Feb 4, 2014

@johnmyleswhite, I'll create the metaprogramming package. Is that appropriate under JuliaStats, or should I just put it under my own account?

@tshort
Copy link
Contributor

tshort commented Feb 5, 2014

See the following for DataFramesMeta package that includes @with and @select:

https://github.com/tshort/DataFramesMeta.jl

It's already better than the expression-based approaches we had. There are plenty of options to expand this. It needs more eyeballs to see where the kinks are.

This package isn't in METADATA.

@johnmyleswhite
Copy link
Contributor Author

Thanks so much for doing this. I'm so much happier having this functionality in a place where it can grown until its mature enough to merge into DataFrames.

I'd be cool with moving this package into JuliaStats as long as we advertise right at the start that the package is experimental.

@nalimilan
Copy link
Member

Looks great. @with seems quite similar to what I described about recoding data at #369 (comment); the only difference is that the for loop and [i] indexing would be implicit to iterate over rows.

@johnmyleswhite
Copy link
Contributor Author

Tom, you've got the required permissions now.

@gpmauroy
Copy link

I am new to Julia, still not getting many details of the varying proposed solutions I read, including the @with macro from @tshort, still lots to learn for me on macros, expressions, symbols, etc, so please forgive me if my following attempt at offering 2 cents is lame.

My understanding is there are 2 problems:

  1. NA: the overhead from dealing with NA values that could be anywhere.
  2. Type inference: the column-array types buried inside the DataFrame causing the compiler not to know the column types at compile time, henceforth generating sub-optimal code.

So here are my practical views on those.

1- NA overhead.
For practical purposes, I try to deal with NA early on in the data preprocessing/scrubbing phase. If I control the data source, I make sure it does not contain NA in the first place. If I don't control the data source and if I do get NA then I want to deal with them all some way or another, typically either filtering the corresponding records out or replacing the NA with a meaningful default value that would accomplish the strategy I would set to deal with those NA, strategy to evaluate on a case by case basis, the implications can be serious, I don't like letting those slip under the rug. So when I get into the modeling process, the biggest phase usually (though data preprocessing can be tedious/painful as well) I may have multiple versions of my original data in different shape and form, but I make sure I don't have NA values left anywhere, with a good report from preprocessing re. their statistics and implications. Perhaps I have been lucky or sheltered to have gotten away so easily from dealing with NA throughout my modeling process, perhaps it is not practical for most, but it has been to me so far.
So, what I am driving at is, to me, it would make sense to have two types of DataFrames:

1.1- NaDataFrame (sorry for bad name!) == DataFrame as of today, a DataFrame dealing with NA and incurring the performance cost so it can do it right. This is the data frame I would use during my initial data preprocessing phase in charge of preparing data frames for my modeling phase. I can put up in that phase with the performance hit, because other than the dev/test phase usually done on a small data subset, I run it once, and I am done, until I get a new data set. So I am OK takes a lot time to run (within reason that the NA performance hit is within).

1.2- DenseDataFrame, a data frame assuming no NA, getIndex gets us straight to the data, no checking.
This is the one I would use in my modeling stage where I typically spend most of the time. Here performance is critical, particularly on big data set, because it means I can evaluate more models, I can run on bigger data sets, it can be mission critical.
Does it make sense?
Would it be easy to implement such that both types have a lot of the implementation shared through a super type?
(I am still a bit confused with the hierarchy having "empty" abstract super types, I don't see as yet how to make effective use of them, but it is learning I have to do)

2- Type inference.
Again approaching this one from a practical perspective as opposed to the software language purist, even when I have 100's of columns in my data frames, most columns share the same 1-3 underlying types -- e.g., one for my ID's, typically integer or string, one for my numeric values typically Float64, and then my categorical variables that I will assume to be string for now (in theory they are a finite list called factors in R world, could be enum in other languages when statically known, but I have not gotten there as yet with Julia so I will leave this case for now).

So to make a long story short here, we need to expose the column value types, but we don't have to expose each of them, we just need a practical approach to parametrize a handful.
For this purpose, I found find it natural to be able to get a typed data frame, something that could look like:

TypedDataFrame{IdType, NumericType, CategoryType1, CategoryType2}

Those types would be very custom to each case scenario, i.e., each data frame, it comes with the specifics of the data frame at hand for which we know which columns those types refer to. Then in your code, you could annotate you array accordingly, and performance would follow.

As an example, below are 2 dot functions to exemplify this approach, dot9 with type in function parameter, ugly in my opinion because redundant with types in data frames, and dot10 along the line of TypedDataFrame, but there I have a simple container model because I don't know how to implement what I'd really want, not ideal, but hopefully shows the principle. Those two are top performing.

Does it make sense?
Would it be easy to implement such that the TypedDataFrame is a subtype of an AbstractDataFrame as opposed to being a container for a DataFrame?

module TypedDataFrames

using DataFrames

export AbstractTypedDataFrame, TypedDataFrame, valtypeof, dataframe

# Typed DataFrame.
abstract AbstractTypedDataFrame # <: AbstractDataFrame
immutable TypedDataFrame{T <: Number} <: AbstractTypedDataFrame
  df::DataFrame
  dvt::T
  TypedDataFrame(dfpar::DataFrame) =  new(dfpar, zero(T))
end
valtypeof(df::TypedDataFrame) = typeof(df.dvt) # T
dataframe(df::TypedDataFrame) = df.df
# Base.convert(DataFrame, df::TypedDataFrame) = df.df

end # module TypedDataFrames

function dot9{T}(df::DataFrame, dummyForType::T)
  da, db = df[:a].data::Array{T}, df[:b].data::Array{T} # Assumes dense arrays, no NA value.
  x = zero(T)
  for i in 1:length(da)
    x += da[i] * db[i]
  end
  return x
end

function dot10(tdf::TypedDataFrame)
  T = valtypeof(tdf); # Those would be specific to my data frame, one such call for each type.
  df = dataframe(tdf); # Ideally I would like tdf[:a] == df[:a] to bypass this call.
  da, db = df[:a].data::Array{T}, df[:b].data::Array{T} # Assumes dense arrays, no NA value.
  x = zero(T)
  for i in 1:length(da)
    x += da[i] * db[i]
  end
  return x
end

@printf "0. Vectors type as function param: %f\n" t0
@printf "1. Vectors Generic: %f\n" t1
@printf "2. DataVectors: %f\n" t2
@printf "3. DataVectors in DataFrame: %f\n" t3
@printf "4.2. DataVectors in DataFrame w/ Indirection: %f\n" t42
@printf "4.0. Vectors{T} in DataFrame w/ Indirection: %f\n" t40
@printf "4.1. Vectors in DataFrame w/ Indirection: %f\n" t41
@printf "5. DataVectors w/ unsafe access: %f\n" t5
@printf "6. DataVectors w/ safish access: %f\n" t6
@printf "7. DataVectors w/ branch costs: %f\n" t7
@printf "8. Vectors w/ branch costs: %f\n" t8
@printf "9. DataFrame Array type as function param, assumes no NA: %f\n" t9
@printf "10. TypedDataFrame Array type, assumes no NA: %f\n" t10

  1. Vectors type as function param: 0.010209
  2. Vectors Generic: 0.010306
  3. DataVectors: 0.311630
  4. DataVectors in DataFrame: 2.347728
    4.2. DataVectors in DataFrame w/ Indirection: 0.262216
    4.0. Vectors{T} in DataFrame w/ Indirection: 0.010246
    4.1. Vectors in DataFrame w/ Indirection: 0.010295
  5. DataVectors w/ unsafe access: 0.013386
  6. DataVectors w/ safish access: 0.013216
  7. DataVectors w/ branch costs: 0.050199
  8. Vectors w/ branch costs: 0.010699
  9. DataFrame Array type as function param, assumes no NA: 0.010436
  10. TypedDataFrame Array type, assumes no NA: 0.010065

dot0 is the same as dot1 with type T explicit in Vector (does not help performance), dot42 is the same as John's dot 4, dot40 & dot41 not presented here are variants of dot4's indirection resolving type inference through dot0 & dot1 with direct access to column arrays.
Here they are for completeness even though they make the same point as before -- accessing through direct array with known type gets top performance:

function dot0{T}(a::Vector{T}, b::Vector{T})
    x = zero(T)
    for i in 1:length(a)
        x += a[i] * b[i]
    end
    return x
end

function dot40(df::DataFrame)
    T = eltype(df[:a])
    da, db = df[:a].data::Array{T}, df[:b].data::Array{T} # Assumes dense arrays, no NA value.
    return dot0(da, db)
end

function dot41(df::DataFrame)
    da, db = df[:a].data, df[:b].data # Assumes dense arrays, no NA value.
    return dot1(da, db)
end

@gpmauroy
Copy link

Combining the above NA & Typed DataFrame strategies, I think we'd end up with something like:

  1. NaDataFrame == current DataFrame : with NA support, no type annotation, the slowest but the most robust and least effort to use as no type annotation in code is needed.
  2. DenseDataFrame : no NA support, no type annotation -- small performance gain over (1) if we know we have no NA but are lazy to type the arrays.
  3. TypedDenseDataFrame: no NA support, with type annotation -- the fastest, optimized speed, a couple orders of magnitude of performance improvement over non-parametric data frames.
  4. TypedNaDataFrame: with NA support, with type annotation -- small performance hit over (3) to support NA, sill a couple orders of magnitude of performance improvement over any non-parametric data frames.

My updated performance tests look like as follows.

function dot10_1(tdf::TypedDataFrame) # ~ NaDataFrame.
  T = valtypeof(tdf); # Unnecessary if parametric function.
  df = dataframe(tdf); # Unnecessary if we were to override the TypedDataFrame.[] operator.
  da, db = df[:a].data, df[:b].data # Assumes dense arrays, no NA value.
  x = zero(T)
  for i in 1:length(da)
    if !(isna(da, i))
      x += da[i] * db[i]
    end
  end
  return x
end

function dot10_2(tdf::TypedDataFrame) # ~ TypedDenseDataFrame.
  T = valtypeof(tdf); # Unnecessary if parametric function.
  df = dataframe(tdf); # Unnecessary if we were to override the TypedDataFrame.[] operator.
  da, db = df[:a].data::Array{T}, df[:b].data::Array{T} # Assumes dense arrays, no NA value.
  x = zero(T)
  for i in 1:length(da)
    x += da[i] * db[i]
  end
  return x
end

function dot10_3{T}(tdf::TypedDataFrame{T}) # ~ DenseDataFrame.
  df = dataframe(tdf); # Unnecessary if we were to override the TypedDataFrame.[] operator.
  da, db = df[:a].data, df[:b].data # Assumes dense arrays, no NA value.
  x = zero(T)
  for i in 1:length(da)
    x += da[i] * db[i]
  end
  return x
end

function dot10_4{T}(tdf::TypedDataFrame{T}) # ~ TypedDenseDataFrame
  df = dataframe(tdf); # Unnecessary if we were to override the TypedDataFrame.[] operator.
  da, db = df[:a].data::Array{T}, df[:b].data::Array{T} # Assumes dense arrays, no NA value.
  x = zero(T)
  for i in 1:length(da)
    x += da[i] * db[i]
  end
  return x
end

function dot10_5{T}(tdf::TypedDataFrame{T}) # ~ NaDataFrame.
  df = dataframe(tdf); # Unnecessary if we were to override the TypedDataFrame.[] operator.
  da, db = df[:a].data, df[:b].data # Assumes dense arrays, no NA value.
  x = zero(T)
  for i in 1:length(da)
    if !(isna(da, i))
      x += da[i] * db[i]
    end
  end
  return x
end

function dot10_6{T}(tdf::TypedDataFrame{T}) # ~ TypedNaDataFrame.
  df = dataframe(tdf); # Unnecessary if we were to override the TypedDataFrame.[] operator.
  da, db = df[:a].data::Array{T}, df[:b].data::Array{T} # Assumes dense arrays, no NA value.
  x = zero(T)
  for i in 1:length(da)
    if !(isna(da, i))
      x += da[i] * db[i]
    end
  end
  return x
end

@printf "0- Vectors type as function param: %f\n" t0
@printf "1- Vectors Generic: %f\n" t1
@printf "2- DataVectors: %f\n" t2
@printf "3- DataVectors in DataFrame: %f\n" t3
@printf "4.2- DataVectors in DataFrame w/ Indirection: %f\n" t4_2
@printf "4.0- Vectors{T} in DataFrame w/ Indirection: %f\n" t4_0
@printf "4.1- Vectors in DataFrame w/ Indirection: %f\n" t4_1
@printf "5- DataVectors w/ unsafe access: %f\n" t5
@printf "6- DataVectors w/ safish access: %f\n" t6
@printf "7- DataVectors w/ branch costs: %f\n" t7
@printf "8- Vectors w/ branch costs: %f\n" t8
@printf "9.1- DataFrame Array type parametric function, assumes no NA - no array type annotation: %f\n" t9_1
@printf "9.2- DataFrame Array type parametric function, assumes no NA - with array type annotation: %f\n" t9_2
@printf "10.1- TypedDataFrame parametric type, no parametric function, checks for NA - no array type annotation ~ NaDataFrame: %f\n" t10_1
@printf "10.2- TypedDataFrame parametric type, no parametric function, assumes no NA - with array type annotation ~ TypedDenseDataFrame: %f\n" t10_2
@printf "10.3- TypedDataFrame parametric type, parametric function, assumes no NA - no array type annotation ~ DenseDataFrame: %f\n" t10_3
@printf "10.4- TypedDataFrame parametric type, parametric function, assumes no NA - with array type annotation ~ TypedDenseDataFrame: %f\n" t10_4
@printf "10.5- TypedDataFrame parametric type, parametric function, checks for NA - no array type annotation ~ NaDataFrame: %f\n" t10_5
@printf "10.6- TypedDataFrame parametric type, parametric function, checks for NA - with array type annotation ~ TypedNaDataFrame: %f\n" t10_6

0- Vectors type as function param: 0.012684
1- Vectors Generic: 0.018507
2- DataVectors: 0.269310
3- DataVectors in DataFrame: 2.307525
4.2- DataVectors in DataFrame w/ Indirection: 0.305576
4.0- Vectors{T} in DataFrame w/ Indirection: 0.010220
4.1- Vectors in DataFrame w/ Indirection: 0.010038
5- DataVectors w/ unsafe access: 0.013021
6- DataVectors w/ safish access: 0.013105
7- DataVectors w/ branch costs: 0.033408
8- Vectors w/ branch costs: 0.010308
9.1- DataFrame Array type parametric function, assumes no NA - no array type annotation: 2.054180
9.2- DataFrame Array type parametric function, assumes no NA - with array type annotation: 0.012109
10.1- TypedDataFrame parametric type, no parametric function, checks for NA - no array type annotation ~ NaDataFrame: 2.420923
10.2- TypedDataFrame parametric type, no parametric function, assumes no NA - with array type annotation ~ TypedDenseDataFrame: 0.011458
10.3- TypedDataFrame parametric type, parametric function, assumes no NA - no array type annotation **~ DenseDataFrame: 2.092206**
10.4- TypedDataFrame parametric type, parametric function, assumes no NA - with array type annotation **~ TypedDenseDataFrame: 0.010671**
10.5- TypedDataFrame parametric type, parametric function, checks for NA - no array type annotation **~ NaDataFrame: 2.499859**
10.6- TypedDataFrame parametric type, parametric function, checks for NA - with array type annotation **~ TypedNaDataFrame: 0.011038**

We can simplify a bit the parametric DataFrame definition by removing the dummy variable for each type of interest as the type annotation suffices -- we just need to parametrize the type accessors function valtypeof for each type of interest.

module TypedDataFrames

using DataFrames

export AbstractTypedDataFrame, TypedDataFrame, valtypeof, dataframe

# Typed DataFrame.
abstract AbstractTypedDataFrame # <: AbstractDataFrame
immutable TypedDataFrame{T <: Number} <: AbstractTypedDataFrame
  df::DataFrame
end
valtypeof{T}(::TypedDataFrame{T}) = T
dataframe(df::TypedDataFrame) = df.df
# Base.convert(DataFrame, df::TypedDataFrame) = df.df

end # module TypedDataFrames

We also don't need the type accessors (valtypeof) when we are willing to parametrize the functions having a TypedDataFrame as argument -- see dot10_3 to dot10_6 above as examples vs. dot10_1 and dot10_2 that do need call valtypeof because they are not not parametric functions.
I suppose it is a matter of style. Personally, I prefer the type annotation over calling an extra function.

This strategy does not suffer from creating new data frames for every column addition or deletion because the exposed types are conceptual, they apply to groups of columns, whatever makes sense for every data frame. To get optimized speed with this strategy, the usage extra efforts are:

  1. Conceptualize the column types.
  2. Define the TypedDataFrame -- hopefully a good macro will simplify this process so it is concise.
  3. Type-annotate each array access for which performance is desired. This does not strike me as a big job, particularly when accessing columns by name (i.e., symbol), but perhaps it may be in some complex scenarios with lots of heterogeneous types. There might be some enhanced strategies that I cannot foresee at this point -- e.g., perhaps a smart macro utility to associate column symbols to their type in the exposed ones, or better a little initialization process completely automated (if possible, not clear whether it is at this point) checking the column types against the ones exposed, this way the type annotation could perhaps be embedded in a macro?

For the one a bit familiar with macros, I suppose defining a TypedDataFrame could easily be embedded in a macro. And there ought to be a simple solution so TypedDataFrame objects behave just like regular DataFrame for accessing information -- at worst, override the [] operator or/and perhaps the Base.convert(DataFrame, df::TypedDataFrame) function, not sure what the proper way here and I have not experimented with it as yet.

I am afraid this is non-sense to the language purist but for practical purposes, for someone like me who likes explicit type annotation in the first place for type clarity despite being slightly more verbose, such a solution would work like a charm to get optimized performance in a practical way but without the ugliness of passing around dummy variables just for type information.

My questions are:

  1. How bad/ugly is such a strategy compared to all the others I am not at a point to fully understand and appreciate as yet?
  2. What would be the simplest most elegant idiomatic way to define those data frame type variants so common code is reused? Is it possible to defined those various types without a container model by leveraging a well thought of abstract type hierarchy relating them? If so how?

I know it is a bit long but I am trying to be as clear as possible since I believe it would be an acceptable solution for me. Again sorry if it is non-sense noise.

Unless it is such a huge bad idea, I would appreciate any help in defining a macro to do the job, perhaps starting with the container model by lack of a better one for now -- defining macros is one of my next priorities to learn...

Thanks.

@johnmyleswhite
Copy link
Contributor Author

@gpmauroy,

Thanks for your ideas. You've got a lot of interesting points. I've been suffering from a bout of RSI lately, so I can't respond to a message as long as yours in full. Instead, I'll try to be brief.

We have thought a lot about the issues you're raising. I would strongly encourage you to do more reading in this repo and in DataArrays, rather than try to implement something yourself without a full understanding of how we've been thinking about these concerns.

In that vein, here is a very high-level overview of our strategy:

  • Remove NA completely and replace it with the parametric type Nullable that was added to Julia's Base yesterday. This is the most important change by far.
  • Replace DataArray with NullableArray, which is similar, but produces Nullable object when given a scalar index. This will remove the need for functions like isna(DataArray, Index).
  • Replace PooledDataArray with CategoricalArray and OrdinalArray to capture the full semantics of R's factors.
  • Clarify the way that DataFrames already work. In particular, they already support the insertion of Array columns in addition to the usage of DataArray. They do not support strong typing, but that is a point that others have considered before. It is a subtle problem, which would be very informative to work on, but it also a problem I would suggest returning to after you've gained a greater mastery of how Julia works.

@gpmauroy
Copy link

@johnmyleswhite
Thanks for your comments.

Sorry for my lengthy notes, I just wanted to express my points as clearly as possible..

I definitely have a lot to learn in Julia, I will keep on reading when I can.

All the enhancements you mention sound great. But I need a stop-gap approach in 0.3 until we have a stable release with all of those great features that will help resolve, in particular, the type inference issue, the performance bottleneck.

I have plenty of questions, way too many and not appropriate to discuss in this ticket.
When I get a chance, I may drop a targeted question here and there in the user forum.

Thanks for all your contributions in the Julia community.

@johnmyleswhite
Copy link
Contributor Author

There are two stopgaps:

  • Replace DataArray with Array as soon as possible
  • Do all of your work inside a function that takes columns as arguments.
df = DataFrame(A = 1:3, B = 2:4)
vA = array(df[:A]) # vA = convert(Vector, df[:A])
vB = array(df[:B]) # vB = convert(Vector, df[:B])
output = dowork(vA, vB)

@quinnj
Copy link
Member

quinnj commented Sep 8, 2017

It would be valuable to re-run some of the original benchmarks here w/ the transition from DataArray to Vector{T}/Vector{Union{T, Null}}, particularly on 0.7 current master (9/7/2017) where performance optimizations have been implemented for Vector{Union{T, Null}} where isbits(T) == true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants