WIP: collect without inference #135

piever · 2018-03-18T12:22:19Z

Collect an iterable of Tuples or NamedTuples to a Columns object, without using inference. The Columns object is initialized with the type of the first element and progressively widened, should this type change.
I assume that the iterator iterates Tuples (or NamedTuples) always of the same length and, in the NamedTuples case, always with the same fieldnames.

The condition to trigger widening is that at least one field of the iterate tuple is not a subtype of the eltype of the corresponding column.

Widening happens element-wise, meaning I perform typejoin on a field by field basis.

To work with missing in the future typejoin is not the correct operation, as:

julia> typejoin(Missing, Int)
Any

but Base.promote_typejoin should be sufficient to fix this.

The overall strategy is more or less a copy-paste of Base Julia strategy to collect generators of unknown type.

nalimilan

AFAICT that should work, but will suffer from the problem described at JuliaLang/julia#25925, i.e. that a full copy will have to be made each time a missing value is encountered for the first time in a field. So it can be relatively efficient if the number of columns is small and some missing values appear early, but very slow if that's not the case.

nalimilan · 2018-03-18T13:48:37Z

src/collect.jl

+    while !done(itr, st)
+        el, st = next(itr, st)
+        S = typeof(el)
+        if all((s <: t) for (s, t) in zip(S.parameters, T.parameters))


Is this inferable if eltype(itr) is concrete or once T has become at least as wide as it? That's essential for performance in Base map.

I don't have a strong enough intuition to understand what the compiler can optimize. Here Base uses el isa T || typeof(el) === T. What should I check exactly to test inferrability? Maybe making this a function is_fieldwise_subtype(el::S, ::Type{T}) = all((s <: t) for (s, t) in zip(S.parameters, T.parameters)) would help the compiler, I'm not sure

@code_warntype will tell you whether the return type is inferred. Using a helper function could indeed help (maybe after marking it as Base.@pure, but the chances are that I'm wrong).

I've checked with constant eltype and now it is tyoe stable. Somehow the way to achieve that was to use a generated function to check is_fieldwise_subtype, which we should do anyway for performance. It's not inferrable in the NamedTuples case (the initialization step arrayof which we already use in map is not inferrable), but I hope all the NamedTuples business will be much better in 0.7

Still, should probably benchmark it to make sure it's doing OK compared with the inference based map...

There's probably a way to achieve the same result without a generated function, but these things are always tricky...

piever · 2018-03-18T14:29:25Z

Concerning missingness, I see the concern. One optimization I can do is to not copy the whole thing, but also figure out what is the "offending column" and only copy that one (there is a setcol function to only change one column without copying the rest: it should be pretty efficient). Like this each column is copied at most once. Of course, if there was a way to widen a vector to a vector that allows missing without copying, this would be even better: is there progress on that front?

I'm starting to think that using the trick of "columnar storage" the situation is not nearly as bad as with a vector of NamedTuples, if copying each column at most once is considered acceptable, we may be able to skip inference altogether (plus, with your promote_typejoin PR, we wouldn't really need to special case the code so much to accomodate Missing).

nalimilan · 2018-03-18T15:16:34Z

Concerning missingness, I see the concern. One optimization I can do is to not copy the whole thing, but also figure out what is the "offending column" and only copy that one (there is a setcol function to only change one column without copying the rest: it should be pretty efficient). Like this each column is copied at most once.

Good trick!

Of course, if there was a way to widen a vector to a vector that allows missing without copying, this would be even better: is there progress on that front?

I don't think there's been any progress (the priority is merging breaking changes for 1.0).

I'm starting to think that using the trick of "columnar storage" the situation is not nearly as bad as with a vector of NamedTuples, if copying each column at most once is considered acceptable, we may be able to skip inference altogether (plus, with your promote_typejoin PR, we wouldn't really need to special case the code so much to accomodate Missing).

Yeah, it should be more reasonable, at least if we are able to make convert non-copying. That's an interesting approach to test.

piever · 2018-03-19T11:05:13Z

Benchmarks are very encouraging, for some reason this is actually faster than our current implementation in the type stable case....

julia> using IndexedTables, BenchmarkTools

julia> s = table(rand(1000), rand(1000), names = [:x, :y]);

julia> f(i) = @NT(sum = i.x + i.y, diff = i.x - i.y)
f (generic function with 1 method)

julia> map1(f, s) = table(collectcolumns(f(i) for i in s), copy = false, presorted = true)
map1 (generic function with 1 method)

julia> @benchmark map($f, $s)
BenchmarkTools.Trial: 
  memory estimate:  105.02 KiB
  allocs estimate:  2554
  --------------
  minimum time:     64.973 μs (0.00% GC)
  median time:      72.453 μs (0.00% GC)
  mean time:        83.847 μs (7.97% GC)
  maximum time:     2.059 ms (91.21% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark map1($f, $s)
BenchmarkTools.Trial: 
  memory estimate:  17.83 KiB
  allocs estimate:  44
  --------------
  minimum time:     13.858 μs (0.00% GC)
  median time:      15.307 μs (0.00% GC)
  mean time:        17.561 μs (4.52% GC)
  maximum time:     1.198 ms (91.14% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> map1(f, s) == map(f,s)
true

I still need to:

Implement the unknown length iterator case
Generalize to work with scalars and just collect an Array
Decide on a name for collectcolumns (though maybe collectcolumns is not so bad)

And then we could start porting things to use this instead of inference.

shashi · 2018-03-19T12:42:32Z

That's pretty awesome! The patch looks good to me!! :-)

Now the challenge is figuring out how to implement groupreduce and groupby as iterators

Then there is the elephant in the room --- join :-)

shashi · 2018-03-19T12:44:35Z

src/collect.jl

+    else
+        return :(false)
+    end
+end


I think this doesn't need to be @generated, a Base.@pure or @inline might work here. But I guess once a package has @generated it really doesn't matter how many of them are there :-p

Base.@pure seemed to not work as nicely, maybe we can leave as is now and do a "remove all generated functions" PR in the future (maybe in Julia 0.7).

shashi · 2018-03-19T12:53:11Z

src/collect.jl

+        copy!(newcol, 1, column(dest, l), 1, i-1)
+        new = setcol(new, l, newcol)
+    end
+    new


really liking how this function reads.

shashi · 2018-03-19T12:56:46Z

but Base.promote_typejoin should be sufficient to fix this.

I like the sound of that!

shashi · 2018-03-19T13:06:17Z

Oops I think the current implementation of map allocates, probably because of JuliaLang/julia#15276 since we close over fs?

shashi · 2018-03-19T13:12:12Z

If I just call f instead of fs (which understandably isn't inferable), the performance is 20% slower in the relevant portions (allocation + filling), but I think we can take that for the benefits and simplicity this PR affords.

piever · 2018-03-19T14:22:41Z

I see, I couldn't really understand how this PR could be faster. 20% sounds fair: the last step of converting to a table could account for that. map2(f,s) = collectcolumns(f(i) for i in s) is much faster than map1, not sure why, I thought converting to table was for free.

In terms of groupby and groupreduce, I made the SizeUnknown implementation for them, after we merge this one I can see how to translate them to use it (groupreduce is already doing with this PR strategy but assuming concrete grouping). For join we should actually change typejoin.

We should (before 0.7, then there is @nalimilan's PR to rescue us) define our own promote_typejoin and add promote_typejoin(::Type{<:DataValue}, ::Type{T}) = DataValue{T} and then it should work out of the box (though we may also need to tweak the s <: t part to allow Float64 <: DataValue{Float64} - of course using our own version of <:). Better leave it for last though.

A possible strategy could be, in the following order:

Decide on a name for collectcolumns and merge this PR (could also leave collectcolumns)
port map, which should be trivial
port groupby and groupreduce which could be a bit trickier
think about join and revisit this PR to adjust as needed.

The tricky thing about group is that it also uses inference to decide whether to wrap f in a function or not, so maybe that's not so easy.

shashi · 2018-03-19T15:30:37Z

I thought converting to table was for free.

That is suspicious actually, it's the simplest wrapper when copy=false, presorted=true... Worth profiling.

shashi · 2018-03-19T15:46:56Z

I was not timing the table creation, and that could have been the 20% overhead... I did some profiling, it looks like it's just the overhead of dispatch calling many methods with keyword arguments.

shashi · 2018-03-19T16:01:07Z

define our own promote_typejoin and add promote_typejoin(::Type{<:DataValue}, ::Type{T}) = DataValue{T}

julia> promote_type(DataValue{Int}, Float64)
DataValues.DataValue{Float64}

why not use promote_type?

I think the 4-step plan sounds perfect. collectcolumns seems like a good name. Maybe collect_columns is easier to read, and mirrors the use of _ after collect in Base.

Cheers!

piever · 2018-03-19T16:30:08Z

promote_type is actually an interesting idea, I think I'll switch to that. I like it because it will give a tighter type than typejoin, then I guess we'll have to see in practice if that's enough to get join to work out of the box. I'll switch to that (and to collect_columns - I also think it reads better) + docstrings and I think it's good to go.

nalimilan · 2018-03-19T16:38:04Z

One potential issue with promote_type is that promote_type(Int, Float64) === Float64, which means that big integers would be silently rounded. OTOH that's what happens with cat on arrays already, so...

piever · 2018-03-19T16:44:44Z

I saw that but I think it's better than returning an Array{Real}, especially if there are precedents like vcat. The use case of mixing up large Int64 and Float64 while wanting to preserve the Int64 exactly (where they are already at a size that risks overflow) seems limited.

shashi · 2018-03-19T18:47:20Z

There's a _promote_type in JuliaDB.jl which does promote_type on all the fields of a tuple (using map_params). Maybe we should move that here and reuse it there. It's a minor detail that will need an in-tandem release. Everything looks good here: feel free to hit merge!

piever · 2018-03-19T19:57:12Z

One thing though: throughout the code, I'm assuming that the number of fields in the output Tup is constant throughout the data, but one of the tests has map of a function which does not respect this property. The old code would give an Array of Tuples (as it is a case where inference failed). Should we just error in that case?

shashi · 2018-03-19T19:58:23Z

Why can't we switch to Array{Tuple} in this case?

shashi · 2018-03-19T19:59:54Z

src/collect.jl

+    end
+end
+
+function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T}


I guess this should have T <: Tup as well?

shashi · 2018-03-19T20:01:56Z

src/collect.jl

+
+function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T}
+    sp, tp = S.parameters, T.parameters
+    idx = find(!(s <: t) for (s, t) in zip(sp, tp))


Would be good to see if sp and tp have the same length, else return Array{Tuple}(length(dest)); copy!(newcol, 1, dest, 1, i-1)?
`

Yep, figured out that was missing when changing map and luckily one test checked for that. More than same length, same fieldnames (we also shouldn't accept NamedTuples where things are called differently). I'll change that in the future map PR.

I should also add a test for #101 in the map PR as that should work now.

piever · 2018-03-19T20:14:04Z

It's actually not so trivial though, as the Array{NamedTuple} can get horribly inefficient (in the future Missing world), as separate fields can go missing in different moments and every time we'd need to copy everything (as the trick I'm doing here of only changing the relevant columns doesn't work). Not sure what's the best course of action: unless there are clear usecases it may be better to error.

piever · 2018-03-19T20:15:52Z

If the user insists of having tuples of different lengths, he/she can always wrap them in a one-element Tuple.

shashi · 2018-03-19T20:38:02Z

and every time we'd need to copy everything

Why? Array{NamedTuple} should be wide enough for any tuple, even those containing missing?

piever · 2018-03-19T20:54:40Z

Ah, fair enough, as soon as different length are encountered it will just give a NamedTuple. without specializing further. I'll see if there's a way to include this without making the code too ugly...

OTOH, don't you think it'd be better to error and propose that the user has to be explicit if he/she wants this funny Array of tuples of different length? collect_columns((ntuple(identity, i),) for i in 1:3) for example.

EDIT: looking closer it seems that we do not need to change the code too much to allow this and it's true that it's surprising to have code that can collect anything unless there are tuples of different lengths...

davidanthoff · 2018-03-20T03:31:35Z

You should be able to get rid of copying entirely if you widen to DataValueArray instead of Array{DataValue} (or are you doing that already? From a quick glance at the code I got the impression that you don't). In that case you can just reuse the array with the values that you collected so far, stuff it into a new DataValueArray and the only significant new allocation in a widening operation would be the missing mask (which of course is not a problem because it would have been allocated at some point in any case).

On the flip side, probably not even needed, because with DataValue you probably will pick the right column type right away based on the first element in the iteration in any case.

piever · 2018-03-20T09:04:54Z

Thanks for the tip! It will probably be needed for join as there one could get DataValue because the row is missing even when the field is of type Int for example. I'll probably have to revisit this for join and will include your trick!

shashi · 2018-03-20T09:18:28Z

We do the trick for DataValueArrays in TextParse. At some point (maybe with Parsing.jl) we should use this type of iterator approach there as well. The promotion rules there are quite complex (e.g. string columns are started off as PooledArray, then the pool is widened, if there are a large number of uniques, we switch to Array{String})

collect without inference

cf31fa0

piever changed the title ~~collect without inference~~ WIP: collect without inference Mar 18, 2018

nalimilan reviewed Mar 18, 2018

View reviewed changes

piever added 4 commits March 18, 2018 19:57

use generated function to optimize type checking step

ef30634

only change relevant columns

a74cbe7

remove incorrect typeassert

0814597

fix test

59347bb

piever added 2 commits March 19, 2018 11:20

refactor and add scalar case

6c05925

unknown length

89cc30e

shashi reviewed Mar 19, 2018

View reviewed changes

rename to collect_columns

dac0578

change to promote_type

1f51b23

piever added 2 commits March 19, 2018 16:38

docs

ca280ef

typo

d196912

piever merged commit 11ffe1e into master Mar 19, 2018

piever deleted the pv/collect branch March 19, 2018 19:24

shashi reviewed Mar 19, 2018

View reviewed changes

piever mentioned this pull request Mar 20, 2018

map no longer relies on inference #137

Merged

piever mentioned this pull request Apr 5, 2018

Iterator roadmap #151

Open

12 tasks

WIP: collect without inference #135

WIP: collect without inference #135

Conversation

piever commented Mar 18, 2018 • edited Loading

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piever commented Mar 18, 2018 • edited Loading

nalimilan commented Mar 18, 2018

piever commented Mar 19, 2018 • edited Loading

shashi commented Mar 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shashi commented Mar 19, 2018

shashi commented Mar 19, 2018

shashi commented Mar 19, 2018

piever commented Mar 19, 2018

shashi commented Mar 19, 2018

shashi commented Mar 19, 2018

shashi commented Mar 19, 2018

piever commented Mar 19, 2018

nalimilan commented Mar 19, 2018

piever commented Mar 19, 2018

shashi commented Mar 19, 2018

piever commented Mar 19, 2018

shashi commented Mar 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piever commented Mar 19, 2018

piever commented Mar 19, 2018

shashi commented Mar 19, 2018

piever commented Mar 19, 2018 • edited Loading

davidanthoff commented Mar 20, 2018

piever commented Mar 20, 2018

shashi commented Mar 20, 2018

piever commented Mar 18, 2018 •

edited

Loading

piever commented Mar 18, 2018 •

edited

Loading

piever commented Mar 19, 2018 •

edited

Loading

piever commented Mar 19, 2018 •

edited

Loading