Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: collect without inference #135

Merged
merged 11 commits into from
Mar 19, 2018
Merged

WIP: collect without inference #135

merged 11 commits into from
Mar 19, 2018

Conversation

piever
Copy link
Collaborator

@piever piever commented Mar 18, 2018

Collect an iterable of Tuples or NamedTuples to a Columns object, without using inference. The Columns object is initialized with the type of the first element and progressively widened, should this type change.
I assume that the iterator iterates Tuples (or NamedTuples) always of the same length and, in the NamedTuples case, always with the same fieldnames.

The condition to trigger widening is that at least one field of the iterate tuple is not a subtype of the eltype of the corresponding column.

Widening happens element-wise, meaning I perform typejoin on a field by field basis.

To work with missing in the future typejoin is not the correct operation, as:

julia> typejoin(Missing, Int)
Any

but Base.promote_typejoin should be sufficient to fix this.

The overall strategy is more or less a copy-paste of Base Julia strategy to collect generators of unknown type.

@piever piever changed the title collect without inference WIP: collect without inference Mar 18, 2018
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT that should work, but will suffer from the problem described at JuliaLang/julia#25925, i.e. that a full copy will have to be made each time a missing value is encountered for the first time in a field. So it can be relatively efficient if the number of columns is small and some missing values appear early, but very slow if that's not the case.

src/collect.jl Outdated
while !done(itr, st)
el, st = next(itr, st)
S = typeof(el)
if all((s <: t) for (s, t) in zip(S.parameters, T.parameters))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this inferable if eltype(itr) is concrete or once T has become at least as wide as it? That's essential for performance in Base map.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong enough intuition to understand what the compiler can optimize. Here Base uses el isa T || typeof(el) === T. What should I check exactly to test inferrability? Maybe making this a function is_fieldwise_subtype(el::S, ::Type{T}) = all((s <: t) for (s, t) in zip(S.parameters, T.parameters)) would help the compiler, I'm not sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@code_warntype will tell you whether the return type is inferred. Using a helper function could indeed help (maybe after marking it as Base.@pure, but the chances are that I'm wrong).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked with constant eltype and now it is tyoe stable. Somehow the way to achieve that was to use a generated function to check is_fieldwise_subtype, which we should do anyway for performance. It's not inferrable in the NamedTuples case (the initialization step arrayof which we already use in map is not inferrable), but I hope all the NamedTuples business will be much better in 0.7

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, should probably benchmark it to make sure it's doing OK compared with the inference based map...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a way to achieve the same result without a generated function, but these things are always tricky...

@piever
Copy link
Collaborator Author

piever commented Mar 18, 2018

Concerning missingness, I see the concern. One optimization I can do is to not copy the whole thing, but also figure out what is the "offending column" and only copy that one (there is a setcol function to only change one column without copying the rest: it should be pretty efficient). Like this each column is copied at most once. Of course, if there was a way to widen a vector to a vector that allows missing without copying, this would be even better: is there progress on that front?

I'm starting to think that using the trick of "columnar storage" the situation is not nearly as bad as with a vector of NamedTuples, if copying each column at most once is considered acceptable, we may be able to skip inference altogether (plus, with your promote_typejoin PR, we wouldn't really need to special case the code so much to accomodate Missing).

@nalimilan
Copy link
Member

Concerning missingness, I see the concern. One optimization I can do is to not copy the whole thing, but also figure out what is the "offending column" and only copy that one (there is a setcol function to only change one column without copying the rest: it should be pretty efficient). Like this each column is copied at most once.

Good trick!

Of course, if there was a way to widen a vector to a vector that allows missing without copying, this would be even better: is there progress on that front?

I don't think there's been any progress (the priority is merging breaking changes for 1.0).

I'm starting to think that using the trick of "columnar storage" the situation is not nearly as bad as with a vector of NamedTuples, if copying each column at most once is considered acceptable, we may be able to skip inference altogether (plus, with your promote_typejoin PR, we wouldn't really need to special case the code so much to accomodate Missing).

Yeah, it should be more reasonable, at least if we are able to make convert non-copying. That's an interesting approach to test.

@piever
Copy link
Collaborator Author

piever commented Mar 19, 2018

Benchmarks are very encouraging, for some reason this is actually faster than our current implementation in the type stable case....

julia> using IndexedTables, BenchmarkTools

julia> s = table(rand(1000), rand(1000), names = [:x, :y]);

julia> f(i) = @NT(sum = i.x + i.y, diff = i.x - i.y)
f (generic function with 1 method)

julia> map1(f, s) = table(collectcolumns(f(i) for i in s), copy = false, presorted = true)
map1 (generic function with 1 method)

julia> @benchmark map($f, $s)
BenchmarkTools.Trial: 
  memory estimate:  105.02 KiB
  allocs estimate:  2554
  --------------
  minimum time:     64.973 μs (0.00% GC)
  median time:      72.453 μs (0.00% GC)
  mean time:        83.847 μs (7.97% GC)
  maximum time:     2.059 ms (91.21% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark map1($f, $s)
BenchmarkTools.Trial: 
  memory estimate:  17.83 KiB
  allocs estimate:  44
  --------------
  minimum time:     13.858 μs (0.00% GC)
  median time:      15.307 μs (0.00% GC)
  mean time:        17.561 μs (4.52% GC)
  maximum time:     1.198 ms (91.14% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> map1(f, s) == map(f,s)
true

I still need to:

  • Implement the unknown length iterator case

  • Generalize to work with scalars and just collect an Array

  • Decide on a name for collectcolumns (though maybe collectcolumns is not so bad)

And then we could start porting things to use this instead of inference.

@shashi
Copy link
Collaborator

shashi commented Mar 19, 2018

That's pretty awesome! The patch looks good to me!! :-)

Now the challenge is figuring out how to implement groupreduce and groupby as iterators

Then there is the elephant in the room --- join :-)

else
return :(false)
end
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't need to be @generated, a Base.@pure or @inline might work here. But I guess once a package has @generated it really doesn't matter how many of them are there :-p

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base.@pure seemed to not work as nicely, maybe we can leave as is now and do a "remove all generated functions" PR in the future (maybe in Julia 0.7).

src/collect.jl Outdated
copy!(newcol, 1, column(dest, l), 1, i-1)
new = setcol(new, l, newcol)
end
new
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really liking how this function reads.

@shashi
Copy link
Collaborator

shashi commented Mar 19, 2018

but Base.promote_typejoin should be sufficient to fix this.

I like the sound of that!

@shashi
Copy link
Collaborator

shashi commented Mar 19, 2018

Oops I think the current implementation of map allocates, probably because of JuliaLang/julia#15276 since we close over fs?

@shashi
Copy link
Collaborator

shashi commented Mar 19, 2018

If I just call f instead of fs (which understandably isn't inferable), the performance is 20% slower in the relevant portions (allocation + filling), but I think we can take that for the benefits and simplicity this PR affords.

@piever
Copy link
Collaborator Author

piever commented Mar 19, 2018

I see, I couldn't really understand how this PR could be faster. 20% sounds fair: the last step of converting to a table could account for that. map2(f,s) = collectcolumns(f(i) for i in s) is much faster than map1, not sure why, I thought converting to table was for free.

In terms of groupby and groupreduce, I made the SizeUnknown implementation for them, after we merge this one I can see how to translate them to use it (groupreduce is already doing with this PR strategy but assuming concrete grouping). For join we should actually change typejoin.

We should (before 0.7, then there is @nalimilan's PR to rescue us) define our own promote_typejoin and add promote_typejoin(::Type{<:DataValue}, ::Type{T}) = DataValue{T} and then it should work out of the box (though we may also need to tweak the s <: t part to allow Float64 <: DataValue{Float64} - of course using our own version of <:). Better leave it for last though.

A possible strategy could be, in the following order:

  • Decide on a name for collectcolumns and merge this PR (could also leave collectcolumns)
  • port map, which should be trivial
  • port groupby and groupreduce which could be a bit trickier
  • think about join and revisit this PR to adjust as needed.

The tricky thing about group is that it also uses inference to decide whether to wrap f in a function or not, so maybe that's not so easy.

@shashi
Copy link
Collaborator

shashi commented Mar 19, 2018

I thought converting to table was for free.

That is suspicious actually, it's the simplest wrapper when copy=false, presorted=true... Worth profiling.

@shashi
Copy link
Collaborator

shashi commented Mar 19, 2018

I was not timing the table creation, and that could have been the 20% overhead... I did some profiling, it looks like it's just the overhead of dispatch calling many methods with keyword arguments.

@shashi
Copy link
Collaborator

shashi commented Mar 19, 2018

define our own promote_typejoin and add promote_typejoin(::Type{<:DataValue}, ::Type{T}) = DataValue{T}

julia> promote_type(DataValue{Int}, Float64)
DataValues.DataValue{Float64}

why not use promote_type?

I think the 4-step plan sounds perfect. collectcolumns seems like a good name. Maybe collect_columns is easier to read, and mirrors the use of _ after collect in Base.

Cheers!

@piever
Copy link
Collaborator Author

piever commented Mar 19, 2018

promote_type is actually an interesting idea, I think I'll switch to that. I like it because it will give a tighter type than typejoin, then I guess we'll have to see in practice if that's enough to get join to work out of the box. I'll switch to that (and to collect_columns - I also think it reads better) + docstrings and I think it's good to go.

@nalimilan
Copy link
Member

One potential issue with promote_type is that promote_type(Int, Float64) === Float64, which means that big integers would be silently rounded. OTOH that's what happens with cat on arrays already, so...

piever added 2 commits March 19, 2018 16:38
@piever
Copy link
Collaborator Author

piever commented Mar 19, 2018

I saw that but I think it's better than returning an Array{Real}, especially if there are precedents like vcat. The use case of mixing up large Int64 and Float64 while wanting to preserve the Int64 exactly (where they are already at a size that risks overflow) seems limited.

@shashi
Copy link
Collaborator

shashi commented Mar 19, 2018

There's a _promote_type in JuliaDB.jl which does promote_type on all the fields of a tuple (using map_params). Maybe we should move that here and reuse it there. It's a minor detail that will need an in-tandem release. Everything looks good here: feel free to hit merge!

@piever piever merged commit 11ffe1e into master Mar 19, 2018
@piever piever deleted the pv/collect branch March 19, 2018 19:24
@piever
Copy link
Collaborator Author

piever commented Mar 19, 2018

One thing though: throughout the code, I'm assuming that the number of fields in the output Tup is constant throughout the data, but one of the tests has map of a function which does not respect this property. The old code would give an Array of Tuples (as it is a case where inference failed). Should we just error in that case?

@shashi
Copy link
Collaborator

shashi commented Mar 19, 2018

Why can't we switch to Array{Tuple} in this case?

end
end

function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should have T <: Tup as well?


function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T}
sp, tp = S.parameters, T.parameters
idx = find(!(s <: t) for (s, t) in zip(sp, tp))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to see if sp and tp have the same length, else return Array{Tuple}(length(dest)); copy!(newcol, 1, dest, 1, i-1)?
`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, figured out that was missing when changing map and luckily one test checked for that. More than same length, same fieldnames (we also shouldn't accept NamedTuples where things are called differently). I'll change that in the future map PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also add a test for #101 in the map PR as that should work now.

@piever
Copy link
Collaborator Author

piever commented Mar 19, 2018

It's actually not so trivial though, as the Array{NamedTuple} can get horribly inefficient (in the future Missing world), as separate fields can go missing in different moments and every time we'd need to copy everything (as the trick I'm doing here of only changing the relevant columns doesn't work). Not sure what's the best course of action: unless there are clear usecases it may be better to error.

@piever
Copy link
Collaborator Author

piever commented Mar 19, 2018

If the user insists of having tuples of different lengths, he/she can always wrap them in a one-element Tuple.

@shashi
Copy link
Collaborator

shashi commented Mar 19, 2018

and every time we'd need to copy everything

Why? Array{NamedTuple} should be wide enough for any tuple, even those containing missing?

@piever
Copy link
Collaborator Author

piever commented Mar 19, 2018

Ah, fair enough, as soon as different length are encountered it will just give a NamedTuple. without specializing further. I'll see if there's a way to include this without making the code too ugly...

OTOH, don't you think it'd be better to error and propose that the user has to be explicit if he/she wants this funny Array of tuples of different length? collect_columns((ntuple(identity, i),) for i in 1:3) for example.

EDIT: looking closer it seems that we do not need to change the code too much to allow this and it's true that it's surprising to have code that can collect anything unless there are tuples of different lengths...

@davidanthoff
Copy link
Contributor

You should be able to get rid of copying entirely if you widen to DataValueArray instead of Array{DataValue} (or are you doing that already? From a quick glance at the code I got the impression that you don't). In that case you can just reuse the array with the values that you collected so far, stuff it into a new DataValueArray and the only significant new allocation in a widening operation would be the missing mask (which of course is not a problem because it would have been allocated at some point in any case).

On the flip side, probably not even needed, because with DataValue you probably will pick the right column type right away based on the first element in the iteration in any case.

@piever
Copy link
Collaborator Author

piever commented Mar 20, 2018

Thanks for the tip! It will probably be needed for join as there one could get DataValue because the row is missing even when the field is of type Int for example. I'll probably have to revisit this for join and will include your trick!

@shashi
Copy link
Collaborator

shashi commented Mar 20, 2018

We do the trick for DataValueArrays in TextParse. At some point (maybe with Parsing.jl) we should use this type of iterator approach there as well. The promotion rules there are quite complex (e.g. string columns are started off as PooledArray, then the pool is widened, if there are a large number of uniques, we switch to Array{String})

@piever piever mentioned this pull request Apr 5, 2018
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants