-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: collect without inference #135
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT that should work, but will suffer from the problem described at JuliaLang/julia#25925, i.e. that a full copy will have to be made each time a missing value is encountered for the first time in a field. So it can be relatively efficient if the number of columns is small and some missing values appear early, but very slow if that's not the case.
src/collect.jl
Outdated
while !done(itr, st) | ||
el, st = next(itr, st) | ||
S = typeof(el) | ||
if all((s <: t) for (s, t) in zip(S.parameters, T.parameters)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this inferable if eltype(itr)
is concrete or once T
has become at least as wide as it? That's essential for performance in Base map
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong enough intuition to understand what the compiler can optimize. Here Base uses el isa T || typeof(el) === T
. What should I check exactly to test inferrability? Maybe making this a function is_fieldwise_subtype(el::S, ::Type{T}) = all((s <: t) for (s, t) in zip(S.parameters, T.parameters))
would help the compiler, I'm not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@code_warntype
will tell you whether the return type is inferred. Using a helper function could indeed help (maybe after marking it as Base.@pure
, but the chances are that I'm wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked with constant eltype
and now it is tyoe stable. Somehow the way to achieve that was to use a generated function to check is_fieldwise_subtype
, which we should do anyway for performance. It's not inferrable in the NamedTuples
case (the initialization step arrayof
which we already use in map
is not inferrable), but I hope all the NamedTuples
business will be much better in 0.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, should probably benchmark it to make sure it's doing OK compared with the inference based map
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a way to achieve the same result without a generated function, but these things are always tricky...
Concerning missingness, I see the concern. One optimization I can do is to not copy the whole thing, but also figure out what is the "offending column" and only copy that one (there is a I'm starting to think that using the trick of "columnar storage" the situation is not nearly as bad as with a vector of |
Good trick!
I don't think there's been any progress (the priority is merging breaking changes for 1.0).
Yeah, it should be more reasonable, at least if we are able to make |
Benchmarks are very encouraging, for some reason this is actually faster than our current implementation in the type stable case.... julia> using IndexedTables, BenchmarkTools
julia> s = table(rand(1000), rand(1000), names = [:x, :y]);
julia> f(i) = @NT(sum = i.x + i.y, diff = i.x - i.y)
f (generic function with 1 method)
julia> map1(f, s) = table(collectcolumns(f(i) for i in s), copy = false, presorted = true)
map1 (generic function with 1 method)
julia> @benchmark map($f, $s)
BenchmarkTools.Trial:
memory estimate: 105.02 KiB
allocs estimate: 2554
--------------
minimum time: 64.973 μs (0.00% GC)
median time: 72.453 μs (0.00% GC)
mean time: 83.847 μs (7.97% GC)
maximum time: 2.059 ms (91.21% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark map1($f, $s)
BenchmarkTools.Trial:
memory estimate: 17.83 KiB
allocs estimate: 44
--------------
minimum time: 13.858 μs (0.00% GC)
median time: 15.307 μs (0.00% GC)
mean time: 17.561 μs (4.52% GC)
maximum time: 1.198 ms (91.14% GC)
--------------
samples: 10000
evals/sample: 1
julia> map1(f, s) == map(f,s)
true I still need to:
And then we could start porting things to use this instead of inference. |
That's pretty awesome! The patch looks good to me!! :-) Now the challenge is figuring out how to implement Then there is the elephant in the room --- |
else | ||
return :(false) | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doesn't need to be @generated
, a Base.@pure
or @inline
might work here. But I guess once a package has @generated
it really doesn't matter how many of them are there :-p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base.@pure
seemed to not work as nicely, maybe we can leave as is now and do a "remove all generated functions" PR in the future (maybe in Julia 0.7).
src/collect.jl
Outdated
copy!(newcol, 1, column(dest, l), 1, i-1) | ||
new = setcol(new, l, newcol) | ||
end | ||
new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really liking how this function reads.
I like the sound of that! |
Oops I think the current implementation of |
If I just call |
I see, I couldn't really understand how this PR could be faster. 20% sounds fair: the last step of converting to a In terms of We should (before 0.7, then there is A possible strategy could be, in the following order:
The tricky thing about |
That is suspicious actually, it's the simplest wrapper when |
I was not timing the table creation, and that could have been the 20% overhead... I did some profiling, it looks like it's just the overhead of dispatch calling many methods with keyword arguments. |
julia> promote_type(DataValue{Int}, Float64)
DataValues.DataValue{Float64} why not use I think the 4-step plan sounds perfect. Cheers! |
|
One potential issue with |
I saw that but I think it's better than returning an |
There's a |
One thing though: throughout the code, I'm assuming that the number of fields in the output |
Why can't we switch to |
end | ||
end | ||
|
||
function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should have T <: Tup
as well?
|
||
function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T} | ||
sp, tp = S.parameters, T.parameters | ||
idx = find(!(s <: t) for (s, t) in zip(sp, tp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to see if sp
and tp
have the same length, else return Array{Tuple}(length(dest)); copy!(newcol, 1, dest, 1, i-1)
?
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, figured out that was missing when changing map
and luckily one test checked for that. More than same length, same fieldnames (we also shouldn't accept NamedTuples
where things are called differently). I'll change that in the future map
PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should also add a test for #101 in the map
PR as that should work now.
It's actually not so trivial though, as the |
If the user insists of having tuples of different lengths, he/she can always wrap them in a one-element |
Why? |
Ah, fair enough, as soon as different length are encountered it will just give a
EDIT: looking closer it seems that we do not need to change the code too much to allow this and it's true that it's surprising to have code that can collect anything unless there are tuples of different lengths... |
You should be able to get rid of copying entirely if you widen to On the flip side, probably not even needed, because with |
Thanks for the tip! It will probably be needed for |
We do the trick for DataValueArrays in TextParse. At some point (maybe with Parsing.jl) we should use this type of iterator approach there as well. The promotion rules there are quite complex (e.g. string columns are started off as PooledArray, then the pool is widened, if there are a large number of uniques, we switch to Array{String}) |
Collect an iterable of
Tuples
orNamedTuples
to aColumns
object, without using inference. TheColumns
object is initialized with the type of the first element and progressively widened, should this type change.I assume that the iterator iterates
Tuples
(orNamedTuples
) always of the same length and, in theNamedTuples
case, always with the same fieldnames.The condition to trigger widening is that at least one field of the iterate tuple is not a subtype of the
eltype
of the corresponding column.Widening happens element-wise, meaning I perform
typejoin
on a field by field basis.To work with
missing
in the futuretypejoin
is not the correct operation, as:but
Base.promote_typejoin
should be sufficient to fix this.The overall strategy is more or less a copy-paste of Base Julia strategy to collect generators of unknown type.