Problem with SentinelArrays.ChainedVector when limit/skipto is set #963

bkamins · 2022-01-17T07:03:41Z

@quinnj:
Start Julia with multiple threads.

Run:

using DataFrames
using CSV
df = DataFrame(x = repeat(["a", "b"], 10^6), y = rand(2*10^6));
f1 = tempname();
CSV.write(f1, df);
df2 = CSV.read(f1, DataFrame, header = false, skipto = 1001, limit = 10000);
sort(df2, 1)

The problem is that in both columns SentinelArrays.ChainedVector the length of arrays does not match inds. Here is an example for Column2:

julia> df2.Column2.inds
4-element Vector{Int64}:
  2853
  5706
  8559
 10000

julia> cumsum(length.(df2.Column2.arrays))
4-element Vector{Int64}:
 2853
 5706
 8559
 9970

and you see that there are 30 elements missing.

If you reimplemented the getindex like:

function Base.getindex(A::ChainedVector, i::Integer)
    chunk, ix = index(A, i)
    x = A.arrays[chunk][ix]
    return x
end

to enable bounds checking you get bounds error when trying to work with df2.

The text was updated successfully, but these errors were encountered:

Fixes #963. The issue here is that although we were adjusting the # of rows to a provided limit when multithreaded parsing, we failed to adjust the actual column arrays to the correct size. This was an issue when we converted from the old `CSV.Column` custom array type to returning "normal" arrays in the 0.7 -> 0.8 transition. With `CSV.Column`, we just passed the final row total and it adjusted the size dynamically, without physically resizing the underlying array. With regular arrays, however, we need to ensure the array gets resized appropriately. This became more apparent in the recent pooling change that was released since it actually became a silenced BoundsError because of the use of `@inbounds` in the new `checkpooled!` routine. I've taken out those `@inbounds` uses for now to be more conservative. The fix is fairly straightforward in that if we adjust our final row down to a user-provided limit, then we loop over the parsing tasks and "accumulate" rows until we hit the limit and then resize or `empty!` columns as appropriate.

quinnj · 2022-01-17T17:30:06Z

Thanks for the report; fix is up: #964.

Fixes #963. The issue here is that although we were adjusting the # of rows to a provided limit when multithreaded parsing, we failed to adjust the actual column arrays to the correct size. This was an issue when we converted from the old `CSV.Column` custom array type to returning "normal" arrays in the 0.7 -> 0.8 transition. With `CSV.Column`, we just passed the final row total and it adjusted the size dynamically, without physically resizing the underlying array. With regular arrays, however, we need to ensure the array gets resized appropriately. This became more apparent in the recent pooling change that was released since it actually became a silenced BoundsError because of the use of `@inbounds` in the new `checkpooled!` routine. I've taken out those `@inbounds` uses for now to be more conservative. The fix is fairly straightforward in that if we adjust our final row down to a user-provided limit, then we loop over the parsing tasks and "accumulate" rows until we hit the limit and then resize or `empty!` columns as appropriate.

* Fix use of limit in multithreaded parsing Fixes #963. The issue here is that although we were adjusting the # of rows to a provided limit when multithreaded parsing, we failed to adjust the actual column arrays to the correct size. This was an issue when we converted from the old `CSV.Column` custom array type to returning "normal" arrays in the 0.7 -> 0.8 transition. With `CSV.Column`, we just passed the final row total and it adjusted the size dynamically, without physically resizing the underlying array. With regular arrays, however, we need to ensure the array gets resized appropriately. This became more apparent in the recent pooling change that was released since it actually became a silenced BoundsError because of the use of `@inbounds` in the new `checkpooled!` routine. I've taken out those `@inbounds` uses for now to be more conservative. The fix is fairly straightforward in that if we adjust our final row down to a user-provided limit, then we loop over the parsing tasks and "accumulate" rows until we hit the limit and then resize or `empty!` columns as appropriate. * Fix

bkamins added the bug label Jan 17, 2022

quinnj mentioned this issue Jan 17, 2022

Fix use of limit in multithreaded parsing #964

Merged

quinnj closed this as completed in #964 Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with SentinelArrays.ChainedVector when limit/skipto is set #963

Problem with SentinelArrays.ChainedVector when limit/skipto is set #963

bkamins commented Jan 17, 2022

quinnj commented Jan 17, 2022

Problem with SentinelArrays.ChainedVector when limit/skipto is set #963

Problem with SentinelArrays.ChainedVector when limit/skipto is set #963

Comments

bkamins commented Jan 17, 2022

quinnj commented Jan 17, 2022