Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with SentinelArrays.ChainedVector when limit/skipto is set #963

Closed
bkamins opened this issue Jan 17, 2022 · 1 comment · Fixed by #964
Closed

Problem with SentinelArrays.ChainedVector when limit/skipto is set #963

bkamins opened this issue Jan 17, 2022 · 1 comment · Fixed by #964
Labels

Comments

@bkamins
Copy link
Member

bkamins commented Jan 17, 2022

@quinnj:
Start Julia with multiple threads.

Run:

using DataFrames
using CSV
df = DataFrame(x = repeat(["a", "b"], 10^6), y = rand(2*10^6));
f1 = tempname();
CSV.write(f1, df);
df2 = CSV.read(f1, DataFrame, header = false, skipto = 1001, limit = 10000);
sort(df2, 1)

The problem is that in both columns SentinelArrays.ChainedVector the length of arrays does not match inds. Here is an example for Column2:

julia> df2.Column2.inds
4-element Vector{Int64}:
  2853
  5706
  8559
 10000

julia> cumsum(length.(df2.Column2.arrays))
4-element Vector{Int64}:
 2853
 5706
 8559
 9970

and you see that there are 30 elements missing.

If you reimplemented the getindex like:

function Base.getindex(A::ChainedVector, i::Integer)
    chunk, ix = index(A, i)
    x = A.arrays[chunk][ix]
    return x
end

to enable bounds checking you get bounds error when trying to work with df2.

@bkamins bkamins added the bug label Jan 17, 2022
quinnj added a commit that referenced this issue Jan 17, 2022
Fixes #963. The issue here is that although we were adjusting the # of
rows to a provided limit when multithreaded parsing, we failed to adjust
the actual column arrays to the correct size. This was an issue when we
converted from the old `CSV.Column` custom array type to returning
"normal" arrays in the 0.7 -> 0.8 transition. With `CSV.Column`, we just
passed the final row total and it adjusted the size dynamically, without
physically resizing the underlying array. With regular arrays, however,
we need to ensure the array gets resized appropriately. This became more
apparent in the recent pooling change that was released since it
actually became a silenced BoundsError because of the use of `@inbounds`
in the new `checkpooled!` routine. I've taken out those `@inbounds` uses
for now to be more conservative. The fix is fairly straightforward in
that if we adjust our final row down to a user-provided limit, then we
loop over the parsing tasks and "accumulate" rows until we hit the limit
and then resize or `empty!` columns as appropriate.
@quinnj
Copy link
Member

quinnj commented Jan 17, 2022

Thanks for the report; fix is up: #964.

quinnj added a commit that referenced this issue Jan 17, 2022
Fixes #963. The issue here is that although we were adjusting the # of
rows to a provided limit when multithreaded parsing, we failed to adjust
the actual column arrays to the correct size. This was an issue when we
converted from the old `CSV.Column` custom array type to returning
"normal" arrays in the 0.7 -> 0.8 transition. With `CSV.Column`, we just
passed the final row total and it adjusted the size dynamically, without
physically resizing the underlying array. With regular arrays, however,
we need to ensure the array gets resized appropriately. This became more
apparent in the recent pooling change that was released since it
actually became a silenced BoundsError because of the use of `@inbounds`
in the new `checkpooled!` routine. I've taken out those `@inbounds` uses
for now to be more conservative. The fix is fairly straightforward in
that if we adjust our final row down to a user-provided limit, then we
loop over the parsing tasks and "accumulate" rows until we hit the limit
and then resize or `empty!` columns as appropriate.
quinnj added a commit that referenced this issue Jan 19, 2022
* Fix use of limit in multithreaded parsing

Fixes #963. The issue here is that although we were adjusting the # of
rows to a provided limit when multithreaded parsing, we failed to adjust
the actual column arrays to the correct size. This was an issue when we
converted from the old `CSV.Column` custom array type to returning
"normal" arrays in the 0.7 -> 0.8 transition. With `CSV.Column`, we just
passed the final row total and it adjusted the size dynamically, without
physically resizing the underlying array. With regular arrays, however,
we need to ensure the array gets resized appropriately. This became more
apparent in the recent pooling change that was released since it
actually became a silenced BoundsError because of the use of `@inbounds`
in the new `checkpooled!` routine. I've taken out those `@inbounds` uses
for now to be more conservative. The fix is fairly straightforward in
that if we adjust our final row down to a user-provided limit, then we
loop over the parsing tasks and "accumulate" rows until we hit the limit
and then resize or `empty!` columns as appropriate.

* Fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants