Issue with `Tables.rowtable` #827

ericphanson · 2021-04-16T13:12:28Z

I just ran into this error when trying to call Tables.rowtable on a CSV.File:

MethodError: no method matching CSV.PooledString(::String)

Closest candidates are:

CSV.PooledString() at /home/ec2-user/.julia/packages/CSV/CJfFO/src/utils.jl:8

convert(::Type{CSV.PooledString}, ::String)@basic.jl:232
cvt1@essentials.jl:322[inlined]
macro expansion@ntuple.jl:74[inlined]
ntuple@ntuple.jl:69[inlined]

I don't think I can share the file, unfortunately.

With this and apache/arrow-julia#167 I've found a good workaround has been map(NamedTuple, Tables.row(...)). I wonder if this also a schema issue or something else.

The text was updated successfully, but these errors were encountered:

quinnj · 2021-04-16T18:19:29Z

Hmmm....I'm not quite sure how you're ending up with PooledString in a schema produced by CSV.File? Like, in the following, we end up with PooledArrays with string elements, but hte schema is correctly inferred as Union{String, Missing}

f = CSV.File(IOBuffer("""x,y
                                       a,b
                                       a,b
                                       a,b
                                       a,b
                                       a,b
                                       a,b
                                       a,b
                                       a,b
                                       a,b

                                       """), ignoreemptylines=false)
julia> Tables.schema(f)
Tables.Schema:
 :x  Union{Missing, String}
 :y  Union{Missing, String}
 
julia> Tables.rowtable(f)
10-element Vector{NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}}:
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}((missing, missing))

If there's any way to anonymize the data or share it w/ me privately, I'd be interested in tracking this down.

ericphanson · 2021-04-16T21:00:30Z

I spent a bit of time trying to minimize and anonymize and then eventually realized that this was enough to trigger it aha:

julia> using Tables, CSV, Random

julia> Threads.nthreads()
2

julia> CSV.write("data.csv", (; col= [randstring(50) for _ = 1:50]))
"data.csv"

julia> Tables.schema(CSV.File("data.csv"; pool=true, threaded=true))
Tables.Schema:
 :col  PooledString

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, broadwell)

(CSV v0.8.4, Tables v1.4.2)

It works fine without threading.

Fixes #827. The issue here is the column type of pooled columns was still `PooledString` when finished parsing, which is an internal-only type used while parsing to signal a column is being pooled. The fix is pretty straightforward: ensure the column type is `String` or `Union{String, Missing}` when we're done parsing.

quinnj · 2021-04-19T23:12:50Z

Fix is up: #828

* Fix CSV.File schema for pooled columns when multithreaded parsing Fixes #827. The issue here is the column type of pooled columns was still `PooledString` when finished parsing, which is an internal-only type used while parsing to signal a column is being pooled. The fix is pretty straightforward: ensure the column type is `String` or `Union{String, Missing}` when we're done parsing. * finish test

ericphanson · 2021-04-20T14:21:56Z

Thanks!

aplavin · 2021-05-20T19:12:36Z

Is this fix released already? I seem to have the very same issue on the latest available CSV.jl 0.8.4.

quinnj · 2021-05-20T19:48:15Z

Oh whoops, looks like we forgot to do a patch release w/ this fix; I've gone ahead and done that here: f405361.

quinnj mentioned this issue Apr 19, 2021

Fix CSV.File schema for pooled columns when multithreaded parsing #828

Merged

quinnj closed this as completed in #828 Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with `Tables.rowtable` #827

Issue with `Tables.rowtable` #827

ericphanson commented Apr 16, 2021

quinnj commented Apr 16, 2021

ericphanson commented Apr 16, 2021

quinnj commented Apr 19, 2021

ericphanson commented Apr 20, 2021

aplavin commented May 20, 2021 •

edited

Loading

quinnj commented May 20, 2021

Issue with Tables.rowtable #827

Issue with Tables.rowtable #827

Comments

ericphanson commented Apr 16, 2021

quinnj commented Apr 16, 2021

ericphanson commented Apr 16, 2021

quinnj commented Apr 19, 2021

ericphanson commented Apr 20, 2021

aplavin commented May 20, 2021 • edited Loading

quinnj commented May 20, 2021

Issue with `Tables.rowtable` #827

Issue with `Tables.rowtable` #827

aplavin commented May 20, 2021 •

edited

Loading