Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Tables.rowtable #827

Closed
ericphanson opened this issue Apr 16, 2021 · 6 comments · Fixed by #828
Closed

Issue with Tables.rowtable #827

ericphanson opened this issue Apr 16, 2021 · 6 comments · Fixed by #828

Comments

@ericphanson
Copy link

I just ran into this error when trying to call Tables.rowtable on a CSV.File:

MethodError: no method matching CSV.PooledString(::String)

Closest candidates are:

CSV.PooledString() at /home/ec2-user/.julia/packages/CSV/CJfFO/src/utils.jl:8

convert(::Type{CSV.PooledString}, ::String)@basic.jl:232
cvt1@essentials.jl:322[inlined]
macro expansion@ntuple.jl:74[inlined]
ntuple@ntuple.jl:69[inlined]

I don't think I can share the file, unfortunately.

With this and apache/arrow-julia#167 I've found a good workaround has been map(NamedTuple, Tables.row(...)). I wonder if this also a schema issue or something else.

@quinnj
Copy link
Member

quinnj commented Apr 16, 2021

Hmmm....I'm not quite sure how you're ending up with PooledString in a schema produced by CSV.File? Like, in the following, we end up with PooledArrays with string elements, but hte schema is correctly inferred as Union{String, Missing}

f = CSV.File(IOBuffer("""x,y
                                       a,b
                                       a,b
                                       a,b
                                       a,b
                                       a,b
                                       a,b
                                       a,b
                                       a,b
                                       a,b

                                       """), ignoreemptylines=false)
julia> Tables.schema(f)
Tables.Schema:
 :x  Union{Missing, String}
 :y  Union{Missing, String}
 
julia> Tables.rowtable(f)
10-element Vector{NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}}:
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}(("a", "b"))
 NamedTuple{(:x, :y), Tuple{Union{Missing, String}, Union{Missing, String}}}((missing, missing))

If there's any way to anonymize the data or share it w/ me privately, I'd be interested in tracking this down.

@ericphanson
Copy link
Author

I spent a bit of time trying to minimize and anonymize and then eventually realized that this was enough to trigger it aha:

julia> using Tables, CSV, Random

julia> Threads.nthreads()
2

julia> CSV.write("data.csv", (; col= [randstring(50) for _ = 1:50]))
"data.csv"

julia> Tables.schema(CSV.File("data.csv"; pool=true, threaded=true))
Tables.Schema:
 :col  PooledString

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, broadwell)

(CSV v0.8.4, Tables v1.4.2)

It works fine without threading.

quinnj added a commit that referenced this issue Apr 19, 2021
Fixes #827. The issue here is the column type of pooled columns was
still `PooledString` when finished parsing, which is an internal-only
type used while parsing to signal a column is being pooled. The fix is
pretty straightforward: ensure the column type is `String` or
`Union{String, Missing}` when we're done parsing.
@quinnj
Copy link
Member

quinnj commented Apr 19, 2021

Fix is up: #828

quinnj added a commit that referenced this issue Apr 20, 2021
* Fix CSV.File schema for pooled columns when multithreaded parsing

Fixes #827. The issue here is the column type of pooled columns was
still `PooledString` when finished parsing, which is an internal-only
type used while parsing to signal a column is being pooled. The fix is
pretty straightforward: ensure the column type is `String` or
`Union{String, Missing}` when we're done parsing.

* finish test
@ericphanson
Copy link
Author

Thanks!

@aplavin
Copy link

aplavin commented May 20, 2021

Is this fix released already? I seem to have the very same issue on the latest available CSV.jl 0.8.4.

@quinnj
Copy link
Member

quinnj commented May 20, 2021

Oh whoops, looks like we forgot to do a patch release w/ this fix; I've gone ahead and done that here: f405361.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants