Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are ignored columns still parsed? #942

Closed
ShuhuaGao opened this issue Nov 15, 2021 · 3 comments · Fixed by #943
Closed

Are ignored columns still parsed? #942

ShuhuaGao opened this issue Nov 15, 2021 · 3 comments · Fixed by #943

Comments

@ShuhuaGao
Copy link

Consider the following MWE.

julia> using CSV

julia> data = """
       name, age
       Jack, 12
       Tom, 10
       """
"name, age\nJack, 12\nTom, 10\n"

julia> CSV.read(IOBuffer(data), Tuple; select=[2], types=Dict(2=>Int32))
(Int32[12, 10],)

julia> CSV.read(IOBuffer(data), Tuple; select=[2], types=Int32)
┌ Warning: thread = 1 warning: error parsing Int32 around row = 1, col = 1: "Jack,", error=INVALID: SENTINEL | DELIMITED | INVALID_DELIMITER 
└ @ CSV ~/.julia/packages/CSV/nofYz/src/file.jl:637
┌ Warning: thread = 1 warning: error parsing Int32 around row = 2, col = 1: "Tom,", error=INVALID: SENTINEL | DELIMITED | INVALID_DELIMITER 
└ @ CSV ~/.julia/packages/CSV/nofYz/src/file.jl:637
(Int32[12, 10],)

In the second read above, since I have only selected column 2, a natural expectation is that only column 2 is parsed. In other words, I expect that types=Int32 applies to all select columns. However, it seems that column 1 is still parsed.

@nickrobinson251
Copy link
Collaborator

Are ignored columns still parsed?

Yes, even with select (or drop), all columns are parsed initially, and then only afterwards dropped.

e.g. here's where they get dropped (immediately after parsing)

CSV.jl/src/file.jl

Lines 312 to 322 in a68bb04

# delete any dropped columns from names, columns
names = ctx.names
if length(columns) > length(names)
# columns were widened during parsing, auto-generate trailing column names
names = makeunique(append!(names, [Symbol(:Column, i) for i = (length(names) + 1):length(columns)]))
end
for i = length(columns):-1:1
col = columns[i]
if col.willdrop
deleteat!(names, i)
deleteat!(columns, i)

(If i had to guess at why we parse them all, i'd guess it's because in multi-threaded parsing it's too hard to do the alternative of "ignoring" columns... but that's a guess)

But even taking for granted that's how thngs are implemented, I think this a good point:

I expect that types=Int32 applies to all select columns.

For example, we might be able to do something where we if we get types::Type{T} and select/drop, we expand types to be e.g. Dict(i => T for i in select) or typesf(i, name) = i in select ? T : nothing 🤔

@ShuhuaGao
Copy link
Author

Many thanks for the explanation. That makes sense if "it's too hard to do the alternative of "ignoring" columns" in multi-threading.

quinnj added a commit that referenced this issue Nov 16, 2021
Fixes #942. When dropping columns while parsing via `select` or `drop`,
we need to ensure the column type gets set as `HardMissing`.
@quinnj
Copy link
Member

quinnj commented Nov 16, 2021

Thanks for the report @ShuhuaGao; this is just an oversight in our parsing code when the types keyword is provided along with select/drop. A fix is up here: #943

quinnj added a commit that referenced this issue Nov 16, 2021
* Ensure dropped columns have type set as missing

Fixes #942. When dropping columns while parsing via `select` or `drop`,
we need to ensure the column type gets set as `HardMissing`.

* finish test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants