Are ignored columns still parsed? #942

ShuhuaGao · 2021-11-15T04:59:00Z

Consider the following MWE.

julia> using CSV

julia> data = """
       name, age
       Jack, 12
       Tom, 10
       """
"name, age\nJack, 12\nTom, 10\n"

julia> CSV.read(IOBuffer(data), Tuple; select=[2], types=Dict(2=>Int32))
(Int32[12, 10],)

julia> CSV.read(IOBuffer(data), Tuple; select=[2], types=Int32)
┌ Warning: thread = 1 warning: error parsing Int32 around row = 1, col = 1: "Jack,", error=INVALID: SENTINEL | DELIMITED | INVALID_DELIMITER 
└ @ CSV ~/.julia/packages/CSV/nofYz/src/file.jl:637
┌ Warning: thread = 1 warning: error parsing Int32 around row = 2, col = 1: "Tom,", error=INVALID: SENTINEL | DELIMITED | INVALID_DELIMITER 
└ @ CSV ~/.julia/packages/CSV/nofYz/src/file.jl:637
(Int32[12, 10],)

In the second read above, since I have only selected column 2, a natural expectation is that only column 2 is parsed. In other words, I expect that types=Int32 applies to all select columns. However, it seems that column 1 is still parsed.

The text was updated successfully, but these errors were encountered:

nickrobinson251 · 2021-11-15T11:57:30Z

Are ignored columns still parsed?

Yes, even with select (or drop), all columns are parsed initially, and then only afterwards dropped.

e.g. here's where they get dropped (immediately after parsing)

CSV.jl/src/file.jl

Lines 312 to 322 in a68bb04

    
           # delete any dropped columns from names, columns 
        
           names = ctx.names 
        
           if length(columns) > length(names) 
        
               # columns were widened during parsing, auto-generate trailing column names 
        
               names = makeunique(append!(names, [Symbol(:Column, i) for i = (length(names) + 1):length(columns)])) 
        
           end 
        
           for i = length(columns):-1:1 
        
               col = columns[i] 
        
               if col.willdrop 
        
                   deleteat!(names, i) 
        
                   deleteat!(columns, i)

(If i had to guess at why we parse them all, i'd guess it's because in multi-threaded parsing it's too hard to do the alternative of "ignoring" columns... but that's a guess)

But even taking for granted that's how thngs are implemented, I think this a good point:

I expect that types=Int32 applies to all select columns.

For example, we might be able to do something where we if we get types::Type{T} and select/drop, we expand types to be e.g. Dict(i => T for i in select) or typesf(i, name) = i in select ? T : nothing 🤔

ShuhuaGao · 2021-11-16T01:48:43Z

Many thanks for the explanation. That makes sense if "it's too hard to do the alternative of "ignoring" columns" in multi-threading.

Fixes #942. When dropping columns while parsing via `select` or `drop`, we need to ensure the column type gets set as `HardMissing`.

quinnj · 2021-11-16T04:44:30Z

Thanks for the report @ShuhuaGao; this is just an oversight in our parsing code when the types keyword is provided along with select/drop. A fix is up here: #943

* Ensure dropped columns have type set as missing Fixes #942. When dropping columns while parsing via `select` or `drop`, we need to ensure the column type gets set as `HardMissing`. * finish test

ShuhuaGao closed this as completed Nov 16, 2021

quinnj added a commit that referenced this issue Nov 16, 2021

Ensure dropped columns have type set as missing

e3d2f1a

Fixes #942. When dropping columns while parsing via `select` or `drop`, we need to ensure the column type gets set as `HardMissing`.

quinnj mentioned this issue Nov 16, 2021

Ensure dropped columns have type set as missing #943

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are ignored columns still parsed? #942

Are ignored columns still parsed? #942

ShuhuaGao commented Nov 15, 2021

nickrobinson251 commented Nov 15, 2021

ShuhuaGao commented Nov 16, 2021

quinnj commented Nov 16, 2021

Are ignored columns still parsed? #942

Are ignored columns still parsed? #942

Comments

ShuhuaGao commented Nov 15, 2021

nickrobinson251 commented Nov 15, 2021

ShuhuaGao commented Nov 16, 2021

quinnj commented Nov 16, 2021