Reading Transposed CSV with unequal rows reads only minimum size of data #834

shakisparki · 2021-05-09T19:14:55Z

Data (test.txt)

a, 0.1, 0.2, 0.3
b, 0.4

Code

julia> f = CSV.File(read("test.txt"), transpose=true, normalizenames=true)
1-element CSV.File{false}:
CSV.Row: (a = 0.1, b = 0.4)

Issue

Output ignores remaining data in first row instead of reading all data for row 1, and row 2 while reporting "missing" data in row 2.

The text was updated successfully, but these errors were encountered:

quinnj · 2021-05-11T02:48:53Z

Oof.....this is kinda rough. The problem is that we reach EOF on row two right after 0.4, so parsing terminates. I guess instead of EOF, we could maybe try to check if each row position buffer has terminated; I'll look into it, but it might be pretty tricky to make this work.

This PR is a large refactoring of CSV.jl internals to simplify code, fix a few bugs, and allow several desired improvements. Here's a list of goals for the refactor: * Switch the `lazystrings` keyword argument to a more general `stringtype` argument, allowing users to pass `stringtype=PosLenString` to achieve the same effect `lazystrings=true` has previously provided; but in addition to just return string columns as a lazy string type, it allows this new lazy string type (`PosLenString`), so be used in pooled columns as well * Change the header.jl file to context.jl and `Header` struct to `Context`; it's a more descriptive name as it holds all the variables and initial computation done before any real parsing work begins, so it provides "context" to parsing * Updated multithreaded parsing to do "type sampling" before parsing begins; we were already scanning a number of rows per chunk when identifying where each chunk should start parsing; we now merge that operation with also tracking the values parsed while identifying row starts, and use these sample values to figure out a column's initial type, build up pooling column refpools, and determine other properties like max string size * Includes a fix for #834, since I was in the middle of a large refactoring anyway and it wasn't too hard to add while mucking around w/ everything * Big update to how pooling is done; I've tried to explain the new approach in great detail in the src/README.md document, which I plan on having as a sort of "dev docs" going forward * Getting rid of the internal "flags" bitmask in favor of a structured `Column` type that holds current type information, pooled-ness, parsed values, plus individual fields for things the flags were previously tracking. Overall, this helps simplify quite a bit of code * Introduces new behavior where _extra columns_ detected while parsing will be _added_ to the returned final columnset, instead of warned about; this is something I've wanted to do for a long time, under the general rule of "don't discard present data"; if it's there, parse it. Things I'd still like to do before merging: * Review codegen/performance; from a quick glance at a few files, it seems that pooling is slower; it also seems like `detect` has some dynamic dispatch causing allocations * Review keyword argument defaults for `CSV.File`/`CSV.read`; we're getting very close to tagging a 1.0, so I'd like to do a thorough review of parsing options and make sure we feel confident in the user-facing API * Docs/tests update

* CSV parsing internals refactoring This PR is a large refactoring of CSV.jl internals to simplify code, fix a few bugs, and allow several desired improvements. Here's a list of goals for the refactor: * Switch the `lazystrings` keyword argument to a more general `stringtype` argument, allowing users to pass `stringtype=PosLenString` to achieve the same effect `lazystrings=true` has previously provided; but in addition to just return string columns as a lazy string type, it allows this new lazy string type (`PosLenString`), so be used in pooled columns as well * Change the header.jl file to context.jl and `Header` struct to `Context`; it's a more descriptive name as it holds all the variables and initial computation done before any real parsing work begins, so it provides "context" to parsing * Updated multithreaded parsing to do "type sampling" before parsing begins; we were already scanning a number of rows per chunk when identifying where each chunk should start parsing; we now merge that operation with also tracking the values parsed while identifying row starts, and use these sample values to figure out a column's initial type, build up pooling column refpools, and determine other properties like max string size * Includes a fix for #834, since I was in the middle of a large refactoring anyway and it wasn't too hard to add while mucking around w/ everything * Big update to how pooling is done; I've tried to explain the new approach in great detail in the src/README.md document, which I plan on having as a sort of "dev docs" going forward * Getting rid of the internal "flags" bitmask in favor of a structured `Column` type that holds current type information, pooled-ness, parsed values, plus individual fields for things the flags were previously tracking. Overall, this helps simplify quite a bit of code * Introduces new behavior where _extra columns_ detected while parsing will be _added_ to the returned final columnset, instead of warned about; this is something I've wanted to do for a long time, under the general rule of "don't discard present data"; if it's there, parse it. Things I'd still like to do before merging: * Review codegen/performance; from a quick glance at a few files, it seems that pooling is slower; it also seems like `detect` has some dynamic dispatch causing allocations * Review keyword argument defaults for `CSV.File`/`CSV.read`; we're getting very close to tagging a 1.0, so I'd like to do a thorough review of parsing options and make sure we feel confident in the user-facing API * Docs/tests update * Allow mixing int/string/symbol in keyword args. Implements #810 * Add maximum file length check * Fix some performance * put back debug stmt * fix CSV.Chunks and add at least one test * Tweak default pooling per review comments. The main change in this commit is that non-string columns won't participate when `pool=0.1`, for example, is provided. * Update src/README.md Co-authored-by: Milan Bouchet-Valat <[email protected]> * Support InlineString natively in CSV.jl * Move PooledArrays code back to not rely on _always_ treating missing as ref 1 (though we actually still do that in the pool anyway). * don't do inline strings over InlineString31 by default * Fix tests * fix tests again * fix 32-bit * fixes * f * Support Int32 parsing by default; add some more dev docs * Support new downcast keyword arg (no docs yet) * tests and fixes for downcast Co-authored-by: Milan Bouchet-Valat <[email protected]>

quinnj · 2021-06-17T05:06:49Z

Fixed in #837

quinnj mentioned this issue May 23, 2021

CSV parsing internals refactoring #837

Merged

quinnj closed this as completed Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading Transposed CSV with unequal rows reads only minimum size of data #834

Reading Transposed CSV with unequal rows reads only minimum size of data #834

shakisparki commented May 9, 2021 •

edited

Loading

quinnj commented May 11, 2021

quinnj commented Jun 17, 2021

Reading Transposed CSV with unequal rows reads only minimum size of data #834

Reading Transposed CSV with unequal rows reads only minimum size of data #834

Comments

shakisparki commented May 9, 2021 • edited Loading

Data (test.txt)

Code

Issue

quinnj commented May 11, 2021

quinnj commented Jun 17, 2021

shakisparki commented May 9, 2021 •

edited

Loading