-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading Transposed CSV with unequal rows reads only minimum size of data #834
Comments
Oof.....this is kinda rough. The problem is that we reach EOF on row two right after |
quinnj
added a commit
that referenced
this issue
May 23, 2021
This PR is a large refactoring of CSV.jl internals to simplify code, fix a few bugs, and allow several desired improvements. Here's a list of goals for the refactor: * Switch the `lazystrings` keyword argument to a more general `stringtype` argument, allowing users to pass `stringtype=PosLenString` to achieve the same effect `lazystrings=true` has previously provided; but in addition to just return string columns as a lazy string type, it allows this new lazy string type (`PosLenString`), so be used in pooled columns as well * Change the header.jl file to context.jl and `Header` struct to `Context`; it's a more descriptive name as it holds all the variables and initial computation done before any real parsing work begins, so it provides "context" to parsing * Updated multithreaded parsing to do "type sampling" before parsing begins; we were already scanning a number of rows per chunk when identifying where each chunk should start parsing; we now merge that operation with also tracking the values parsed while identifying row starts, and use these sample values to figure out a column's initial type, build up pooling column refpools, and determine other properties like max string size * Includes a fix for #834, since I was in the middle of a large refactoring anyway and it wasn't too hard to add while mucking around w/ everything * Big update to how pooling is done; I've tried to explain the new approach in great detail in the src/README.md document, which I plan on having as a sort of "dev docs" going forward * Getting rid of the internal "flags" bitmask in favor of a structured `Column` type that holds current type information, pooled-ness, parsed values, plus individual fields for things the flags were previously tracking. Overall, this helps simplify quite a bit of code * Introduces new behavior where _extra columns_ detected while parsing will be _added_ to the returned final columnset, instead of warned about; this is something I've wanted to do for a long time, under the general rule of "don't discard present data"; if it's there, parse it. Things I'd still like to do before merging: * Review codegen/performance; from a quick glance at a few files, it seems that pooling is slower; it also seems like `detect` has some dynamic dispatch causing allocations * Review keyword argument defaults for `CSV.File`/`CSV.read`; we're getting very close to tagging a 1.0, so I'd like to do a thorough review of parsing options and make sure we feel confident in the user-facing API * Docs/tests update
quinnj
added a commit
that referenced
this issue
Jun 16, 2021
* CSV parsing internals refactoring This PR is a large refactoring of CSV.jl internals to simplify code, fix a few bugs, and allow several desired improvements. Here's a list of goals for the refactor: * Switch the `lazystrings` keyword argument to a more general `stringtype` argument, allowing users to pass `stringtype=PosLenString` to achieve the same effect `lazystrings=true` has previously provided; but in addition to just return string columns as a lazy string type, it allows this new lazy string type (`PosLenString`), so be used in pooled columns as well * Change the header.jl file to context.jl and `Header` struct to `Context`; it's a more descriptive name as it holds all the variables and initial computation done before any real parsing work begins, so it provides "context" to parsing * Updated multithreaded parsing to do "type sampling" before parsing begins; we were already scanning a number of rows per chunk when identifying where each chunk should start parsing; we now merge that operation with also tracking the values parsed while identifying row starts, and use these sample values to figure out a column's initial type, build up pooling column refpools, and determine other properties like max string size * Includes a fix for #834, since I was in the middle of a large refactoring anyway and it wasn't too hard to add while mucking around w/ everything * Big update to how pooling is done; I've tried to explain the new approach in great detail in the src/README.md document, which I plan on having as a sort of "dev docs" going forward * Getting rid of the internal "flags" bitmask in favor of a structured `Column` type that holds current type information, pooled-ness, parsed values, plus individual fields for things the flags were previously tracking. Overall, this helps simplify quite a bit of code * Introduces new behavior where _extra columns_ detected while parsing will be _added_ to the returned final columnset, instead of warned about; this is something I've wanted to do for a long time, under the general rule of "don't discard present data"; if it's there, parse it. Things I'd still like to do before merging: * Review codegen/performance; from a quick glance at a few files, it seems that pooling is slower; it also seems like `detect` has some dynamic dispatch causing allocations * Review keyword argument defaults for `CSV.File`/`CSV.read`; we're getting very close to tagging a 1.0, so I'd like to do a thorough review of parsing options and make sure we feel confident in the user-facing API * Docs/tests update * Allow mixing int/string/symbol in keyword args. Implements #810 * Add maximum file length check * Fix some performance * put back debug stmt * fix CSV.Chunks and add at least one test * Tweak default pooling per review comments. The main change in this commit is that non-string columns won't participate when `pool=0.1`, for example, is provided. * Update src/README.md Co-authored-by: Milan Bouchet-Valat <[email protected]> * Support InlineString natively in CSV.jl * Move PooledArrays code back to not rely on _always_ treating missing as ref 1 (though we actually still do that in the pool anyway). * don't do inline strings over InlineString31 by default * Fix tests * fix tests again * fix 32-bit * fixes * f * Support Int32 parsing by default; add some more dev docs * Support new downcast keyword arg (no docs yet) * tests and fixes for downcast Co-authored-by: Milan Bouchet-Valat <[email protected]>
Fixed in #837 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Data (test.txt)
a, 0.1, 0.2, 0.3
b, 0.4
Code
julia> f = CSV.File(read("test.txt"), transpose=true, normalizenames=true)
1-element CSV.File{false}:
CSV.Row: (a = 0.1, b = 0.4)
Issue
Output ignores remaining data in first row instead of reading all data for row 1, and row 2 while reporting "missing" data in row 2.
The text was updated successfully, but these errors were encountered: