Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading Transposed CSV with unequal rows reads only minimum size of data #834

Closed
shakisparki opened this issue May 9, 2021 · 2 comments
Closed

Comments

@shakisparki
Copy link

shakisparki commented May 9, 2021

Data (test.txt)

a, 0.1, 0.2, 0.3
b, 0.4

Code

julia> f = CSV.File(read("test.txt"), transpose=true, normalizenames=true)
1-element CSV.File{false}:
CSV.Row: (a = 0.1, b = 0.4)

Issue

Output ignores remaining data in first row instead of reading all data for row 1, and row 2 while reporting "missing" data in row 2.

@quinnj
Copy link
Member

quinnj commented May 11, 2021

Oof.....this is kinda rough. The problem is that we reach EOF on row two right after 0.4, so parsing terminates. I guess instead of EOF, we could maybe try to check if each row position buffer has terminated; I'll look into it, but it might be pretty tricky to make this work.

quinnj added a commit that referenced this issue May 23, 2021
This PR is a large refactoring of CSV.jl internals to simplify code, fix
a few bugs, and allow several desired improvements. Here's a list of
goals for the refactor:
  * Switch the `lazystrings` keyword argument to a more general
  `stringtype` argument, allowing users to pass
  `stringtype=PosLenString` to achieve the same effect
  `lazystrings=true` has previously provided; but in addition to just
  return string columns as a lazy string type, it allows this new lazy
  string type (`PosLenString`), so be used in pooled columns as well
  * Change the header.jl file to context.jl and `Header` struct to
  `Context`; it's a more descriptive name as it holds all the variables
  and initial computation done before any real parsing work begins, so
  it provides "context" to parsing
  * Updated multithreaded parsing to do "type sampling" before parsing
  begins; we were already scanning a number of rows per chunk when
  identifying where each chunk should start parsing; we now merge that
  operation with also tracking the values parsed while identifying row
  starts, and use these sample values to figure out a column's initial
  type, build up pooling column refpools, and determine other properties
  like max string size
  * Includes a fix for #834,
  since I was in the middle of a large refactoring anyway and it wasn't
  too hard to add while mucking around w/ everything
  * Big update to how pooling is done; I've tried to explain the new
  approach in great detail in the src/README.md document, which I plan
  on having as a sort of "dev docs" going forward
  * Getting rid of the internal "flags" bitmask in favor of a structured
  `Column` type that holds current type information, pooled-ness, parsed
  values, plus individual fields for things the flags were previously
  tracking. Overall, this helps simplify quite a bit of code
  * Introduces new behavior where _extra columns_ detected while parsing
  will be _added_ to the returned final columnset, instead of warned
  about; this is something I've wanted to do for a long time, under the
  general rule of "don't discard present data"; if it's there, parse it.

Things I'd still like to do before merging:
  * Review codegen/performance; from a quick glance at a few files, it
  seems that pooling is slower; it also seems like `detect` has some
  dynamic dispatch causing allocations
  * Review keyword argument defaults for `CSV.File`/`CSV.read`; we're
  getting very close to tagging a 1.0, so I'd like to do a thorough
  review of parsing options and make sure we feel confident in the
  user-facing API
  * Docs/tests update
quinnj added a commit that referenced this issue Jun 16, 2021
* CSV parsing internals refactoring

This PR is a large refactoring of CSV.jl internals to simplify code, fix
a few bugs, and allow several desired improvements. Here's a list of
goals for the refactor:
  * Switch the `lazystrings` keyword argument to a more general
  `stringtype` argument, allowing users to pass
  `stringtype=PosLenString` to achieve the same effect
  `lazystrings=true` has previously provided; but in addition to just
  return string columns as a lazy string type, it allows this new lazy
  string type (`PosLenString`), so be used in pooled columns as well
  * Change the header.jl file to context.jl and `Header` struct to
  `Context`; it's a more descriptive name as it holds all the variables
  and initial computation done before any real parsing work begins, so
  it provides "context" to parsing
  * Updated multithreaded parsing to do "type sampling" before parsing
  begins; we were already scanning a number of rows per chunk when
  identifying where each chunk should start parsing; we now merge that
  operation with also tracking the values parsed while identifying row
  starts, and use these sample values to figure out a column's initial
  type, build up pooling column refpools, and determine other properties
  like max string size
  * Includes a fix for #834,
  since I was in the middle of a large refactoring anyway and it wasn't
  too hard to add while mucking around w/ everything
  * Big update to how pooling is done; I've tried to explain the new
  approach in great detail in the src/README.md document, which I plan
  on having as a sort of "dev docs" going forward
  * Getting rid of the internal "flags" bitmask in favor of a structured
  `Column` type that holds current type information, pooled-ness, parsed
  values, plus individual fields for things the flags were previously
  tracking. Overall, this helps simplify quite a bit of code
  * Introduces new behavior where _extra columns_ detected while parsing
  will be _added_ to the returned final columnset, instead of warned
  about; this is something I've wanted to do for a long time, under the
  general rule of "don't discard present data"; if it's there, parse it.

Things I'd still like to do before merging:
  * Review codegen/performance; from a quick glance at a few files, it
  seems that pooling is slower; it also seems like `detect` has some
  dynamic dispatch causing allocations
  * Review keyword argument defaults for `CSV.File`/`CSV.read`; we're
  getting very close to tagging a 1.0, so I'd like to do a thorough
  review of parsing options and make sure we feel confident in the
  user-facing API
  * Docs/tests update

* Allow mixing int/string/symbol in keyword args. Implements #810

* Add maximum file length check

* Fix some performance

* put back debug stmt

* fix CSV.Chunks and add at least one test

* Tweak default pooling per review comments. The main change in this
commit is that non-string columns won't participate when `pool=0.1`, for
example, is provided.

* Update src/README.md

Co-authored-by: Milan Bouchet-Valat <[email protected]>

* Support InlineString natively in CSV.jl

* Move PooledArrays code back to not rely on _always_ treating missing as ref 1 (though we actually still do that in the pool anyway).

* don't do inline strings over InlineString31 by default

* Fix tests

* fix tests again

* fix 32-bit

* fixes

* f

* Support Int32 parsing by default; add some more dev docs

* Support new downcast keyword arg (no docs yet)

* tests and fixes for downcast

Co-authored-by: Milan Bouchet-Valat <[email protected]>
@quinnj
Copy link
Member

quinnj commented Jun 17, 2021

Fixed in #837

@quinnj quinnj closed this as completed Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants