Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix select/drop on CSV.Rows #866

Merged
merged 2 commits into from
Aug 5, 2021
Merged

Fix select/drop on CSV.Rows #866

merged 2 commits into from
Aug 5, 2021

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Aug 5, 2021

The new column-widening functionality regressed using select/drop
keyword args for CSV.Rows. May fix #858.

quinnj added 2 commits August 4, 2021 23:22
The new column-widening functionality regressed using select/drop
keyword args for `CSV.Rows`. May fix #858.
@codecov
Copy link

codecov bot commented Aug 5, 2021

Codecov Report

Merging #866 (4703ffa) into main (d462c35) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #866   +/-   ##
=======================================
  Coverage   90.88%   90.88%           
=======================================
  Files           9        9           
  Lines        1898     1898           
=======================================
  Hits         1725     1725           
  Misses        173      173           
Impacted Files Coverage Δ
src/rows.jl 86.82% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d462c35...4703ffa. Read the comment docs.

@quinnj quinnj merged commit 3b62f10 into main Aug 5, 2021
@quinnj quinnj deleted the jq/selectrows branch August 5, 2021 05:58
quinnj added a commit that referenced this pull request Aug 19, 2021
* Fix select/drop on CSV.Rows

The new column-widening functionality regressed using select/drop
keyword args for `CSV.Rows`. May fix #858.

* fix tests
quinnj added a commit that referenced this pull request Aug 20, 2021
* lots of updates to manual/docs

* fix

* Fix select/drop on CSV.Rows (#866)

* Fix select/drop on CSV.Rows

The new column-widening functionality regressed using select/drop
keyword args for `CSV.Rows`. May fix #858.

* fix tests

* Add automatic support for reading gzipped csv files (#864)

* Add automatic support for reading gzipped csv files

Also fixes #733 by using a performant method for processing gzipped
files. The motivation here is to make this more "automatic" for users so
they don't have to fumble around and question whether they should use
GZip.jl or CodecZlib.jl or Zlib.jl or whatever. We know which is the
fastest (CodecZlib.jl), and it's a mature, properly binary-built package
at this point that works cross platform, so I submit that it's a
non-risky dependency to take on for the benefit it provides.

In terms of implementation, we take the same approach as data.table
`fread` here by decompressing the gzip file to a temp file, then
mmapping it. We track the temp files generated in this way and clean
them up at the end of `CSV.File` (I need to check `CSV.Rows` as well),
and we can also rely on having the default of `cleanup=true` for
`tempname()`, which means the julia process itself will try and delete
all its generated temp files when it exits.

Alternative approaches involving reading chunks of a
`GzipDecompressorStream` or having a new `BufferedIOVector <:
AbstractVector{UInt8}` which would buffer reads from a
`GzipDecompressorStream` input were both considered, but at the end of
the day, CSV.jl is hard-coded in several places to work on the input as
a `Vector{UInt8}` explicitly. Parsers.jl is more flexible, but it would
take non-trivial work to allow CSV.jl the same flexibility. Another
factor to consider here is that if we allowed many different `buf` types
(other than just `Vector{UInt8}`), then we dig up the recompilation
problems all over again (that we literally just finished burying with
the removal of type parameters from `Parsers.Options`). That's probably
the biggest factor for me personally to take the `fread` approach of
using tempfiles, in addition to being very performant.

So follow ups before merging are:
  * Check that `CSV.Rows` can clean up tempfiles
  * We can be more strict in a few places of requiring `Vector{UInt8}`
  in CSV.jl
  * Update doc examples to not need `transcode` explicitly

* add buffer_in_memory keyword arg and ensure _any_ source that is gzip will be decompressed automatically

* use mktemp instead of tempname

* make sure we hit buffer_in_memory codepath

* lots of updates to manual/docs

* add more writing examples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using select with CSV.Rows
1 participant