Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul CSV.jl docs #869

Merged
merged 6 commits into from
Aug 20, 2021
Merged

Overhaul CSV.jl docs #869

merged 6 commits into from
Aug 20, 2021

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Aug 18, 2021

It's been a long time coming. Changes include:

  • In the index.md, we have an opening "manual" section that discusses the high-level functionality of the package and gives overview of eac (CSV.File, CSV.read, CSV.write, etc.)
  • We also have larger sections for each supported keyword argument to CSV.File, with notes of compat w/ CSV.Rows/CSV.Chunks. Almost all they keyword argument sections also have links to examples in the examples.md file
  • The examples.md file has been restructured to be easily linkable from the index.md file, and includes more in-depth comments inline for each example, which should almost entirely be copy-pastable to a local REPL.

I've tried to dump as much information as hopefully is useful for users between the keyword argument sections and corresponding examples.

If there are pieces that aren't clear, please comment and let's get it cleaned up! This is going to be the basis for 1.0 docs, so we'd like them to be solid, stable, and very helpful. Specifically, some of the trickiest pieces to clarify are the interactions between keyword arguments (like how pool might affect types, for one example); if you know or can think of cases where it's unclear how one keyword argument interacts with another, just chime in and we'll get it sorted out.

Thanks for any and all review!

@quinnj quinnj requested review from nalimilan and bkamins August 18, 2021 03:09
@codecov
Copy link

codecov bot commented Aug 18, 2021

Codecov Report

Merging #869 (ff78471) into main (d462c35) will decrease coverage by 0.24%.
The diff coverage is 78.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #869      +/-   ##
==========================================
- Coverage   90.88%   90.63%   -0.25%     
==========================================
  Files           9        9              
  Lines        1898     1923      +25     
==========================================
+ Hits         1725     1743      +18     
- Misses        173      180       +7     
Impacted Files Coverage Δ
src/CSV.jl 37.50% <ø> (ø)
src/write.jl 85.25% <ø> (ø)
src/file.jl 96.07% <42.85%> (-0.72%) ⬇️
src/utils.jl 83.46% <82.14%> (+0.06%) ⬆️
src/chunks.jl 91.30% <100.00%> (ø)
src/context.jl 87.86% <100.00%> (ø)
src/rows.jl 86.82% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d462c35...ff78471. Read the comment docs.

docs/src/examples.md Outdated Show resolved Hide resolved
file = CSV.File(file; drop=(i, nm) -> i == 2)
```

### [Limiting number of rows from data](@id limit_example)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah - I had this comment above - I would move this section there as it is a more natural flow.

```julia
using CSV

# This file contains a 3x3 identity matrix of `Float64`. By default, parsing will detect the delimiter and type, but we can also explicitly pass `delim= ' '` and `types=Float64`, which tells parsing to explicitly treat each column as `Float64`, without having to guess the type on its own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably ignorerepeated should be also added here, as this typically happens in matrix data:

julia> rand(5,5)
5×5 Matrix{Float64}:
 0.885415   0.436992  0.966233  0.415127   0.595875
 0.0674271  0.586727  0.992303  0.236927   0.997092
 0.729953   0.720313  0.70169   0.922141   0.377983
 0.932993   0.257281  0.92218   0.0864256  0.0153523
 0.852068   0.591985  0.577147  0.966295   0.220887

CSV.RowWriter
```
That's quite a bit! Let's boil down a TL;DR:
* Just want to read a delimited file and do basic stuff with data? Use `CSV.File(file)` or `CSV.read(file, DataFrame)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are not equivalent - i.e. calling DataFrame on CSV.File output copies columns (which will also change their type)


An argument that controls the precise type of string columns. Supported values are `InlineString` (the default), `PosLenString`, or `String`. The various string types are aimed at being mostly transparent to most users. In certain workflows, however, it can be advantageous to be more specific. Here's a quick rundown of the possible options:

* `InlineString`: a set of fixed-width, stack-allocated primitive types. Can take memory pressure off the GC because they aren't reference types/on the heap. For very large files with string columns that have a fairly low variance in string length, this can provide much better GC interaction than `String`. When string length has a high variance, it can lead to lots of "wasted space", since an entire column will be promoted to the smallest InlineString type that fits the longest string value. For small strings, that can mean a lot of wasted space when they're promoted to a high fixed-width.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a cross-reference where the user can learn more.

@bkamins
Copy link
Member

bkamins commented Aug 18, 2021

I am OOO for a week, with scattered access to Internet, so I gave some quick comments. Thank you!

@quinnj
Copy link
Member Author

quinnj commented Aug 18, 2021

I am OOO for a week, with scattered access to Internet, so I gave some quick comments. Thank you!

Glad you were able to take a look! Thank you!

docs/src/examples.md Outdated Show resolved Hide resolved
84044,3.4
"""

file = CSV.File(file; typemap=Dict(Int => String))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm using CSV for a year now and never knew this functionality existed

@SamRodkey
Copy link

Super generic drive-by docs suggestion based upon my personal experience as a super newb a couple of weeks ago:

I just wanted to dump a big vanilla Julia matrix of numbers to a CSV file, so I found myself reading the docs. I couldn't find a single fully reproducible example (all of the examples seemed to presume some familiarity with Tables.jl) of how to dump an Array{Float64,2} to a CSV file anywhere. It seemed like something that would naturally belong in a quick start section...

After skimming the Tables.jl docs and then reading a bunch of Discourse posts, someone pointed out to just use dlmwrite instead because its simpler for this use case.

Two actionable suggestions w/ apologizes if this is addressed in the PR (haven't had a chance to review yet):

  • If CSV.jl is intended for this use case, could you add a single fully reproducible example to the quick start /introductory section? Maybe just use rand to generate a big matrix that gets dumped to a file?
  • If CSV.jl is not intended for this use case, could you highlight that early in the docs and describe when to use generic base base dlmwrite vs. CSV.jl functionality?

Thanks @quinnj for all your hard work!


## [`pool`](@id pool)

Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, vector of `Bool` or `Float64`, or dict mapping column number/name to `Bool` or `Float64`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` indicating the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. As mentioned, when the `pool` argument is a single `Bool` or `Float64`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly if single value for pool is used it is applied to string columns only and otherwise if it is a vector or dict to any columns. Maybe highlight it more?

Also in this description I do not see what is the default (maybe I missed it, but I cannot find it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I considered this section not-quite-done. I think it'd be useful for you, @nalimilan, and I (and anyone else interested) to discuss what the default pooling behavior will be. On current main branch, the default pool value is 0.25, and it applies the same to single and multithreading. The multithreaded case has also been updated w/ the same behavior as single threaded: i.e. we'll parse the entire file, and post-parsing, we'll check the cardinality of potentially pooled columns to see if they pass the threshold and then convert to regular array or pooled array as appropriate. The use-case I'm worried about given the current state is if you have a large file that uses multiple threads to parse, and you have a very high cardinality string column (like a string id). In that case, we would do roughly double the work parsing that id column, only to discard the pooled values at the end.

An alternative I've considered is making the default pool=false. So by default, no surprises or potential slowdowns. If you want pooled columns, you'd have to pass a pool value or pool=true explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with both pool=0.25. and pool=false as a default. I think the problems users have with pooled columns should be resolved (like the map issue) on our side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be too weird to have pool=false for multithreaded parsing and pool=0.25 for single-threaded? Would we run into too many issues where the difference of array would be weird?

@quinnj
Copy link
Member Author

quinnj commented Aug 19, 2021

I just wanted to dump a big vanilla Julia matrix of numbers to a CSV file, so I found myself reading the docs. I couldn't find a single fully reproducible example (all of the examples seemed to presume some familiarity with Tables.jl) of how to dump an Array{Float64,2} to a CSV file anywhere. It seemed like something that would naturally belong in a quick start section...

@SamRodkey great suggestion! I just added a couple more examples explicitly for writing matrices; both to the high-level "start" and to the CSV.write docs specifically. These are exactly the kinds of suggestions that are the most helpful! Let me know if you have any more!

quinnj added 4 commits August 19, 2021 16:36
* Fix select/drop on CSV.Rows

The new column-widening functionality regressed using select/drop
keyword args for `CSV.Rows`. May fix #858.

* fix tests
* Add automatic support for reading gzipped csv files

Also fixes #733 by using a performant method for processing gzipped
files. The motivation here is to make this more "automatic" for users so
they don't have to fumble around and question whether they should use
GZip.jl or CodecZlib.jl or Zlib.jl or whatever. We know which is the
fastest (CodecZlib.jl), and it's a mature, properly binary-built package
at this point that works cross platform, so I submit that it's a
non-risky dependency to take on for the benefit it provides.

In terms of implementation, we take the same approach as data.table
`fread` here by decompressing the gzip file to a temp file, then
mmapping it. We track the temp files generated in this way and clean
them up at the end of `CSV.File` (I need to check `CSV.Rows` as well),
and we can also rely on having the default of `cleanup=true` for
`tempname()`, which means the julia process itself will try and delete
all its generated temp files when it exits.

Alternative approaches involving reading chunks of a
`GzipDecompressorStream` or having a new `BufferedIOVector <:
AbstractVector{UInt8}` which would buffer reads from a
`GzipDecompressorStream` input were both considered, but at the end of
the day, CSV.jl is hard-coded in several places to work on the input as
a `Vector{UInt8}` explicitly. Parsers.jl is more flexible, but it would
take non-trivial work to allow CSV.jl the same flexibility. Another
factor to consider here is that if we allowed many different `buf` types
(other than just `Vector{UInt8}`), then we dig up the recompilation
problems all over again (that we literally just finished burying with
the removal of type parameters from `Parsers.Options`). That's probably
the biggest factor for me personally to take the `fread` approach of
using tempfiles, in addition to being very performant.

So follow ups before merging are:
  * Check that `CSV.Rows` can clean up tempfiles
  * We can be more strict in a few places of requiring `Vector{UInt8}`
  in CSV.jl
  * Update doc examples to not need `transcode` explicitly

* add buffer_in_memory keyword arg and ensure _any_ source that is gzip will be decompressed automatically

* use mktemp instead of tempname

* make sure we hit buffer_in_memory codepath
@quinnj quinnj merged commit 17bffa1 into main Aug 20, 2021
@quinnj quinnj deleted the jq/manual branch August 20, 2021 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants