Overhaul CSV.jl docs #869

quinnj · 2021-08-18T03:09:35Z

It's been a long time coming. Changes include:

In the index.md, we have an opening "manual" section that discusses the high-level functionality of the package and gives overview of eac (CSV.File, CSV.read, CSV.write, etc.)
We also have larger sections for each supported keyword argument to CSV.File, with notes of compat w/ CSV.Rows/CSV.Chunks. Almost all they keyword argument sections also have links to examples in the examples.md file
The examples.md file has been restructured to be easily linkable from the index.md file, and includes more in-depth comments inline for each example, which should almost entirely be copy-pastable to a local REPL.

I've tried to dump as much information as hopefully is useful for users between the keyword argument sections and corresponding examples.

If there are pieces that aren't clear, please comment and let's get it cleaned up! This is going to be the basis for 1.0 docs, so we'd like them to be solid, stable, and very helpful. Specifically, some of the trickiest pieces to clarify are the interactions between keyword arguments (like how pool might affect types, for one example); if you know or can think of cases where it's unclear how one keyword argument interacts with another, just chime in and we'll get it sorted out.

Thanks for any and all review!

codecov · 2021-08-18T03:16:19Z

Codecov Report

Merging #869 (ff78471) into main (d462c35) will decrease coverage by 0.24%.
The diff coverage is 78.57%.

@@            Coverage Diff             @@
##             main     #869      +/-   ##
==========================================
- Coverage   90.88%   90.63%   -0.25%     
==========================================
  Files           9        9              
  Lines        1898     1923      +25     
==========================================
+ Hits         1725     1743      +18     
- Misses        173      180       +7

Impacted Files	Coverage Δ
src/CSV.jl	`37.50% <ø> (ø)`
src/write.jl	`85.25% <ø> (ø)`
src/file.jl	`96.07% <42.85%> (-0.72%)`	⬇️
src/utils.jl	`83.46% <82.14%> (+0.06%)`	⬆️
src/chunks.jl	`91.30% <100.00%> (ø)`
src/context.jl	`87.86% <100.00%> (ø)`
src/rows.jl	`86.82% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d462c35...ff78471. Read the comment docs.

docs/src/examples.md

bkamins · 2021-08-18T05:04:16Z

docs/src/examples.md

+file = CSV.File(file; drop=(i, nm) -> i == 2)
+```
+
+### [Limiting number of rows from data](@id limit_example)


ah - I had this comment above - I would move this section there as it is a more natural flow.

docs/src/examples.md

bkamins · 2021-08-18T05:07:02Z

docs/src/examples.md

+```julia
+using CSV
+
+# This file contains a 3x3 identity matrix of `Float64`. By default, parsing will detect the delimiter and type, but we can also explicitly pass `delim= ' '` and `types=Float64`, which tells parsing to explicitly treat each column as `Float64`, without having to guess the type on its own.


probably ignorerepeated should be also added here, as this typically happens in matrix data:

julia> rand(5,5) 5×5 Matrix{Float64}: 0.885415 0.436992 0.966233 0.415127 0.595875 0.0674271 0.586727 0.992303 0.236927 0.997092 0.729953 0.720313 0.70169 0.922141 0.377983 0.932993 0.257281 0.92218 0.0864256 0.0153523 0.852068 0.591985 0.577147 0.966295 0.220887

bkamins · 2021-08-18T05:08:49Z

docs/src/index.md

-CSV.RowWriter
-```
+That's quite a bit! Let's boil down a TL;DR:
+  * Just want to read a delimited file and do basic stuff with data? Use `CSV.File(file)` or `CSV.read(file, DataFrame)`


they are not equivalent - i.e. calling DataFrame on CSV.File output copies columns (which will also change their type)

bkamins · 2021-08-18T05:10:02Z

docs/src/index.md

+
+An argument that controls the precise type of string columns. Supported values are `InlineString` (the default), `PosLenString`, or `String`. The various string types are aimed at being mostly transparent to most users. In certain workflows, however, it can be advantageous to be more specific. Here's a quick rundown of the possible options:
+
+  * `InlineString`: a set of fixed-width, stack-allocated primitive types. Can take memory pressure off the GC because they aren't reference types/on the heap. For very large files with string columns that have a fairly low variance in string length, this can provide much better GC interaction than `String`. When string length has a high variance, it can lead to lots of "wasted space", since an entire column will be promoted to the smallest InlineString type that fits the longest string value. For small strings, that can mean a lot of wasted space when they're promoted to a high fixed-width.


maybe add a cross-reference where the user can learn more.

bkamins · 2021-08-18T05:10:38Z

I am OOO for a week, with scattered access to Internet, so I gave some quick comments. Thank you!

quinnj · 2021-08-18T06:31:19Z

I am OOO for a week, with scattered access to Internet, so I gave some quick comments. Thank you!

Glad you were able to take a look! Thank you!

docs/src/examples.md

rikhuijzer · 2021-08-18T10:57:00Z

docs/src/examples.md

+84044,3.4
+"""
+
+file = CSV.File(file; typemap=Dict(Int => String))


Nice! I'm using CSV for a year now and never knew this functionality existed

SamRodkey · 2021-08-18T15:38:14Z

Super generic drive-by docs suggestion based upon my personal experience as a super newb a couple of weeks ago:

I just wanted to dump a big vanilla Julia matrix of numbers to a CSV file, so I found myself reading the docs. I couldn't find a single fully reproducible example (all of the examples seemed to presume some familiarity with Tables.jl) of how to dump an Array{Float64,2} to a CSV file anywhere. It seemed like something that would naturally belong in a quick start section...

After skimming the Tables.jl docs and then reading a bunch of Discourse posts, someone pointed out to just use dlmwrite instead because its simpler for this use case.

Two actionable suggestions w/ apologizes if this is addressed in the PR (haven't had a chance to review yet):

If CSV.jl is intended for this use case, could you add a single fully reproducible example to the quick start /introductory section? Maybe just use rand to generate a big matrix that gets dumped to a file?
If CSV.jl is not intended for this use case, could you highlight that early in the docs and describe when to use generic base base dlmwrite vs. CSV.jl functionality?

Thanks @quinnj for all your hard work!

bkamins · 2021-08-18T20:29:41Z

docs/src/index.md

+
+## [`pool`](@id pool)
+
+Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, vector of `Bool` or `Float64`, or dict mapping column number/name to `Bool` or `Float64`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` indicating the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. As mentioned, when the `pool` argument is a single `Bool` or `Float64`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict.


If I understand correctly if single value for pool is used it is applied to string columns only and otherwise if it is a vector or dict to any columns. Maybe highlight it more?

Also in this description I do not see what is the default (maybe I missed it, but I cannot find it)

Ah yes, I considered this section not-quite-done. I think it'd be useful for you, @nalimilan, and I (and anyone else interested) to discuss what the default pooling behavior will be. On current main branch, the default pool value is 0.25, and it applies the same to single and multithreading. The multithreaded case has also been updated w/ the same behavior as single threaded: i.e. we'll parse the entire file, and post-parsing, we'll check the cardinality of potentially pooled columns to see if they pass the threshold and then convert to regular array or pooled array as appropriate. The use-case I'm worried about given the current state is if you have a large file that uses multiple threads to parse, and you have a very high cardinality string column (like a string id). In that case, we would do roughly double the work parsing that id column, only to discard the pooled values at the end.

An alternative I've considered is making the default pool=false. So by default, no surprises or potential slowdowns. If you want pooled columns, you'd have to pass a pool value or pool=true explicitly.

I am OK with both pool=0.25. and pool=false as a default. I think the problems users have with pooled columns should be resolved (like the map issue) on our side.

Would it be too weird to have pool=false for multithreaded parsing and pool=0.25 for single-threaded? Would we run into too many issues where the difference of array would be weird?

quinnj · 2021-08-19T21:54:12Z

I just wanted to dump a big vanilla Julia matrix of numbers to a CSV file, so I found myself reading the docs. I couldn't find a single fully reproducible example (all of the examples seemed to presume some familiarity with Tables.jl) of how to dump an Array{Float64,2} to a CSV file anywhere. It seemed like something that would naturally belong in a quick start section...

@SamRodkey great suggestion! I just added a couple more examples explicitly for writing matrices; both to the high-level "start" and to the CSV.write docs specifically. These are exactly the kinds of suggestions that are the most helpful! Let me know if you have any more!

* Fix select/drop on CSV.Rows The new column-widening functionality regressed using select/drop keyword args for `CSV.Rows`. May fix #858. * fix tests

* Add automatic support for reading gzipped csv files Also fixes #733 by using a performant method for processing gzipped files. The motivation here is to make this more "automatic" for users so they don't have to fumble around and question whether they should use GZip.jl or CodecZlib.jl or Zlib.jl or whatever. We know which is the fastest (CodecZlib.jl), and it's a mature, properly binary-built package at this point that works cross platform, so I submit that it's a non-risky dependency to take on for the benefit it provides. In terms of implementation, we take the same approach as data.table `fread` here by decompressing the gzip file to a temp file, then mmapping it. We track the temp files generated in this way and clean them up at the end of `CSV.File` (I need to check `CSV.Rows` as well), and we can also rely on having the default of `cleanup=true` for `tempname()`, which means the julia process itself will try and delete all its generated temp files when it exits. Alternative approaches involving reading chunks of a `GzipDecompressorStream` or having a new `BufferedIOVector <: AbstractVector{UInt8}` which would buffer reads from a `GzipDecompressorStream` input were both considered, but at the end of the day, CSV.jl is hard-coded in several places to work on the input as a `Vector{UInt8}` explicitly. Parsers.jl is more flexible, but it would take non-trivial work to allow CSV.jl the same flexibility. Another factor to consider here is that if we allowed many different `buf` types (other than just `Vector{UInt8}`), then we dig up the recompilation problems all over again (that we literally just finished burying with the removal of type parameters from `Parsers.Options`). That's probably the biggest factor for me personally to take the `fread` approach of using tempfiles, in addition to being very performant. So follow ups before merging are: * Check that `CSV.Rows` can clean up tempfiles * We can be more strict in a few places of requiring `Vector{UInt8}` in CSV.jl * Update doc examples to not need `transcode` explicitly * add buffer_in_memory keyword arg and ensure _any_ source that is gzip will be decompressed automatically * use mktemp instead of tempname * make sure we hit buffer_in_memory codepath

lots of updates to manual/docs

73da5f7

quinnj requested review from nalimilan and bkamins August 18, 2021 03:09

fix

5c73f5f