-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul CSV.jl docs #869
Overhaul CSV.jl docs #869
Conversation
Codecov Report
@@ Coverage Diff @@
## main #869 +/- ##
==========================================
- Coverage 90.88% 90.63% -0.25%
==========================================
Files 9 9
Lines 1898 1923 +25
==========================================
+ Hits 1725 1743 +18
- Misses 173 180 +7
Continue to review full report at Codecov.
|
file = CSV.File(file; drop=(i, nm) -> i == 2) | ||
``` | ||
|
||
### [Limiting number of rows from data](@id limit_example) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah - I had this comment above - I would move this section there as it is a more natural flow.
docs/src/examples.md
Outdated
```julia | ||
using CSV | ||
|
||
# This file contains a 3x3 identity matrix of `Float64`. By default, parsing will detect the delimiter and type, but we can also explicitly pass `delim= ' '` and `types=Float64`, which tells parsing to explicitly treat each column as `Float64`, without having to guess the type on its own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably ignorerepeated
should be also added here, as this typically happens in matrix data:
julia> rand(5,5)
5×5 Matrix{Float64}:
0.885415 0.436992 0.966233 0.415127 0.595875
0.0674271 0.586727 0.992303 0.236927 0.997092
0.729953 0.720313 0.70169 0.922141 0.377983
0.932993 0.257281 0.92218 0.0864256 0.0153523
0.852068 0.591985 0.577147 0.966295 0.220887
CSV.RowWriter | ||
``` | ||
That's quite a bit! Let's boil down a TL;DR: | ||
* Just want to read a delimited file and do basic stuff with data? Use `CSV.File(file)` or `CSV.read(file, DataFrame)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are not equivalent - i.e. calling DataFrame
on CSV.File
output copies columns (which will also change their type)
|
||
An argument that controls the precise type of string columns. Supported values are `InlineString` (the default), `PosLenString`, or `String`. The various string types are aimed at being mostly transparent to most users. In certain workflows, however, it can be advantageous to be more specific. Here's a quick rundown of the possible options: | ||
|
||
* `InlineString`: a set of fixed-width, stack-allocated primitive types. Can take memory pressure off the GC because they aren't reference types/on the heap. For very large files with string columns that have a fairly low variance in string length, this can provide much better GC interaction than `String`. When string length has a high variance, it can lead to lots of "wasted space", since an entire column will be promoted to the smallest InlineString type that fits the longest string value. For small strings, that can mean a lot of wasted space when they're promoted to a high fixed-width. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a cross-reference where the user can learn more.
I am OOO for a week, with scattered access to Internet, so I gave some quick comments. Thank you! |
Glad you were able to take a look! Thank you! |
84044,3.4 | ||
""" | ||
|
||
file = CSV.File(file; typemap=Dict(Int => String)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I'm using CSV for a year now and never knew this functionality existed
Super generic drive-by docs suggestion based upon my personal experience as a super newb a couple of weeks ago: I just wanted to dump a big vanilla Julia matrix of numbers to a CSV file, so I found myself reading the docs. I couldn't find a single fully reproducible example (all of the examples seemed to presume some familiarity with Tables.jl) of how to dump an Array{Float64,2} to a CSV file anywhere. It seemed like something that would naturally belong in a quick start section... After skimming the Tables.jl docs and then reading a bunch of Discourse posts, someone pointed out to just use dlmwrite instead because its simpler for this use case. Two actionable suggestions w/ apologizes if this is addressed in the PR (haven't had a chance to review yet):
Thanks @quinnj for all your hard work! |
|
||
## [`pool`](@id pool) | ||
|
||
Argument that controls whether columns will be returned as `PooledArray`s. Can be provided as a `Bool`, `Float64`, vector of `Bool` or `Float64`, or dict mapping column number/name to `Bool` or `Float64`. As a `Bool`, controls absolutely whether a column will be pooled or not; if passed as a single `Bool` argument like `pool=true`, then all string columns will be pooled, regardless of cardinality. When passed as a `Float64`, the value should be between `0.0` and `1.0` indicating the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if `pool=0.1`, then all string columns with a unique value % less than 10% will be returned as `PooledArray`, while other string columns will be normal string vectors. As mentioned, when the `pool` argument is a single `Bool` or `Float64`, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a `Bool` or `Float64`. Similar to the [types](@ref types) argument, providing a vector to `pool` should have an element for each column in the data input, while a dict argument can map column number/name to `Bool` or `Float64` for specific columns. Unspecified columns will not be pooled when the argument is a dict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly if single value for pool
is used it is applied to string columns only and otherwise if it is a vector or dict to any columns. Maybe highlight it more?
Also in this description I do not see what is the default (maybe I missed it, but I cannot find it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I considered this section not-quite-done. I think it'd be useful for you, @nalimilan, and I (and anyone else interested) to discuss what the default pooling behavior will be. On current main
branch, the default pool value is 0.25
, and it applies the same to single and multithreading. The multithreaded case has also been updated w/ the same behavior as single threaded: i.e. we'll parse the entire file, and post-parsing, we'll check the cardinality of potentially pooled columns to see if they pass the threshold and then convert to regular array or pooled array as appropriate. The use-case I'm worried about given the current state is if you have a large file that uses multiple threads to parse, and you have a very high cardinality string column (like a string id). In that case, we would do roughly double the work parsing that id column, only to discard the pooled values at the end.
An alternative I've considered is making the default pool=false
. So by default, no surprises or potential slowdowns. If you want pooled columns, you'd have to pass a pool value or pool=true
explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with both pool=0.25.
and pool=false
as a default. I think the problems users have with pooled columns should be resolved (like the map
issue) on our side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be too weird to have pool=false
for multithreaded parsing and pool=0.25
for single-threaded? Would we run into too many issues where the difference of array would be weird?
@SamRodkey great suggestion! I just added a couple more examples explicitly for writing matrices; both to the high-level "start" and to the |
* Fix select/drop on CSV.Rows The new column-widening functionality regressed using select/drop keyword args for `CSV.Rows`. May fix #858. * fix tests
* Add automatic support for reading gzipped csv files Also fixes #733 by using a performant method for processing gzipped files. The motivation here is to make this more "automatic" for users so they don't have to fumble around and question whether they should use GZip.jl or CodecZlib.jl or Zlib.jl or whatever. We know which is the fastest (CodecZlib.jl), and it's a mature, properly binary-built package at this point that works cross platform, so I submit that it's a non-risky dependency to take on for the benefit it provides. In terms of implementation, we take the same approach as data.table `fread` here by decompressing the gzip file to a temp file, then mmapping it. We track the temp files generated in this way and clean them up at the end of `CSV.File` (I need to check `CSV.Rows` as well), and we can also rely on having the default of `cleanup=true` for `tempname()`, which means the julia process itself will try and delete all its generated temp files when it exits. Alternative approaches involving reading chunks of a `GzipDecompressorStream` or having a new `BufferedIOVector <: AbstractVector{UInt8}` which would buffer reads from a `GzipDecompressorStream` input were both considered, but at the end of the day, CSV.jl is hard-coded in several places to work on the input as a `Vector{UInt8}` explicitly. Parsers.jl is more flexible, but it would take non-trivial work to allow CSV.jl the same flexibility. Another factor to consider here is that if we allowed many different `buf` types (other than just `Vector{UInt8}`), then we dig up the recompilation problems all over again (that we literally just finished burying with the removal of type parameters from `Parsers.Options`). That's probably the biggest factor for me personally to take the `fread` approach of using tempfiles, in addition to being very performant. So follow ups before merging are: * Check that `CSV.Rows` can clean up tempfiles * We can be more strict in a few places of requiring `Vector{UInt8}` in CSV.jl * Update doc examples to not need `transcode` explicitly * add buffer_in_memory keyword arg and ensure _any_ source that is gzip will be decompressed automatically * use mktemp instead of tempname * make sure we hit buffer_in_memory codepath
It's been a long time coming. Changes include:
index.md
, we have an opening "manual" section that discusses the high-level functionality of the package and gives overview of eac (CSV.File
,CSV.read
,CSV.write
, etc.)CSV.File
, with notes of compat w/CSV.Rows
/CSV.Chunks
. Almost all they keyword argument sections also have links to examples in theexamples.md
fileexamples.md
file has been restructured to be easily linkable from theindex.md
file, and includes more in-depth comments inline for each example, which should almost entirely be copy-pastable to a local REPL.I've tried to dump as much information as hopefully is useful for users between the keyword argument sections and corresponding examples.
If there are pieces that aren't clear, please comment and let's get it cleaned up! This is going to be the basis for 1.0 docs, so we'd like them to be solid, stable, and very helpful. Specifically, some of the trickiest pieces to clarify are the interactions between keyword arguments (like how
pool
might affecttypes
, for one example); if you know or can think of cases where it's unclear how one keyword argument interacts with another, just chime in and we'll get it sorted out.Thanks for any and all review!