This repo is archived and the code moved to Arrow CLI Tools.
Convert CSV files to Apache Parquet. You may also be interested in json2parquet, csv2arrow, or json2arrow.
You can get the latest releases from https://github.com/domoritz/csv2parquet/releases/.
cargo install csv2parquet
Usage: csv2parquet [OPTIONS] <CSV> <PARQUET>
Arguments:
<CSV> Input CSV file
<PARQUET> Output file
Options:
-s, --schema-file <SCHEMA_FILE>
File with Arrow schema in JSON format
--max-read-records <MAX_READ_RECORDS>
The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed
--header <HEADER>
Set whether the CSV file has headers [possible values: true, false]
-d, --delimiter <DELIMITER>
Set the CSV file's column delimiter as a byte character [default: ,]
-c, --compression <COMPRESSION>
Set the compression [possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd]
-e, --encoding <ENCODING>
Sets encoding for any column [possible values: plain, rle, bit-packed, delta-binary-packed, delta-length-byte-array, delta-byte-array, rle-dictionary]
--data-pagesize-limit <DATA_PAGESIZE_LIMIT>
Sets data page size limit
--dictionary-pagesize-limit <DICTIONARY_PAGESIZE_LIMIT>
Sets dictionary page size limit
--write-batch-size <WRITE_BATCH_SIZE>
Sets write batch size
--max-row-group-size <MAX_ROW_GROUP_SIZE>
Sets max size for a row group
--created-by <CREATED_BY>
Sets "created by" property
--dictionary
Sets flag to enable/disable dictionary encoding for any column
--statistics <STATISTICS>
Sets flag to enable/disable statistics for any column [possible values: none, chunk, page]
--max-statistics-size <MAX_STATISTICS_SIZE>
Sets max statistics size for any column. Applicable only if statistics are enabled
-p, --print-schema
Print the schema to stderr
-n, --dry
Only print the schema
-h, --help
Print help information
-V, --version
Print version information
The --schema-file option uses the same file format as --dry and --print-schema.
csv2parquet data.csv data.parquet
csv2parquet --header false <CSV> <PARQUET>
csv2parquet --header true --dry <CSV> <PARQUET>
Below is an example of the schema-file
content:
{
"fields": [
{
"name": "col1",
"data_type": "Utf8",
"nullable": false,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
},
{
"name": " col2",
"data_type": "Utf8",
"nullable": false,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
}
],
" metadata": {}
}
Then add the schema-file schema.json
in the command:
csv2parquet --header false --schema-file schema.json <CSV> <PARQUET>