try arrow/feather format for speeding up data load #105

jangorecki · 2019-10-09T10:37:37Z

R arrow package is now on CRAN.
With using feather in the past we were getting segfaults. In R it didn't even work for 1e7, in python 1e9.
We should try new arrow format, might speed up reading data.
related issues: #47 #99

The text was updated successfully, but these errors were encountered:

jangorecki · 2020-11-17T08:48:56Z

code to create feather files for all our csv currently in use

library(arrow)
library(data.table)
setDTthreads(0L)
g1 = sprintf("G1_%s_%s_0_%s", rep(c("1e7","1e8","1e9"), each=4L), rep(c("1e2","1e1","2e0","1e2"), times=3L), rep(c("0","0","0","1"), times=3L))
j1 = sprintf("J1_%s_%s_0_0", rep(c("1e7","1e8","1e9"), each=4L), trimws(gsub("e+0","e", format(c(sapply(c(1e7,1e8,1e9), `/`, c(NA,1e6,1e3,1e0))), digits=1), fixed=TRUE)))
csv2fea = function(dn) {
  cat("csv2fea:", dn, "\n") 
  df = fread(sprintf("data/%s.csv", dn), showProgress=FALSE, stringsAsFactors=TRUE, data.table=FALSE)
  arrow::write_feather(df, sprintf("data/%s.fea", dn))
  rm(df)
  TRUE
}
sapply(c(g1, j1), csv2fea)

jangorecki · 2020-11-21T10:46:41Z

This change is for now blocked by arrow's Capacity error "array cannot contain more than 2147483646 bytes" described in: apache/arrow#8732 (it is limitation in R arrow package which is planned to be addressed 3.0.0)
Error happens when trying to write G1_1e9_2e0_0_0 dataset.

Moreover, there are some issues among different software about using arrow files:

dask: add dask.dataframe.read_feather like pandas one dask/dask#6865
spark: https://stackoverflow.com/questions/53569580/how-to-read-feather-arrow-file-natively
cudf: [BUG] arrow data: Categorical categories must be unique rapidsai/cudf#6828
clickhouse: The type "dictionary" ... is not supported for conversion from a Arrow data format ClickHouse/ClickHouse#17240

Code related to that change will stay in https://github.com/h2oai/db-benchmark/tree/arrow branch for now.

nealrichardson · 2020-11-23T17:17:42Z

Have you considered using Parquet files instead of Feather? Parquet files are generally smaller on disk and also very fast to load. See https://ursalabs.org/blog/2020-feather-v2/ and https://arrow.apache.org/faq/ for some discussion of the tradeoffs. (Arrow and Parquet are closely related, and the C++/Python/R implementations of Parquet are part of the Arrow libraries.)

I'm not sure about the cudf and clickhouse issues you note above, but Dask and Spark can read Parquet files directly. Regarding the Arrow R package issue you reported, Parquet doesn't help that because the data goes through the Arrow format to get to Parquet.

jangorecki · 2020-11-23T17:58:33Z

@nealrichardson Thank you for your comment and useful links. I did used parquet before. Bigger data in parquet files (corresponding to 50GB csv) written from python were not able to be read in Spark. In that sense they were not well portable. Portability is a plus because I can use same file by multiple different tools. Although Spark doesn't seem to read Arrow directly, the Arrow format is meant to be "cross platform" exchange format, so in theory Arrow is the way to go in my use case.
Loading speed is big concern as long as it is faster than reading csv.
Moreover I have many datasets and it is easier to maintain them having 1 dataset = 1 file structure, and not directories like parquet.
Aside from having data files to read into memory there is another aspect of being able to mmap and compute out-of-memory queries. Arrow seems to fit well here. I had parquet out-of-memory computation working on Dask but in recent version 2.30.0 it does not compute even single query anymore.

jangorecki · 2020-12-07T09:17:38Z

There is no much to do in this issue. At least until R arrow 3.0.0 will be release. For details see apache/arrow#8732
As for suggested alternative:

You could also generate the data from Python (pyarrow) and I believe it would handle the chunking correctly.

Unfortunately it is not possible to read 50GB csv data on 128GB mem machine due to memory required by python. Therefore I cannot generate arrow files with pyarrow in case of all 1e9 data size cases.
Note that it is a second attempt to use arrow/feather files. Previous attempt (feather v1) failed due to segfaults.

jangorecki changed the title ~~try arrow format for speeding in data load~~ try arrow format for speeding up data load Oct 9, 2019

This comment has been minimized.

Sign in to view

jangorecki changed the title ~~try arrow format for speeding up data load~~ try arrow/feather format for speeding up data load Nov 15, 2020

jangorecki added this to the 2.1.0 milestone Nov 15, 2020

This comment has been minimized.

Sign in to view

jangorecki removed this from the 2.1.0 milestone Nov 21, 2020

jangorecki closed this as completed Dec 7, 2020

jangorecki added this to the 2.1.0 milestone Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

try arrow/feather format for speeding up data load #105

try arrow/feather format for speeding up data load #105

jangorecki commented Oct 9, 2019 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

jangorecki commented Nov 17, 2020

jangorecki commented Nov 21, 2020 •

edited

Loading

nealrichardson commented Nov 23, 2020

jangorecki commented Nov 23, 2020

jangorecki commented Dec 7, 2020 •

edited

Loading

try arrow/feather format for speeding up data load #105

try arrow/feather format for speeding up data load #105

Comments

jangorecki commented Oct 9, 2019 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

jangorecki commented Nov 17, 2020

jangorecki commented Nov 21, 2020 • edited Loading

nealrichardson commented Nov 23, 2020

jangorecki commented Nov 23, 2020

jangorecki commented Dec 7, 2020 • edited Loading

jangorecki commented Oct 9, 2019 •

edited

Loading

jangorecki commented Nov 21, 2020 •

edited

Loading

jangorecki commented Dec 7, 2020 •

edited

Loading