Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try arrow/feather format for speeding up data load #105

Closed
jangorecki opened this issue Oct 9, 2019 · 8 comments
Closed

try arrow/feather format for speeding up data load #105

jangorecki opened this issue Oct 9, 2019 · 8 comments
Milestone

Comments

@jangorecki
Copy link
Contributor

jangorecki commented Oct 9, 2019

R arrow package is now on CRAN.
With using feather in the past we were getting segfaults. In R it didn't even work for 1e7, in python 1e9.
We should try new arrow format, might speed up reading data.
related issues: #47 #99

@jangorecki jangorecki changed the title try arrow format for speeding in data load try arrow format for speeding up data load Oct 9, 2019
@jangorecki

This comment has been minimized.

@jangorecki

This comment has been minimized.

@jangorecki jangorecki changed the title try arrow format for speeding up data load try arrow/feather format for speeding up data load Nov 15, 2020
@jangorecki jangorecki added this to the 2.1.0 milestone Nov 15, 2020
@jangorecki

This comment has been minimized.

@jangorecki
Copy link
Contributor Author

code to create feather files for all our csv currently in use

library(arrow)
library(data.table)
setDTthreads(0L)
g1 = sprintf("G1_%s_%s_0_%s", rep(c("1e7","1e8","1e9"), each=4L), rep(c("1e2","1e1","2e0","1e2"), times=3L), rep(c("0","0","0","1"), times=3L))
j1 = sprintf("J1_%s_%s_0_0", rep(c("1e7","1e8","1e9"), each=4L), trimws(gsub("e+0","e", format(c(sapply(c(1e7,1e8,1e9), `/`, c(NA,1e6,1e3,1e0))), digits=1), fixed=TRUE)))
csv2fea = function(dn) {
  cat("csv2fea:", dn, "\n") 
  df = fread(sprintf("data/%s.csv", dn), showProgress=FALSE, stringsAsFactors=TRUE, data.table=FALSE)
  arrow::write_feather(df, sprintf("data/%s.fea", dn))
  rm(df)
  TRUE
}
sapply(c(g1, j1), csv2fea)

@jangorecki
Copy link
Contributor Author

jangorecki commented Nov 21, 2020

This change is for now blocked by arrow's Capacity error "array cannot contain more than 2147483646 bytes" described in: apache/arrow#8732 (it is limitation in R arrow package which is planned to be addressed 3.0.0)
Error happens when trying to write G1_1e9_2e0_0_0 dataset.

Moreover, there are some issues among different software about using arrow files:

Code related to that change will stay in https://github.com/h2oai/db-benchmark/tree/arrow branch for now.

@jangorecki jangorecki removed this from the 2.1.0 milestone Nov 21, 2020
@nealrichardson
Copy link

Have you considered using Parquet files instead of Feather? Parquet files are generally smaller on disk and also very fast to load. See https://ursalabs.org/blog/2020-feather-v2/ and https://arrow.apache.org/faq/ for some discussion of the tradeoffs. (Arrow and Parquet are closely related, and the C++/Python/R implementations of Parquet are part of the Arrow libraries.)

I'm not sure about the cudf and clickhouse issues you note above, but Dask and Spark can read Parquet files directly. Regarding the Arrow R package issue you reported, Parquet doesn't help that because the data goes through the Arrow format to get to Parquet.

@jangorecki
Copy link
Contributor Author

@nealrichardson Thank you for your comment and useful links. I did used parquet before. Bigger data in parquet files (corresponding to 50GB csv) written from python were not able to be read in Spark. In that sense they were not well portable. Portability is a plus because I can use same file by multiple different tools. Although Spark doesn't seem to read Arrow directly, the Arrow format is meant to be "cross platform" exchange format, so in theory Arrow is the way to go in my use case.
Loading speed is big concern as long as it is faster than reading csv.
Moreover I have many datasets and it is easier to maintain them having 1 dataset = 1 file structure, and not directories like parquet.
Aside from having data files to read into memory there is another aspect of being able to mmap and compute out-of-memory queries. Arrow seems to fit well here. I had parquet out-of-memory computation working on Dask but in recent version 2.30.0 it does not compute even single query anymore.

@jangorecki
Copy link
Contributor Author

jangorecki commented Dec 7, 2020

There is no much to do in this issue. At least until R arrow 3.0.0 will be release. For details see apache/arrow#8732
As for suggested alternative:

You could also generate the data from Python (pyarrow) and I believe it would handle the chunking correctly.

Unfortunately it is not possible to read 50GB csv data on 128GB mem machine due to memory required by python. Therefore I cannot generate arrow files with pyarrow in case of all 1e9 data size cases.
Note that it is a second attempt to use arrow/feather files. Previous attempt (feather v1) failed due to segfaults.

@jangorecki jangorecki added this to the 2.1.0 milestone Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants