-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try arrow/feather format for speeding up data load #105
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
code to create feather files for all our csv currently in use library(arrow)
library(data.table)
setDTthreads(0L)
g1 = sprintf("G1_%s_%s_0_%s", rep(c("1e7","1e8","1e9"), each=4L), rep(c("1e2","1e1","2e0","1e2"), times=3L), rep(c("0","0","0","1"), times=3L))
j1 = sprintf("J1_%s_%s_0_0", rep(c("1e7","1e8","1e9"), each=4L), trimws(gsub("e+0","e", format(c(sapply(c(1e7,1e8,1e9), `/`, c(NA,1e6,1e3,1e0))), digits=1), fixed=TRUE)))
csv2fea = function(dn) {
cat("csv2fea:", dn, "\n")
df = fread(sprintf("data/%s.csv", dn), showProgress=FALSE, stringsAsFactors=TRUE, data.table=FALSE)
arrow::write_feather(df, sprintf("data/%s.fea", dn))
rm(df)
TRUE
}
sapply(c(g1, j1), csv2fea) |
This change is for now blocked by arrow's Capacity error "array cannot contain more than 2147483646 bytes" described in: apache/arrow#8732 (it is limitation in R arrow package which is planned to be addressed 3.0.0) Moreover, there are some issues among different software about using arrow files:
Code related to that change will stay in https://github.com/h2oai/db-benchmark/tree/arrow branch for now. |
Have you considered using Parquet files instead of Feather? Parquet files are generally smaller on disk and also very fast to load. See https://ursalabs.org/blog/2020-feather-v2/ and https://arrow.apache.org/faq/ for some discussion of the tradeoffs. (Arrow and Parquet are closely related, and the C++/Python/R implementations of Parquet are part of the Arrow libraries.) I'm not sure about the cudf and clickhouse issues you note above, but Dask and Spark can read Parquet files directly. Regarding the Arrow R package issue you reported, Parquet doesn't help that because the data goes through the Arrow format to get to Parquet. |
@nealrichardson Thank you for your comment and useful links. I did used parquet before. Bigger data in parquet files (corresponding to 50GB csv) written from python were not able to be read in Spark. In that sense they were not well portable. Portability is a plus because I can use same file by multiple different tools. Although Spark doesn't seem to read Arrow directly, the Arrow format is meant to be "cross platform" exchange format, so in theory Arrow is the way to go in my use case. |
There is no much to do in this issue. At least until R arrow 3.0.0 will be release. For details see apache/arrow#8732
Unfortunately it is not possible to read 50GB csv data on 128GB mem machine due to memory required by python. Therefore I cannot generate arrow files with |
R
arrow
package is now on CRAN.With using feather in the past we were getting segfaults. In R it didn't even work for 1e7, in python 1e9.
We should try new arrow format, might speed up reading data.
related issues: #47 #99
The text was updated successfully, but these errors were encountered: