-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve accuracy of fst_table documentation regarding random row access #143
Comments
Hi @martinblostein, thanks for submitting your issue! Indeed, the current implementation reads all blocks in a range and then discards all blocks that are unneeded and that is far from ideal. The documentation you are referring to was written for A better, faster implementation for row-subsetting is definitely required. The plan is to first make a map of blocks that are required to fulfill the request. At the same time, a second (inverse-) map can be generated to link the elements from a given block to the positions in the output vector. Although this can be done fairly fast (in So allowing the user to specify a custom vector for row-indexing (as with Some quick benchmarks: # toy method to determine the block for each element in the specified row vector
Rcpp::cppFunction("
SEXP get_blocks(SEXP row_index, int vec_length) {
SEXP selected_blocks_vec = Rf_allocVector(INTSXP, 1 + vec_length / 4096);
PROTECT(selected_blocks_vec);
int* selected_blocks = INTEGER(selected_blocks_vec);
int selection_length = LENGTH(row_index);
int* selection_index = INTEGER(row_index);
for (int i = 0; i < selection_length; i++) {
selected_blocks[selection_index[i] / 4096]++;
}
UNPROTECT(1);
return selected_blocks_vec;
}
")
timing1 <- microbenchmark::microbenchmark(
get_blocks(1L:1e7L, 1e7L)
)
# speed
4e7 / median(timing1$time)
#> [1] 0.4496849
timing2 <- microbenchmark::microbenchmark(
get_blocks(1L:1e6L, 1e7L)
)
# speed
4e7 / median(timing2$time)
#> [1] 11.08621
random_selection <- sample(1L:1e7L, 1e6)
timing3 <- microbenchmark::microbenchmark(
get_blocks(random_selection, 1e7L)
)
# speed
4e7 / median(timing3$time)
#> [1] 27.12827 So the single-threaded speed with which a Creating a multi-threaded version can bring the speed up to multiple GB/s for a full selection, so that will probably be acceptable speed-wise! |
Hi @martinblostein, I've changed the documentation for the Thanks again for pointing this out! |
The documentation of
fst::fst
says:However, it is not the case that only the requested subset of rows is read from file. What happens is that all rows been the first and last requested row are read into R, and then the rest are discarded.
It is important for the user to know this implementation detail. If they have a large fst file from which they need a small number of sparsely distributed rows, it would be substantially more efficient to query for the rows individually, instead of in one query. It would also influence how a user ought to sort their tables.
For an worst-case example, consider loading the first and last rows of a 10 million row table:
Of course, the best solution would be to alter the implementation so that this distinction is not important, but I imagine this is not so easy. (Could the minimal set of blocks be read in and then subsetted?)
The text was updated successfully, but these errors were encountered: