Improve accuracy of fst_table documentation regarding random row access #143

martinblostein · 2018-03-25T21:18:54Z

The documentation of fst::fst says:

When data is accessed, only the requested subset is read from file. This is possible because the fst file format allows full random access (in columns and rows) to the stored dataset.

However, it is not the case that only the requested subset of rows is read from file. What happens is that all rows been the first and last requested row are read into R, and then the rest are discarded.

It is important for the user to know this implementation detail. If they have a large fst file from which they need a small number of sparsely distributed rows, it would be substantially more efficient to query for the rows individually, instead of in one query. It would also influence how a user ought to sort their tables.

For an worst-case example, consider loading the first and last rows of a 10 million row table:

N <- 1e7L
x <- data.frame(a = 1:N, b = 'b', c = 'c', d = 'd')
tmp <- tempfile()
fst::write_fst(x, tmp)

ft <- fst::fst(tmp)

microbenchmark::microbenchmark(two_queries = dt1 <- rbind(ft[1,], ft[N,]),
                               one_query = dt2 <- ft[c(1, N),])

# Unit: milliseconds
#        expr        min         lq       mean     median         uq        max neval
# two_queries   1.036671   1.341872   1.440764   1.399377   1.666008   1.960236   100
#   one_query 103.749060 111.464089 124.397109 117.907903 147.771323 159.217269   100

Of course, the best solution would be to alter the implementation so that this distinction is not important, but I imagine this is not so easy. (Could the minimal set of blocks be read in and then subsetted?)

The text was updated successfully, but these errors were encountered:

MarcusKlik · 2018-03-25T22:30:03Z

Hi @martinblostein, thanks for submitting your issue! Indeed, the current implementation reads all blocks in a range and then discards all blocks that are unneeded and that is far from ideal.

The documentation you are referring to was written for read_fst(), where that range is specified with an upper and lower value. But it doesn't hold for the data.frame interface, so that should be explained more clearly to the user as you say.

A better, faster implementation for row-subsetting is definitely required. The plan is to first make a map of blocks that are required to fulfill the request. At the same time, a second (inverse-) map can be generated to link the elements from a given block to the positions in the output vector. Although this can be done fairly fast (in C++), often that will mean that we are just parsing a vector 1:N. Because it is not known in advance from this vector that it's a complete range, each element needs to be scanned. When only the upper and lower value of the range are specified, this scan can be omitted altogether (much faster).

So allowing the user to specify a custom vector for row-indexing (as with []) means that the file read will be slower because of the required scans. If the user has more knowledge (e.g. he/she already knows the relevant ranges), that knowledge can be exchanged for speed by manually splitting the range.

Some quick benchmarks:

# toy method to determine the block for each element in the specified row vector
Rcpp::cppFunction("
SEXP get_blocks(SEXP row_index, int vec_length) {

  SEXP selected_blocks_vec = Rf_allocVector(INTSXP, 1 + vec_length / 4096);
  PROTECT(selected_blocks_vec);

  int* selected_blocks = INTEGER(selected_blocks_vec);

  int selection_length = LENGTH(row_index);
  int* selection_index = INTEGER(row_index);

  for (int i = 0; i < selection_length; i++) {
    selected_blocks[selection_index[i] / 4096]++;
  }

  UNPROTECT(1);

  return selected_blocks_vec;
}
")

timing1 <- microbenchmark::microbenchmark(
  get_blocks(1L:1e7L, 1e7L)
)

# speed
4e7 / median(timing1$time)
#> [1] 0.4496849

timing2 <- microbenchmark::microbenchmark(
  get_blocks(1L:1e6L, 1e7L)
)

# speed
4e7 / median(timing2$time)
#> [1] 11.08621

random_selection <- sample(1L:1e7L, 1e6)
timing3 <- microbenchmark::microbenchmark(
  get_blocks(random_selection, 1e7L)
)

# speed
4e7 / median(timing3$time)
#> [1] 27.12827

So the single-threaded speed with which a 1:N vector can be scanned for the relevant blocks is 0.45 GB/s in this implementation. For a row-selection of 10 percent of values that speed is much higher (more than 10 GB/s because we only use 10 percent of the rows).

Creating a multi-threaded version can bring the speed up to multiple GB/s for a full selection, so that will probably be acceptable speed-wise!

MarcusKlik · 2018-05-14T12:47:10Z

Hi @martinblostein, I've changed the documentation for the fst() method to reflect that currently the row-selection is not exactly random-access, but reads a range of rows.

Thanks again for pointing this out!

MarcusKlik self-assigned this Mar 25, 2018

MarcusKlik added enhancement documentation labels Mar 25, 2018

MarcusKlik added this to the fst v0.8.6 milestone Mar 25, 2018

MarcusKlik closed this as completed in a437497 May 14, 2018

mikejiang mentioned this issue May 21, 2018

Question about random row access #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve accuracy of fst_table documentation regarding random row access #143

Improve accuracy of fst_table documentation regarding random row access #143

martinblostein commented Mar 25, 2018 •

edited

Loading

MarcusKlik commented Mar 25, 2018 •

edited

Loading

MarcusKlik commented May 14, 2018

Improve accuracy of fst_table documentation regarding random row access #143

Improve accuracy of fst_table documentation regarding random row access #143

Comments

martinblostein commented Mar 25, 2018 • edited Loading

MarcusKlik commented Mar 25, 2018 • edited Loading

MarcusKlik commented May 14, 2018

martinblostein commented Mar 25, 2018 •

edited

Loading

MarcusKlik commented Mar 25, 2018 •

edited

Loading