Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve accuracy of fst_table documentation regarding random row access #143

Closed
martinblostein opened this issue Mar 25, 2018 · 2 comments
Closed

Comments

@martinblostein
Copy link
Contributor

martinblostein commented Mar 25, 2018

The documentation of fst::fst says:

When data is accessed, only the requested subset is read from file. This is possible because the fst file format allows full random access (in columns and rows) to the stored dataset.

However, it is not the case that only the requested subset of rows is read from file. What happens is that all rows been the first and last requested row are read into R, and then the rest are discarded.

It is important for the user to know this implementation detail. If they have a large fst file from which they need a small number of sparsely distributed rows, it would be substantially more efficient to query for the rows individually, instead of in one query. It would also influence how a user ought to sort their tables.

For an worst-case example, consider loading the first and last rows of a 10 million row table:

N <- 1e7L
x <- data.frame(a = 1:N, b = 'b', c = 'c', d = 'd')
tmp <- tempfile()
fst::write_fst(x, tmp)

ft <- fst::fst(tmp)

microbenchmark::microbenchmark(two_queries = dt1 <- rbind(ft[1,], ft[N,]),
                               one_query = dt2 <- ft[c(1, N),])

# Unit: milliseconds
#        expr        min         lq       mean     median         uq        max neval
# two_queries   1.036671   1.341872   1.440764   1.399377   1.666008   1.960236   100
#   one_query 103.749060 111.464089 124.397109 117.907903 147.771323 159.217269   100

Of course, the best solution would be to alter the implementation so that this distinction is not important, but I imagine this is not so easy. (Could the minimal set of blocks be read in and then subsetted?)

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Mar 25, 2018

Hi @martinblostein, thanks for submitting your issue! Indeed, the current implementation reads all blocks in a range and then discards all blocks that are unneeded and that is far from ideal.

The documentation you are referring to was written for read_fst(), where that range is specified with an upper and lower value. But it doesn't hold for the data.frame interface, so that should be explained more clearly to the user as you say.

A better, faster implementation for row-subsetting is definitely required. The plan is to first make a map of blocks that are required to fulfill the request. At the same time, a second (inverse-) map can be generated to link the elements from a given block to the positions in the output vector. Although this can be done fairly fast (in C++), often that will mean that we are just parsing a vector 1:N. Because it is not known in advance from this vector that it's a complete range, each element needs to be scanned. When only the upper and lower value of the range are specified, this scan can be omitted altogether (much faster).

So allowing the user to specify a custom vector for row-indexing (as with []) means that the file read will be slower because of the required scans. If the user has more knowledge (e.g. he/she already knows the relevant ranges), that knowledge can be exchanged for speed by manually splitting the range.

Some quick benchmarks:

# toy method to determine the block for each element in the specified row vector
Rcpp::cppFunction("
SEXP get_blocks(SEXP row_index, int vec_length) {

  SEXP selected_blocks_vec = Rf_allocVector(INTSXP, 1 + vec_length / 4096);
  PROTECT(selected_blocks_vec);

  int* selected_blocks = INTEGER(selected_blocks_vec);

  int selection_length = LENGTH(row_index);
  int* selection_index = INTEGER(row_index);

  for (int i = 0; i < selection_length; i++) {
    selected_blocks[selection_index[i] / 4096]++;
  }

  UNPROTECT(1);

  return selected_blocks_vec;
}
")

timing1 <- microbenchmark::microbenchmark(
  get_blocks(1L:1e7L, 1e7L)
)

# speed
4e7 / median(timing1$time)
#> [1] 0.4496849

timing2 <- microbenchmark::microbenchmark(
  get_blocks(1L:1e6L, 1e7L)
)

# speed
4e7 / median(timing2$time)
#> [1] 11.08621

random_selection <- sample(1L:1e7L, 1e6)
timing3 <- microbenchmark::microbenchmark(
  get_blocks(random_selection, 1e7L)
)

# speed
4e7 / median(timing3$time)
#> [1] 27.12827

So the single-threaded speed with which a 1:N vector can be scanned for the relevant blocks is 0.45 GB/s in this implementation. For a row-selection of 10 percent of values that speed is much higher (more than 10 GB/s because we only use 10 percent of the rows).

Creating a multi-threaded version can bring the speed up to multiple GB/s for a full selection, so that will probably be acceptable speed-wise!

@MarcusKlik MarcusKlik self-assigned this Mar 25, 2018
@MarcusKlik MarcusKlik added this to the fst v0.8.6 milestone Mar 25, 2018
@MarcusKlik
Copy link
Collaborator

Hi @martinblostein, I've changed the documentation for the fst() method to reflect that currently the row-selection is not exactly random-access, but reads a range of rows.

Thanks again for pointing this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants