Can `read_fst` use a filter condition beforehand? #274

hope-data-science · 2023-02-21T08:04:53Z

The from and to parameters in read_fst is great. I wonder if we can use a filter to select only the rows we needed. If this could be done on disk, it would be wonderful and make fst become a light and convenient database system, which would be great. I think this feature could boost fst to a new level. Any ideas to promote it? (I don't know the underlying tools, so I have no idea whether it would be easy to implemented or not. I just imagine that if we can just use the filter columns to retrieve the needed row ids and extract them using from and to might be efficient)

Thanks.

The text was updated successfully, but these errors were encountered:

MarcusKlik · 2023-12-01T09:37:32Z

Hi @hope-data-science, thanks for sharing your ideas!

And yes, you are very correct, the ability to apply a row-filter (and an ordering) during file reads is definitely the next step needed for fst.

If we have filtering and ordering we could do:

library(dplyr)
library(fst)

nr_of_rows <- 1000

# generate sample fst file
tibble(
  A = 1:nr_of_rows,
  B = LETTERS[sample(1:26, nr_of_rows, replace = TRUE)]
) |>
  write_fst("x.fst")


# filter and order during read
fst("x.fst") |>
  filter(A %% 2 == 0) |>
  arrange(B)

In the last steps, the following would happen:

first, the filter is calculated and stored in the fst_table object. for simple filters, the actual logical vector result might be calculated on the fly so that the columns needed for filtering don't have to be read into RAM, greatly reducing RAM requirements for large datasets.
second, the ordering would result in a (large-) integer vector that can be used to fill the result columns during the read action. This ensures that RAM is only used for the result data and not for intermediate results (again, reducing RAM requirements)

In future implementation, this mechanism could be extended with an on-disk sorting algorithm to allow sorting of larger-than-memory datasets and keep the data on disk as long as possible. Only when collect() is called, the data would actually be fully read.

To make this happen it's probably best to add a full dplyr implementation to the fst package first (so that the code above will actually run) and then implement the required back-end code in fstlib.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can `read_fst` use a filter condition beforehand? #274

Can `read_fst` use a filter condition beforehand? #274

hope-data-science commented Feb 21, 2023

MarcusKlik commented Dec 1, 2023 •

edited

Loading

Can read_fst use a filter condition beforehand? #274

Can read_fst use a filter condition beforehand? #274

Comments

hope-data-science commented Feb 21, 2023

MarcusKlik commented Dec 1, 2023 • edited Loading

Can `read_fst` use a filter condition beforehand? #274

Can `read_fst` use a filter condition beforehand? #274

MarcusKlik commented Dec 1, 2023 •

edited

Loading