You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When dealing with multivalued column but sparse (most document have 0 values), tantivy is currently very inefficient.
We need to build an start_offsets index that is a column as long as the number of documents.
In particular, in dynamic mode, if a user is filling quickwit with documents with a dynamic key. {"cart": {user_123123: ["item1", "item30"]}
We could end up creating a lot of multivalued columns.
The phenomenon could be stress full on indexing.
On merge, we would have to extend these start_offsets, and sometime lift an efficient, sparse optional column to a multivalue.
This problem has been observed on Airmail.
Possible solution
One possible solution would be to shield the multivalued column by an optional column marking the rows that have at least one document.
That solution is reasonable to implement and has being explored as a duct tape hack in #2429.
Main cons are:
multivalued columns will get more inefficient as every single call will have to go through the optional index access.
when facing dense multivalued columns, the index will get 1 bit per document larger per such column.
We can add FULL blocks or BIZARRO blocks to the optional index to fix this, but the extra branching could have some unexpected effected on scalar operations microoptimizations.
The text was updated successfully, but these errors were encountered:
The PR #2439 adds the optional index to the multivalue index based on #2429.
Extending the optional index with FULL blocks or BIZARRO blocks (store docis without a value) should be explored seperately. It may also make sense to have an encoding for the threshold between SPARSE and DENSE (5_120 elements in e u16::MAX block), when DENSE is relatively costly and SPARSE is relatively slow with the binary_search.
When dealing with multivalued column but sparse (most document have 0 values), tantivy is currently very inefficient.
We need to build an
start_offsets
index that is a column as long as the number of documents.In particular, in dynamic mode, if a user is filling quickwit with documents with a dynamic key.
{"cart": {user_123123: ["item1", "item30"]}
We could end up creating a lot of multivalued columns.
The phenomenon could be stress full on indexing.
On merge, we would have to extend these
start_offsets
, and sometime lift an efficient, sparse optional column to a multivalue.This problem has been observed on Airmail.
Possible solution
One possible solution would be to shield the multivalued column by an optional column marking the rows that have at least one document.
That solution is reasonable to implement and has being explored as a duct tape hack in #2429.
Main cons are:
We can add FULL blocks or BIZARRO blocks to the optional index to fix this, but the extra branching could have some unexpected effected on scalar operations microoptimizations.
The text was updated successfully, but these errors were encountered: