Fix inefficiency on multivalued but sparse column. #2431

fulmicoton · 2024-06-12T01:02:27Z

When dealing with multivalued column but sparse (most document have 0 values), tantivy is currently very inefficient.

We need to build an start_offsets index that is a column as long as the number of documents.

In particular, in dynamic mode, if a user is filling quickwit with documents with a dynamic key.
{"cart": {user_123123: ["item1", "item30"]}
We could end up creating a lot of multivalued columns.

The phenomenon could be stress full on indexing.

On merge, we would have to extend these start_offsets, and sometime lift an efficient, sparse optional column to a multivalue.

This problem has been observed on Airmail.

Possible solution

One possible solution would be to shield the multivalued column by an optional column marking the rows that have at least one document.

That solution is reasonable to implement and has being explored as a duct tape hack in #2429.

Main cons are:

multivalued columns will get more inefficient as every single call will have to go through the optional index access.
when facing dense multivalued columns, the index will get 1 bit per document larger per such column.
We can add FULL blocks or BIZARRO blocks to the optional index to fix this, but the extra branching could have some unexpected effected on scalar operations microoptimizations.

The text was updated successfully, but these errors were encountered:

PSeitz · 2024-06-18T01:34:14Z

The PR #2439 adds the optional index to the multivalue index based on #2429.

Extending the optional index with FULL blocks or BIZARRO blocks (store docis without a value) should be explored seperately. It may also make sense to have an encoding for the threshold between SPARSE and DENSE (5_120 elements in e u16::MAX block), when DENSE is relatively costly and SPARSE is relatively slow with the binary_search.

fulmicoton assigned PSeitz Jun 12, 2024

fulmicoton added the high priority label Jun 12, 2024

PSeitz mentioned this issue Jun 18, 2024

use optional index in multivalued index #2439

Merged

PSeitz mentioned this issue Jun 18, 2024

Possible Codec Between SPARSE and DENSE: CHIMERA #2440

Open

PSeitz closed this as completed in #2439 Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inefficiency on multivalued but sparse column. #2431

Fix inefficiency on multivalued but sparse column. #2431

fulmicoton commented Jun 12, 2024

PSeitz commented Jun 18, 2024

Fix inefficiency on multivalued but sparse column. #2431

Fix inefficiency on multivalued but sparse column. #2431

Comments

fulmicoton commented Jun 12, 2024

Possible solution

PSeitz commented Jun 18, 2024