Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inefficiency on multivalued but sparse column. #2431

Closed
fulmicoton opened this issue Jun 12, 2024 · 1 comment · Fixed by #2439
Closed

Fix inefficiency on multivalued but sparse column. #2431

fulmicoton opened this issue Jun 12, 2024 · 1 comment · Fixed by #2439
Assignees

Comments

@fulmicoton
Copy link
Collaborator

When dealing with multivalued column but sparse (most document have 0 values), tantivy is currently very inefficient.

We need to build an start_offsets index that is a column as long as the number of documents.

In particular, in dynamic mode, if a user is filling quickwit with documents with a dynamic key.
{"cart": {user_123123: ["item1", "item30"]}
We could end up creating a lot of multivalued columns.

The phenomenon could be stress full on indexing.

On merge, we would have to extend these start_offsets, and sometime lift an efficient, sparse optional column to a multivalue.

This problem has been observed on Airmail.

Possible solution

One possible solution would be to shield the multivalued column by an optional column marking the rows that have at least one document.

That solution is reasonable to implement and has being explored as a duct tape hack in #2429.

Main cons are:

  • multivalued columns will get more inefficient as every single call will have to go through the optional index access.
  • when facing dense multivalued columns, the index will get 1 bit per document larger per such column.
    We can add FULL blocks or BIZARRO blocks to the optional index to fix this, but the extra branching could have some unexpected effected on scalar operations microoptimizations.
@PSeitz
Copy link
Contributor

PSeitz commented Jun 18, 2024

The PR #2439 adds the optional index to the multivalue index based on #2429.

Extending the optional index with FULL blocks or BIZARRO blocks (store docis without a value) should be explored seperately. It may also make sense to have an encoding for the threshold between SPARSE and DENSE (5_120 elements in e u16::MAX block), when DENSE is relatively costly and SPARSE is relatively slow with the binary_search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants