Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate sparse histogram collection #1704

Closed
PSeitz opened this issue Nov 30, 2022 · 0 comments · Fixed by #1898
Closed

Evaluate sparse histogram collection #1704

PSeitz opened this issue Nov 30, 2022 · 0 comments · Fixed by #1898

Comments

@PSeitz
Copy link
Contributor

PSeitz commented Nov 30, 2022

Currently histogram buckets are densely pre-created depending on the fastfields min max values and passed bounds. This allows fast computation of the bucket pos for incoming values.

In some scenarios this may cause issue with max bucket limit (defaults to 65000) and server memory consumption

Data example

{"value": 0}
{"value": 1_000_000_000}

A histogram query on value with interval 1 and min_doc_count > 0 would create 1 billion buckets, and potentially overload the server, even though only 2 buckets will be returned.

An alternative would be to have sparse histogram collection. It could also be a hybrid of lazy dense collection with automatic switching to sparse.

Sparse histogram collection may have a potential reuse capability with a a future date histogram.

Related Issues #1703, quickwit-oss/quickwit#2503

PSeitz added a commit that referenced this issue Feb 22, 2023
Replaces histogram vec collection with a hashmap. This approach works much better for sparse data and enables use cases like drill downs (filter + small interval).
It is slower for dense cases (1.3x-2x slower). This can be alleviated with a specialized hashmap in the future.
closes #1704
closes #1370
PSeitz added a commit that referenced this issue Feb 23, 2023
* switch to sparse collection for histogram

Replaces histogram vec collection with a hashmap. This approach works much better for sparse data and enables use cases like drill downs (filter + small interval).
It is slower for dense cases (1.3x-2x slower). This can be alleviated with a specialized hashmap in the future.
closes #1704
closes #1370

* refactor, clippy

* fix bucket_pos overflow issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant