Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(aggregators/metric): Add a top_hits aggregator #2198

Merged
merged 32 commits into from
Jan 26, 2024

Conversation

ditsuke
Copy link
Contributor

@ditsuke ditsuke commented Oct 2, 2023

Summary

Implements the top_hits aggregator. The aggregator is backed by a BinaryHeap based on prior discussions on Discord#tantivy-dev the new TopNComputer.

@ditsuke ditsuke marked this pull request as draft October 2, 2023 19:14
@adamreichold
Copy link
Collaborator

The aggregator is backed by a BinaryHeap based on prior discussions

Did those discussions already consider the recently introduced TopNComputer which might be a better choice?

Cargo.toml Outdated Show resolved Hide resolved
@fulmicoton
Copy link
Collaborator

Hello @ditsuke, is this something you actually have a need for? Can you describe the use case?

@ditsuke
Copy link
Contributor Author

ditsuke commented Oct 3, 2023

The aggregator is backed by a BinaryHeap based on prior discussions

Did those discussions already consider the recently introduced TopNComputer which might be a better choice?

Yes, the idea behind the new algorithm came up in that discussion thanks to @fulmicoton. I didn't use that here since it wasn't found suitable for pagination, but I'm happy to reconsider if that's wrong

@ditsuke
Copy link
Contributor Author

ditsuke commented Oct 3, 2023

Hello @ditsuke, is this something you actually have a need for? Can you describe the use case?

Hi @fulmicoton, we discussed the use-case on #tantivy-help some weeks back. Our use-case is querying for the top docs in each bucket (really the most recent post by some source).

For reference.

@adamreichold
Copy link
Collaborator

I didn't use that here since it wasn't found suitable for pagination, but I'm happy to reconsider if that's wrong

I might be misunderstanding things since I wasn't part of that discussion, but my understanding was that the stable sort order needed for pagination is not so much a question of BinaryHeap versus TopNComputer, but rather whether ties are resolved in a stable manner by comparing document addresses when sorting the final result set?

@ditsuke
Copy link
Contributor Author

ditsuke commented Oct 3, 2023

I didn't use that here since it wasn't found suitable for pagination, but I'm happy to reconsider if that's wrong

I might be misunderstanding things since I wasn't part of that discussion, but my understanding was that the stable sort order needed for pagination is not so much a question of BinaryHeap versus TopNComputer, but rather whether ties are resolved in a stable manner by comparing document addresses when sorting the final result set?

You're actually right! My initial impression was that the elimination step would be unstable for distributions with > n/2 elements ≤ median (so when you can potentially eliminate more than n/2 elements and have to make a choice to retain some of the conflicting values, given that we need to cap the elimination to n/2 elements). But in fact this isn't a problem anymore with our comparator falling back to the doc address on conflicts, so we should be able to use TopNComputer here safely.

src/aggregation/agg_req.rs Outdated Show resolved Hide resolved
@fulmicoton fulmicoton requested a review from PSeitz October 9, 2023 00:41
src/lib.rs Show resolved Hide resolved
@fulmicoton
Copy link
Collaborator

This PR is starting to look pretty good! Thank you @ditsuke.
Can you add unit tests?
@PSeitz can you take care with the rest of review?

@PSeitz
Copy link
Contributor

PSeitz commented Oct 9, 2023

Fetching Docs

One tricky part that is missing is fetching the actual content of the document. So far aggregations are limited to fast fields due to the way they operate.
Question is when to fetch documents and from which data source.

Document's data source

Generally they are two data-sources for a documents data in top hits: The doc store and fast fields.

Doc store access is relatively expensive, fast fields (doc values) access are cheap.

Fast field terms may be limited to a certain length, so some long texts may be missing there.

When to Fetch

Aggregation works roughly like that

Segment Collection => Intermediate Result
Segment Collection => Intermediate Result
Segment Collection => Intermediate Result
... Could be 100 of Intermediate Result

Intermediate results can be de/serialized, merged and converted to a final result.

Fetching a documents data could happen when converting to a intermediate result or final result.

When the data-source is a fast field, fetching at the intermediate result step will be fine.

Fetch at Intermediate Result

This will cause some overhead as, we will fetch e.g. Top 10 for each Intermediate Result, but after merging 100 Intermediate Result only keep the Top 10. So we would fetch 10_000 documents. From the doc store that would be expensive. From the fast field this should be fine (except maybe very long texts).

Fetch at Final Result

This would require passing additionally metadata like the segment id. In distributed scenarios like quickwit this may require more additional metadata to be able to resolve at the end, as final result conversion may happen at a different node.

Conclusion

These approaches are quite different and require a different approach. Managing both would be possible but definitely adds some complexity.

The main problem I see with fetching the final result is that is requires to have access to all the Index at the node that assembles the final result, but I think in that kind of aggregation most users are interested in only one, two few fields anyways not the whole doc.

We could limit the aggregation to only handle them with the docvalue fields parameter.

I'm not sure which variant is the best approach, it very much depends on user queries. Supporting only docvalue fields is the simplest variant, so we could just start with that and change it to fetch at the final result if we run into issues.

@adamreichold
Copy link
Collaborator

Supporting only docvalue fields is the simplest variant, so we could just start with that and change it to fetch at the final result if we run into issues.

I think this sounds like the most reasonable approach from a risk management and incremental feature development perspective.

@ditsuke
Copy link
Contributor Author

ditsuke commented Oct 10, 2023

This PR is starting to look pretty good! Thank you @ditsuke. Can you add unit tests?

Thank you! I'm working on the tests

@ditsuke
Copy link
Contributor Author

ditsuke commented Oct 10, 2023

Supporting only docvalue fields is the simplest variant, so we could just start with that and change it to fetch at the final result if we run into issues.

I think this sounds like the most reasonable approach from a risk management and incremental feature development perspective.

Agreed, that sounds reasonable. @PSeitz do we handle the docvalue fields support in this PR or with a follow-up?

@ditsuke ditsuke force-pushed the feat/aggregators/top-hits branch from 6cc3cd9 to 74654bf Compare October 10, 2023 11:23
@PSeitz
Copy link
Contributor

PSeitz commented Oct 10, 2023

Agreed, that sounds reasonable. @PSeitz do we handle the docvalue fields support in this PR or with a follow-up?

It should be in this PR, or we can't add tests for the aggregation.

@ditsuke
Copy link
Contributor Author

ditsuke commented Oct 15, 2023

@PSeitz rough dig at docvalue_fields in a26a353. Could you have a look and validate if its heading in the right direction?

@ditsuke ditsuke changed the title feat(aggregators/metric): Implement a top_hits aggregator feat(aggregators/metric): Add a top_hits aggregator Oct 19, 2023
@ditsuke ditsuke marked this pull request as ready for review October 19, 2023 17:26
@ditsuke
Copy link
Contributor Author

ditsuke commented Nov 10, 2023

@PSeitz thanks for the latest round of review, I'll get back to the PR early next week!

Since a (name, type) constitutes a unique column.
Introduces a translation step to bridge the difference between
ColumnarReaders null `\0` separated json field keys to the common
`.` separated used by SegmentReader. Although, this should probably
be the default behavior for ColumnarReader's public API perhaps.
@ditsuke
Copy link
Contributor Author

ditsuke commented Nov 27, 2023

@PSeitz I believe all review points have been resolved.

@ditsuke
Copy link
Contributor Author

ditsuke commented Dec 10, 2023

Hi @PSeitz, I don't want to rush the review but I'm just tagging you in case this slipped through your notifications earlier. Please let me know if there are issues with any of the new updates here.

@fulmicoton
Copy link
Collaborator

@PSeitz can you resume review?

@PSeitz
Copy link
Contributor

PSeitz commented Jan 26, 2024

Looks good so far, except the segment ordinal part

@ditsuke
Copy link
Contributor Author

ditsuke commented Jan 26, 2024

Looks good so far, except the segment ordinal part

Thank you, I dropped a comment about the SegmentOrdinal in the review thread.

@PSeitz PSeitz merged commit 0e04ec3 into quickwit-oss:main Jan 26, 2024
3 checks passed
@PSeitz
Copy link
Contributor

PSeitz commented Jan 26, 2024

Thanks for the PR!

(and sorry for the slow Review)

@ditsuke
Copy link
Contributor Author

ditsuke commented Jan 26, 2024

Thanks for the PR!

(and sorry for the slow Review)

Thanks for merging and the thorough review, very educational!

PSeitz pushed a commit that referenced this pull request Apr 10, 2024
* feat(aggregators/metric): Implement a top_hits aggregator

* fix: Expose get_fields

* fix: Serializer for top_hits request

Also removes extraneous the extraneous third-party
serialization helper.

* chore: Avert panick on parsing invalid top_hits query

* refactor: Allow multiple field names from aggregations

* perf: Replace binary heap with TopNComputer

* fix: Avoid comparator inversion by ComparableDoc

* fix: Rank missing field values lower than present values

* refactor: Make KeyOrder a struct

* feat: Rough attempt at docvalue_fields

* feat: Complete stab at docvalue_fields

- Rename "SearchResult*" => "Retrieval*"
- Revert Vec => HashMap for aggregation accessors.
- Split accessors for core aggregation and field retrieval.
- Resolve globbed field names in docvalue_fields retrieval.
- Handle strings/bytes and other column types with DynamicColumn

* test(unit): Add tests for top_hits aggregator

* fix: docfield_value field globbing

* test(unit): Include dynamic fields

* fix: Value -> OwnedValue

* fix: Use OwnedValue's native Null variant

* chore: Improve readability of test asserts

* chore: Remove DocAddress from top_hits result

* docs: Update aggregator doc

* revert: accidental doc test

* chore: enable time macros only for tests

* chore: Apply suggestions from review

* chore: Apply suggestions from review

* fix: Retrieve all values for fields

* test(unit): Update for multi-value retrieval

* chore: Assert term existence

* feat: Include all columns for a column name

Since a (name, type) constitutes a unique column.

* fix: Resolve json fields

Introduces a translation step to bridge the difference between
ColumnarReaders null `\0` separated json field keys to the common
`.` separated used by SegmentReader. Although, this should probably
be the default behavior for ColumnarReader's public API perhaps.

* chore: Address review on mutability

* chore: s/segment_id/segment_ordinal instances of SegmentOrdinal

* chore: Revert erroneous grammar change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants