feat(aggregators/metric): Add a top_hits aggregator #2198

ditsuke · 2023-10-02T15:35:29Z

Summary

Implements the top_hits aggregator. The aggregator is backed by ~~a BinaryHeap based on prior discussions on Discord#tantivy-dev~~ the new TopNComputer.

adamreichold · 2023-10-03T06:52:27Z

The aggregator is backed by a BinaryHeap based on prior discussions

Did those discussions already consider the recently introduced TopNComputer which might be a better choice?

src/collector/top_score_collector.rs

Cargo.toml

fulmicoton · 2023-10-03T07:00:47Z

Hello @ditsuke, is this something you actually have a need for? Can you describe the use case?

ditsuke · 2023-10-03T07:47:04Z

The aggregator is backed by a BinaryHeap based on prior discussions

Did those discussions already consider the recently introduced TopNComputer which might be a better choice?

Yes, the idea behind the new algorithm came up in that discussion thanks to @fulmicoton. I didn't use that here since it wasn't found suitable for pagination, but I'm happy to reconsider if that's wrong

ditsuke · 2023-10-03T07:51:06Z

Hello @ditsuke, is this something you actually have a need for? Can you describe the use case?

Hi @fulmicoton, we discussed the use-case on #tantivy-help some weeks back. Our use-case is querying for the top docs in each bucket (really the most recent post by some source).

For reference.

adamreichold · 2023-10-03T07:54:34Z

I didn't use that here since it wasn't found suitable for pagination, but I'm happy to reconsider if that's wrong

I might be misunderstanding things since I wasn't part of that discussion, but my understanding was that the stable sort order needed for pagination is not so much a question of BinaryHeap versus TopNComputer, but rather whether ties are resolved in a stable manner by comparing document addresses when sorting the final result set?

ditsuke · 2023-10-03T11:44:25Z

I didn't use that here since it wasn't found suitable for pagination, but I'm happy to reconsider if that's wrong

I might be misunderstanding things since I wasn't part of that discussion, but my understanding was that the stable sort order needed for pagination is not so much a question of BinaryHeap versus TopNComputer, but rather whether ties are resolved in a stable manner by comparing document addresses when sorting the final result set?

You're actually right! My initial impression was that the elimination step would be unstable for distributions with > n/2 elements ≤ median (so when you can potentially eliminate more than n/2 elements and have to make a choice to retain some of the conflicting values, given that we need to cap the elimination to n/2 elements). But in fact this isn't a problem anymore with our comparator falling back to the doc address on conflicts, so we should be able to use TopNComputer here safely.

src/aggregation/metric/top_hits.rs

src/aggregation/agg_req.rs

src/aggregation/metric/top_hits.rs

src/lib.rs

fulmicoton · 2023-10-09T00:45:32Z

This PR is starting to look pretty good! Thank you @ditsuke.
Can you add unit tests?
@PSeitz can you take care with the rest of review?

PSeitz · 2023-10-09T05:51:44Z

Fetching Docs

One tricky part that is missing is fetching the actual content of the document. So far aggregations are limited to fast fields due to the way they operate.
Question is when to fetch documents and from which data source.

Document's data source

Generally they are two data-sources for a documents data in top hits: The doc store and fast fields.

Doc store access is relatively expensive, fast fields (doc values) access are cheap.

Fast field terms may be limited to a certain length, so some long texts may be missing there.

When to Fetch

Aggregation works roughly like that

Segment Collection => Intermediate Result
Segment Collection => Intermediate Result
Segment Collection => Intermediate Result
... Could be 100 of Intermediate Result

Intermediate results can be de/serialized, merged and converted to a final result.

Fetching a documents data could happen when converting to a intermediate result or final result.

When the data-source is a fast field, fetching at the intermediate result step will be fine.

Fetch at Intermediate Result

This will cause some overhead as, we will fetch e.g. Top 10 for each Intermediate Result, but after merging 100 Intermediate Result only keep the Top 10. So we would fetch 10_000 documents. From the doc store that would be expensive. From the fast field this should be fine (except maybe very long texts).

Fetch at Final Result

This would require passing additionally metadata like the segment id. In distributed scenarios like quickwit this may require more additional metadata to be able to resolve at the end, as final result conversion may happen at a different node.

Conclusion

These approaches are quite different and require a different approach. Managing both would be possible but definitely adds some complexity.

The main problem I see with fetching the final result is that is requires to have access to all the Index at the node that assembles the final result, but I think in that kind of aggregation most users are interested in only one, two few fields anyways not the whole doc.

We could limit the aggregation to only handle them with the docvalue fields parameter.

I'm not sure which variant is the best approach, it very much depends on user queries. Supporting only docvalue fields is the simplest variant, so we could just start with that and change it to fetch at the final result if we run into issues.

adamreichold · 2023-10-09T06:42:27Z

Supporting only docvalue fields is the simplest variant, so we could just start with that and change it to fetch at the final result if we run into issues.

I think this sounds like the most reasonable approach from a risk management and incremental feature development perspective.

ditsuke · 2023-10-10T10:39:33Z

This PR is starting to look pretty good! Thank you @ditsuke. Can you add unit tests?

Thank you! I'm working on the tests

ditsuke · 2023-10-10T10:40:39Z

Supporting only docvalue fields is the simplest variant, so we could just start with that and change it to fetch at the final result if we run into issues.

I think this sounds like the most reasonable approach from a risk management and incremental feature development perspective.

Agreed, that sounds reasonable. @PSeitz do we handle the docvalue fields support in this PR or with a follow-up?

PSeitz · 2023-10-10T11:35:13Z

Agreed, that sounds reasonable. @PSeitz do we handle the docvalue fields support in this PR or with a follow-up?

It should be in this PR, or we can't add tests for the aggregation.

ditsuke · 2023-10-15T18:55:30Z

@PSeitz rough dig at docvalue_fields in a26a353. Could you have a look and validate if its heading in the right direction?

src/aggregation/agg_req_with_accessor.rs

src/aggregation/metric/top_hits.rs

Also removes extraneous the extraneous third-party serialization helper.

src/aggregation/metric/top_hits.rs

ditsuke · 2023-11-10T04:35:28Z

@PSeitz thanks for the latest round of review, I'll get back to the PR early next week!

src/aggregation/metric/top_hits.rs

Since a (name, type) constitutes a unique column.

Introduces a translation step to bridge the difference between ColumnarReaders null `\0` separated json field keys to the common `.` separated used by SegmentReader. Although, this should probably be the default behavior for ColumnarReader's public API perhaps.

ditsuke · 2023-11-27T09:29:39Z

@PSeitz I believe all review points have been resolved.

ditsuke · 2023-12-10T15:57:53Z

Hi @PSeitz, I don't want to rush the review but I'm just tagging you in case this slipped through your notifications earlier. Please let me know if there are issues with any of the new updates here.

fulmicoton · 2024-01-15T01:25:20Z

@PSeitz can you resume review?

src/aggregation/agg_req_with_accessor.rs

src/aggregation/metric/top_hits.rs

src/aggregation/agg_req_with_accessor.rs

src/aggregation/collector.rs

PSeitz · 2024-01-26T04:38:17Z

Looks good so far, except the segment ordinal part

ditsuke · 2024-01-26T12:22:33Z

Looks good so far, except the segment ordinal part

Thank you, I dropped a comment about the SegmentOrdinal in the review thread.

PSeitz · 2024-01-26T15:47:13Z

Thanks for the PR!

(and sorry for the slow Review)

ditsuke · 2024-01-26T15:50:57Z

Thanks for the PR!

(and sorry for the slow Review)

Thanks for merging and the thorough review, very educational!

* feat(aggregators/metric): Implement a top_hits aggregator * fix: Expose get_fields * fix: Serializer for top_hits request Also removes extraneous the extraneous third-party serialization helper. * chore: Avert panick on parsing invalid top_hits query * refactor: Allow multiple field names from aggregations * perf: Replace binary heap with TopNComputer * fix: Avoid comparator inversion by ComparableDoc * fix: Rank missing field values lower than present values * refactor: Make KeyOrder a struct * feat: Rough attempt at docvalue_fields * feat: Complete stab at docvalue_fields - Rename "SearchResult*" => "Retrieval*" - Revert Vec => HashMap for aggregation accessors. - Split accessors for core aggregation and field retrieval. - Resolve globbed field names in docvalue_fields retrieval. - Handle strings/bytes and other column types with DynamicColumn * test(unit): Add tests for top_hits aggregator * fix: docfield_value field globbing * test(unit): Include dynamic fields * fix: Value -> OwnedValue * fix: Use OwnedValue's native Null variant * chore: Improve readability of test asserts * chore: Remove DocAddress from top_hits result * docs: Update aggregator doc * revert: accidental doc test * chore: enable time macros only for tests * chore: Apply suggestions from review * chore: Apply suggestions from review * fix: Retrieve all values for fields * test(unit): Update for multi-value retrieval * chore: Assert term existence * feat: Include all columns for a column name Since a (name, type) constitutes a unique column. * fix: Resolve json fields Introduces a translation step to bridge the difference between ColumnarReaders null `\0` separated json field keys to the common `.` separated used by SegmentReader. Although, this should probably be the default behavior for ColumnarReader's public API perhaps. * chore: Address review on mutability * chore: s/segment_id/segment_ordinal instances of SegmentOrdinal * chore: Revert erroneous grammar change

ditsuke marked this pull request as draft October 2, 2023 19:14

fulmicoton reviewed Oct 3, 2023

View reviewed changes

src/collector/top_score_collector.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Oct 3, 2023

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

PSeitz reviewed Oct 6, 2023

View reviewed changes

src/aggregation/metric/top_hits.rs Outdated Show resolved Hide resolved

PSeitz reviewed Oct 6, 2023

View reviewed changes

src/aggregation/agg_req.rs Outdated Show resolved Hide resolved

PSeitz reviewed Oct 6, 2023

View reviewed changes

src/aggregation/metric/top_hits.rs Outdated Show resolved Hide resolved

fulmicoton requested a review from PSeitz October 9, 2023 00:41

fulmicoton reviewed Oct 9, 2023

View reviewed changes

src/lib.rs Show resolved Hide resolved

ditsuke force-pushed the feat/aggregators/top-hits branch from 6cc3cd9 to 74654bf Compare October 10, 2023 11:23

ditsuke commented Oct 15, 2023

View reviewed changes

src/aggregation/agg_req_with_accessor.rs Outdated Show resolved Hide resolved

ditsuke commented Oct 19, 2023

View reviewed changes

src/aggregation/metric/top_hits.rs Outdated Show resolved Hide resolved

ditsuke changed the title ~~feat(aggregators/metric): Implement a top_hits aggregator~~ feat(aggregators/metric): Add a top_hits aggregator Oct 19, 2023

ditsuke marked this pull request as ready for review October 19, 2023 17:26

ditsuke added 4 commits October 20, 2023 16:09

feat(aggregators/metric): Implement a top_hits aggregator

39b1684

fix: Expose get_fields

453ac23

fix: Serializer for top_hits request

d574384

Also removes extraneous the extraneous third-party serialization helper.

chore: Avert panick on parsing invalid top_hits query

ba5e23f

PSeitz reviewed Nov 6, 2023

View reviewed changes

src/aggregation/metric/top_hits.rs Outdated Show resolved Hide resolved

PSeitz reviewed Nov 6, 2023

View reviewed changes

src/aggregation/metric/top_hits.rs Outdated Show resolved Hide resolved

PSeitz reviewed Nov 6, 2023

View reviewed changes

src/aggregation/metric/top_hits.rs Show resolved Hide resolved

PSeitz mentioned this pull request Nov 7, 2023

expand_dots and object reconstruction #2241

Open

chore: Apply suggestions from review

57a811c

PSeitz reviewed Nov 16, 2023

View reviewed changes

src/aggregation/metric/top_hits.rs Outdated Show resolved Hide resolved

ditsuke added 5 commits November 16, 2023 12:27

fix: Retrieve all values for fields

d760c6c

test(unit): Update for multi-value retrieval

bc8a4cf

chore: Assert term existence

5162e14

feat: Include all columns for a column name

33b12a8

Since a (name, type) constitutes a unique column.

fix: Resolve json fields

46d1cf7

Introduces a translation step to bridge the difference between ColumnarReaders null `\0` separated json field keys to the common `.` separated used by SegmentReader. Although, this should probably be the default behavior for ColumnarReader's public API perhaps.

PSeitz reviewed Jan 15, 2024

View reviewed changes

src/aggregation/agg_req_with_accessor.rs Outdated Show resolved Hide resolved

PSeitz reviewed Jan 15, 2024

View reviewed changes

src/aggregation/metric/top_hits.rs Outdated Show resolved Hide resolved

PSeitz reviewed Jan 15, 2024

View reviewed changes

src/aggregation/metric/top_hits.rs Outdated Show resolved Hide resolved

chore: Address review on mutability

467b70b

PSeitz reviewed Jan 18, 2024

View reviewed changes

src/aggregation/agg_req_with_accessor.rs Outdated Show resolved Hide resolved

PSeitz reviewed Jan 18, 2024

View reviewed changes

src/aggregation/collector.rs Outdated Show resolved Hide resolved

ditsuke added 2 commits January 18, 2024 22:23

chore: s/segment_id/segment_ordinal instances of SegmentOrdinal

da13a1c

chore: Revert erroneous grammar change

e2ba462

PSeitz approved these changes Jan 26, 2024

View reviewed changes

PSeitz merged commit 0e04ec3 into quickwit-oss:main Jan 26, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aggregators/metric): Add a top_hits aggregator #2198

feat(aggregators/metric): Add a top_hits aggregator #2198

ditsuke commented Oct 2, 2023 •

edited

Loading

adamreichold commented Oct 3, 2023

fulmicoton commented Oct 3, 2023

ditsuke commented Oct 3, 2023

ditsuke commented Oct 3, 2023 •

edited

Loading

adamreichold commented Oct 3, 2023

ditsuke commented Oct 3, 2023

fulmicoton commented Oct 9, 2023

PSeitz commented Oct 9, 2023 •

edited

Loading

adamreichold commented Oct 9, 2023

ditsuke commented Oct 10, 2023

ditsuke commented Oct 10, 2023

PSeitz commented Oct 10, 2023

ditsuke commented Oct 15, 2023

ditsuke commented Nov 10, 2023

ditsuke commented Nov 27, 2023

ditsuke commented Dec 10, 2023

fulmicoton commented Jan 15, 2024

PSeitz commented Jan 26, 2024

ditsuke commented Jan 26, 2024

PSeitz commented Jan 26, 2024

ditsuke commented Jan 26, 2024

feat(aggregators/metric): Add a top_hits aggregator #2198

feat(aggregators/metric): Add a top_hits aggregator #2198

Conversation

ditsuke commented Oct 2, 2023 • edited Loading

Summary

adamreichold commented Oct 3, 2023

fulmicoton commented Oct 3, 2023

ditsuke commented Oct 3, 2023

ditsuke commented Oct 3, 2023 • edited Loading

adamreichold commented Oct 3, 2023

ditsuke commented Oct 3, 2023

fulmicoton commented Oct 9, 2023

PSeitz commented Oct 9, 2023 • edited Loading

Fetching Docs

Document's data source

When to Fetch

Fetch at Intermediate Result

Fetch at Final Result

Conclusion

adamreichold commented Oct 9, 2023

ditsuke commented Oct 10, 2023

ditsuke commented Oct 10, 2023

PSeitz commented Oct 10, 2023

ditsuke commented Oct 15, 2023

ditsuke commented Nov 10, 2023

ditsuke commented Nov 27, 2023

ditsuke commented Dec 10, 2023

fulmicoton commented Jan 15, 2024

PSeitz commented Jan 26, 2024

ditsuke commented Jan 26, 2024

PSeitz commented Jan 26, 2024

ditsuke commented Jan 26, 2024

ditsuke commented Oct 2, 2023 •

edited

Loading

ditsuke commented Oct 3, 2023 •

edited

Loading

PSeitz commented Oct 9, 2023 •

edited

Loading