Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Search User Behavior Logging and Data Reuse for Relevance #4619

Closed
macohen opened this issue Sep 28, 2022 · 11 comments
Closed

[RFC] Search User Behavior Logging and Data Reuse for Relevance #4619

macohen opened this issue Sep 28, 2022 · 11 comments
Assignees
Labels
feature New feature or request Search:Relevance Search Search query, autocomplete ...etc

Comments

@macohen
Copy link
Contributor

macohen commented Sep 28, 2022

What/Why

What are you proposing?

Currently, there is no way for users of OpenSearch to get a full picture of how search is being used without building their own logging and metrics collection system. This is a request for comments to the community to discuss needs for a standardized logging schema & collection mechanism. We want to work with the community to understand where we can make the most impactful improvements to help the most users in understanding how search is used in their applications and how they can tune results most effectively.

We believe that application builders using OpenSearch for e-commerce, product, and document based search have a common set of needs in how they collect and expose data for analytics and reuse. Regarding analytics, we believe builders, business users, and relevance engineers want to see metrics out of the box for any search application like top queries, top queries resulting in a high value action (HVA - like a purchase, stream, download, or whatever the builder defines), top queries with zero results, top abandoned queries, as well as more advanced analytics like similar queries in the long tail that may be helped by synonyms, query rewrites/expansion or other relevance tuning techniques. This same data can also be re-used to feed manual judgement and automated learning to improve relevance in the index.

What users have asked for this feature?

Highlight any research, proposals, requests or anecdotes that signal this is the right thing to build. Include links to GitHub Issues, Forums, Stack Overflow, Twitter, Etc

What problems are you trying to solve?

Template: When <a situation arises> , a <type of user> wants to <do something>, so they can <expected outcome>. (Example: When searching by postal code, a buyer wants to be required to enter a valid code so they don’t waste time searching for a clearly invalid postal code.)_

  • When any search results are returned, search application builders want to report on the top requested queries so that they can learn about what their users intend to find.
  • When users search for content, a search relevance engineer wants to feed behavioral data back into the search system for automatic reranking.
  • When users search for content, a search relevance engineer wants to feed behavioral data back into the search system for manual tuning of search results.

What is the developer experience going to be?

Does this have a REST API? If so, please describe the API and any impact it may have to existing APIs. In a brief summary (not a spec), highlight what new REST APIs or changes to REST APIs are planned. as well as any other API, CLI or Configuration changes that are planned as part of this feature.

  • Allow the user to submit an optional field containing the original, user typed query. Track that original query through all steps of querying the index: user typed 1) query -> 2) rewritten query -> 3) results from OpenSearch -> 4) reranked results outside of OpenSearch -> 5) actions taken by the end users (query again, abandon search, some other high value action).
  • Initially, we are focused on adoption so even if we started from the inside out with [PURIFY] Remove all trace of XPack code from High Level Rest Client #2 and [PURIFY] Remove docs directory #3 above, it would be helpful. The API change would be providing a place in the query DSL to optionally submit the original query. We could build that in as well, but only include it in logging and analysis if it is there.

Are there any security considerations?

Describe if the feature has any security considerations or impact. What is the security model of the new APIs? Features should be integrated into the OpenSearch security suite and so if they are not, we should highlight the reasons here.

  • New data will be logged inside OpenSearch. Possible injection attacks could occur.

Are there any breaking changes to the API

If this feature will require breaking changes to any APIs, ouline what those are and why they are needed. What is the path to minimizing impact? (example, add new API and deprecate the old one)

What is the user experience going to be?

Describe the feature requirements and or user stories. You may include low-fidelity sketches, wireframes, APIs stubs, or other examples of how a user would use the feature via CLI, OpenSearch Dashboards, REST API, etc. Using a bulleted list or simple diagrams to outline features is okay. If this is net new functionality, call this out as well.

Are there breaking changes to the User Experience?

Will this change the existing user experience? Will this be a breaking change from a user flow or user experience perspective?

  • No breaking changes

Why should it be built? Any reason not to?

Describe the value that this feature will bring to the OpenSearch community, as well as what impact it has if it isn't built, or new risks if it is. Highlight opportunities for additional research.

  • Building this feature will standardize a set of reporting and data collection needs that are common across search applications and allow software engineers and relevance engineers to focus on higher level concerns out of the box like tuning queries, query rewriting, synonyms, and results reranking.
  • If it isn't built, users will either have no insights into search results and how to tune them, they will keep building analytics and data collection applications without getting an understanding of what is happening inside OpenSearch.
  • If it is built, one technical concern is trade offs between adding latency to OpenSearch and adding complexity to the platform. Logging every request and each step like rewrites, results returned from the index, reranking, and HVAs could have impact on an OpenSearch cluster if we decide to do all of this in OpenSearch. On the other hand adding a whole new set of infrastructure to deal with this level of data collection, even with a separate OpenSearch cluster adds complexity to the architecture.

What will it take to execute?

Describe what it will take to build this feature. Are there any assumptions you may be making that could limit scope or add limitations? Are there performance, cost, or technical constraints that may impact the user experience? Does this feature depend on other feature work? What additional risks are there?

Any remaining open questions?

What are known enhancements to this feature? Any enhancements that may be out of scope but that we will want to track long term? List any other open questions that may need to be answered before proceeding with an implementation.

Questions for the Community

  • Do you have first (homegrown) or third party analytics tools like Google Analytics, Adobe, or others? Would it make sense for us to connect the logging and metrics we propose to deliver inside OpenSearch with the clickstream/application metrics you have in those other systems?

Review & Validate this Proposal for tracking data through OpenSearch: opensearch-project/search-processor#12

@macohen macohen self-assigned this Sep 28, 2022
@saratvemulapalli saratvemulapalli added untriaged Indexing & Search enhancement Enhancement or improvement to existing feature or request feature New feature or request and removed enhancement Enhancement or improvement to existing feature or request labels Sep 29, 2022
@macohen macohen moved this from 🆕 New to Next (Next Quarter) in Search Project Board Oct 27, 2022
@macohen macohen changed the title [Feature Proposal] Search Logging Metrics/Monitoring [RFC] Search Logging Metrics/Monitoring Dec 6, 2022
@macohen macohen changed the title [RFC] Search Logging Metrics/Monitoring [RFC] Standard Search Logging Metrics and Data Reuse for Relevance Dec 6, 2022
@macrakis
Copy link

I would split this into three parts:

  • Collecting and storing the basic data.
    • Some users will just use existing tools to parse and analyze the data.
    • This is useful even if the remaining steps aren't complete.
  • Making it available in an easily queryable form.
    • If it were easily queryable (using SQL or DSL or whatever), that would be fantastic for all kinds of analysis and reporting.
  • Building some standard reports.
    • Once the data is in an easily queryable form, it should be easy enough to develop standard reports using OpenSearch's standard reporting tools.

@macrakis
Copy link

macrakis commented Jan 20, 2023

Re "Track that original query through all steps of querying the index: user typed 1) query -> 2) rewritten query -> 3) results from OpenSearch -> 4) reranked results outside of OpenSearch -> 5) actions taken by the end users (query again, abandon search, some other high value action)."
For relevance evaluation, it is generally not all that useful to track the intermediate steps, and as you say, it potentially has very high overhead. In any case, if the configuration of the search pipeline is well-defined, then the intermediate stages should be recoverable by re-running the query.
Recording the intermediate stages is no doubt helpful for debugging, but it is not central to logging for relevance tuning. Making basic logging efficient is very important, because you'd like to always run with it on, but logging of intermediate stages presumably only gets instantiated for debugging, and so doesn't need to be very efficient.

@macohen
Copy link
Contributor Author

macohen commented Feb 3, 2023

I think I get what you mean and it helps to refine the ideas. I do agree that recording intermediate stages is not the highest priority release immediately along with the logging done outside OpenSearch. I think there's a trade-off we need to consider in logging the debugging information and we should explore this more: either we're going to go down the path of making sure the query as it ran is recoverable which means making sure we have the right versions of plugins, analyzers, indices, OpenSearch itself, the QueryDSL, rerankers, etc. Once we're talking about external rerankers, then there are more variables that we don't control. The other option would be to create a scalable system for optionally logging everything so we know what happens at every step. That will give more info, but is certainly harder to scale. Looking for more feedback and options here as well.

@macohen macohen moved this from Next (Next Quarter) to 👀 In review in Search Project Board Feb 3, 2023
@sathishbaskar
Copy link

  1. When running analytics app on time series data, the app owner wants to look at query count by shards and time windows to plan how to break the index and replicate further for additional throughput.
  2. When running analytics app on time series data, the app owner wants to look at unique query patterns (e.g. filters excluding values) and their resource usage - memory used, cpu time usage, io usage, no. of segments/shards/indices hit etc, to plan which query patterns can be split into a replica cluster.

@macrakis
Copy link

macrakis commented Feb 6, 2023

For now, we're focusing on collecting the data for relevance tuning. In our first stage, we're looking at end-to-end behavior (user query to user actions). In our second stage, we'll be looking at the search pipeline. In both those cases, we'll be collecting data for high-level performance statistics (latency end-to-end, latency per pipeline stage), but it a future enhancement could collect shard-level data.

@macrakis
Copy link

macrakis commented Feb 7, 2023

To elaborate on "high value action (HVA - like a purchase, stream, download, or whatever the builder defines)", here are some actions/events that a user might want to track either within a session or beyond it:

  • Hover-over (presumably showing some additional information)
  • Unhide details (maybe more than one type/level of this), UI might be “More…” or “…” (e.g. display abstract of document)
  • Clickthrough to detail page (metadata, helps to determine whether the content is useful)
  • Clickthrough to content/display/play/download page (the thing itself -- the user is consuming the content)
  • Play content (hit the PLAY button)
  • Buy content
  • Add to cart from SRP (search results page)
  • Add to cart from detail page
  • Add as contact/friend
  • Communicate (send message)
  • Remove from cart
  • Bookmark / add to wish list
  • Purchase from SRP (one-click)
  • Purchase/checkout from cart
  • Click on "related" product
  • Rating (upvote/downvote)
  • Write review
  • Apply to job / submit proposal to buyer
  • Be hired for job / buyer accepts proposal

The user should be able to define their own event types as well.

Search systems exist in many domains, with different object types (product, document, person, company, ...) and different actions (buy, read, contact, apply for job, ...). Should we try to unify actions, e.g., "add as friend" = "purchase"?

Should "add to cart from SRP" vs "from detail page" be different event types or the same event type, distinguished by page type (where is that recorded?).

Should we try to align with others' definitions of actions, e.g., Google Analytics recommended events (only some of which are relevant to search)? Is there an industry standard or convention we should be following?

Can events have additional information like "dollar value of action" -- that's mostly a generic analytics issue, but even for search analytics, there may be differences in user behavior around high-priced and low-priced items.

Do we need to provide explicit support for multi-dimensional events (action=buy, pagetype=detail) or a hierarchy of events (buy is a supercategory of buy-on-srp and buy-on-detail-page)? Or should we leave this to the user?

@macohen macohen changed the title [RFC] Standard Search Logging Metrics and Data Reuse for Relevance [RFC] Search User Behavior Logging and Data Reuse for Relevance Feb 24, 2023
@macohen macohen added Search Search query, autocomplete ...etc and removed Indexing & Search labels Mar 27, 2023
@reta
Copy link
Collaborator

reta commented Jan 24, 2024

May be somewhat related to #72

@smacrakis
Copy link

@reta Thanks for the comment. I think #72 is more about changing the results, while this issue is about measuring the results and user interaction with them both for analytics and as input to machine learning.

@ansjcy
Copy link
Member

ansjcy commented Jan 24, 2024

Great proposal! I believe those are very valid user stories and adding supports mentioned in the RFC will definitely improve the visibility and the analytics experience. Also, the proposal has a bunch of overlaps with several query insights features we are building now. On a high level, in query insights we want to build a generic data collection, processing and exporting framework, adding support for query level recommendations, and also query insights dashboards to help users have better visibility into the search performance.

we believe builders, business users, and relevance engineers want to see metrics out of the box for any search application like top queries, top queries resulting in a high value action (HVA - like a purchase, stream, download, or whatever the builder defines), top queries with zero results, top abandoned queries

We are trying to cover those use cases with the Top N Queries feature! in 2.12 we are releasing the latency based top queries feature, but we will add more dimensions (like CPU, JVM usage) in the future releases. Also, "top queries with zero results, top avandoned queries" are great use cases we can consider building into the feature as well :).

more advanced analytics like similar queries in the long tail that may be helped by synonyms, query rewrites/expansion or other relevance tuning techniques

Good point! I believe finding "similar" queries, and cluster those similar queries will be super useful. It would be a valuable information to the user, furthermore, we can build query cost estimation if we have a robust query clustering method. it will facilitate a bunch of other features like query rewrite, query sandboxing and tiered caching as well, since knowing "how expensive the query would be" can be a super important metrics for those features.

Building this feature will standardize a set of reporting and data collection needs that are common across search applications and allow software engineers and relevance engineers to focus on higher level concerns out of the box like tuning queries, query rewriting, synonyms, and results reranking.

These components are actually built in the query insights framework, if would be great if we can reuse some of them.
#11429

Logging every request and each step like rewrites, results returned from the index, reranking, and HVAs could have impact on an OpenSearch cluster if we decide to do all of this in OpenSearch

Agreed! we should be careful about this and do thorough evaluations of factors like feature availability, recommendation SLA, and cost when determining what component to choose for a certain feature.

@jzonthemtn
Copy link

Related to #12084

@macohen
Copy link
Contributor Author

macohen commented Jan 30, 2024

Let's use this RFC as a point of historical reference for #12084

@macohen macohen closed this as not planned Won't fix, can't repro, duplicate, stale Jan 30, 2024
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Search Project Board Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request Search:Relevance Search Search query, autocomplete ...etc
Projects
Status: Done
Development

No branches or pull requests

9 participants