Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying what's 'current' in a time series index is hard, there may be a better way #61349

Closed
andrewvc opened this issue Aug 19, 2020 · 5 comments
Labels
:Analytics/Aggregations Aggregations >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team

Comments

@andrewvc
Copy link
Contributor

andrewvc commented Aug 19, 2020

I work on Uptime / Heartbeat here at Elastic, and a frequent source of query complexity is querying over the subset of our timeseries data that represents the current state of the system. For instance, "Tell me how many monitors are up vs down". This sounds simple, but it's not, because we have time series data. To accomplish this we must:

  1. Aggregate all the data by id
  2. Find the most recent document in per ID via the @timestamp field
  3. Count the number of those most recent documents that have a monitor.status value of up vs. down.

This gets even more complex when you add querying on top. What if you also want to query only the most recent documents? If you find a value that was present in a past document that matches you may display an old 'current' status, rather than a new one.

We handle this today by using composite aggs and lots of post-processing in JS. We aim for our UI to handle large numbers of monitors, ~100k today. This essentially pushes the complexity of solving this difficult problems onto developers and isn't ideal.

I've discussed this a bit with @polyfractal and we've covered a few different solutions.

  • Some new sort of aggregation or query-phase that makes this easier. Maybe we could have a two-phase query where the first aggregates and sorts, yielding a subset of docs that the second queries / aggregates over?
  • A way to split writes on the ingest node so we could create a 'current' index that gets only scripted upserts written to it, 1 per monitor ID
  • A way to do the same as above but with data frame transforms. One pain point with DFTs is that their lifecycle needs to be managed. For stack solutions we don't have DFT integrations yet, and I'm worried about them stopping / starting or not being around. It seems like a source of bugs. Ideally we could handle this the same way we handled streams, where they are bundled with an ingestion point and either the whole thing works or it doesn't.
  • A way to do more complex queries with multiple levels of joins (in this case a self join) in ES via scripting in a safe way.
@andrewvc andrewvc added >enhancement needs:triage Requires assignment of a team area label labels Aug 19, 2020
@polyfractal polyfractal added :Analytics/Aggregations Aggregations :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Aug 19, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added Team:Search Meta label for search team Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Aug 19, 2020
@dgieselaar
Copy link
Member

I can imagine something like this being useful for the Telemetry folks as well. cc @mindbat

@andrewvc
Copy link
Contributor Author

After speaking with @benwtrent I think we can solve this with data frame transforms after all. I'm going to close this for now.

@hendrikmuhs
Copy link

For random readers of the issue, looking for a solution:

https://www.elastic.co/guide/en/elasticsearch/reference/current/transform-overview.html#latest-transform-overview

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

5 participants