[Observability] Homepage experience (Milestone 1) #68176

formgeist · 2020-06-03T20:14:42Z

Summary

As a continuation of #66931 we're looking to add a new view that will serve as the overview page when users have existing data available for either Logs, Metrics, APM or Uptime.

Design proposal

▶️ Figma prototype

Chart panels

The overview page will consist of a number of sections per area of Observability, each containing a number of chart visualizations that will be based on a high-level data query e.g. the number of log events by log source.

Logs

The proposed chart panel for logs will be a log rate histogram grouped by the log source. We will be looking for available indices matching the default setup for the Logs app. The list of look ups will be expanded as we investigate further which indices would be interesting to auto-detect and visualize based on other 3-party log vendors, where we know we will have ECS compatible data.

Data query

The log rate visualization already exists in the Log rate tab in the Logs app.

We will use the configured log indices in the Log settings.

TODO: Perhaps include an example ES query to get the same data

Metrics

The metrics section will consist of a chart panel based on system metrics aggreated on host metrics only. Kubernetes and container metrics will be looked at in future iterations.

The different aggregates will show:

Number of hosts
CPU usage (used vs. available)
Memory usage (used vs. available)
Disk used (used vs. allocated)
Inbound traffic MB/s
Outbound traffic MB/s

The progress bar visualization will indicate used vs. capacity.

Data query

TODO: Show ES query example of get the aforementioned data

APM

The APM data panel will show the number of services, transactions and error rate.

Data query

Aggregate total number of services
Aggregate total number of processor.event: transaction
Aggregate error rate across aggregate of processor.event: transaction

Uptime

The Uptime panel will show the number of pings over time grouped by up / down status. The stats will show the total number of monitors and show the number of up and down.

Data query

The uptime monitors visualization already exists in Uptime.

Aggregate number of pings grouped by up and down
Aggreate total number of monitors
Aggregate number of monitors reporting "up"
Aggregate number of monitors reporting "down"

TODO: Show example of ES query

Alerts and alerts activity

The alerts section will consist of two panels; the Alerts distribution showing the number of alerts triggered grouped by type.

The second panel will focus on showing recent activity with direct links to alert detail views. Each alert will have a link to the alert detail page, show a total number of alert instances within the time range selected and tags.

Resources

As another section of content, we will provide users with options to go straight to the documentation, discuss forum or training resources.

Documentation (Note: Docs will provide a direct URL for the Observability section in docs)
Discuss forum
Training and webinars

News feed

The news feed will consist of Observability related blog posts and other industry-related stories.

Kibana news feeds can be set up by providing a .yml feed in the Newsfeed repository and use the Kibana news feed services to show the content.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-03T20:14:43Z

Pinging @elastic/apm-ui (Team:apm)

formgeist · 2020-06-09T09:28:21Z

Design update - 9 June 2020

We received some feedback on some specific areas of the design, so I've updated the examples above. Here's a quick changelog;

Updated the Metrics data panels with a KPI style for the traffic metrics as well and added the number of hosts as per feedback from @sorantis and @cyrille-leclerc
Replaced the section "add data" links with a "view in app" option that will link to each individual app for further investigation.

I've also put together a quick responsive layout example of how we want to let the data panels grow while retaining a fixed width for the alert column. Allowing the primary data panels (logs, metrics etc.) to grow means inspecting the visualizations become easier on larger screens, whereas the alert visualization and activity feed don't necessarily benefit all that much from growing larger.

sorantis · 2020-06-09T10:16:45Z

It's worth noting that the initial scope for Metrics is Hosts only. Kubernetes and containers will not considered in future iterations.

formgeist · 2020-06-09T11:08:43Z

It's worth noting that the initial scope for Metrics is Hosts only. Kubernetes and containers will not considered in future iterations.

Thanks @sorantis I've made a note of it in the Metrics section in the description along with more specifics around each metric. I additionally updated the traffic metrics to not show a progress bar visualization because it's simply the aggregated traffic metrics we'll show (not a typical used vs. allocated) which was indicated. Mostly due to copying over the same stat component as the others, I forgot to remove it.

sorantis · 2020-06-09T11:19:34Z

@formgeist what do you think about adding a tiny graph underneath the traffic metrics can show instead of a progress bar?

formgeist · 2020-06-09T13:40:35Z

@formgeist what do you think about adding a tiny graph underneath the traffic metrics can show instead of a progress bar?

We should be able to graph a time-series chart underneath, so here's an example of using a sparkline for the traffic metrics.

Thoughts?

felixbarny · 2020-06-29T13:00:57Z

Aggregate total number of processor.event: transaction

Aggregate error rate across aggregate of processor.event: transaction

As we're also showing the rate of errors, maybe a more useful metric than the total number of transactions would be the rate of transactions per minute. This metric is less dependant on a secondary context which is the selected time frame and thus easier to understand as it stands on its own vs having to check what the time range is. This gets especially complicated if a time range in the past is used where you'd have do some arithmetic to know many hours are in that time range.

The same considerations apply to the log rate widget.

sorenlouv · 2020-06-29T13:17:11Z

maybe a more useful metric than the total number of transactions would be the rate of transactions per minute

I agree, this would be easier to understand. This is also aligned with what we already show in APM.

formgeist · 2020-06-30T08:24:41Z

@felixbarny @sqren I think both suggestions are very reasonable - let's make sure to change the data contracts with the Logs team to be able to provide log rate per second/minute instead of the aggregate count. @cauemarcondes Will you open a new issue for this with the Logs UI team?

afgomez · 2020-07-01T12:26:02Z

maybe a more useful metric than the total number of transactions would be the rate of transactions per minute [...] The same considerations apply to the log rate widget.

The date histogram shows already a "Log rate per bucket size". From a user perspective, isn't that enough to get an idea of the average rate?

By using the log rate per minute instead of the count we will show two very similar metrics in two places. If we use the count, we show both total volume and rate (which, more is better, right? right?).

Is there a use case that I'm missing? Is the log rate per minute (vs per bucket size) such an interesting metric that deserves to exist on its own?

formgeist · 2020-07-01T13:49:01Z

The date histogram shows already a "Log rate per bucket size". From a user perspective, isn't that enough to get an idea of the average rate?

As I understand it, the visualizations we've been referencing in the design is the Log entries visualization.

The challenge I see is that the bucket size is not dynamic in the current logs visualization, it's fixed to 15 minute buckets. Not sure about the reasoning behind that decision? And if we add the Transaction rate for APM, which will be dynamic down to per minute, it'll be hard for the user to correlate the two charts if they want to. Maybe because I'm not all that familiar with the topic re: logs and rate.

afgomez · 2020-07-01T13:57:50Z

it's fixed to 15 minute buckets. Not sure about the reasoning behind that decision?

I think it's related to how the ML job process the log entries, but don't quote me on that. @weltenwort can probably give you the right answer.

if we add the Transaction rate for APM, which will be dynamic down to per minute, it'll be hard for the user to correlate the two charts if they want to.

I'm querying the data for the dashboard will use whatever startTime, endTime and bucketSize are passed as a parameter. I assume other plugins will use the same parameters, so the graphs should all be equivalent for the provided time range.

Edit: Ongoing work for the query #70413

jasonrhodes · 2020-07-01T13:58:53Z

Yeah what @afgomez said -- you can't really use the existing chart as a reference because it's tied completely to ML, and we are building something that for the overview page that doesn't use ML at all for this rate.

sorenlouv · 2020-07-01T18:34:54Z

I think it's related to how the ML job process the log entries, but don't quote me on that. @weltenwort can probably give you the right answer.

Off-topic: We also ran into this for APM. We went a little overboard and interpolate the ML values when the buckets are smaller than 15minutes so it fits with our APM data - I don't think this is necessary but it's nice now we have it.

it's fixed to 15 minute bucket

I also thought that was the case but turns out the bucket size is dynamic (in this case the bucket size is 5265 minutes):

So perhaps the text that says "Bucket span: 15 minutes" should be updated to avoid confusion?

felixbarny · 2020-07-02T06:34:59Z

The date histogram shows already a "Log rate per bucket size". From a user perspective, isn't that enough to get an idea of the average rate?

Especially if there's a lot of variability in the chart, it's not always that easy to know what the average is. If, for example, you'd want to compare the average log rate before vs after a release it will be really helpful to have that on the chart vs the user having to calculate that based on all data points in the chart.

Is there a use case that I'm missing? Is the log rate per minute (vs per bucket size) such an interesting metric that deserves to exist on its own?

I think it's even a benefit in terms of consistency if both the single metric and the date histogram chart refer to the exact same metric. I've seen this as a common practice in other dashboarding tools where you'd have certain metrics, like avg, min, max, in the legend for a graph next to the color and the label for the line. That's basically condensing all the values in the chart to a single value.

The challenge I see is that the bucket size is not dynamic in the current logs visualization, it's fixed to 15 minute buckets.

I think that ideally, the metric should be the same for the overall metric count and the metric shown in the date histogram chart. Maybe it's just me but I prefer to have normalized values that don't change as you change the date range. For example, instead of showing the number of total logs per bucket, we may normalize it to log rate per minute, no matter if the bucket size is 1m, 15m, or 5265m.

formgeist · 2020-07-02T10:52:23Z

We had a Zoom call to discuss the above feedback and next steps. We decided to continue with showing the log rate at a fixed rate (per minute). @afgomez will handle the changes in #70413

formgeist added Team:APM All issues that need APM UI Team support Team:Observability Team label for Observability Team (for things that are handled across all of observability) v7.9.0 Feature:Observability Landing - Milestone 1 labels Jun 3, 2020

formgeist changed the title ~~[Observability] Landingpage - Milestone 2 - Overview with data~~ [Observability] Overview page with data Jun 3, 2020

formgeist changed the title ~~[Observability] Overview page with data~~ [Observability] Homepage experience (Milestone 1) Jun 3, 2020

formgeist assigned cauemarcondes Jun 3, 2020

formgeist added [zube]: Inbox Feature:Observability Landing - Milestone 1 and removed Feature:Observability Landing - Milestone 1 [zube]: Inbox labels Jun 3, 2020

This was referenced Jun 4, 2020

[Observability]Make a client to expose the indices to read data #68230

Closed

[APM] Make lightweight client for reading APM event data #67397

Closed

Add tootip sync mechanism elastic/elastic-charts#695

Closed

cauemarcondes added [zube]: In Progress and removed [zube]: (7.9) Planned for release labels Jun 8, 2020

This was referenced Jun 23, 2020

[Logs UI] Register function(s) for homepage data retrieval #68531

Closed

[Metrics UI] Register function(s) for homepage data retrieval #68532

Closed

andrewvc mentioned this issue Jun 23, 2020

[Uptime] Add Data Connector for observability homepage #69716

Closed

cauemarcondes mentioned this issue Jun 29, 2020

Chart loading status elastic/elastic-charts#202

Open

cauemarcondes mentioned this issue Jul 6, 2020

Observability overview page #69141

Merged

felixbarny mentioned this issue Jul 7, 2020

[Observability] Homepage Log rate metrics #70917

Open

cauemarcondes closed this as completed in #69141 Jul 8, 2020

zube bot added [zube]: Done and removed [zube]: In Progress labels Jul 8, 2020

sorenlouv removed the [zube]: Done label Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Observability] Homepage experience (Milestone 1) #68176

[Observability] Homepage experience (Milestone 1) #68176

formgeist commented Jun 3, 2020 •

edited by cauemarcondes

Loading

elasticmachine commented Jun 3, 2020

formgeist commented Jun 9, 2020 •

edited

Loading

sorantis commented Jun 9, 2020

formgeist commented Jun 9, 2020

sorantis commented Jun 9, 2020

formgeist commented Jun 9, 2020

felixbarny commented Jun 29, 2020

sorenlouv commented Jun 29, 2020

formgeist commented Jun 30, 2020

afgomez commented Jul 1, 2020

formgeist commented Jul 1, 2020

afgomez commented Jul 1, 2020 •

edited

Loading

jasonrhodes commented Jul 1, 2020

sorenlouv commented Jul 1, 2020 •

edited

Loading

felixbarny commented Jul 2, 2020

formgeist commented Jul 2, 2020

[Observability] Homepage experience (Milestone 1) #68176

[Observability] Homepage experience (Milestone 1) #68176

Comments

formgeist commented Jun 3, 2020 • edited by cauemarcondes Loading

Summary

Design proposal

Chart panels

Logs

Metrics

APM

Uptime

Alerts and alerts activity

Resources

News feed

elasticmachine commented Jun 3, 2020

formgeist commented Jun 9, 2020 • edited Loading

sorantis commented Jun 9, 2020

formgeist commented Jun 9, 2020

sorantis commented Jun 9, 2020

formgeist commented Jun 9, 2020

felixbarny commented Jun 29, 2020

sorenlouv commented Jun 29, 2020

formgeist commented Jun 30, 2020

afgomez commented Jul 1, 2020

formgeist commented Jul 1, 2020

afgomez commented Jul 1, 2020 • edited Loading

jasonrhodes commented Jul 1, 2020

sorenlouv commented Jul 1, 2020 • edited Loading

felixbarny commented Jul 2, 2020

formgeist commented Jul 2, 2020

formgeist commented Jun 3, 2020 •

edited by cauemarcondes

Loading

formgeist commented Jun 9, 2020 •

edited

Loading

afgomez commented Jul 1, 2020 •

edited

Loading

sorenlouv commented Jul 1, 2020 •

edited

Loading