Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Improve UI of list of services page and match with latest service map design #262

Closed
alex-fedotyev opened this issue Apr 30, 2020 · 14 comments
Assignees

Comments

@alex-fedotyev
Copy link

alex-fedotyev commented Apr 30, 2020

Summary of the problem (If there are multiple problems or use cases, prioritize them)
(7/15 - updating mock up to a better quality + adding health chart column)

APM Home page currently shows the list of all services.
As a user, I want to quickly get to the service which is having a problem or based on another interest.

Proposal is to improve the overall UI of the page and match the service map latest design as well as introduce sparklines to the page:

Services inventory - Services Inventory@2x

User stories
As a user, I want to visually identify when any of services are experiencing outages like higher response time, higher error rates and/or significant drop in requests volume.

List known (technical) restrictions and requirements

If in doubt, don’t hesitate to reach out to the #observability-design Slack channel.

@elasticmachine
Copy link
Contributor

Pinging @elastic/observability-design (design)

@alex-fedotyev alex-fedotyev changed the title Improve UI of list of services page and match with latest service map Improve UI for list of services page and match with latest service map Apr 30, 2020
@alex-fedotyev alex-fedotyev changed the title Improve UI for list of services page and match with latest service map Improve UI of list of services page and match with latest service map design Apr 30, 2020
@alex-fedotyev
Copy link
Author

alex-fedotyev commented Jul 16, 2020

Added a better quality mock up, plus added one more column to display health as spark chart.
Health chart would be a timeline of alerts + ML anomalies raised for each service during selected time frame.

Open question:

  • Currently on the service map service health is displayed as a color indicator which is calculated based on the worst state across selected time range.
  • Would it be proper to match that and show health state as color for each service together with health timeline spark chart, of showing trend would be enough?

@formgeist
Copy link
Contributor

Here's a first initial draft of a hi-res mock of the proposed updates.

Main

A few comments to the design;

  • I opted to display the health as a status that fall into the same severity level statuses that we have from ML and what we should possibly also show in the service map (as those are the levels we're basing our current statuses on). Not sure if a health trend is feasible at this time, but happy to make an example of what that trend chart will look like as well.
  • The sparklines and example values are not correct, obviously, but this is more just to get the layout and overall elements in place.

Thoughts on this?

@graphaelli
Copy link
Member

I really like recovering the space used for the agent name and the addition of health - did you consider health in the first column? Not advocating for that but interested in thoughts on it.

Are the metrics next to the sparklines the latest value or the average over the period shown? Is there any interaction with the sparkline, like hover?

@alex-fedotyev
Copy link
Author

alex-fedotyev commented Aug 5, 2020

Looks great already!

Regarding health indicators:

  • Using same health indicators as on the service map looks awesome!
  • I think going forward we should consider showing health trend sparkchart to make the UX more useful for multi-hour or multi-day time ranges. Quickly seeing whether services were "red" all time, or only last 15 minutes is very informative. Additionally showing health trend on a service map inside a popup would make it aligned. @formgeist - could you put together a mock how that could look?
  • @graphaelli My initial thought was to move health indicators to the left as well. But now when I think that health indicators would include health trend sparkchart, it feels that it is better to keep it together with other metrics and sparklines.

Metrics:

  • Good question about showing last value vs average value. I think that showing averages is more useful as it enables better column sorting logic, from workflow perspective user goals are align time range selection to the time when there were production issues, and then to select slowest or most erroneous services.
  • Interactions could be rather simple, like hover shows a numeric value at that time. I imagine that it would be cool to mouse sync across all charts, but not sure that this is reasonable with sparklines. Have you encountered any interesting designs of interactions with sparklines?

@alex-fedotyev
Copy link
Author

Suggestion - would it be better to show total # of calls vs TPM?

This though is following my comment above on showing avg duration and avg % of errors, instead of last value.
The goal of sorting by "Calls" is to show which services processed more requests in selected time frame, but using calls/min has a risk of showing <1 values for less loaded services or staging environments, while total # of calls will always be a true number.

@nehaduggal
Copy link

I love the design!

I second Gil's feedback of moving the chart indicator to the left. Probably if we add a spark chart to track the health it might make more sense to have a column. For maps we were planning on adding alert based indicators too - how would that be represented on this health chart? Would the sparkline chart track both? Or would we add another column?

The agent types on the services page are aesthetically pleasing and aligned with the service maps view but from a user perspective I am not sure if they are actually helpful. By moving it to the left we are definitely claiming some pixels back and improving the look of this page but I was wondering if users would lose much if we took those off completely?

@eyalkoren
Copy link
Contributor

eyalkoren commented Aug 6, 2020

This looks really good!

Regarding latest measurement vs. average (or max or whatever) - maybe it makes sense to assume two different workflows: one is as @alex-fedotyev described- start an issue investigation based on a time range, where average/max makes sense as it allows you to sort and drill down.
The second workflow is using this view (or service map, or metrics for this matter) as a dashboard constantly presenting the updated state/health of the environment. I think that using auto-refresh is a good indication of that, so maybe we can use it - if auto-refresh is enabled, show latest measurement, otherwise show aggregated value.

Regarding agent icons - I think they are both nice visually and useful.

@formgeist
Copy link
Contributor

Design update, 12 Aug 2020

First and foremost, thank you for all the positive and constructive feedback I've received on the first initial mock. I think a lot of the same things were echoed, but there were possibly also some conflicting ideas. I've tried to distill most of them into changes that I've made to the updated mocks.

F

In the updated design, I've made the following changes;

  • Moved the health status to the first column in the table.
  • Showing an example of non-Elastic agent e.g. Jaeger, which will use the default service type icon. (Specifically for Jaeger, we could choose to include the Jaeger icon, because it's officially supported).
  • Changed the health status indication to a badge. The current EuiHealth indicator is typically a dot and label, but I find the three statuses harder to distinguish, so that's why we're pursuing changing it to a EuiBadge. We'll propose the EUI team that we might reconsider our current EuiHealth component design and see this as an enhancement.
  • Combined the agent and name columns so Agent name is no longer a separate column (which thereby cannot be sorted either). We weren't sure how often a user would sort the table by agent name, and additional they have the option to filter by the agent name using the UI filters available on the left.
  • The sparkline charts next to each metric have been converted to a line rather than an area. This was due to reduction of noise when viewing the charts especially in tables with a lot of rows per page.
  • Additionally, I suggest that we won't have any interactions for the sparklines at this time. Both in order to reduce scope and because the user is able to go to each individual service landing page to get a better view of them.

Open questions which haven't been addressed yet;

  • What column should we sort by default? Health? Alphabetically until the users choose to sort by either of the columns available?
  • What happens if all services don't have ML anomaly detection? Do we simply hide the health status if all services report unknown and show a callout indicating that ML anomaly detection would be able to give this indication in the table?
  • Calls per min. vs. total calls as suggested by Alex

I ❤️ all the ideas provided in the comments above, but I'm also aware of the time constraint we currently have with completing this design to allow the devs to pick it up and get cracking on it. I'm sure there'll be more feedback once we have a working version of the current proposal.

@alex-fedotyev
Copy link
Author

Re: Sorting

  • I think that sorting by load (requests) descending would be a good default option.
  • In my experience, I often found the most important services amongst top 10-20 services by load, and conversely response time or error rate could mean different for different services and may not bring the most important ones to the first page (depending on the service role, 50ms could be too slow and 3 seconds is just fine!).

@graphaelli
Copy link
Member

I buy the reasoning @alex-fedotyev.

What happens if all services don't have ML anomaly detection? Do we simply hide the health status if all services report unknown and show a callout indicating that ML anomaly detection would be able to give this indication in the table?

That sounds a lot better than an unknown in every row.

Calls per min. vs. total calls as suggested by Alex

I think a rate is preferable to the total. I like per min since that's what we show now.

@formgeist formgeist changed the title Improve UI of list of services page and match with latest service map design [APM] Improve UI of list of services page and match with latest service map design Aug 19, 2020
@formgeist
Copy link
Contributor

Based on the feedback received from the last update, I've made some minor changes and extended mocks to show different states.

F (sorted by health)

F (without health)

Example of the health indication not available/all unknown, but without the callout as proposed

F (without health + callout)

Example of a callout enticing the user to add anomaly detection to get health indication on their Services list

@nehaduggal
Copy link

The sorting should be two-fold. First arranged by health status to see the red services on top and then arranged by load. In case a user doesn't have ML enabled then it should default to load.

@formgeist
Copy link
Contributor

Moving the design to implementation to be tracked in elastic/kibana#75252

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants