[Infra UI] : Enhanced Disk Space Support for Hosts #164151

roshan-elastic · 2023-08-17T09:59:37Z

🔗 Key Links

Hosts List

Issues

Note : This will be completed once the epic has been refined and issues created

Issues/Tasks

Give feedback

Add issues here
Options

📖 Description

We are obscuring when a disk (mount point) may be running out of memory because we are averaging the space left across all mount points on the host (and losing the fact that an individual mount point may be running out of space).

This issue is to update the Host experience to allow the SRE to easily see when a host has a mount point is close to running out of space and allow them to analyse this.

Background

A consistent use case called out by our users is the ability to 'catch' hosts before they run out of storage so they can address it before it causes the host to stop functioning correctly.

To cater for this, we show the average(system.filesystem.used.pct) prominently around the hosts experience:

Hosts List

Overview fly-out - KPI tile

Overview fly-out metrics

Disk Space by mount point (asset detail view)

What's the problem?

When a host has multiple mount points, multiple documents are emitted - one per mount point:

This allows Elastic to understand the space available per mount point:

Space per mount point for a host

However, when we show the space available across the host - we show the average() which means we don't allow the user to know that a volume is close to running out of space (it is disguised by the average of all mount points):

Average space per mount point for a host

Here is an example from @Danouchka:

Example on his host

💡 Solution Proposal

This is a starting point for ideas, there are likely better solutions

We swap out the current disk space usage metric average(system.filesystem.used.pct) for a max(system.filesystem.used.pct) so the user can always see at a glance if a host is about to run out of space:

Sample data illustrating the difference between average() and max()

Limitations

If we show max() instead of average(), users won't easily be able to dig into which mount point is running out of space (it will be available in the host detail view but they won't be able to find it).

How might we allow them to easily debug which mount point is close to running out of space?

✔️ Acceptance criteria

What must this feature have?

1. Must Have

Must be delivered in this issue in order for the release to be valuable

Name	Description	Notes
The Host List, fly-out and detail views must allow the SRE to know if space for a mount point on a host is running low at a glance	-	-

2. Should Have

Name	Description	Notes
The SRE should be able to easily debug which mount point is running out of space	-	Once the SRE sees a mount point is running out of space, they will want to debug which one it is and understand why.

3. Could Have

Would be nice to have but not critical

Name	Description	Notes
-	-	-

4. Will Not Have (for now)

Explicitly will not be looked at within this issue

Name	Description	Notes
Predictions of when it might run out	-	@Danouchka (senior solutions architect) has a very powerful illustration leveraging ML which predicts when mount points are running out of space (I'll try and get a video showing this)

🚗 Use Cases

A selection of use cases to think about

As an SRE, I need to understand whether hosts have any mount points which could (or have) run out of space soon so I can ensure this is handled before it impacts the performance of the host.

📈 Telemetry Process

Telemetry requirements must be part of the acceptance criteria (above) (defined by the Epic creator, e.g. the Product Manager) during refinement.
See Telemetry Process for full details/process/implementation conventions.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-08-17T09:59:40Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

elasticmachine · 2023-11-14T00:04:51Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

botelastic · 2024-05-12T13:51:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

roshan-elastic · 2024-08-15T10:03:47Z

A lot of this has been delivered. Can create new issues for more disk work when needed

roshan-elastic mentioned this issue Aug 17, 2023

[Infra UI] Normalise network and Disk rates for Hosts across time ranges #164152

Closed

smith added Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 14, 2023

roshan-elastic changed the title ~~[Infra UI] : Disk Space Available - Handle Mount Points~~ [Infra UI] : Enhanced Disk Space Support for Hosts Nov 14, 2023

roshan-elastic mentioned this issue Mar 20, 2024

[Infra UI] Update Host Disk Partition Average Metrics to 'max' #179044

Closed

3 tasks

botelastic bot added the stale Used to mark issues that were closed for being stale label May 12, 2024

smith added the enhancement New value added to drive a business result label May 18, 2024

botelastic bot removed the stale Used to mark issues that were closed for being stale label May 18, 2024

roshan-elastic closed this as not planned Won't fix, can't repro, duplicate, stale Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Infra UI] : Enhanced Disk Space Support for Hosts #164151

[Infra UI] : Enhanced Disk Space Support for Hosts #164151

roshan-elastic commented Aug 17, 2023 •

edited

Loading

Issues/Tasks

elasticmachine commented Aug 17, 2023

elasticmachine commented Nov 14, 2023

botelastic bot commented May 12, 2024

roshan-elastic commented Aug 15, 2024

[Infra UI] : Enhanced Disk Space Support for Hosts #164151

[Infra UI] : Enhanced Disk Space Support for Hosts #164151

Comments

roshan-elastic commented Aug 17, 2023 • edited Loading

🔗 Key Links

Issues

Issues/Tasks

📖 Description

Background

What's the problem?

💡 Solution Proposal

Limitations

✔️ Acceptance criteria

1. Must Have

2. Should Have

3. Could Have

4. Will Not Have (for now)

🚗 Use Cases

📈 Telemetry Process

elasticmachine commented Aug 17, 2023

elasticmachine commented Nov 14, 2023

botelastic bot commented May 12, 2024

roshan-elastic commented Aug 15, 2024

roshan-elastic commented Aug 17, 2023 •

edited

Loading