Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] host metrics #2129

Merged
merged 16 commits into from
Mar 1, 2023
Merged

[RFC] host metrics #2129

merged 16 commits into from
Mar 1, 2023

Conversation

ruflin
Copy link
Contributor

@ruflin ruflin commented Jan 5, 2023

This RFC adds a proposal to bring host metrics to ECS. These metrics should build the foundation to deliver a minimal set of metrics related to a host. The current list looks as following:

  • host.cpu.system.norm.pct
  • host.cpu.user.norm.pct
  • host.fsstats.total_size.used (in bytes)
  • host.fsstats.total_size.total (in bytes)
  • host.fsstats.total_size.used.pct
  • host.load.norm.1
  • host.load.norm.5
  • host.load.norm.15
  • host.memory.actual.used.bytes
  • host.memory.actual.used.pct
  • host.memory.total
  • host.network.egress.bytes
  • host.network.ingress.bytes

One of the main challenges around this RFC is if we should prefix with host.* or system.*. See some more details in the RFC itself. It would be great to hear opinions around it.

This RFC adds a proposal to bring host metrics to ECS. These metrics should build the foundation to deliver a minimal set of metrics related to a host. The current list looks as following:

* host.cpu.system.norm.pct
* host.cpu.user.norm.pct
* host.fsstats.total_size.used (in bytes)
* host.fsstats.total_size.total (in bytes)
* host.fsstats.total_size.used.pct
* host.load.norm.1
* host.load.norm.5
* host.load.norm.15
* host.memory.actual.used.bytes
* host.memory.actual.used.pct
* host.memory.total
* host.network.egress.bytes
* host.network.ingress.bytes

One of the main challenges around this RFC is if we should prefix with `host.*` or `system.*`. See some more details in the RFC itself. It would be great to hear opinions around it.
@ruflin ruflin self-assigned this Jan 5, 2023
@ruflin ruflin mentioned this pull request Jan 5, 2023
@ruflin ruflin marked this pull request as ready for review January 9, 2023 09:35
@ruflin ruflin requested a review from a team as a code owner January 9, 2023 09:35
@ruflin
Copy link
Contributor Author

ruflin commented Jan 9, 2023

Converted from draft to review to start conversations.

@ruflin ruflin requested a review from neptunian January 9, 2023 09:36
@ebeahan
Copy link
Member

ebeahan commented Jan 13, 2023

What's the relationship between this new proposal and RFC 0005 - host metric fields?

cc @kaiyan-sheng who authored RFC 0005.

One of the main challenges around this RFC is if we should prefix with host.* or system.*. See some more details in the RFC itself. It would be great to hear opinions around it.

RFC 0005 established a small group of metric fields under host.*. With those fields already added to the spec, I'd propose continue to use host.*.

@ruflin
Copy link
Contributor Author

ruflin commented Jan 16, 2023

Thanks for pointing out https://github.com/elastic/ecs/blob/main/rfcs/text/0005-host-metric-fields.md @ebeahan It definitively points in the direction of using host.* for the fields. For now I will to with the assumption we keep the host.* prefix until someone objects.

There is an overlap with RFC 5 here and it expands on it. I was aware of the network and disk fields but missed the cpu field. @kaiyan-sheng Can you comment on where these fields are used today? How does it compare to the fields proposed here?

@ruflin
Copy link
Contributor Author

ruflin commented Feb 20, 2023

Can I get some reviews on this PR to get things moving?

@neptunian
Copy link

Can you comment on where these fields are used today? How does it compare to the fields proposed here?

I found this issue created to switch over to using some of the new ECS host fields. It looks like host.cpu.usage wasn't done. Currently it's calculated here as average of system.cpu.user.pct plus average of system.cpu.system.pct divided by system.cpu.cores. Comparing the result of this to host.cpu.usage is very close and would be exact if we were using the normalized system values instead. We should remove this calculation and switch to using host.cpu.usage.

So, from the list we can remove:

  • host.cpu.system.norm.pct
  • host.cpu.user.norm.pct
  • host.network.egress.bytes
  • host.network.ingress.bytes


## Concerns

Currently Elastic Agent and metricbeat ship data host/system metrics under the `system.*` prefix. This would change it to `host.*`. One of the reasons for this is that some metrics for network already exist under this prefix in ECS. Another advantage is that some of these fields might use newer field types like `gauge` and `counter` delivered by TSDB in Elasticsearch which is possible without a breaking change. One of the big advantages is it needs to be figured out how to migrate to it with the existing shippers.
Copy link

@neptunian neptunian Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently Elastic Agent and metricbeat ship data host/system metrics under the system.* prefix. This would change it to host.*. One of the reasons for this is that some metrics for network already exist under this prefix in ECS. Another advantage is that some of these fields might use newer field types like gauge and counter delivered by TSDB in Elasticsearch which is possible without a breaking change

Should this be under "Scope of impact"?

One of the big advantages is it needs to be figured out how to migrate to it with the existing shippers.

I'm confused by this sentence. How is this a big advantage? Seems more like a concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The s/advantage/concern was a pretty big typo. Fixed it now.

The other part I moved under scope of impact. It somehow is a mix between both.

Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some comments and nits, but overall no objections to the premise of the proposal and merging the RFC as stage 0.

- name: host.fsstats.total_size.used
type: long
format: bytes
time_series_metric: gauge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the proposal itself, but I don't believe ECS supports a time_series_metric attribute today; we'll need to add in support for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ECS have a spec for the fields.yml file where this can be added?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no one source of truth for the ECS spec. I filed #2176 capturing the requirements needed to support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the issue. For the schema, maybe some parts can be copied over from the package-spec where the same fields.yml structure exists: https://github.com/elastic/package-spec


### CPU ###

# The CPU metrics must indicate under how much load the system is.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need any guidance how these new proposed host.cpu.* fields are related to the existing host.cpu.usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I added this to the list of concerns.

@ruflin
Copy link
Contributor Author

ruflin commented Feb 24, 2023

So, from the list we can remove:

host.cpu.system.norm.pct
host.cpu.user.norm.pct
host.network.egress.bytes
host.network.ingress.bytes

@neptunian Would you remove these from the initial proposal because they are not used or do you think we should keep them as they will be useful?

@neptunian
Copy link

neptunian commented Feb 24, 2023

So, from the list we can remove:

host.cpu.system.norm.pct
host.cpu.user.norm.pct
host.network.egress.bytes
host.network.ingress.bytes

@neptunian Would you remove these from the initial proposal because they are not used or do you think we should keep them as they will be useful?

Maybe I'm missing something. host.network.egress.bytes and host.network.ingress.bytes are already ECS fields. For the CPU usage it looks like we could use host.cpu.usage which is already an ECS field in place of host.cpu.system.norm.pct and host.cpu.user.norm.pct.

@ruflin
Copy link
Contributor Author

ruflin commented Feb 27, 2023

The egress fields I only put in for completness: https://github.com/elastic/ecs/pull/2129/files#diff-eb1c37e580b8c563a0079437420aa4214f94d05d5b683b743b34ca09b0695446R175 If it is confusing, can also remove these.

For the cpu usage, I think it is important for users to see what cpu usage of the kernel vs user space is. If we decide we don't need this, the host.cpu.usage value can be used. The part that is not clear to me, what exactly do we show for host.cpu.usage? Unfortunately the definition does not contain any details around it (which should be fixed).

@neptunian
Copy link

The egress fields I only put in for completness: https://github.com/elastic/ecs/pull/2129/files#diff-eb1c37e580b8c563a0079437420aa4214f94d05d5b683b743b34ca09b0695446R175 If it is confusing, can also remove these.

@ruflin I see, thanks.

For the cpu usage, I think it is important for users to see what cpu usage of the kernel vs user space is. If we decide we don't need this, the host.cpu.usage value can be used. The part that is not clear to me, what exactly do we show for host.cpu.usage? Unfortunately the definition does not contain any details around it (which should be fixed).

Sounds good to me. If it's helpful, host.cpu.usage looks to be the equivalent of system.cpu.total.pct / system.cpu.cores.

Screenshot 2023-02-28 at 10 37 53 AM

@ebeahan
Copy link
Member

ebeahan commented Mar 1, 2023

Will move forward merging as stage 0 with the two approvers. We'll continuing refining the details and addressing any outstanding concerns in subsequent stages.

@ebeahan ebeahan merged commit d5d48c9 into elastic:main Mar 1, 2023
@ruflin ruflin deleted the host-metrics branch March 2, 2023 06:08
@ruflin
Copy link
Contributor Author

ruflin commented Mar 2, 2023

Thanks @ebeahan for getting this in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants