Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition of 'hw.errors' metric for hardware network adapter and physical disks are ambiguous #3132

Closed
sebastien-rosset opened this issue Jan 23, 2023 · 0 comments · Fixed by #3344
Assignees
Labels
spec:metrics Related to the specification/metrics directory

Comments

@sebastien-rosset
Copy link
Contributor

sebastien-rosset commented Jan 23, 2023

What are you trying to achieve?

Improve the definition of the hw.errors metric in guidelines for hardware network adapter and physical disk . Add one or more attribute to qualify the type of errors.

What did you expect to see?

The definition of the hw.errors metric in hardware network adapter should not be ambiguous. I'm guessing the intent is to count the number of packets that contained errors preventing them from being delivered, i.e., I/O errors, but the spec is not clear. Maybe the intent was to count any error that could happen on the adapter, including count of packet errors (IO errors), hardware component failures and chipset errors.

Additional context.

General Case

In the general case, the spec defines the hw.errors metric as the number of errors encountered by the component. Since the type of error is not qualified, this could include any type of error, such as hardware failure, firmware bugs, I/O bus errors, I/O device errors, machine check exceptions, etc. It's a bit odd to use a single metric for all of these types of errors.

hw.errors in Network adapter metrics

In the network adapter metrics, the hw.errors is defined as the number of errors encountered by the network adapter.

For packet I/O errors, it's useful to know the direction of the error. I.e. counting the number of received packets that couldn't be delivered (malformed packet, CRC error, buffer full, etc) versus the number of packets that could not be transmitted.

If the spec is not clear, that could mean different instruments report network and disk hw.errors in different ways. Some implementation may only report I/O errors, while other implementations could report hardware errors, which are very different.

For example, if a chipset error occurs, the network adapter may have to be replaced. If packets with wrong CRC are received, the network adapter has most likely nothing to do with the problem. It could be a software issue or a hardware problem on a remote system. The errors could also be injected intentionally to test how systems handle network I/O errors.

hw.errors in Physical Disk Errors

This is similar to network errors. Physical disks expose many error counters through SMART, including IO errors (e.g. read errors, write errors, and errors that are not about I/O.

Proposal

  1. Add optional direction attribute for hw.errors. This will help to distinguish between ingress versus egress errors on the network interface. But see Attribute names in semantic guidelines should be hierarchical #3131.
  2. Add one more attribute to identify the type of error, e.g., is it an I/O error or hardware error.
@sebastien-rosset sebastien-rosset added the spec:metrics Related to the specification/metrics directory label Jan 23, 2023
@sebastien-rosset sebastien-rosset changed the title Definition of 'hw.errors' metric for hardware network adapter is ambiguous Definition of 'hw.errors' metric for hardware network adapter and physical disks are ambiguous Jan 23, 2023
carlosalberto pushed a commit that referenced this issue Apr 11, 2023
…d attributes (#3344)

Fixes #3132, #3133

## Changes

In semantic conventions for hardware metrics:

* Updated network adapter description
* Changed `type` to `hw.error.type` for the `hw.errors` metric
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this issue Oct 31, 2024
… metrics, network metrics and attributes (open-telemetry#3344)

Fixes open-telemetry#3132, open-telemetry#3133

## Changes

In semantic conventions for hardware metrics:

* Updated network adapter description
* Changed `type` to `hw.error.type` for the `hw.errors` metric
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants