Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add volume ID append dimension for disk metrics #1156

Merged
merged 9 commits into from
May 2, 2024
Merged

Add volume ID append dimension for disk metrics #1156

merged 9 commits into from
May 2, 2024

Conversation

jefchien
Copy link
Contributor

@jefchien jefchien commented May 2, 2024

Description of the issue

CloudWatch agent enables customers to collect disc utilization metrics from hosts including total, used, free, and used_percent. The cardinality of these metrics includes the following dimensions: device (e.g. xvdb), filesystem path (e.g. /mnt/volume), and filesystem type (e.g. ext4). Device is the logical representation of a partition on a volume. There can be more than one partition in a volume.

Customers cannot determine which of their volumes are over provisioned without post processing of these metrics to aggregate all of a metric over a device. Furthermore, since device is not synonymous with a volume, customers cannot aggregate usage on volumes with more than one partition. Also, for instances that use the Nitro system, the device names change for each disk mount when the instance is rebooted, so in those cases device cannot be used to monitor disk usage.

The EC2 tagger already supports this dimension, but was broken and unconfigurable.

Description of changes

To enable volume aggregation, VolumeId is being added as an option in the disk metrics append_dimensions.

{
  "metrics": {
    "metrics_collected": {
      "disk": {
        "append_dimensions": {
          "VolumeId": "{aws:VolumeId}"
        }
      }
    }
  }
}

If that specific dimension key/value pair is present in the disk metrics, then the agent will cache a device to serial map on start up. This mapping is derived in two ways: from the host (linux only) and from ec2:DescribeVolumes.

On a linux host, the agent will read the /sys/block/<device>/device/serial if available for the serial. This is similar to the gopsutil implementation (although gopsutil includes the model for some reason). For EBS volumes, the serial will be vol0c241693efb58734a rather than the vol-0303a1cc896c42d28. In those cases, the serial is formatted to match the volume ID.

If ec2:DescribeVolumes permissions are provided to the agent, then the agent will make the call and add the mapping derived from the response to the cache as well.

If neither are available and the dimension is present, then an error will be logged and the tagger will try again after 3 minutes.

This is only for disk metrics and requires the device attribute to be present on the metric when it reaches the EC2 tagger. Therefore, it is not compatible with the drop_device field.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Added unit tests. Built and ran on test hosts.

2024-05-02T02:06:05Z I! {"caller":"ec2tagger/ec2tagger.go:337","msg":"ec2tagger: EC2 tagger has started initialization.","kind":"processor","name":"ec2tagger","pipeline":"metrics/host"}
2024-05-02T02:06:05Z D! {"caller":"adapter/receiver.go:41","msg":"Starting adapter","kind":"receiver","name":"telegraf_disk","data_type":"metrics","receiver":"disk"}
2024-05-02T02:06:05Z I! {"caller":"[email protected]/service.go:169","msg":"Everything is ready. Begin running and processing data."}
2024-05-02T02:06:05Z D! {"caller":"ec2tagger/ec2tagger.go:387","msg":"Volume Serial Cache","kind":"processor","name":"ec2tagger","pipeline":"metrics/host","devices":["nvme1n1","nvme0n1"]}
2024-05-02T02:06:05Z I! {"caller":"ec2tagger/ec2tagger.go:484","msg":"ec2tagger: Initial retrieval of tags succeeded","kind":"processor","name":"ec2tagger","pipeline":"metrics/host"}
2024-05-02T02:06:05Z I! {"caller":"ec2tagger/ec2tagger.go:395","msg":"ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes","kind":"processor","name":"ec2tagger","pipeline":"metrics/host"}
image

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@jefchien jefchien requested a review from a team as a code owner May 2, 2024 13:43
@sethAmazon
Copy link
Contributor

What is the volume id for non disk metrics? Does it use the default device?

@sethAmazon
Copy link
Contributor

Can you please do a macos test

@jefchien jefchien merged commit e46fae5 into main May 2, 2024
6 checks passed
@jefchien jefchien deleted the ebs-volume branch May 2, 2024 22:43
sethAmazon added a commit that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants