-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containerd metricbeat module #29247
Containerd metricbeat module #29247
Conversation
This pull request does not have a backport label. Could you fix it @MichaelKatsoulis? 🙏
NOTE: |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. 🤖 GitHub commentsTo re-run your PR in the CI, just comment with:
|
"process_cpu_seconds_total": prometheus.Metric("system.total"), | ||
}, | ||
Labels: map[string]prometheus.LabelMap{ | ||
"container_id": p.KeyLabel("id"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are using openmetrics library here, is this intentional? Why not prometheus.KeyLabel()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By mistake
|
||
// init registers the MetricSet with the central registry. | ||
// The New method will be called after the setup of the module and before starting to fetch data | ||
func init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new module so I guess it would better fit under xpack
.
if err == nil { | ||
cpuUsageTotalPct := calcCpuTotalUsagePct(cpuUsageTotal.(float64), systemUsageDelta, | ||
float64(contCpus), cID, m.preContainerCpuTotalUsage) | ||
m.Logger().Infof("cpuUsageTotalPct for %+v is %+v", cID, cpuUsageTotalPct) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe that's too noisy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remember it is a draft . I gave no intention of keeping it
var systemTotalNs int64 | ||
perContainerCpus := make(map[string]int) | ||
elToDel := -1 | ||
for i, event := range events { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional percentances calculations can be configured to be enabled/disabled on demand like what is happening in docker module so as to avoid overloading the in case we have too many containers/metrics and we are interested into getting that much detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
systemUsageDelta := float64(systemTotalNs) - m.preSystemCpuUsage | ||
|
||
// Calculate cpu total usage percentage | ||
cpuUsageTotal, err := event.GetValue("usage.total.ns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put the rationale of these calculations into the PR's description and in the docs so as to have them clearly documented? I forsee getting questions about how these are calculated and we will have to dig into the code and struggle to understand these calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes of course!
Regarding the calculation of cpu usage percentage I followed this approach:
So in order to get the cpu usage percentage of each container I followed the approach we have in docker. For each container: In order to set the Basically, I took a point of reference, then see the difference in next batch of events. That way you can tell how much of the time was used by the container. Also as the @fearful-symmetry as you have worked with docker module in similar calculations, what do you think about this approach? |
|
||
var ( | ||
// HostParser validates Prometheus URLs | ||
hostParser = parse.URLHostParserBuilder{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this declared globally? I only see it being used in the init function?
// The New method will be called after the setup of the module and before starting to fetch data | ||
func init() { | ||
// Mapping of state metrics | ||
mapping := &prometheus.MetricsMapping{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why this map is declared here? I don't think that getMetricsetFactory
function is getting reused, so we could just put it in one place?
systemUsageDelta := float64(systemTotalNs) - m.preSystemCpuUsage | ||
|
||
// Calculate cpu total usage percentage | ||
cpuUsageTotal, err := event.GetValue("usage.total.ns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This nearly identical logic block gets duplicated 3 times, we may want to try to abstract this away to the calcCpuTotalUsagePct
function so it's a bit easier to wrangle.
//} | ||
} | ||
|
||
func calcCpuTotalUsagePct(cpuUsageTotal, systemUsageDelta, contCpus float64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on why these are three separate functions. They look really similar?
containerFields.Put("id", cID) | ||
event.Delete("id") | ||
} | ||
e, err := util.CreateEvent(event, "containerd.cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're gonna borrow functions from other modules, we should move that code outside the module to somewhere generic.
As far as calculating the CPU percents, your logic seems sound? Generally, we track a previous value, calculate the delta between the previous total and the current, and divide that by the time between the deltas. Pay attention to the values that are being reported upstream, as not all platforms will have a particularly clear-cut idea of what a "total" is. Normalized CPU usage should be thought of as the total per-CPU. Put another way, the maximum value for For example, check out https://github.com/elastic/beats/blob/master/metricbeat/module/docker/cpu/helper.go and https://github.com/elastic/beats/blob/master/metricbeat/internal/metrics/cpu/metrics.go |
@ChrsMark I updated the PR based on your comments. Could you take another look? |
Co-authored-by: hendry-lim <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
and more specifically fields `containerd.cpu.usage.total.pct`, `containerd.cpu.usage.kernel.pct`, `containerd.cpu.usage.user.pct`. | ||
Default value is true. | ||
|
||
For memory metricset if `calcpct.memory` setting is set to true, memory usage percentages will be calculated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why calcpct.cpu
and calcpct.memory
were introduced? do we have some reason to make it configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Containerd module collects cpu, memory and blkio statistics about | ||
running containers controlled by containerd runtime. | ||
|
||
The current metricsets are: `cpu`, `blkio` and `memory`. They are not enabled by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in blkio.asciidoc it is:
This is a default metricset. If the host module is unconfigured, this metricset is enabled by default.
doesn't it contradict each other?
the same for cpu and memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand how this text in blkio.asciidoc is generated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
// Calculate cpu total usage percentage | ||
cpuUsageTotal, err := event.GetValue("usage.total.ns") | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it could be helpful to add some debug logs if err != nil ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we iterate a batch of events and we check if usage.total.ns
field is present. There is for sure one event in this batch that this field is not present and that is the one that has system.total
field. The reason is that process_cpu_seconds_total
containerd metric that is then mapped to system.total
field, does not include info about any container id (system wise metric not container specific). So that leads to an event that has only system.total
field. While the rest of the events have fields that are grouped together due to container_id
mainly. So with this if
I just want to be sure that we skip trying to calculate percentages for the event that only has system.total
(it does not include any other field, is not container specific). It is not an actual error to log anything. But I will add a comment.
if m.calcPct { | ||
inactiveFiles, err := event.GetValue("inactiveFiles") | ||
if err != nil { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which cases inactiveFiles
can be not present in event? should here be added a debug log to explain why usage percentage was skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In no case. Only if there is an error. I will add a debug
… not present in memory event
@fearful-symmetry you have some requested changes that are blocking the merging of the PR. I can still merge it though, unless you want to make a final review |
@MichaelKatsoulis Sorry about that! No idea how I missed your ping last week. |
|
What does this PR do?
This PR creates containerd metricbeat module.
Cpu, memory and blkio metricsets are created.
Why is it important?
Containerd is a container runtime that implements Container Runtime Interface (CRI).
It is used as one of Kubernetes runtimes after k8s deprecating docker after v1.20.
Containerd if configured to expose metrics it provides useful informations about cpu, memory and blkio.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
kind create cluster --image 'kindest/node:v1.21.1' --config kind-cluster.yaml
/etc/containerd/config.toml
and addsystemctl restart containerd
metricbeat-daemonset-modules
ConfigMapcontainerd
fields getting populated.Use cases
Screenshots