-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-8331 client: Export client metrics via agent #13545
Conversation
Bug-tracker data: |
7a6fb84
to
e2f54b6
Compare
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13545/6/display/redirect |
208898e
to
3f99401
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be new entries added to utils/config/daos_agent.yml?
Also, could I ask for this small change? It would allow functional tests - specifically, the performance tests - to set the config
578d907
And since it's unused, no special testing is needed
Yes, good catch. I forgot about those.
I'll merge that in, thanks. Actually, I may try to add a ftest for this work, so that change makes it even easier. |
3f99401
to
fcb4958
Compare
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13545/10/execution/node/284/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13545/10/execution/node/279/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13545/10/execution/node/366/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13545/10/execution/node/361/log |
Just refreshed this patch. I did add the |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13545/10/execution/node/480/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13545/10/execution/node/347/log |
fcb4958
to
9182b98
Compare
9182b98
to
bfc9b03
Compare
Bug-tracker data: |
Functional on EL 9 Test Results (old)135 tests 131 ✅ 1h 31m 39s ⏱️ Results for commit 9894532. ♻️ This comment has been updated with latest results. |
Functional on EL 8.8 Test Results (old)135 tests 131 ✅ 1h 29m 5s ⏱️ Results for commit 9894532. ♻️ This comment has been updated with latest results. |
Functional Hardware Medium Test Results (old)130 tests 104 ✅ 2h 9m 52s ⏱️ Results for commit 9894532. ♻️ This comment has been updated with latest results. |
Functional Hardware Medium Verbs Provider Test Results (old)55 tests 54 ✅ 4h 7m 31s ⏱️ Results for commit 9894532. ♻️ This comment has been updated with latest results. |
Functional Hardware Large Test Results (old)64 tests 64 ✅ 28m 42s ⏱️ Results for commit 9894532. ♻️ This comment has been updated with latest results. |
Adds new agent config parameters and code to optionally export client metrics in Prometheus format. Example daos_agent.yml updates: telemetry_port: 9192 # export on port 9192 telemetry_retain: 5m # retain metrics for 5 minutes # after client exit Run-GHA: true Change-Id: I77864682cc19fa4c33f326d879e20704ef57a7ea Required-githooks: true Signed-off-by: Michael MacDonald <[email protected]>
bfc9b03
to
9894532
Compare
Bug-tracker data: |
Requesting early reviews while waiting for the base patch to land, TIA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not overly familiar with the telemetry code but these changes LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest code changes LGTM. Thanks for adding!
if (tm_shmem.other_rw == 1) | ||
flags |= 0666; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is concerning to me -- allowing anyone to write to the root of client telemetry means anyone can change it, potentially crashing the agent or client process, or doing worse. IMO another way is needed to link client job telemetry from the root.
@@ -2717,7 +2727,7 @@ rm_ephemeral_dir(struct d_tm_context *ctx, struct d_tm_node_t *link) | |||
head = &shmem->sh_subregions; | |||
for (cur = conv_ptr(shmem, head->next); cur != head; cur = conv_ptr(shmem, cur->next)) { | |||
curr = d_list_entry(cur, __typeof__(*curr), rl_link); | |||
rc = rm_ephemeral_dir(ctx, curr->rl_link_node); | |||
rc = rm_ephemeral_dir(ctx, conv_ptr(shmem, curr->rl_link_node)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
var engLabels labelMap | ||
engLabels, name = extractEngineLabels(log, strings.Join(comps[compsIdx:], string(telemetry.PathSep))) | ||
for k, v := range engLabels { | ||
labels[k] = v | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extractEngineLabels in the client telemetry?
@@ -1,447 +1,40 @@ | |||
// | |||
// (C) Copyright 2021-2022 Intel Corporation. | |||
// (C) Copyright 2024 Intel Corporation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - should be 2021-2024 right?
} | ||
} | ||
|
||
func (s *sourceMetricSchema) Prune() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - needs documentation comment
addFn func(logging.Logger, telemetry.Metric) *sourceMetric | ||
} | ||
|
||
MetricSource struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - needs documentation comment
s.enabled.SetFalse() | ||
} | ||
|
||
func (s *MetricSource) Collect(log logging.Logger, ch chan<- *sourceMetric) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - needs documentation comment
s.smSchema.Prune() | ||
} | ||
|
||
func (s *MetricSource) PruneSegments(log logging.Logger, maxSegAge time.Duration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs documentation comment
|
||
// adjustAttachInfo performs any necessary adjustments to the attach info | ||
// before returning it. | ||
func (c *InfoCache) adjustAttachInfo(resp *control.GetAttachInfoResp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We make other modifications to the GetAttachInfoResp struct in mgmt_rpc.go before returning to the client. Infocache has been caching only what the server sends back. I'm not exactly opposed to doing it this way, but it feels a little odd making our client-side modifications in two different layers.
Closed in favor of the approach in #14030. |
Adds new agent config parameters and code to
optionally export client metrics in Prometheus
format.
Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_retain: 5m # retain metrics for 5 minutes
# after client exit
Change-Id: I77864682cc19fa4c33f326d879e20704ef57a7ea
Required-githooks: true
Signed-off-by: Michael MacDonald [email protected]