-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-8331 client: Add client side metrics #14030
Conversation
1. Move TLS to common, so both client and server can have TLS, which metrics can be attached metrics on it. 2. Add object metrics on the client side, enabled by export DAOS_CLIENT_METRICS=1. And client metrics are organized as "root/jobid/pid/xxxxx" And root/jobid/pid are stored in an independent share memory, which will only be destoryed if all jobs are destroyed. During each daos thread initialization, it will created another shmem (pid/xxx), which all metrics of the thread will be attached to. And this metric will be destoryed once the thread exit, though if DAOS_CLIENT_METRICS_RETAIN is set, these client metrics will be retain, and it can be retrieved by daos_metrics --jobid 3. Add DAOS_METRIC_DUMP_ENV dump metrics from current thread once it exit. 4. Some fixes in telemetrics about conv_ptr during re-open the share memory. 5. Add daos_metrics --jobid XXX options to retrieve all metrics of the job. Required-githooks: true Change-Id: Iab54954cd6b94233b37853087041ea0e867871dd Signed-off-by: Di Wang <[email protected]>
Ticket title is 'Client side metrics/stats support for DAOS' |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14030/1/execution/node/269/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14030/1/execution/node/369/log |
Adds new agent config parameters and code to optionally export client metrics in Prometheus format. Example daos_agent.yml updates: telemetry_port: 9192 # export on port 9192 telemetry_retain: 5m # retain metrics for 5 minutes # after client exit Run-GHA: true Features: telemetry Change-Id: I77864682cc19fa4c33f326d879e20704ef57a7ea Required-githooks: true Signed-off-by: Michael MacDonald <[email protected]>
Functional on EL 9 Test Results (old)144 tests 140 ✅ 2h 4m 0s ⏱️ Results for commit 0ae1f55. |
Functional on EL 8.8 Test Results (old)144 tests 140 ✅ 2h 8m 14s ⏱️ Results for commit 0ae1f55. |
Functional Hardware Large Test Results (old)64 tests 64 ✅ 32m 5s ⏱️ Results for commit 0ae1f55. |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14030/13/execution/node/1436/log |
Test failures in the latest updates appear to be CR-related: CR20-27 Most likely already covered in DAOS-15614. |
I do see this problem with the dump file with running multiple ior processes on a single host. Most lines were associated with one PID out of 16 procs launched, and one line with another PID, and another line looked like it wasn't formatted right ("<pool_uuid>/EC_update/full_stripe,0" which is missing "PID/pool" prefix. This was done with a single DAOS_JOBID value, and launching 2 mpirun commands (each -np 8). A potential "workaround"(?) might be to set DAOS_JOBID=myio, D_CLIENT_METRICS_RETAIN=1 (boolean), configure daos_agent.yml telemetry_retain: 5m (I chose 5 minutes arbitrarily to give me time after mpirun/ior jobs finished), and run daos_metrics -j myio --csv after the ior jobs finished. This produces metrics per PID assocaited with the job ID. |
mjmac already addressed this i believe in a follow on PR to treat the path as dir instead. i think it would be good to rename the env variable to be D_CLIENT_METRICS_DUMP_DIR (in that follow on PR) |
I don't have performance numbers from Frontera, but I'm so far not seeing the segfaults or hangs I saw when testing #13517 |
@@ -57,11 +57,12 @@ type Config struct { | |||
FabricInterfaces []*NUMAFabricConfig `yaml:"fabric_ifaces,omitempty"` | |||
ProviderIdx uint // TODO SRS-31: Enable with multiprovider functionality | |||
TelemetryPort int `yaml:"telemetry_port,omitempty"` | |||
TelemetryEnabled bool `yaml:"telemetry_enabled,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea of this to enable telemetry locally, with no prometheus export?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea of this to enable telemetry locally, with no prometheus export?
No. If it's set, telemetry is automatically enabled for all client processes. If it's not set, then clients have to enable telemetry manually using the env var. There's a config validation check that will fail if this is set to true and the port is not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. I had thought that adding a telemetry port would be enough to indicate we wanted it enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. I had thought that adding a telemetry port would be enough to indicate we wanted it enabled.
That was the original approach after we discussed it, but @mchaarawi advocated for having the option to selectively enable telemetry on a per-client basis.
Great, thanks for confirming that. So, what are the next steps here? The PR has received two +2 reviews, but looks like it lost the race to landing so I'm going to have to address a conflict. Can we plan to get this merged as soon as it passes testing again? Any other issues that need to be sorted out, @mchaarawi? |
I am not clear on whether we have confirmed that telemetry collection is actually working with workloads like mdtest, ior, etc. when i had tried it was not and i was getting the error that there is nothing collected. @daltonbohning did you verify that there are metrics collected? |
In my limited ior testing, configuring retaining metrics for 5 minutes, running 2 mpirun -np 8 ior application runs (at the same time on the same single client), and subsequently running daos_metrics -j --csv seemed to indicate telemetry was collected for all 16 processes. Dalton's testing with this patch is likely more comprehensive, so would be interesting to get confirmation there too. |
Features: telemetry Required-githooks: true Change-Id: I18754a81a93c9ce055aec0c399c9f8b193db393e Signed-off-by: Michael MacDonald <[email protected]>
I would like to point out that this has now passed again, with |
Hi all, @mchaarawi pointed me here to see if I have any general feedback on this work. For some background, I'm lead developer on the Darshan I/O characterization tool (https://www.mcs.anl.gov/research/projects/darshan/) right now, and have been working recently on new instrumentation modules for DAOS. I've basically just followed our typical strategy in Darshan of intercepting calls to various DAOS/DFS APIs and logging stats, timers, and other counters that are stored in a log when the app exits. I think the metrics captured here would definitely be of interest to us as well, they definitely seem complementary to what we've already been working on (Darshan's detailed per-file (or per-object) statistics of usage of DAOS APIs vs aggregate RPC metrics for an app/process as here). I'm not sure if there was a plan on how to persists this sort of data, for example, on Aurora, but Darshan could potentially be a vehicle for that if it makes sense since we've traditionally been deployed full-time on ALCF systems. Without being an expert in the DAOS codebase and only skimming this PR, I just wanted to provide some quick feedback. I think all we need is for APIs to query these metrics at application shutdown time, so nothing fancy. I think I saw some discussion related to whether this data is accessible via application processes or via the Obviously, not trying to get in the way as you all try to get this merged, just wanted to provide another perspective if it helps with any subsequent work here. Thanks! |
Hi Shane, thanks for taking a look. I think there is absolutely an opportunity to build on what we've got here in order to nicely integrate with Darshan. The client-side telemetry implementation is based on the same library and APIs as the server-side stuff, so there's already a way to get at the metrics on the way out. The daos_agent integration is optional and provides a way to expose the telemetry in real time for monitoring via something like Prometheus. I think integration with Darshan should be relatively straightforward, mostly by teaching it how to read the client metrics. I'm not super familiar with Darshan's inner workings, but I've spent some time using it for workload analysis recently, so I have a general idea of how it works. Definitely interested in collaborating on that integration effort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done some testing on aurora yesterday and today, and it seems the issues i got were fixed with the latest update + the PR to make the path as a dir. So I am OK with this now.
This all sounds great, thanks for the details @mjmac. We will definitely keep this on our radar and can try to see what the Darshan side of things looks like once we have a DAOS deployment that supports this client-side functionality. Will try to keep you posted on our progress and let you know if we have questions. |
This commit comprises two separate patches to enable optional collection and export of client-side telemetry. The daos_agent configuration file includes new parameters to control collection and export of per-client telemetry. If the telemetry_port option is set, then per-client telemetry will be published in Prometheus format for real-time sampling of client processes. By default, the client telemetry will be automatically cleaned up on client exit, but may be optionally retained for some amount of time after client exit in order to allow for a final sample to be read. Example daos_agent.yml updates: telemetry_port: 9192 # export on port 9192 telemetry_enable: true # enable client telemetry for all connected clients telemetry_retain: 1m # retain metrics for 1 minute after client exit If telemetry_enable is false (default), client telemetry may be enabled on a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the environment for clients that should collect telemetry. Notes from the first patch by Di: Move TLS to common, so both client and server can have TLS, which metrics can be attached metrics on it. Add object metrics on the client side, enabled by export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized as "/jobid/pid/xxxxx". During each daos thread initialization, it will created another shmem (pid/xxx), which all metrics of the thread will be attached to. And this metric will be destroyed once the thread exit, though if D_CLIENT_METRICS_RETAIN is set, these client metrics will be retain, and it can be retrieved by daos_metrics --jobid Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread once it exit. Some fixes in telemetrics about conv_ptr during re-open the share memory. Add daos_metrics --jobid XXX options to retrieve all metrics of the job. Required-githooks: true Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540 Signed-off-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]> Co-authored-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]>
This commit comprises two separate patches to enable optional collection and export of client-side telemetry. The daos_agent configuration file includes new parameters to control collection and export of per-client telemetry. If the telemetry_port option is set, then per-client telemetry will be published in Prometheus format for real-time sampling of client processes. By default, the client telemetry will be automatically cleaned up on client exit, but may be optionally retained for some amount of time after client exit in order to allow for a final sample to be read. Example daos_agent.yml updates: telemetry_port: 9192 # export on port 9192 telemetry_enable: true # enable client telemetry for all connected clients telemetry_retain: 1m # retain metrics for 1 minute after client exit If telemetry_enable is false (default), client telemetry may be enabled on a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the environment for clients that should collect telemetry. Notes from the first patch by Di: Move TLS to common, so both client and server can have TLS, which metrics can be attached metrics on it. Add object metrics on the client side, enabled by export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized as "/jobid/pid/xxxxx". During each daos thread initialization, it will created another shmem (pid/xxx), which all metrics of the thread will be attached to. And this metric will be destroyed once the thread exit, though if D_CLIENT_METRICS_RETAIN is set, these client metrics will be retain, and it can be retrieved by daos_metrics --jobid Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread once it exit. Some fixes in telemetrics about conv_ptr during re-open the share memory. Add daos_metrics --jobid XXX options to retrieve all metrics of the job. Required-githooks: true Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540 Signed-off-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]> Co-authored-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]>
This commit comprises two separate patches to enable optional collection and export of client-side telemetry. The daos_agent configuration file includes new parameters to control collection and export of per-client telemetry. If the telemetry_port option is set, then per-client telemetry will be published in Prometheus format for real-time sampling of client processes. By default, the client telemetry will be automatically cleaned up on client exit, but may be optionally retained for some amount of time after client exit in order to allow for a final sample to be read. Example daos_agent.yml updates: telemetry_port: 9192 # export on port 9192 telemetry_enable: true # enable client telemetry for all connected clients telemetry_retain: 1m # retain metrics for 1 minute after client exit If telemetry_enable is false (default), client telemetry may be enabled on a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the environment for clients that should collect telemetry. Notes from the first patch by Di: Move TLS to common, so both client and server can have TLS, which metrics can be attached metrics on it. Add object metrics on the client side, enabled by export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized as "/jobid/pid/xxxxx". During each daos thread initialization, it will created another shmem (pid/xxx), which all metrics of the thread will be attached to. And this metric will be destroyed once the thread exit, though if D_CLIENT_METRICS_RETAIN is set, these client metrics will be retain, and it can be retrieved by daos_metrics --jobid Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread once it exit. Some fixes in telemetrics about conv_ptr during re-open the share memory. Add daos_metrics --jobid XXX options to retrieve all metrics of the job. Features: telemetry Required-githooks: true Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540 Signed-off-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]> Co-authored-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]>
This commit comprises two separate patches to enable optional collection and export of client-side telemetry. The daos_agent configuration file includes new parameters to control collection and export of per-client telemetry. If the telemetry_port option is set, then per-client telemetry will be published in Prometheus format for real-time sampling of client processes. By default, the client telemetry will be automatically cleaned up on client exit, but may be optionally retained for some amount of time after client exit in order to allow for a final sample to be read. Example daos_agent.yml updates: telemetry_port: 9192 # export on port 9192 telemetry_enable: true # enable client telemetry for all connected clients telemetry_retain: 1m # retain metrics for 1 minute after client exit If telemetry_enable is false (default), client telemetry may be enabled on a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the environment for clients that should collect telemetry. Notes from the first patch by Di: Move TLS to common, so both client and server can have TLS, which metrics can be attached metrics on it. Add object metrics on the client side, enabled by export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized as "/jobid/pid/xxxxx". During each daos thread initialization, it will created another shmem (pid/xxx), which all metrics of the thread will be attached to. And this metric will be destroyed once the thread exit, though if D_CLIENT_METRICS_RETAIN is set, these client metrics will be retain, and it can be retrieved by daos_metrics --jobid Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread once it exit. Some fixes in telemetrics about conv_ptr during re-open the share memory. Add daos_metrics --jobid XXX options to retrieve all metrics of the job. Features: telemetry Required-githooks: true Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540 Signed-off-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]> Co-authored-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]>
This commit comprises two separate patches to enable optional collection and export of client-side telemetry. The daos_agent configuration file includes new parameters to control collection and export of per-client telemetry. If the telemetry_port option is set, then per-client telemetry will be published in Prometheus format for real-time sampling of client processes. By default, the client telemetry will be automatically cleaned up on client exit, but may be optionally retained for some amount of time after client exit in order to allow for a final sample to be read. Example daos_agent.yml updates: telemetry_port: 9192 # export on port 9192 telemetry_enable: true # enable client telemetry for all connected clients telemetry_retain: 1m # retain metrics for 1 minute after client exit If telemetry_enable is false (default), client telemetry may be enabled on a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the environment for clients that should collect telemetry. Notes from the first patch by Di: Move TLS to common, so both client and server can have TLS, which metrics can be attached metrics on it. Add object metrics on the client side, enabled by export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized as "/jobid/pid/xxxxx". During each daos thread initialization, it will created another shmem (pid/xxx), which all metrics of the thread will be attached to. And this metric will be destroyed once the thread exit, though if D_CLIENT_METRICS_RETAIN is set, these client metrics will be retain, and it can be retrieved by daos_metrics --jobid Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread once it exit. Some fixes in telemetrics about conv_ptr during re-open the share memory. Add daos_metrics --jobid XXX options to retrieve all metrics of the job. Features: telemetry Required-githooks: true Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540 Signed-off-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]> Co-authored-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]>
This commit comprises two separate patches to enable optional collection and export of client-side telemetry. The daos_agent configuration file includes new parameters to control collection and export of per-client telemetry. If the telemetry_port option is set, then per-client telemetry will be published in Prometheus format for real-time sampling of client processes. By default, the client telemetry will be automatically cleaned up on client exit, but may be optionally retained for some amount of time after client exit in order to allow for a final sample to be read. Example daos_agent.yml updates: telemetry_port: 9192 # export on port 9192 telemetry_enable: true # enable client telemetry for all connected clients telemetry_retain: 1m # retain metrics for 1 minute after client exit If telemetry_enable is false (default), client telemetry may be enabled on a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the environment for clients that should collect telemetry. Notes from the first patch by Di: Move TLS to common, so both client and server can have TLS, which metrics can be attached metrics on it. Add object metrics on the client side, enabled by export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized as "/jobid/pid/xxxxx". During each daos thread initialization, it will created another shmem (pid/xxx), which all metrics of the thread will be attached to. And this metric will be destroyed once the thread exit, though if D_CLIENT_METRICS_RETAIN is set, these client metrics will be retain, and it can be retrieved by daos_metrics --jobid Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread once it exit. Some fixes in telemetrics about conv_ptr during re-open the share memory. Add daos_metrics --jobid XXX options to retrieve all metrics of the job. Features: telemetry Required-githooks: true Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540 Signed-off-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]> Co-authored-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]>
This commit comprises two separate patches to enable optional collection and export of client-side telemetry. The daos_agent configuration file includes new parameters to control collection and export of per-client telemetry. If the telemetry_port option is set, then per-client telemetry will be published in Prometheus format for real-time sampling of client processes. By default, the client telemetry will be automatically cleaned up on client exit, but may be optionally retained for some amount of time after client exit in order to allow for a final sample to be read. Example daos_agent.yml updates: telemetry_port: 9192 # export on port 9192 telemetry_enable: true # enable client telemetry for all connected clients telemetry_retain: 1m # retain metrics for 1 minute after client exit If telemetry_enable is false (default), client telemetry may be enabled on a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the environment for clients that should collect telemetry. Notes from the first patch by Di: Move TLS to common, so both client and server can have TLS, which metrics can be attached metrics on it. Add object metrics on the client side, enabled by export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized as "/jobid/pid/xxxxx". During each daos thread initialization, it will created another shmem (pid/xxx), which all metrics of the thread will be attached to. And this metric will be destroyed once the thread exit, though if D_CLIENT_METRICS_RETAIN is set, these client metrics will be retain, and it can be retrieved by daos_metrics --jobid Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread once it exit. Some fixes in telemetrics about conv_ptr during re-open the share memory. Add daos_metrics --jobid XXX options to retrieve all metrics of the job. Features: telemetry Required-githooks: true Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540 Co-authored-by: Di Wang <[email protected]> Signed-off-by: Di Wang <[email protected]> Signed-off-by: Michael MacDonald <[email protected]>
This commit comprises two separate patches to enable optional collection and export of client-side telemetry. The daos_agent configuration file includes new parameters to control collection and export of per-client telemetry. If the telemetry_port option is set, then per-client telemetry will be published in Prometheus format for real-time sampling of client processes. By default, the client telemetry will be automatically cleaned up on client exit, but may be optionally retained for some amount of time after client exit in order to allow for a final sample to be read. Example daos_agent.yml updates: telemetry_port: 9192 # export on port 9192 telemetry_enable: true # enable client telemetry for all connected clients telemetry_retain: 1m # retain metrics for 1 minute after client exit If telemetry_enable is false (default), client telemetry may be enabled on a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the environment for clients that should collect telemetry. Notes from the first patch by Di: Move TLS to common, so both client and server can have TLS, which metrics can be attached metrics on it. Add object metrics on the client side, enabled by export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized as "/jobid/pid/xxxxx". During each daos thread initialization, it will created another shmem (pid/xxx), which all metrics of the thread will be attached to. And this metric will be destroyed once the thread exit, though if D_CLIENT_METRICS_RETAIN is set, these client metrics will be retain, and it can be retrieved by daos_metrics --jobid Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread once it exit. Some fixes in telemetrics about conv_ptr during re-open the share memory. Add daos_metrics --jobid XXX options to retrieve all metrics of the job. Includes some useful ftest updates from the following commit: * DAOS-11626 test: Adding MD on SSD metrics tests (#13661) Adding tests for WAL commit, reply, and checkpoint metrics. Signed-off-by: Phil Henderson <[email protected]> Signed-off-by: Michael MacDonald <[email protected]> Signed-off-by: Di Wang <[email protected]> Co-authored-by: Phil Henderson <[email protected]> Co-authored-by: Di Wang <[email protected]>
This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.
If the agent is configured to export daos client telemetry,
then libdaos-linked client processes will automatically
enable client-side telemetry and share it with the agent.
When the client processes exit, the agent will automatically
clean up any shared memory segments left behind by the clients.
Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_retain: 1m # retain metrics for 1 minute after client exit
Notes from the first patch (env vars not needed with agent management):
Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.
Add object metrics on the client side, enabled by
export DAOS_CLIENT_METRICS=1. And client metrics are organized
as "root/jobid/pid/xxxxx".
Add DAOS_METRIC_DUMP_ENV dump metrics from current thread
once it exit.
Some fixes in telemetrics about conv_ptr during re-open the
share memory.
Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.
Signed-off-by: Michael MacDonald [email protected]
Co-authored-by: Di Wang [email protected]
Signed-off-by: Di Wang [email protected]