Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-8331 client: Add client side metrics #14030

Merged
merged 19 commits into from
Apr 19, 2024
Merged

DAOS-8331 client: Add client side metrics #14030

merged 19 commits into from
Apr 19, 2024

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Mar 20, 2024

This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

If the agent is configured to export daos client telemetry,
then libdaos-linked client processes will automatically
enable client-side telemetry and share it with the agent.

When the client processes exit, the agent will automatically
clean up any shared memory segments left behind by the clients.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_retain: 1m # retain metrics for 1 minute after client exit

Notes from the first patch (env vars not needed with agent management):

  1. Move TLS to common, so both client and server can have TLS,
    which metrics can be attached metrics on it.

  2. Add object metrics on the client side, enabled by
    export DAOS_CLIENT_METRICS=1. And client metrics are organized
    as "root/jobid/pid/xxxxx".

And root/jobid/pid are stored in an independent share memory,
which will only be destoryed if all jobs are destroyed.

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destoryed once the thread exit, though
if DAOS_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
  1. Add DAOS_METRIC_DUMP_ENV dump metrics from current thread
    once it exit.

  2. Some fixes in telemetrics about conv_ptr during re-open the
    share memory.

  3. Add daos_metrics --jobid XXX options to retrieve all metrics
    of the job.

Signed-off-by: Michael MacDonald [email protected]
Co-authored-by: Di Wang [email protected]
Signed-off-by: Di Wang [email protected]

1. Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

2. Add object metrics on the client side, enabled by
export DAOS_CLIENT_METRICS=1. And client metrics are organized
as "root/jobid/pid/xxxxx"

And root/jobid/pid are stored in an independent share memory,
which will only be destoryed if all jobs are destroyed.

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destoryed once the thread exit, though
if DAOS_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid

3. Add DAOS_METRIC_DUMP_ENV dump metrics from current thread
once it exit.

4. Some fixes in telemetrics about conv_ptr during re-open the
share memory.

5. Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Required-githooks: true
Change-Id: Iab54954cd6b94233b37853087041ea0e867871dd
Signed-off-by: Di Wang <[email protected]>
@mjmac mjmac requested review from a team as code owners March 20, 2024 19:02
Copy link

github-actions bot commented Mar 20, 2024

Ticket title is 'Client side metrics/stats support for DAOS'
Status is 'In Review'
Labels: 'HPE'
https://daosio.atlassian.net/browse/DAOS-8331

@mjmac mjmac changed the title mjmac/DAOS 8331 DAOS-8331 client: Add client side metrics Mar 20, 2024
@mjmac mjmac marked this pull request as draft March 20, 2024 19:03
@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14030/1/execution/node/269/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14030/1/execution/node/369/log

Adds new agent config parameters and code to
optionally export client metrics in Prometheus
format.

Example daos_agent.yml updates:
  telemetry_port: 9192 # export on port 9192
  telemetry_retain: 5m # retain metrics for 5 minutes
                       # after client exit

Run-GHA: true
Features: telemetry
Change-Id: I77864682cc19fa4c33f326d879e20704ef57a7ea
Required-githooks: true
Signed-off-by: Michael MacDonald <[email protected]>
@mjmac mjmac force-pushed the mjmac/DAOS-8331 branch from 6c8b674 to 0ae1f55 Compare March 20, 2024 20:14
Copy link

Functional on EL 9 Test Results (old)

144 tests   140 ✅  2h 4m 0s ⏱️
 43 suites    4 💤
 43 files      0 ❌

Results for commit 0ae1f55.

Copy link

Functional on EL 8.8 Test Results (old)

144 tests   140 ✅  2h 8m 14s ⏱️
 43 suites    4 💤
 43 files      0 ❌

Results for commit 0ae1f55.

Copy link

Functional Hardware Large Test Results (old)

64 tests   64 ✅  32m 5s ⏱️
14 suites   0 💤
14 files     0 ❌

Results for commit 0ae1f55.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14030/13/execution/node/1436/log

@mjmac
Copy link
Contributor Author

mjmac commented Apr 12, 2024

Test failures in the latest updates appear to be CR-related: CR20-27

Most likely already covered in DAOS-15614.

@kccain
Copy link
Contributor

kccain commented Apr 12, 2024

  • If you have multiple processes all trying to write to the same shared dump file, it's going to cause problems. Maybe the dump path should be interpreted as a parent directory, and each process should write to $pid.csv ?

it's very difficult to set a different env variable for every MPI process on the same node as there could be many of them. So either changing the path to be a dir (where we can point it to /tmp on every node), or allowing logging to a single file and append like the pid of the process is fine. i guess the latter can be a bit messy. For now I was just looking for some metrics i can look it to see what gets reported, but couldn't generate anything. I don't know if this is the issue im running into though as no metrics reported error sounds more fundamental issue than a logging issue? i could be wrong of course..

I do see this problem with the dump file with running multiple ior processes on a single host. Most lines were associated with one PID out of 16 procs launched, and one line with another PID, and another line looked like it wasn't formatted right ("<pool_uuid>/EC_update/full_stripe,0" which is missing "PID/pool" prefix. This was done with a single DAOS_JOBID value, and launching 2 mpirun commands (each -np 8).

A potential "workaround"(?) might be to set DAOS_JOBID=myio, D_CLIENT_METRICS_RETAIN=1 (boolean), configure daos_agent.yml telemetry_retain: 5m (I chose 5 minutes arbitrarily to give me time after mpirun/ior jobs finished), and run daos_metrics -j myio --csv after the ior jobs finished. This produces metrics per PID assocaited with the job ID.

@mchaarawi
Copy link
Contributor

  • If you have multiple processes all trying to write to the same shared dump file, it's going to cause problems. Maybe the dump path should be interpreted as a parent directory, and each process should write to $pid.csv ?

it's very difficult to set a different env variable for every MPI process on the same node as there could be many of them. So either changing the path to be a dir (where we can point it to /tmp on every node), or allowing logging to a single file and append like the pid of the process is fine. i guess the latter can be a bit messy. For now I was just looking for some metrics i can look it to see what gets reported, but couldn't generate anything. I don't know if this is the issue im running into though as no metrics reported error sounds more fundamental issue than a logging issue? i could be wrong of course..

I do see this problem with the dump file with running multiple ior processes on a single host. Most lines were associated with one PID out of 16 procs launched, and one line with another PID, and another line looked like it wasn't formatted right ("<pool_uuid>/EC_update/full_stripe,0" which is missing "PID/pool" prefix. This was done with a single DAOS_JOBID value, and launching 2 mpirun commands (each -np 8).

A potential "workaround"(?) might be to set DAOS_JOBID=myio, D_CLIENT_METRICS_RETAIN=1 (boolean), configure daos_agent.yml telemetry_retain: 5m (I chose 5 minutes arbitrarily to give me time after mpirun/ior jobs finished), and run daos_metrics -j myio --csv after the ior jobs finished. This produces metrics per PID assocaited with the job ID.

mjmac already addressed this i believe in a follow on PR to treat the path as dir instead. i think it would be good to rename the env variable to be D_CLIENT_METRICS_DUMP_DIR (in that follow on PR)

kccain
kccain previously approved these changes Apr 12, 2024
@daltonbohning
Copy link
Contributor

I don't have performance numbers from Frontera, but I'm so far not seeing the segfaults or hangs I saw when testing #13517

kjacque
kjacque previously approved these changes Apr 12, 2024
@@ -57,11 +57,12 @@ type Config struct {
FabricInterfaces []*NUMAFabricConfig `yaml:"fabric_ifaces,omitempty"`
ProviderIdx uint // TODO SRS-31: Enable with multiprovider functionality
TelemetryPort int `yaml:"telemetry_port,omitempty"`
TelemetryEnabled bool `yaml:"telemetry_enabled,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea of this to enable telemetry locally, with no prometheus export?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea of this to enable telemetry locally, with no prometheus export?

No. If it's set, telemetry is automatically enabled for all client processes. If it's not set, then clients have to enable telemetry manually using the env var. There's a config validation check that will fail if this is set to true and the port is not set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. I had thought that adding a telemetry port would be enough to indicate we wanted it enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. I had thought that adding a telemetry port would be enough to indicate we wanted it enabled.

That was the original approach after we discussed it, but @mchaarawi advocated for having the option to selectively enable telemetry on a per-client basis.

@mjmac
Copy link
Contributor Author

mjmac commented Apr 15, 2024

I don't have performance numbers from Frontera, but I'm so far not seeing the segfaults or hangs I saw when testing #13517

Great, thanks for confirming that. So, what are the next steps here? The PR has received two +2 reviews, but looks like it lost the race to landing so I'm going to have to address a conflict. Can we plan to get this merged as soon as it passes testing again? Any other issues that need to be sorted out, @mchaarawi?

@mchaarawi
Copy link
Contributor

I don't have performance numbers from Frontera, but I'm so far not seeing the segfaults or hangs I saw when testing #13517

Great, thanks for confirming that. So, what are the next steps here? The PR has received two +2 reviews, but looks like it lost the race to landing so I'm going to have to address a conflict. Can we plan to get this merged as soon as it passes testing again? Any other issues that need to be sorted out, @mchaarawi?

I am not clear on whether we have confirmed that telemetry collection is actually working with workloads like mdtest, ior, etc. when i had tried it was not and i was getting the error that there is nothing collected. @daltonbohning did you verify that there are metrics collected?

@kccain
Copy link
Contributor

kccain commented Apr 16, 2024

I am not clear on whether we have confirmed that telemetry collection is actually working with workloads like mdtest, ior, etc. when i had tried it was not and i was getting the error that there is nothing collected. @daltonbohning did you verify that there are metrics collected?

In my limited ior testing, configuring retaining metrics for 5 minutes, running 2 mpirun -np 8 ior application runs (at the same time on the same single client), and subsequently running daos_metrics -j --csv seemed to indicate telemetry was collected for all 16 processes. Dalton's testing with this patch is likely more comprehensive, so would be interesting to get confirmation there too.

Features: telemetry
Required-githooks: true

Change-Id: I18754a81a93c9ce055aec0c399c9f8b193db393e
Signed-off-by: Michael MacDonald <[email protected]>
@mjmac mjmac dismissed stale reviews from kjacque and kccain via e781790 April 16, 2024 21:46
@mjmac
Copy link
Contributor Author

mjmac commented Apr 18, 2024

I would like to point out that this has now passed again, with Features: telemetry, and had previously received two brave +1 reviews before it lost to a merge conflict. Kindly and humbly requesting re-reviews and a landing as soon as possible to avoid more churn. It may not be perfect, but I think that we've demonstrated that it provides a good base for future work without undue risk, particularly as the feature is disabled by default. (@mchaarawi, @kccain, @kjacque -- TIA)

@shanedsnyder
Copy link

Hi all, @mchaarawi pointed me here to see if I have any general feedback on this work. For some background, I'm lead developer on the Darshan I/O characterization tool (https://www.mcs.anl.gov/research/projects/darshan/) right now, and have been working recently on new instrumentation modules for DAOS. I've basically just followed our typical strategy in Darshan of intercepting calls to various DAOS/DFS APIs and logging stats, timers, and other counters that are stored in a log when the app exits.

I think the metrics captured here would definitely be of interest to us as well, they definitely seem complementary to what we've already been working on (Darshan's detailed per-file (or per-object) statistics of usage of DAOS APIs vs aggregate RPC metrics for an app/process as here). I'm not sure if there was a plan on how to persists this sort of data, for example, on Aurora, but Darshan could potentially be a vehicle for that if it makes sense since we've traditionally been deployed full-time on ALCF systems.

Without being an expert in the DAOS codebase and only skimming this PR, I just wanted to provide some quick feedback. I think all we need is for APIs to query these metrics at application shutdown time, so nothing fancy. I think I saw some discussion related to whether this data is accessible via application processes or via the daos_agent, and obviously in our case we would need to be able to access this information from the application itself.

Obviously, not trying to get in the way as you all try to get this merged, just wanted to provide another perspective if it helps with any subsequent work here. Thanks!

@mjmac
Copy link
Contributor Author

mjmac commented Apr 19, 2024

Without being an expert in the DAOS codebase and only skimming this PR, I just wanted to provide some quick feedback. I think all we need is for APIs to query these metrics at application shutdown time, so nothing fancy. I think I saw some discussion related to whether this data is accessible via application processes or via the daos_agent, and obviously in our case we would need to be able to access this information from the application itself.

Hi Shane, thanks for taking a look. I think there is absolutely an opportunity to build on what we've got here in order to nicely integrate with Darshan. The client-side telemetry implementation is based on the same library and APIs as the server-side stuff, so there's already a way to get at the metrics on the way out.

The daos_agent integration is optional and provides a way to expose the telemetry in real time for monitoring via something like Prometheus. I think integration with Darshan should be relatively straightforward, mostly by teaching it how to read the client metrics. I'm not super familiar with Darshan's inner workings, but I've spent some time using it for workload analysis recently, so I have a general idea of how it works. Definitely interested in collaborating on that integration effort.

Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done some testing on aurora yesterday and today, and it seems the issues i got were fixed with the latest update + the PR to make the path as a dir. So I am OK with this now.

@mjmac mjmac merged commit 0c3e72f into master Apr 19, 2024
52 checks passed
@mjmac mjmac deleted the mjmac/DAOS-8331 branch April 19, 2024 16:27
@shanedsnyder
Copy link

Hi Shane, thanks for taking a look. I think there is absolutely an opportunity to build on what we've got here in order to nicely integrate with Darshan. The client-side telemetry implementation is based on the same library and APIs as the server-side stuff, so there's already a way to get at the metrics on the way out.

The daos_agent integration is optional and provides a way to expose the telemetry in real time for monitoring via something like Prometheus. I think integration with Darshan should be relatively straightforward, mostly by teaching it how to read the client metrics. I'm not super familiar with Darshan's inner workings, but I've spent some time using it for workload analysis recently, so I have a general idea of how it works. Definitely interested in collaborating on that integration effort.

This all sounds great, thanks for the details @mjmac. We will definitely keep this on our radar and can try to see what the Darshan side of things looks like once we have a DAOS deployment that supports this client-side functionality. Will try to keep you posted on our progress and let you know if we have questions.

mjmac added a commit that referenced this pull request Apr 20, 2024
This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit

If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.

Notes from the first patch by Di:

Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.

Some fixes in telemetrics about conv_ptr during re-open the
share memory.

Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Required-githooks: true

Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540
Signed-off-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
Co-authored-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Apr 21, 2024
This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit

If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.

Notes from the first patch by Di:

Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.

Some fixes in telemetrics about conv_ptr during re-open the
share memory.

Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Required-githooks: true

Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540
Signed-off-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
Co-authored-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Apr 21, 2024
This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit

If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.

Notes from the first patch by Di:

Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.

Some fixes in telemetrics about conv_ptr during re-open the
share memory.

Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Features: telemetry
Required-githooks: true
Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540
Signed-off-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
Co-authored-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Apr 21, 2024
This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit

If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.

Notes from the first patch by Di:

Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.

Some fixes in telemetrics about conv_ptr during re-open the
share memory.

Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Features: telemetry
Required-githooks: true
Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540
Signed-off-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
Co-authored-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Apr 22, 2024
This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit

If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.

Notes from the first patch by Di:

Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.

Some fixes in telemetrics about conv_ptr during re-open the
share memory.

Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Features: telemetry
Required-githooks: true
Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540
Signed-off-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
Co-authored-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Apr 23, 2024
This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit

If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.

Notes from the first patch by Di:

Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.

Some fixes in telemetrics about conv_ptr during re-open the
share memory.

Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Features: telemetry
Required-githooks: true
Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540
Signed-off-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
Co-authored-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Apr 24, 2024
This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit

If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.

Notes from the first patch by Di:

Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.

Some fixes in telemetrics about conv_ptr during re-open the
share memory.

Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Features: telemetry
Required-githooks: true
Change-Id: Ib80ff89f39d259e0dce26e0ae8388318f96a3540
Co-authored-by: Di Wang <[email protected]>
Signed-off-by: Di Wang <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
mjmac added a commit that referenced this pull request Apr 29, 2024
This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit

If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.

Notes from the first patch by Di:

Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.

Some fixes in telemetrics about conv_ptr during re-open the
share memory.

Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Includes some useful ftest updates from the following commit:
* DAOS-11626 test: Adding MD on SSD metrics tests (#13661)
Adding tests for WAL commit, reply, and checkpoint metrics.

Signed-off-by: Phil Henderson <[email protected]>
Signed-off-by: Michael MacDonald <[email protected]>
Signed-off-by: Di Wang <[email protected]>
Co-authored-by: Phil Henderson <[email protected]>
Co-authored-by: Di Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants