Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of operator debug: fix pprof interval handling into release/1.7.x #20216

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #20206 to be assessed for backporting due to the inclusion of the label backport/1.7.x.

The below text is copied from the body of the original PR.


The nomad operator debug command saves a CPU profile for each interval, and names these files based on the interval.

The same functions takes a goroutine profile, heap profile, etc. but is missing the logic to interpolate the file name with the interval. This results in the operator debug command making potentially many expensive profile requests, and then overwriting the data. Update the command to save every profile it scrapes, and number them similarly to the existing CPU profile.

Additionally, the command flags for -pprof-interval and -pprof-duration were validated backwards, which meant that we always coerced the -pprof-interval to be the same as the -pprof-duration, which always resulted in a single profile being taken at the start of the bundle. Correct the check as well as change the defaults to be more sensible.

Fixes: #20151


In addition to fixing up the tests as needed, I've tested this locally as follows.

$ nomad operator debug -duration 1m -stale=true -node-id=61c19030  -log-level=trace -pprof-interval=15s
Starting debugger...

Nomad CLI Version: Nomad v1.7.7-dev
BuildDate 2024-03-22T19:40:43Z
Revision f91127fc5492d930dc61af54fd4f1f2a6f01f109+CHANGES
           Region:
        Namespace:
          Servers: (1/1) [continuity.global]
          Clients: (1/1) [61c19030-0ba7-6927-d9b9-a5b9df52f4e4]
         Interval: 30s
         Duration: 1m
   pprof Interval: 15s

Capturing cluster data...
Consul - Skipping, no API address found
    Capture pprofInterval 0000
    Capture interval 0000
    Capture pprofInterval 0001
    Capture interval 0001
    Capture pprofInterval 0002
    Capture pprofInterval 0003
Created debug archive: nomad-debug-2024-03-22-200116Z.tar.gz

This results in the following file tree:

file tree
$ tar -xf nomad-debug-2024-03-22-200116Z.tar.gz
$ tree nomad-debug-2024-03-22-200116Z
nomad-debug-2024-03-22-200116Z
├── client
│   └── 61c19030-0ba7-6927-d9b9-a5b9df52f4e4
│       ├── agent-host.json
│       ├── allocs_0000.prof
│       ├── allocs_0001.prof
│       ├── allocs_0002.prof
│       ├── allocs_0003.prof
│       ├── goroutine_0000.prof
│       ├── goroutine_0001.prof
│       ├── goroutine_0002.prof
│       ├── goroutine_0003.prof
│       ├── goroutine-debug1_0000.txt
│       ├── goroutine-debug1_0001.txt
│       ├── goroutine-debug1_0002.txt
│       ├── goroutine-debug1_0003.txt
│       ├── goroutine-debug2_0000.txt
│       ├── goroutine-debug2_0001.txt
│       ├── goroutine-debug2_0002.txt
│       ├── goroutine-debug2_0003.txt
│       ├── heap_0000.prof
│       ├── heap_0001.prof
│       ├── heap_0002.prof
│       ├── heap_0003.prof
│       ├── monitor.log
│       ├── profile_0000.prof
│       ├── profile_0001.prof
│       ├── profile_0002.prof
│       ├── profile_0003.prof
│       ├── threadcreate_0000.prof
│       ├── threadcreate_0001.prof
│       ├── threadcreate_0002.prof
│       ├── threadcreate_0003.prof
│       ├── trace_0000.prof
│       ├── trace_0001.prof
│       ├── trace_0002.prof
│       └── trace_0003.prof
├── cluster
│   ├── agent-self.json
│   ├── cli-flags.json
│   ├── eventstream.json
│   ├── leader.json
│   ├── members.json
│   ├── namespaces.json
│   ├── nodes.json
│   └── regions.json
├── index.html
├── index.json
├── interval
│   ├── 0000
│   │   ├── allocations.json
│   │   ├── csi-plugins.json
│   │   ├── csi-volumes.json
│   │   ├── deployments.json
│   │   ├── evaluations.json
│   │   ├── jobs.json
│   │   ├── license.json
│   │   ├── metrics.json
│   │   ├── nodes.json
│   │   ├── operator-autopilot-health.json
│   │   ├── operator-raft.json
│   │   └── operator-scheduler.json
│   └── 0001
│       ├── allocations.json
│       ├── csi-plugins.json
│       ├── csi-volumes.json
│       ├── deployments.json
│       ├── evaluations.json
│       ├── jobs.json
│       ├── license.json
│       ├── metrics.json
│       ├── nodes.json
│       ├── operator-autopilot-health.json
│       ├── operator-raft.json
│       └── operator-scheduler.json
└── server
    └── continuity.global
        ├── agent-host.json
        ├── allocs_0000.prof
        ├── allocs_0001.prof
        ├── allocs_0002.prof
        ├── allocs_0003.prof
        ├── goroutine_0000.prof
        ├── goroutine_0001.prof
        ├── goroutine_0002.prof
        ├── goroutine_0003.prof
        ├── goroutine-debug1_0000.txt
        ├── goroutine-debug1_0001.txt
        ├── goroutine-debug1_0002.txt
        ├── goroutine-debug1_0003.txt
        ├── goroutine-debug2_0000.txt
        ├── goroutine-debug2_0001.txt
        ├── goroutine-debug2_0002.txt
        ├── goroutine-debug2_0003.txt
        ├── heap_0000.prof
        ├── heap_0001.prof
        ├── heap_0002.prof
        ├── heap_0003.prof
        ├── monitor.log
        ├── profile_0000.prof
        ├── profile_0001.prof
        ├── profile_0002.prof
        ├── profile_0003.prof
        ├── threadcreate_0000.prof
        ├── threadcreate_0001.prof
        ├── threadcreate_0002.prof
        ├── threadcreate_0003.prof
        ├── trace_0000.prof
        ├── trace_0001.prof
        ├── trace_0002.prof
        └── trace_0003.prof

8 directories, 102 files

Overview of commits

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/b-operator-debug-interval/absolutely-desired-cardinal branch from cc95f11 to 2d86e81 Compare March 25, 2024 13:01
@tgross tgross merged commit 9f3c1d5 into release/1.7.x Mar 25, 2024
19 of 21 checks passed
@tgross tgross deleted the backport/b-operator-debug-interval/absolutely-desired-cardinal branch March 25, 2024 13:24
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 15, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants