Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate a single debug file for a long duration capture #10279

Merged
merged 32 commits into from
Jun 7, 2021

Conversation

dhiaayachi
Copy link
Collaborator

Current behaviour when generating dynamic debug files is to create multiple files in subfolders that represent each capture interval that make it harder to correlate logs and debug

This PR is to:

  • change the behaviour to create a single file for each captured item and append all the data under that file, the single file will still be under a subfolder with the capture starting timestamp (need to be discussed)
  • simplify the capture loop by using context to handle timeouts

dnephin and others added 3 commits May 20, 2021 15:49
The API allows collecting profiles even debug_enabled=false as long as
ACLs are enabled. Remove this check from the CLI so that users do not
need to set debug_enabled=true for no reason.

Also:
- fix the API client to return errors on non-200 status codes for debug
  endpoints
- improve the failure messages when pprof data can not be collected

Co-Authored-By: Dhia Ayachi <[email protected]>
parallel runs create a race condition that fail the debug tests
- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist
@github-actions github-actions bot added theme/api Relating to the HTTP API interface theme/cli Flags and documentation for the CLI interface labels May 21, 2021
@dhiaayachi dhiaayachi marked this pull request as draft May 21, 2021 20:10
@dhiaayachi dhiaayachi changed the title WIP: generate a single debug file for a long duration capture generate a single debug file for a long duration capture May 21, 2021
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you for working on this!

This is a bit different then what I was imagining. I'm actually not sure if these file formats are safe for concatenation. Maybe they are?

I think this might work, but collecting a single profile might also help us miss fewer things. I was thinking that we could split up captureDynamic and make a single API call for each of the profiles.

Looking at https://golang.org/pkg/net/http/pprof/ , we can specify the duration using the seconds parameter. That way we can specify seconds=120 (or whatever duration is set using the command line flag).

I guess that will also require changes to the HTTP client in api/debug, because right now it looks like it is not accepting any parameters.

@dhiaayachi
Copy link
Collaborator Author

Thank you @dnephin,
I was not sure if it's the right way of doing it neither. I will read more about the pprof api and attempt to do the change as you described it

@dhiaayachi
Copy link
Collaborator Author

dhiaayachi commented May 25, 2021

I went through the code to understand what is the current behaviour and my findings are the following:

  • 3 different type of data in the captureDynamic func are captured (logs, metrics, pprof)
  • the capture is done for a time span (config parameter interval)
  • we create a new file for each interval capture time and we do that for the duration parameter
  • the duration parameter have a grace period of 2 seconds added to it

So if we do a capture with duration=200ms and an interval=100ms we will end up with 22 folders ((2s + 200ms)/100ms) each having a file with the data for each type of capture we specified (logs, metrics, pprof)

To achieve what is stated in this PR we need to change the logic because capturing everything in the same file defy the logic of having a duration and an interval.

one option would be to simplify the usage and have interval = duration if not specified so we capture a single set of files for the total duration specified (including the grace period)

@dnephin I will need your insights on this one as I'm not sure what we want

In the meantime, I reverted the change I made and did some refactoring to clean up the code a bit

@vercel vercel bot temporarily deployed to Preview – consul May 25, 2021 16:40 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging May 25, 2021 16:40 Inactive
- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist
@hashicorp-cla
Copy link

hashicorp-cla commented May 27, 2021

CLA assistant check
All committers have signed the CLA.

@dhiaayachi dhiaayachi changed the base branch from master to dhia/node-name-default-value May 27, 2021 19:05
@dhiaayachi dhiaayachi changed the base branch from dhia/node-name-default-value to master May 27, 2021 19:05
@vercel vercel bot temporarily deployed to Preview – consul May 31, 2021 17:27 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging May 31, 2021 17:27 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging May 31, 2021 17:28 Inactive
@vercel vercel bot temporarily deployed to Preview – consul May 31, 2021 17:28 Inactive
@vercel vercel bot temporarily deployed to Preview – consul May 31, 2021 18:38 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging May 31, 2021 18:38 Inactive
@vercel vercel bot temporarily deployed to Preview – consul June 4, 2021 13:01 Inactive
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good.

Not sure if other reviewers want to take another pass - might be worth a ping to see but I'll approve this 😄 .

There was one integration test failure that looks like a flake and I just re-ran it. If it fails again it might be worth double-checking but I expect it's unrelated as I don't think we even run this command in envoy integration tests anywhere.

website/content/commands/debug.mdx Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – consul June 7, 2021 13:15 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging June 7, 2021 13:15 Inactive
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, couple minor things, but nothing blocking

go.mod Outdated Show resolved Hide resolved
command/debug/debug.go Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging June 7, 2021 16:28 Inactive
@vercel vercel bot temporarily deployed to Preview – consul June 7, 2021 16:28 Inactive
@vercel vercel bot temporarily deployed to Preview – consul June 7, 2021 16:30 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging June 7, 2021 16:30 Inactive
@dhiaayachi dhiaayachi merged commit 005ad9e into master Jun 7, 2021
@dhiaayachi dhiaayachi deleted the dhia/debug-generate-single-file branch June 7, 2021 17:00
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/381945.

@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/381978.

@hc-github-team-consul-core
Copy link
Collaborator

🍒✅ Cherry pick of commit 005ad9e onto release/1.10.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Jun 7, 2021
* debug: remove the CLI check for debug_enabled

The API allows collecting profiles even debug_enabled=false as long as
ACLs are enabled. Remove this check from the CLI so that users do not
need to set debug_enabled=true for no reason.

Also:
- fix the API client to return errors on non-200 status codes for debug
  endpoints
- improve the failure messages when pprof data can not be collected

Co-Authored-By: Dhia Ayachi <[email protected]>

* remove parallel test runs

parallel runs create a race condition that fail the debug tests

* snapshot the timestamp at the beginning of the capture

- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist

* Revert "snapshot the timestamp at the beginning of the capture"

This reverts commit c2d0334

* Refactor captureDynamic to extract capture logic for each item in a different func

* snapshot the timestamp at the beginning of the capture

- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist

* Revert "snapshot the timestamp at the beginning of the capture"

This reverts commit c2d0334

* Refactor captureDynamic to extract capture logic for each item in a different func

* extract wait group outside the go routine to avoid a race condition

* capture pprof in a separate go routine

* perform a single capture for pprof data for the whole duration

* add missing vendor dependency

* add a change log and fix documentation to reflect the change

* create function for timestamp dir creation and simplify error handling

* use error groups and ticker to simplify interval capture loop

* Logs, profile and traces are captured for the full duration. Metrics, Heap and Go routines are captured every interval

* refactor Logs capture routine and add log capture specific test

* improve error reporting when log test fail

* change test duration to 1s

* make time parsing in log line more robust

* refactor log time format in a const

* test on log line empty the earliest possible and return

Co-authored-by: Freddy <[email protected]>

* rename function to captureShortLived

* more specific changelog

Co-authored-by: Paul Banks <[email protected]>

* update documentation to reflect current implementation

* add test for behavior when invalid param is passed to the command

* fix argument line in test

* a more detailed description of the new behaviour

Co-authored-by: Paul Banks <[email protected]>

* print success right after the capture is done

* remove an unnecessary error check

Co-authored-by: Daniel Nephin <[email protected]>

* upgraded github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57 => v0.0.0-20210601050228-01bbb1931b22

Co-authored-by: Daniel Nephin <[email protected]>
Co-authored-by: Freddy <[email protected]>
Co-authored-by: Paul Banks <[email protected]>
@hc-github-team-consul-core
Copy link
Collaborator

🍒✅ Cherry pick of commit 005ad9e onto release/1.9.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Jun 7, 2021
* debug: remove the CLI check for debug_enabled

The API allows collecting profiles even debug_enabled=false as long as
ACLs are enabled. Remove this check from the CLI so that users do not
need to set debug_enabled=true for no reason.

Also:
- fix the API client to return errors on non-200 status codes for debug
  endpoints
- improve the failure messages when pprof data can not be collected

Co-Authored-By: Dhia Ayachi <[email protected]>

* remove parallel test runs

parallel runs create a race condition that fail the debug tests

* snapshot the timestamp at the beginning of the capture

- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist

* Revert "snapshot the timestamp at the beginning of the capture"

This reverts commit c2d0334

* Refactor captureDynamic to extract capture logic for each item in a different func

* snapshot the timestamp at the beginning of the capture

- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist

* Revert "snapshot the timestamp at the beginning of the capture"

This reverts commit c2d0334

* Refactor captureDynamic to extract capture logic for each item in a different func

* extract wait group outside the go routine to avoid a race condition

* capture pprof in a separate go routine

* perform a single capture for pprof data for the whole duration

* add missing vendor dependency

* add a change log and fix documentation to reflect the change

* create function for timestamp dir creation and simplify error handling

* use error groups and ticker to simplify interval capture loop

* Logs, profile and traces are captured for the full duration. Metrics, Heap and Go routines are captured every interval

* refactor Logs capture routine and add log capture specific test

* improve error reporting when log test fail

* change test duration to 1s

* make time parsing in log line more robust

* refactor log time format in a const

* test on log line empty the earliest possible and return

Co-authored-by: Freddy <[email protected]>

* rename function to captureShortLived

* more specific changelog

Co-authored-by: Paul Banks <[email protected]>

* update documentation to reflect current implementation

* add test for behavior when invalid param is passed to the command

* fix argument line in test

* a more detailed description of the new behaviour

Co-authored-by: Paul Banks <[email protected]>

* print success right after the capture is done

* remove an unnecessary error check

Co-authored-by: Daniel Nephin <[email protected]>

* upgraded github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57 => v0.0.0-20210601050228-01bbb1931b22

Co-authored-by: Daniel Nephin <[email protected]>
Co-authored-by: Freddy <[email protected]>
Co-authored-by: Paul Banks <[email protected]>
@hc-github-team-consul-core
Copy link
Collaborator

🍒❌ Cherry pick of commit 005ad9e onto release/1.8.x failed! Build Log

dhiaayachi added a commit that referenced this pull request Jun 7, 2021
* debug: remove the CLI check for debug_enabled

The API allows collecting profiles even debug_enabled=false as long as
ACLs are enabled. Remove this check from the CLI so that users do not
need to set debug_enabled=true for no reason.

Also:
- fix the API client to return errors on non-200 status codes for debug
  endpoints
- improve the failure messages when pprof data can not be collected

Co-Authored-By: Dhia Ayachi <[email protected]>

* remove parallel test runs

parallel runs create a race condition that fail the debug tests

* snapshot the timestamp at the beginning of the capture

- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist

* Revert "snapshot the timestamp at the beginning of the capture"

This reverts commit c2d0334

* Refactor captureDynamic to extract capture logic for each item in a different func

* snapshot the timestamp at the beginning of the capture

- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist

* Revert "snapshot the timestamp at the beginning of the capture"

This reverts commit c2d0334

* Refactor captureDynamic to extract capture logic for each item in a different func

* extract wait group outside the go routine to avoid a race condition

* capture pprof in a separate go routine

* perform a single capture for pprof data for the whole duration

* add missing vendor dependency

* add a change log and fix documentation to reflect the change

* create function for timestamp dir creation and simplify error handling

* use error groups and ticker to simplify interval capture loop

* Logs, profile and traces are captured for the full duration. Metrics, Heap and Go routines are captured every interval

* refactor Logs capture routine and add log capture specific test

* improve error reporting when log test fail

* change test duration to 1s

* make time parsing in log line more robust

* refactor log time format in a const

* test on log line empty the earliest possible and return

Co-authored-by: Freddy <[email protected]>

* rename function to captureShortLived

* more specific changelog

Co-authored-by: Paul Banks <[email protected]>

* update documentation to reflect current implementation

* add test for behavior when invalid param is passed to the command

* fix argument line in test

* a more detailed description of the new behaviour

Co-authored-by: Paul Banks <[email protected]>

* print success right after the capture is done

* remove an unnecessary error check

Co-authored-by: Daniel Nephin <[email protected]>

* upgraded github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57 => v0.0.0-20210601050228-01bbb1931b22

Co-authored-by: Daniel Nephin <[email protected]>
Co-authored-by: Freddy <[email protected]>
Co-authored-by: Paul Banks <[email protected]>
dhiaayachi added a commit that referenced this pull request Jun 7, 2021
…0360)

* debug: remove the CLI check for debug_enabled

The API allows collecting profiles even debug_enabled=false as long as
ACLs are enabled. Remove this check from the CLI so that users do not
need to set debug_enabled=true for no reason.

Also:
- fix the API client to return errors on non-200 status codes for debug
  endpoints
- improve the failure messages when pprof data can not be collected

Co-Authored-By: Dhia Ayachi <[email protected]>

* remove parallel test runs

parallel runs create a race condition that fail the debug tests

* snapshot the timestamp at the beginning of the capture

- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist

* Revert "snapshot the timestamp at the beginning of the capture"

This reverts commit c2d0334

* Refactor captureDynamic to extract capture logic for each item in a different func

* snapshot the timestamp at the beginning of the capture

- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures
- capture append to the file if it already exist

* Revert "snapshot the timestamp at the beginning of the capture"

This reverts commit c2d0334

* Refactor captureDynamic to extract capture logic for each item in a different func

* extract wait group outside the go routine to avoid a race condition

* capture pprof in a separate go routine

* perform a single capture for pprof data for the whole duration

* add missing vendor dependency

* add a change log and fix documentation to reflect the change

* create function for timestamp dir creation and simplify error handling

* use error groups and ticker to simplify interval capture loop

* Logs, profile and traces are captured for the full duration. Metrics, Heap and Go routines are captured every interval

* refactor Logs capture routine and add log capture specific test

* improve error reporting when log test fail

* change test duration to 1s

* make time parsing in log line more robust

* refactor log time format in a const

* test on log line empty the earliest possible and return

Co-authored-by: Freddy <[email protected]>

* rename function to captureShortLived

* more specific changelog

Co-authored-by: Paul Banks <[email protected]>

* update documentation to reflect current implementation

* add test for behavior when invalid param is passed to the command

* fix argument line in test

* a more detailed description of the new behaviour

Co-authored-by: Paul Banks <[email protected]>

* print success right after the capture is done

* remove an unnecessary error check

Co-authored-by: Daniel Nephin <[email protected]>

* upgraded github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57 => v0.0.0-20210601050228-01bbb1931b22

Co-authored-by: Daniel Nephin <[email protected]>
Co-authored-by: Freddy <[email protected]>
Co-authored-by: Paul Banks <[email protected]>

Co-authored-by: Daniel Nephin <[email protected]>
Co-authored-by: Freddy <[email protected]>
Co-authored-by: Paul Banks <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/api Relating to the HTTP API interface theme/cli Flags and documentation for the CLI interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

debug: improvements to the debug command
6 participants