Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore the use of StackWalker in REST Client #42508

Closed
geoand opened this issue Aug 13, 2024 · 29 comments
Closed

Explore the use of StackWalker in REST Client #42508

geoand opened this issue Aug 13, 2024 · 29 comments
Labels
area/rest-client kind/enhancement New feature or request triage/wontfix This will not be worked on

Comments

@geoand
Copy link
Contributor

geoand commented Aug 13, 2024

Description

Currently in the REST Client when quarkus.rest-client.capture-stacktrace is set to true (which is the default), we capture the entire stacktrace of the Thread in order to enhance the debugging experience when something goes wrong.

@johnaohara has found however that in some cases up to 20% of the users application code callstack in capturing a stack trace, which is a waste of cpu cycles, but also forces the JVM to safepoint the mutating threads during every call to the api

Implementation ideas

No response

@geoand geoand added kind/enhancement New feature or request area/rest-client labels Aug 13, 2024
Copy link

quarkus-bot bot commented Aug 13, 2024

/cc @cescoffier (rest-client)

@johnaohara
Copy link
Member

Flamegraph of a sample application that where the value of quarkus.rest-client.capture-stacktrace is the default true

Screenshot from 2024-08-13 12-59-09

@geoand
Copy link
Contributor Author

geoand commented Aug 13, 2024

Gotcha, thanks.

@geoand
Copy link
Contributor Author

geoand commented Aug 13, 2024

We can also look into caching the part of the stacktrace we care about

geoand added a commit to geoand/quarkus that referenced this issue Aug 14, 2024
This is done by utilizing the StackWalker API
and limiting the number of frames captured

Closes: quarkusio#42508
geoand added a commit to geoand/quarkus that referenced this issue Aug 26, 2024
This is done by utilizing the StackWalker API
and limiting the number of frames captured

Closes: quarkusio#42508
@geoand
Copy link
Contributor Author

geoand commented Aug 26, 2024

@johnaohara would you be able to test #42544?

The reason I ask is that I wasn't able to reproduce the initial results so I don't have a baseline

@johnaohara
Copy link
Member

Hi @geoand I did not have chance last week. I public holiday today, I can verify the patch tomorrow

@geoand
Copy link
Contributor Author

geoand commented Aug 26, 2024

Thanks!

@geoand
Copy link
Contributor Author

geoand commented Sep 3, 2024

@johnaohara when you get a chance to look at #42544, I'd be happy to hear about the results :)

@johnaohara
Copy link
Member

@geoand checking now

@geoand
Copy link
Contributor Author

geoand commented Sep 3, 2024

🙏🏽

@johnaohara
Copy link
Member

@geoand quick update on the testing so far. I tried with your PR branch, but saw some extra call stacks that I was not expecting, so backported to 3.12.2 (this was the version that produced the above baseline), and I am still seeing the unexpected stack, see;

image

I am looking at the test now to see why these stacks are showing in the flamegraph

@gsmet
Copy link
Member

gsmet commented Sep 3, 2024

@johnaohara you talking about the ones from ExtLogRecord?

If so, I reported it here: #42858 and it should be fixed in current main with the update of SmallRye Common to 2.6.0. The PR might need a rebase.

@johnaohara
Copy link
Member

@gsmet yeah, that is what I was talking about. But I cherry-picked onto 3.12.2, so expected it to disappear.

I just realized what I have done wrong, I will try running the test again.

@gsmet
Copy link
Member

gsmet commented Sep 3, 2024

This particular issue is in all 3.13.x and in 3.14.1. It should be fixed in 3.14.2.

@johnaohara
Copy link
Member

@geoand sorry about the noise yesterday.
testing with #42544

The cpu time for processing the stack frames (in my particular test) has dropped approx 10% with the new implementation (691 cpu samples -> 623 cpu samples)

Application code stack went from 21.2% of cpu time processing stack to 19.2% with the change. Although it is improved, there is still a considerable overhead on each invocation

Before:

Screenshot from 2024-09-04 09-11-07

After:

Screenshot from 2024-09-04 09-11-10

@geoand
Copy link
Contributor Author

geoand commented Sep 4, 2024

Thanks a lot @johnaohara!

Application code stack went from 21.2% of cpu time processing stack to 19.2% with the change. Although it is improved, there is still a considerable overhead on each invocation

Indeed....

I wonder if we should change the default to not capture the stack... @cescoffier @gsmet WDYT?

@johnaohara
Copy link
Member

obvious setting quarkus.rest-client.capture-stacktrace=false removes all the calls

idk if the stack traces were intended to be used in a prod deployment or just for development

@geoand
Copy link
Contributor Author

geoand commented Sep 4, 2024

They were intended for both

@gsmet
Copy link
Member

gsmet commented Sep 4, 2024

How confusing is the stack trace without it?

I wonder if we should have the following behavior:

  • Enabled in dev and test mode / disabled in prod mode by default
  • Make sure you have at least the name of the REST Client around even in prod mode - can we at least push this info?

Not sure if it's feasible but that might help?

@geoand
Copy link
Contributor Author

geoand commented Sep 4, 2024

How confusing is the stack trace without it?

It's almost meaningless...

Enabled in dev and test mode / disabled in prod mode by default

Right, I thought of that one as well, but I am still not convinced it's a good idea.

Make sure you have at least the name of the REST Client around even in prod mode - can we at least push this info?

We can do that yeah. Actually we have the class and the method - so although there would be no real stacktrace, at least you would know which method is at fault...

@gsmet
Copy link
Member

gsmet commented Sep 4, 2024

Yeah better than nothing and better than slowing down the whole app.

Now, that's not the only place where our stack traces are border line useless unfortunately :/

@geoand
Copy link
Contributor Author

geoand commented Sep 4, 2024

True, so if @cescoffier is also on board, I can make the change

@geoand
Copy link
Contributor Author

geoand commented Sep 4, 2024

Actually, we already print the method that causes the failure, so nothing needs to be done on that front

@cescoffier
Copy link
Member

What's the gist of the work that need to be done?

@geoand
Copy link
Contributor Author

geoand commented Sep 4, 2024

We would only change the default or the property that controls whether or not we capture the real stack.

@cescoffier
Copy link
Member

Ah ok, makes sense.

@geoand
Copy link
Contributor Author

geoand commented Sep 4, 2024

I'll do it tomorrow

@geoand
Copy link
Contributor Author

geoand commented Sep 5, 2024

#43037 changes the default

@geoand
Copy link
Contributor Author

geoand commented Sep 5, 2024

I'm going to close this as won't fix as changing the default pretty much makes this obsolete.

@geoand geoand closed this as not planned Won't fix, can't repro, duplicate, stale Sep 5, 2024
@geoand geoand added the triage/wontfix This will not be worked on label Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rest-client kind/enhancement New feature or request triage/wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants