-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM in Quarkus 3.0.0.Beta1 caused by okio via OpenTelemetry #32238
Comments
/cc @brunobat (opentelemetry), @radcortez (opentelemetry) |
Hi @snazy, not sure what kind of tests you are performing that lead to an OOM... I would need to have an example.
This makes sure everything is flushed in a timely fashion because the defaults are tuned for prod. See: Please let me know if this works for you. |
It's a legit bug caused by the usage of okio/okhttp w/ that watchdog thread in tests. It's not about flushing, it's that the watchdog thread never terminates and keeps all classes on the heap, eventually producing an OOM. Can repro at will:
You can also attach a debugger to it and check that the Watchdog thread keeps running and never terminates. |
What are the memory definitions that you use @snazy ? |
At max 1.5G heap IIRC, maybe less |
I cannot run the tests. Throws me a too many open files before I reach to any OOM. |
There are not that many test classes, OOMs pretty much „at the end“. |
I asked @alesj to please take a look. |
I already got this on the |
Where as this never stopped: https://gist.github.com/alesj/60d7467f85644dd639869f3c23833818 ... 42min and it would go on ... @snazy any easier way to reproduce this -- not to have the whole Nessie project ... ? |
Probably - you could create a test project with many test classes, and each produces an otel event. |
@snazy what if you set this for the |
@snazy any progress / luck with this |
Sorry, no luck so far. It's pretty clear that it's caused by okio, already known and was hit before. I think, the current idea is to remove okio. It's "only" a test-issue (prod-like doesn't reload stuff). |
So I assume the workaround we need is to have our own implementation of |
Yes and not just for this reason. We need to drop OkHttp because of support reasons. |
Good point |
So I assume this is one for @alesj due to the need for gRPC |
Yes, it would be nice |
I might be misreading |
The OkHttp lib is able to do both GRPC and HTTP, there is a config for that. |
Yeah, I see that. My point is that in reality, And actually this is true, as the description of the original PR that introduced the class above says. |
Correct. |
I saw that too, yeah. But from a quick read through it, doesn't seem to change any gRPC related stuff. |
According to this, it's not possible to use the JDK HTTP Client because there is not trailing header support. @cescoffier @alesj I assume we could leverage the Vert.x |
That's actually my preference as well. |
What do you call trailing headers? trailer? gRPC is often using trailers, and we do support that. |
@cescoffier yeah trailer
so it should be possible to use the pure |
Yes, it should be fine. |
Cool, thanks |
I assume https://vertx.io/docs/vertx-grpc/java/#_message_level_api_2 is what we are after. I'll try and have a look next week |
I have a prototype here. I think it's a pretty good starting point and if you agree I can move it forward, it should definitely be reviewed / improved by Vert.x / gRPC / OTel folks |
I got native to work, so now I think it's more a matter of hardening |
Includes a workaround for Quarkus3/OTel/okio OOM in tests See quarkusio/quarkus#32238
Replace OkHttp tracing backend with Vert.x
Describe the bug
okio starts a thread via the class
okio.AsyncTimeout.Watchdog
once "it has to deal with a timeout". The thread's configured as a daemon thread and seems to have some shutdown logic, and also aborts when its interrupted.The behavior isn't an issue in production usecases, but it's an issue when running tests.
okio.AsyncTimeout.Watchdog
is loaded by a Quarkus class loader for every test, so it implicitly keeps a reference to its class loader and transitively to all the resources that one holds - just because the thread's still running.Setting the following parameters disables timeouts and in turn does not start that watchdog thread and the OOM doesn't happen. It's maybe a legit workaround for tests, until the issue's fixed.
This behavior seems to be introduced after Quarkus 3.0.0.Alpha5, but I'm not sure why, because okio seems to behave this way "forever".
Expected behavior
No response
Actual behavior
No response
How to Reproduce?
No response
Output of
uname -a
orver
No response
Output of
java -version
No response
GraalVM version (if different from Java)
No response
Quarkus version or git rev
No response
Build tool (ie. output of
mvnw --version
orgradlew --version
)No response
Additional information
No response
The text was updated successfully, but these errors were encountered: