-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelization issues in getLowCardinalityKeyValues #4356
Comments
That would be appreciated.
I doubt this is caused by the filter. Check what happens when you stop an I think this is caused by something adding a KeyValue to the Could you please upgrade Spring and Micrometer first and check if it is still an issue? |
Hi @jonatan-ivanov , thank's for taking a look at this. I'm a colleague of roookeee, who is ooo today, so let me jump in. => Both the NullPointerException and the IndexOutOfBounds originated from: I understand this is a super rare case, and thus we only see this phenomenon a few times a day. My assumption is, that the onTerminalSignal is running in another thread than the observation, hence it can happen that both are modifying/reading the underlying LinkedHashMap simultaneously. Are there other means for us to make sure that usage pattern cannot happen? Like injecting a ConcurrentHashMap, overriding the SimpleObserver, … ? |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
I am seeing a similar problem In app code that has been working great until we upgraded Spring Boot (including an upgrade of Micrometer) The use of observation is encapsulated in a method like so: private final ObservationRegistry registry = [default provided in Spring Boot]
private final Scheduler customScheduler = Schedulers.fromExecutor(ExecutorServiceMetrics.monitor(new ThreadPoolExecutor([at least 10 threads]));
public <T> Mono<T> observe(String name, Mono<T> mono) {
return Mono.deferContextual(view -> mono.publishOn(customScheduler).contextWrite(view).name(name)
.tap(observation(registry)).subscribeOn(customScheduler));
} Where webflux controllers call functionality that uses the observe function in various places that wrap IO operations we want to be sure are not happening on the web flux threads. The upgrade notable consists of
We get many exceptions of type ArrayIndexOutOfBoundsException as the MicrometerObservationListener adds tags to an observation whilst a TracingObservationHandler tries to read them. There is some magic happening above these (opentelemetry agent has instrumented itself into the flow), so perhaps this is a bug in how they have hooked in? Using a debugger with only current thread suspended, I captured the following.
and at the same time
and also at the same time
I don't believe our code is doing anything it is not supposed to, that instead we have chanced on a bug somewhere between Spring, Micrometer and Opentelemetry: Micrometer looks most likely given its proximity in the stack to the calling of the methods here. Any help would be greatly appreciated; it has been great to use this instrumentation. |
I believe the use of The first stacktrace originates from the HTTP server processing the request. Could the use of I'm not very familiar with Reactor internals—perhaps @chemicL could provide some insight here? |
Hey, nothing here rings a bell outside of the suspicious |
Looking at the OTel code, the The I'm not sure why |
I'd have the same guess as @chemicL. It looks like Reactor is being double instrumented - by Micrometer (the |
I disabled the opentelemetry reactor integration and still see the ArrayIndexOutOfBoundsException But the stack traces are now like so... Add:
Get:
And another get:
|
Thanks for doing that @markrichardstm That clears up way for further investigation :) I believe there is merit in what @ttddyy noticed. The use of
Consider changing the below: Scheduler customScheduler = Schedulers.fromExecutor(ExecutorServiceMetrics.monitor(new ThreadPoolExecutor([at least 10 threads])); to use Scheduler customScheduler = Schedulers.fromExecutorService(ExecutorServiceMetrics.monitor(new ThreadPoolExecutor([at least 10 threads])); I am assuming you can get back an Why would this help? Executing things in FIFO order and in isolation is essential in providing reactive-streams' guarantees and the concurrent execution (assuming they're happening against the same instance) of both operators from the same chain should not be the case. |
Thanks @chemicL
I'm not sure I understand the FIFO comment as this wraps a single Mono? I don't suppose you have any further documentation on this I could read on? We are executing a mono flow that involves multiple calls to (sometimes different) async IO clients and this means it has to jump threads to avoid blocking. The following is a contrived method, but I managed to see the same concurrent activity using a debugger with this Note: the
As you can see this includes no custom schedulers, just the default delay scheduler simulating additional async activity happening. |
And in the code that uses For the |
Here is a minimal example that you can play with and I asked someone to test and they could reproduce so it should work. I had trouble recreating this without Hopefully, the HELP.md explains how to use this, but please note the instructions regarding breakpoints, they must not suspend all threads (else that will prevent race conditions whilst debugging) |
@chemicL sorry I forgot swap to boundedElastic in that example, but checking today it suffers the problem too. You can swap by replacing
with
|
@markrichardstm I had a look at the project you provided and have some observations (wink, wink). First, I had a suspicion that the problem is the public <T> Mono<T> observe(String name, Mono<T> mono) {
return deferContextual(view -> mono.publishOn(customScheduler).contextWrite(view).name(name)
.tap(observation(registry)).subscribeOn(customScheduler));
} Let me annotate it a bit: public <T> Mono<T> observe(String name, Mono<T> mono) {
return deferContextual(view -> mono.publishOn(customScheduler)
.contextWrite(view) // <1>
.name(name)
.tap(observation(registry)) // <2>
.subscribeOn(customScheduler));
} We should analyze the chain returned from the Let's head on to the inner chain:
I believe this is absolutely undesired and breaks the expectations that reactive-streams build upon - it can no longer be Thread agnostic as the state is mutable:
As you can see, this comes from the mistake of replacing the Even when we fix the above and we also fix the So what's going on? It's called "previous values". As Reactor processes various things in parallel, it can, and will happen, that a particular Now, what's happening in the When the chain is completed, it just adds a new key-value pair to the low-cardinality keys. If this same There were some recent optimizations (#4959 and $5692) but I don't see them as being the reason for the issue, which is the use of Collection<? extends KeyValue> keyValuesCollection = (Collection)keyValues;
return make((KeyValue[])keyValuesCollection.toArray(new KeyValue[0])) I believe, since recently the |
Hello, I will add my two cents: after upgrading to Spring Boot 3.4.0 (which inherits Micrometer 1.14.1) I get this stacktrace quite often.
I think this is related to this issue. Thanks for reading this. |
Using a ConcurrentHashMap for the Context keyvalues (low and high) avoids the issues with concurrent reads and writes. Related benchmarks and concurrency tests were added as well. See gh-4356 --------- Co-authored-by: Jonatan Ivanov <[email protected]>
Could those who were seeing this issue please try with Micrometer 1.14.2? We believe this change should have fixed it. |
@shakuzen, I can confirm that the issue from my side is fixed now. |
@shakuzen a quick test suggests it looks okay. I'll know more once we can run larger workloads through which will not likely be this week. |
We'll close this issue for now under the assumption that it is fixed by the mentioned change. If anyone finds that is not the case, please report back here with the details found and we can reopen the issue. |
@shakuzen we have put a larger workload through and not seen this exception so it looks fixed. Thank you |
Describe the bug
We recently updated to Spring Boot 3.x which changed its metric tagging to be based on the micrometer
Observation
object.We are only creating one custom tag with a custom Spring
@Bean
which was migrated to anObservationFilter
:We are getting the following errors (see collapsible sections below) once in a while which are suspected to be HTTP connection abortions as they happen rarely and are not correlated to service load.
java.lang.ArrayIndexOutOfBoundsException · Index 6 out of bounds for length 6
java.lang.NullPointerException · Cannot invoke "java.lang.Comparable.compareTo(Object)" because "a[runHi]" is null
We dug a bit and determined that the index out of bounds error can only error when
addLowCardinalityKeyValue
andgetLowCardinalityKeyValues
are called in parallel, see the followingmicrometer
Observation.Context.getLowCardinalityKeyValues()
codeflow and inline comments:We found a closed issue for the other error, but that issue was closed without solving the error for us.
Environment
Minimal reproducer
We are trying to create a minimal producer but this error is only occuring 10 times over a period of 24 hours with a constant load of > 300 requests / second and peak load of around 1500 requests / second (across 4 instances of our service).
Expected behavior
Addind custom tags via a
ObservationFilter
should not create errors.Additional context
We currently expect this issue to occur when an inflight http request of a client is terminated abnormally. We want to tag such failing requests and their metrics anyways. We just thought this might help find the root cause.
The text was updated successfully, but these errors were encountered: