Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] "Operator called default onErrorDropped" message caused by NullPointerException in freed connection #42429

Closed
3 tasks done
haraldk opened this issue Oct 17, 2024 · 8 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus

Comments

@haraldk
Copy link
Contributor

haraldk commented Oct 17, 2024

Describe the bug

Occasionally (several times a day) we get two errors in our application's log, first "{"az.sdk.message":"Retry attempts exhausted or exception was not retriable.","exception":"Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null", ..." followed by one or more "Operator called default onErrorDropped" messages.

I don't think messages are actually lost (not processed) due to this, but the message is still very annoying, and as it happens often it creates noise in the logs. It does happen in all our environments.

I think this might be related to #41637 but I was surprised to see that upgrading to the latest servicebus release (7.17.5) did not remove the issue.

Exception or Stack Trace

We get two error messages in our log, only the second contains a stack trace. But I believe they are related:

our-application: ERROR:  {"az.sdk.message":"Retry attempts exhausted or exception was not retriable.","exception":"Cannot invoke \"java.util.List.add(Object)\" because \"this._sessions\" is null","connectionId":"MF_5f89df_1729161349513","entityPath":"$cbs","tryCount":1}
our-application: ERROR:  Operator called default onErrorDropped
reactor.core.Exceptions$ErrorCallbackNotImplemented: java.lang.NullPointerException: Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null
Caused by: java.lang.NullPointerException: Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null
	at org.apache.qpid.proton.engine.impl.ConnectionImpl.session(ConnectionImpl.java:91)
	at org.apache.qpid.proton.engine.impl.ConnectionImpl.session(ConnectionImpl.java:39)
	at com.azure.core.amqp.implementation.ReactorConnection.lambda$createSession$15(ReactorConnection.java:342)
	at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(Unknown Source)
	at com.azure.core.amqp.implementation.ReactorConnection.lambda$createSession$16(ReactorConnection.java:339)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:113)
	at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onNext(MonoPeekTerminal.java:180)
	at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:137)
	at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.request(MonoIgnoreThen.java:164)
	at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.request(FluxHide.java:152)
	at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.request(MonoPeekTerminal.java:139)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
	at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
	at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194)
	at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
	at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.request(FluxPeekFuseable.java:144)
	at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.request(Operators.java:2331)
	at com.azure.core.amqp.implementation.AmqpChannelProcessor.requestUpstream(AmqpChannelProcessor.java:336)
	at com.azure.core.amqp.implementation.AmqpChannelProcessor.lambda$onError$4(AmqpChannelProcessor.java:230)
	at reactor.core.publisher.LambdaMonoSubscriber.onNext(LambdaMonoSubscriber.java:171)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:270)
	at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:285)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
	at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

To Reproduce

I don't have exact reproducer code at this time, as the code is quite complicated. We use a standard ServiceBusProcessorClient created from a ServiceBusClientBuilder like this:

new ServiceBusClientBuilder()
                     .connectionString(queueProperties.connectionString())
                     .processor()
                     .queueName(queueProperties.name())
                     .disableAutoComplete()
                     .maxAutoLockRenewDuration(Duration.ofMinutes(10))
                     .processMessage(this::processMessage)
                     .processError(this::processError)
                     .buildProcessorClient()

However, none of our code is actually involved in the stack trace above, so I don't think the contents of processMessage or processError matters.

From what I can see, the only way the _sessions field in org.apache.qpid.proton.engine.impl.ConnectionImpl can be null, is someone first invoking doFree(). Ie. it seems that someone/something tries to create a new session from an already freed connection. This might be due to a race condition, or a simpler logic bug somewhere.

Let me know if you need more information to reproduce.

Code Snippet

No code at the moment.

Expected behavior

I would expect the connection to be freed and a new session created in a new, fresh session, without any exceptions or error messages in the log.

Setup (please complete the following information):

  • OS: n/a, runs in Kubernetes, on AKS
  • IDE: n/a, runs in Kubernetes, on AKS
  • Library/Libraries:
    • com.azure:azure-core:1.53.0
    • com.azure:azure-core-ampq:2.9.10
    • com.azure:azure-core-http-netty:1.15.5
    • com.azure:azure-json:1.3.0
    • com.azure:azure-messaging-servicebus:7.17.5
    • com.azure:azure-xml:1.1.0
    • org.apache.qpid:proton-j:0.34.1
  • Java version: 21
  • App Server/Environment: Tomcat
  • Frameworks: Spring Boot

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added
@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus labels Oct 17, 2024
Copy link

@anuchandy @conniey @lmolkova

Copy link

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@anuchandy
Copy link
Member

Hello @haraldk, it looks like you're encountering the same issue described here: #41736 , but for the Processor. As outlined in there, could you try setting the configuration "com.azure.core.amqp.cache" to "true" for version 7.17.5, i.e.,

new ServiceBusClientBuilder()
                     .connectionString(queueProperties.connectionString())
                     .configuration(new ConfigurationBuilder()
                         .putProperty("com.azure.core.amqp.cache", "true")
                         .build())
                     .processor()
                     .queueName(queueProperties.name())
                     .disableAutoComplete()
                     .maxAutoLockRenewDuration(Duration.ofMinutes(10))
                     .processMessage(this::processMessage)
                     .processError(this::processError)
                     .buildProcessorClient()

(In an upcoming release, the necessity to set this configuration will be removed.)

@anuchandy
Copy link
Member

@haraldk, just to let you know, if you're also using ServiceBusSenderClient, you need to apply the same configuration when building up the sender client as well.

@haraldk
Copy link
Contributor Author

haraldk commented Oct 17, 2024

@anuchandy Thanks! I'll try that out, and see if it helps!

It's not immediately clear to me (even after reading the linked issue) why setting this property to true will resolve our issue, is there somewhere I can read more on this setting? Are there other side effects of setting it, other than fixing the issue?

@anuchandy
Copy link
Member

anuchandy commented Oct 17, 2024

Hello @haraldk, the NPE was traced back to a bug in the internal caching system used by Service Bus and Event Hubs. We have incorporated (as part of September-2024 release) a new cache design which is enabled when the "com.azure.core.amqp.cache" configuration is set. This new internal cache is intended to offer the same functionality as the old one while resolving existing issues (including NPE, thread race conditions, and a rare thread leak), so there shouldn't be any side effect. We plan to delete the old cache in the upcoming releases and make new one default. More about this cache implementation is here if you're interested.

@haraldk
Copy link
Contributor Author

haraldk commented Oct 21, 2024

We've been running with the "com.azure.core.amqp.cache" property enabled in our stage environment during the weekend, and it looks promising! 👍🏻

@anuchandy
Copy link
Member

Glad hear that this issue has been resolved. I will now close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus
Projects
None yet
Development

No branches or pull requests

2 participants