-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ServiceBusReceiverClient stops consuming messages after some time though messages are present in subscription. #26465
Comments
We have experienced the exact same issue but we are using the spring cloud stream binder. From our analysis the issue occurs when we receive a non retriable amqp exception. Which we have been able to correlate with internal server errors on the servicebus, last time this occurred there was a server downtime of over 1 hour on azure side. The disconnect happens across all applications and pods at the exact same time when these errors occurs. Currently we are trying to setup a health check towards the binder in order to restart the pods when this occurs as a temporary workaround while the issue persists. From what we can see there is no adequate way of attaching our own logic to the health check. The ServiceBusQueueHealthIndicator that was added in recent release only verifies that the binder managed to connect once and returns "UP" status even though we lose connection to the servicebus. Is there any way of overriding this healthIndicator class to create a more elaborate healthcheck? |
@anuchandy @ki1729 can you please follow up? |
I think we have found a fix for this, but not sure where to report. So replying on this issue report, but happy to elaborate else where. We're also seeing this issue when using We also witnessed that this issue does not occur with In our own project we have fixed this, in a very dirty way using reflection but we also check the We're now not seeing this issue any more in production. Happy to discuss further, or elaborate more if there are any doubts or questions. If I have any more details I'll report back.
|
Hello @nomisRev, underneath the
The logging instruction is here: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/docs/troubleshooting.md |
Hey @anuchandy,
Could you point me to this code? I could not find it in the SDK, and as I mentioned in the comment above porting the fix from
At no point an error is being received in The
We do have logging enabled. Both custom logging, and Azure AppInsights but comparing to the other issues that reported this issue we cannot find any additional logging that points in the direction of any other issues. I will check again next week, and share logs here. I'm also going to add additional logging to track |
Hi @nomisRev, Can you please share with code snippet where you check the ServiceBusConnectionProcessor of the ServiceBusReceiverAsyncClient and close/recreate the ServiceBusReceiverAsyncClient when the channel is closed? We are facing the same issue. After some time ServiceBusReceiverAsyncClient stops receiving messages. |
Hey @maksatbolatov, Of course. I have the code written in Kotlin with some custom things, but here is a translation of the code to Project Reactor in Java. if you have any questions I'd be happy to help :) import com.azure.messaging.servicebus.ServiceBusReceivedMessage;
import com.azure.messaging.servicebus.ServiceBusReceiverAsyncClient;
import com.azure.messaging.servicebus.implementation.ServiceBusConnectionProcessor;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;
import java.lang.reflect.Field;
import java.time.Duration;
class WorkAround {
public static Flux<ServiceBusReceivedMessage> receiveMessagesForever(ServiceBusReceiverAsyncClient client) {
return client.receiveMessages()
.takeUntilOther(connectionClosed(client))
.concatWith(receiveMessagesForever(client));
}
public static Mono<ServiceBusConnectionProcessor> conn(ServiceBusReceiverAsyncClient client) {
return Mono.fromCallable(() -> {
Field field = client.getClass().getDeclaredField("connectionProcessor");
field.setAccessible(true);
return (ServiceBusConnectionProcessor) field.get(client);
});
}
public static Mono<Void> connectionClosed(ServiceBusReceiverAsyncClient client) {
return conn(client).flatMap((connection) ->
Flux.interval(Duration.ofSeconds(30))
.map((ignored) -> connection.isChannelClosed())
.takeUntil((isChannelClosed) -> !isChannelClosed)
.then()
);
}
} |
We are seeing similar issues in our production environment where we have an Azure App Service connected to Service Bus with multiple queues either sending to, receiving from, or both depending on the queue and business flow. We are using the ServiceBusSenderAsyncClient and ServiceBusReceiverAsyncClient in this App Service. We have 4 randomly occurring occasions where a queue just starts filling up with messages and the receiver is no longer picking up messages. We get a alerted as we have monitoring alerts setup, but at this time, the quick "fix" is to restart the App Service and it will start picking up the messages again. We haven't been able to quite figure out what exactly is causing the connection to randomly be lost, but regardless, we would expect that even if there were a network issue or Service Bus server error, the ServiceBusReceiverAsyncClient would be able to reestablish a connection. I have found a way to replicate this in our nonprod environment by doing the following:
At this point, the sending of the message to the queue works as expected without issues. The connection on the send was reestablished, however, the message just sits in the queue and will never get picked up by the receiver because it has not reconnected. This may not be exactly what is happening in production to cause the lost connection to begin with, but we should expect that after making the queue "active" again in this test that our ServiceBusReceiverAsyncClient would have noticed the lost connection and made a new one. Or at least attempt to make a new connection until it succeeds. Even setting AmqpRetryOptions on the ServiceBusClientBuilder doesn't help. It seems like there are a number of both open and closed issues that are similar to this issue (though not specifically this and seeing some similar ones that were for eventhubs and blob storage). |
Hi @mseiler90, thank you for sharing the observations. I will take a look to understand what is happening when we disable queue. Could you take a look at these recommendations - #28573 (comment) and #28573 (comment). I will be moving these recommendations to Java doc. |
@anuchandy Thanks for the response. I should have mentioned that we are using Camel in this service and using Camel routes for all of our sending/receiving with Service Bus. Camel is using the ServiceBusSenderAsyncClient and ServiceBusReceiverAsyncClient and perhaps I am just not familiar with it enough, but our expectation would be that we would get this connection retry out of the box and not have to build out custom logic for this. We tried creating our own ServiceBusSenderAsyncClient and ServiceBusReceiverAsyncClient to set AMQPRetryOptions and such in an effort to hopefully resolve the issue until a fix is in place in the azure sdk, but that didn't help and also defeats the purpose of what we thought we would be getting out of the box. It's looking like to get this to work as expected would likely mean a complete rewrite and potentially skip out on Camel since it is relying on these async clients. |
Hi @mseiler90 thanks for the response. The The cases where receive-flux notify a terminal error (hence no longer emit messages) to the application (subscriber) are -
A few examples of non-retriable errors are - the app attempting to connect to a queue that does not exist, someone deleting the queue in the middle of receiving, the user explicitly initiating Geo-DR, user disabling the queue (yes the test you did :)). These are certain events where Service Bus service communicates to the SDK that a non-retriable error occurred. An example of locally initiated non-retriable errors is SDK threads receiving interrupted exceptions from outside the application.
I've created a custom class import com.azure.core.util.logging.ClientLogger;
import com.azure.messaging.servicebus.ServiceBusClientBuilder;
import com.azure.messaging.servicebus.ServiceBusReceivedMessage;
import com.azure.messaging.servicebus.ServiceBusReceiverAsyncClient;
import com.azure.messaging.servicebus.models.ServiceBusReceiveMode;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;
import reactor.util.retry.Retry;
import java.time.Duration;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicReference;
public final class ServiceBusIndefiniteRetryReceiverAsyncClient implements AutoCloseable {
private static final ClientLogger LOGGER = new ClientLogger(ServiceBusIndefiniteRetryReceiverAsyncClient.class);
// On rare cases when Retry exhausts or a non-retryable error occurs back-off for 4 sec matching
// default server busy time.
private static final Duration RETRY_WAIT_TIME = Duration.ofSeconds(4);
private final String connectionString;
private final String queueName;
private final AtomicReference<ServiceBusReceiverAsyncClient> currentLowLevelClient = new AtomicReference<>();
private final AtomicBoolean isClosed = new AtomicBoolean(false);
private final AtomicBoolean isInitial = new AtomicBoolean(true);
public ServiceBusIndefiniteRetryReceiverAsyncClient(String connectionString,
String queueName) {
this.connectionString = connectionString;
this.queueName = queueName;
this.currentLowLevelClient.set(createLowLevelClient());
}
public Flux<ServiceBusReceivedMessage> receiveMessages() {
return Flux.using(
() -> {
if (isClosed.get()) {
throw new IllegalStateException("Cannot perform receive on the closed client.");
}
if (!isInitial.getAndSet(false)) {
LOGGER.verbose("Creating a new LowLevelClient");
currentLowLevelClient.set(createLowLevelClient());
}
return currentLowLevelClient.get();
},
client -> {
return client.receiveMessages();
},
client -> {
LOGGER.verbose("Disposing current LowLevelClient");
client.close();
})
.retryWhen(
Retry.fixedDelay(Long.MAX_VALUE, RETRY_WAIT_TIME)
.filter(throwable -> {
if (isClosed.get()) {
return false;
}
LOGGER.warning("Current LowLevelClient's retry exhausted or a non-retryable error occurred.",
throwable);
return true;
}));
}
public Mono<Void> complete(ServiceBusReceivedMessage message) {
final ServiceBusReceiverAsyncClient lowLevelClient = currentLowLevelClient.get();
return lowLevelClient.complete(message);
}
public Mono<Void> abandon(ServiceBusReceivedMessage message) {
final ServiceBusReceiverAsyncClient lowLevelClient = currentLowLevelClient.get();
return lowLevelClient.abandon(message);
}
public Mono<Void> deadLetter(ServiceBusReceivedMessage message) {
final ServiceBusReceiverAsyncClient lowLevelClient = currentLowLevelClient.get();
return lowLevelClient.deadLetter(message);
}
@Override
public void close() throws Exception {
if (!isClosed.getAndSet(true)) {
this.currentLowLevelClient.get().close();
}
}
private ServiceBusReceiverAsyncClient createLowLevelClient() {
return new ServiceBusClientBuilder()
.connectionString(connectionString)
.receiver()
.receiveMode(ServiceBusReceiveMode.PEEK_LOCK)
.queueName(queueName)
.disableAutoComplete()
.maxAutoLockRenewDuration(Duration.ZERO)
.prefetchCount(0)
.buildAsyncClient();
}
} One last case where receive-flux terminates (hence no longer emits messages) is if the application throws an error from within the SDK callback. You should try-catch and log any application exception rather than bubbling it to the SDK. For e.g. it means, app will need to handle any exception coming from complete/abandon calls as well. Here is an example, showing how to use the above class along with handling exception - import com.azure.core.util.logging.ClientLogger;
import com.azure.messaging.servicebus.ServiceBusReceivedMessage;
import org.reactivestreams.Publisher;
import reactor.core.publisher.Mono;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
public final class MessageConsumeExample {
private static final ClientLogger LOGGER = new ClientLogger(MessageConsumeExample.class);
private final ServiceBusIndefiniteRetryReceiverAsyncClient client;
public MessageConsumeExample() {
String connectionString = System.getenv("CON_STR");
String queueName = System.getenv("Q_NAME");
client = new ServiceBusIndefiniteRetryReceiverAsyncClient(connectionString, queueName);
}
public void handleMessages() {
client.receiveMessages()
.flatMapSequential(new Function<ServiceBusReceivedMessage, Publisher<State>>() {
@Override
public Publisher<State> apply(ServiceBusReceivedMessage message) {
return handleMessage(message)
.onErrorResume(new Function<Throwable, Mono<State>>() {
@Override
public Mono<State> apply(Throwable businessError) {
try {
client.abandon(message).block();
return Mono.just(State.MESSAGE_ABANDONED);
} catch (Throwable abandonError) {
LOGGER.warning("Couldn't abandon message {}", message.getMessageId(), abandonError);
return Mono.just(State.MESSAGE_ABANDON_FAILED);
}
}
})
.flatMap(state -> {
if (state == State.HANDLING_SUCCEEDED) {
try {
client.complete(message).block();
return Mono.just(State.MESSAGE_COMPLETED);
} catch (Throwable completionError) {
LOGGER.warning("Couldn't complete message {}", message.getMessageId(), completionError);
return Mono.just(State.MESSAGE_COMPLETION_FAILED);
}
} else {
return Mono.just(state);
}
});
}
}, 1, 1)
.then()
.subscribe();
}
private Mono<State> handleMessage(ServiceBusReceivedMessage message) {
// A business logic taking 5 seconds to process the message which randomly fails.
return Mono.fromCallable(() -> {
try {
TimeUnit.SECONDS.sleep(5);
} catch (InterruptedException e) {
e.printStackTrace();
}
return 1;
})
.flatMap(ignored -> {
LOGGER.info("Handling message: " + message.getMessageId());
final boolean handlingSucceeded = Math.random() < 0.5;
if (handlingSucceeded) {
return Mono.just(State.HANDLING_SUCCEEDED);
} else {
return Mono.error(
new RuntimeException("Business logic failed to handle message: " +
message.getMessageId()));
}
});
}
private enum State {
HANDLING_SUCCEEDED,
MESSAGE_COMPLETED,
MESSAGE_ABANDONED,
MESSAGE_COMPLETION_FAILED,
MESSAGE_ABANDON_FAILED
}
} I'm unfamiliar with the camel framework and don't know how they do the bridging with Azure SDK. Do you have some cycles to create a git-repo with a sample runnable Camel project showing the following? -
Also, a minimal README to run the code locally. |
@anuchandy thank you for such a detailed response. I can try to get a camel example in the next couple days. I also tried the ServiceBusProcessorClient in a simple test service (not using Camel) and did the same "disable" the Service Bus queue test like I mentioned previously and with that it had no problems reconnecting and picking up messages again. Seems to align with what @nomisRev was describing. I did want to add some log messages that I got after enabling WARN level logs in prod and then this issue occurred again yesterday. Fortunately I have a monitor/alert in place watching queue depths as a way to alert me when this happens. I looked through logs that Application Insights was able to give me and pulled out any of the relevant logs that I could find for when this happened again yesterday. I will post them below to see if that maybe gives any additional insight for you or anyone else. Apologies for the format, but I didn't find a good way to export them in a nice looking log stream format out of Application Insights for just the specific logs I wanted to share here:
|
Hi @mseiler90, I should have mentioned it in my last comment; The Now, coming to the "disable" test you did with HighLevelClient, it recovered because "disable" results in a non-retriable "terminal" error (from the service) that the current LowLevelClient threw asynchronously, and HighLevelClient recreated a new LowLevelClient. This is slightly different from Simon's case; from his comment, it appears that his app listens to "terminal" error from LowLevelClient, but he seems to run into a situation where the LowLevelClient never threw a "terminal" error hence app never get to take action. Looking at the log you shared - It seems there was a failure in the This is why I requested the Camel sample project; I don't know how Camel internally wires I hope it helps to clarify. |
Hey @anuchandy, I can confirm that we're not throwing an exception from our own code, all our code is wrapped in As mentioned in my comments above, Currently I can also see that I'm happy to discuss further, or share any more details if you need more information. I can sadly not share our production code, but I can rewrite them as I've done above to share a patch. We seem to have success with this patch, but more time / battle testing in production will confirm that. |
Hey @anuchandy, I made a quick demo camel repo which can be found here https://github.com/mseiler90/camel-demo You asked for a number of configurations which I did not include because Camel is taking care of everything for us by default. We have the ability to set certain configurations as seen in the Camel documentation here https://camel.apache.org/components/3.17.x/azure-servicebus-component.html, but I wanted to create this using the "out of the box" and default settings which we are currently using in production. For example, we are not disabling auto-complete, we aren't setting our own retry options, and we don't have any custom logic built around the Service Bus clients. We have tried tests with things like setting retry options, but nothing has helped. As you can see by the example repo, Camel with the help of the Azure SDK is taking care of all of our configurations for us as it should. For testing, if you just update the application.yml file with a Service Bus connection string and a queue name, then you should be able to just start it up in your IDE and it will automatically put a message on the queue every 10 seconds. There is a Camel route that is listening on that same queue and just logs out the body of the message. Once that is up and running, if you do the "disable" on the queue and after several seconds or so, reactivate the queue. You will see that messages still get sent to the queue, but will no longer get picked up until restarting the application. As we already have discussed, the High Level Client would successfully reconnect on the receiver, but the Low Level Client doesn't. We just would like to see the same "isChannelClosed" logic in the Low Level Client and it seems everything should work as we would hope for. I do agree that we have an issue causing the connection to get lost to begin with, but regardless of it losing the connection to begin with, we should be able to automatically reconnect on the receiver without writing custom logic to do so. I did find some additional information on our production issue. It seems that this has only been happening a specific queue where we are picking up the message and then sending a REST API call to another service. I have found logs in Application Insights showing that there was a |
Hi @nomisRev, to give you some background, the messages arrive via an An As you can see, the underlying types "react" to close (successful or error) events by creating a new The expectation is that these "reactive" logic should take care of recovery, but there were edge cases (which are very hard to repro) where these "reactive" handlers never get to kick off recovery due to signal missing. If you go through the SDK changelog, you can find multiple signal loss root causing via DEBUG logs. A signal loss bug fix for the "reactive" recovery is coming in mid-July release (details here). The historical reason for "HighLevelClient" using a timer to "proactive" channel check (in addition to "LowLevelClient"'s "reactive" channel check) is something I need to check. Interestingly, you're not (so far) seem to have recovery problems with the queue entity but with the topic entity. I'll kick off a long-running topic test with DEBUG enabled to see if we can get some helpful log. Hi @mseiler90, thanks for the Camel project; I'll use it to debug to understand how Camel is doing wiring with SDK. But your observation "It seems that this has only been happening a specific queue where we are picking up the message and then sending a REST API call to another service" is giving some hint. I think what could be happening here is -
As you investigate the REST API call connection (read) timeout, see if app was running into the above flow. I'll find sometime go over the Camel sample to understand the Camel's bridging with the SDK and see if there is any issue in bridging. |
Thanks @anuchandy. In case it helps, here is a link to the Service Bus component in the Camel repo https://github.com/apache/camel/tree/main/components/camel-azure/camel-azure-servicebus |
Hey @anuchandy. Curious if you have had a chance to look into this any further? I see this was added to the Backlog milestone 12 days ago. What exactly does this mean? Is there work that is planned for this? We have been able to prevent the connection timeout errors from happening on that API call as I had explained above, so we haven't seen this issue happening anymore. However, we are still cautious that if something were to cause the disconnection, we don't have something to automatically health check and reconnect unless we write custom logic which would also mean not using the Camel Service Bus component. |
Hi, is this issue still in progress? We have switched from the old azure lib to the new one and since then we face the same issue, that the threads are waiting and do not consume any message any longer. Unfortunately this stop occurs quite often, at least once a day, and we do not find a possible solution on our side. We have carefully checked our implementation and all exceptions are catched in our message handler. The bad part is, that we cannot reproduce the unexpected stop easily but we have some ideas. If we set the maxConcurrentCalls property of the service bus processor to 1 everything is working well and the processing of messages never stops. With a value of example: 10 we can force the stop of the processor only if we randomly cut the network connection for a short period of time (< 5 seconds). The threads will stop working and stay in the state "waiting" and never start working again. Our assumption is, that the .block() calls in the
Would it be an option to set a timeout to improve the unexpected stops of messaging processings? Additionally we tried to do some exception handling to recognize this stop of message processing, but unfortunately we could not find a suitable place to receive the exception. Our processor configuration is created with
But the error callback is never called. Thank your for the support. |
Hi, with regards to #26465 (comment) above we've done a test on a local fork and found that indeed calling the complete() action with a timeout would see the thread recovered after the timeout expired and process messages again as normal. Would be great if someone could provides thoughts about the proposed change in #33298, it has been created in a non breaking and invasive fashion by only offering the option to use an additional method with a timeout input but not changing any current behavior. Thanks! |
Hi @tux4ever @TheDevOps |
@liukun-msft Do you know if there is any further investigation into this issue (not the separate issue you created, but this one)? |
Hi @mseiler90, the original question was followed up by @anuchandy. I'll talk to him to know the current stage. |
Hi @mseiler90 The design of the The Apache Camel plugin is not taking care of re-creating the From the log shared by you, we can see that the As we don't own or have expertise in the Apache Camel plugin, we suggest you to open a ticket in the Apache Camel plugin repo to use |
fixing typo elasticSanResourceId (Azure#26465)
Describe the bug
We are using ServiceBusReceiverClient as receiver but we have observed that though the services were running and messages were present in subscription it has stopped consuming the messages .
Exception or Stack Trace
To Reproduce
This issue is coming randomly.
Expected behavior
ServiceBusReceiverClient should consume the messages available in subscription.
Setup (please complete the following information):
Additional context
Similar to issue https://github.com/Azure/azure-sdk-for-java/issues/26138
Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
The text was updated successfully, but these errors were encountered: