-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kinesis binder locks timed out and messages not delivered #190
Comments
Duplicate of #186. Please, consider to upgrade to the latest Kinesis Binder: https://spring.io/blog/2023/03/27/spring-integration-for-aws-3-0-0-m2-and-spring-cloud-stream-kinesis-binder-4. For the current version I'd suggest to use a short |
Thanks for the quick reply. |
In May. Or better to say when Spring Cloud AWS 3.0 is GA. |
@artembilan To Reproduce
Exception org.springframework.messaging.core.DestinationResolutionException: no output-channel or replyChannel header available 2023-05-01T12:20:10.379-04:00 ERROR 25124 --- [-response-1-316] o.s.integration.handler.LoggingHandler : org.springframework.messaging.core.DestinationResolutionException: no output-channel or replyChannel header available |
I've tried to debug it to find the root cause, but I just see that some remote call is failing because of wrong endpoint, however it is still unclear why it uses the wrong one. |
Yeah... That one was fixed in the SNAPSHOT: 8223c8d. Was not released yet. Can you try against |
Yeah, I confirm, this error doesn't appear with snapshot, thanks |
While it works as it should, with no errors, and shard consumers able to allocate unused nodes I faced another issue that may need attention. 2023-05-01T14:54:10.459-04:00 INFO 16112 --- [esis-consumer-1] a.i.k.KinesisMessageDrivenChannelAdapter : Got an exception java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.ExpiredIteratorException: The shard iterator has expired. Shard iterators are only valid for 300 seconds (Service: Kinesis, Status Code: 400, Request ID: AKSSPO8S0D2O3C4WWIF4QKMEL8VJSL2NANQN52N899PIFJFO7I09) during [ShardConsumer{shardOffset=KinesisShardOffset{iteratorType=AFTER_SEQUENCE_NUMBER, sequenceNumber='49640352981598798030864787941543021567332992185863766098', timestamp=null, stream='stream-v0', shard='shardId-000000000005', reset=false}, state=CONSUME}] task invocation. |
Got it! It looks like we we don't catch this kind of errors in the |
it looks like the locks were holding, but the iterator has expired as after restarting I saw that old messages started to be consumed. |
@artembilan Binder v4 doesn't have those dependencies so this configuration is not working. |
That's not correct. You can find its POM here: https://central.sonatype.com/artifact/org.springframework.cloud/spring-cloud-stream-binder-kinesis/3.0.0 As you see there is So, what we did in v4 is just changed that dep to Probably those props are changed.
See this Microservices Pattern implementation where I really use that Kinesis Binder v4: https://github.com/artembilan/microservices-patterns-spring-integration/tree/main/distributed-tracing |
Looks like property to configure it was changed from "cloud.aws.region.static" to "spring.cloud.aws.region.static". |
@artembilan It worked perfectly fine with 3.0 binder, and header is populated on the sending side. So it worked before when all apps used 3.0 binders, no changes except binder. Do you have any idea why it can happen? PS: I've tried to add headers to test application referenced in this ticket and it worked fine, so is it some kind of compatibility of the 3.0 vs 4.0 binders? |
Please, raise a new issue with more details. Are you really sure that you produce a message with header? |
Yes, we use single event stream for different kind of messages and routing based by headers. |
New bug reported #192 |
We use Spring Cloud stream with Kinesis, sometimes we see that messages are not delivering.
After investigation, we found that the locks are not processing correctly, and we have messages about it in the logs.
To Reproduce
Steps to reproduce the behavior:
you should see messages like:
Message sent: Message: 3
Received message: Message: 3
Message sent: Message: 4
At this point only first instance (started on step 3) receiving all the messages and allocate the locks
Version of the framework
org.springframework.cloud:spring-cloud-dependencies:2022.0.1
spring-cloud-stream-binder-kinesis: 3.0.0
Expected behavior
Locks allocated by running nodes and messages are received
Observed behavior
Messages lost, delayed.
Timeouts in the logs:
`2023-04-27T12:25:37.735-04:00 INFO 31356 --- [is-dispatcher-1] a.i.k.KinesisMessageDrivenChannelAdapter : The lock for key 'event-group:stream-v0:shardId-000000000009' was not renewed in time
java.util.concurrent.TimeoutException: null
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095) ~[na:na]
at org.springframework.integration.aws.inbound.kinesis.KinesisMessageDrivenChannelAdapter$ShardConsumer.renewLockIfAny(KinesisMessageDrivenChannelAdapter.java:1035) ~[spring-integration-aws-2.5.4.jar:na]
at org.springframework.integration.aws.inbound.kinesis.KinesisMessageDrivenChannelAdapter$ShardConsumer.execute(KinesisMessageDrivenChannelAdapter.java:947) ~[spring-integration-aws-2.5.4.jar:na]
at org.springframework.integration.aws.inbound.kinesis.KinesisMessageDrivenChannelAdapter$ConsumerDispatcher.run(KinesisMessageDrivenChannelAdapter.java:857) ~[spring-integration-aws-2.5.4.jar:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
`
Additional context
The issue is likely happening when shard locks is distributed between multiple nodes, it is not observed when all locks are allocated by one instance.
We running 2 pods listening stream with 4 shards, sometimes messages are not delivered, and it may be remediated by clearing the locks table. When locks are allocated to different pods timeout messages like the above appear in the logs.
The behavior is not consistent, sometimes it works as it should, and sometimes message delivery is stuck, but using a linked project it is reproducible almost all the time.
You may check lock distribution by configuring test credentials in .aws/config like:
aws_access_key_id=test
aws_secret_access_key=test
and executing getLockTableContent.sh/getLockTableContent.cmd
The text was updated successfully, but these errors were encountered: