-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kinesis PutRecords intermittently hangs with successful but empty response #4149
Comments
Hi, is this issue still persisting with the latest version of SDK? |
Hey there, the last time we tried reproducing the issue a couple months ago, we were unable to do so. I'm guessing our network zone was having issues when this was reported, or perhaps the Kinesis service itself was having issues. Since we are no longer able to reproduce reliably, there may not be much we can do to investigate whether it's still something that can be improved here. Feel free to close if you need to do some housekeeping -- we can always re-open if it comes up again. Thanks! |
Sure please reopen this again if this comes up again. |
|
Confirm by changing [ ] to [x] below to ensure that it's a bug:
Describe the bug
We are experiencing an intermittent issue where calling Kinesis PutRecords via the Fluent Bit Kinesis Go plugin hangs until a timeout occurs. This appears to be similar to #301, #1037, and #1141 (but with a different API).
As of this writing, the plugin uses the default HTTP Client in the
aws.Config
. We have submitted a patch to the plugin as a temporary work-around to minimize the time spent hanging when it occurs.We have enabled debug logging by patching the plugin to pass in
aws.LogLevel(aws.LogDebugWithRequestRetries | aws.LogDebugWithRequestErrors | aws.LogDebugWithHTTPBody)
to theaws.Config
. When using the default HTTP Client, we see a read timeout (consistently after 5m30s). When we pass in a HTTP Client w/ a custom timeout, we see a context cancellation timeout. In either case, we observe a 200 OK response is logged at the point of timeout but theDate
response header has an earlier time (generally within 1 second of the request), and it appears to have an empty response body. The SDK then retries sending the request and succeeds as normal, resulting in duplicate records.Not sure if
PutRecords
should simply be included in https://github.com/aws/aws-sdk-go/blob/v1.41.12/service/kinesis/customizations.go#L16-L18 (#1166 resolution) or if there is any other investigation that should be done. BTW seems like the timeout is hardcoded there -- should it not make use of the HTTP Client timeout config that can be customized? I'm also curious if this would result in different behavior than our current work-around of providing the HTTP Client timeout (I'm assuming we'd still see the duplicate records)?Version of AWS SDK for Go?
Observed using: v1.40.49, v1.41.9, and v1.41.12
Version of Go (
go version
)?1.17
To Reproduce (observed behavior)
Can follow up with a minimal example, if needed. We're generating logs and using Fluent Bit's tail plugin as input and outputting to Kinesis. We'll generally see it occur when the requests per second are >= 300 (Fluent Bit flush interval of 1s). There is a chance our environment where we encounter this has potential network issues, but those would be difficult to identify/resolve.
Expected behavior
Ideally, we would observe neither the hanging behavior nor the duplicate records, but we understand the at least once semantics of Kinesis usage can lead to some duplicates. This just seems to happen fairly often in our test environment.
Additional context
We can provide logs, request IDs, etc. as needed for any troubleshooting. We just need to sanitize a bit before doing so.
cc: @nikoo28
The text was updated successfully, but these errors were encountered: