Kinesis PutRecords intermittently hangs with successful but empty response #4149

jpaskhay · 2021-10-27T20:45:46Z

Confirm by changing [ ] to [x] below to ensure that it's a bug:

I've gone though Developer Guide and API reference
I've checked AWS Forums and StackOverflow for answers
I've searched for previous similar issues and didn't find any solution

Describe the bug
We are experiencing an intermittent issue where calling Kinesis PutRecords via the Fluent Bit Kinesis Go plugin hangs until a timeout occurs. This appears to be similar to #301, #1037, and #1141 (but with a different API).

As of this writing, the plugin uses the default HTTP Client in the aws.Config. We have submitted a patch to the plugin as a temporary work-around to minimize the time spent hanging when it occurs.

We have enabled debug logging by patching the plugin to pass in aws.LogLevel(aws.LogDebugWithRequestRetries | aws.LogDebugWithRequestErrors | aws.LogDebugWithHTTPBody) to the aws.Config. When using the default HTTP Client, we see a read timeout (consistently after 5m30s). When we pass in a HTTP Client w/ a custom timeout, we see a context cancellation timeout. In either case, we observe a 200 OK response is logged at the point of timeout but the Date response header has an earlier time (generally within 1 second of the request), and it appears to have an empty response body. The SDK then retries sending the request and succeeds as normal, resulting in duplicate records.

Not sure if PutRecords should simply be included in https://github.com/aws/aws-sdk-go/blob/v1.41.12/service/kinesis/customizations.go#L16-L18 (#1166 resolution) or if there is any other investigation that should be done. BTW seems like the timeout is hardcoded there -- should it not make use of the HTTP Client timeout config that can be customized? I'm also curious if this would result in different behavior than our current work-around of providing the HTTP Client timeout (I'm assuming we'd still see the duplicate records)?

Version of AWS SDK for Go?
Observed using: v1.40.49, v1.41.9, and v1.41.12

Version of Go (go version)?
1.17

To Reproduce (observed behavior)
Can follow up with a minimal example, if needed. We're generating logs and using Fluent Bit's tail plugin as input and outputting to Kinesis. We'll generally see it occur when the requests per second are >= 300 (Fluent Bit flush interval of 1s). There is a chance our environment where we encounter this has potential network issues, but those would be difficult to identify/resolve.

Expected behavior
Ideally, we would observe neither the hanging behavior nor the duplicate records, but we understand the at least once semantics of Kinesis usage can lead to some duplicates. This just seems to happen fairly often in our test environment.

Additional context
We can provide logs, request IDs, etc. as needed for any troubleshooting. We just need to sanitize a bit before doing so.

cc: @nikoo28

The text was updated successfully, but these errors were encountered:

vudh1 · 2022-04-20T22:12:17Z

Hi, is this issue still persisting with the latest version of SDK?

jpaskhay · 2022-04-21T02:51:44Z

Hi, is this issue still persisting with the latest version of SDK?

Hey there, the last time we tried reproducing the issue a couple months ago, we were unable to do so. I'm guessing our network zone was having issues when this was reported, or perhaps the Kinesis service itself was having issues. Since we are no longer able to reproduce reliably, there may not be much we can do to investigate whether it's still something that can be improved here. Feel free to close if you need to do some housekeeping -- we can always re-open if it comes up again. Thanks!

vudh1 · 2022-04-22T20:27:19Z

Sure please reopen this again if this comes up again.

github-actions · 2022-04-22T20:27:40Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

jpaskhay added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Oct 27, 2021

This was referenced Jan 27, 2022

add timeout config for AWS SDK Go HTTP calls aws/amazon-kinesis-streams-for-fluent-bit#179

Merged

Add HTTP timeout config for PutRecords call aws/amazon-kinesis-streams-for-fluent-bit#178

Closed

vudh1 self-assigned this Apr 15, 2022

vudh1 added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels Apr 20, 2022

vudh1 added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Apr 20, 2022

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Apr 22, 2022

vudh1 closed this as completed Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kinesis PutRecords intermittently hangs with successful but empty response #4149

Kinesis PutRecords intermittently hangs with successful but empty response #4149

jpaskhay commented Oct 27, 2021 •

edited

Loading

vudh1 commented Apr 20, 2022

jpaskhay commented Apr 21, 2022

vudh1 commented Apr 22, 2022

github-actions bot commented Apr 22, 2022

Kinesis PutRecords intermittently hangs with successful but empty response #4149

Kinesis PutRecords intermittently hangs with successful but empty response #4149

Comments

jpaskhay commented Oct 27, 2021 • edited Loading

vudh1 commented Apr 20, 2022

jpaskhay commented Apr 21, 2022

vudh1 commented Apr 22, 2022

github-actions bot commented Apr 22, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

jpaskhay commented Oct 27, 2021 •

edited

Loading