Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kinesis PutRecords intermittently hangs with successful but empty response #4149

Closed
3 tasks done
jpaskhay opened this issue Oct 27, 2021 · 4 comments
Closed
3 tasks done
Assignees
Labels
bug This issue is a bug.

Comments

@jpaskhay
Copy link

jpaskhay commented Oct 27, 2021

Confirm by changing [ ] to [x] below to ensure that it's a bug:

Describe the bug
We are experiencing an intermittent issue where calling Kinesis PutRecords via the Fluent Bit Kinesis Go plugin hangs until a timeout occurs. This appears to be similar to #301, #1037, and #1141 (but with a different API).

As of this writing, the plugin uses the default HTTP Client in the aws.Config. We have submitted a patch to the plugin as a temporary work-around to minimize the time spent hanging when it occurs.

We have enabled debug logging by patching the plugin to pass in aws.LogLevel(aws.LogDebugWithRequestRetries | aws.LogDebugWithRequestErrors | aws.LogDebugWithHTTPBody) to the aws.Config. When using the default HTTP Client, we see a read timeout (consistently after 5m30s). When we pass in a HTTP Client w/ a custom timeout, we see a context cancellation timeout. In either case, we observe a 200 OK response is logged at the point of timeout but the Date response header has an earlier time (generally within 1 second of the request), and it appears to have an empty response body. The SDK then retries sending the request and succeeds as normal, resulting in duplicate records.

Not sure if PutRecords should simply be included in https://github.com/aws/aws-sdk-go/blob/v1.41.12/service/kinesis/customizations.go#L16-L18 (#1166 resolution) or if there is any other investigation that should be done. BTW seems like the timeout is hardcoded there -- should it not make use of the HTTP Client timeout config that can be customized? I'm also curious if this would result in different behavior than our current work-around of providing the HTTP Client timeout (I'm assuming we'd still see the duplicate records)?

Version of AWS SDK for Go?
Observed using: v1.40.49, v1.41.9, and v1.41.12

Version of Go (go version)?
1.17

To Reproduce (observed behavior)
Can follow up with a minimal example, if needed. We're generating logs and using Fluent Bit's tail plugin as input and outputting to Kinesis. We'll generally see it occur when the requests per second are >= 300 (Fluent Bit flush interval of 1s). There is a chance our environment where we encounter this has potential network issues, but those would be difficult to identify/resolve.

Expected behavior
Ideally, we would observe neither the hanging behavior nor the duplicate records, but we understand the at least once semantics of Kinesis usage can lead to some duplicates. This just seems to happen fairly often in our test environment.

Additional context
We can provide logs, request IDs, etc. as needed for any troubleshooting. We just need to sanitize a bit before doing so.

cc: @nikoo28

@jpaskhay jpaskhay added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Oct 27, 2021
@vudh1 vudh1 self-assigned this Apr 15, 2022
@vudh1 vudh1 added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels Apr 20, 2022
@vudh1
Copy link
Contributor

vudh1 commented Apr 20, 2022

Hi, is this issue still persisting with the latest version of SDK?

@vudh1 vudh1 added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Apr 20, 2022
@jpaskhay
Copy link
Author

Hi, is this issue still persisting with the latest version of SDK?

Hey there, the last time we tried reproducing the issue a couple months ago, we were unable to do so. I'm guessing our network zone was having issues when this was reported, or perhaps the Kinesis service itself was having issues. Since we are no longer able to reproduce reliably, there may not be much we can do to investigate whether it's still something that can be improved here. Feel free to close if you need to do some housekeeping -- we can always re-open if it comes up again. Thanks!

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Apr 22, 2022
@vudh1
Copy link
Contributor

vudh1 commented Apr 22, 2022

Sure please reopen this again if this comes up again.

@vudh1 vudh1 closed this as completed Apr 22, 2022
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
None yet
Development

No branches or pull requests

2 participants