Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-9476] KinesisIO retry LimitExceededException #10973

Merged
merged 1 commit into from
Apr 3, 2020

Conversation

ameihm0912
Copy link

During Kinesis stream setup DescribeStream is used. This API call has
quota limits that can become very problematic when attempting to
configure multiple Kinesis streams in the same AWS account.

This change modifies the caller of describeStream (listShards) such that
transient failures are retried up to 10 times with a one second delay in
between each try (the time used to assess the quota), at which point if
it has still failed the exception is thrown.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@aromanenko-dev
Copy link
Contributor

Thank you for contribution! Have you seen this PR #9765 and a discussion over this topic there? I think it tries to solve the similar problem, if I'm not mistaken.

@ameihm0912
Copy link
Author

Thank you for contribution! Have you seen this PR #9765 and a discussion over this topic there? I think it tries to solve the similar problem, if I'm not mistaken.

@aromanenko-dev thank you for the response.

I had a quick look at the referenced PR and I suspect that may be solving a different issue, that being transient errors associated with hitting read throughput rate limiting.

Some additional context on the issue we are running into: in our case, we have a large number of Kinesis streams in a single account. By default, Kinesis DescribeStream calls are limited to 10 per second per account. So we easily run into this during graph construction as we have a large number of streams that need to be setup in the same pipeline. This results in a RuntimeException being thrown by the pipeline. So, our issue is happening during stream initialization right off the bat, where as the referenced PR I believe is more to address issues that are occurring while reading from an already initialized stream.

It is possible I am misinterpreting something in the other PR however.

@aromanenko-dev
Copy link
Contributor

aromanenko-dev commented Feb 27, 2020

@ameihm0912 Yes, I see your point and this is actually a different but similar issue imo.

One of the option to overcome this could be using internal AWS RetryPolicy as I showed in this comment - in this case you just need to create new AWSClientsProvider, based on current BasicKinesisProvider, and override the method getKinesisClient() with configured custom RetryPolicy and BackoffStrategy. Then set this new AWSClientsProvider by withAWSClientsProvider(AWSClientsProvider). Though, be aware that this policy will be applied to all other AmazonKinesis requests as well afaik.

Another option can using Beam FluentBackoffclass as many other Beam IOs do, like ClickHouseIO for example.

@ameihm0912
Copy link
Author

@aromanenko-dev interesting, that's definitely an option but as you mention I am unclear on what the implications of applying a global retry/backoff policy to all API requests would be. For this particular PR I tried to just focus on the DescribeStream initialization issue itself, but am happy to revisit the approach here if you prefer.

@aromanenko-dev
Copy link
Contributor

@ameihm0912 Could you use Beam FluentBackoff instead own implementation? I believe it should simplify the code.

@aromanenko-dev
Copy link
Contributor

aromanenko-dev commented Mar 4, 2020

Also, please, create new Jira issue for this feature and name this PR and commits using Jira ID as a prefix, like [BEAM-1234] Commit message where 1234 is Jira ID (see Contribution Guide for details).

@ameihm0912 ameihm0912 changed the title KinesisIO retry LimitExceededException [BEAM-9476] KinesisIO retry LimitExceededException Mar 9, 2020
@ameihm0912
Copy link
Author

@aromanenko-dev updated to use FluentBackoff and references to Jira issue added

@aromanenko-dev
Copy link
Contributor

retest this please

Copy link
Contributor

@aromanenko-dev aromanenko-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it looks fine, thanks. Could you just add a unit test to test that this BackOff actually works?

@aromanenko-dev
Copy link
Contributor

@ameihm0912 kindly pinging to prevent this PR get staled

@ameihm0912
Copy link
Author

@ameihm0912 kindly pinging to prevent this PR get staled

@aromanenko-dev apologies for the delay here, I'll try to get that test added this week.

@aromanenko-dev
Copy link
Contributor

@ameihm0912 Not a problem, thank you for keep working on this!

@ameihm0912
Copy link
Author

@aromanenko-dev I have added a couple test cases to verify both the retry behavior and the retry limit behavior to KinesisMockReadTest. When you have a moment please let me know if that is OK.

@aromanenko-dev
Copy link
Contributor

aromanenko-dev commented Apr 2, 2020

@ameihm0912 Thanks, all is fine for me except that Spotless fails. Please, run gradlew spotlessApply (or more precise ./gradlew :sdks:java:io:kinesis:spotlessApply) to fix these violations and ./gradlew :sdks:java:io:kinesis:check before pushing to make sure that all tests and checks are passed.
Once it's fixed, I think it will be ready to merge.

During Kinesis stream setup DescribeStream is used. This API call has
quota limits that can become problematic when attempting to
configure multiple Kinesis streams in the same AWS account.

This change modifies the caller of describeStream (listShards) such that
transient failures are retried up to 10 times using FluentBackoff
starting with a one second backoff (the time used to assess the quota).
After ten attempts the exception will be thrown.
@ameihm0912
Copy link
Author

@aromanenko-dev this should be fixed up now

@ameihm0912
Copy link
Author

@aromanenko-dev I'm not too sure why the tests are indicated as still failing here, they seemed to pass for me locally. I had a look at the Jenkins output as well and don't see anything obvious in there.

@aromanenko-dev
Copy link
Contributor

Run Java PreCommit

@aromanenko-dev
Copy link
Contributor

Run JavaPortabilityApi PreCommit

@aromanenko-dev
Copy link
Contributor

Run Java PreCommit

@aromanenko-dev
Copy link
Contributor

Run JavaPortabilityApi PreCommit

@aromanenko-dev
Copy link
Contributor

@ameihm0912 I think it's not related to your changes, so I just rerun this tests

@ameihm0912
Copy link
Author

Looks like we still had a failure; if you'd like me to rebase this off current master let me know, I'm not sure if maybe that might help?

@aromanenko-dev
Copy link
Contributor

Run Java PreCommit

Copy link
Contributor

@aromanenko-dev aromanenko-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok now and the fails were caused by flakiness in some not related tests.
LGTM and thanks for your contribution!

@aromanenko-dev aromanenko-dev merged commit a259c3a into apache:master Apr 3, 2020
@ameihm0912 ameihm0912 deleted the kinesisretry branch April 3, 2020 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants