-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NPE on graceful shutdown before DDB LeaseCoordinator
starts.
#1157
Conversation
takerFuture.cancel(false); | ||
|
||
if (takerFuture != null) { | ||
takerFuture.cancel(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In handling this null case is there any chance that takerFuture is assigned to Null. Do we need to mark this somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any chance that takerFuture is assigned to Null.
Yes, takerFuture=null
at declaration and remains so until reassigned @ https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client/src/main/java/software/amazon/kinesis/leases/dynamodb/DynamoDBLeaseCoordinator.java#L236-L239 This is currently the only assignment, so it will not revert to null
.
Do we need to mark this somewhere?
Would you kindly clarify what is being asked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to understand the RCA for the bugs, first one says if scheduler is shutdown too fast after started, I dont think thats possible because start and shutdown are synchronized on the scheduler object. But its possible that if shutdown happened first before start could even run, but then we dont have the right behavior either, ideally if shutdown ran first and then start runs, we should disallow the start, so im thinking maybe we want to keep track of a flag for started and shutdown and only do shutdown processing if it was started and dont allow a start if it was already shutdown?
I dint quite understand the second linked issue from a quick look.
I think maybe thats what you called out in your description but a bias-for-action fix is to address NPE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[aravinda] I think maybe thats what you called out in your description but a bias-for-action fix is to address NPE?
Yeah, replication paths weren't provided in the issues but the stacktraces consistently implicated this line.
I agree that we should "disallow the start", and other more defensive/correct behaviors. However, time is a luxury and this bias-for-action PR is better than letting this linger another ~2.5yrs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with bias-for-action if it fixes the issues referenced, i.e. maybe they dont see the NPE with this fix, but still their application wont behave as they want which I assume is, cancel the start and shutdown gracefully? So is there value in doing this, i.e. if we have correct guarantees about only shutdown if started and only start if not already shutdown then this null check wont even be needed? Just a thought, please let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just asking if for some reason the function returned null do we need to log this for future reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[aravinda] maybe they dont see the NPE with this fix, but still their application wont behave as they want which I assume is, cancel the start and shutdown gracefully? So is there value in doing this
Correct on intent to shutdown (per Cx attestations), and yes: there is value in avoiding the NPE. This NPE apparently prevents KCL from terminating. Comments from the referenced issues:
Calling Scheduler.startGracefulShutdown on a scheduler too fast after the scheduler has been started results in a NullPointerException, and the scheduler is not shut down.
(emphasis added)
ref: #745 (comment)
NullPointerException happens which prevents worker to shut down and it continues operating in a corrupted state
(emphasis added)
ref: #900 (comment)
[brendan] I was just asking if for some reason the function returned null do we need to log this for future reference?
Thanks for the clarification. stopLeaseTaker()
has a void
return type. So long as the method doesn't throw an unchecked NPE (which is fixed in this PR), KCL is in a better state.
@Test | ||
public void testStopLeaseTakerBeforeStart() { | ||
leaseCoordinator.stopLeaseTaker(); | ||
assertTrue(leaseCoordinator.getAssignments().isEmpty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this test is checking that assignments are empty to show that it is before the LeaseTaker Starts. If this is the case, shouldn't we verify that it is empty before calling stopLeaseTaker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is checking that assignments are empty to show that it is before the LeaseTaker Starts
Not exactly. For one referenced issue, getAssignments()
happens immediately-after stopLeaseTaker()
. This check ensures that we don't solve one NPE only to expose/throw another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Perfect.
Issue #, if available:
Description of changes:
Fix NPE on graceful shutdown before DDB
LeaseCoordinator
has fully started.DynamoDBLeaseCoordinator
still has issues since it doesn't use a FSM. For example, initialization/startup could happen in parallel, or immediately following, a graceful shutdown. However, this PR provides a bias-for-action fix of the NPE.This PR supplants #901
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.