Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NPE on graceful shutdown before DDB LeaseCoordinator starts. #1157

Merged
merged 1 commit into from
Jun 28, 2023

Conversation

stair-aws
Copy link
Contributor

Issue #, if available:

Description of changes:
Fix NPE on graceful shutdown before DDB LeaseCoordinator has fully started.

DynamoDBLeaseCoordinator still has issues since it doesn't use a FSM. For example, initialization/startup could happen in parallel, or immediately following, a graceful shutdown. However, this PR provides a bias-for-action fix of the NPE.

This PR supplants #901

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@stair-aws stair-aws added bug v2.x Issues related to the 2.x version labels Jun 27, 2023
takerFuture.cancel(false);

if (takerFuture != null) {
takerFuture.cancel(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In handling this null case is there any chance that takerFuture is assigned to Null. Do we need to mark this somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any chance that takerFuture is assigned to Null.

Yes, takerFuture=null at declaration and remains so until reassigned @ https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client/src/main/java/software/amazon/kinesis/leases/dynamodb/DynamoDBLeaseCoordinator.java#L236-L239 This is currently the only assignment, so it will not revert to null.

Do we need to mark this somewhere?

Would you kindly clarify what is being asked?

Copy link

@aravindakidambi aravindakidambi Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand the RCA for the bugs, first one says if scheduler is shutdown too fast after started, I dont think thats possible because start and shutdown are synchronized on the scheduler object. But its possible that if shutdown happened first before start could even run, but then we dont have the right behavior either, ideally if shutdown ran first and then start runs, we should disallow the start, so im thinking maybe we want to keep track of a flag for started and shutdown and only do shutdown processing if it was started and dont allow a start if it was already shutdown?

I dint quite understand the second linked issue from a quick look.

I think maybe thats what you called out in your description but a bias-for-action fix is to address NPE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[aravinda] I think maybe thats what you called out in your description but a bias-for-action fix is to address NPE?

Yeah, replication paths weren't provided in the issues but the stacktraces consistently implicated this line.

I agree that we should "disallow the start", and other more defensive/correct behaviors. However, time is a luxury and this bias-for-action PR is better than letting this linger another ~2.5yrs.

Copy link

@aravindakidambi aravindakidambi Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with bias-for-action if it fixes the issues referenced, i.e. maybe they dont see the NPE with this fix, but still their application wont behave as they want which I assume is, cancel the start and shutdown gracefully? So is there value in doing this, i.e. if we have correct guarantees about only shutdown if started and only start if not already shutdown then this null check wont even be needed? Just a thought, please let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just asking if for some reason the function returned null do we need to log this for future reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[aravinda] maybe they dont see the NPE with this fix, but still their application wont behave as they want which I assume is, cancel the start and shutdown gracefully? So is there value in doing this

Correct on intent to shutdown (per Cx attestations), and yes: there is value in avoiding the NPE. This NPE apparently prevents KCL from terminating. Comments from the referenced issues:

Calling Scheduler.startGracefulShutdown on a scheduler too fast after the scheduler has been started results in a NullPointerException, and the scheduler is not shut down.

(emphasis added)
ref: #745 (comment)

NullPointerException happens which prevents worker to shut down and it continues operating in a corrupted state

(emphasis added)
ref: #900 (comment)


[brendan] I was just asking if for some reason the function returned null do we need to log this for future reference?

Thanks for the clarification. stopLeaseTaker() has a void return type. So long as the method doesn't throw an unchecked NPE (which is fixed in this PR), KCL is in a better state.

@Test
public void testStopLeaseTakerBeforeStart() {
leaseCoordinator.stopLeaseTaker();
assertTrue(leaseCoordinator.getAssignments().isEmpty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this test is checking that assignments are empty to show that it is before the LeaseTaker Starts. If this is the case, shouldn't we verify that it is empty before calling stopLeaseTaker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is checking that assignments are empty to show that it is before the LeaseTaker Starts

Not exactly. For one referenced issue, getAssignments() happens immediately-after stopLeaseTaker(). This check ensures that we don't solve one NPE only to expose/throw another.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Perfect.

@stair-aws stair-aws merged commit feadd5e into awslabs:master Jun 28, 2023
@stair-aws stair-aws deleted the graceful branch June 28, 2023 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug v2.x Issues related to the 2.x version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants