Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] EventHubListener causes message lost in shutdown #41784

Closed
yfujiwara-sansan opened this issue Feb 5, 2024 · 11 comments · Fixed by #41891
Closed

[BUG] EventHubListener causes message lost in shutdown #41784

yfujiwara-sansan opened this issue Feb 5, 2024 · 11 comments · Fixed by #41891
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@yfujiwara-sansan
Copy link

Library name and version

Microsoft.Azure.WebJobs.Extensions.EventHubs 6.0.2 and 5.5.0

Describe the bug

EventHubListener.PartitionProcessor progresses the checkpoint even when the application is shutting down (for example, configuration change, scaling, etc.). It causes message lost.
Note that I found that this issue is occurred occasionally.

I and my colleague guess that this issue should be fixed #36432, but reintroduced with #38067 as following out investigation results.

Our investigation results

In this situation, while _functionExecutionToken.IsCancellationRequested became true, linkedCts.IsCancellationRequested had not become true. So, the checkpoint was progressed even if the application execution had been cancelled.

By LinkedCancellationTokenSource source code, the following facts were found:

  • LinkdedCancellationTokenSource (linkedCts) is cancelled via callback of "linked" cancellation token (_functionExecutionToken's source).
  • The callback chain is invoked LIFO order.
  • The callback is also used in many places such as Task.Delay(CancellationToken) implementation.

By watching the callstack in the checkpointing, following sequence was occurred:

  1. WebHost calls listener's StopAsync()
  2. The listener calls CancellationTokenSource.Cancel() (source of _functionExecutionToken)
  3. The application cancellation is occurred as continuation of async / await, then awaits in function runtimes are finished as part of registered callback execution. Note that this callback should be occurred before setting linkedCts.IsCancellationRequested to true as described above. So, the checkpoint is progressed because linkedCts.IsCancellationRequested has not been true yet.

Expected behavior

The checkpoint is never progressed when application process is shutting down ( _functionExecutionToken is cancelled).

Actual behavior

The checkpoint is progressed occasionally.

Reproduction Steps

  1. Use event hub trigger with following in local.
  2. After Recieve method started, press Ctrl + C to shutdown process.
  3. The checkpoint should be progressed. You can investigate linkedCts and _functionExecutionToken states with break point in the checkpointing.
[FunctionName("Receive")]
[ExponentialBackoffRetry(5, "00:00:10", "00:10:00")]
public async Task Receive(
    [EventHubTrigger("%EventHubName%", Connection = "EventHubConnectionString")] EventData[] events,
    CancellationToken cancellationToken)
{
    await Task.Delay(TimeSpan.FromMinutes(3), cancellationToken);
}

Environment

  • Platform: Windows (Functions runtime v3 and v4, Azure App Service)
    • We reproduced in local environment, but message lost was occurred in production Azure environment multiple times.
@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-triage Workflow: This issue needs the team to triage. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Feb 5, 2024
@jsquire jsquire added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-team-triage Workflow: This issue needs the team to triage. labels Feb 5, 2024
@jsquire
Copy link
Member

jsquire commented Feb 5, 2024

@JoshLove-msft: Would you please advise on whether this is related to the ask on the Functions team to expose whether an execution would be retried if the host is not shutting down? If so, please transfer to the Functions host repo.

@jsquire
Copy link
Member

jsquire commented Feb 5, 2024

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@JoshLove-msft
Copy link
Member

@yfujiwara-sansan, I couldn't reproduce having _functionExecutionToken be canceled with linkedCts still not being cancelled. That said, it looks like you are correct that cancellation is not atomic when it comes to linked token sources, so it is possible that one of the sources can be canceled while the linked source is still not canceled. I will make a fix to explicitly check each of the token sources when making the checkpointing decision.

@yfujiwara-sansan
Copy link
Author

@JoshLove-msft Thank you for your work! I guess that the repro code sometimes fails to repro due to scheduling of continuations.

By the way, as far as I read #38067 again, it looks that lInkedCts now can be passed to TryExecute bacause it is not lInked to the token which is cancelled In drain mode. I think there is a better option to pass linkedCts to TryExecute, because the app should be cancelled when the load balancer detects ownership lost.

@JoshLove-msft
Copy link
Member

I think there is a better option to pass linkedCts to TryExecute, because the app should be cancelled when the load balancer detects ownership lost.

The semantics of the token passed to the function are that it is signaled only when shutting down in a way that is not guaranteed to allow the function to complete execution. In terms of whether it makes sense to also cancel this token when partition ownership is lost, I would have to defer to @jsquire on that.

@yfujiwara-sansan
Copy link
Author

I understand the semantics of the token passeed to the app. The cancellation behavior for ownership lost should be discussed as a different issue because it is intentional behavior.

Thank you for your answer! I and my colleagues are waiting for fixking the race condition.

@jsquire
Copy link
Member

jsquire commented Feb 12, 2024

I think there is a better option to pass linkedCts to TryExecute, because the app should be cancelled when the load balancer detects ownership lost.

The semantics of the token passed to the function are that it is signaled only when shutting down in a way that is not guaranteed to allow the function to complete execution. In terms of whether it makes sense to also cancel this token when partition ownership is lost, I would have to defer to @jsquire on that.

@JoshLove-msft : The token that the processor passes to OnProcessingEventBatch will get canceled when partition ownership is lost. I would assume that we're flowing that token into the Executor so that the Function is notified. Looking over the implementation, it seems that we're not passing that along, for some reason. Any idea why?

@JoshLove-msft
Copy link
Member

I don't think there was a good reason - it may have just been an oversight. Updated the PR to pass linkedCts token to TryExecute.

@JoshLove-msft
Copy link
Member

You are right again. Apologies for the oversight. I will add these updates in.

@JoshLove-msft
Copy link
Member

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants