Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIO version of EventHubConsumerClient has significant memory leak #36398

Closed
smoke opened this issue Jul 9, 2024 · 19 comments
Closed

AIO version of EventHubConsumerClient has significant memory leak #36398

smoke opened this issue Jul 9, 2024 · 19 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. Messaging Messaging crew question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@smoke
Copy link

smoke commented Jul 9, 2024

  • Package Name: azure-eventhub
  • Package Version: ^5.11.2 (maybe older as well)
  • Operating System: Linux
  • Python Version: 3.9, 3.12

Describe the bug
AIO version of EventHubConsumerClient has significant memory leak, when async code is used inside on_event

To Reproduce
Steps to reproduce the behavior:

  1. Use reference implementation from https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/eventhub/azure-eventhub/samples/async_samples/recv_async.py
  2. Consume some 1.3-5k events per minute for 10-15 min
  3. Observe the memory grows only

Expected behavior
Long running events processing to not leak memory.

Screenshots
AIO version memory usage over 12h:
(note the spike downs are POD restarts due to OOM killer)
image

Sync version memory usage over 12h:
image

Additional context
For example the sync implementation has no issues at all https://github.dev/Azure/azure-sdk-for-python/tree/main/sdk/eventhub/azure-eventhub/samples/sync_samples/recv.py

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jul 9, 2024
Copy link

github-actions bot commented Jul 9, 2024

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@swathipil swathipil added the Messaging Messaging crew label Jul 9, 2024
@swathipil
Copy link
Member

Hi @smoke - Thanks for opening an issue! Are you using websocket, i.e. passing in TransportType.AmqpOverWebsocket? Or are you using the sample as is without any modifications?

@swathipil swathipil added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Jul 9, 2024
Copy link

github-actions bot commented Jul 9, 2024

Hi @smoke. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

@smoke
Copy link
Author

smoke commented Jul 10, 2024

Sample as is, with added few lines to send the events through httpx to another service.

@github-actions github-actions bot added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Jul 10, 2024
@swathipil
Copy link
Member

@smoke Would you also be able to provide debug level logging to see if the memory jumps correlate with underlying errors?
To turn on logging:

import logging
import sys
logger = logging.getLogger("azure.eventhub")
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(stream=sys.stdout)
logger.addHandler(handler)

Pass in logging_enable=True during client construction: EventHubConsumerClient(.....logging_enable=True)

@swathipil swathipil added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Jul 10, 2024
Copy link

Hi @smoke. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

@bretttegart
Copy link

This memory leak is also impacting my application (forwarding large JSON audit events from Microsoft Entra to Azure Event Hub).

I reproduced the memory leak issue in a scratch environment. To do so, I wrote 200MB of dummy JSON data to a new event hub and downloaded it using the async reference implementation that Smoke linked in the issue. Debug logs from the receiver are attached.
debug.log

Unfortunately, the Sync implementation doesn't work in my environment due to a different issue with our dual-stack TCP outgoing proxy.

Please let me know if I can provide any additional context or logging!

@smoke
Copy link
Author

smoke commented Jul 15, 2024

@smoke Would you also be able to provide debug level logging to see if the memory jumps correlate with underlying errors? To turn on logging:

import logging
import sys
logger = logging.getLogger("azure.eventhub")
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(stream=sys.stdout)
logger.addHandler(handler)

Pass in logging_enable=True during client construction: EventHubConsumerClient(.....logging_enable=True)

Please find the log file with debug logger debug.log.tar.gz

Here a graph of the memory usage that clearly shows the memory leak:
image

Notice that to have realistic scenario I have used the following on_event handler:

async def on_event(partition_context, event):
    # Put your code here.
    # If the operation is i/o intensive, async will have better performance.
    print("Received event from partition: {}.".format(partition_context.partition_id))

    ### changed code starts ###
    print('Simulating work')  # noqa: T201
    await asyncio.sleep(30000)
    ### changed code finishes ###

    await partition_context.update_checkpoint(event)

Update: I have tested with code that does not have any additional code and there is no memory leak, apparently the problem is when there is some async code inside the on_event.

Here the pristine code log
debug2.log.tar.gz

and the graph:
image

@github-actions github-actions bot added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Jul 15, 2024
@l0lawrence
Copy link
Member

Hi @smoke,

A few questions:

Update: I have tested with code that does not have any additional code and there is no memory leak, apparently the problem is when there is some async code inside the on_event.

  1. When you say "some async code" here are you referring to asyncio.sleep() specifically or have you run into this with other async code as well?
  2. Have you tried other sleep intervals other than 30,000 with different results? More info: After the first message from a consumer goes into the on_event you will continue to receive messages on that consumer and they will queue up in the background. When the on_event awakens and completes for the current message, it then takes the next one and sleeps again.
  3. What size are the events you are consuming?
  4. Are you using Azure functions or any other env configurations we should consider?

@smoke
Copy link
Author

smoke commented Jul 16, 2024

Hi @smoke,

A few questions:

Update: I have tested with code that does not have any additional code and there is no memory leak, apparently the problem is when there is some async code inside the on_event.

  1. When you say "some async code" here are you referring to asyncio.sleep() specifically or have you run into this with other async code as well?
  2. Have you tried other sleep intervals other than 30,000 with different results? More info: After the first message from a consumer goes into the on_event you will continue to receive messages on that consumer and they will queue up in the background. When the on_event awakens and completes for the current message, it then takes the next one and sleeps again.
  3. What size are the events you are consuming?
  4. Are you using Azure functions or any other env configurations we should consider?
  1. Yes, the memory leak happens with other regular Async code e.g. processing and sending the events using httpx.AsyncClient()
  2. The regular Async code I have succeeds within anything between 100ms to 10-20s and the result is the same. I guess the on_event is forcefully executed in parallel from the consumer, thus creating a back pressure of messages read but unprocessed. In my opinion parallel execution should be left for users to utilize, so they can properly handle the back pressure.
  3. Not sure, difficult to check exactly but it should be anything between 100kb to 1-2mb, if important I can check additional, let me know
  4. K8S containers with amd64 arch and regular EC2 instances in AWS EKS cluster, should be nothing out of ordinary

@l0lawrence
Copy link
Member

Hi @smoke we were able to reproduce your issue and are looking into a fix, will update when we have more information.

@l0lawrence
Copy link
Member

Hi @smoke update: we have a fixed merged into main, if you are able to test out those changes by building a whl from there that would help to confirm that your issue has been resolved. If you are unable to, we will update this thread when the fix is released :)

@l0lawrence l0lawrence removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Sep 4, 2024
@l0lawrence l0lawrence added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Sep 4, 2024
Copy link

github-actions bot commented Sep 4, 2024

Hi @smoke. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

Copy link

Hi @smoke, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@github-actions github-actions bot added the no-recent-activity There has been no recent activity on this issue. label Sep 11, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2024
@smoke
Copy link
Author

smoke commented Sep 26, 2024

@l0lawrence I can not do that right now, sorry. Can you link the PR or commit that fixes the issue here, also make sure the issue is properly closed as resolved / fixed?
This way avoid

github-actions bot closed this as not planned 6 hours ago

@github-actions github-actions bot reopened this Sep 26, 2024
@github-actions github-actions bot added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. no-recent-activity There has been no recent activity on this issue. labels Sep 26, 2024
@bretttegart
Copy link

I packaged a new .whl for azure-eventhub from the Main branch (commit ef2521b) and duplicated @smoke's test condition (#36398 (comment)).

Memory usage on the latest build is now stable:
plot1

I would appreciate it if we could cut a release with this patch included- I can't import arbitrary commits into my environment.

@l0lawrence
Copy link
Member

Thanks for testing it out, we will work on getting this released and update you all when it is out.

@l0lawrence l0lawrence added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Sep 26, 2024
@github-actions github-actions bot removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Sep 26, 2024
Copy link

Hi @smoke. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

@l0lawrence
Copy link
Member

Hi @smoke and @bretttegart, the fix is out in azure-eventhub 5.12.2. Thanks!

@github-actions github-actions bot locked and limited conversation to collaborators Jan 6, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. Messaging Messaging crew question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

5 participants