Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIFI-13597: fix: modified kafka manager to use poll in producer #360

Merged
merged 1 commit into from
Jun 12, 2024

Conversation

i-chvets
Copy link
Contributor

Description

During latest experiments with large number of APs, we narrowed down that memory is consumed by Kafka internal queue on GW (producer). And with large number of messages producer cannot keep up with emptying this queue.
One noticeable suspect was identified in flush() call here
Looks like, flushing on every message slows down producer to 100 messages per second.

The solution was to use poll() to allow for faster message transmission in peak times.

Related Jira: https://telecominfraproject.atlassian.net/browse/WIFI-13597

Summary of changes:

  • Modified code in KafkaManager to use poll instead of flush for every messages sent. flush is used only on empty internal notification queue in idle times.

https://telecominfraproject.atlassian.net/browse/WIFI-13597

Summary of changes:
- Modified code in KafkaManager to use poll instead of flush for every
  messages sent. flush is used only on empty internal notification queue
in idle times.

Signed-off-by: Ivan Chvets <[email protected]>
@i-chvets i-chvets changed the title fix: modified kafka manager to use poll in producer WIFI-13597: fix: modified kafka manager to use poll in producer Jun 11, 2024
Copy link
Contributor

@stephb9959 stephb9959 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this with very few devices: < 5 for example. We tried this method in the past, however we found messages were lingering in Kafka and not being received by other consumers on the bus. The fix looks good, just want to make sure we are not trading on problem for another.

@i-chvets
Copy link
Contributor Author

i-chvets commented Jun 11, 2024

Have you tested this with very few devices: < 5 for example. We tried this method in the past, however we found messages were lingering in Kafka and not being received by other consumers on the bus. The fix looks good, just want to make sure we are not trading on problem for another.

I did experiments and the queue goes down to zero pretty often, that where flush() is executed. It looks like messages are being sent in batches of 10, which is fast enough to empty the queue.
Even with 10K APs flush() was triggered multiple times a second under load (will confirm with 25K APs test as well). The rate of messages is still 100/second or so, but I guess the use of poll() on every message and executing flush() on bunches of ~10 messages improves latency.

Memory did not go above 700MB for the whole hour of test.

I ran 1000 and 5000 APs tests for short time, all looked good. What is the indication of Kafka messages issue?
I will get 5 APs simulation running for 1 hour. What should I look for to confirm it is working properly?

@stephb9959
Copy link
Contributor

Have you tested this with very few devices: < 5 for example. We tried this method in the past, however we found messages were lingering in Kafka and not being received by other consumers on the bus. The fix looks good, just want to make sure we are not trading on problem for another.

I did experiments and the queue goes down to zero pretty often, that where flush() is executed. It looks like messages are being sent in batches of 10, which is fast enough to empty the queue. Even with 10K APs flush() was triggered multiple times a second under load (will confirm with 25K APs test as well). The rate of messages is still 100/second or so, but I guess the use of poll() on every message and executing flush() on bunches of ~10 messages improves latency.

Memory did not go above 700MB for the whole hour of test.

I ran 1000 and 5000 APs tests for short time, all looked good. What is the indication of Kafka messages issue? I will get 5 APs simulation running for 1 hour. What should I look for to confirm it is working properly?

What we have seen was that Kafka was a long delay between publish and consume: 30 seconds to 1 minute. If you are not seeing that, then this is a good fix.

@i-chvets
Copy link
Contributor Author

i-chvets commented Jun 12, 2024

Results of 25,000 AP simulation 1 hour test. Memory never grows beyond 1.5GB
Screenshot from 2024-06-12 09-49-29

Copy link
Contributor

@stephb9959 stephb9959 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stephb9959 stephb9959 merged commit 02a0eef into master Jun 12, 2024
3 of 4 checks passed
@stephb9959 stephb9959 deleted the WIFI-13597-fix-kafka-producer-using-poll branch June 12, 2024 19:17
i-chvets added a commit to Telecominfraproject/wlan-cloud-owprov that referenced this pull request Jun 13, 2024
https://telecominfraproject.atlassian.net/browse/WIFI-13597

NOTE: This fix is port of Telecominfraproject/wlan-cloud-ucentralgw#360

Summary of changes:
- Modified code in KafkaManager to use poll instead of flush for every
  messages sent. flush is used only on empty internal notification queue
  in idle times.

Signed-off-by: Ivan Chvets <[email protected]>
i-chvets added a commit to Telecominfraproject/wlan-cloud-analytics that referenced this pull request Jun 13, 2024
https://telecominfraproject.atlassian.net/browse/WIFI-13597

NOTE: This fix is port of Telecominfraproject/wlan-cloud-ucentralgw#360

Summary of changes:
- Modified code in KafkaManager to use poll instead of flush for every messages sent. flush is used only on empty internal notification queue in idle times.

Signed-off-by: Ivan Chvets <[email protected]>
i-chvets added a commit to Telecominfraproject/wlan-cloud-ucentralsec that referenced this pull request Jun 13, 2024
https://telecominfraproject.atlassian.net/browse/WIFI-13597

NOTE: This fix is port of Telecominfraproject/wlan-cloud-ucentralgw#360

Summary of changes:
- Modified code in KafkaManager to use poll instead of flush for every messages sent. flush is used only on empty internal notification queue in idle times.

Signed-off-by: Ivan Chvets <[email protected]>
i-chvets added a commit to Telecominfraproject/wlan-cloud-owls that referenced this pull request Jun 13, 2024
https://telecominfraproject.atlassian.net/browse/WIFI-13597

NOTE: This fix is port of Telecominfraproject/wlan-cloud-ucentralgw#360

Summary of changes:
- Modified code in KafkaManager to use poll instead of flush for every messages sent. flush is used only on empty internal notification queue in idle times.

Signed-off-by: Ivan Chvets <[email protected]>
i-chvets added a commit to kinarasystems/wlan-cloud-owprov that referenced this pull request Sep 17, 2024
https://telecominfraproject.atlassian.net/browse/WIFI-13597

NOTE: This fix is port of Telecominfraproject/wlan-cloud-ucentralgw#360

Summary of changes:
- Modified code in KafkaManager to use poll instead of flush for every
  messages sent. flush is used only on empty internal notification queue
  in idle times.

Signed-off-by: Ivan Chvets <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants