Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdkafka_performance: Don't sleep while waiting for delivery reports #1918

Merged

Conversation

ankon
Copy link
Contributor

@ankon ankon commented Aug 3, 2018

Sleeping here means we won't be looking for delivery reports, which in latency mode
is problematic: The latency is calculated in the delivery report callback based on the time
when that is called, which would be offset by the slept time.

In my case I'm using the tool with low message rates (1-10 messages/s), and the latencies reported
by the producer ended up between 1s (1 message/s) and 100ms (10 messages/s). Looking into this
showed that the sleeping for the rate_sleep means that the tool doesn't process delivery
reports until it has slept -- but the latency is based on the actual time of running the delivery report callback,
not the time the response was accepted by rdkafka.

Fix this problem by replacing the sleep with a busy loop that polls (with a hand-wavy timeout of at most 100ms).
I opted for not adding a latency_mode branch, but just do the polling in either case when the rate limiter is enabled.

if (rate_sleep) {
rd_ts_t next = rd_clock() + (rd_ts_t) rate_sleep;
do {
rd_kafka_poll(rk, RD_MIN(100, rate_sleep));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rd_kafka_poll()'s timeout is in milliseconds while rate_sleep is in microseconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest calculating the remaining time (of rate_sleep) in each loop and using that for timeout, which removes the need for RD_MIN()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done -- but now I ended up with an RD_MAX to avoid polling with a negative timeout :)

@ankon ankon force-pushed the pr/perf-low-message-rate-latency branch from 4ee1063 to 8fbea5a Compare August 9, 2018 09:42
if (rate_sleep) {
rd_ts_t next = rd_clock() + (rd_ts_t) rate_sleep;
do {
int poll_timeout = RD_MAX(0, (next - rd_clock()) / 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable is only used once, move the right side directly into poll() argument

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need int cast as so:
(int)((next - rd_clock()) / 1000)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and re-indented to avoid a long line.

Sleeping here means we won't be looking for delivery reports, which in latency mode
is problematic: The latency is calculated in the delivery report callback based on the time
when that is called, which would be offset by the slept time.
@ankon ankon force-pushed the pr/perf-low-message-rate-latency branch from 8fbea5a to b9f3de2 Compare August 9, 2018 10:43
@edenhill edenhill merged commit a3236a9 into confluentinc:master Aug 9, 2018
@edenhill
Copy link
Contributor

edenhill commented Aug 9, 2018

Thank you for this!

@ankon ankon deleted the pr/perf-low-message-rate-latency branch August 9, 2018 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants