Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with producer - low throughput #3328

Closed
4 tasks done
sendaun opened this issue Apr 5, 2021 · 21 comments
Closed
4 tasks done

Performance issue with producer - low throughput #3328

sendaun opened this issue Apr 5, 2021 · 21 comments

Comments

@sendaun
Copy link

sendaun commented Apr 5, 2021

Description

My application has a throughput requirement of at least 5000 records per second (with records ranging from 200 to 300 bytes each). In this regard, I have been testing librdkafka throughput for past one week by tweaking various configuration parameters. So far the max through put i could witness is ~ 1560 records per second (no where near the performance numbers I read here).

Few observations to share:

(1) As suggested in various forums, I tried with various batch sizes and queue.buffering.max.ms (between 1ms to 1000ms), but hardly any difference in the throughput.

(2) The statistics shows significantly high internal latency. [Attached]
librd_stat.txt

(3) Adding a another producer instance, shows almost double the through put (~2k to 2.5k records per second)

(4) As observed in this issue: #2356, I also notice that reducing the record size from 300 bytes to 30 bytes shows a drastic increase in the throughput numbers (don't have the exact numbers, but significantly high)

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version (release number or git tag): v1.6.1
  • Apache Kafka version: 2.7.0
  • Operating system: z/OS (Unix System Services)
  • librdkafka client configuration:
    client.id = rdkafka
    client.software.name = librdkafka
    metadata.broker.list = 192.0.1.2:9092
    message.max.bytes = 600000
    message.copy.max.bytes = 65535
    receive.message.max.bytes = 100000000
    max.in.flight.requests.per.connection = 1000000
    metadata.request.timeout.ms = 60000
    topic.metadata.refresh.interval.ms = 5000
    metadata.max.age.ms = 900000
    topic.metadata.refresh.fast.interval.ms = 250
    topic.metadata.refresh.fast.cnt = 10
    topic.metadata.refresh.sparse = true
    topic.metadata.propagation.max.ms = 30000
    debug =
    socket.timeout.ms = 60000
    socket.blocking.max.ms = 1
    socket.send.buffer.bytes = 0
    socket.receive.buffer.bytes = 0
    socket.keepalive.enable = false
    socket.nagle.disable = false
    socket.max.fails = 1
    broker.address.ttl = 1000
    broker.address.family = any
    enable.sparse.connections = true
    reconnect.backoff.jitter.ms = 0
    reconnect.backoff.ms = 100
    reconnect.backoff.max.ms = 10000
    statistics.interval.ms = 10000
    enabled_events = 0
    stats_cb = 0x1e60a4e0
    log_cb = 0x1e643cd0
    log_level = 6
    log.queue = false
    log.thread.name = true
    enable.random.seed = true
    log.connection.close = true
    socket_cb = 0x1e6afa58
    open_cb = 0x1e7db490
    default_topic_conf = 0x5000102b60
    internal.termination.signal = 0
    api.version.request = true
    api.version.request.timeout.ms = 10000
    api.version.fallback.ms = 0
    broker.version.fallback = 0.10.0
    security.protocol = plaintext
    ssl.ca.certificate.stores = Root
    enable.ssl.certificate.verification = true
    ssl.endpoint.identification.algorithm = none
    sasl.mechanisms = GSSAPI
    sasl.kerberos.service.name = kafka
    sasl.kerberos.principal = kafkaclient
    sasl.kerberos.kinit.cmd = kinit -R -t "%{sasl.kerberos.keytab}" -k %{sasl.kerberos.principal} || kinit -t "%{sasl.kerberos.keytab}" -k %{sasl.kerberos.principal}
    sasl.kerberos.min.time.before.relogin = 60000
    enable.sasl.oauthbearer.unsecure.jwt = false
    test.mock.num.brokers = 0
    partition.assignment.strategy = range,roundrobin
    session.timeout.ms = 10000
    heartbeat.interval.ms = 3000
    group.protocol.type = consumer
    coordinator.query.interval.ms = 600000
    max.poll.interval.ms = 300000
    enable.auto.commit = true
    auto.commit.interval.ms = 5000
    enable.auto.offset.store = true
    queued.min.messages = 100000
    queued.max.messages.kbytes = 65536
    fetch.wait.max.ms = 500
    fetch.message.max.bytes = 1048576
    fetch.max.bytes = 52428800
    fetch.min.bytes = 1
    fetch.error.backoff.ms = 500
    offset.store.method = broker
    isolation.level = read_committed
    enable.partition.eof = false
    check.crcs = false
    allow.auto.create.topics = false
    client.rack =
    transaction.timeout.ms = 60000
    enable.idempotence = false
    enable.gapless.guarantee = false
    queue.buffering.max.messages = 100000
    queue.buffering.max.kbytes = 1048576
    queue.buffering.max.ms = 1000
    message.send.max.retries = 2147483647
    retry.backoff.ms = 100
    queue.buffering.backpressure.threshold = 1
    compression.codec = none
    batch.num.messages = 1000
    batch.size = 1000000
    delivery.report.only.error = true
    sticky.partitioning.linger.ms = 10

Broker configuration: 3 brokers, 1 topic/ 3 partitions, RF:3

@edenhill
Copy link
Contributor

edenhill commented Apr 5, 2021

Are you perhaps doing synchronous producing, that is, you wait for each produce() to complete by calling flush() or poll(long_timeout)?
https://github.com/edenhill/librdkafka/wiki/FAQ#why-is-there-no-sync-produce-interface

@sendaun
Copy link
Author

sendaun commented Apr 5, 2021

Thanks for the quick response @edenhill . I am not doing synchronous produce. My producer application has a separate polling thread that polls indefinitely:
while(1)
{
rd_kafka_poll(rk, 1000);
}

Apart from this no poll (or flush()) done in my main thread that produce() the records.

@edenhill
Copy link
Contributor

edenhill commented Apr 6, 2021

What configuration properties are you setting? (it is quite a bit of work to see which ones are set to non-default values in the config dump above)
Alternatively use debug=conf to get them printed on startup.

Are there any process limits in play, perhaps number of concurrent threads running?

@sendaun
Copy link
Author

sendaun commented Apr 7, 2021

These are the only configurations set:

%7|1617777172.751|CONF|rdkafka#producer-1| [thrd::0/internal]: Client configuration:
%7|1617777172.751|CONF|rdkafka#producer-1| [thrd::0/internal]: client.software.version = 1.6.1
%7|1617777172.751|CONF|rdkafka#producer-1| [thrd::0/internal]: metadata.broker.list = 192.0.1.2:9092
%7|1617777172.751|CONF|rdkafka#producer-1| [thrd::0/internal]: debug = conf
%7|1617777172.751|CONF|rdkafka#producer-1| [thrd::0/internal]: queue.buffering.max.ms = 1000
%7|1617777172.751|CONF|rdkafka#producer-1| [thrd::0/internal]: batch.num.messages = 1000
%7|1617777172.751|CONF|rdkafka#producer-1| [thrd::0/internal]: delivery.report.only.error = true

I checked my system configuration for process limits, current configuration can go upto 1500 concurrent threads:

CURRENT UNIX CONFIGURATION SETTINGS:
MAXPROCSYS = 1500 MAXPROCUSER = 250
MAXFILEPROC = 256000 MAXFILESIZE = NOLIMIT
MAXCPUTIME = 1000 MAXUIDS = 200
MAXPTYS = 256 MAXIOBUFUSER = 2048
MAXMMAPAREA = 40960 MAXASSIZE = 2G
MAXTHREADS = 1500 MAXTHREADTASKS = 1500
MAXCORESIZE = 4194304 MAXSHAREPAGES = 131072
IPCMSGQBYTES = 2147483647 IPCMSGQMNUM = 10000
IPCMSGNIDS = 500 IPCSEMNIDS = 500
IPCSEMNOPS = 25 IPCSEMNSEMS = 50
IPCSHMMPAGES = 2048 IPCSHMNIDS = 500
IPCSHMNSEGS = 500 IPCSHMSPAGES = 262144
SUPERUSER = BPXROOT FORKCOPY = COPY

@sendaun
Copy link
Author

sendaun commented Apr 7, 2021

  • Turned on protocol/msg logging. Attached for your reference.
    librd_log.txt

@edenhill
Copy link
Contributor

edenhill commented Apr 7, 2021

Oh, you are setting acks=0:

%7|1617778132.515|CONF|rdkafka#producer-1| [thrd:192.0.1.2:9092/bootstrap]: Topic "kpis" configuration (user-supplied):
%7|1617778132.515|CONF|rdkafka#producer-1| [thrd:192.0.1.2:9092/bootstrap]:   request.required.acks = 0

That will not work great since the client will not receive much back pressure or pacing from the broker but just blast away messages as quickly as it can, supposedly faster than the broker can handle.
Highly recommend never to use acks=0. Use acks=1 if you don't care about messaging guarantees, or acks=all if you do.

@sendaun
Copy link
Author

sendaun commented Apr 7, 2021

ack=1 did not yield much throughput gain either. It is still around 1724 records per second.
log (Attempt to produce 100K records ~ 300bytes each):
librd_log_acks1.txt

@edenhill
Copy link
Contributor

edenhill commented Apr 7, 2021

Recommend removing "batch.num.messages", it is better for the client to fill batches as large as possible.

@edenhill
Copy link
Contributor

edenhill commented Apr 7, 2021

You can also reduce linger.ms to 100 ms

@sendaun
Copy link
Author

sendaun commented Apr 7, 2021

No luck still, we came down to ~1100 rps.

librd_log_100ms.txt

@edenhill
Copy link
Contributor

edenhill commented Apr 7, 2021

When the socket buffers become full there's this 1s delay between sends:

%7|1617783327.850|SEND|rdkafka#producer-1| [thrd:192.0.1.2:9094/3]: 192.0.1.2:9093/2: Sent partial ProduceRequest (v7, 704037+65536/999872 bytes, CorrId 7)
%7|1617783327.850|SEND|rdkafka#producer-1| [thrd:192.0.1.2:9094/3]: 192.0.1.2:9093/2: Sent partial ProduceRequest (v7, 769573+65536/999872 bytes, CorrId 7)
%7|1617783328.851|SEND|rdkafka#producer-1| [thrd:192.0.1.2:9094/3]: 192.0.1.2:9093/2: Sent partial ProduceRequest (v7, 835109+94419/999872 bytes, CorrId 7)
%7|1617783329.851|SEND|rdkafka#producer-1| [thrd:192.0.1.2:9094/3]: 192.0.1.2:9093/2: Sent partial ProduceRequest (v7, 929528+65536/999872 bytes, CorrId 7)
%7|1617783329.852|SEND|rdkafka#producer-1| [thrd:192.0.1.2:9094/3]: 192.0.1.2:9093/2: Sent ProduceRequest (v7, 999872 bytes @ 995064, CorrId 7)

Are you comfortable with rebuilding librdkafka yourself?
If so I'd like you to change this line from 1000 to 10 and then try to reproduce:
https://github.com/edenhill/librdkafka/blob/master/src/rdkafka_broker.c#L78

@sendaun
Copy link
Author

sendaun commented Apr 7, 2021

That did the trick! I see the throughput went up to ~ 25K to 33K rps.

librd_log_blockms_10.txt

Now, Will it make sense to change max_block time (rd_kafka_max_block_ms) configurable by client ?

Thanks for your precious time on this @edenhill

@edenhill
Copy link
Contributor

edenhill commented Apr 7, 2021

Awesome!
Are those throughput numbers more in line with your requirements?

The max_block_ms is a safety harness which shouldn't really be used, it is just there to make sure we don't over-poll or under-poll, but the code should wake up automatically when there are more messages.
I think this might be related to the queue wakeup issues discussed here: #2912

@sendaun
Copy link
Author

sendaun commented Apr 7, 2021

Indeed! -
The numbers I see exceeds my applications current throughput requirement. I will settle down on a nominal max_block_ms value so I don't over/under poll.

The issue can be closed. Thanks again!

@edenhill
Copy link
Contributor

edenhill commented Apr 7, 2021

Dup of #2912

@edenhill edenhill closed this as completed Apr 7, 2021
@edenhill
Copy link
Contributor

edenhill commented Apr 8, 2021

@sendaun Could you try the https://github.com/edenhill/librdkafka/tree/nokinitinnew branch? (without reducing max_block_ms)
I think it might fix this issue.

@sendaun
Copy link
Author

sendaun commented Apr 8, 2021

Sure will test and keep you posted on the result.

@sendaun
Copy link
Author

sendaun commented Apr 8, 2021

Hi @edenhill , appears does not fix the problem. Tested with multiple runs but unable to cross 1500 records per sec.

librd_log_nokinitinnew.txt

@edenhill
Copy link
Contributor

Hey, can you try to reproduce on this branch? https://github.com/edenhill/librdkafka/tree/qwakeupfix

@sendaun
Copy link
Author

sendaun commented Apr 14, 2021

Unable to locate the branch, Should I try master?

@edenhill
Copy link
Contributor

edenhill commented Apr 14, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants