-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KafkaProducerConnector producer.send() future call taking too much time #5506
Comments
There is only one consumer for the topic, why do you have so many partitions? |
@style95 My bad, the offset is 524 and the partition was 0 as seen from the log message. I use default configuration of kafka as in the openwhisk-deploy-kube repo. |
@abdulmonum |
@style95 I have one node for the invoker. No prewarmed containers (like in ow terminology) but I force 60 cold starts to create 60 containers (for the same action) and once they are up and running, I invoke a workload (the same action) with a fixed arrival rate of 100 rps. A request takes around ~ 0.02 seconds. So in principle, there should be 2 containers busy at all times which is also the case. However, between this workload, there are periodic bursts of 40-60+ jobs getting queued at the KafkaProducerConnector producer.send() future call which all of them then go to the invoker and then if around 60+ jobs were queued get scheduled, it then induces a cold start. I am not sure why there would be a backpressure from the invoker when there are no no cold starts happening? And how would a backpressure from the invoker introduce pressure on kafka from not being to send the acknowledgment to the controller that it has produced the message? |
@style95 I see these logs periodically in the kafka-event-provider pod and I am not sure if this is related: [2024-08-25T19:45:40.323Z] [INFO] [??] [kafkatriggers] [canary] I found a canary. The last one was 60.168319 seconds ago. |
@abdulmonum |
-->
Environment details:
I found that requests formed a short queue when the controller schedules activation and posts the topic for invoker to consume
Here is an example log for a tid:
[2024-08-17T19:12:51.632Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ActionsApi] [marker:controller_blockingActivation_start:104]
[2024-08-17T19:12:51.632Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ActionsApi] action activation id: d3b7ed1f632b4bd1b7ed1f632b4bd101 [marker:controller_loadbalancer_start:104]
[2024-08-17T19:12:51.632Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ShardingContainerPoolBalancer] scheduled activation d3b7ed1f632b4bd1b7ed1f632b4bd101, action 'guest/[email protected]' (managed), ns 'guest', mem limit 256 MB (std), time limit 60000 ms (std) to invoker0/owdev-invoker-0
[2024-08-17T19:12:51.632Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ShardingContainerPoolBalancer] posting topic 'invoker0' with activation id 'd3b7ed1f632b4bd1b7ed1f632b4bd101' [marker:controller_kafka_start:104]
[2024-08-17T19:12:51.666Z] [DEBUG] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [KafkaProducerConnector] sent message: invoker0[0][524]
[2024-08-17T19:12:51.666Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ShardingContainerPoolBalancer] posted to invoker0[0][524] [marker:controller_kafka_finish:122:18]
[2024-08-17T19:12:51.666Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ActionsApi] [marker:controller_loadbalancer_finish:122:18]
The time between sent [ShardingContainerPoolBalancer] posting topic 'invoker0' and [KafkaProducerConnector] sent message: invoker0[0][524] is more than 30ms (there are cases when its more) and this results in many requests forming a queue here and subsequently a burst of jobs scheduled to invoker.
Upon inspecting source code in KafkaProducerConnector:
Future {
blocking {
try {
producer.send(record, new Callback {
override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
if (exception == null)
produced.trySuccess(ResultMetadata(metadata.topic(), metadata.partition(), metadata.offset()))
else produced.tryFailure(exception)
}
})
} catch {
case e: Throwable =>
produced.tryFailure(e)
}
}
}
Produced.future sucess is taking time. Upon checking logs more i found that messages invoker0[0][519-523] are sent in the meantime after the producer.send() with parititon 524 of and then finally invoker0[0][524] is sent. Why is the call and how can I increase performance here so that the short queue that forms does not happen because of the slow return of the future?
The text was updated successfully, but these errors were encountered: