Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KafkaProducerConnector producer.send() future call taking too much time #5506

Open
abdulmonum opened this issue Aug 22, 2024 · 6 comments
Open

Comments

@abdulmonum
Copy link

-->

Environment details:

  • I am using k8s setup on openwhisk-deploy-kube, with action limits increased to 99999 .
  • One node hosts OW core components, the other node hosts the invoker
  • Using the old scheduler, using a client generating request rate of around 100 rps, function take 0.02 seconds to complete ( all invoked with blocking calls)
  • There are already 60 warm containers up and running before generating 100rps workload
  • one controller and one invoker node both with 1024 Jvm heap

I found that requests formed a short queue when the controller schedules activation and posts the topic for invoker to consume
Here is an example log for a tid:

[2024-08-17T19:12:51.632Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ActionsApi] [marker:controller_blockingActivation_start:104]
[2024-08-17T19:12:51.632Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ActionsApi] action activation id: d3b7ed1f632b4bd1b7ed1f632b4bd101 [marker:controller_loadbalancer_start:104]
[2024-08-17T19:12:51.632Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ShardingContainerPoolBalancer] scheduled activation d3b7ed1f632b4bd1b7ed1f632b4bd101, action 'guest/[email protected]' (managed), ns 'guest', mem limit 256 MB (std), time limit 60000 ms (std) to invoker0/owdev-invoker-0
[2024-08-17T19:12:51.632Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ShardingContainerPoolBalancer] posting topic 'invoker0' with activation id 'd3b7ed1f632b4bd1b7ed1f632b4bd101' [marker:controller_kafka_start:104]
[2024-08-17T19:12:51.666Z] [DEBUG] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [KafkaProducerConnector] sent message: invoker0[0][524]
[2024-08-17T19:12:51.666Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ShardingContainerPoolBalancer] posted to invoker0[0][524] [marker:controller_kafka_finish:122:18]
[2024-08-17T19:12:51.666Z] [INFO] [#tid_cbDR904isIhkDhBksSNU4GiocWngMPD1] [ActionsApi] [marker:controller_loadbalancer_finish:122:18]

The time between sent [ShardingContainerPoolBalancer] posting topic 'invoker0' and [KafkaProducerConnector] sent message: invoker0[0][524] is more than 30ms (there are cases when its more) and this results in many requests forming a queue here and subsequently a burst of jobs scheduled to invoker.

Upon inspecting source code in KafkaProducerConnector:

Future {
blocking {
try {
producer.send(record, new Callback {
override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
if (exception == null)
produced.trySuccess(ResultMetadata(metadata.topic(), metadata.partition(), metadata.offset()))
else produced.tryFailure(exception)
}
})
} catch {
case e: Throwable =>
produced.tryFailure(e)
}
}
}

produced.future.andThen {
  case Success(status) =>
    logging.debug(this, s"sent message: ${status.topic}[${status.partition}][${status.offset}]")
    sentCounter.next()
  case Failure(t) =>
    logging.error(this, s"sending message on topic '$topic' failed: ${t.getMessage}")

Produced.future sucess is taking time. Upon checking logs more i found that messages invoker0[0][519-523] are sent in the meantime after the producer.send() with parititon 524 of and then finally invoker0[0][524] is sent. Why is the call and how can I increase performance here so that the short queue that forms does not happen because of the slow return of the future?

@style95
Copy link
Member

style95 commented Aug 23, 2024

There is only one consumer for the topic, why do you have so many partitions?
While one consumer can receive messages from multiple partitions, it is not good for performance as far as I know.

@abdulmonum
Copy link
Author

abdulmonum commented Aug 23, 2024

@style95 My bad, the offset is 524 and the partition was 0 as seen from the log message. I use default configuration of kafka as in the openwhisk-deploy-kube repo.

@style95
Copy link
Member

style95 commented Aug 26, 2024

@abdulmonum
How many nodes do you have for the invoker?
And how many prewarmed containers do you have?
The issue might be due to a limited number of invokers or slow container (pod) creation, which can cause backpressure.

@abdulmonum
Copy link
Author

abdulmonum commented Aug 26, 2024

@style95 I have one node for the invoker. No prewarmed containers (like in ow terminology) but I force 60 cold starts to create 60 containers (for the same action) and once they are up and running, I invoke a workload (the same action) with a fixed arrival rate of 100 rps. A request takes around ~ 0.02 seconds. So in principle, there should be 2 containers busy at all times which is also the case. However, between this workload, there are periodic bursts of 40-60+ jobs getting queued at the KafkaProducerConnector producer.send() future call which all of them then go to the invoker and then if around 60+ jobs were queued get scheduled, it then induces a cold start. I am not sure why there would be a backpressure from the invoker when there are no no cold starts happening? And how would a backpressure from the invoker introduce pressure on kafka from not being to send the acknowledgment to the controller that it has produced the message?

@abdulmonum
Copy link
Author

@style95 I see these logs periodically in the kafka-event-provider pod and I am not sure if this is related:

[2024-08-25T19:45:40.323Z] [INFO] [??] [kafkatriggers] [canary] I found a canary. The last one was 60.168319 seconds ago.
[2024-08-25T19:46:40.503Z] [INFO] [??] [kafkatriggers] [canary] I found a canary. The last one was 60.178885 seconds ago.
[2024-08-25T19:47:40.649Z] [INFO] [??] [kafkatriggers] [canary] I found a canary. The last one was 60.1465 seconds ago.
[2024-08-25T19:48:40.824Z] [INFO] [??] [kafkatriggers] [canary] I found a canary. The last one was 60.173952 seconds ago.
[2024-08-25T19:49:40.829Z] [ERROR] [??] [kafkatriggers] [canary] Exception caught from changes feed. Restarting changes feed...
[2024-08-25T19:49:40.829Z] [ERROR] [??] [kafkatriggers] HTTPConnectionPool(host='owdev-couchdb.openwhisk.svc.cluster.local', port=5984): Read timed out.
[2024-08-25T19:49:40.829Z] [INFO] [??] [kafkatriggers] Shutting down existing DB client
[2024-08-25T19:49:40.830Z] [INFO] [??] [kafkatriggers] Resetting dropped connection: owdev-couchdb.openwhisk.svc.cluster.local
[2024-08-25T19:49:40.840Z] [INFO] [??] [kafkatriggers] Starting changes feed
[2024-08-25T19:49:40.843Z] [INFO] [??] [kafkatriggers] Starting new HTTP connection (1): owdev-couchdb.openwhisk.svc.cluster.local
[2024-08-25T19:49:40.872Z] [INFO] [??] [kafkatriggers] Database exists - connecting to it.

@style95
Copy link
Member

style95 commented Aug 26, 2024

@abdulmonum
Are you using the kafka trigger to invoke your action?
Basically, your assumption is correct. If pods are already created and your action only takes 0.02 seconds, it should be fine.
What is the batch size of your kafka producer and what is the user memory configured for the invoker?
Do you have any suspicious logs on the consumer side as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants