When apps stop fully running there's no mechanism to either restart them or abort the monitor #364

kkonstan-ovo · 2022-05-16T15:36:26Z

Under certain circumstances, eg network errors, a service might stop due to an error, however no attempt is made to restart it.

The regular healthcheck will notice a service is not running and therefore the app is not fully running but no remedial action is taken:

ERROR Topic-management-service-for-single-cluster-monitor/MultiClusterTopicManagementService will stop due to error. (com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService)
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Aborted due to timeout.
        at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) ~[kafka-clients-2.4.0.jar:?]
        at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) ~[kafka-clients-2.4.0.jar:?]
        at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89) ~[kafka-clients-2.4.0.jar:?]
        at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260) ~[kafka-clients-2.4.0.jar:?]
        at com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService$TopicManagementHelper.minPartitionNum(MultiClusterTopicManagementService.java:324) ~[kafka-monitor-2.5.12.jar:2.5.12]
        at com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService$TopicManagementHelper.maybeCreateTopic(MultiClusterTopicManagementService.java:313) ~[kafka-monitor-2.5.12.jar:2.5.12]
        at com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService$TopicManagementRunnable.run(MultiClusterTopicManagementService.java:179) [kafka-monitor-2.5.12.jar:2.5.12]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
        at java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
        at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.apache.kafka.common.errors.TimeoutException: Aborted due to timeout.
INFO Topic-management-service-for-single-cluster-monitor/MultiClusterTopicManagementService stopped. (com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService)
INFO Topic-management-service-for-single-cluster-monitor/MultiClusterTopicManagementService shutdown completed (com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService)
INFO TopicManagementService is not running. (com.linkedin.xinfra.monitor.apps.SingleClusterMonitor)
ERROR App single-cluster-monitor is not fully running. (com.linkedin.xinfra.monitor.XinfraMonitor)

Ideally failing services should be restarted automatically.

If however dealing with the complexities of the above (backoff, retry limits etc) is not desirable within the context of this project, it should at the very least provide the option to shutdown the monitor when an app is not fully running so this can be dealt with by the scheduler that runs xinfra-monitor.

At the moment it just logs an error:

https://github.com/linkedin/kafka-monitor/blob/master/src/main/java/com/linkedin/xinfra/monitor/XinfraMonitor.java#L143-L145

The text was updated successfully, but these errors were encountered:

github-actions · 2022-05-16T15:37:13Z

This is your first issue in the repository. Thank you for raising this issue.' first issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When apps stop fully running there's no mechanism to either restart them or abort the monitor #364

When apps stop fully running there's no mechanism to either restart them or abort the monitor #364

kkonstan-ovo commented May 16, 2022

github-actions bot commented May 16, 2022

When apps stop fully running there's no mechanism to either restart them or abort the monitor #364

When apps stop fully running there's no mechanism to either restart them or abort the monitor #364

Comments

kkonstan-ovo commented May 16, 2022

github-actions bot commented May 16, 2022