Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When apps stop fully running there's no mechanism to either restart them or abort the monitor #364

Open
kkonstan-ovo opened this issue May 16, 2022 · 1 comment

Comments

@kkonstan-ovo
Copy link

Under certain circumstances, eg network errors, a service might stop due to an error, however no attempt is made to restart it.

The regular healthcheck will notice a service is not running and therefore the app is not fully running but no remedial action is taken:

ERROR Topic-management-service-for-single-cluster-monitor/MultiClusterTopicManagementService will stop due to error. (com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService)
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Aborted due to timeout.
        at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) ~[kafka-clients-2.4.0.jar:?]
        at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) ~[kafka-clients-2.4.0.jar:?]
        at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89) ~[kafka-clients-2.4.0.jar:?]
        at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260) ~[kafka-clients-2.4.0.jar:?]
        at com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService$TopicManagementHelper.minPartitionNum(MultiClusterTopicManagementService.java:324) ~[kafka-monitor-2.5.12.jar:2.5.12]
        at com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService$TopicManagementHelper.maybeCreateTopic(MultiClusterTopicManagementService.java:313) ~[kafka-monitor-2.5.12.jar:2.5.12]
        at com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService$TopicManagementRunnable.run(MultiClusterTopicManagementService.java:179) [kafka-monitor-2.5.12.jar:2.5.12]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
        at java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
        at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.apache.kafka.common.errors.TimeoutException: Aborted due to timeout.
INFO Topic-management-service-for-single-cluster-monitor/MultiClusterTopicManagementService stopped. (com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService)
INFO Topic-management-service-for-single-cluster-monitor/MultiClusterTopicManagementService shutdown completed (com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService)
INFO TopicManagementService is not running. (com.linkedin.xinfra.monitor.apps.SingleClusterMonitor)
ERROR App single-cluster-monitor is not fully running. (com.linkedin.xinfra.monitor.XinfraMonitor)

Ideally failing services should be restarted automatically.

If however dealing with the complexities of the above (backoff, retry limits etc) is not desirable within the context of this project, it should at the very least provide the option to shutdown the monitor when an app is not fully running so this can be dealt with by the scheduler that runs xinfra-monitor.

At the moment it just logs an error:

https://github.com/linkedin/kafka-monitor/blob/master/src/main/java/com/linkedin/xinfra/monitor/XinfraMonitor.java#L143-L145

@github-actions
Copy link

This is your first issue in the repository. Thank you for raising this issue.' first issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant