Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track blocked thread event and report statistics in metrics endpoint #1649

Merged
merged 5 commits into from
Oct 12, 2023

Conversation

khanguyen88
Copy link
Contributor

Vert.x monitors and reports blocked thread. This PR will include that info in the /healthcheck endpoint so we can use that info to restart Kroki when perfomance degrades too much due to many blocked threads.

@ggrossetie
Copy link
Member

I think that's a good idea.
Did you investigate to understand why threads are blocked? It would be better to fix the root cause rather than periodically restart the service.

@khanguyen88
Copy link
Contributor Author

I had quite a few outages due to blocked threads. So far the logs indicated that certain inputs will cause PlantUml/Ditaa to hang or go into infinite loop. I'm looking out for those bad inputs, but in the meantime, I'm adding as the first step to improve reliability of my server.

@ggrossetie
Copy link
Member

Are you using the latest version? We are now using single executable binaries (produced by GraalVM) for both PlantUML and Ditaa so it shouldn't block threads anymore.

@khanguyen88
Copy link
Contributor Author

yes, I'm on 0.22

Comment on lines 27 to 28
private final Map<String, Instant> eventLoopStats = new ConcurrentHashMap<>();
private final Map<String, Instant> workerStats = new ConcurrentHashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned since we don't do eviction since is basically a memory leak. We could use something like https://github.com/ben-manes/caffeine to use a time-based eviction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I switched to Caffeine's Cache to store the stats.

I was reluctant to use Caffeine as it was not included in the project and I didn't expect the 2 maps to grow above 100 entries.

Comment on lines 50 to 53
if (blockedThreadChecker != null) {
data.put("blockedWorkerPercentage", blockedThreadChecker.blockedWorkerThreadPercentage());
data.put("blockedEventLoopPercentage", blockedThreadChecker.blockedEventLoopThreadPercentage());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should export data using Prometheus format on /metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't use Prometheus on my server but I'm happy to help. Can you create another issue and assign to me? I'll work on it when I have the bandwidth

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've created an issue

@khanguyen88 khanguyen88 requested a review from ggrossetie October 2, 2023 07:43
@khanguyen88
Copy link
Contributor Author

Hi @ggrossetie could you please review and merge this PR?

@ggrossetie ggrossetie changed the title track blocked thread event and report statistics in healthcheck endpoint track blocked thread event and report statistics in metrics endpoint Oct 12, 2023
@ggrossetie ggrossetie merged commit cc0a6d1 into yuzutech:main Oct 12, 2023
1 check passed
@boris-moduscreate
Copy link

@ggrossetie Any eta on a release which includes this? ❤️

@ggrossetie
Copy link
Member

I will try to push a new release before my vacation (mid January)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants