Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide blockhound integrations #4493

Merged
merged 71 commits into from
May 25, 2023

Conversation

jrhee17
Copy link
Contributor

@jrhee17 jrhee17 commented Oct 20, 2022

Motivation:

Armeria users often use BlockHound to determine whether there are any blocking calls in their applications.
Providing a basic BlockHound integration can help de-duplicate some code and debugging efforts.

I propose the following:

  • Our CI checks for blocking calls in our tests to keep our BlockHound integration up-to-date with changes.
  • We provide a lenient list of allowed blocking calls that may be commonly used even though not related directly to Armeria
    • Some JDK implementations internally use locks (ThreadPoolExecutor). As long as these locks are verified to be short-living, we can add it to our integration.
    • Although somewhat subjective, if short-living locks are held by Armeria integrations (not necessarily directly related to Armeria) I think it's reasonable to add it to our list.
      • HdrHistogram is used internally by Micrometer. Micrometer itself doesn't expect metrics to be recorded from event loops, so it may not be reasonable to expect Micrometer or HdrHistogram to provide Blockhound integrations. Since Armeria uses Micrometer to record metrics from event loops, I think it is reasonable that we add HdrHistogram related calls to our Blockhound integration.
    • The integration is public so that users can opt-out of some integrations if they choose to do so.
      • We can possibly provide more customization points if users feel some rules are too lenient.
  • I don't consider the current listing as complete/finalized, and expect that users can suggest additional rules in the future.

Also note that there are some blocking calls within the codebase, but this PR attempts to just introduce blockhound rather than remove such calls.

Modifications:

  • Modify the build action so that
    • The blockhound flag is set for java 19
    • Upload blockhound logs which contains a list of blocked threads and their stacktraces
    • Change the artifact upload name. Currently, because the name can be duplicated for java19 (mac, windows, self-hosted), only one of these is downloadable
  • Provide BlockHound integrations for core, brave, grpc, retrofit2, scala, sangria modules
  • Provide an internal testing BlockHound integration which allows test-specific blocking calls
    • This integration also adds a callback which logs blocking calls to a separate file
  • If the blockhound gradle property flag is specified, run tests with BlockHound enabled
    • Otherwise, BlockHound.install is not invoked and not enabled in the code-base
  • Introduce ReentrantShortLock which Armeria's BlockHound whitelists and change all usages of ReentrantLock
  • BlockingFastThreadLocalThread is now used for non-event loops since netty whitelists FastThreadLocalThread with permitBlockingCalls==true
  • ShutdownSupport::shutdown is now called from startStopExecutor. Although this method returns a CF, the call can actually be blocking
  • START_STOP_EXECUTOR is now a non-eventloop daemon thread instead of using the GlobalEventExecutor
    • This is because GlobalEventExecutor is an event loop.
  • Introduce BlockingUtils#blockingRun in case users want to run blocking calls from tests
  • Add the kotlinx-coroutines-debug dependency so that Coroutines BlockHound integration is loaded
  • Allow users to specify the ThreadFactory when creating a EventLoopExtension or EventLoopGroupExtension
    • This can be useful when users would like to create an event loop which allows blocking calls
  • SamlService, SamlDecorator now runs some calls from the blocking executor
  • ZookeeperEndpointGroup now uses a non-event-loop thread factory

Result:

  • Users can now use BlockHound integrations provided by armeria

@jrhee17
Copy link
Contributor Author

jrhee17 commented Oct 20, 2022

Note: There is a known issue where a Option AllowRedefinitionToAddDeleteMethods was deprecated.. is printed in logs. Will look into this as well.
https://bugs.openjdk.org/browse/JDK-8221528
reactor/BlockHound#33

@jrhee17 jrhee17 force-pushed the feature/blockhound-integration branch from 90e89e3 to 11e8e18 Compare November 4, 2022 04:52
@ikhoon
Copy link
Contributor

ikhoon commented Nov 10, 2022

As of Netty 4.1.85, FastThreadLocalThread permits to perform blocking calls. If the flag is enabled, the Netty Blockhound integration module does not assume the thread is a non-blocking thread. netty/netty#12978

@mauhiz
Copy link
Contributor

mauhiz commented Jan 25, 2023

Hi, let me report the following after trying blockhound with Armeria 1.20.3

Zipkin tracing can block when flushing spans

at java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:267)
	at zipkin2.reporter.ByteBoundedQueue.offer(ByteBoundedQueue.java:51)
	at zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:259)
	at (...)
	at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:44)
	at zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:29)
	at zipkin2.reporter.brave.ZipkinSpanHandler.end(ZipkinSpanHandler.java:148)
	at brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:57)
	at brave.internal.recorder.PendingSpans.finish(PendingSpans.java:122)
	at brave.RealSpan.finish(RealSpan.java:143)
	at brave.http.HttpHandler.handleFinish(HttpHandler.java:84)
	at brave.http.HttpServerHandler.handleSend(HttpServerHandler.java:200)
	at com.linecorp.armeria.server.brave.BraveService.lambda$maybeAddTagsToSpan$2(BraveService.java:138)

Though it sounds ignorable, I'm not sure if armeria integration should be the one ignoring it, as opposed to each application using it.

Reading Prometheus gauges (ExecutorServiceMetrics) can block

at java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:267)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.size(ScheduledThreadPoolExecutor.java:1069)
	at io.micrometer.core.instrument.binder.jvm.ExecutorServiceMetrics.lambda$monitor$0(ExecutorServiceMetrics.java:347)
	at io.micrometer.core.instrument.internal.DefaultGauge.value(DefaultGauge.java:53)
	at io.micrometer.prometheus.PrometheusMeterRegistry.lambda$newGauge$5(PrometheusMeterRegistry.java:335)
	at io.micrometer.prometheus.MicrometerCollector.collect(MicrometerCollector.java:75)
	at io.prometheus.client.Collector.collect(Collector.java:45)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:204)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:219)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:152)
	at io.prometheus.client.exporter.common.TextFormat.writeOpenMetrics100(TextFormat.java:202)
	at io.prometheus.client.exporter.common.TextFormat.writeFormat(TextFormat.java:57)
	at com.linecorp.armeria.server.metric.PrometheusExpositionService.doGet(PrometheusExpositionService.java:95)
	at com.linecorp.armeria.server.AbstractHttpService.serve(AbstractHttpService.java:53)

This looks ignorable.

recording in a Hdr histogram can cause a sleep

at java.base/java.util.concurrent.TimeUnit.sleep(TimeUnit.java:446)
	at org.HdrHistogram.WriterReaderPhaser.flipPhase(WriterReaderPhaser.java:244)
	at org.HdrHistogram.DoubleRecorder.performIntervalSample(DoubleRecorder.java:300)
	at org.HdrHistogram.DoubleRecorder.reset(DoubleRecorder.java:265)
	at io.micrometer.core.instrument.distribution.TimeWindowPercentileHistogram.resetBucket(TimeWindowPercentileHistogram.java:63)
	at io.micrometer.core.instrument.distribution.TimeWindowPercentileHistogram.resetBucket(TimeWindowPercentileHistogram.java:35)
	at io.micrometer.core.instrument.distribution.AbstractTimeWindowHistogram.rotate(AbstractTimeWindowHistogram.java:268)
	at io.micrometer.core.instrument.distribution.AbstractTimeWindowHistogram.recordLong(AbstractTimeWindowHistogram.java:221)
	at io.micrometer.core.instrument.distribution.TimeWindowPercentileHistogram.recordLong(TimeWindowPercentileHistogram.java:35)
	at io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:182)
	at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:50)
	at com.linecorp.armeria.internal.common.metric.RequestMetricSupport.onResponse(RequestMetricSupport.java:115)

The sleep is about 1ms at most, so I suppose it can be ignored, but let's be aware that it could happen for each histogram that is being updated in a non-blocking thread.

Submitting a task to a blocking scheduler can block

at java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:267)
	at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:916)
	at java.base/java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1583)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:346)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:705)
	at com.linecorp.armeria.common.util.DefaultBlockingTaskExecutor.execute(DefaultBlockingTaskExecutor.java:126)

The lock seems like it's never held for long.

Guava's MapMaker's ConcurrentMap can block

at java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:267)
	at com.linecorp.armeria.internal.shaded.guava.collect.MapMakerInternalMap$Segment.put(MapMakerInternalMap.java:1476)
	at com.linecorp.armeria.internal.shaded.guava.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:2419)
	at java.base/java.util.concurrent.ConcurrentMap.computeIfAbsent(ConcurrentMap.java:331)
	at com.linecorp.armeria.client.HttpClientFactory.pool(HttpClientFactory.java:404)
	at com.linecorp.armeria.client.HttpClientDelegate.acquireConnectionAndExecute0(HttpClientDelegate.java:169)

Again, the lock seems never held for long.

@minwoox
Copy link
Contributor

minwoox commented Mar 23, 2023

Is there any progress on this?

@jrhee17
Copy link
Contributor Author

jrhee17 commented Mar 23, 2023

Is there any progress on this?

Let me try working on this again when I have more time 😅

At the moment, the difficult part is figuring out why failed tests are failing. The pattern most often found is:

  1. An assertion is made on a thread other than the junit main thread
  2. I have to spend a significant amount of time figuring out at which point the test failed

Let me look into this again next week 😅 Sorry about the delay

@ikhoon
Copy link
Contributor

ikhoon commented Apr 21, 2023

If we introduce a general interface that marks all methods of its subclasses are safe to execute in an event loop, it would be easier to implement the BlockHound integration module.

The following example was borrowed from the change in #4837

public interface EventLoopSafe {}

class DestroyHandler implements BiFunction<Object, Throwable, Void>, EventLoopSafe {

	@Override
	public Void apply(Object unused, Throwable throwable) {
	    // BlockHound will report this blocking operation without EventLoopSafe
	    lock.lock();
	    try {
	       ...
	    } finally {
	        lock.unlock();
	    }
	    return null;
	}
}

@jrhee17
Copy link
Contributor Author

jrhee17 commented Apr 21, 2023

If we introduce a general interface that marks all methods of its subclasses are safe to execute in an event loop, it would be easier to implement the BlockHound integration module.

I'm not sure if I understood the intention, but could it be easier if we just introduce a lock which is ignorable by Blockhound?
I'm not sure if it's feasible to ensure all calls to lock are wrapped in such a function (especially if the call stack becomes larger)

@ikhoon
Copy link
Contributor

ikhoon commented Apr 21, 2023

In #4837, it is likely that DefaultHealthCheckerContext.destroy() is detected by BlockHound because of the usage of ReentrantLock.lock().

Question: Is it feasible to extend ReentrantLock and add it to allowBlockingCallsInside?
For example:

class ReentrantShortLock extends ReentrantLock { ... }

BlockHound.Builder builder = ...;
builder.allowBlockingCallsInside("com.linecorp.armeria.internal.common.ReentrantShortLock",
                                 "lock");

@jrhee17 jrhee17 force-pushed the feature/blockhound-integration branch from ddbdbc3 to 9967be7 Compare April 26, 2023 23:47
@jrhee17 jrhee17 added this to the 1.24.0 milestone May 4, 2023
@jrhee17
Copy link
Contributor Author

jrhee17 commented May 4, 2023

Sorry about the delay 🙏 I believe this PR is (finally) ready for review

@jrhee17 jrhee17 marked this pull request as ready for review May 4, 2023 04:54
@jrhee17 jrhee17 requested review from ikhoon and minwoox as code owners May 4, 2023 04:54
Copy link
Contributor

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! 🙇

public void applyTo(Builder builder) {
// short locks
builder.allowBlockingCallsInside("com.linecorp.armeria.client.HttpClientFactory",
"pool");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain which part of the method is a blocking call? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the report from @mauhiz contains the stacktrace. I was able to reproduce this failure from our CI as well.

#4493 (comment)

at java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:267)
	at com.linecorp.armeria.internal.shaded.guava.collect.MapMakerInternalMap$Segment.put(MapMakerInternalMap.java:1476)
	at com.linecorp.armeria.internal.shaded.guava.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:2419)
	at java.base/java.util.concurrent.ConcurrentMap.computeIfAbsent(ConcurrentMap.java:331)
	at com.linecorp.armeria.client.HttpClientFactory.pool(HttpClientFactory.java:404)
	at com.linecorp.armeria.client.HttpClientDelegate.acquireConnectionAndExecute0(HttpClientDelegate.java:169)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks! 😉

builder.allowBlockingCallsInside("org.HdrHistogram.ConcurrentHistogram", "getCountAtIndex");
builder.allowBlockingCallsInside("org.HdrHistogram.WriterReaderPhaser", "flipPhase");

// StreamMessageInputStream internally uses a blocking queue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it uses add instead offer? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Thanks!

core/src/main/java/com/linecorp/armeria/server/Server.java Outdated Show resolved Hide resolved
build.gradle Outdated Show resolved Hide resolved
Copy link
Contributor

@ikhoon ikhoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your hard work. 🙇‍♂️ This should be very useful to diagnose event loop blocking problems.

@Override
public void applyTo(Builder builder) {
// UUID.randomUUID is called by default
builder.allowBlockingCallsInside("graphql.execution.ExecutionId", "generate");
Copy link
Contributor

@ikhoon ikhoon May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of removing this rule and using ctx.id().text() as the ExecutionId?

  • I don't see that users would want to use the UUID.randomUUID for the default execution ID.
  • It would be a better choice to correlate an HTTP request with a GraphQL query.
final ExecutionInput executionInput =
        builder.context(ctx)
               .executionId(ExecutionId.from(ctx.id().text()))
               ...

builder.context(ctx)
.graphQLContext(GraphqlServiceContexts.graphqlContext(ctx))
.dataLoaderRegistry(dataLoaderRegistry)

We may add an API that takes an ExecutionId factory from GraphqlServiceBuilder later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue for this. #4867

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Seems like a simple enough workaround to include in this PR 👍

Copy link
Contributor

@ikhoon ikhoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! 💯 👍

@jrhee17 jrhee17 force-pushed the feature/blockhound-integration branch from add4a98 to 7b6dd68 Compare May 24, 2023 05:39
* A {@link BlockHoundIntegration} for the brave module.
*/
@UnstableApi
public final class BraveBlockHoundIntegration implements BlockHoundIntegration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case a user needs to refer to our BlockHoundIntegration implementations manually? If not, can we move all BlockHoundIntegration into the internal package, given that they are loaded automatically via SPI?

Copy link
Contributor Author

@jrhee17 jrhee17 May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought users may want to exclude some BlockHoundIntegrations (not necessarily armeria's)

e.g.

BlockHound.install(BlockHoundIntegration... integrations) - same as BlockHound.install(), but adds user-provided integrations to the list.
BlockHound.builder().install() - will create a new builder, without discovering any integrations.
You may install them manually by using BlockHound.builder().with(new MyIntegration()).install().

https://github.com/reactor/BlockHound/blob/master/docs/customization.md#customization

If there is no need, I'm fine with hiding them. Let me know what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then it's fine as it is :-)

@jrhee17 jrhee17 force-pushed the feature/blockhound-integration branch from a0deb15 to 7b6dd68 Compare May 24, 2023 08:10
Copy link
Member

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank a lot for tackling into this, @jrhee17. 🙇🙇🙇

@trustin trustin merged commit d853651 into line:main May 25, 2023
@srikiraju
Copy link

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants