Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statsd meter recordings are dropped when submitted from parallel threads #2880

Closed
hdv opened this issue Dec 1, 2021 · 17 comments · May be fixed by #2892
Closed

Statsd meter recordings are dropped when submitted from parallel threads #2880

hdv opened this issue Dec 1, 2021 · 17 comments · May be fixed by #2892
Labels
bug A general bug registry: statsd A StatsD Registry related issue regression A regression from a previous release
Milestone

Comments

@hdv
Copy link
Contributor

hdv commented Dec 1, 2021

Describe the bug
The following test fails when added to StatsdMeterRegistryPublishTest:

    @ParameterizedTest
    @EnumSource(StatsdProtocol.class)
    void receiveParallelMetricsSuccessfully(StatsdProtocol protocol) throws InterruptedException {
        final int N = 10;

        skipUdsTestOnWindows(protocol);
        serverLatch = new CountDownLatch(N);
        server = startServer(protocol, 0);
        final int port = getPort(protocol);

        meterRegistry = new StatsdMeterRegistry(getUnbufferedConfig(protocol, port), Clock.SYSTEM);
        startRegistryAndWaitForClient();
        Counter counter = Counter.builder("my.counter").register(meterRegistry);

        IntStream.range(0, N)
                .parallel()
                .forEach(ignored -> counter.increment());

        assertThat(serverLatch.await(3, TimeUnit.SECONDS)).isTrue();
    }

It succeeds when .parallel() line is commented out.

Environment

  • Micrometer version [e.g. 1.7.1]: latest Micrometer from main branch
  • Micrometer registry [e.g. prometheus]: statsd
  • OS: [e.g. macOS] macOS
  • Java version: [e.g. output of java -version]
    ~/projects/micrometer #v1.8.0 *3 !1 ❯ java --version  base
    openjdk 11.0.2 2019-01-15
    OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
    OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)

To Reproduce
How to reproduce the bug:
See the test above

Expected behavior
Test should succeed

Additional context

  • Similar test succeeds for 1.7.6 release
  • Looks like this change might be related: Upgrade to Reactor 2020.0 #2588
  • increasing N to 1000 makes the test fail even without .parallel()
@hdv hdv added the bug A general bug label Dec 1, 2021
@shakuzen shakuzen added the registry: statsd A StatsD Registry related issue label Dec 1, 2021
@shakuzen shakuzen added this to the 1.8.x milestone Dec 1, 2021
@shakuzen shakuzen added the regression A regression from a previous release label Dec 1, 2021
@hdv
Copy link
Contributor Author

hdv commented Dec 1, 2021

Looks like the following change in StatsdCounter will make make things work for N=10:

    @Override
    public void increment(double amount) {
        if (!shutdown && amount > 0) {
            count.add(amount);
            
            // sink.tryEmitNext(lineBuilder.count((long) amount));

            sink.emitNext(lineBuilder.count((long) amount), (signal, failure) -> {
                System.out.println(failure);
                return failure == Sinks.EmitResult.FAIL_NON_SERIALIZED;
            });
    }

Problem is that we try to emit events to a sink from different threads and the current behavior of Sinks.many() sinks is to fail fast if events are not serialized. Apparently it was different with Processors before.
The danger of the busy-loop approach above is that it might increase CPU load as it will produce lots of retries.

This is all based on https://stackoverflow.com/questions/65379026/reactor-handle-fail-non-serialized-error-from-a-many-sink?noredirect=1&lq=1
Also see reactor/reactor-core#2382

@hdv
Copy link
Contributor Author

hdv commented Dec 1, 2021

This discovery was a little problematic as LogbackMetricsSuppressingManySink is masking the problem :-(

    @Override
    public Sinks.EmitResult tryEmitNext(String s) {
        LogbackMetrics.ignoreMetrics(() -> delegate.tryEmitNext(s));
        // We want to silently drop the element if not emitted for any reason
        // We do not use the returned result
        return Sinks.EmitResult.OK;
    }

It claims that everything is great while it isn't. And doesn't even debug log anything.
It should probably be smarter about it.

@hdv
Copy link
Contributor Author

hdv commented Dec 1, 2021

Actually, aforementioned busy-loop solution doesn't really work if I try to set N to 1000 in a test. Too much busy work is happening.

@shakuzen
Copy link
Member

shakuzen commented Dec 2, 2021

Thank you for the detailed investigation. That is very helpful. We will discuss with the Reactor team how we should handle this.

@shakuzen shakuzen changed the title Counter samples are dropped when submitted from parallel threads StatsdCounter increments are dropped when submitted from parallel threads Dec 2, 2021
@hdv
Copy link
Contributor Author

hdv commented Dec 2, 2021

Noticed the title change.
Just to make sure we're on the same page, the problem doesn't seem to be specific to counters but rather the way samples are fed to sink. All StatsD meters are affected.

@shakuzen shakuzen changed the title StatsdCounter increments are dropped when submitted from parallel threads Statsd meter recordings are dropped when submitted from parallel threads Dec 3, 2021
@simonbasle
Copy link
Contributor

can you try this as a workaround: change the underlying Sinks.Many<T> flavor to be:

Sinks.unsafe().many().unicast().onBackpressureBuffer(
    Queues.unboundedMultiproducer().get());

If I understand correctly, the sink is already subscribed to only once so Unicast is an option.

The above code will create an "unsafe" implementation of the sink, meaning there will be no attempt at detecting concurrent usage. Instead, it will rely on the MPSC properties of the unboundedMultiproducer queue for serialization.

@jonatan-ivanov
Copy link
Member

For the sake of documentation:

jonatan-ivanov added a commit to jonatan-ivanov/micrometer that referenced this issue Dec 4, 2021
until we figure out what to do.
fixes micrometer-metricsgh-2880
jonatan-ivanov added a commit to jonatan-ivanov/micrometer that referenced this issue Dec 4, 2021
@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Dec 4, 2021

I moved the code back to the deprecated DirectProcessor. Merging this until we find out what to do can be an option: main...jonatan-ivanov:revert-to-use-processor

@shakuzen
Copy link
Member

shakuzen commented Dec 6, 2021

I've updated the 1.8.0 release notes calling this out as a known issue at the top. We'll work on getting a release out in the next day or two with the fix going back to the DirectProcessor for now. Once we get a snapshot published, could you please try that out and let us know if it looks fixed on your end @hdv?

@hdv
Copy link
Contributor Author

hdv commented Dec 6, 2021 via email

@shakuzen
Copy link
Member

shakuzen commented Dec 7, 2021

The 1.8.1-SNAPSHOT now contains the change reverting to the deprecated API. Could you please let us know if it looks alright before we do the release @hdv? Thanks for reporting the issue and all the investigation including a test case. Much appreciated and sorry for the regression.

@shakuzen shakuzen modified the milestones: 1.8.x, 1.8.1 Dec 7, 2021
@hdv
Copy link
Contributor Author

hdv commented Dec 7, 2021

@shakuzen I'll report back shortly after I try it in a live test system.

However, I'd like to point one more thing out. If we set N equal to 1000 in the test, it still fails for both parallel and sequential cases. Any idea why that could happen?

@hdv
Copy link
Contributor Author

hdv commented Dec 7, 2021

Also I noticed that TcpClient batches samples disregarding the buffering setting. Is that intentional?

@hdv
Copy link
Contributor Author

hdv commented Dec 7, 2021

@shakuzen I can see that 1.8.1-SNAPSHOT behavior is back on par with 1.7.6! Confirmed in my test environment.

Please note, the questions above are still standing.

@jonatan-ivanov
Copy link
Member

@hdv Thank you very much confirming it!
I'm going to release a new patch round today to fix the most pressing/apparent issue and we will look into the other issues around this after that.

@jonatan-ivanov
Copy link
Member

@hdv New patch releases are out, you can use a release version now: https://github.com/micrometer-metrics/micrometer/releases/tag/v1.8.1

Thank you very much for reporting and investigating this!
Next steps: fixing the issues you mentioned above and migrating back to the Sink API.

@shakuzen
Copy link
Member

shakuzen commented Dec 8, 2021

However, I'd like to point one more thing out. If we set N equal to 1000 in the test, it still fails for both parallel and sequential cases. Any idea why that could happen?

Yes, we noticed that and have some ideas on why it is happening but need to investigate further. For what it's worth, it is the same behavior on 1.7.x, so it is pre-existing behavior, which is why we're considering it separate from this regression and did not block the patch release on it.

Also I noticed that TcpClient batches samples disregarding the buffering setting. Is that intentional?

Looking at the packet splitting in the TCP case is one of the things we need to look into more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug registry: statsd A StatsD Registry related issue regression A regression from a previous release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants