Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An attempt to optimize AsyncConsumerWorkService.WorkPool dispatch loop #352

Closed
wants to merge 1 commit into from

Conversation

vendre21
Copy link
Contributor

@vendre21 vendre21 commented Sep 14, 2017

Further discussion on this PR:
#350 (comment)

@bording
Copy link
Collaborator

bording commented Sep 14, 2017

Adding a BlockingCollection to a code path that is supposed to be async defeats the purpose of having an async code path. This PR means the thread is blocked waiting for the collection.

@michaelklishin
Copy link
Member

What problem does this PR solve? I don't think the comment linked answers this question.

@vendre21
Copy link
Contributor Author

@YulerB has mentioned an improvement on given class. The BlockingCollection will be a good alternative for Task.Delay and for while cycle, which also holds the current thread inefficiently. So basically this PR will speed up consumers.

@michaelklishin
Copy link
Member

michaelklishin commented Sep 14, 2017 via email

@YulerB
Copy link
Contributor

YulerB commented Sep 14, 2017

The BlockingCollection isn't blocking the asyncrony. It pauses the execution when the queue is empty.

The code is greatly simplified by this PR.

It uses less op-codes in user code.

It's more readable.

It's now something people can understand.

The code took a long time to mentally check.

The code may have unforseen bugs relating to the try methods failing, that I for one cannot follow.

The code reduces the cyclomactic complexity of the method.

The blockingcollection was designed for this purpose.

As for performance, im sure its going to win out.

Allot of the code is smart, but apply the KISS principle where possible.

And it's cheaper for all of us to leverage the framework, rather than implement our own.

@michaelklishin
Copy link
Member

michaelklishin commented Sep 15, 2017

@YulerB @vendre21 so are there any benchmarks or profiling data that your team used to come up with this and can share?

I can see the argument that this simplifies the code. A BlockingCollection in this specific case may be OK. But from my experience (and I'm sure @bording would concur) humans are terrible at reasoning about concurrent program execution and efficiency even for a small number of workloads.
Benchmarks and profiling are more reliable tools than developer gut feeling.

@YulerB
Copy link
Contributor

YulerB commented Sep 15, 2017

Please see bench code attached.

I had to pull code out of the client to test.

Batch script to run to collect results:

perfbenchworkpool new single
perfbenchworkpool new multiple
perfbenchworkpool old single
perfbenchworkpool old multiple

Each is separated to get accurate results, since there seems to be a memory leak in the old version due to excessive lockinig.

Results:
C:>perfbenchworkpool new single
New Single Producer - Avg:17.36ms, Max:70ms, Min:10ms, Memory:78784

C:>perfbenchworkpool new multiple
New Multiple Producers - Avg:113.08ms, Max:385ms, Min:70ms, Memory:85568

C:>perfbenchworkpool old single
Old Single Producer - Avg:44.37ms, Max:152ms, Min:3ms, Memory:260016376

C:>perfbenchworkpool old multiple
Old Multiple Producers - Avg:179.63ms, Max:1114ms, Min:37ms, Memory:1439127520

Program.cs.txt

@YulerB
Copy link
Contributor

YulerB commented Sep 15, 2017

What we see is when stop is called, the current code doesn't stop until the queue is empty. The updated code will stop when the cancellation token is set to cancel.
So maybe not a memory leak.
I'll leave it up to you guys to figure out if this is a bug.

@YulerB
Copy link
Contributor

YulerB commented Sep 15, 2017

If it’s a bug, put this in the while loop:
if(tokenSource.IsCancellationRequested) {
break;
}
Then the numbers will probably be pretty much the same.
And if they are roughly the same, I'd change the code to make sure this questions/complexity doesn't come back again to haunt us all.

@YulerB
Copy link
Contributor

YulerB commented Sep 15, 2017

Updated bench and results
Program.cs.txt
New Single - Avg:108ms, Max:190ms, Min:85ms, Memory:214392, Processed:1202632
Old Single Producer - Avg:82.94ms, Max:300ms, Min:58ms, Memory:225480, Processed:584946
New Multiple - Avg:553.69ms, Max:771ms, Min:405ms, Memory:212472, Processed:4292341
Old Multiple - Avg:450.7ms, Max:626ms, Min:354ms, Memory:213776, Processed:2297539

I realize the blockingcollection is slighly slower now after fixing the bug in the old while loop, but check out how many items got processed.

@michaelklishin
Copy link
Member

@YulerB we intentionally process outstanding consumer operations before stopping. Not doing so will be considered a bug by some. With automatic acknowledgements this can effectively lose/ignore some deliveries, which is not necessarily a major issue given the "fire and forget" nature of that mode but still would be a surprising breaking change.

With manual acknowledgements it can result in a certain number of delivered messages to be requeued, which is fine.

So I'd really like to preserve the current behavior.

@michaelklishin
Copy link
Member

@YulerB averages and min/max values are not ideal in benchmarks, we switched PerfTest to use median/95th/99th percentile earlier this year, for example. I'd also consider increasing the number of iterations to, say, 10M or 100M to let the benchmark run longer.

Thank you for producing it, by the way!

@michaelklishin
Copy link
Member

@bording @danielmarbach so we seem to "win some, lose some here". WDYT of this proposal? Blocking collection use aside, I'd agree that it is a certain simplification.

Are there any NServiceBus benchmarks you can run to compare the two approaches?

@danielmarbach
Copy link
Collaborator

@michaelklishin I don't understand how simplification of the code is an argument here. And unfortunately the blocking collection cannot just be left aside in the argument we are having.

When you look at concurrency and parallelism how you are using the threadpool is essential. Because every thread that we can free up is worth freeing up because during that time it can work on other things and we reduce the chances of hill-climbing in the thread pool (or ramping up threads). So what it means is with the while loop in combination with the delay while we might have a context switch we are freeing up the thread when the queue is empty. On the other hand the blocking collection uses internally wait handles that entirely block the thread plus some non trivial cancellation token source linking which has other memory implications. When the blocking collection is empty the thread is block and cannot be used. While this might not be an issue when you look at it from a single consumer thread perspective it becomes an issue when you look at the process as a whole containing potentially multiple such consumers.

I cannot judge whether you as maintainers are willing to make those trade-offs for the argument of simplicity. I would not.

@michaelklishin
Copy link
Member

@danielmarbach thanks for your feedback. I'd like take a closer look at where this collection is used, since most concurrent systems are snowflakes, even though they are built from common primitives.

I have a couple of things I'd like to clarify. In this PR, a blocking collection is used in the consumer work service to "batch" pending operation dispatch in a loop. Can that thread be reused for something else in practice? Good question, I assume the answer is "yes" but I don't know enough about the .NET runtime or TPL task scheduling.

There is one WorkPool instance per channel as of #307. @danielmarbach is there a scenario in which once the number of consumers (or channels) grows, we can see increased contention around the BlockingCollection operations?

Currently this PR doesn't seem to be an obvious improvement even with 5 consumers. You win some on some metrics, you lose some in terms of latency.

@YulerB can you please add a few more versions of your benchmark that have 100, 250, 500 and 1000 "consumers" (tasks) sharing a pool? Those numbers may seem unusual but on a system with 16, 32 or greater number of cores, having that many consumers no longer seems crazy. I expect the results can be quite
different from the currently posted ones with 1 and 5 "consumers".

Optimizing for a single consumer is not something our team usually does (even though we see arguments for that from time to time), both in client libraries and in RabbitMQ itself.

@michaelklishin michaelklishin changed the title AsyncConsumerWorkService improvement An attempt to optimize AsyncConsumerWorkService.WorkPool dispatch loop Sep 15, 2017
@YulerB
Copy link
Contributor

YulerB commented Sep 16, 2017

So, if the items in the queue are allowed to complete when stop is called, then we are queueing an action on a ConcurrentQueue to then queue on the thread pool queue, and only need to stop queuing when stop is called.

If this is the case, we could skip the workpool and queue directly on the thread pool.

internal class AsyncConsumerWorkService : ConsumerWorkService{
    readonly SynchronizedList<IModel> workPools = new SynchronizedList<IModel>();
    bool go = true;
    public void Schedule<TWork>(ModelBase model, TWork work) where TWork : Work {
        if (go && !workPools.Contains(model)) Task.Run(() => work.Execute(model));
    }
    public void Stop(IModel model){
        workPools.Add(model);
    }
    public void Stop(){
        go = false;
    }
}

@michaelklishin
Copy link
Member

@YulerB WorkPool's responsibility is to offer a per-channel dispatch ordering guarantee. It's a port of the same idea in the Java client. Wouldn't the snippet above create a race condition between tasks for a single channel (there's no ordering guarantee between channels, by design)?

@YulerB
Copy link
Contributor

YulerB commented Sep 18, 2017

If it’s truly async, then you cannot guarantee ordering.
If it’s a background thread to process the queue sequentially, then yes you would have ordering.
If this is the case, async simply means on a single thread.

I reckon, all the operations don't require ordering, maybe only the message delivery, and only for some applications. If this is the case, we should add directly to the threadpool all entries except deliver message which will be peformed sequentially on the background thread.

So, there really are 3 use cases,

  1. Sync (Original)
  2. Background thread per channel (The current Async)
  3. Async (The code above)

I've also added await/async for read operations to our version. The BinaryReader is more trouble than its worth. BitConverter is a great classs.

@danielmarbach
Copy link
Collaborator

danielmarbach commented Sep 18, 2017

If it’s truly async, then you cannot guarantee ordering.

Async != concurrent

If this is the case, async simply means on a single thread.

Async != single thread

The code above would offload an inherently IO bound problem to the worker pool which is not desirable. When the async consumer service was introduced it was meant as a first step as an enable for asynchronous consumer code. We knew that the current model code is still IO bound but blocking but the consumer service would at least enable existing asynchronous third party code to be executed in an asynchronous way and would allow to more naturally combine such code with the new async enabled APIs. Happy to talk this through but I think we need to clarify a few terminologies here first so that we all talk about the same thing (no offence meant).

@YulerB
Copy link
Contributor

YulerB commented Sep 18, 2017

@michaelklishin, yes it would create a race condition if ordering was required.
My use case doesn't require ordering for any of my consumers.
I'm processing tons of messages now concurrently (thanks @danielmarbach).

@michaelklishin
Copy link
Member

@YulerB I don't subscribe to the idea that "if it's truly async you cannot guarantee ordering". You can guarantee per-channel dispatch ordering. Concurrently running consumer operations that require synchronisation is application developer's concern and libraries cannot fully avoid concurrency hazards.

@michaelklishin
Copy link
Member

@YulerB thanks for your time on this but unless we have benchmarks that prove this is more efficient with a very large number of "consumers", this PR has no chance of getting in. It's not an obvious improvement from the basic benchmarks we have and the subtle behaviour changes that were discussed and tested in @danielmarbach's async dispatcher PR are completely overlooked here.

@michaelklishin
Copy link
Member

michaelklishin commented Sep 18, 2017 via email

@vendre21
Copy link
Contributor Author

Thank you very much everyone's effort on this PR. I gonna close as this was overlooked enough.

@vendre21 vendre21 closed this Sep 18, 2017
@michaelklishin
Copy link
Member

michaelklishin commented Sep 18, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants