Using ConfigureAwait() 400% slower on completed task #26610

KristianWedberg · 2018-06-25T21:54:06Z

Issue

Calling ConfigureAwait() on a task takes a significant amount of time
(over 400% slower than the alternatives), and also performs an allocation,
even if the task is already completed. For scenarios with high volume and with
the task usually completed, this necessitates verbose workarounds that checks
the task before calling ConfigureAwait(), e.g.:

var task = MethodAsync();
if (!task.IsCompleted)
    await task.ConfigureAwait(false);
var useTaskResult = task.GetAwaiter().GetResult();

GetAwaiter() is documented 'for compiler use', which is also a downside.
We can use Task<T>.Result:

var task = MethodAsync();
if (!task.IsCompleted)
    await task.ConfigureAwait(false);
var useTaskResult = task.Result;

But this wraps exceptions in an AggregateException, which then
differs from the await path, which is a further downside.

Expected Behavior

ConfigureAwait() should immediately check if the task is completed,
and if so immediately return (and without any allocation.) This path should
be at least as fast as calling Task.IsCompleted().

This would allow replacing the four lines from the workarounds with an
equally fast one-liner:

var useTaskResult = await MethodAsync().ConfigureAwait(false);

Repro

The tests measure the time it takes to get the result from a successfully
completed task, using different methods. With Task<T>.Result as the
baseline, using ConfigureAwait() takes over 400% longer,
irrespective of its parameter value, and whether await-ing or not.

BenchmarkDotNet

BenchmarkDotNet=v0.10.14, OS=Windows 7 SP1 (6.1.7601.0)
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
Frequency=3417031 Hz, Resolution=292.6517 ns, Timer=TSC
  [Host]     : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2650.0
  DefaultJob : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2650.0

(StdErr columns removed)

Method	Mean	Error	Scaled	Allocated
ConfigureAwaitFalse_GetAwaiter	11.163 ns	0.0370 ns	4.32	96 B
ConfigureAwaitTrue_GetAwaiter	11.112 ns	0.0538 ns	4.30	96 B
Await_ConfigureAwaitFalse	10.590 ns	0.0185 ns	4.10	96 B
Await_ConfigureAwaitTrue	10.597 ns	0.0326 ns	4.10	96 B
Await	2.952 ns	0.0072 ns	1.14	84 B
Result	2.584 ns	0.0048 ns	1.00	84 B
GetAwaiter	2.506 ns	0.0035 ns	0.97	84 B

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Threading.Tasks;

namespace VSThreading301
{
    [MemoryDiagnoser]
    [BenchmarkDotNet.Attributes.DisassemblyDiagnoser(printIL: true)]
    public class BenchCompletedTask
    {
        private volatile Task<int> _completedTask = Task.FromResult(1);
        const int Operations = 100_000;

        [Benchmark(OperationsPerInvoke = Operations)]
        public async Task<int> ConfigureAwaitFalse_GetAwaiter()
        {
            await _completedTask.ConfigureAwait(false);
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                if (!_completedTask.IsCompleted)
                    await _completedTask.ConfigureAwait(false);
                x += _completedTask.ConfigureAwait(false).GetAwaiter().GetResult();
            }
            return x;
        }

        [Benchmark(OperationsPerInvoke = Operations)]
        public async Task<int> ConfigureAwaitTrue_GetAwaiter()
        {
            await _completedTask.ConfigureAwait(false);
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                if (!_completedTask.IsCompleted)
                    await _completedTask.ConfigureAwait(true);
                x += _completedTask.ConfigureAwait(false).GetAwaiter().GetResult();
            }
            return x;
        }

        [Benchmark(OperationsPerInvoke = Operations)]
        public async Task<int> Await_ConfigureAwaitFalse()
        {
            await _completedTask.ConfigureAwait(false);
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                x += await _completedTask.ConfigureAwait(false);
            }
            return x;
        }

        [Benchmark(OperationsPerInvoke = Operations)]
        public async Task<int> Await_ConfigureAwaitTrue()
        {
            await _completedTask.ConfigureAwait(false);
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                x += await _completedTask.ConfigureAwait(false);
            }
            return x;
        }

        [Benchmark(OperationsPerInvoke = Operations)]
        public async Task<int> Await()
        {
            await _completedTask.ConfigureAwait(false);
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                x += await _completedTask;
            }
            return x;
        }

        [Benchmark(OperationsPerInvoke = Operations, Baseline = true)]
        public async Task<int> Result()
        {
            await _completedTask.ConfigureAwait(false);
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                if (!_completedTask.IsCompleted)
                    await _completedTask.ConfigureAwait(false);
                // Wraps exceptions in AggregateException:
                x += _completedTask.Result;
            }
            return x;
        }

        [Benchmark(OperationsPerInvoke = Operations)]
        public async Task<int> GetAwaiter()
        {
            await _completedTask.ConfigureAwait(false);
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                if (!_completedTask.IsCompleted)
                    await _completedTask.ConfigureAwait(false);
                // Avoids wrapping exceptions in AggregateException, but is 'for compiler use':
                x += _completedTask.GetAwaiter().GetResult();
            }
            return x;
        }
    }

    public class Program
    {
        public static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<BenchCompletedTask>();
        }
    }
}

The text was updated successfully, but these errors were encountered:

AArnott · 2018-06-25T22:10:05Z

Calling ConfigureAwait() on a task ... performs an allocation,

Why do you say that? ConfigureAwait returns a ConfiguredTaskAwaitable which is a struct. Calling GetAwaiter() on that returns a ConfiguredTaskAwaiter which is also a struct. These don't constitute allocations.

https://github.com/dotnet/corefx/blob/ee77c46eb02a4a9865f39c2c4801f99619d1a4c8/src/Common/src/CoreLib/System/Runtime/CompilerServices/TaskAwaiter.cs#L431-L456

KristianWedberg · 2018-06-25T22:17:48Z

I rephrased it slightly just now - according to BenchmarkDotNet, the versions with ConfigureAwait() allocate 96 bytes per iteration, vs. 84 bytes per iteration without ConfigureAwait(), see 'Allocated' column above.

It is strange that the difference is 12 bytes, which is the minimum object size on 32-bit, but I'm running 64-bit (which has a minimum object size of 24 bytes.)

stephentoub · 2018-06-25T22:48:18Z

First, this is the corefx repo, so measuring .NET 4.7.1 isn't the right thing to measure. I suspect something is awry in your measurements on .NET Framework, as there shouldn't be such a difference in the allocation size when nothing is yielding. When I run this with .NET Core 2.1, I get results like this:

                         Method |      Mean |     Error |    StdDev | Scaled | ScaledSD | Allocated |
------------------------------- |----------:|----------:|----------:|-------:|---------:|----------:|
 ConfigureAwaitFalse_GetAwaiter | 10.479 ns | 0.1539 ns | 0.1439 ns |   4.64 |     0.11 |      72 B |
  ConfigureAwaitTrue_GetAwaiter | 10.291 ns | 0.1645 ns | 0.1539 ns |   4.55 |     0.11 |      72 B |
      Await_ConfigureAwaitFalse | 10.672 ns | 0.2089 ns | 0.2051 ns |   4.72 |     0.13 |      72 B |
       Await_ConfigureAwaitTrue | 10.452 ns | 0.1707 ns | 0.1597 ns |   4.62 |     0.12 |      72 B |
                          Await |  3.139 ns | 0.0277 ns | 0.0216 ns |   1.39 |     0.03 |      72 B |
                         Result |  2.262 ns | 0.0513 ns | 0.0480 ns |   1.00 |     0.00 |      72 B |
                     GetAwaiter |  2.323 ns | 0.0275 ns | 0.0215 ns |   1.03 |     0.02 |      72 B |

so no difference in allocation, and around 3.5x when using ConfigureAwait on an already completed task, but we're here talking about nanoseconds, and the difference between 3ns and 10ns... it's not hard to be 3x slower when we're measuring things at the level of a few instructions.

Yes, ConfigureAwait adds a few more instructions. Calling it requires creating another struct on the stack and initializing it, and then using it requires passing the values of those fields into the same helper used without ConfigureAwait, but without ConfigureAwait, a const value is passed in rather than a field. So there's a bit more work to be done.

I'm not sure what improvements you hope to see here, but if you have concrete ideas, you're welcome to submit a PR.

ConfigureAwait() should immediately check if the task is completed, and if so immediately return (and without any allocation.) This path should be at least as fast as calling Task.IsCompleted().

I do not understand the suggestion. This is not how await works. And there are no allocations here. The allocation you're seeing is the allocation for the Task<int> returned from your benchmark method and has nothing to do with what's being done inside the method: if you were to change your test to return 0; instead of return x; from your benchmarks, there should be 0 allocations.

benaadams · 2018-06-25T23:13:42Z

Aside you'd only want to use ConfigureAwait(false) for the actual await so can drop it from the .GetAwaiter().GetResult() part; rather than creating the ConfiguredAwaitable wrapper

e.g. change

if (!_completedTask.IsCompleted)
    await _completedTask.ConfigureAwait(false);
x += _completedTask.ConfigureAwait(false).GetAwaiter().GetResult();

to

if (!_completedTask.IsCompleted)
    await _completedTask.ConfigureAwait(false);
x += _completedTask.GetAwaiter().GetResult();

Which changes the results to

                         Method |      Mean |     Error | Scaled | Allocated |
------------------------------- |----------:|----------:|-------:|----------:|
 ConfigureAwaitFalse_GetAwaiter |  2.662 ns | 0.0094 ns |   0.99 |      72 B |
  ConfigureAwaitTrue_GetAwaiter |  2.644 ns | 0.0149 ns |   0.99 |      72 B |
      Await_ConfigureAwaitFalse | 11.123 ns | 0.0197 ns |   4.14 |      72 B |
       Await_ConfigureAwaitTrue | 11.138 ns | 0.0306 ns |   4.15 |      72 B |
                          Await |  3.134 ns | 0.0118 ns |   1.17 |      72 B |
                         Result |  2.685 ns | 0.0506 ns |   1.00 |      72 B |
                     GetAwaiter |  2.744 ns | 0.0248 ns |   1.02 |      72 B |

The allocation you're seeing is the allocation for the Task<int> returned from your benchmark method and has nothing to do with what's being done inside the method: if you were to change your test to return 0; instead of return x; from your benchmarks, there should be 0 allocations.

Also if your methods generally return sync (completed path) rather than going async; changing the return type to ValueTask<int> rather than Task<int> should drop the allocations.

ValueTask version

                         Method |      Mean |     Error | Scaled | Allocated |
------------------------------- |----------:|----------:|-------:|----------:|
 ConfigureAwaitFalse_GetAwaiter |  2.649 ns | 0.0092 ns |   1.00 |       0 B |
  ConfigureAwaitTrue_GetAwaiter |  2.644 ns | 0.0114 ns |   1.00 |       0 B |
      Await_ConfigureAwaitFalse | 11.695 ns | 0.0138 ns |   4.43 |       0 B |
       Await_ConfigureAwaitTrue | 11.139 ns | 0.0451 ns |   4.22 |       0 B |
                          Await |  3.099 ns | 0.0076 ns |   1.17 |       0 B |
                         Result |  2.642 ns | 0.0049 ns |   1.00 |       0 B |
                     GetAwaiter |  2.639 ns | 0.0022 ns |   1.00 |       0 B |

KristianWedberg · 2018-06-25T23:28:11Z

@benaadams Yes, I included ConfigureAwait().GetAwaiter().GetResult() just to show that it was adding ConfigureAwait() that increased the run-time.

In these performance sensitive areas I'm using cached tasks where possible, otherwise ValueTask<T>.

KristianWedberg · 2018-06-26T13:54:11Z

@stephentoub OK, that makes sense regarding allocations, thanks.

As @benaadams intimated, this ConfigureAwait() performance difference seems to have the same cause as https://github.com/dotnet/coreclr/issues/18542.

Cutting away the async parts, I get a 10x speed difference between retrieving a flat struct vs. a nested struct, both with the same two fields (and irrespective if fields are primitives or not). This explains the ConfigureAwait() performance, so hopefully the investigation in the other issue finds ways to tune nested structs.

BenchmarkDotNet=v0.10.14, OS=Windows 7 SP1 (6.1.7601.0)
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
Frequency=3417031 Hz, Resolution=292.6517 ns, Timer=TSC
  [Host]     : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2650.0
  DefaultJob : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2650.0

Method	Mean	Error	StdDev	Scaled	ScaledSD
New_FlatReference	0.2727 ns	0.0007 ns	0.0007 ns	0.50	0.00
Get_FlatReference	0.5463 ns	0.0010 ns	0.0010 ns	1.00	0.00
New_NestedPrimitive	0.8163 ns	0.0013 ns	0.0011 ns	1.49	0.00
Get_NestedPrimitive	5.7104 ns	0.0118 ns	0.0105 ns	10.45	0.03
New_NestedReference	1.3612 ns	0.0015 ns	0.0013 ns	2.49	0.00
Get_NestedReference	5.2022 ns	0.0231 ns	0.0216 ns	9.52	0.04
Call_ConfigureAwait	5.1707 ns	0.0148 ns	0.0138 ns	9.46	0.03

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Attributes.Jobs;
using BenchmarkDotNet.Running;
using System;
using System.Threading.Tasks;

namespace ConfigureAwait_VSThreading301
{
    //[ShortRunJob]
    [BenchmarkDotNet.Attributes.DisassemblyDiagnoser]
    public class BenchNestedStruct
    {
        private volatile Task<int> _completedTask = Task.FromResult(1);
        const int Operations = 100_000;


        public struct FlatReference
        {
            private Task _task;
            private bool _bool;

            public FlatReference(Task task, bool b)
            {
                _task = task;
                _bool = b;
            }
        }

        [Benchmark(OperationsPerInvoke = Operations)]
        public int New_FlatReference()
        {
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                new FlatReference(_completedTask, false);
            }
            return x;
        }

        public FlatReference GetFlatReference(bool b)
        {
            return new FlatReference(_completedTask, false);
        }

        [Benchmark(OperationsPerInvoke = Operations, Baseline = true)]
        public int Get_FlatReference()
        {
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                GetFlatReference(false);
            }
            return x;
        }

        // **************************************

        private long _long = 1;

        public struct FlatPrimitive
        {
            private long _long;
            private bool _bool;

            public FlatPrimitive(long l, bool b)
            {
                _long = l;
                _bool = b;
            }
        }

        public struct NestedPrimitive
        {
            private FlatPrimitive _FlatPrimitive;

            public NestedPrimitive(long l, bool b)
            {
                _FlatPrimitive = new FlatPrimitive(l, b);
            }
        }

        [Benchmark(OperationsPerInvoke = Operations)]
        public int New_NestedPrimitive()
        {
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                new NestedPrimitive(_long, false);
            }
            return x;
        }

        public NestedPrimitive GetNestedPrimitive(bool b)
        {
            return new NestedPrimitive(_long, false);
        }

        [Benchmark(OperationsPerInvoke = Operations)]
        public int Get_NestedPrimitive()
        {
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                GetNestedPrimitive(false);
            }
            return x;
        }

        // **************************************

        public struct NestedReference
        {
            private FlatReference _flatReference;

            public NestedReference(Task task, bool b)
            {
                _flatReference = new FlatReference(task, b);
            }
        }

        [Benchmark(OperationsPerInvoke = Operations)]
        public int New_NestedReference()
        {
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                new NestedReference(_completedTask, false);
            }
            return x;
        }

        public NestedReference GetNestedReference(bool b)
        {
            return new NestedReference(_completedTask, false);
        }

        [Benchmark(OperationsPerInvoke = Operations)]
        public int Get_NestedReference()
        {
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                GetNestedReference(false);
            }
            return x;
        }

        // **************************************

        [Benchmark(OperationsPerInvoke = Operations)]
        public int Call_ConfigureAwait()
        {
            int x = 0;
            for (int i = 0; i < Operations; i++)
            {
                _completedTask.ConfigureAwait(false);
            }
            return x;
        }
    }

    public class Program
    {
        public static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<BenchNestedStruct>();
        }
    }
}

stephentoub · 2018-06-26T14:23:57Z

Right. Thanks. So, I'm going to close this issue. To my knowledge there's nothing that's specific to ConfigureAwait here. If I'm wrong, please feel free to re-open and clarify.

stephentoub closed this as completed Jun 26, 2018

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 3.0 milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ConfigureAwait() 400% slower on completed task #26610

Using ConfigureAwait() 400% slower on completed task #26610

KristianWedberg commented Jun 25, 2018

AArnott commented Jun 25, 2018 •

edited

Loading

KristianWedberg commented Jun 25, 2018 •

edited

Loading

stephentoub commented Jun 25, 2018 •

edited

Loading

benaadams commented Jun 25, 2018

KristianWedberg commented Jun 25, 2018

KristianWedberg commented Jun 26, 2018

stephentoub commented Jun 26, 2018

Using ConfigureAwait() 400% slower on completed task #26610

Using ConfigureAwait() 400% slower on completed task #26610

Comments

KristianWedberg commented Jun 25, 2018

Issue

Expected Behavior

Repro

BenchmarkDotNet

AArnott commented Jun 25, 2018 • edited Loading

KristianWedberg commented Jun 25, 2018 • edited Loading

stephentoub commented Jun 25, 2018 • edited Loading

benaadams commented Jun 25, 2018

KristianWedberg commented Jun 25, 2018

KristianWedberg commented Jun 26, 2018

stephentoub commented Jun 26, 2018

AArnott commented Jun 25, 2018 •

edited

Loading

KristianWedberg commented Jun 25, 2018 •

edited

Loading

stephentoub commented Jun 25, 2018 •

edited

Loading