Potential improvement for ActivitySource name matching #708

reyang · 2020-05-30T16:51:59Z

The following code holds a hash set of ActivitySource names - converted to uppercase invariant:

opentelemetry-dotnet/src/OpenTelemetry/Trace/Configuration/OpenTelemetryBuilder.cs

Line 73 in 577ae6c

public OpenTelemetryBuilder AddActivitySource(string activitySourceName)

The following code converts the incoming ActivitySource name to uppercase invariant, and do a hash lookup. This involves a string allocation/conversion, the time complexity is X * O(n), where n is the length of the input name, X is the hash lookup, which varies from 1 to m (where m is the number of activity sources) depending on the hash collision status.

opentelemetry-dotnet/src/OpenTelemetry/Trace/Configuration/OpenTelemetrySdk.cs

Line 69 in ace469d

    
           ShouldListenTo = (activitySource) => openTelemetryBuilder.ActivitySourceNames.Contains(activitySource.Name.ToUpperInvariant()),

The potential improvement is to change the API from AddActivitySource to SetActivitySources, the pattern = new regex("|".join(map(sources, s => regex.escape(s)), options=COMPILED | CASE_INSENSITIVE).
This would result in a compiled DFA which has the best performance - O(n) (where n is the length of the input string), and yet avoid the string uppercase convert / allocation during each matching.
In the future, this also opens a potential to support pattern such like wildcards.

The text was updated successfully, but these errors were encountered:

reyang · 2020-05-30T16:52:41Z

@tarekgh @noahfalk any further improvement ideas?

noahfalk · 2020-05-31T08:45:03Z

The runtime implementation for invoking the ShouldListenTo(ActivitySource) callback should only occur once for each (ActivityListener, ActivitySource) pair in the process and then the result is cached. This means any performance changes in this suggestion probably show up as a small difference in process startup time. Were you concerned about process startup time or perhaps there was a misunderstanding that the performance of this callback would influence steady-state throughput?

A small optimization that might be useful regardless is creating the HashSet using the constructor that takes a custom IEqualityComparer<T> and using StringComparer.OrdinalIgnoreCase. This will let you avoid allocating new upper case strings. I did the benchmark below to show how different options compare. Its possible I made mistakes or alternate variations will let you see more nuanced differences. I was surprised that Regex lookup was as slow as it was, but I am not surprised overall that the case insensitive hash lookup won the match up.

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CreateRegex	27,194.91 ns	285.468 ns	267.027 ns	6.0120	0.7629	-	37768 B
UpperCaseLoopkup	76.41 ns	0.807 ns	0.755 ns	0.0191	-	-	120 B
IgnoreCaseLookup	44.47 ns	0.267 ns	0.250 ns	-	-	-	-
RegexLookup	509.69 ns	4.210 ns	3.938 ns	-	-	-	-

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace ConsoleApp24
{
    class Program
    {
        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<ActivitySourceNameBenchmark>();
        }
    }

    [MemoryDiagnoser]
    public class ActivitySourceNameBenchmark
    {
        private Random r = new Random();
        HashSet<string> hs = new HashSet<string>();
        HashSet<string> hsIgnoreCase = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
        string componentName = "SomeCompany.SomeComponent.MakeTheNameABitLonger";
        Regex regex;

        public ActivitySourceNameBenchmark()
        {
            hs.Add(componentName.ToUpperInvariant());
            hsIgnoreCase.Add(componentName);
            regex = new Regex(Regex.Escape(componentName), RegexOptions.Compiled | RegexOptions.IgnoreCase);
        }

        [Benchmark]
        public Regex CreateRegex() => new Regex(Regex.Escape(componentName), RegexOptions.Compiled | RegexOptions.IgnoreCase);

        [Benchmark]
        public bool UpperCaseLoopkup() => hs.Contains(componentName.ToUpperInvariant());

        [Benchmark]
        public bool IgnoreCaseLookup() => hsIgnoreCase.Contains(componentName);

        [Benchmark]
        public bool RegexLookup() => regex.IsMatch(componentName);

    }
}

reyang · 2020-05-31T17:10:20Z

+1 on the consideration of startup overhead of RegEx. This might not be a problem for services but definitely I've seen a lot in device/app scenario.

Looks like we're looking from different angles:

I assume that majority of the case we will has mismatch rather than match - I was assuming the application to have many different sources (e.g. could be hundreds or thousands, if source names are hierarchical) and we only consume from a small number of them.
I assume that in the normal scenario we will subscribe to a list of sources instead of only one. Need @cijothomas to chime in on how many sources do we subscribe in typical scenario. Probably won't be a big number?

@noahfalk looks like you have a more powerful machine than me, in C Runtime we used to give developers the low end machines so they write fast code 🤣.

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
CreateRegex	25,521.97 ns	292.759 ns	244.467 ns	12.6038	-	-	26408 B
UpperCaseLoopkup	80.04 ns	1.025 ns	0.909 ns	0.0573	-	-	120 B
IgnoreCaseLookup	50.40 ns	0.985 ns	0.967 ns	-	-	-	-
RegexLookup	40.85 ns	0.686 ns	0.608 ns	-	-	-	-

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace ConsoleApp24
{
    class Program
    {
        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<ActivitySourceNameBenchmark>();
        }
    }

    [MemoryDiagnoser]
    public class ActivitySourceNameBenchmark
    {
        private Random r = new Random();
        HashSet<string> hs = new HashSet<string>();
        // UseRandomizedStringHashAlgorithm = 1
        HashSet<string> hsIgnoreCase = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
        private string componentName = "SomeCompany.SomeComponent.MakeTheNameABitLonger";
        Regex regex;

        public ActivitySourceNameBenchmark()
        {
            var componentNames = new List<string> {
                // if all the sources are controlled by the developer, there shouldn't be a concern of hash collision attack
                // collision should be rare, maybe we can do a custom perfect hash?
                "Microsoft.Azure.Source1",
                "Microsoft.Azure.Source2",
                "Microsoft.Azure.Source3",
                "Microsoft.Azure.Source4",
                "Microsoft.Azure.Source5",
                "Microsoft.Azure.Source6",
                "Microsoft.Azure.Source7",
                "Microsoft.Azure.Source8",
                "Microsoft.Azure.Source9",
            };
            var patterns = new List<string>();
            foreach (var name in componentNames)
            {
                hs.Add(name.ToUpperInvariant());
                hsIgnoreCase.Add(name);
                patterns.Add(Regex.Escape(name));
            }
            var pattern = String.Join('|', patterns);
            regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
        }

        [Benchmark]
        public Regex CreateRegex() => new Regex(Regex.Escape(componentName), RegexOptions.Compiled | RegexOptions.IgnoreCase);

        [Benchmark]
        public bool UpperCaseLoopkup() => hs.Contains(componentName.ToUpperInvariant());

        [Benchmark]
        public bool IgnoreCaseLookup() => hsIgnoreCase.Contains(componentName);

        [Benchmark]
        public bool RegexLookup() => regex.IsMatch(componentName);

    }
}

reyang · 2020-05-31T17:22:02Z

The runtime implementation for invoking the ShouldListenTo(ActivitySource) callback should only occur once for each (ActivityListener, ActivitySource) pair in the process and then the result is cached. This means any performance changes in this suggestion probably show up as a small difference in process startup time. Were you concerned about process startup time or perhaps there was a misunderstanding that the performance of this callback would influence steady-state throughput?

This is good to know, I wasn't aware of this caching. This has a small implication that the listener has to give consistent result for the given name/ver (so if we want to change the listener behavior, we should create a new one and discard the old one rather than modify it in place). Seems to be a very good choice considering the outcome vs. implication.

In this case, looks like HashSet<string>(StringComparer.OrdinalIgnoreCase) is the best option here.

tarekgh · 2020-05-31T22:37:09Z

Yes using StringComparer.OrdinalIgnoreCase is the preferred method here as it shouldn't allocate extra objects and it should perform fast comparisons too.

The other question I have is, why using HashSet and not something like Dictionary? I am asking because HashSet on the full framework has some perf issues which we fixed on the net core only.

cijothomas · 2020-06-01T19:16:21Z

We havent finalized on what would be the default listening model - it could or "listen to all except those explicitly turned off", or "listen to only sources explicitily enabled" or "something else".
This is tracked as 5a under #684

For now, we can change the comparison to ignorecase to avoid the string allocation.
@tarekgh There was no need for dictionary here as all we need a set of names. i.e just keys, not key,value pairs. I can change to Dictionary<string,bool> if it performs better. (I can try the benchmark and see this for myself as well :) )

tarekgh · 2020-06-01T19:19:37Z

There was no need for dictionary here as all we need a set of names. i.e just keys, not key,value pairs. I can change to Dictionary<string,bool> if it performs better. (I can try the benchmark and see this for myself as well :) )

I would say try to measure the perf when using HashSet and Dictionary on the full framework. and we can decide which to use. just make sure you use OrdinalIgnoreCase with both when you measure it. let me know if you want any help from me on that.

cijothomas · 2020-06-13T06:12:41Z

The ShouldListenTo callback is not in the hotpath as its only called when ActivitySource is created.
(https://github.com/dotnet/runtime/blob/master/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/ActivitySource.cs#L37)

GetRequestedDataUsingContext or GetRequestedDataUsingParentId are the ones invoked with every StartActivity.
https://github.com/dotnet/runtime/blob/master/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/ActivitySource.cs#L118

As ShouldListenTo is only invoked once, and not with every Activity Start, I think its fine to leave it as is.

(we can still fix it if needed, but not a priority)

reyang added the enhancement New feature or request label May 30, 2020

reyang assigned cijothomas May 30, 2020

cijothomas closed this as completed Jul 1, 2020

reyang mentioned this issue Oct 13, 2020

Updating Status based on the new spec #1313

Merged

3 tasks

reyang mentioned this issue Sep 17, 2021

AggregatorStore to use concurentdictionary #2339

Merged

reyang mentioned this issue Oct 12, 2021

Benchmarks for HashSet, Dictionary comparison in .NET core 3.1, .NET 4.8, and .NET Core 5.0 #2473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential improvement for ActivitySource name matching #708

Potential improvement for ActivitySource name matching #708

reyang commented May 30, 2020 •

edited

Loading

reyang commented May 30, 2020

noahfalk commented May 31, 2020

reyang commented May 31, 2020

reyang commented May 31, 2020

tarekgh commented May 31, 2020

cijothomas commented Jun 1, 2020

tarekgh commented Jun 1, 2020

cijothomas commented Jun 13, 2020 •

edited

Loading

Potential improvement for ActivitySource name matching #708

Potential improvement for ActivitySource name matching #708

Comments

reyang commented May 30, 2020 • edited Loading

reyang commented May 30, 2020

noahfalk commented May 31, 2020

reyang commented May 31, 2020

reyang commented May 31, 2020

tarekgh commented May 31, 2020

cijothomas commented Jun 1, 2020

tarekgh commented Jun 1, 2020

cijothomas commented Jun 13, 2020 • edited Loading

reyang commented May 30, 2020 •

edited

Loading

cijothomas commented Jun 13, 2020 •

edited

Loading