Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential improvement for ActivitySource name matching #708

Closed
reyang opened this issue May 30, 2020 · 8 comments
Closed

Potential improvement for ActivitySource name matching #708

reyang opened this issue May 30, 2020 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@reyang
Copy link
Member

reyang commented May 30, 2020

The following code holds a hash set of ActivitySource names - converted to uppercase invariant:

public OpenTelemetryBuilder AddActivitySource(string activitySourceName)

The following code converts the incoming ActivitySource name to uppercase invariant, and do a hash lookup. This involves a string allocation/conversion, the time complexity is X * O(n), where n is the length of the input name, X is the hash lookup, which varies from 1 to m (where m is the number of activity sources) depending on the hash collision status.

ShouldListenTo = (activitySource) => openTelemetryBuilder.ActivitySourceNames.Contains(activitySource.Name.ToUpperInvariant()),

The potential improvement is to change the API from AddActivitySource to SetActivitySources, the pattern = new regex("|".join(map(sources, s => regex.escape(s)), options=COMPILED | CASE_INSENSITIVE).
This would result in a compiled DFA which has the best performance - O(n) (where n is the length of the input string), and yet avoid the string uppercase convert / allocation during each matching.
In the future, this also opens a potential to support pattern such like wildcards.

@reyang reyang added the enhancement New feature or request label May 30, 2020
@reyang
Copy link
Member Author

reyang commented May 30, 2020

@tarekgh @noahfalk any further improvement ideas?

@noahfalk
Copy link

The runtime implementation for invoking the ShouldListenTo(ActivitySource) callback should only occur once for each (ActivityListener, ActivitySource) pair in the process and then the result is cached. This means any performance changes in this suggestion probably show up as a small difference in process startup time. Were you concerned about process startup time or perhaps there was a misunderstanding that the performance of this callback would influence steady-state throughput?

A small optimization that might be useful regardless is creating the HashSet using the constructor that takes a custom IEqualityComparer<T> and using StringComparer.OrdinalIgnoreCase. This will let you avoid allocating new upper case strings. I did the benchmark below to show how different options compare. Its possible I made mistakes or alternate variations will let you see more nuanced differences. I was surprised that Regex lookup was as slow as it was, but I am not surprised overall that the case insensitive hash lookup won the match up.

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CreateRegex 27,194.91 ns 285.468 ns 267.027 ns 6.0120 0.7629 - 37768 B
UpperCaseLoopkup 76.41 ns 0.807 ns 0.755 ns 0.0191 - - 120 B
IgnoreCaseLookup 44.47 ns 0.267 ns 0.250 ns - - - -
RegexLookup 509.69 ns 4.210 ns 3.938 ns - - - -
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace ConsoleApp24
{
    class Program
    {
        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<ActivitySourceNameBenchmark>();
        }
    }

    [MemoryDiagnoser]
    public class ActivitySourceNameBenchmark
    {
        private Random r = new Random();
        HashSet<string> hs = new HashSet<string>();
        HashSet<string> hsIgnoreCase = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
        string componentName = "SomeCompany.SomeComponent.MakeTheNameABitLonger";
        Regex regex;

        public ActivitySourceNameBenchmark()
        {
            hs.Add(componentName.ToUpperInvariant());
            hsIgnoreCase.Add(componentName);
            regex = new Regex(Regex.Escape(componentName), RegexOptions.Compiled | RegexOptions.IgnoreCase);
        }

        [Benchmark]
        public Regex CreateRegex() => new Regex(Regex.Escape(componentName), RegexOptions.Compiled | RegexOptions.IgnoreCase);

        [Benchmark]
        public bool UpperCaseLoopkup() => hs.Contains(componentName.ToUpperInvariant());

        [Benchmark]
        public bool IgnoreCaseLookup() => hsIgnoreCase.Contains(componentName);

        [Benchmark]
        public bool RegexLookup() => regex.IsMatch(componentName);

    }
}

@reyang
Copy link
Member Author

reyang commented May 31, 2020

+1 on the consideration of startup overhead of RegEx. This might not be a problem for services but definitely I've seen a lot in device/app scenario.

Looks like we're looking from different angles:

  1. I assume that majority of the case we will has mismatch rather than match - I was assuming the application to have many different sources (e.g. could be hundreds or thousands, if source names are hierarchical) and we only consume from a small number of them.
  2. I assume that in the normal scenario we will subscribe to a list of sources instead of only one. Need @cijothomas to chime in on how many sources do we subscribe in typical scenario. Probably won't be a big number?

@noahfalk looks like you have a more powerful machine than me, in C Runtime we used to give developers the low end machines so they write fast code 🤣.

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CreateRegex 25,521.97 ns 292.759 ns 244.467 ns 12.6038 - - 26408 B
UpperCaseLoopkup 80.04 ns 1.025 ns 0.909 ns 0.0573 - - 120 B
IgnoreCaseLookup 50.40 ns 0.985 ns 0.967 ns - - - -
RegexLookup 40.85 ns 0.686 ns 0.608 ns - - - -
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace ConsoleApp24
{
    class Program
    {
        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<ActivitySourceNameBenchmark>();
        }
    }

    [MemoryDiagnoser]
    public class ActivitySourceNameBenchmark
    {
        private Random r = new Random();
        HashSet<string> hs = new HashSet<string>();
        // UseRandomizedStringHashAlgorithm = 1
        HashSet<string> hsIgnoreCase = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
        private string componentName = "SomeCompany.SomeComponent.MakeTheNameABitLonger";
        Regex regex;

        public ActivitySourceNameBenchmark()
        {
            var componentNames = new List<string> {
                // if all the sources are controlled by the developer, there shouldn't be a concern of hash collision attack
                // collision should be rare, maybe we can do a custom perfect hash?
                "Microsoft.Azure.Source1",
                "Microsoft.Azure.Source2",
                "Microsoft.Azure.Source3",
                "Microsoft.Azure.Source4",
                "Microsoft.Azure.Source5",
                "Microsoft.Azure.Source6",
                "Microsoft.Azure.Source7",
                "Microsoft.Azure.Source8",
                "Microsoft.Azure.Source9",
            };
            var patterns = new List<string>();
            foreach (var name in componentNames)
            {
                hs.Add(name.ToUpperInvariant());
                hsIgnoreCase.Add(name);
                patterns.Add(Regex.Escape(name));
            }
            var pattern = String.Join('|', patterns);
            regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
        }

        [Benchmark]
        public Regex CreateRegex() => new Regex(Regex.Escape(componentName), RegexOptions.Compiled | RegexOptions.IgnoreCase);

        [Benchmark]
        public bool UpperCaseLoopkup() => hs.Contains(componentName.ToUpperInvariant());

        [Benchmark]
        public bool IgnoreCaseLookup() => hsIgnoreCase.Contains(componentName);

        [Benchmark]
        public bool RegexLookup() => regex.IsMatch(componentName);

    }
}

@reyang
Copy link
Member Author

reyang commented May 31, 2020

The runtime implementation for invoking the ShouldListenTo(ActivitySource) callback should only occur once for each (ActivityListener, ActivitySource) pair in the process and then the result is cached. This means any performance changes in this suggestion probably show up as a small difference in process startup time. Were you concerned about process startup time or perhaps there was a misunderstanding that the performance of this callback would influence steady-state throughput?

This is good to know, I wasn't aware of this caching. This has a small implication that the listener has to give consistent result for the given name/ver (so if we want to change the listener behavior, we should create a new one and discard the old one rather than modify it in place). Seems to be a very good choice considering the outcome vs. implication.

In this case, looks like HashSet<string>(StringComparer.OrdinalIgnoreCase) is the best option here.

@tarekgh
Copy link
Contributor

tarekgh commented May 31, 2020

Yes using StringComparer.OrdinalIgnoreCase is the preferred method here as it shouldn't allocate extra objects and it should perform fast comparisons too.

The other question I have is, why using HashSet and not something like Dictionary? I am asking because HashSet on the full framework has some perf issues which we fixed on the net core only.

@cijothomas
Copy link
Member

We havent finalized on what would be the default listening model - it could or "listen to all except those explicitly turned off", or "listen to only sources explicitily enabled" or "something else".
This is tracked as 5a under #684

For now, we can change the comparison to ignorecase to avoid the string allocation.
@tarekgh There was no need for dictionary here as all we need a set of names. i.e just keys, not key,value pairs. I can change to Dictionary<string,bool> if it performs better. (I can try the benchmark and see this for myself as well :) )

@tarekgh
Copy link
Contributor

tarekgh commented Jun 1, 2020

There was no need for dictionary here as all we need a set of names. i.e just keys, not key,value pairs. I can change to Dictionary<string,bool> if it performs better. (I can try the benchmark and see this for myself as well :) )

I would say try to measure the perf when using HashSet and Dictionary on the full framework. and we can decide which to use. just make sure you use OrdinalIgnoreCase with both when you measure it. let me know if you want any help from me on that.

@cijothomas
Copy link
Member

cijothomas commented Jun 13, 2020

The ShouldListenTo callback is not in the hotpath as its only called when ActivitySource is created.
(https://github.com/dotnet/runtime/blob/master/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/ActivitySource.cs#L37)

GetRequestedDataUsingContext or GetRequestedDataUsingParentId are the ones invoked with every StartActivity.
https://github.com/dotnet/runtime/blob/master/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/ActivitySource.cs#L118

As ShouldListenTo is only invoked once, and not with every Activity Start, I think its fine to leave it as is.

(we can still fix it if needed, but not a priority)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants