Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize reflection of F# types #9714

Merged
merged 1 commit into from
Jul 23, 2020
Merged

Optimize reflection of F# types #9714

merged 1 commit into from
Jul 23, 2020

Conversation

kerams
Copy link
Contributor

@kerams kerams commented Jul 18, 2020

While compiling expression trees to Funcs for faster execution is probably an overkill for one-off reflection functions, I reckon this extra step is worth it for their precomputed counterparts, which tend to be used in performance-critical scenarios such as serialization.

If there's interest, I can similarly improve other PreCompute functions.

@dnfadmin
Copy link

dnfadmin commented Jul 18, 2020

CLA assistant check
All CLA requirements met.

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jul 18, 2020

Are you sure Compile method is available in .NET Core? I don't see it mentioned here: https://docs.microsoft.com/en-us/dotnet/api/system.data.objects.compiledquery.compile, at the bottom it only lists .NET Framework.

Oh wait, I think you're using this: https://docs.microsoft.com/en-us/dotnet/api/system.linq.expressions.expression-1.compile?view=netcore-3.1

@abelbraaksma
Copy link
Contributor

This will greatly improve repeated calls, but I wonder where the threshold is, because compiling is quite expensive. I mean, is it beneficial from 5 calls up, or 500 calls?

I think this is a great improvement, but we should probably know how big the performance improvement is, and where it starts to improve. Did you test with BDN?

Would it be possible to have the best of both, for instance by calling it the old way, and lazily compile in a different thread (no idea this is even feasible). Once compilation is done, subsequent calls come from the compiled one.

@kerams
Copy link
Contributor Author

kerams commented Jul 18, 2020

I haven't benchmarked it in isolation, but I've seen a nice improvement in Fable.Remoting thanks to this approach. Let me put something together.

Would it be possible to have the best of both, for instance by calling it the old way, and lazily compile in a different thread (no idea this is even feasible). Once compilation is done, subsequent calls come from the compiled one.

Oh, it should be possible by introducing some kind of a cache for field readers (and more, unfortunately, separate caches for the rest of the PreCompute methods). However, it seems like a lot of trouble and I'm not convinced it's worth the effort. I'd expect the caller to use the function A LOT more than just 500 times.

Say the threshold where precomputing pays off with this change (haven't measured this) is 10000 invocations, but you only need 1000. You'll get a performance penalty now, but if that's a huge problem, you can always switch to GetRecordFields, which does record field lookup every time. The time spent on PropertyInfo lookup using reflection (1000 * record field count) times will be negligible in my opinion (unless you for some reason need to make those 1000 calls over and over again in a new process).

What I'm not sure about are the memory consumption implications. How much space does a compiled expression the size of the one in compilePropGetterFunc take? They don't ever get GCed either, right?

The obvious solution to all of these concerns is having new methods, but then you won't get a performance boost (or, indeed, penalty in very specific cases) just by updating.

@kerams
Copy link
Contributor Author

kerams commented Jul 18, 2020

type Record = {
    A: int
    B: int
    C: string
    D: string
    E: unit }

let compileRecordReaderFunc (recordType: Type) =
    let param = Expression.Parameter (typeof<obj>, "param")
    let typedParam = Expression.Variable recordType

    let expr =
        Expression.Lambda<Func<obj, obj[]>> (
            Expression.Block (
                [ typedParam ],
                Expression.Assign (typedParam, Expression.Convert (param, recordType)),
                Expression.NewArrayInit (typeof<obj>, [
                    for prop in typeof<Record>.GetProperties (BindingFlags.Instance ||| BindingFlags.Public) ->
                        Expression.Convert (Expression.Property (typedParam, prop), typeof<obj>) :> Expression
                ])
            ),
            param)
    expr.Compile ()

let compileRecordReaderFuncWithBuffer (recordType: Type) =
    let param = Expression.Parameter (typeof<obj>, "param")
    let typedParam = Expression.Variable recordType
    let buffer = Expression.Parameter typeof<obj[]>
    let props = typeof<Record>.GetProperties (BindingFlags.Instance ||| BindingFlags.Public)

    let expr =
        Expression.Lambda<Func<obj, obj[], int>> (
            Expression.Block (
                [ typedParam ],
                [
                    Expression.Assign (typedParam, Expression.Convert (param, recordType)) :> Expression

                    for i, prop in typeof<Record>.GetProperties (BindingFlags.Instance ||| BindingFlags.Public) |> Array.indexed do
                        let arrayAtIndex = Expression.ArrayAccess (buffer, Expression.Constant (i, typeof<int>))
                        Expression.Assign (arrayAtIndex, Expression.Convert (Expression.Property (typedParam, prop), typeof<obj>)) :> Expression

                    Expression.Constant (props.Length, typeof<int>) :> Expression
                ]
            ),
            [ param; buffer ])
    expr.Compile ()

[<MemoryDiagnoser>]
type Test () =
    let before = FSharpValue.PreComputeRecordReader typeof<Record>
    let after = compileRecordReaderFunc typeof<Record>
    let afterWithBuffer = compileRecordReaderFuncWithBuffer typeof<Record>
    let buffer = Array.zeroCreate 100

    let record = { A = 1; B = 2; C = "3"; D = "4"; E = () }

    [<Benchmark(Baseline = true)>]
    member _.Before () =
        for i in 1 .. 1000 do
            before record |> ignore

    [<Benchmark>]
    member _.After () =
        for i in 1 .. 1000 do
            after.Invoke record |> ignore

    [<Benchmark>]
    member _.AfterWithProvidedBuffer () =
        for i in 1 .. 1000 do
            afterWithBuffer.Invoke (record, buffer) |> ignore

    [<Benchmark>]
    member _.Direct () =
        for i in 1 .. 1000 do
            [| box record.A; box record.B; box record.C; box record.D; box record.E |] |> ignore

    [<Benchmark>]
    member _.ReaderCompilation () =
        compileRecordReaderFunc typeof<Record> |> ignore

    [<Benchmark>]
    member _.GetRecordFields () =
        for i in 1 .. 1000 do
            FSharpValue.GetRecordFields record |> ignore

BenchmarkRunner.Run<Test> () |> ignore
Method Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Before 546.78 us 10.889 us 14.537 us 1.00 0.00 12.6953 - - 109.38 KB
After 21.19 us 0.411 us 0.404 us 0.04 0.00 13.3667 - - 109.38 KB
AfterWithProvidedBuffer 17.37 us 0.241 us 0.214 us 0.03 0.00 5.7373 - - 46.88 KB
Direct 19.27 us 0.376 us 0.501 us 0.04 0.00 13.3667 - - 109.38 KB
ReaderCompilation 191.33 us 2.377 us 2.224 us 0.35 0.01 0.7324 0.2441 - 7.39 KB
GetRecordFields 20,648.06 us 200.123 us 187.195 us 37.47 0.86 500.0000 - - 4132.85 KB

So if I'm interpreting this right, compiling a single reader function for the entire record (as opposed to a function for each field in the original commit) costs ~350 invocations of the present day version of the function from PreComputeRecordReader and each call of the compiled function is ~20 times faster than a single invocation of the latter. That puts the threshold somewhere around 370.

@baronfel
Copy link
Member

Having this as an option would be great. When I tested FSharp.SystemTextJson last year as part of my OpenF# talk, The use of the reflection based members erased almost all of the performance benefits from using system.txt.json. Having an out-of-the-box way to get that same information in a more efficient way would make that library even more of a no-brainer than it already is

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jul 18, 2020

Thanks for the benchmark, the numbers help understand the impact.

If I understand the PR correctly, this pre-compiles, then caches access to members of records when PreComputeRecordReader is used, right? And you deliberately didn't do it for GetRecordFields. Since that method already has the word compute in it, it kinda makes sense. But I agree that it would be even better to be have this for more functions in reflect.fs.

I am, however, a little worried about the initial overhead. I don't know in what contexts this code is usually used (well, in reflection), and if we can ascertain somehow that the threshold of 350+ calls is reached.

An alternative would be do add functions that have Compile in the name, so that users can choose. Something like PreCompileRecordReader, GetCompiledRecordFields.

Yet another way is perhaps how dynamics works in C#, which caches the invoked member for future access, though I'm not sure if it uses Compile(). I mean, I think in C# you get the MethodInfo, and a delegate is created, which is much cheaper than compiling a LINQ tree. I didn't check if such approach is feasible here though.

@kerams
Copy link
Contributor Author

kerams commented Jul 18, 2020

If I understand the PR correctly, this pre-compiles, then caches access to members of records when PreComputeRecordReader is used, right?

The property accessors are embedded in the returned function closure. You can refer to it as caching, but if you call the precompute method again with the same type, everything is compiled anew and you get a different closure back.

you deliberately didn't do it for GetRecordFields

Yes, that's the "one-off" variant I talked about in the OP. There isn't any sort of caching involved and each call looks up PropertyInfo of each field of the record type. See GetRecordFields method in the benchmark.

and if we can ascertain somehow that the threshold of 350+ calls is reached

Unless I misunderstood something, the caller is the one that needs to know how often they're going to need to read record fields and it is their responsibility to choose the appropriate API/approach.

An alternative would be do add functions that have Compile in the name, so that users can choose. Something like PreCompileRecordReader, GetCompiledRecordFields.

Sure, but it would also be fantastic if users could automatically reap the benefits of this change by simply updating to a new version of .NET/FSharp.Core. As we have established though, this does introduce additional overhead, so I'll let the powers that be decide whether or not a new set of methods is required.

and a delegate is created, which is much cheaper than compiling a LINQ tree

I think Delegate.CreateDelegate returns a delegate for a specific instance if used with instance methods. It's also quite a bit faster than plain Invoke, but nowhere near as fast the compiled Func, which technically isn't even reflection anymore. Compare Direct and After2 benchmarks. The only overhead stems from the need to allocate an array to return the results and boxing of value types.

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jul 18, 2020

Unless I misunderstood something,

No, I think we're on the same page. I understand the PR better now, thanks for the explanations!

Sure, but it would also be fantastic if users could automatically reap the benefits of this change by simply updating to a new version of .NET/FSharp.Core.

I totally agree.

The property accessors are embedded in the returned function closure. You can refer to it as caching

I see. Since compiled functions are never GC'ed, it may be better to introduce global caching for this, or users may leak memory (but that may not be trivial, concurrency stuff et al, and I don't know what the general idea is about global caches from the Core lib).

I think Delegate.CreateDelegate returns a delegate for a specific instance if used with instance methods.

There's only one method table, regardless of instance, so I doubt that. You pass the instance as first argument to a delegate if it's an instance method delegate.

It's also quite a bit faster than plain Invoke, but nowhere near as fast the compiled Func

I thought so too, but in my own timings (different kinds of reflection, though) I saw only a few percent difference.

Anyway, compiling is certainly the fastest once it's compiled, but the overhead of compiling is huge compared to delegates. Which is why I raised the suggestion as an alternative.

But whatever route we take, it's a great improvement :).

@kerams
Copy link
Contributor Author

kerams commented Jul 19, 2020

I've added another method to the benchmark. This could potentially be an extra overload where results are written into the provided buffer, doing away with an array allocation.

@kerams
Copy link
Contributor Author

kerams commented Jul 20, 2020

This is pretty cool https://github.com/dadhi/FastExpressionCompiler, but I am not sure if it's desirable to depend on it in FSharp.Core.

@abelbraaksma
Copy link
Contributor

Yeah, they try to keep FSharp.Core independent of other assemblies.

Copy link
Contributor

@cartermp cartermp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is generally a good improvement for the serialization scenario.

@dsyme what are your thoughts here?

@Daniel-Svensson
Copy link
Contributor

Daniel-Svensson commented Jul 22, 2020

Nice work @kerams.
Have you considered to just use delegate invocations as an alternative?
I did some benchmarking on different approaches to get property values a while back and found it surprisingly fast.
It also has the upside of working really fast on platforms where compiled expressions are interpreted (such as when reflection emit is missing).

I share my findings below:
Note:

  • update I did a quick attempt at a f# version running on netcoreapp3.1 and there the expression version seems to be faster than delegates even for 100s items, so it seems lika a god solution for that runtime
  • That the benchmark is in C# and available here and the timings are for creating a Func<object,object> and calling it N times.
    In your scenario you will do a single expression compile per type so it cannot be directly translated as the total overhead will be lower.
  • I did only measure on net framework, measrements on core will be different.
  • The posted measurements are from my laptop so last results might se some increase in measured error (even if cpu was capped to 50%).

For my scenario delegates did win over pure reflection even after just 10 calls and is was faster.

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18363
Intel Core i5-8250U CPU 1.60GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
  [Host]    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4180.0
  RyuJitX64 : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4180.0

Job=RyuJitX64  Jit=RyuJit  Platform=X64  
Method NumInvocations Mean Error StdDev Median Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Reflection 10 3.374 us 0.1361 us 0.3927 us 3.424 us 1.00 0.00 - - - -
ExpressionCompile 10 533.485 us 16.3923 us 47.8171 us 526.475 us 160.25 22.56 0.9766 - - 5216 B
DelegateInvoke 10 16.636 us 0.6175 us 1.7717 us 16.812 us 4.99 0.77 0.2441 - - 787 B
Reflection 50 18.658 us 1.6585 us 4.6507 us 17.166 us 1.00 0.00 - - - -
ExpressionCompile 50 740.353 us 24.1893 us 69.7915 us 750.745 us 41.72 9.86 0.9766 - - 5216 B
DelegateInvoke 50 18.863 us 1.1237 us 3.1325 us 18.746 us 1.07 0.29 0.2441 - - 787 B
Reflection 100 90.505 us 1.7948 us 3.1903 us 90.785 us 1.00 0.00 - - - -
ExpressionCompile 100 1,239.046 us 17.2747 us 16.1587 us 1,239.779 us 14.08 0.47 - - - 5200 B
DelegateInvoke 100 54.064 us 4.7938 us 14.1345 us 61.546 us 0.71 0.06 0.2441 - - 787 B
Reflection 500 291.448 us 25.6986 us 75.7730 us 272.045 us 1.00 0.00 - - - -
ExpressionCompile 500 877.857 us 31.8290 us 93.8484 us 900.854 us 3.22 0.91 0.9766 - - 5216 B
DelegateInvoke 500 49.281 us 1.9660 us 5.5772 us 49.284 us 0.18 0.05 0.2441 - - 787 B
Reflection 1000 426.901 us 18.8465 us 55.5693 us 423.196 us 1.00 0.00 - - - -
ExpressionCompile 1000 867.530 us 70.6930 us 208.4398 us 817.176 us 2.06 0.54 0.9766 - - 5216 B
DelegateInvoke 1000 49.232 us 2.1033 us 6.1353 us 50.323 us 0.12 0.02 0.2441 - - 787 B

Copy link
Member

@KevinRansom KevinRansom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this change as is. The performance benefit when cached is excellent, and in serialization scenarios this will be noticeable and significant. The user of this API will certainly want to cache the result to eliminates generating the funcs a bunch of times. One time uses of the PreComputeRecordReader are probably fairly rare ... the clue is in the name.

Thank you for preparing this, and the performance analysis.

@cartermp cartermp merged commit d82a0eb into dotnet:master Jul 23, 2020
@kerams kerams deleted the reflection branch July 23, 2020 04:35
@kerams
Copy link
Contributor Author

kerams commented Jul 23, 2020

@KevinRansom, I'd be more than happy to try to implement this for these methods in a similar fashion (and refactor the record reader to use a single compiled func instead of one for every record field because I did not expect this to get merged so quickly :))):

PreComputeRecordConstructor(Type, FSharpOption)
PreComputeUnionConstructor(UnionCaseInfo, FSharpOption)
PreComputeUnionReader(UnionCaseInfo, FSharpOption)
PreComputeUnionTagReader(Type, FSharpOption)
PreComputeRecordFieldReader(PropertyInfo)
PreComputeTupleConstructor(Type)
PreComputeTupleReader(Type)

Additionally, do overloads taking a buffer (see AfterWithProvidedBuffer in the benchmark) sound like something that would be worth adding as well?

@Daniel-Svensson, if you're only going to read a property as few as a 100 or 1000 times, does it really matter which option you choose? Your benchmark shows that the difference between the slowest and fastest is sub millisecond (not sure what happened in (ExpressionCompile 100) and I have a hard time coming up with a plausible scenario where that would matter at all. When I set out to create this PR, I had a specific use case in mind - serialization in web servers. The compilation overhead gets amortized into nothing and you get suberb performance for the (long) lifetime of the process.

Your point about interpreted expression trees is interesting. Do those platforms throw on Compile() or do they return a delegate that does the interpretation on every invocation?

@KevinRansom
Copy link
Member

@kerams, please take a look if you would like. Certainly we would consider prs for those apis.

nosami pushed a commit to xamarin/visualfsharp that referenced this pull request Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants