-
Notifications
You must be signed in to change notification settings - Fork 788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize reflection of F# types #9714
Conversation
Oh wait, I think you're using this: https://docs.microsoft.com/en-us/dotnet/api/system.linq.expressions.expression-1.compile?view=netcore-3.1 |
This will greatly improve repeated calls, but I wonder where the threshold is, because compiling is quite expensive. I mean, is it beneficial from 5 calls up, or 500 calls? I think this is a great improvement, but we should probably know how big the performance improvement is, and where it starts to improve. Did you test with BDN? Would it be possible to have the best of both, for instance by calling it the old way, and lazily compile in a different thread (no idea this is even feasible). Once compilation is done, subsequent calls come from the compiled one. |
I haven't benchmarked it in isolation, but I've seen a nice improvement in Fable.Remoting thanks to this approach. Let me put something together.
Oh, it should be possible by introducing some kind of a cache for field readers (and more, unfortunately, separate caches for the rest of the PreCompute methods). However, it seems like a lot of trouble and I'm not convinced it's worth the effort. I'd expect the caller to use the function A LOT more than just 500 times. Say the threshold where precomputing pays off with this change (haven't measured this) is 10000 invocations, but you only need 1000. You'll get a performance penalty now, but if that's a huge problem, you can always switch to What I'm not sure about are the memory consumption implications. How much space does a compiled expression the size of the one in The obvious solution to all of these concerns is having new methods, but then you won't get a performance boost (or, indeed, penalty in very specific cases) just by updating. |
type Record = {
A: int
B: int
C: string
D: string
E: unit }
let compileRecordReaderFunc (recordType: Type) =
let param = Expression.Parameter (typeof<obj>, "param")
let typedParam = Expression.Variable recordType
let expr =
Expression.Lambda<Func<obj, obj[]>> (
Expression.Block (
[ typedParam ],
Expression.Assign (typedParam, Expression.Convert (param, recordType)),
Expression.NewArrayInit (typeof<obj>, [
for prop in typeof<Record>.GetProperties (BindingFlags.Instance ||| BindingFlags.Public) ->
Expression.Convert (Expression.Property (typedParam, prop), typeof<obj>) :> Expression
])
),
param)
expr.Compile ()
let compileRecordReaderFuncWithBuffer (recordType: Type) =
let param = Expression.Parameter (typeof<obj>, "param")
let typedParam = Expression.Variable recordType
let buffer = Expression.Parameter typeof<obj[]>
let props = typeof<Record>.GetProperties (BindingFlags.Instance ||| BindingFlags.Public)
let expr =
Expression.Lambda<Func<obj, obj[], int>> (
Expression.Block (
[ typedParam ],
[
Expression.Assign (typedParam, Expression.Convert (param, recordType)) :> Expression
for i, prop in typeof<Record>.GetProperties (BindingFlags.Instance ||| BindingFlags.Public) |> Array.indexed do
let arrayAtIndex = Expression.ArrayAccess (buffer, Expression.Constant (i, typeof<int>))
Expression.Assign (arrayAtIndex, Expression.Convert (Expression.Property (typedParam, prop), typeof<obj>)) :> Expression
Expression.Constant (props.Length, typeof<int>) :> Expression
]
),
[ param; buffer ])
expr.Compile ()
[<MemoryDiagnoser>]
type Test () =
let before = FSharpValue.PreComputeRecordReader typeof<Record>
let after = compileRecordReaderFunc typeof<Record>
let afterWithBuffer = compileRecordReaderFuncWithBuffer typeof<Record>
let buffer = Array.zeroCreate 100
let record = { A = 1; B = 2; C = "3"; D = "4"; E = () }
[<Benchmark(Baseline = true)>]
member _.Before () =
for i in 1 .. 1000 do
before record |> ignore
[<Benchmark>]
member _.After () =
for i in 1 .. 1000 do
after.Invoke record |> ignore
[<Benchmark>]
member _.AfterWithProvidedBuffer () =
for i in 1 .. 1000 do
afterWithBuffer.Invoke (record, buffer) |> ignore
[<Benchmark>]
member _.Direct () =
for i in 1 .. 1000 do
[| box record.A; box record.B; box record.C; box record.D; box record.E |] |> ignore
[<Benchmark>]
member _.ReaderCompilation () =
compileRecordReaderFunc typeof<Record> |> ignore
[<Benchmark>]
member _.GetRecordFields () =
for i in 1 .. 1000 do
FSharpValue.GetRecordFields record |> ignore
BenchmarkRunner.Run<Test> () |> ignore
So if I'm interpreting this right, compiling a single reader function for the entire record (as opposed to a function for each field in the original commit) costs ~350 invocations of the present day version of the function from |
Having this as an option would be great. When I tested FSharp.SystemTextJson last year as part of my OpenF# talk, The use of the reflection based members erased almost all of the performance benefits from using system.txt.json. Having an out-of-the-box way to get that same information in a more efficient way would make that library even more of a no-brainer than it already is |
Thanks for the benchmark, the numbers help understand the impact. If I understand the PR correctly, this pre-compiles, then caches access to members of records when I am, however, a little worried about the initial overhead. I don't know in what contexts this code is usually used (well, in reflection), and if we can ascertain somehow that the threshold of 350+ calls is reached. An alternative would be do add functions that have Yet another way is perhaps how dynamics works in C#, which caches the invoked member for future access, though I'm not sure if it uses |
The property accessors are embedded in the returned function closure. You can refer to it as caching, but if you call the precompute method again with the same type, everything is compiled anew and you get a different closure back.
Yes, that's the "one-off" variant I talked about in the OP. There isn't any sort of caching involved and each call looks up
Unless I misunderstood something, the caller is the one that needs to know how often they're going to need to read record fields and it is their responsibility to choose the appropriate API/approach.
Sure, but it would also be fantastic if users could automatically reap the benefits of this change by simply updating to a new version of .NET/FSharp.Core. As we have established though, this does introduce additional overhead, so I'll let the powers that be decide whether or not a new set of methods is required.
I think |
No, I think we're on the same page. I understand the PR better now, thanks for the explanations!
I totally agree.
I see. Since compiled functions are never GC'ed, it may be better to introduce global caching for this, or users may leak memory (but that may not be trivial, concurrency stuff et al, and I don't know what the general idea is about global caches from the Core lib).
There's only one method table, regardless of instance, so I doubt that. You pass the instance as first argument to a delegate if it's an instance method delegate.
I thought so too, but in my own timings (different kinds of reflection, though) I saw only a few percent difference. Anyway, compiling is certainly the fastest once it's compiled, but the overhead of compiling is huge compared to delegates. Which is why I raised the suggestion as an alternative. But whatever route we take, it's a great improvement :). |
I've added another method to the benchmark. This could potentially be an extra overload where results are written into the provided buffer, doing away with an array allocation. |
This is pretty cool https://github.com/dadhi/FastExpressionCompiler, but I am not sure if it's desirable to depend on it in FSharp.Core. |
Yeah, they try to keep FSharp.Core independent of other assemblies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is generally a good improvement for the serialization scenario.
@dsyme what are your thoughts here?
Nice work @kerams. I share my findings below:
For my scenario delegates did win over pure reflection even after just 10 calls and is was faster. BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18363
Intel Core i5-8250U CPU 1.60GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
[Host] : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4180.0
RyuJitX64 : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4180.0
Job=RyuJitX64 Jit=RyuJit Platform=X64
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this change as is. The performance benefit when cached is excellent, and in serialization scenarios this will be noticeable and significant. The user of this API will certainly want to cache the result to eliminates generating the funcs a bunch of times. One time uses of the PreComputeRecordReader are probably fairly rare ... the clue is in the name.
Thank you for preparing this, and the performance analysis.
@KevinRansom, I'd be more than happy to try to implement this for these methods in a similar fashion (and refactor the record reader to use a single compiled func instead of one for every record field because I did not expect this to get merged so quickly :))): PreComputeRecordConstructor(Type, FSharpOption) Additionally, do overloads taking a buffer (see @Daniel-Svensson, if you're only going to read a property as few as a 100 or 1000 times, does it really matter which option you choose? Your benchmark shows that the difference between the slowest and fastest is sub millisecond (not sure what happened in (ExpressionCompile 100) and I have a hard time coming up with a plausible scenario where that would matter at all. When I set out to create this PR, I had a specific use case in mind - serialization in web servers. The compilation overhead gets amortized into nothing and you get suberb performance for the (long) lifetime of the process. Your point about interpreted expression trees is interesting. Do those platforms throw on |
@kerams, please take a look if you would like. Certainly we would consider prs for those apis. |
While compiling expression trees to
Func
s for faster execution is probably an overkill for one-off reflection functions, I reckon this extra step is worth it for their precomputed counterparts, which tend to be used in performance-critical scenarios such as serialization.If there's interest, I can similarly improve other
PreCompute
functions.