-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review of loading time #77
Comments
this gives an idea:
step 4 is just However taken independently, there doesn't seem to be a masssive slowdown... |
In a fresh session,
benchmarking stuff in a module is a bit weird I guess... |
seems to point out to categorical arrays being shitty
|
There seems to be a massive discrepancy between a terminal REPL and Juno... this is in a fresh repl session (Same julia version); no problem here...
Not sure how to analyse this. In both the REPL and Juno, loading MLJBase takes 13-15 seconds which is way too much. When running timings in the module, it's always the |
Right so the culprit is to be found in Requires, making everything a hard dependency in scientific types hugely decreases latency:
1.2s instead of 7s I guess we could make that even less by removing Generally there's probably choices to be made in terms of where we have optional dependencies, while it makes total sense in MLJModels, I'm less convinced it makes sense in MLJBase or MLJ and I definitely don't think it makes sense in ST. If optional deps were overhead free, pushing for optional deps could be a good idea but it seems it's far from the cases so maybe let's add more rather than fewer hard deps and eventually when things stabilise and it's clearer what we can do, let's make stuff optional? |
Ok so after a lot of head scratching and trial and error here's what I did:
The loading time drops from 10-15s to... 2s So I definitely think there's some thought to be had on this; if we don't want MLJ to be shunned because it's mega slow to start we should start being smart about possibly having hard deps for things that will most likely be used anyway. Suggestions
I think there's a larger discussion to be had but basically using Requires a lot should be avoided as much as possible; in MLJModels it's inevitable but elsewhere we probably should not use it; btw performance of Requires this has been a problem since 2017 apparently: JuliaPackaging/Requires.jl#39 There will probably be things that we can't do by dropping optional deps etc but I think it might constrain us in the right direction, having code that is more generic and faster to load. I think we should make a strong move in that direction ASAP, loading times of 10-30s will make people run away from the package... Some numbersToo slow
OK
Instant
So from looking at this, it would seem that the only real reason to use Requires would be for CSV; I don't really know why that would be a necessity to have it in MLJBase; other than that, using optional deps does not seem to serve a purpose. Further than that, there may be things to shave in the way things are loaded in MLJModels (beyond requires), like the way the metadata is read etc, but I think if we sort out this first, we will already see a huge improvement, and then we can tackle more minor details. |
See also https://discourse.julialang.org/t/loading-time-for-2-packages-sum-of-individual-loading-time/30630 There seems to be an issue with Julia 1.3... Loading MLJModels on Julia 1.2 takes 3s; on Julia 1.3 it takes 13s ... |
See also this issue: JuliaLang/julia#33615 It may be that when this is fixed, the recommendations here can essentially be ignored. |
takes 3s now 👍 |
On Julia 1.5, even when precompiled, julia> @time using MLJBase
12.497485 seconds (22.42 M allocations: 1.155 GiB, 5.92% gc time)
pkg> st MLJBase
Status `~/Programa/Project.toml`
[a7f614a8] MLJBase v0.14.6 😞 Is there any plan to move some of the bigger dependencies (HTTP, JLSO?) to MLJ? Using MLJBase in my package makes |
I think we should drop HTTP and the openML stuff, that should be a separate package; I don't know what JLSO is. Then there's this plan of taking the measures outside of MLJBase which would take out LossFunctions too... |
JLSO.jl is part of MLBase: https://github.com/alan-turing-institute/MLJBase.jl/blob/master/Project.toml#L16 It has a lot of dependencies (TestEnv) pkg> add JLSO
Updating registry at `~/.julia/registries/General`
######################################################################## 100.0%
Resolving package versions...
Updating `~/Programa/TestEnv/Project.toml`
[9da8a3cd] + JLSO v2.3.2
Updating `~/Programa/TestEnv/Manifest.toml`
[fbb218c0] + BSON v0.2.6
[944b1d66] + CodecZlib v0.7.0
[e2ba6199] + ExprTools v0.1.1
[8f5d6c58] + EzXML v1.1.0
[48062228] + FilePathsBase v0.9.4
[9da8a3cd] + JLSO v2.3.2
[682c06a0] + JSON v0.21.0
[94ce4f54] + Libiconv_jll v1.16.0+5
[f28f55f0] + Memento v1.1.0
[78c3b35d] + Mocking v0.7.1
[69de0a69] + Parsers v1.0.10
[3cdcf5f2] + RecipesBase v1.0.2
[cea106d9] + Syslogs v0.3.0
[f269a46b] + TimeZones v1.3.2
[3bb67fe8] + TranscodingStreams v0.9.5
[02c8fc9c] + XML2_jll v2.9.10+1
[83775a58] + Zlib_jll v1.2.11+15 |
@ablaom what do you think? Should this serialisation business be either done differently so that it depends on fewer things or otherwise extracted from MLJBase? I've not looked at that part of the code at all so no idea what's reasonable |
I agree it makes sense to remove one or both of model serialisation and OpenML into new packages, if this is impacting on the load times.
@cstjean Just curious, what package are you using that depends on MLJBase? |
Yes, just confirming JLSO has a huge impact. Times with or without pre-compilation are roughly halved by dumping JLSO. cc JLSO author @oxinabox may want to comment, but I think we go ahead and move serialisation out. (Just FYI, JLSO is the fallback serialisation method. Models can implement custom serialisation. ) Thanks @cstjean for pointing this one out! |
The alternative is to use julia native serialisation as the fallback. Actually, this is essentially what is happening - by default JLSO uses native serialisation. However, it adds a wrapper that records the julia version, the state of packages, and so forth, which is kind of nice for reproducibility. |
@rofinn we should look into what is making JLSO have such an impact on load-times. |
Most of the JLSO load time is from Memento's |
The PR for a partial fix is open invenia/Memento.jl#164 |
Update: As far as I can tell, the only issues brought up here left unresolved and not tracked elsewhere is the surprisingly long load time for LossFunctions.jl. So I am closing this now in favour of #570 |
@ablaom skip to the last message for a summary
In line with JuliaAI/MLJModels.jl#118
Initial considerations:
ScientificTypes
by a huge margin (7 seconds; all other packages take like 0.5)I'll keep investigating
I'm a bit confused; loading ST independently takes 0.1s; loading it with CategoricalArrays is significantly slower but still not 7s.
The text was updated successfully, but these errors were encountered: