Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error loading module on remote workers #19960

Closed
GregPlowman opened this issue Jan 10, 2017 · 5 comments
Closed

Error loading module on remote workers #19960

GregPlowman opened this issue Jan 10, 2017 · 5 comments
Labels
compiler:precompilation Precompilation of modules parallelism Parallel or distributed computation

Comments

@GregPlowman
Copy link
Contributor

Originally reported on Discourse discussion https://discourse.julialang.org/t/error-loading-module-on-remote-workers/1049

module TestModule
    export test
    using Formatting
    test(x) = println(format(x, commas=true))
end
addprocs(...)    # add worker on remote machine 
using TestModule
test(1000)    # this runs locally as expected for all versions
remotecall_fetch(2, test, 1000)    # this works for v0.4.5 but errors on v0.4.7

On Julia v0.4.5 everything works as expected:

1,000
        From worker 2:  1,000

On Julia v0.4.7 (and v0.5.0) error occurs:

ERROR: LoadError: On worker 2:
LoadError: LoadError: SystemError: opening file C:\Users\plowman\.julia\lib\v0.4\Compat.ji: No such file or directory
 in open at iostream.jl:90
 in open at iostream.jl:102
 in stale_cachefile at loading.jl:459
 in _require_search_from_serialized at loading.jl:114
 in require at loading.jl:249
 in include_string at loading.jl:295
 in include_from_node1 at loading.jl:336
 in require at loading.jl:273
 in include_string at loading.jl:295
 in include_from_node1 at loading.jl:336
 in eval at sysimg.jl:14
 in anonymous at multi.jl:1374
 in anonymous at multi.jl:920
 in run_work_thunk at multi.jl:661
 in run_work_thunk at multi.jl:670
 in anonymous at task.jl:58
while loading C:\Users\plowman\.julia\v0.4\Formatting\src\Formatting.jl, in expression starting on line 10
while loading C:\Users\plowman\OneDrive\Julia\ModulesTemp\TestModule\src\TestModule.jl, in expression starting on line 5
 in remotecall_fetch at multi.jl:747
 in remotecall_fetch at multi.jl:750
 in call_on_owner at multi.jl:793
 in wait at multi.jl:808
 in require at loading.jl:271
 in include at boot.jl:261
 in include_from_node1 at loading.jl:333
while loading C:\Users\plowman\OneDrive\Julia\Temp\RemoteWorkerTest.jl, in expression starting on line 22
  • I'm using Windows (errors on Windows 7 and Windows 10)

  • Everything works OK on Julia v0.4.5. Error occurs on v0.4.7 and similar error on v0.5.0.

  • Error occurs only when loading module on workers on remote machines (everything is OK when workers are local on same machine as master process)

  • If Compat is installed and precompiled on the remote worker then everything works. Interestingly:

    • Compat needs to be precompiled (seems worker is looking for Compat.ji)
    • Seems to ignore location of package directory (reported by Pkg.dir()) on remote worker. I'm guessing it might be using the same path as Pkg.dir() on master process.
    • Formatting package does not need to be installed on remote machine (only its dependency Compat)
  • Another user on Discourse (thanks Patrick) tested this example on Linux using Julia v0.5.0 but could not reproduce the error.

Because everything works on v0.4.5 and errors on v0.4.7, I tried to look at which files in the error backtrace have changed between these versions.
multi.jl and iostream.jl did not change, but loading.jl had some changes.
#18230 changes loading.jl in v0.4.7 (apparently a backport of PR #18150)

@amitmurthy
Copy link
Contributor

Nice detailed report. Good detective work on tracking down the changes causing it. Code loading is currently being revamped in 0.6 by Stefan.

@kshyatt kshyatt added the parallelism Parallel or distributed computation label Jan 10, 2017
@GregPlowman
Copy link
Contributor Author

GregPlowman commented Jan 12, 2017

I assume by your comment that there won't be a fix in say 0.5.1?
Will the functionality be available in 0.6?

Although Julia internals and code loading is way beyond my understanding, I have been trying to find a fix so that I can upgrade to v0.5. I have managed to make a change that I think fixes the problem for me, but I'm not sure what the full implications might be. I was hoping someone could comment on whether this change is reasonable or will likely cause me issues:

It seems that when loading a module, the code in _require_search_from_serialized() searches for cache paths always on node 1, but then checks each cache path to see if it's stale on the current process.

So maybe:

  • check should be done only when the current process is node 1:
-  if stale_cachefile(sourcepath, path_to_try)
+  if node == myid() && stale_cachefile(sourcepath, path_to_try)

OR

  • check should always be run on node 1 (similar to finding the cache paths on node 1)
-  if stale_cachefile(sourcepath, path_to_try)
+  if node == myid() 
        if stale_cachefile(sourcepath, path_to_try)
            continue
        end
    else
        if @fetchfrom node stale_cachefile(sourcepath, path_to_try)
            continue
        end
    end

Here's the full function from v0.5.0

function _require_search_from_serialized(node::Int, mod::Symbol, sourcepath::String, toplevel_load::Bool)
    if node == myid()
        paths = find_all_in_cache_path(mod)
    else
        paths = @fetchfrom node find_all_in_cache_path(mod)
    end

    for path_to_try in paths::Vector{String}
        if stale_cachefile(sourcepath, path_to_try)
            continue
        end
        restored = _require_from_serialized(node, mod, path_to_try, toplevel_load)
        if isa(restored, Exception)
            if isa(restored, ErrorException) && endswith(restored.msg, " uuid did not match cache file.")
                # can't use this cache due to a module uuid mismatch,
                # defer reporting error until after trying all of the possible matches
                DEBUG_LOADING[] && info("JL_DEBUG_LOADING: Failed to load $path_to_try because $(restored.msg)")
                continue
            end
            warn("Deserialization checks failed while attempting to load cache from $path_to_try.")
            throw(restored)
        else
            return restored
        end
    end
    return !isempty(paths)
end

@GregPlowman
Copy link
Contributor Author

GregPlowman commented Apr 3, 2017

Just tried on Version 0.6.0-pre.alpha.0 and receive error.

Are there any plans for remote code loading to be reintroduced?
AFAICT, it last worked in v0.4.5.

cc @amitmurthy

@GregPlowman
Copy link
Contributor Author

This is the error I receive on Julia v0.6.0:

module RemoteWorkerTest
    export test2
    using Formatting
    test2(x) = (printfmt(FormatSpec("#04x"), x); println())
end
Julia-0.6.0-pre.alpha> using RemoteWorkerTest
ERROR: On worker 2:
LoadError: SystemError: opening file C:\Users\plowman\.julia\lib\v0.6\Formatting.ji: No such file or directory
#systemerror#39 at .\error.jl:64 [inlined]
systemerror at .\error.jl:64
open at .\iostream.jl:104
open at .\iostream.jl:132
stale_cachefile at .\loading.jl:739
_require_search_from_serialized at .\loading.jl:221
require at .\loading.jl:409
include_string at .\loading.jl:485
include_from_node1 at .\loading.jl:542
eval at .\boot.jl:235
#682 at .\distributed\macros.jl:25
#93 at .\distributed\process_messages.jl:264
run_work_thunk at .\distributed\process_messages.jl:56
run_work_thunk at .\distributed\process_messages.jl:65 [inlined]
#86 at .\event.jl:73
while loading C:\Users\plowman\OneDrive\Julia\Modules\v0.6\RemoteWorkerTest\src\RemoteWorkerTest.jl, in expression starting on line 4
Stacktrace:
 [1] #remotecall_fetch#131(::Array{Any,1}, ::Function, ::Function, ::Base.Distributed.Worker, ::Base.Distributed.RRID, ::Vararg{Any,N} where N) at .\distributed\remotecall.jl:354
 [2] remotecall_fetch(::Function, ::Base.Distributed.Worker, ::Base.Distributed.RRID, ::Vararg{Any,N} where N) at .\distributed\remotecall.jl:346
 [3] #remotecall_fetch#134(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Base.Distributed.RRID, ::Vararg{Any,N} where N) at .\distributed\remotecall.jl:367
 [4] remotecall_fetch(::Function, ::Int64, ::Base.Distributed.RRID, ::Vararg{Any,N} where N) at .\distributed\remotecall.jl:367
 [5] call_on_owner(::Function, ::Future, ::Int64, ::Vararg{Int64,N} where N) at .\distributed\remotecall.jl:440
 [6] wait(::Future) at .\distributed\remotecall.jl:455
 [7] require(::Symbol) at .\loading.jl:451

@vtjnash
Copy link
Member

vtjnash commented Jun 27, 2017

fixed by #21695

@vtjnash vtjnash closed this as completed Jun 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:precompilation Precompilation of modules parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests

4 participants