-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: User definable transports using ClusterManagers #9046
Conversation
Currently, the worker knows about its pid, only when it receives the first message. This leads to issues like
We can address this by exporting |
With MPI, the workers are started by MPI. There is no special manager process, there are just workers -- if you want to have a manager, you need to use one of the workers. MPI offers an API to send messages between workers. There is also additional functionality (e.g. efficient broadcast and reduce operations) but that's too high level to be mapped to a ClusterManager. MPI has no pids; instead, workers are numbered sequentially starting from zero. It is possible to add new workers (by connecting several MPI jobs), but let's ignore that for the moment. Would this MPI model fit with your changes? Essentially, when Julia starts, all workers have already been set up and have been connected, and one uses small integers to address them? |
It would broadly work in the following manner (psuedo code): Julia processes started with
MPIClusterManager implementation:
|
When you receive data via MPI, you already know who sent the data; the additional headers are not necessary. Also, MPI already implements buffers, so it should not be necessary to implement buffers in Julia as well. Unbuffered reads/writes should work just fine. In an ideal world, the manager would not need to store any data regarding connections, ids, workers etc., as one can get all this information by querying MPI instead. In your code above, the whole The I'm sorry that I didn't look at your implementation yet, but if you think that the above is feasible? Your quick response and implementation is very encouraging. |
The issue is in the different communication infrastructure implementations In Julia, the model is:
We do not want to change the above model just for MPI. For MPI communication we are not aware (and don't care) of its internal communication infrastructure. However, if we want Julia parallel constructs (
As part of the bridging, we send data recevied from a worker(via MPI) to the appropriate Julia task handling that particular worker connection via We need to store the mapping between a Julia process id (used by The additional header is required for just the initial state. A Julia worker does not know its Julia process id, till the master connects to it and sends the pid |
The Julia pid can be sent only once the first time a node sends any message to any other node - the same way that Julia does. That will do away with the requirement for a header. However, a header, may still be useful in other circumstances - for example, if the MPIClusterManager is written to support both regular MPI as well as MPI-Julia bridge calls in the same user code. Also MPI broadcast can probably be supported by having another version of |
Maybe I am conflating two issues. Yes, this cluster manager can be used with MPI, this is good. We should probably proceed this way, and address the remaining issues later. My other worry is that having one task per connection leads to O(n^2) tasks when there are n workers, if we assume a long-running application with a "random" communication pattern. This will not scale to the process counts I'm interested in (say, 10,000 workers). It will likely work fine if there are 100 workers, and that is probably the majority of the use cases. However -- at the moment, it is probably more important to support MPI at all than to worry about this kind of scalability. So, please go ahead, I like your proposal very much. |
Tasks are very, very cheap. For example:
There is a segfault issue that came by with this code - #9066 - but the fact remains that tasks are not expensive at all. However, at 10K+ workers, we will need to have efficient broadcast mechanisms. |
I have added a proof-of-concept that uses ZMQ for transport. It is in It uses a star topology as opposed to the native mesh network. All Julia nodes only connect to a "broker" process that listens on known ports 8100 and 8101 via ZMQ sockets. All commands must be run from First, start the broker. In a new console type: Next, start a Julia REPL and type: Alternatively, head.jl, a test script could be run. It just launches the requested number of workers, executes a simple command on al of them and exits. NOTE: As stated this is a proof-of-concept. A real Julia cluster using ZMQ will probably use different ZMQ socket types and optimize the transport further. |
cc @shashi |
Some timing information for the simple ZMQ vs native transport sockets
With 8 workers
1.7 seconds (ZMQ), 0.75 seconds (in built TCP) Not too bad given that an extra hop is involved in the above model. Using ZMQ for transport will be beneficial when we need to scale to thousands of workers, or bridging clusters of nodes in different locations. When the need to leverage extra computing resources justifies any extra network latencies. @eschnett do you think you can use the above as a template for an MPI cluster manager? |
I think that any system allowing thousands of workers will have some kind of management software installed, such as Condor or MPI. But then, maybe someone discovers a cool project running thousands of Julia workers distributed over the globe... @amitmurthy I will try this with MPI. I don't know when I'll have time for this, though. |
Superseded by #9434 |
This is an attempt to make it possible to use alternate transport mechanisms for a Julia cluster.
NOTE:
multi.jl
has been split intomanager.jl
andmulti.jl
. The former has all code related to connectionsetup and cluster managers, while the latter has the multi-processing part.
Concept
Currently, all nodes are setup in a mesh network with each node connected to every other node via TCP sockets.
From a scaling point of view, in some cases, a star configuration or even a tree configuration may be preferred.
This patch makes it possible to, for example,
to use Julia libraries that leverage MPI as well as the regular Julia parallel infrastructure
NOTE: I am unfamiliar with MPI, so the MPI comments are based on a sketchy knowledge of MPI.
As an example, say we write a ZMQClusterManager and we interconnect thousands of julia processes
via a 0MQ router into a star network.
So while a Julia worker connects to the 0MQ router with a single 0MQ socket,
the interface between ZMQClusterManager and multi.jl is via a pair of "StreamBuffer"s. A "StreamBuffer" is a
non-OS
AsyncStream
. It essentially wraps aPipeBuffer
and read/writeCondition
variables toprovide waitable in-process stream objects.
While the implementaion may have similarities with a
Channel{Uint8}
(#8507),it is different in the sense that
StreamBuffer
s implementAsynsStream
, whileChannel
implements single element
put!
,take!
, etc.Implementation
Custom cluster managers should implement the following methods:
launch
- launches multiple worker processesconnect_m2w
- sets up connections between the master and workers. Optional.connect_w2w
- sets up connections between the workers. Optional.manage
- bookkeeping / interrupting workersA default implementation of
connect_m2w
andconnect_w2w
usingTCpSocket
is provided in Base.NOTE: Custom implementations which return "StreamBuffer"s may not even setup any actual "connection"s.
They may just associate the Julia pid with the appropriate routing information.
Base provides:
The control flow starts with an
addprocs(np::Int; kwargs...)
where one of thekeyword args is
manager
which defaults toLocalManager()
addprocs
callslaunch(::Type, config::Dict, launched::Array, c::Condition)
to start julia workers.The
launch
method, which is run in a separate task, shouldlaunched::Array
for every worker launched. This will typically be a copy ofconfig
with additional fields that may be required to connect to the worker.notify
the Condition variablec
, as and when a worker is launched, so that connections to the launched workers can be setup in parallelBase provides two
launch
methods :launch{T<:LocalManager}(::Type{T}, ....
andlaunch{T<:SSHManager}(::Type{T}, ....
Next,
connect_m2w(::Type, pid::Int, config::Dict)
is called to setupconnections between the master and workers. It is called for each and every
config
returned by thelaunch
method above.config
also has an additionalkey
:pid
which is the pid of the worker.connect_m2w
should return a tuple,(read_stream::AsyncStream, write_stream::AsyncStream)
.It can also add any additonal key-value pairs to
config
.Any bytes read from
read_stream
must have been sent from processpid
Any bytes written to
write_stream
should be delivered to processpid
Base provides a default implementation which uses regular TCP sockets to
setup the mesh network :
connect{T<:ClusterManager}(::Type{T}, ....)
Both
read_stream
andwrite_stream
are the sameTCPSocket
in this case.connect{T<:ClusterManager}
expects either:io
or:host
/:port
to be definedin config, where io is the stdout of the launched worker.
If
:io
is defined, it is used to read the host/port information that is printed by the worker.If
:host
is defined, it overrides the host printed from reading io.Both
launch
andconnect_m2w
could add a key:connect_at
. The value of thisfield is sent to all workers to be used in worker-to-worker connection setups.
connect_w2w(::Type, pid::Int, config::Dict)
is called to setup worker-to-worker connections.Since this is called from the workers, very limited config information is available.
config
in this case has a field:connect_at
which can be used by thecluster managers's
connect_w2w
implementation to connect topid
.manage(::Type{T}, id::Integer, config::Dict, op::Symbol)
is called from the masterprocess only with
op
either of:register
,:deregister
,:interrupt
and"finalize
startup and specifying the cluster manager in workers
--worker <cluster_manager>
wherecluster_manager
is the type of cluster manager.
the
-e
or program options.Base.start_worker{T<:ClusterManager}(::Type{T}, out::IO)
is the current entry pointfor a worker, which listens on a free port and prints out connection information.
after loading the cluster manager using
-e
or or program options.process_messages(r_stream::AsyncStream, w_stream::AsyncStream; kwargs...)
,once for every incoming connection/data from another node.
TODO
StreamBuffer
s