Remote series and dataframes and distributed GC #932

josevalim · 2024-06-29T10:19:31Z

Automatically transfer data between nodes for remote series
and dataframes and perform distributed garbage collection.

The functions in Explorer.DataFrame and Explorer.Series
will automatically move operations on remote dataframes to
the nodes they belong to. This module provides additional
conveniences for manual placement.

Implementation details

There is a new module called Explorer.Remote.
In order to understand what it does, we need
to understand the challenges in working with remote series
and dataframes.

Series and dataframes are actually NIF resources: they are
pointers to blobs of memory operated by low-level libraries.
Those are represented in Erlang/Elixir as references (the
same as the one returned by make_ref/0). Once the reference
is garbage collected (based on refcounting), those NIF
resources are garbage collected and the memory is reclaimed.

When using Distributed Erlang, you may write this code:

remote_series = :erpc.call(node, Explorer.Series, :from_list, [[1, 2, 3]])

However, the code above will not work, because the series
will be allocated in the remote node and the remote node
won't hold a reference to said series! This means the series
is garbage collected and if we attempt to read it later on,
from the caller node, it will no longer exist. Therefore,
we must explicitly place these resources in remote nodes
by spawning processes to hold these refernces. That's what
the place/2 function in this module does.

We also need to guarantee these resources are not kept
forever by these remote nodes, so place/2 creates a
local NIF resource that notifies the remote resources
they have been GCed, effectively implementing a remote
garbage collector.

TODO

Make collect in dataframe transfer to the current node
Add collect to series
Add compute to dataframe
Handle dataframes (remove Shared.apply_impl)
Handle lazy series
Add node option to creation functions

billylanchantin

Very cool stuff! 🤯

So far I've only had one question. I'll probably have more after I read through again.

billylanchantin · 2024-06-30T15:45:08Z

lib/explorer/remote.ex

+  Receives a data structure and traverses it looking
+  for remote dataframes and series.
+
+  If any is found, it spawns a process on the remote node
+  and sets up a distributed garbage collector. This function
+  only traverses maps, lists, and tuples, it does not support
+  arbitrary structs (such as map sets).


Why do a traversal? If place accepted only series and dataframes, I think the API would be simpler.

Are we worried the following is expensive or inconvenient?

resource_map = %{"foo" => resource1, "bar" => resource2} pids_map = Map.new(resource_map, fn {k, v} -> {k, place(v)} end)

There are two reasons why we do a traversal:

Convenience: it is easier if the rest of the Explorer Series/DataFrame API do not have to care about explicitly annotating what needs to be placed.

FLAME support: we want to integrate this with FLAME, allowing developers to run custom code within a FLAME and that may return series/dataframes which are arbitrarily nested.

jonatanklosko · 2024-07-01T09:20:18Z

Amazing!! 🔥🐈‍⬛

jonatanklosko · 2024-07-01T09:36:24Z

lib/explorer/remote/holder.ex

+    {:stop, reason, state}
+  end
+
+  def handle_info({:DOWN, _, _, pid, _}, state) do


Shouldn't we monitor on hold to receive this message?

Yes, we should! I will try to add a test.

jonatanklosko · 2024-07-01T09:40:43Z

lib/explorer/remote/holder.ex

+    noreply_or_stop(%{state | pids: pids, refs: refs})
+  end
+
+  def handle_info({:DOWN, owner_ref, _, _, reason}, %{owner_ref: owner_ref} = state) do


If the idea is to allow multiple nodes to call hold, why do we terminate eagerly when the owner goes down, given that we already do cleanup in the other :DOWN chandler?

We need to do this if there are no pids or no refs, I will change accordingly.

Yeah, but it should be enough to monitor on init and rely on the other clause, which wouldn't pop anything and would cause stop, no?

(the difference being, that if the initial node goes down and there are other nodes referencing, we keep the holder alive)

josevalim and others added 15 commits June 27, 2024 19:12

WIP

76f4028

Fix warnings

a70fe37

Start tracking remote references

a8e6748

Implement message_on_gc (#930)

c157eae

WIP

0f9ced2

Add missing files

02cb8cc

Test placement

ba2e1e4

Update TODOs

e7fb81c

Transfer between nodes

f3cac55

More tests

f9ff532

Support :node on Series.from_*

85c3383

Add Series.collect

5653d0f

Add DF.compute

f9d56ed

Add :node option to DataFrame operations

08396a8

Remote dataframes

c5102e0

billylanchantin reviewed Jun 30, 2024

View reviewed changes

josevalim and others added 6 commits June 30, 2024 19:09

Support lazy series

b1fe4aa

Add Native.is_message_when_gc(reference) (#933)

6906f08

Extract common logic

96eb425

Avoid traversing when possible

cb49352

Do not allocate new ref if already alive

167ccab

Do not reuse remotes

cb4d53b

josevalim merged commit 067685f into main Jun 30, 2024
4 checks passed

josevalim deleted the jv-remote branch June 30, 2024 19:52

jonatanklosko reviewed Jul 1, 2024

View reviewed changes

josevalim restored the jv-remote branch August 27, 2024 11:48

josevalim deleted the jv-remote branch August 27, 2024 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote series and dataframes and distributed GC #932

Remote series and dataframes and distributed GC #932

josevalim commented Jun 29, 2024 •

edited

Loading

billylanchantin left a comment

billylanchantin Jun 30, 2024 •

edited

Loading

josevalim Jun 30, 2024

jonatanklosko commented Jul 1, 2024

jonatanklosko Jul 1, 2024

josevalim Jul 1, 2024

jonatanklosko Jul 1, 2024

josevalim Jul 1, 2024

jonatanklosko Jul 1, 2024

jonatanklosko Jul 1, 2024

Remote series and dataframes and distributed GC #932

Remote series and dataframes and distributed GC #932

Conversation

josevalim commented Jun 29, 2024 • edited Loading

Implementation details

TODO

billylanchantin left a comment

Choose a reason for hiding this comment

billylanchantin Jun 30, 2024 • edited Loading

Choose a reason for hiding this comment

josevalim Jun 30, 2024

Choose a reason for hiding this comment

jonatanklosko commented Jul 1, 2024

jonatanklosko Jul 1, 2024

Choose a reason for hiding this comment

josevalim Jul 1, 2024

Choose a reason for hiding this comment

jonatanklosko Jul 1, 2024

Choose a reason for hiding this comment

josevalim Jul 1, 2024

Choose a reason for hiding this comment

jonatanklosko Jul 1, 2024

Choose a reason for hiding this comment

jonatanklosko Jul 1, 2024

Choose a reason for hiding this comment

josevalim commented Jun 29, 2024 •

edited

Loading

billylanchantin Jun 30, 2024 •

edited

Loading