TODO

Based on the design guidelines paper:
    Multi queue optimization is useful: Avoid "binding" a fast cpu core to a slow NPU and use multiple qps
    Can we use their timer trick (client guesses the upper 4 bytes) that they use to get around the 8 byte limitation of the immediate value?

Allocation metadata (what which pages are owned by this object) must also be shipped with the object itself.

Variables that we have to be careful about:
dev_attrs.max_sge: 30
dev_attrs.max_sge_rd: 30
dev_attrs.max_qp_rd_atom: 16
dev_attrs.max_qp_init_rd_atom: 16
dev_attrs.max_srq_wr: 32767
dev_attrs.max_srq_sge: 31
dev_attrs.max_pkeys: 128


Make data structure implementations to see what's required/useful
distributed trie?

Do not compare to the ideal scenario, compare to what's currently possible/doable

Betrfs and their follow ons
Recursive move


Compare with the current state and how we do things currently
for small sizes of objects, our granularity is too large
    what is the APi for developers
    reimplement an existing system with this and show what's better
    what do distributed applications do -> what are the existing solutions
    -> why they don't work
    refer to ramp explicitly since we're building on them
    write bullet points


Can actor model be an application?
Computation sharding to go with data sharding
Anything can be passed along and shuffled like a hadoop
    key, value

If an application wants to do things with a remote machine,
And if we can do things fast enough and there is a gain in
latency/overal time,
we can ship the computation there, do that specific part
(e.g. talking to a tcp service will be faster through the
loopback without leaving the kernel, or simpler yet, reading
from local disk in a hadoop like computation model that is
much more flexible)