Transfer reception dataflow is all wrong, performance is a nightmare, determinism went out the window, everything is terrible #185

pavel-kirienko · 2018-11-19T12:07:42Z

So the current reception architecture was created back in 2014 when I was working on a very, very early proof-of-concept (the protocol was first announced in 2014). A lot of ground-breaking changes were introduced since then, so we ended up with this weird logic which was never designed to be this way, it just happened to be this way due to multiple partial mutations.

Сurrently, every received frame goes through the following pipeline:

First, the frame is parsed. At this point, it has three dimensions that need to be matched to a particular transfer receiver state: port ID (this is currently called "data type ID", but this concept is obsolete), source node ID, and kind (service or message).
The port ID (data type ID) is used to find the matching listener (TransferListener) in a linked list. The remaining dimensions are node ID and kind. Observe that the same listener serves both services and messages under the same port ID, this is odd.
The listener matches the required receiver state (TransferReceiver) using kind and source node ID (another linked list traversal).
The frame is fed into the matching TransferReceiver object which reconstructs the transfer.

So, plenty of unnecessary indirection and linear-time searching is afoot which we must fix.

We could follow the example of libcanard and use uint32_t transfer descriptors: https://github.com/UAVCAN/libcanard/blob/master/DESIGN.md#rx-transfer-states which would reduce the number of indirections to just one. Libcanard was designed much later so we were able to get things right from the first attempt. However, there is a limitation that will be problematic for libuavcan: storing all states in a single linked list requires traversal of a huge linked list per every received frame.

An obvious solution is to shard lists into two per kind: one for messages, one for services; in this case, no extra dynamic indirection is introduced since the number of options is fixed (always two) and the logic can be implemented statically (meaning that there will be two separate lists, one for services, the other for messages). The remaining number of dimensions to map through will be reduced down to two: port ID and source node ID.

That seems better, but what if we're dealing with a hard real-time application? The worst case frame processing complexity would be O(number_of_ports * number_of_nodes), which is, frankly, terrible, especially if we can't control for the number of nodes on the bus, which is the case for COTS hardware.

How about we shard lists further? Both the number of nodes and the number of ports have well-known limits: 128 and 65536/512, respectively. The latter is impractical to deal with, but how about node ID? If we were to really put our foot down and say that libuavcan can't work with memory-tight MCUs (let's face it, it's already the case, so what), we could keep a dedicated list of TransferReceiver per node ID, per kind. The right linked list can be resolved by constant-complexity static array indexing, and then the right listener can be found by traversal with the complexity of O(number_of_ports), where the argument is well-controlled by the application (we always know how many subjects/services we subscribe to!). The resulting structure would be:

Map<uint16_t, TransferReceiver> receivers_by_port_id_[128 * 2];

Where uint16_t is the type of port ID, 128 is the worst case number of nodes, and 2 is the number of transfer kinds (services or messages). The Map template is just a thin wrapper over a linked list. Frame dispatching would look roughly like:

auto index = frame.getSourceNodeID() * ((frame.getTransferKind() == TransferKind::Service) ? 1 : 2);
auto& map = receivers_by_port_id_[index];  // Resolving node ID & kind in constant time
TransferReceiver& receiver = map.at(frame.getPortID());  // Resolving port in linear time
receiver.addFrame(frame);

This would be a monumental improvement to determinism & performance, and some improvement to the ROM footprint.

The text was updated successfully, but these errors were encountered:

pavel-kirienko · 2018-11-19T12:08:26Z

Marked as "bug" because it is a serious problem, even though it is not visible from outside.

antoinealb · 2018-11-19T12:29:16Z

If what you mostly seek is to remove the linear time searching, it could be enoughto store the data in self-balanced binary trees, couldn't it ? We could cut it from O(number_of_ports * number_of_nodes) to O(log(number_of_ports) + log(number_of_nodes)) which is already much better.

pavel-kirienko · 2018-11-19T13:15:34Z

I think we should separate the question of the data structure used for port ID mapping from the question of sharding. I am quite certain that sharding as described in my first post shall be implemented, because the gains are huge. Now, to the question of which data structure to use.

Let's have a look: http://bigocheatsheet.com/ What we care about is the worst case complexity of searching and insertion. Finally, I can put my l33t gimp skills to a good use:

Ignoring the lame options we have our linked list and four kinds of trees.

So the use case is simple: we receive a transfer and go look for the matching receiver, which is O(n) or O(log(n)) depending on the chosen container (where n is the number of ports). If there is one, it's a bingo. If not, we have to insert one, which is either O(1) for linked list and O(log(n)) for trees. Theory says that twice O(log(n)) is better than O(n), trees win. Which one do we want?

pavel-kirienko · 2018-11-19T13:33:10Z

I would like to add also that the data structure should be heavily optimized for very small values of n. This is because most applications are unlikely to use more than a dozen of subjects and a few services. This constraint essentially invalidates generic complexity analysis where n approaches infinity, possibly skewing the balance in favor of linked lists again.

antoinealb · 2018-11-19T13:51:48Z

I think we should separate the question of the data structure used for port ID mapping from the question of sharding. I am quite certain that sharding as described in my first post shall be implemented, because the gains are huge. Now, to the question of which data structure to use.

I think it is too expensive to shard the receive list, hence my proposal to not shard but use a more efficient data structure instead. Let's do the math:

Cost of sharding: We have to keep 256 Maps in memory. Let's assume that since Map is just an abstraction over a single linked list, that means keeping 256 pointers in memory -> 1 KB of extra RAM usage. Is this correct ? And is this an OK cost to pay ?

We could also put all the TransferReceivers into hashmaps, which would reduce the average lookup time, without using additional memory.

pavel-kirienko · 2018-11-19T13:54:03Z

Cost of sharding: We have to keep 256 Maps in memory. Let's assume that since Map is just an abstraction over a single linked list, that means keeping 256 pointers in memory -> 1 KB of extra RAM usage. Is this correct ? And is this an OK cost to pay ?

Yep, looks right and perfectly acceptable. 1 KB is not that much; memory-limited applications can always opt for libcanard.

thirtytwobits · 2018-11-19T21:26:29Z

I'll revisit this bug in the future. My gut reaction is "egads! 256 std::maps!?!" but I won't rule it out.

pavel-kirienko · 2018-11-19T21:29:33Z

"egads! 256 std::maps!?!"

Just 256 instances of uavcan::Map<>, which are essentially just pointers. We don't use std::map<>, I couldn't wrestle the Allocator trait to make it work with our block allocator.

pavel-kirienko · 2018-11-20T17:16:15Z

Invariance to the number of nodes in the network is extremely important. The ability of a node to adhere to real-time constraints should not be conditional on the number of other nodes on the network. Meaning that O(number_of_ports) is an improvement over O(log(number_of_ports) + log(number_of_nodes)), even if the latter offers nearly identical performance in most scenarios, because determinism.

pavel-kirienko · 2019-01-02T12:20:16Z

What I neglected to provide for is that there are two kinds of service transfers: requests and responses. It is possible for a node to act as a service server and a client for any service ID at the same time, meaning that the proposed dual sharding between message and service transfers does not cover all cases. The sharding must be done across three kinds: messages, service requests, and service responses, resulting in the following arrangement:

Map<uint16_t, TransferReceiver> receivers_by_port_id_[128 * 3];

The rest still holds.

pavel-kirienko added class-defect v1 labels Nov 19, 2018

pavel-kirienko mentioned this issue Nov 19, 2018

Architectural optimization: remove OutgoingTransferRegistry, require exactly one publisher/client per subject/service #181

Closed

pavel-kirienko mentioned this issue Nov 19, 2018

Remove background cleanup of stale states #186

Closed

pavel-kirienko mentioned this issue Dec 3, 2018

IO performance is highly dependent on the size of the TX queue #189

Closed

pavel-kirienko mentioned this issue Dec 18, 2018

Use log(N) containers OpenCyphal/libcanard#74

Closed

thirtytwobits added this to the v1.0.0-alpha milestone May 29, 2019

thirtytwobits added milestone constraint and removed class-defect labels May 29, 2019

pavel-kirienko mentioned this issue Jun 4, 2019

Port transport layer to v1 #235

Closed

thirtytwobits removed this from the v1.0.0-alpha milestone Feb 21, 2020

pavel-kirienko mentioned this issue Mar 3, 2020

UAVCAN v1.0 done OpenCyphal/libcanard#142

Merged

pavel-kirienko mentioned this issue Oct 13, 2021

Temporal complexity specification and AVL trees OpenCyphal-Garage/cyphal.rs#82

Open

pavel-kirienko closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transfer reception dataflow is all wrong, performance is a nightmare, determinism went out the window, everything is terrible #185

Transfer reception dataflow is all wrong, performance is a nightmare, determinism went out the window, everything is terrible #185

pavel-kirienko commented Nov 19, 2018 •

edited

Loading

pavel-kirienko commented Nov 19, 2018

antoinealb commented Nov 19, 2018

pavel-kirienko commented Nov 19, 2018

pavel-kirienko commented Nov 19, 2018

antoinealb commented Nov 19, 2018

pavel-kirienko commented Nov 19, 2018

thirtytwobits commented Nov 19, 2018

pavel-kirienko commented Nov 19, 2018 •

edited

Loading

pavel-kirienko commented Nov 20, 2018

pavel-kirienko commented Jan 2, 2019

Transfer reception dataflow is all wrong, performance is a nightmare, determinism went out the window, everything is terrible #185

Transfer reception dataflow is all wrong, performance is a nightmare, determinism went out the window, everything is terrible #185

Comments

pavel-kirienko commented Nov 19, 2018 • edited Loading

pavel-kirienko commented Nov 19, 2018

antoinealb commented Nov 19, 2018

pavel-kirienko commented Nov 19, 2018

pavel-kirienko commented Nov 19, 2018

antoinealb commented Nov 19, 2018

pavel-kirienko commented Nov 19, 2018

thirtytwobits commented Nov 19, 2018

pavel-kirienko commented Nov 19, 2018 • edited Loading

pavel-kirienko commented Nov 20, 2018

pavel-kirienko commented Jan 2, 2019

pavel-kirienko commented Nov 19, 2018 •

edited

Loading

pavel-kirienko commented Nov 19, 2018 •

edited

Loading