Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer reception dataflow is all wrong, performance is a nightmare, determinism went out the window, everything is terrible #185

Closed
pavel-kirienko opened this issue Nov 19, 2018 · 10 comments

Comments

@pavel-kirienko
Copy link
Member

pavel-kirienko commented Nov 19, 2018

So the current reception architecture was created back in 2014 when I was working on a very, very early proof-of-concept (the protocol was first announced in 2014). A lot of ground-breaking changes were introduced since then, so we ended up with this weird logic which was never designed to be this way, it just happened to be this way due to multiple partial mutations.

Сurrently, every received frame goes through the following pipeline:

  • First, the frame is parsed. At this point, it has three dimensions that need to be matched to a particular transfer receiver state: port ID (this is currently called "data type ID", but this concept is obsolete), source node ID, and kind (service or message).
  • The port ID (data type ID) is used to find the matching listener (TransferListener) in a linked list. The remaining dimensions are node ID and kind. Observe that the same listener serves both services and messages under the same port ID, this is odd.
  • The listener matches the required receiver state (TransferReceiver) using kind and source node ID (another linked list traversal).
  • The frame is fed into the matching TransferReceiver object which reconstructs the transfer.

So, plenty of unnecessary indirection and linear-time searching is afoot which we must fix.

We could follow the example of libcanard and use uint32_t transfer descriptors: https://github.com/UAVCAN/libcanard/blob/master/DESIGN.md#rx-transfer-states which would reduce the number of indirections to just one. Libcanard was designed much later so we were able to get things right from the first attempt. However, there is a limitation that will be problematic for libuavcan: storing all states in a single linked list requires traversal of a huge linked list per every received frame.

An obvious solution is to shard lists into two per kind: one for messages, one for services; in this case, no extra dynamic indirection is introduced since the number of options is fixed (always two) and the logic can be implemented statically (meaning that there will be two separate lists, one for services, the other for messages). The remaining number of dimensions to map through will be reduced down to two: port ID and source node ID.

That seems better, but what if we're dealing with a hard real-time application? The worst case frame processing complexity would be O(number_of_ports * number_of_nodes), which is, frankly, terrible, especially if we can't control for the number of nodes on the bus, which is the case for COTS hardware.

How about we shard lists further? Both the number of nodes and the number of ports have well-known limits: 128 and 65536/512, respectively. The latter is impractical to deal with, but how about node ID? If we were to really put our foot down and say that libuavcan can't work with memory-tight MCUs (let's face it, it's already the case, so what), we could keep a dedicated list of TransferReceiver per node ID, per kind. The right linked list can be resolved by constant-complexity static array indexing, and then the right listener can be found by traversal with the complexity of O(number_of_ports), where the argument is well-controlled by the application (we always know how many subjects/services we subscribe to!). The resulting structure would be:

Map<uint16_t, TransferReceiver> receivers_by_port_id_[128 * 2];

Where uint16_t is the type of port ID, 128 is the worst case number of nodes, and 2 is the number of transfer kinds (services or messages). The Map template is just a thin wrapper over a linked list. Frame dispatching would look roughly like:

auto index = frame.getSourceNodeID() * ((frame.getTransferKind() == TransferKind::Service) ? 1 : 2);
auto& map = receivers_by_port_id_[index];  // Resolving node ID & kind in constant time
TransferReceiver& receiver = map.at(frame.getPortID());  // Resolving port in linear time
receiver.addFrame(frame);

This would be a monumental improvement to determinism & performance, and some improvement to the ROM footprint.

@pavel-kirienko
Copy link
Member Author

Marked as "bug" because it is a serious problem, even though it is not visible from outside.

@antoinealb
Copy link

If what you mostly seek is to remove the linear time searching, it could be enoughto store the data in self-balanced binary trees, couldn't it ? We could cut it from O(number_of_ports * number_of_nodes) to O(log(number_of_ports) + log(number_of_nodes)) which is already much better.

@pavel-kirienko
Copy link
Member Author

I think we should separate the question of the data structure used for port ID mapping from the question of sharding. I am quite certain that sharding as described in my first post shall be implemented, because the gains are huge. Now, to the question of which data structure to use.

Let's have a look: http://bigocheatsheet.com/ What we care about is the worst case complexity of searching and insertion. Finally, I can put my l33t gimp skills to a good use:

screenshot_20181119_150153

Ignoring the lame options we have our linked list and four kinds of trees.

So the use case is simple: we receive a transfer and go look for the matching receiver, which is O(n) or O(log(n)) depending on the chosen container (where n is the number of ports). If there is one, it's a bingo. If not, we have to insert one, which is either O(1) for linked list and O(log(n)) for trees. Theory says that twice O(log(n)) is better than O(n), trees win. Which one do we want?

@pavel-kirienko
Copy link
Member Author

I would like to add also that the data structure should be heavily optimized for very small values of n. This is because most applications are unlikely to use more than a dozen of subjects and a few services. This constraint essentially invalidates generic complexity analysis where n approaches infinity, possibly skewing the balance in favor of linked lists again.

@antoinealb
Copy link

I think we should separate the question of the data structure used for port ID mapping from the question of sharding. I am quite certain that sharding as described in my first post shall be implemented, because the gains are huge. Now, to the question of which data structure to use.

I think it is too expensive to shard the receive list, hence my proposal to not shard but use a more efficient data structure instead. Let's do the math:

Cost of sharding: We have to keep 256 Maps in memory. Let's assume that since Map is just an abstraction over a single linked list, that means keeping 256 pointers in memory -> 1 KB of extra RAM usage. Is this correct ? And is this an OK cost to pay ?

We could also put all the TransferReceivers into hashmaps, which would reduce the average lookup time, without using additional memory.

@pavel-kirienko
Copy link
Member Author

Cost of sharding: We have to keep 256 Maps in memory. Let's assume that since Map is just an abstraction over a single linked list, that means keeping 256 pointers in memory -> 1 KB of extra RAM usage. Is this correct ? And is this an OK cost to pay ?

Yep, looks right and perfectly acceptable. 1 KB is not that much; memory-limited applications can always opt for libcanard.

@thirtytwobits
Copy link
Contributor

I'll revisit this bug in the future. My gut reaction is "egads! 256 std::maps!?!" but I won't rule it out.

@pavel-kirienko
Copy link
Member Author

pavel-kirienko commented Nov 19, 2018

"egads! 256 std::maps!?!"

Just 256 instances of uavcan::Map<>, which are essentially just pointers. We don't use std::map<>, I couldn't wrestle the Allocator trait to make it work with our block allocator.

@pavel-kirienko
Copy link
Member Author

Invariance to the number of nodes in the network is extremely important. The ability of a node to adhere to real-time constraints should not be conditional on the number of other nodes on the network. Meaning that O(number_of_ports) is an improvement over O(log(number_of_ports) + log(number_of_nodes)), even if the latter offers nearly identical performance in most scenarios, because determinism.

@pavel-kirienko
Copy link
Member Author

What I neglected to provide for is that there are two kinds of service transfers: requests and responses. It is possible for a node to act as a service server and a client for any service ID at the same time, meaning that the proposed dual sharding between message and service transfers does not cover all cases. The sharding must be done across three kinds: messages, service requests, and service responses, resulting in the following arrangement:

Map<uint16_t, TransferReceiver> receivers_by_port_id_[128 * 3];

The rest still holds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants