IO performance is highly dependent on the size of the TX queue #189

pavel-kirienko · 2018-12-03T20:33:37Z

Whenever the library emits a frame, it puts it into a prioritized transmission queue. There is one such TX queue per interface. Each entry in the queue has a fixed lifetime so that if the interface is saturated or down, old entries can be removed in favor of newer ones.

The queue is a sorted linked list of entries, and as such, it has a linear time complexity, which causes severe performance degradation when there is a down interface, especially in complex applications where TX queues are allowed to use a lot of memory.

The worst thing about it is that it can lead to bad side effects, like a multi-interface node failing to meet its real-time constraints when an interface goes down because the library is suddenly eating all its processing time.

There are two okay solutions:

Use a smarter log time data structure instead of linked lists. My exquisite raster graphics skills should help here: Transfer reception dataflow is all wrong, performance is a nightmare, determinism went out the window, everything is terrible #185 (comment)
Use flat static arrays, one per interface, instead of clever dynamic structures. I am not yet sure how exactly it can be done efficiently, given that we have to keep an ordered collection of items of varying size (remember, the queue is prioritized). One is tempted to just make like a tree and ~~leave~~ choose the first option, log(N) is not so bad and constant time might be theoretically impossible.

ntakouris · 2019-01-14T13:40:49Z

Regarding the smarter log time data structure, would a simple search tree be sufficient? Things could be pushed a bit further by using some smarter self balancing structure (like a red black tree), but the insertion performance is going to be a bit more costly. Memory is also going to go up 1 bit per entry because color information needs to be stored as well, but worse case is avoided as much as possible and average case insertion cost is constant. A lighter version of this would be the AVL trees.

I'm unaware of details of this library, but perhaps if there are nodes that are more frequently accessed, we could use a splay tree (it pushes most used nodes closer to the root). The drawback of this would be that it's internal representation changes upon each read (but this could be stopped after a 'characteristic' number of reads, in order to have more predictable read performance).

I guess that in order for the best way to be decided, simulations need to be ran with a characteristic payload that covers as many usage edge-cases as possible.

Android's take on memory-limiting / performance balancing: https://developer.android.com/reference/android/support/v4/util/ArrayMap

(I'm sorry if this is totally irrelevant, I'm trying to make my way in uavs)

pavel-kirienko · 2019-01-18T03:20:28Z

Thanks for your input. Most applications that libuavcan (or UAVCAN in general) is designed for don't really care about average case. We must optimize for the worst case. Hence, while the experience of non-real-time systems such as Android may still be valuable, it can't be applied here directly.

ntakouris · 2019-01-18T11:27:40Z

a simple height balanced tree (avl) can be used then. If extra performance is needed at the cost of some extra work upon insertion, red-black is the way to go. All these can be implemented by having an underlying array to take advantage of potential cpu caching.

(Android is just an example on how to have an array powering a map interface)

pavel-kirienko · 2019-01-18T22:48:13Z

Would you be interested to draft up a proof-of-concept implementation? (@thirtytwobits FYI)

ntakouris · 2019-01-19T09:36:25Z

I'd be willing to.

Basically these need to be done on can_io.hpp / uc_can_io.cpp:

Modify CanTxQueue so that it has an underlying AVL tree (some extra balancing helper functions are going to be added) instead of the LinkedListRoot.
Change the push, peek, remove, ... to work with the AVL tree
Find usages of queue_ at uc_can_io.cpp and change it if it's broken with the new stuff
Tests

I guess the backing AVL tree needs access to the allocator. Moving on, the tx queue can inherit from the tree so that we can do some smart remove-expired-nodes while traversing. There are some parts of the code that traverse the linked list without using it's functions but I think I can cope with that.

Am I missing something?

pavel-kirienko · 2019-01-19T18:27:18Z

That makes sense. I think the optimal approach is to implement the changes in the current upstream, and we will then reimplement them in the rewrite unless @thirtytwobits has a different view (his call).

I guess the backing AVL tree needs access to the allocator.

Either that, or you could use a fixed-capacity array as backing storage, as you described earlier. The size of the array could be parametrized via preprocessor options, template parameters, or a reference to the backing storage could be simply supplied from outside.

ntakouris · 2019-01-19T18:41:26Z

Ok. I'm developing at https://github.com/Zarkopafilis/libuavcan/tree/txq (Check the latest commits if you want to see the progress -- it's not buildable yet)

One more thing I noticed is that CanIOManager::sendFromTxQueue uses peek to send the next frame, which is essentially the first non-expired frame. I guess this can be easily switched to finding the frame with max qos.

I'm developing without a backing array right now, it seems that it may cause many problems with memory copying -- consistently adding and removing items could cause bad performance.

With the tree, special care must be taken regarding when we cleanup multiple expired frames. For now, I'm sticking to only try to remove one if out of memory, with maybe a 3rd retry after cleaning up everything.

@thirtytwobits code reviews, enhancements and what-not is very welcome! :) | I've created some avl_tree under uavcan\util and currently in progress of changing & interfacing uc_can_io.cpp.

ntakouris · 2019-01-20T17:17:34Z

Update: I just finished the most optimal CanTxQueue interface I could imagine.

Currently it looks like this:

class UAVCAN_EXPORT CanTxQueue : public AvlTree<CanTxQueueEntry>
{
private:
    static bool AVLTxQueueEntryInsertComparator(const CanTxQueueEntry &lhs, const CanTxQueueEntry &rhs) {
        return rhs.frame.priorityHigherThan(lhs.frame);
    }

    ISystemClock& sysclock_;
    uint32_t rejected_frames_cnt_;

    void safeIncrementRejectedFrames();

public:
    CanTxQueue(IPoolAllocator& allocator, ISystemClock& sysclock, std::size_t allocator_quota)
        : AvlTree(allocator, allocator_quota, AVLTxQueueEntryInsertComparator)
        , sysclock_(sysclock)
        , rejected_frames_cnt_(0)
    {}

    ~CanTxQueue();

    /* Avl Tree allocates the AvlTree::Node, while this(CanTxQueue) allocates the CanTxQueueEntry
     * Same logic for removal. */
    void push(const CanFrame &frame, MonotonicTime tx_deadline, Qos qos, CanIOFlags flags);
    void remove(CanTxQueueEntry*& entry);

    /* Tries to look up rightmost Node. If the frame is expired, garbage-collects all the expired frames */
    const CanTxQueueEntry* getTopPriorityNonExpiredPendingEntry() const;

    /* When OOM, try to avoid garbage-collecting all the expired frames and instead swiftly remove one or two */
    void removeFirstWithQosLowerOrThenEqualThan(Qos qos) const;
    void removeFirstWithQosLargerThan(Qos qos) const;

    // The 'or equal' condition is necessary to avoid frame reordering.
    bool topPriorityHigherOrEqual(const CanFrame& rhs_frame) const;

    uint32_t getRejectedFrameCount() const { return rejected_frames_cnt_; }

    void removeOneExpiredFrame() const;
    void removeAllExpiredFrames() const;
};

It's "optimal" for the following reasons:

For smooth operation (that is, without having expired nodes), everything happens the fastest possible way.
Whenever there is some OOM error, multiple ways of dealing with this issue are attempted: First, one expired node is removed. If that fails, we try to remove the first frame that complies with QosLowerThan. If none found, then equal is tried. If that does not work as well, first greater is tried. Last resort is to garbage collect every expired frame.
When asked for the top priority pending frame from the _tx_queues by the driver, the rightmost node is returned - if not expired -. If expired, all the expired frames are removed (because the driver is probably ready to transmit).

There is a comment: // The driver will give them a scrutinizing look before deciding whether he wants to accept them. Which makes me wonder; What does this mean? Can we optimize the algorithm further by knowing some more information (if any) on what the driver expects?

(For now, frames are ordered by greatest priority. Perhaps some caching can be done, so that when the highest priority entry is lowest that all the other messages present in the bus, perform some cleanups before we can retransmit?). @pavel-kirienko

The OOM retry could not even be possible without doing a stop-the-world GC pause, because even if we swiftly manage to find space for a new Entry, the AVL tree could fail to allocate a new Node immediately afterwards, because it has no knowledge of the outside world -- This is an extreme edge case, but perhaps, upon the slightest OOM we should clean up as much data as we can and then proceed.

That's it for now. (Sorry for the wall of text). I still have some internal AVL-tree implementation to do while these sub-issues are resolved.

pavel-kirienko · 2019-01-22T19:45:54Z

Please see #195. I suggest we simplify the logic as described for the sake of robustness and determinism (and the ROM footprint too, although it's not very important). Do you think that makes sense? If it does, then designing the new interface should be a no-brainer: no cleanups and traversals are needed, and no QoS used.

ntakouris · 2019-01-24T15:55:08Z

I'm almost ready to do a PR but I get this in tests:

2: [----------] 1 test from Node
2: [ RUN      ] Node.Basic
2: UAVCAN: GlobalDataTypeRegistry: Reset; was frozen: 0, num msgs: 0, num srvs: 0
2: UAVCAN: CanIOManager: Memory blocks per iface: 513, total: 1024
2: UAVCAN: CanIOManager: Memory blocks per iface: 513, total: 1024
2: sizeof(uavcan::Node<0>): 2104
2: SIGNAL 11 RECEIVED; STACKTRACE:
2: /work/libuavcan/libuavcan_test_optim[0x657a58]
2: /lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f14fd32d0e0]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan12CanIOManager4sendERKNS_8CanFrameENS_13MonotonicTimeES4_hNS_3QosEt+0x175)[0x6fcb25]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan10Dispatcher4sendERKNS_5FrameENS_13MonotonicTimeES4_NS_3QosEth+0x9d)[0x6ff88d]
2: /work/libuavcan/libuavcan_test_optim(_ZNK6uavcan14TransferSender4sendEPKhjNS_13MonotonicTimeES3_NS_12TransferTypeENS_6NodeIDENS_10TransferIDE+0x27b)[0x70590b]
2: /work/libuavcan/libuavcan_test_optim(_ZNK6uavcan14TransferSender4sendEPKhjNS_13MonotonicTimeES3_NS_12TransferTypeENS_6NodeIDE+0xc8)[0x705af8]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan20GenericPublisherBase14genericPublishERKNS_24StaticTransferBufferImplENS_12TransferTypeENS_6NodeIDEPNS_10TransferIDENS_13MonotonicTimeE+0xf0)[0x6f58c0]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan18NodeStatusProvider7publishEv+0xfd)[0x6f887d]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan18NodeStatusProvider15startAndPublishENS_16TransferPriorityE+0x39)[0x6f8969]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan4NodeILm1024EE5startENS_16TransferPriorityE+0x48)[0x56dee8]
2: /work/libuavcan/libuavcan_test_optim(_ZN15Node_Basic_Test8TestBodyEv+0x981)[0x5672a1]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x65)[0x731afe]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x4b)[0x72b986]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing4Test3RunEv+0xd1)[0x70daff]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8TestInfo3RunEv+0x103)[0x70e3c1]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8TestCase3RunEv+0xee)[0x70ea52]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x26b)[0x718463]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x65)[0x732ef4]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x4b)[0x72c708]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8UnitTest3RunEv+0xa7)[0x716f77]
2: /work/libuavcan/libuavcan_test_optim(main+0x4c)[0x4f8f3c]
2: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f14fd319b45]
2: /work/libuavcan/libuavcan_test_optim[0x4fcf24]
2: UAVCAN: CanIOManager: Memory blocks per iface: 9, total: 16
2: UAVCAN: CanIOManager: Memory blocks per iface: 9, total: 16
2: UAVCAN: GenericSubscriber: Stop; dtname=uavcan.protocol.NodeStatus
2: UAVCAN: GenericSubscriber: Start as message listener; dtname=uavcan.protocol.NodeStatus
2: UAVCAN: GlobalDataTypeRegistry: Frozen; num msgs: 2, num srvs: 4
2: UAVCAN: OutgoingTransferRegistry: Created dtid=341 tt=2 dnid=0
2: UAVCAN: TransferSender: prio=16 dtid=341 tt=2 snid=2 dnid=0 sot=1 eot=0 togl=0 tid=0 payload=[]
2/2 Test #2: libuavcan_test_optim .............***Failed    1.09 sec

I'm using dockcross-linux-x64 on macos mojave and I don't think I messed with that one. Do you have any idea on what is causing this?

pavel-kirienko · 2019-01-24T19:17:20Z

Signal 11 means segmentation fault. Could be a dangling pointer or whatever. I suggest you use a debugger or Valgrind to find the culprit.

[Review & Discuss] Update TxQueue to have a backing AVL Tree (Issues #189 #195)

pavel-kirienko · 2019-03-06T19:44:38Z

Fixed in #198

pavel-kirienko added class-defect v1 labels Dec 3, 2018

pavel-kirienko mentioned this issue Dec 18, 2018

Use log(N) containers OpenCyphal/libcanard#74

Closed

pavel-kirienko mentioned this issue Jan 22, 2019

Consider simplifying the TX queue management #195

Closed

pavel-kirienko added a commit that referenced this issue Mar 6, 2019

Merge pull request #198 from Zarkopafilis/txq

5853215

[Review & Discuss] Update TxQueue to have a backing AVL Tree (Issues #189 #195)

pavel-kirienko closed this as completed Mar 6, 2019

pavel-kirienko mentioned this issue Oct 13, 2021

Temporal complexity specification and AVL trees OpenCyphal-Garage/cyphal.rs#82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO performance is highly dependent on the size of the TX queue #189

IO performance is highly dependent on the size of the TX queue #189

pavel-kirienko commented Dec 3, 2018

ntakouris commented Jan 14, 2019

pavel-kirienko commented Jan 18, 2019

ntakouris commented Jan 18, 2019

pavel-kirienko commented Jan 18, 2019

ntakouris commented Jan 19, 2019 •

edited

Loading

pavel-kirienko commented Jan 19, 2019

ntakouris commented Jan 19, 2019

ntakouris commented Jan 20, 2019 •

edited by pavel-kirienko

Loading

pavel-kirienko commented Jan 22, 2019

ntakouris commented Jan 24, 2019 •

edited by pavel-kirienko

Loading

pavel-kirienko commented Jan 24, 2019

pavel-kirienko commented Mar 6, 2019

IO performance is highly dependent on the size of the TX queue #189

IO performance is highly dependent on the size of the TX queue #189

Comments

pavel-kirienko commented Dec 3, 2018

ntakouris commented Jan 14, 2019

pavel-kirienko commented Jan 18, 2019

ntakouris commented Jan 18, 2019

pavel-kirienko commented Jan 18, 2019

ntakouris commented Jan 19, 2019 • edited Loading

pavel-kirienko commented Jan 19, 2019

ntakouris commented Jan 19, 2019

ntakouris commented Jan 20, 2019 • edited by pavel-kirienko Loading

pavel-kirienko commented Jan 22, 2019

ntakouris commented Jan 24, 2019 • edited by pavel-kirienko Loading

pavel-kirienko commented Jan 24, 2019

pavel-kirienko commented Mar 6, 2019

ntakouris commented Jan 19, 2019 •

edited

Loading

ntakouris commented Jan 20, 2019 •

edited by pavel-kirienko

Loading

ntakouris commented Jan 24, 2019 •

edited by pavel-kirienko

Loading