Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO performance is highly dependent on the size of the TX queue #189

Closed
pavel-kirienko opened this issue Dec 3, 2018 · 12 comments
Closed

Comments

@pavel-kirienko
Copy link
Member

Whenever the library emits a frame, it puts it into a prioritized transmission queue. There is one such TX queue per interface. Each entry in the queue has a fixed lifetime so that if the interface is saturated or down, old entries can be removed in favor of newer ones.

The queue is a sorted linked list of entries, and as such, it has a linear time complexity, which causes severe performance degradation when there is a down interface, especially in complex applications where TX queues are allowed to use a lot of memory.

The worst thing about it is that it can lead to bad side effects, like a multi-interface node failing to meet its real-time constraints when an interface goes down because the library is suddenly eating all its processing time.

There are two okay solutions:

@ntakouris
Copy link

  • Regarding the smarter log time data structure, would a simple search tree be sufficient? Things could be pushed a bit further by using some smarter self balancing structure (like a red black tree), but the insertion performance is going to be a bit more costly. Memory is also going to go up 1 bit per entry because color information needs to be stored as well, but worse case is avoided as much as possible and average case insertion cost is constant. A lighter version of this would be the AVL trees.

I'm unaware of details of this library, but perhaps if there are nodes that are more frequently accessed, we could use a splay tree (it pushes most used nodes closer to the root). The drawback of this would be that it's internal representation changes upon each read (but this could be stopped after a 'characteristic' number of reads, in order to have more predictable read performance).

I guess that in order for the best way to be decided, simulations need to be ran with a characteristic payload that covers as many usage edge-cases as possible.

(I'm sorry if this is totally irrelevant, I'm trying to make my way in uavs)

@pavel-kirienko
Copy link
Member Author

Thanks for your input. Most applications that libuavcan (or UAVCAN in general) is designed for don't really care about average case. We must optimize for the worst case. Hence, while the experience of non-real-time systems such as Android may still be valuable, it can't be applied here directly.

@ntakouris
Copy link

a simple height balanced tree (avl) can be used then. If extra performance is needed at the cost of some extra work upon insertion, red-black is the way to go. All these can be implemented by having an underlying array to take advantage of potential cpu caching.

(Android is just an example on how to have an array powering a map interface)

@pavel-kirienko
Copy link
Member Author

Would you be interested to draft up a proof-of-concept implementation? (@thirtytwobits FYI)

@ntakouris
Copy link

ntakouris commented Jan 19, 2019

I'd be willing to.

Basically these need to be done on can_io.hpp / uc_can_io.cpp:

  • Modify CanTxQueue so that it has an underlying AVL tree (some extra balancing helper functions are going to be added) instead of the LinkedListRoot.
  • Change the push, peek, remove, ... to work with the AVL tree
  • Find usages of queue_ at uc_can_io.cpp and change it if it's broken with the new stuff
  • Tests

I guess the backing AVL tree needs access to the allocator. Moving on, the tx queue can inherit from the tree so that we can do some smart remove-expired-nodes while traversing. There are some parts of the code that traverse the linked list without using it's functions but I think I can cope with that.

Am I missing something?

@pavel-kirienko
Copy link
Member Author

That makes sense. I think the optimal approach is to implement the changes in the current upstream, and we will then reimplement them in the rewrite unless @thirtytwobits has a different view (his call).

I guess the backing AVL tree needs access to the allocator.

Either that, or you could use a fixed-capacity array as backing storage, as you described earlier. The size of the array could be parametrized via preprocessor options, template parameters, or a reference to the backing storage could be simply supplied from outside.

@ntakouris
Copy link

Ok. I'm developing at https://github.com/Zarkopafilis/libuavcan/tree/txq (Check the latest commits if you want to see the progress -- it's not buildable yet)

One more thing I noticed is that CanIOManager::sendFromTxQueue uses peek to send the next frame, which is essentially the first non-expired frame. I guess this can be easily switched to finding the frame with max qos.

I'm developing without a backing array right now, it seems that it may cause many problems with memory copying -- consistently adding and removing items could cause bad performance.

With the tree, special care must be taken regarding when we cleanup multiple expired frames. For now, I'm sticking to only try to remove one if out of memory, with maybe a 3rd retry after cleaning up everything.

@thirtytwobits code reviews, enhancements and what-not is very welcome! :) | I've created some avl_tree under uavcan\util and currently in progress of changing & interfacing uc_can_io.cpp.

@ntakouris
Copy link

ntakouris commented Jan 20, 2019

Update: I just finished the most optimal CanTxQueue interface I could imagine.

Currently it looks like this:

class UAVCAN_EXPORT CanTxQueue : public AvlTree<CanTxQueueEntry>
{
private:
    static bool AVLTxQueueEntryInsertComparator(const CanTxQueueEntry &lhs, const CanTxQueueEntry &rhs) {
        return rhs.frame.priorityHigherThan(lhs.frame);
    }

    ISystemClock& sysclock_;
    uint32_t rejected_frames_cnt_;

    void safeIncrementRejectedFrames();

public:
    CanTxQueue(IPoolAllocator& allocator, ISystemClock& sysclock, std::size_t allocator_quota)
        : AvlTree(allocator, allocator_quota, AVLTxQueueEntryInsertComparator)
        , sysclock_(sysclock)
        , rejected_frames_cnt_(0)
    {}

    ~CanTxQueue();

    /* Avl Tree allocates the AvlTree::Node, while this(CanTxQueue) allocates the CanTxQueueEntry
     * Same logic for removal. */
    void push(const CanFrame &frame, MonotonicTime tx_deadline, Qos qos, CanIOFlags flags);
    void remove(CanTxQueueEntry*& entry);

    /* Tries to look up rightmost Node. If the frame is expired, garbage-collects all the expired frames */
    const CanTxQueueEntry* getTopPriorityNonExpiredPendingEntry() const;

    /* When OOM, try to avoid garbage-collecting all the expired frames and instead swiftly remove one or two */
    void removeFirstWithQosLowerOrThenEqualThan(Qos qos) const;
    void removeFirstWithQosLargerThan(Qos qos) const;

    // The 'or equal' condition is necessary to avoid frame reordering.
    bool topPriorityHigherOrEqual(const CanFrame& rhs_frame) const;

    uint32_t getRejectedFrameCount() const { return rejected_frames_cnt_; }

    void removeOneExpiredFrame() const;
    void removeAllExpiredFrames() const;
};

It's "optimal" for the following reasons:

  • For smooth operation (that is, without having expired nodes), everything happens the fastest possible way.
  • Whenever there is some OOM error, multiple ways of dealing with this issue are attempted: First, one expired node is removed. If that fails, we try to remove the first frame that complies with QosLowerThan. If none found, then equal is tried. If that does not work as well, first greater is tried. Last resort is to garbage collect every expired frame.
  • When asked for the top priority pending frame from the _tx_queues by the driver, the rightmost node is returned - if not expired -. If expired, all the expired frames are removed (because the driver is probably ready to transmit).

There is a comment: // The driver will give them a scrutinizing look before deciding whether he wants to accept them. Which makes me wonder; What does this mean? Can we optimize the algorithm further by knowing some more information (if any) on what the driver expects?

(For now, frames are ordered by greatest priority. Perhaps some caching can be done, so that when the highest priority entry is lowest that all the other messages present in the bus, perform some cleanups before we can retransmit?). @pavel-kirienko

The OOM retry could not even be possible without doing a stop-the-world GC pause, because even if we swiftly manage to find space for a new Entry, the AVL tree could fail to allocate a new Node immediately afterwards, because it has no knowledge of the outside world -- This is an extreme edge case, but perhaps, upon the slightest OOM we should clean up as much data as we can and then proceed.

That's it for now. (Sorry for the wall of text). I still have some internal AVL-tree implementation to do while these sub-issues are resolved.

@pavel-kirienko
Copy link
Member Author

Please see #195. I suggest we simplify the logic as described for the sake of robustness and determinism (and the ROM footprint too, although it's not very important). Do you think that makes sense? If it does, then designing the new interface should be a no-brainer: no cleanups and traversals are needed, and no QoS used.

@ntakouris
Copy link

ntakouris commented Jan 24, 2019

I'm almost ready to do a PR but I get this in tests:

2: [----------] 1 test from Node
2: [ RUN      ] Node.Basic
2: UAVCAN: GlobalDataTypeRegistry: Reset; was frozen: 0, num msgs: 0, num srvs: 0
2: UAVCAN: CanIOManager: Memory blocks per iface: 513, total: 1024
2: UAVCAN: CanIOManager: Memory blocks per iface: 513, total: 1024
2: sizeof(uavcan::Node<0>): 2104
2: SIGNAL 11 RECEIVED; STACKTRACE:
2: /work/libuavcan/libuavcan_test_optim[0x657a58]
2: /lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f14fd32d0e0]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan12CanIOManager4sendERKNS_8CanFrameENS_13MonotonicTimeES4_hNS_3QosEt+0x175)[0x6fcb25]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan10Dispatcher4sendERKNS_5FrameENS_13MonotonicTimeES4_NS_3QosEth+0x9d)[0x6ff88d]
2: /work/libuavcan/libuavcan_test_optim(_ZNK6uavcan14TransferSender4sendEPKhjNS_13MonotonicTimeES3_NS_12TransferTypeENS_6NodeIDENS_10TransferIDE+0x27b)[0x70590b]
2: /work/libuavcan/libuavcan_test_optim(_ZNK6uavcan14TransferSender4sendEPKhjNS_13MonotonicTimeES3_NS_12TransferTypeENS_6NodeIDE+0xc8)[0x705af8]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan20GenericPublisherBase14genericPublishERKNS_24StaticTransferBufferImplENS_12TransferTypeENS_6NodeIDEPNS_10TransferIDENS_13MonotonicTimeE+0xf0)[0x6f58c0]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan18NodeStatusProvider7publishEv+0xfd)[0x6f887d]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan18NodeStatusProvider15startAndPublishENS_16TransferPriorityE+0x39)[0x6f8969]
2: /work/libuavcan/libuavcan_test_optim(_ZN6uavcan4NodeILm1024EE5startENS_16TransferPriorityE+0x48)[0x56dee8]
2: /work/libuavcan/libuavcan_test_optim(_ZN15Node_Basic_Test8TestBodyEv+0x981)[0x5672a1]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x65)[0x731afe]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x4b)[0x72b986]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing4Test3RunEv+0xd1)[0x70daff]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8TestInfo3RunEv+0x103)[0x70e3c1]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8TestCase3RunEv+0xee)[0x70ea52]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x26b)[0x718463]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x65)[0x732ef4]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x4b)[0x72c708]
2: /work/libuavcan/libuavcan_test_optim(_ZN7testing8UnitTest3RunEv+0xa7)[0x716f77]
2: /work/libuavcan/libuavcan_test_optim(main+0x4c)[0x4f8f3c]
2: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f14fd319b45]
2: /work/libuavcan/libuavcan_test_optim[0x4fcf24]
2: UAVCAN: CanIOManager: Memory blocks per iface: 9, total: 16
2: UAVCAN: CanIOManager: Memory blocks per iface: 9, total: 16
2: UAVCAN: GenericSubscriber: Stop; dtname=uavcan.protocol.NodeStatus
2: UAVCAN: GenericSubscriber: Start as message listener; dtname=uavcan.protocol.NodeStatus
2: UAVCAN: GlobalDataTypeRegistry: Frozen; num msgs: 2, num srvs: 4
2: UAVCAN: OutgoingTransferRegistry: Created dtid=341 tt=2 dnid=0
2: UAVCAN: TransferSender: prio=16 dtid=341 tt=2 snid=2 dnid=0 sot=1 eot=0 togl=0 tid=0 payload=[]
2/2 Test #2: libuavcan_test_optim .............***Failed    1.09 sec

I'm using dockcross-linux-x64 on macos mojave and I don't think I messed with that one. Do you have any idea on what is causing this?

@pavel-kirienko
Copy link
Member Author

Signal 11 means segmentation fault. Could be a dangling pointer or whatever. I suggest you use a debugger or Valgrind to find the culprit.

pavel-kirienko added a commit that referenced this issue Mar 6, 2019
[Review & Discuss] Update TxQueue to have a backing AVL Tree (Issues #189 #195)
@pavel-kirienko
Copy link
Member Author

Fixed in #198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants