Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TID distance of 1 may cause timed out state to be selected for incoming packet #129

Closed
magshaw opened this issue Jun 14, 2019 · 5 comments
Closed
Assignees

Comments

@magshaw
Copy link
Contributor

magshaw commented Jun 14, 2019

We've been running a system with lots of nodes, some sharing the same set of messages.

We found that after a target node had timeout out once the system enters a strange state where new messages (with a higher TID) pick up state from an old transfer and don't process because of TID/TOGGLE mismatches.

On further inspection we can see that

Tracing this back through the code I think there is a bug in the computation of the not_previous_tid variable - this causes needs_restart to not be true and the transfer to be rejected for bad TIDs and TOGGLEs

Substituting:

    const bool not_previous_tid =
        computeTransferIDForwardDistance((uint8_t) rx_state->transfer_id, TRANSFER_ID_FROM_TAIL_BYTE(tail_byte)) > 1;

for

    const bool not_previous_tid =
        computeTransferIDForwardDistance((uint8_t) rx_state->transfer_id, TRANSFER_ID_FROM_TAIL_BYTE(tail_byte)) != 0;

Fixes this issue. This seems plausible as well, as a distance of 1 is indeed not_previous_tid

Does this seem reasonable or are we maybe missing something here?

@pavel-kirienko pavel-kirienko self-assigned this Jun 14, 2019
@pavel-kirienko
Copy link
Member

I apologize for the slow response. I've read your post back in June, but I knew immediately that sorting this out is going to take time so I had to table the issue until later. The later is now.

Does this seem reasonable or are we maybe missing something here?

I've just spent an hour digging through my old records trying to understand what is happening. Long story short, the problem is that the arguments to computeTransferIDForwardDistance() are swapped. Being human, I am prone to mistakes of that sort.

This is the pseudocode from Specification:

image

Observe that the local state is on the right; the received transfer-ID is on the left. The algorithm requires us to compute the number of increment operations that need to be applied to the received value in order to equalize it with the local state. This algorithm is implemented correctly in libuavcan and pyuavcan:

https://github.com/UAVCAN/libuavcan/blob/67e56232362aea9f9606d0e80454e6abcae5ff5a/libuavcan/src/transport/uc_transfer_receiver.cpp#L201

https://github.com/UAVCAN/pyuavcan/blob/281680b28657574818348ec5a67347f129525d44/pyuavcan/transport/can/_session/_transfer_receiver.py#L53

Notice that in libcanard the argument order is different, hence the problem you've observed. Swapping the argument order on your branch while keeping the correct comparison > 1 (strictly greater) makes your unit tests pass.

@mike7c2
Copy link

mike7c2 commented Aug 19, 2019

Hi Pavel, my apologies too, things have been hectic my end! I will get this done and back to you tonight/tomorrow.

@pavel-kirienko
Copy link
Member

Fixed in #142

@ImFovery
Copy link

遇到的同样的问题,发现原因是:发送端应该每条消息使用独立的transfer_id,而不要所有的消息共用同一个递增的transfer_id。
The same issue encountered was found to be caused by the sender using a unique transfer_id for each message, instead of all messages sharing the same incrementing transfer_id.

@pavel-kirienko
Copy link
Member

A publishing node shall use a separate transfer-ID counter per subject (topic). Usage of a shared transfer-ID counter for all subjects is non-compliant and will cause issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants