ethos: move bulk of state machine out of ISR context #17265

miri64 · 2021-11-24T16:38:13Z

Contribution description

This tries to fix #17254 by eliminating the condition that causes the race altogether. It moves the following parts of ethos' state machine out of ISR context, inspired by the way it is done in slipdev:

Sending and replying to HELLO messages
Byte-unstuffing

Some escape handling is still needed in the ISR handler, due to ethos' protocol design, to determine if a received byte must go into the netdev queue (tsrb) or the STDIO queue (isrpipe), but the actual unstuffing is now done in the STDIO and netdev handler threads, respectively.

This causes a slight increase of ROM, as byte unstuffing needs to be now handled in both stdio and the netdev context, but we also gain some RAM, as the ethos_t members framesize, last_framesize, and accept_new are not needed anymore.

We also have the added benefits, that now multiple frames can be added to the respective pipes, separated by the FRAME_DELIMITER byte.

Testing procedure

Compile and flash examples/gnrc_border_router to a board of your choice. To be closer to the setup in the testbed, I used an iotlab-m3 and set ETHOS_BAUDRATE=500000 (but I also tested with samr21-xpro with the same ETHOS_BAUDRATE):

BOARD=iotlab-m3 ETHOS_BAUDRATE=500000 make -C examples/gnrc_border_router/ flash -j term

get the link-layer address of the ethos interface (the one with 6-byte hardware address) using ifconfig and pound that address with multiple ping instances:

for _ in $(seq 5); do
    sudo ping -c 1500 -f fe80::a804:28ff:fefa:f3f8%tap0 &
done

On master I get a significantly worse result, when it comes to packet loss, as with this PR at the cost of slightly higher latency (the cost of cleaner multiplexing, I guess):

master

[…]
--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 890 received, 40.6667% packet loss, time 24033ms
rtt min/avg/max/mdev = 7.252/17.322/1016.708/33.617 ms, pipe 60, ipg/ewma 16.032/16.285 ms

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 876 received, 41.6% packet loss, time 24037ms
rtt min/avg/max/mdev = 7.929/18.462/1019.208/47.469 ms, pipe 61, ipg/ewma 16.035/16.224 ms

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 858 received, 42.8% packet loss, time 24041ms
rtt min/avg/max/mdev = 8.087/19.703/1021.549/58.747 ms, pipe 61, ipg/ewma 16.038/16.559 ms

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 841 received, 43.9333% packet loss, time 24051ms
rtt min/avg/max/mdev = 7.576/17.265/1016.057/34.544 ms, pipe 60
, ipg/ewma 16.044/15.698 ms
--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 856 received, 42.9333% packet loss, time 24053ms
rtt min/avg/max/mdev = 8.167/17.401/1016.099/34.267 ms, pipe 60, ipg/ewma 16.045/16.766 ms

This PR

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 1326 received, 11.6% packet loss, time 21439ms
rtt min/avg/max/mdev = 13.336/41.125/55.888/3.944 ms, pipe 4, ipg/ewma 14.301/40.972 ms
...
--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 1315 received, 12.3333% packet loss, time 21521ms
rtt min/avg/max/mdev = 20.581/41.220/55.949/4.130 ms, pipe 5, ipg/ewma 14.356/41.088 ms
   
--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 1297 received, 13.5333% packet loss, time 21639ms
rtt min/avg/max/mdev = 12.583/41.019/56.037/4.523 ms, pipe 4, ipg/ewma 14.435/24.055 ms

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 1285 received, 14.3333% packet loss, time 21690ms
rtt min/avg/max/mdev = 13.235/41.065/55.867/4.555 ms, pipe 4, ipg/ewma 14.469/23.689 ms

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 1287 received, 14.2% packet loss, time 21706ms
rtt min/avg/max/mdev = 12.578/41.029/56.543/4.845 ms, pipe 4, ipg/ewma 14.480/21.693 ms

Issues/PRs references

Fixes #17254

miri64 · 2021-11-24T16:43:42Z

Forgot to mention: As the stdio now also needs special handling in the thread layer, this moves the stdio_ethos implementation to its own ethos submodule (maybe also a step towards #13469).

chrysn · 2021-11-24T17:29:01Z

I'm having a look, but I'm not too familiar with the ethos subsystem overall.

If we already had subsystem maintainer reviews, would you consider yourself suitable as an ethos subsystem maintainer? If so, I think I can provide sufficient review, otherwise I'm still doing the same but would ask around for someone who can give this a high-level Go just from the description you provide in the PR.

chrysn

Not related to concrete code changes (at least such that GitHub would let me annotate):

In stdio_uart, #ifdef M_S_ETHOS / uart_init(ETHOS_UART, ...); remains; is that intentional? (Seems like something was missed in the split).

Other than that (and barring testing which I'd do on the complete thing), the ethos move LGTM.

miri64 · 2021-11-24T18:36:08Z

In stdio_uart, #ifdef M_S_ETHOS / uart_init(ETHOS_UART, ...); remains; is that intentional? (Seems like something was missed in the split).

Arghs forgot to remove that change. Will do ASAP.

chrysn

Incomplete review; one nit, one "wat?".

drivers/ethos/ethos.c

chrysn · 2021-11-24T19:28:45Z

drivers/ethos/ethos.c

+                DEBUG("ethos _recv(): inbuf doesn't contain enough bytes.\n");
+                return -EIO;
+            }
+            tmp = ethos_unstuff_readbyte(ptr, (uint8_t)byte, &escaped, &frametype);


I don't understand the escaping yet (stay with me, I think I start understanding), but maybe you can clarify this easily: If a stdio_read() is started between having received an escape and the escaped data, will this not produce an erroneous read?

Say the buffer is filled with a text frame, and another frame (that won't fit in the buffer yet) is being filled in. When the buffer is full, stdio_read is called and (with speed varying depending on which high-priority threads are running) pulls out data byte by byte, creating on and off opportunity for isrpipe_write_one to push data.

Frametype stays text, data keeps flowing in and getting dropped on and off before the reader code even gets to the being-corrupted frame, and all that happens on isrpipe_write_one failure is that dev->inbuf gets touched (which is weird as I'd understand that to be network related).

The more I think of this, the more it seems to me that the puts("lost frame"); code in the ETHOS_FRAME_TYPE_TEXT case doesn't do what it should. Maybe it should set the frame type to FRAME_TYPE_ERRORED and thus ensure that the rest of the frame is discarded (rather than risking that individual characters get eaten and possibly misassembled at the stdio reading site)?

(Not sure of any of this, but it seems weird; gotta cut it short now and will continue at a later point.)

Not 100% sure I understood your scenario, so let me just describe how the buffers and escaping work:

For data, hello and hello reply packets, the frame-type is determined by the head, consisting of the escaped frame type or just (escaped) data, when it is data which is pushed to the tsrb with the remainder of the packet. Also: at the end of the frame there is a frame delimiter character (if it appears in the frame it is escaped). So if multiple frames are in the ring buffer, one frame will be read until the frame delimiter (and the type determined by its head) and the same is true for the next.

Text frames (which in the ISR arrive in sequence with data frames) are not put to the tsrb ring buffer, but to the isrpipe. Since only text frames go there (and since writing back the escape + type seemed to be more trouble than its worth) they are not identified to the handler thread with a head, but are delimited by the frame delimiter.

In both buffers, the data is escaped similar to SLIP, so if the frame delimiter or the escape byte appear within the packet, they are escaped with the byte sequence, {escape_byte, c ^ 0x20}.

Packet types (at least as far as I understand it) are only escaped at the beginning of the packet. This is where the protocol design is somewhat flawed, because it seems a bit like a "because we can approach" and also makes it necessary to do at least some rudimentary unstuffing to determine into which buffer (data or text) the frame should go.

Thanks for the input, with that I'd try to make the point more precise:

If overflow happens in _handle_char for Ethernet frames, the whole frame is dropped (that's how I understand the inbuf.{reads,writes} - 0), and thanks to the frametype being reset, further data is ignored (is it? ->frametype is also reset, but we're still in the case IN_FRAME of ethos_isr, so wouldn't the next byte trigger the default case assert?). Dropping a frame is relatively safe (a counter might be nice but adding features is not what this PR is about).

If overflow happens in _handle_char on text frames, the network frame is dropped (?; we're not in one), but otherwise the error is ignored.

As text may be read piecemeal, we can't "cancel" the whole text frame (some of it may have been consumed, that's OK), but IMO the sane behavior would be to discard the rest of the frame -- for the alternative is that while more bytes come in, some may be accepted and some not, and the ISR (rightfully, after this PR) feeds all the characters until end of frame to the isrpipe. For example, if the data stream is "H e l l o Escape EscapedEscape", and the Escape gets dropped but the EscapedEscape does not, the stdio reader would not read the bytes "H e l l o Escape" (which there should be) but "H e l l o EscapedEscape".

I think that on the UART, losing text is OK -- receiving wrong text is not. I have a slight preference for losing chunks of text over losing individual characters, but (now that escapes are decoded after the isrpipe) losing individual characters has the downside of possibly producing wrong text, and thus, losing a chunk (the rest of the text frame, ) it should be.

I think I start to understand what you mean. Will fix.

miri64 · 2021-11-24T22:34:06Z

Not related to concrete code changes (at least such that GitHub would let me annotate):
* In stdio_uart, `#ifdef M_S_ETHOS` / `uart_init(ETHOS_UART, ...);` remains; is that intentional? (Seems like something was missed in the split).
Other than that (and barring testing which I'd do on the complete thing), the ethos move LGTM.

Addressed with the other issues pointed out above.

chrysn · 2021-11-25T07:42:23Z

drivers/ethos/ethos.c

+            if (isrpipe_write_one(&ethos_stdio_isrpipe, c) < 0) {
+                //puts("lost frame");
+                dev->inbuf.reads = 0;
+                dev->inbuf.writes = 0;


This is the location where I should have started the thread on what losing bytes on the stdio byte would do to escaping.

I hardened the error handling in the following way in ~~febe559~~ ef54d1c (edit: removed the mutex_lock/mutex_unlock around the isrpipe's reset, as we are in ISR context for that function):

The previous frametype determines which buffer (tsrb or isrpipe) is operated on.

The buffer is reset (both reads and writes are set to 0)

A ETHOS_FRAME_DELIMITER is added to the now reset buffer (as the buffer was now reset it will even work if the error was caused by an overflown buffer), to tell the handler thread to stop reading and just truncate the frame as read so far.

miri64 · 2021-12-07T11:47:18Z

Rebased and adopted for the tsrb_clear() function introduced in #17337.

miri64 · 2021-12-10T09:28:57Z

Ping? This should go in for 2021.01 IMHO.

chrysn · 2021-12-14T18:26:07Z

LGTM, please squash. I'll give version merged after squashing a test run and then I think this can be done. Sorry for the holdup.

This moves the following parts of ethos' state machine out of ISR context: - Sending and replying to HELLO messages - Byte-unstuffing Some escape handling is still needed in the ISR handler, due to ethos' protocol design, to determine if a received byte must go into the netdev queue (tsrb) or the STDIO queue (isrpipe), but the actual unstuffing is now done in the STDIO and netdev handler threads, respectively.

miri64 · 2021-12-15T15:15:47Z

LGTM, please squash. I'll give version merged after squashing a test run and then I think this can be done. Sorry for the holdup.

Squashed.

chrysn

Tested by running gnrc_border_router on a microbit-v2.

Command line interaction works, network works, and success rate on flood pings is vastly improved.

ACK, and thanks for fixing this.

benpicco · 2022-02-03T10:01:21Z

Now this broke standalong ethos (without ethos_stdio)

/home/benpicco/dev/RIOT/drivers/ethos/ethos.c: In function '_fail_frame':
/home/benpicco/dev/RIOT/drivers/ethos/ethos.c:92:25: error: 'ethos_stdio_isrpipe' undeclared (first use in this function)
   92 |             tsrb_clear(&ethos_stdio_isrpipe.tsrb);
      |                         ^~~~~~~~~~~~~~~~~~~
/home/benpicco/dev/RIOT/drivers/ethos/ethos.c:92:25: note: each undeclared identifier is reported only once for each function it appears in
/home/benpicco/dev/RIOT/drivers/ethos/ethos.c:95:13: error: implicit declaration of function 'isrpipe_write_one' [-Werror=implicit-function-declaration]
   95 |             isrpipe_write_one(&ethos_stdio_isrpipe, ETHOS_FRAME_DELIMITER);
      |             ^~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors

fjmolinas · 2022-03-30T11:04:51Z

With this PR for some reason, ethos + riotboot causes the shell to continuously exit, I'm investigating but if someone has an idea:

FEATURES_REQUIRED=riotboot USEMODULE=riotboot BOARD=dwm1001 make -C examples/gnrc_border_router/ flash term -j7
NETOPT_RX_END_IRQ not implemented by driver
gnrc_uhcpc: Using 8 as border interface and 7 as wireless interface.
udp_sendto(): sendto(): Cannot assign requested address
main(): This is RIOT! (Version: 2022.01-devel-1018-g820a3-HEAD)
RIOT border router example application
All up, running the shell now
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

fjmolinas · 2022-03-31T06:35:00Z

With this PR for some reason, ethos + riotboot causes the shell to continuously exit, I'm investigating but if someone has an idea:

FEATURES_REQUIRED=riotboot USEMODULE=riotboot BOARD=dwm1001 make -C examples/gnrc_border_router/ flash term -j7
NETOPT_RX_END_IRQ not implemented by driver
gnrc_uhcpc: Using 8 as border interface and 7 as wireless interface.
udp_sendto(): sendto(): Cannot assign requested address
main(): This is RIOT! (Version: 2022.01-devel-1018-g820a3-HEAD)
RIOT border router example application
All up, running the shell now
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

Its the second commit 820a397, but I didn't all the time too look into it as I hoped yesterday, is there anything obvious that could be at fault here @miri64?

fjmolinas · 2022-03-31T06:43:26Z

Its the second commit 820a397, but I didn't all the time too look into it as I hoped yesterday, is there anything obvious that could be at fault here @miri64?

I didn't test on iotlab-m3 but samr21-xpro and dwm1001

miri64 · 2022-03-31T07:32:50Z

Its the second commit 820a397, but I didn't all the time too look into it as I hoped yesterday, is there anything obvious that could be at fault here @miri64?

Not sure, but I observed the behavior you mention also sometimes on iotlab-m3. The cause of this is unclear. Maybe some combination of Ctrl+D or Ctrl+C isn't properly handled by the stdio part? Or does the node wake up like this in your case?

fjmolinas · 2022-03-31T07:43:30Z

Not sure, but I observed the behavior you mention also sometimes on iotlab-m3. The cause of this is unclear. Maybe some combination of Ctrl+D or Ctrl+C isn't properly handled by the stdio part? Or does the node wake up like this in your case?

It seems to happen after flashing only, resetting the BOARD after seems to fix it though.

miri64 · 2022-03-31T07:49:43Z

It seems to happen after flashing only, resetting the BOARD after seems to fix it though.

Still not clear to me if this happens when you just started the board or if you hit any key.

fjmolinas · 2022-03-31T07:55:38Z

It seems to happen after flashing only, resetting the BOARD after seems to fix it though.

Still not clear to me if this happens when you just started the board or if you hit any key.

At start, once the shell is launched.

miri64 · 2022-03-31T07:58:14Z

Ok, that I did not observe on iotlab-m3 :-/. Maybe there is a broken condition which makes the shell exit and then start again? Have you checked if the ethos part at the host side may be the cause?

miri64 added the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label Nov 24, 2021

miri64 requested review from chrysn, jia200x and kaspar030 November 24, 2021 16:38

github-actions bot added Area: build system Area: Build system Area: drivers Area: Device drivers Area: sys Area: System labels Nov 24, 2021

miri64 mentioned this pull request Nov 24, 2021

ethos: race condition seems to cause packet loss #17254

Closed

miri64 added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Nov 24, 2021

chrysn reviewed Nov 24, 2021

View reviewed changes

chrysn reviewed Nov 25, 2021

View reviewed changes

miri64 force-pushed the ethos/fix/threadify-state-machine branch from febe559 to ef54d1c Compare November 25, 2021 13:30

miri64 mentioned this pull request Dec 6, 2021

sys/tsrb: add tsrb_clear #17337

Merged

miri64 force-pushed the ethos/fix/threadify-state-machine branch from a3f8fa9 to 3e30300 Compare December 7, 2021 11:45

miri64 added 2 commits December 15, 2021 16:15

stdio_ethos: move to ethos_stdio

568be10

miri64 force-pushed the ethos/fix/threadify-state-machine branch from 3e30300 to 820a397 Compare December 15, 2021 15:15

chrysn approved these changes Dec 15, 2021

View reviewed changes

miri64 merged commit e062ad7 into RIOT-OS:master Dec 16, 2021

miri64 deleted the ethos/fix/threadify-state-machine branch December 16, 2021 11:24

fjmolinas added this to the Release 2022.01 milestone Jan 21, 2022

benpicco mentioned this pull request Feb 3, 2022

drivers/ethos: fix build without ethos_stdio #17608

Merged

fjmolinas mentioned this pull request Apr 21, 2022

stdio_ethos: infinite shell loop #17972

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ethos: move bulk of state machine out of ISR context #17265

ethos: move bulk of state machine out of ISR context #17265

miri64 commented Nov 24, 2021

miri64 commented Nov 24, 2021

chrysn commented Nov 24, 2021

chrysn left a comment

miri64 commented Nov 24, 2021

chrysn left a comment

chrysn Nov 24, 2021

miri64 Nov 24, 2021

chrysn Nov 25, 2021

miri64 Nov 25, 2021

miri64 commented Nov 24, 2021

chrysn Nov 25, 2021

miri64 Nov 25, 2021 •

edited

Loading

miri64 commented Dec 7, 2021

miri64 commented Dec 10, 2021

chrysn commented Dec 14, 2021 •

edited

Loading

miri64 commented Dec 15, 2021

chrysn left a comment

benpicco commented Feb 3, 2022 •

edited

Loading

fjmolinas commented Mar 30, 2022

fjmolinas commented Mar 31, 2022

fjmolinas commented Mar 31, 2022

miri64 commented Mar 31, 2022

fjmolinas commented Mar 31, 2022

miri64 commented Mar 31, 2022

fjmolinas commented Mar 31, 2022

miri64 commented Mar 31, 2022

ethos: move bulk of state machine out of ISR context #17265

ethos: move bulk of state machine out of ISR context #17265

Conversation

miri64 commented Nov 24, 2021

Contribution description

Testing procedure

master

This PR

Issues/PRs references

miri64 commented Nov 24, 2021

chrysn commented Nov 24, 2021

chrysn left a comment

Choose a reason for hiding this comment

miri64 commented Nov 24, 2021

chrysn left a comment

Choose a reason for hiding this comment

chrysn Nov 24, 2021

Choose a reason for hiding this comment

miri64 Nov 24, 2021

Choose a reason for hiding this comment

chrysn Nov 25, 2021

Choose a reason for hiding this comment

miri64 Nov 25, 2021

Choose a reason for hiding this comment

miri64 commented Nov 24, 2021

chrysn Nov 25, 2021

Choose a reason for hiding this comment

miri64 Nov 25, 2021 • edited Loading

Choose a reason for hiding this comment

miri64 commented Dec 7, 2021

miri64 commented Dec 10, 2021

chrysn commented Dec 14, 2021 • edited Loading

miri64 commented Dec 15, 2021

chrysn left a comment

Choose a reason for hiding this comment

benpicco commented Feb 3, 2022 • edited Loading

fjmolinas commented Mar 30, 2022

fjmolinas commented Mar 31, 2022

fjmolinas commented Mar 31, 2022

miri64 commented Mar 31, 2022

fjmolinas commented Mar 31, 2022

miri64 commented Mar 31, 2022

fjmolinas commented Mar 31, 2022

miri64 commented Mar 31, 2022

miri64 Nov 25, 2021 •

edited

Loading

chrysn commented Dec 14, 2021 •

edited

Loading

benpicco commented Feb 3, 2022 •

edited

Loading