Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ethos: move bulk of state machine out of ISR context #17265

Merged
merged 2 commits into from
Dec 16, 2021

Conversation

miri64
Copy link
Member

@miri64 miri64 commented Nov 24, 2021

Contribution description

This tries to fix #17254 by eliminating the condition that causes the race altogether. It moves the following parts of ethos' state machine out of ISR context, inspired by the way it is done in slipdev:

  • Sending and replying to HELLO messages
  • Byte-unstuffing

Some escape handling is still needed in the ISR handler, due to ethos' protocol design, to determine if a received byte must go into the netdev queue (tsrb) or the STDIO queue (isrpipe), but the actual unstuffing is now done in the STDIO and netdev handler threads, respectively.

This causes a slight increase of ROM, as byte unstuffing needs to be now handled in both stdio and the netdev context, but we also gain some RAM, as the ethos_t members framesize, last_framesize, and accept_new are not needed anymore.

We also have the added benefits, that now multiple frames can be added to the respective pipes, separated by the FRAME_DELIMITER byte.

Testing procedure

Compile and flash examples/gnrc_border_router to a board of your choice. To be closer to the setup in the testbed, I used an iotlab-m3 and set ETHOS_BAUDRATE=500000 (but I also tested with samr21-xpro with the same ETHOS_BAUDRATE):

BOARD=iotlab-m3 ETHOS_BAUDRATE=500000 make -C examples/gnrc_border_router/ flash -j term

get the link-layer address of the ethos interface (the one with 6-byte hardware address) using ifconfig and pound that address with multiple ping instances:

for _ in $(seq 5); do
    sudo ping -c 1500 -f fe80::a804:28ff:fefa:f3f8%tap0 &
done

On master I get a significantly worse result, when it comes to packet loss, as with this PR at the cost of slightly higher latency (the cost of cleaner multiplexing, I guess):

master

[…]
--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 890 received, 40.6667% packet loss, time 24033ms
rtt min/avg/max/mdev = 7.252/17.322/1016.708/33.617 ms, pipe 60, ipg/ewma 16.032/16.285 ms

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 876 received, 41.6% packet loss, time 24037ms
rtt min/avg/max/mdev = 7.929/18.462/1019.208/47.469 ms, pipe 61, ipg/ewma 16.035/16.224 ms

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 858 received, 42.8% packet loss, time 24041ms
rtt min/avg/max/mdev = 8.087/19.703/1021.549/58.747 ms, pipe 61, ipg/ewma 16.038/16.559 ms

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 841 received, 43.9333% packet loss, time 24051ms
rtt min/avg/max/mdev = 7.576/17.265/1016.057/34.544 ms, pipe 60
, ipg/ewma 16.044/15.698 ms
--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 856 received, 42.9333% packet loss, time 24053ms
rtt min/avg/max/mdev = 8.167/17.401/1016.099/34.267 ms, pipe 60, ipg/ewma 16.045/16.766 ms

This PR

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 1326 received, 11.6% packet loss, time 21439ms
rtt min/avg/max/mdev = 13.336/41.125/55.888/3.944 ms, pipe 4, ipg/ewma 14.301/40.972 ms
...
--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 1315 received, 12.3333% packet loss, time 21521ms
rtt min/avg/max/mdev = 20.581/41.220/55.949/4.130 ms, pipe 5, ipg/ewma 14.356/41.088 ms
   
--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 1297 received, 13.5333% packet loss, time 21639ms
rtt min/avg/max/mdev = 12.583/41.019/56.037/4.523 ms, pipe 4, ipg/ewma 14.435/24.055 ms

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 1285 received, 14.3333% packet loss, time 21690ms
rtt min/avg/max/mdev = 13.235/41.065/55.867/4.555 ms, pipe 4, ipg/ewma 14.469/23.689 ms

--- fe80::a804:28ff:fefa:f3f8%tap0 ping statistics ---
1500 packets transmitted, 1287 received, 14.2% packet loss, time 21706ms
rtt min/avg/max/mdev = 12.578/41.029/56.543/4.845 ms, pipe 4, ipg/ewma 14.480/21.693 ms

Issues/PRs references

Fixes #17254

@miri64 miri64 added the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label Nov 24, 2021
@github-actions github-actions bot added Area: build system Area: Build system Area: drivers Area: Device drivers Area: sys Area: System labels Nov 24, 2021
@miri64
Copy link
Member Author

miri64 commented Nov 24, 2021

Forgot to mention: As the stdio now also needs special handling in the thread layer, this moves the stdio_ethos implementation to its own ethos submodule (maybe also a step towards #13469).

@miri64 miri64 added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Nov 24, 2021
@chrysn
Copy link
Member

chrysn commented Nov 24, 2021

I'm having a look, but I'm not too familiar with the ethos subsystem overall.

If we already had subsystem maintainer reviews, would you consider yourself suitable as an ethos subsystem maintainer? If so, I think I can provide sufficient review, otherwise I'm still doing the same but would ask around for someone who can give this a high-level Go just from the description you provide in the PR.

Copy link
Member

@chrysn chrysn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to concrete code changes (at least such that GitHub would let me annotate):

  • In stdio_uart, #ifdef M_S_ETHOS / uart_init(ETHOS_UART, ...); remains; is that intentional? (Seems like something was missed in the split).

Other than that (and barring testing which I'd do on the complete thing), the ethos move LGTM.

@miri64
Copy link
Member Author

miri64 commented Nov 24, 2021

  • In stdio_uart, #ifdef M_S_ETHOS / uart_init(ETHOS_UART, ...); remains; is that intentional? (Seems like something was missed in the split).

Arghs forgot to remove that change. Will do ASAP.

Copy link
Member

@chrysn chrysn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete review; one nit, one "wat?".

DEBUG("ethos _recv(): inbuf doesn't contain enough bytes.\n");
return -EIO;
}
tmp = ethos_unstuff_readbyte(ptr, (uint8_t)byte, &escaped, &frametype);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the escaping yet (stay with me, I think I start understanding), but maybe you can clarify this easily: If a stdio_read() is started between having received an escape and the escaped data, will this not produce an erroneous read?

Say the buffer is filled with a text frame, and another frame (that won't fit in the buffer yet) is being filled in. When the buffer is full, stdio_read is called and (with speed varying depending on which high-priority threads are running) pulls out data byte by byte, creating on and off opportunity for isrpipe_write_one to push data.

Frametype stays text, data keeps flowing in and getting dropped on and off before the reader code even gets to the being-corrupted frame, and all that happens on isrpipe_write_one failure is that dev->inbuf gets touched (which is weird as I'd understand that to be network related).

The more I think of this, the more it seems to me that the puts("lost frame"); code in the ETHOS_FRAME_TYPE_TEXT case doesn't do what it should. Maybe it should set the frame type to FRAME_TYPE_ERRORED and thus ensure that the rest of the frame is discarded (rather than risking that individual characters get eaten and possibly misassembled at the stdio reading site)?

(Not sure of any of this, but it seems weird; gotta cut it short now and will continue at a later point.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure I understood your scenario, so let me just describe how the buffers and escaping work:

For data, hello and hello reply packets, the frame-type is determined by the head, consisting of the escaped frame type or just (escaped) data, when it is data which is pushed to the tsrb with the remainder of the packet. Also: at the end of the frame there is a frame delimiter character (if it appears in the frame it is escaped). So if multiple frames are in the ring buffer, one frame will be read until the frame delimiter (and the type determined by its head) and the same is true for the next.

Text frames (which in the ISR arrive in sequence with data frames) are not put to the tsrb ring buffer, but to the isrpipe. Since only text frames go there (and since writing back the escape + type seemed to be more trouble than its worth) they are not identified to the handler thread with a head, but are delimited by the frame delimiter.

In both buffers, the data is escaped similar to SLIP, so if the frame delimiter or the escape byte appear within the packet, they are escaped with the byte sequence, {escape_byte, c ^ 0x20}.

Packet types (at least as far as I understand it) are only escaped at the beginning of the packet. This is where the protocol design is somewhat flawed, because it seems a bit like a "because we can approach" and also makes it necessary to do at least some rudimentary unstuffing to determine into which buffer (data or text) the frame should go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input, with that I'd try to make the point more precise:

  • If overflow happens in _handle_char for Ethernet frames, the whole frame is dropped (that's how I understand the inbuf.{reads,writes} - 0), and thanks to the frametype being reset, further data is ignored (is it? ->frametype is also reset, but we're still in the case IN_FRAME of ethos_isr, so wouldn't the next byte trigger the default case assert?). Dropping a frame is relatively safe (a counter might be nice but adding features is not what this PR is about).

  • If overflow happens in _handle_char on text frames, the network frame is dropped (?; we're not in one), but otherwise the error is ignored.

    As text may be read piecemeal, we can't "cancel" the whole text frame (some of it may have been consumed, that's OK), but IMO the sane behavior would be to discard the rest of the frame -- for the alternative is that while more bytes come in, some may be accepted and some not, and the ISR (rightfully, after this PR) feeds all the characters until end of frame to the isrpipe. For example, if the data stream is "H e l l o Escape EscapedEscape", and the Escape gets dropped but the EscapedEscape does not, the stdio reader would not read the bytes "H e l l o Escape" (which there should be) but "H e l l o EscapedEscape".

    I think that on the UART, losing text is OK -- receiving wrong text is not. I have a slight preference for losing chunks of text over losing individual characters, but (now that escapes are decoded after the isrpipe) losing individual characters has the downside of possibly producing wrong text, and thus, losing a chunk (the rest of the text frame, ) it should be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I start to understand what you mean. Will fix.

@miri64
Copy link
Member Author

miri64 commented Nov 24, 2021

Not related to concrete code changes (at least such that GitHub would let me annotate):

* In stdio_uart, `#ifdef M_S_ETHOS` / `uart_init(ETHOS_UART, ...);` remains; is that intentional? (Seems like something was missed in the split).

Other than that (and barring testing which I'd do on the complete thing), the ethos move LGTM.

Addressed with the other issues pointed out above.

if (isrpipe_write_one(&ethos_stdio_isrpipe, c) < 0) {
//puts("lost frame");
dev->inbuf.reads = 0;
dev->inbuf.writes = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the location where I should have started the thread on what losing bytes on the stdio byte would do to escaping.

Copy link
Member Author

@miri64 miri64 Nov 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hardened the error handling in the following way in febe559 ef54d1c (edit: removed the mutex_lock/mutex_unlock around the isrpipe's reset, as we are in ISR context for that function):

  • The previous frametype determines which buffer (tsrb or isrpipe) is operated on.
  • The buffer is reset (both reads and writes are set to 0)
  • A ETHOS_FRAME_DELIMITER is added to the now reset buffer (as the buffer was now reset it will even work if the error was caused by an overflown buffer), to tell the handler thread to stop reading and just truncate the frame as read so far.

@miri64 miri64 force-pushed the ethos/fix/threadify-state-machine branch from febe559 to ef54d1c Compare November 25, 2021 13:30
@miri64 miri64 mentioned this pull request Dec 6, 2021
@miri64 miri64 force-pushed the ethos/fix/threadify-state-machine branch from a3f8fa9 to 3e30300 Compare December 7, 2021 11:45
@miri64
Copy link
Member Author

miri64 commented Dec 7, 2021

Rebased and adopted for the tsrb_clear() function introduced in #17337.

@miri64
Copy link
Member Author

miri64 commented Dec 10, 2021

Ping? This should go in for 2021.01 IMHO.

@chrysn
Copy link
Member

chrysn commented Dec 14, 2021

LGTM, please squash. I'll give version merged after squashing a test run and then I think this can be done. Sorry for the holdup.

This moves the following parts of ethos' state machine out of ISR
context:

- Sending and replying to HELLO messages
- Byte-unstuffing

Some escape handling is still needed in the ISR handler, due to ethos'
protocol design, to determine if a received byte must go into the
netdev queue (tsrb) or the STDIO queue (isrpipe), but the actual
unstuffing is now done in the STDIO and netdev handler threads,
respectively.
@miri64 miri64 force-pushed the ethos/fix/threadify-state-machine branch from 3e30300 to 820a397 Compare December 15, 2021 15:15
@miri64
Copy link
Member Author

miri64 commented Dec 15, 2021

LGTM, please squash. I'll give version merged after squashing a test run and then I think this can be done. Sorry for the holdup.

Squashed.

Copy link
Member

@chrysn chrysn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested by running gnrc_border_router on a microbit-v2.

Command line interaction works, network works, and success rate on flood pings is vastly improved.

ACK, and thanks for fixing this.

@miri64 miri64 merged commit e062ad7 into RIOT-OS:master Dec 16, 2021
@miri64 miri64 deleted the ethos/fix/threadify-state-machine branch December 16, 2021 11:24
@fjmolinas fjmolinas added this to the Release 2022.01 milestone Jan 21, 2022
@benpicco
Copy link
Contributor

benpicco commented Feb 3, 2022

Now this broke standalong ethos (without ethos_stdio)

/home/benpicco/dev/RIOT/drivers/ethos/ethos.c: In function '_fail_frame':
/home/benpicco/dev/RIOT/drivers/ethos/ethos.c:92:25: error: 'ethos_stdio_isrpipe' undeclared (first use in this function)
   92 |             tsrb_clear(&ethos_stdio_isrpipe.tsrb);
      |                         ^~~~~~~~~~~~~~~~~~~
/home/benpicco/dev/RIOT/drivers/ethos/ethos.c:92:25: note: each undeclared identifier is reported only once for each function it appears in
/home/benpicco/dev/RIOT/drivers/ethos/ethos.c:95:13: error: implicit declaration of function 'isrpipe_write_one' [-Werror=implicit-function-declaration]
   95 |             isrpipe_write_one(&ethos_stdio_isrpipe, ETHOS_FRAME_DELIMITER);
      |             ^~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors

@fjmolinas
Copy link
Contributor

With this PR for some reason, ethos + riotboot causes the shell to continuously exit, I'm investigating but if someone has an idea:

FEATURES_REQUIRED=riotboot USEMODULE=riotboot BOARD=dwm1001 make -C examples/gnrc_border_router/ flash term -j7
NETOPT_RX_END_IRQ not implemented by driver
gnrc_uhcpc: Using 8 as border interface and 7 as wireless interface.
udp_sendto(): sendto(): Cannot assign requested address
main(): This is RIOT! (Version: 2022.01-devel-1018-g820a3-HEAD)
RIOT border router example application
All up, running the shell now
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 

@fjmolinas
Copy link
Contributor

With this PR for some reason, ethos + riotboot causes the shell to continuously exit, I'm investigating but if someone has an idea:

FEATURES_REQUIRED=riotboot USEMODULE=riotboot BOARD=dwm1001 make -C examples/gnrc_border_router/ flash term -j7
NETOPT_RX_END_IRQ not implemented by driver
gnrc_uhcpc: Using 8 as border interface and 7 as wireless interface.
udp_sendto(): sendto(): Cannot assign requested address
main(): This is RIOT! (Version: 2022.01-devel-1018-g820a3-HEAD)
RIOT border router example application
All up, running the shell now
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 

Its the second commit 820a397, but I didn't all the time too look into it as I hoped yesterday, is there anything obvious that could be at fault here @miri64?

@fjmolinas
Copy link
Contributor

Its the second commit 820a397, but I didn't all the time too look into it as I hoped yesterday, is there anything obvious that could be at fault here @miri64?

I didn't test on iotlab-m3 but samr21-xpro and dwm1001

@miri64
Copy link
Member Author

miri64 commented Mar 31, 2022

Its the second commit 820a397, but I didn't all the time too look into it as I hoped yesterday, is there anything obvious that could be at fault here @miri64?

Not sure, but I observed the behavior you mention also sometimes on iotlab-m3. The cause of this is unclear. Maybe some combination of Ctrl+D or Ctrl+C isn't properly handled by the stdio part? Or does the node wake up like this in your case?

@fjmolinas
Copy link
Contributor

Not sure, but I observed the behavior you mention also sometimes on iotlab-m3. The cause of this is unclear. Maybe some combination of Ctrl+D or Ctrl+C isn't properly handled by the stdio part? Or does the node wake up like this in your case?

It seems to happen after flashing only, resetting the BOARD after seems to fix it though.

@miri64
Copy link
Member Author

miri64 commented Mar 31, 2022

It seems to happen after flashing only, resetting the BOARD after seems to fix it though.

Still not clear to me if this happens when you just started the board or if you hit any key.

@fjmolinas
Copy link
Contributor

It seems to happen after flashing only, resetting the BOARD after seems to fix it though.

Still not clear to me if this happens when you just started the board or if you hit any key.

At start, once the shell is launched.

@miri64
Copy link
Member Author

miri64 commented Mar 31, 2022

Ok, that I did not observe on iotlab-m3 :-/. Maybe there is a broken condition which makes the shell exit and then start again? Have you checked if the ethos part at the host side may be the cause?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: build system Area: Build system Area: drivers Area: Device drivers Area: sys Area: System CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ethos: race condition seems to cause packet loss
4 participants