Proper ring buffer implementation, fixes #53 #58

MrJake222 · 2024-10-30T14:52:56Z

Now on UART0 too. Speeds up UART0 by 16%.

Only in Rx direction. I've tested another case of bidir buffering but it actually slowed things down by 25%.

Now on UART0 too. Speeds up UART0 by 16%. Only in Rx direction. I've tested another case of bidir buffering but it actually slowed things down by 25%.

MrJake222 · 2024-10-31T06:48:09Z

I actually have doubts whether it's the right way forward. Notice tinyusb already has a fifo! We're copying it to our "better" fifo.

We don't have rtos here. We have a constantly running polling loop, why not just use it? Poll uart0/1 for data, write it to tinyusb. Poll tinyusb, write to uart.

In rtos we could have a semaphore and unlock the polling loop.

MrJake222 · 2024-10-31T08:33:49Z

On the other hand with our own buffer we have control over it's size for example. TinyUSB one is only 64/512 bytes.

I'm thinking about implementing CTS line sensing. When inactive It'll block sending data to FPGA (prevent flooding it). Extra buffer (tx buffer, not implemented here) would come in handy.

Although above case should be handled by sending CTS line state up to the host. Which is unsupported by current tinyusb (there is a PR however).

I can't decide. What do you think about it?

josuah · 2024-10-31T10:39:01Z

Thank you for taking care of this in-place ring buffer implementation!
Good point about TinyUSB having its own mechanisms defined.

TinyUSB one is only 64/512 bytes

This is the USB packet size for FullSpeed and HighSpeed (therefore 64 bytes for RP2040), just enough for filling/emptying an USB packet, and is typical implementation of a CDC ACM device stack.
If needing performance, a byte-based FIFO is not the most efficient, a pread()/pwrite() (scatter-gather) API would be without memcpy() would be a better fit, and then tinyUSB CDC FIFO would get in the way.

block sending data to FPGA (prevent flooding it)

Hardware is usually able to decode commands coming from an UART in real-time, no RX buffer that can overflow, no?
For instance, with SPI, there is no hardware flow control, it all happens implicitly at the protocol level, by having a commands/responses interface.

implementing CTS line sensing

Probably a bit confusing to have it by default: non-default flags would need to be setup. Do feel free to experiment with it though! It could become some opt-in flag to tusb_config.h if that helps.

Poll uart0/1 for data, write it to tinyusb. Poll tinyusb, write to uart.

For USB reception from the host, tinyUSB loads the received bytes into a FIFO and lets the application empty it... The app better empty the FIFO every time for avoiding data to be dropped.
https://github.com/hathach/tinyusb/blob/master/src/class/cdc/cdc_device.c#L452-L453

For USB transmission to the host, tinyUSB pass the whole FIFO content as long as it is not empty, whether it contains 1 byte of wMaxPacketSize bytes depend on how much data was loaded into it by the application.
https://github.com/hathach/tinyusb/blob/master/src/class/cdc/cdc_device.c#L464

It looks like we just had some synchronization bug, maybe IRQs of UART and USB dead-locking each other, and adding a FIFO seemed like a way to avoid this. Maybe it is possible to use TinyUSB native ring buffers only, without bumping into that issue we had.

MrJake222 · 2024-10-31T11:57:52Z

If needing performance, a byte-based FIFO is not the most efficient, a pread()/pwrite() (scatter-gather) API would be without memcpy() would be a better fit, and then tinyUSB CDC FIFO would get in the way.

But the tusb fifo has block reads and writes, which essentially work the same as the posix api.

Hardware is usually able to decode commands coming from an UART in real-time, no RX buffer that can overflow, no?
For instance, with SPI, there is no hardware flow control, it all happens implicitly at the protocol level, by having a commands/responses interface.

Latency can come in the way. When interfacing over usb request/reply protocols can be really slow. I was thinking about dumping all of the requests and reading back replies after. In my use-case I'm sending 3 floats (x, y, z values) and the fpga risc-v softcore is replying received x, y, z (with printf, slow stuff). As an experiment I've tried to stop the RP2040 from sending data while the software is replying via CTS line. But I guess this would require the data to be buffered elsewhere and it's a pretty nasty corner-case of mine.

For USB reception from the host, tinyUSB loads the received bytes into a FIFO and lets the application empty it... The app better empty the FIFO every time for avoiding data to be dropped.

So we'd need transmit buffer nonetheless. If I dump a 40kB binary over USB, I'd certainly get overflowed. Not to mention this just drops whatever doesn't fit into uart at the moment. I will test sending a large file over to the FPGA to see if it drops something there. If it will, I say we should implement tx buffer also, because You've said not reading from cdc fifo's is a bad idea (or worse, stalling them waiting for uarts to become writable).

For USB transmission to the host, tinyUSB pass the whole FIFO content as long as it is not empty, whether it contains 1 byte of wMaxPacketSize bytes depend on how much data was loaded into it by the application.
https://github.com/hathach/tinyusb/blob/master/src/class/cdc/cdc_device.c#L464
It looks like we just had some synchronization bug, maybe IRQs of UART and USB dead-locking each other, and adding a FIFO seemed like a way to avoid this. Maybe it is possible to use TinyUSB native ring buffers only, without bumping into that issue we had.

I think I remember now. It seems the tinyusb fifo functions didn't like to be called from an interrupt (while tud_task() was running from the main loop) because write function can trigger a flush. And when this happens the interrupts can stall or deadlock. We could use dma-oriented functions but pointer to buffers are hidden and I guess It won't change soon. So the homebrew rx buffer is still required I think. Or a variation of tinyusb write with no-flush guarantee. I may create a question on their github about this.

josuah · 2024-10-31T13:28:07Z

It looks like there are a couple of things to improve for the USB CDC ACM <-> UART pipe.

Some are plain bugs (like deliberating dropping all data that does not fit).
Other might be enhancements (DMA-based API).
Other might be upstream contribution to TinyUSB.

Perhaps reviewing all of the USB CDC-ACM will be needed when going through the SDK rejuvenation as part of pico2-ice.

Thank you for your exploration! It will help organizing a solution.

added logging rb levels (uart0 only for now) rx heavy overflow (!), probably IRQs hog the cpu and it can't transfer it over CDC

Defined struct uart_wrap which abstracts uart0/1 properties like itf, hw instance, irq number/handler, buffers. Handled baud and cdc rx callbacks. Code in ice_usb.c got much cleaner. At least the uart part. Moved line_coding_cb() higher, not between rx_cb_table definition and usage. Renamed cdc_num to itf to match tinyusb api naming. Moved internal headers to src (debatable)

MrJake222 · 2024-11-03T14:26:47Z

Hello.

Is this (small) refactor okay?
I've extracted the uart handling to separate files. I plan to experiment with DMA later. The mess ice_usb.c was to get so much bigger. I think this is much better. Also introduced some fixes and changes (read commit messages please).

I've placed private sdk headers in src. Probably not the best solution, but we don't have private include folder. Maybe we should create one?

Still buffer is overflowing... I found out usb throughput is the limiting factor. tusb completes the transfer of 512 bytes in around 787 us thus the speed of CDC is limited to ~650kB/s. 1MBaud is not possible I fear.

The writes to cdc stall at first and then at the end it goes really smoothly, wonder why that is... DMA transfer size fits bulk packet size, 4kB overall buffer

MrJake222 · 2024-11-03T22:06:12Z

Hmm, USB CDC lags at first, then goes smoothly (see usb-cdc-results.txt). Wonder why is that? Maybe something hogs the cpu...

Anyway, DMA wasn't the answer to this one. However it works really nice. I wonder if there's a better way to detect idling DMA on rp2040

josuah

Thank you for this addition.

It looks like I've been using FIFOs and work queues for too long: as soon as I do bare-metal, I miss them to organize APIs... 😆

The extra complexity inserted by this pull request looks like required by what is aimed rather than accidental: ping-pong buffer of DMA transfers, the hardware being busy all of the time... This means structs are needed to keep track of the context... and slowly a driver API is built.

I left a few comments, but ultimately I think this is in good condition to be integrated as it is, thanks for the effort put behind it!

Regarding the slow ramp-up, it could be due to control commands being sent to tinyUSB making the CPU busy with control commands instead of filling the FIFO.

You could try to compare the slow-start behavior when starting transferring immediately after opening /dev/ttyACM0, or after some delay after opening /dev/ttyACM0 to see if giving time to the control commands to complete changes anything...

As long as it is not risking to get the feed stuck I see no issue.

josuah · 2024-11-04T10:46:50Z

src/rb.c

+}
+
+void rb_read_ack(struct rb* rb, unsigned int bytes) {
+    rb->read_index += (int)bytes;


Can the index be negative? It is probably easiest to use size_t for any variable that holds a size, which will save you from having to manage the sign.

josuah · 2024-11-04T10:48:49Z

src/rb.c

+
+void rb_read_ack(struct rb* rb, unsigned int bytes) {
+    rb->read_index += (int)bytes;
+    while (rb->read_index >= rb->size) rb->read_index -= rb->size; // faster than modulo (%)


Did you measure it? If not, is it good enough to leave it a rb->read_index %= rb->size?

josuah · 2024-11-04T10:50:17Z

src/rb.h

+#pragma once
+
+#include <tusb_config.h>
+// #define RB_BUFSIZE CFG_TUD_CDC_TX_BUFSIZE


Maybe you meant to remove this line now that buf is to be provided externally.

josuah · 2024-11-04T10:51:09Z

src/rb.h

+    int size;
+
+    int read_index;
+    unsigned long long read_total;
+
+    int write_index;
+    unsigned long long written_total;


As seen above, using size_t for each of these variables will save you some effort.

size_t is only 32-bit, I need 64-bit for totals

josuah · 2024-11-04T10:51:47Z

src/rb.h

+    // additionally to indexes (which will roll over) we defined read_total and written_total
+    // to mitigate ambiguity (when read_index==write_index is the buffer full or empty?)
+    // this variables will get large so we define them as 64bit (rolls over after 645 years at 10Gbit/s)


You can also solve this with a full boolean if you wish.

Yes, but it'll be accessed by two context, and race conditions will occur. Two separate sets of variables prevent this

josuah · 2024-11-04T11:41:10Z

src/ice_usb_uart.c

+        int wr;
+        for (wr=0; wr<bufsize; wr++) {
+            while (!uart_is_writable(uart->inst));
+            uart_putc(uart->inst, *buf++);


Does this reduce the advantage of a ring buffer?
No trouble with this though. It can be easily improved later.

josuah · 2024-11-04T11:43:57Z

src/rb.c

+    return MIN(rb_space_left(rb), rb->size - rb->write_index);
+}
+
+char* rb_read_ptr(struct rb* rb) {


I vote for something like rb_get_read_ptr() as read is a verb and misleading that it reads some pointer into the buffer.

josuah · 2024-11-04T11:46:36Z

src/rb.c

+    rb->written_total = 0;
+}
+
+int rb_data_left(struct rb* rb) {


Having int for return type is frequent for functions that can fail, returning < 0.

Having size_t makes it explicit that this returns how much data is left, rather than whether there is data left.

josuah · 2024-11-04T11:49:49Z

src/ice_usb_uart.c

+    // complementary channel will write to write_ptr
+    // need to skip another ack
+    char* waddr = rb_write_ptr_next_ack(&uart->rx_buf, DMA_TCNT);
+    dma_channel_set_write_addr(channel, waddr, false);


I did not understand why having 2 different pointers is required, is it not possible to just wait that one transaction finishes before configuring the next?

When dma_chA is started at initial position, increment the ring buffer position, then dma_chB is started at that new position.
When dma_chA completes, increment the ring buffer position, then dma_chA is started at that new position.
When dma_chB completes, increment the ring buffer position, then dma_chB is started at that new position.
When dma_chA completes, increment the ring buffer position, then dma_chA is started at that new position.
When dma_chB completes, increment the ring buffer position, then dma_chB is started at that new position.
...

This way there would be no need to look-up the next upcoming position, only update the position and pick the new value? Or are things more interleaved than this?

the interleaving is happening at a hardware level (chain_to field), This is done to avoid reconfiguring the channels when there is new data to be processed. When chA is being reconfigured, chB already copies data to it's portion of the buffer.

josuah · 2024-11-04T11:57:25Z

src/rb.h

+void rb_init(struct rb* rb, char* buf, int bufsize);
+int rb_data_left(struct rb* rb);
+int rb_data_left_continuous(struct rb* rb);
+int rb_space_left(struct rb* rb);
+int rb_space_left_continuous(struct rb* rb);
+char* rb_read_ptr(struct rb* rb);
+char* rb_write_ptr(struct rb* rb);
+char* rb_write_ptr_next_ack(struct rb* rb, unsigned int bytes);
+void rb_read_ack(struct rb* rb, unsigned int bytes);
+void rb_write_ack(struct rb* rb, unsigned int bytes);


This is a lot of API surface for a ring buffer.
I suspect that it is possible to remove a few functions that are only used internally.

Also, in this case, rb_read_ack() is always called with the same value as what rb_space_lelft_continuous() returned. It becomes tempting to combine 4 functions:

int rb_data_left(struct rb* rb); int rb_data_left_continuous(struct rb* rb); char* rb_read_ptr(struct rb* rb); void rb_read_ack(struct rb* rb, unsigned int bytes);

Into this one, that returns the next continuous buffer, and remove this region from the ring buffer at the same time:

char *rb_read(struct rb* rb, size_t *size);

Not important though.

What about reading buffer that won't fit into CDC or uart fifos?

I think the same thing as now: it would get a partial read operation, and size would get updated to the new size.

It is also good to keep the read and write side symmetric though, so your rb_data_left_continuous() is kind of great way to split the problem to be honest.

josuah · 2024-11-04T12:06:20Z

You can ignore the CI failure, it is not related to this PR...

MrJake222 · 2024-11-04T19:06:43Z

Thanks for the review, I'll implement (or disagree) with your suggestions. But this slow ramp up keeps me up at night...

Regarding the slow ramp-up, it could be due to control commands being sent to tinyUSB making the CPU busy with control commands instead of filling the FIFO.
You could try to compare the slow-start behavior when starting transferring immediately after opening /dev/ttyACM0, or after some delay after opening /dev/ttyACM0 to see if giving time to the control commands to complete changes anything...

sleep(5) isn't working. I've also used C program using poll to rule out slow python script. No success. I enabled tusb debugging and got this log.txt. Look at lines 29-32:

41146  512   1
41081  512   1
41114  512   1
41717  512   1

there are no usb logs there, yet getting tusb to accept 512 bytes in ice_usb_uart_rx_to_cdc() took over 41ms with 1 rejection, that is tud_cdc_n_write() returned 0 available space only once. Later in the file we get:

688  512   2
762  512  10
767  512   9
746  512  21

The function is polled more frequently.

As I'm writing it I saw tud_cdc_tx_complete_cb(). Maybe it's the way to go, not the polling approach I did (back with the first pull request). Will investigate. Any other ideas now? I'm a rookie with USB, really...

MrJake222 · 2024-11-06T12:28:39Z

Here with some news. Transmit logic is stalling the writes. This is long:

            while (!uart_is_writable(uart->inst));
            uart_putc(uart->inst, *buf++);

But I've tried with if also. What i didn't realize is that uart_putc WAITS for the character to be actually transmitted. So when using dr (data register) directly it solved the stalling. But now either CDC or our ring buffer is overflowing on transmits.

So It wasn't rx side fault at all I think. The question is now, should we keep the dma/alarm code? Is it worth the complexity of it? @josuah

josuah · 2024-11-06T12:59:51Z

Transmit logic is stalling the writes. This is long:

Good catch! It makes sense that UART is slower than USB.

should we keep the dma/alarm code?

Maybe it is still useful, did you try at max UART speed?

A ring buffer is useful in case of CDC ACM implementations to aggregate enough data to transmit at once to fill a full USB packet. That way, upon one UART byte received from the FPGA, there is not 1 USB packet sent to the host with just one byte, it would wait that enough data was accumulated onto the UART; something often put on USB TX side (aka USB IN from the host point of view).

I hoped to not limit your progress due to the limitations of the SDK, so am very glad to include your improvements.
Beyond including it in the SDK is the question of how to enable users to include these modifications without requiring to modify the SDK: even if contributions are welcome, having to change the SDK code to do some project is not ideal.

This boils down to how TinyUSB is organized: single-application in mind, not meant to be turned into an SDK, but configured directly by the final user (i.e. there is no per-CDC instance API). Which is great design as very simple to investigate, but difficult to make many things cohabitate.

Glad to receive your feedback on what the pico-ice SDK should become. Some leads:

MicroPython/CircuitPython (widely asked by the community)
A Zephyr port (since tinyVision.ai is active Zephyr side already)
Changes in the SDK to make USB a user responsibility, not SDK one, with just the default firmware to provide DFU/UF2.

All of this being a trade-off with the amount of time available for tinyVision.ai in the long therm.

MrJake222 · 2024-11-06T17:22:25Z

I'd choose an RTOS port. I don't like embedded python (too much memory used, unstable). I think this SDK should provide a clean way to passthrough UART. Honestly, I love it about it. With for example intel boards I had to use external uart dongle, and it provided only one uart. Here I can do however many I want (thanks to PIO) and it should be abstracted. User can always disable It and develop their own solution. Or bring up their ideas here (as I did)

MrJake222 · 2024-11-07T12:13:17Z

Or maybe this whole interconnection thing should be a part of the default firmware? SDK i think should implement board-specific protocols and programming (iCE40 cram, on-board spi bus, etc...) and no more. Kinda like pico-sdk provides convenience functions for hardware configuration but no more.

josuah · 2024-11-07T12:26:37Z

Clearly having it in the default firmware makes sense! Maybe the solution is to migrate away all these "USB features" from the SDK, turn them into examples, and turn the default firmware into an example too.

With exception of DFU, TinyUSB does not handle merging multiple source of configuration too easily. Hacks are used for getting tinyUF2 and the pico-sdk USB config merged along with custom ones. Doing this at example-level is much easier than doing this at SDK-level, which does not exclude custom configuration at build-time (i.e. how many SPIs, how many UARTs, which pins for the UARTs...).

Proper ring buffer implementation, fixes tinyvision-ai-inc#53

198553c

Now on UART0 too. Speeds up UART0 by 16%. Only in Rx direction. I've tested another case of bidir buffering but it actually slowed things down by 25%.

MrJake222 added 3 commits November 2, 2024 15:41

expose tx drop bug + cosmetics

bc2b9f0

Tx works, with tx ring buffer

78e4ff0

added logging rb levels (uart0 only for now) rx heavy overflow (!), probably IRQs hog the cpu and it can't transfer it over CDC

MrJake222 added 3 commits November 3, 2024 17:15

DMA works, but jittering

6d1d1af

DMA works, but it wasn't the issue

2485405

Still buffer is overflowing... I found out usb throughput is the limiting factor. tusb completes the transfer of 512 bytes in around 787 us thus the speed of CDC is limited to ~650kB/s. 1MBaud is not possible I fear.

USB speed testing and results

10f488b

The writes to cdc stall at first and then at the end it goes really smoothly, wonder why that is... DMA transfer size fits bulk packet size, 4kB overall buffer

MrJake222 force-pushed the uart-buffers branch from 3a4f7b6 to 10f488b Compare November 3, 2024 22:02

josuah reviewed Nov 4, 2024

View reviewed changes

MrJake222 mentioned this pull request Nov 6, 2024

Fast uart v2 #61

Merged

MrJake222 closed this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper ring buffer implementation, fixes #53 #58

Proper ring buffer implementation, fixes #53 #58

MrJake222 commented Oct 30, 2024

MrJake222 commented Oct 31, 2024

MrJake222 commented Oct 31, 2024

josuah commented Oct 31, 2024

MrJake222 commented Oct 31, 2024

josuah commented Oct 31, 2024 •

edited

Loading

MrJake222 commented Nov 3, 2024

MrJake222 commented Nov 3, 2024

josuah left a comment

josuah Nov 4, 2024

josuah Nov 4, 2024

josuah Nov 4, 2024

josuah Nov 4, 2024

MrJake222 Nov 6, 2024

josuah Nov 4, 2024

MrJake222 Nov 6, 2024

josuah Nov 4, 2024

josuah Nov 4, 2024

josuah Nov 4, 2024

josuah Nov 4, 2024

MrJake222 Nov 6, 2024

josuah Nov 4, 2024

MrJake222 Nov 6, 2024

josuah Nov 6, 2024

josuah commented Nov 4, 2024

MrJake222 commented Nov 4, 2024

MrJake222 commented Nov 6, 2024 •

edited

Loading

josuah commented Nov 6, 2024

MrJake222 commented Nov 6, 2024

MrJake222 commented Nov 7, 2024

josuah commented Nov 7, 2024

Proper ring buffer implementation, fixes #53 #58

Proper ring buffer implementation, fixes #53 #58

Conversation

MrJake222 commented Oct 30, 2024

MrJake222 commented Oct 31, 2024

MrJake222 commented Oct 31, 2024

josuah commented Oct 31, 2024

MrJake222 commented Oct 31, 2024

josuah commented Oct 31, 2024 • edited Loading

MrJake222 commented Nov 3, 2024

MrJake222 commented Nov 3, 2024

josuah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josuah commented Nov 4, 2024

MrJake222 commented Nov 4, 2024

MrJake222 commented Nov 6, 2024 • edited Loading

josuah commented Nov 6, 2024

MrJake222 commented Nov 6, 2024

MrJake222 commented Nov 7, 2024

josuah commented Nov 7, 2024

josuah commented Oct 31, 2024 •

edited

Loading

MrJake222 commented Nov 6, 2024 •

edited

Loading