-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper ring buffer implementation, fixes #53 #58
Conversation
Now on UART0 too. Speeds up UART0 by 16%. Only in Rx direction. I've tested another case of bidir buffering but it actually slowed things down by 25%.
I actually have doubts whether it's the right way forward. Notice tinyusb already has a fifo! We're copying it to our "better" fifo. We don't have rtos here. We have a constantly running polling loop, why not just use it? Poll uart0/1 for data, write it to tinyusb. Poll tinyusb, write to uart. In rtos we could have a semaphore and unlock the polling loop. |
On the other hand with our own buffer we have control over it's size for example. TinyUSB one is only 64/512 bytes. I'm thinking about implementing CTS line sensing. When inactive It'll block sending data to FPGA (prevent flooding it). Extra buffer (tx buffer, not implemented here) would come in handy. Although above case should be handled by sending CTS line state up to the host. Which is unsupported by current tinyusb (there is a PR however). I can't decide. What do you think about it? |
This is the USB packet size for FullSpeed and HighSpeed (therefore 64 bytes for RP2040), just enough for filling/emptying an USB packet, and is typical implementation of a CDC ACM device stack.
Hardware is usually able to decode commands coming from an UART in real-time, no RX buffer that can overflow, no?
Probably a bit confusing to have it by default: non-default flags would need to be setup. Do feel free to experiment with it though! It could become some opt-in flag to
For USB reception from the host, tinyUSB loads the received bytes into a FIFO and lets the application empty it... The app better empty the FIFO every time for avoiding data to be dropped. For USB transmission to the host, tinyUSB pass the whole FIFO content as long as it is not empty, whether it contains 1 byte of wMaxPacketSize bytes depend on how much data was loaded into it by the application. It looks like we just had some synchronization bug, maybe IRQs of UART and USB dead-locking each other, and adding a FIFO seemed like a way to avoid this. Maybe it is possible to use TinyUSB native ring buffers only, without bumping into that issue we had. |
But the tusb fifo has block reads and writes, which essentially work the same as the posix api.
Latency can come in the way. When interfacing over usb request/reply protocols can be really slow. I was thinking about dumping all of the requests and reading back replies after. In my use-case I'm sending 3 floats (x, y, z values) and the fpga risc-v softcore is replying
So we'd need transmit buffer nonetheless. If I dump a 40kB binary over USB, I'd certainly get overflowed. Not to mention this just drops whatever doesn't fit into uart at the moment. I will test sending a large file over to the FPGA to see if it drops something there. If it will, I say we should implement tx buffer also, because You've said not reading from cdc fifo's is a bad idea (or worse, stalling them waiting for uarts to become writable).
I think I remember now. It seems the tinyusb fifo functions didn't like to be called from an interrupt (while |
It looks like there are a couple of things to improve for the USB CDC ACM <-> UART pipe. Some are plain bugs (like deliberating dropping all data that does not fit). Perhaps reviewing all of the USB CDC-ACM will be needed when going through the SDK rejuvenation as part of Thank you for your exploration! It will help organizing a solution. |
added logging rb levels (uart0 only for now) rx heavy overflow (!), probably IRQs hog the cpu and it can't transfer it over CDC
Defined struct uart_wrap which abstracts uart0/1 properties like itf, hw instance, irq number/handler, buffers. Handled baud and cdc rx callbacks. Code in ice_usb.c got much cleaner. At least the uart part. Moved line_coding_cb() higher, not between rx_cb_table definition and usage. Renamed cdc_num to itf to match tinyusb api naming. Moved internal headers to src (debatable)
Hello. Is this (small) refactor okay? I've placed private sdk headers in src. Probably not the best solution, but we don't have private include folder. Maybe we should create one? |
Still buffer is overflowing... I found out usb throughput is the limiting factor. tusb completes the transfer of 512 bytes in around 787 us thus the speed of CDC is limited to ~650kB/s. 1MBaud is not possible I fear.
The writes to cdc stall at first and then at the end it goes really smoothly, wonder why that is... DMA transfer size fits bulk packet size, 4kB overall buffer
3a4f7b6
to
10f488b
Compare
Hmm, USB CDC lags at first, then goes smoothly (see Anyway, DMA wasn't the answer to this one. However it works really nice. I wonder if there's a better way to detect idling DMA on rp2040 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this addition.
It looks like I've been using FIFOs and work queues for too long: as soon as I do bare-metal, I miss them to organize APIs... 😆
The extra complexity inserted by this pull request looks like required by what is aimed rather than accidental: ping-pong buffer of DMA transfers, the hardware being busy all of the time... This means structs are needed to keep track of the context... and slowly a driver API is built.
I left a few comments, but ultimately I think this is in good condition to be integrated as it is, thanks for the effort put behind it!
Regarding the slow ramp-up, it could be due to control commands being sent to tinyUSB making the CPU busy with control commands instead of filling the FIFO.
You could try to compare the slow-start behavior when starting transferring immediately after opening /dev/ttyACM0, or after some delay after opening /dev/ttyACM0 to see if giving time to the control commands to complete changes anything...
As long as it is not risking to get the feed stuck I see no issue.
} | ||
|
||
void rb_read_ack(struct rb* rb, unsigned int bytes) { | ||
rb->read_index += (int)bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the index be negative? It is probably easiest to use size_t
for any variable that holds a size, which will save you from having to manage the sign.
|
||
void rb_read_ack(struct rb* rb, unsigned int bytes) { | ||
rb->read_index += (int)bytes; | ||
while (rb->read_index >= rb->size) rb->read_index -= rb->size; // faster than modulo (%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you measure it? If not, is it good enough to leave it a rb->read_index %= rb->size
?
#pragma once | ||
|
||
#include <tusb_config.h> | ||
// #define RB_BUFSIZE CFG_TUD_CDC_TX_BUFSIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you meant to remove this line now that buf
is to be provided externally.
int size; | ||
|
||
int read_index; | ||
unsigned long long read_total; | ||
|
||
int write_index; | ||
unsigned long long written_total; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As seen above, using size_t
for each of these variables will save you some effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_t
is only 32-bit, I need 64-bit for totals
// additionally to indexes (which will roll over) we defined read_total and written_total | ||
// to mitigate ambiguity (when read_index==write_index is the buffer full or empty?) | ||
// this variables will get large so we define them as 64bit (rolls over after 645 years at 10Gbit/s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also solve this with a full
boolean if you wish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it'll be accessed by two context, and race conditions will occur. Two separate sets of variables prevent this
int wr; | ||
for (wr=0; wr<bufsize; wr++) { | ||
while (!uart_is_writable(uart->inst)); | ||
uart_putc(uart->inst, *buf++); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this reduce the advantage of a ring buffer?
No trouble with this though. It can be easily improved later.
return MIN(rb_space_left(rb), rb->size - rb->write_index); | ||
} | ||
|
||
char* rb_read_ptr(struct rb* rb) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for something like rb_get_read_ptr()
as read
is a verb and misleading that it reads some pointer into the buffer.
rb->written_total = 0; | ||
} | ||
|
||
int rb_data_left(struct rb* rb) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having int
for return type is frequent for functions that can fail, returning < 0
.
Having size_t
makes it explicit that this returns how much data is left, rather than whether there is data left.
// complementary channel will write to write_ptr | ||
// need to skip another ack | ||
char* waddr = rb_write_ptr_next_ack(&uart->rx_buf, DMA_TCNT); | ||
dma_channel_set_write_addr(channel, waddr, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not understand why having 2 different pointers is required, is it not possible to just wait that one transaction finishes before configuring the next?
When dma_chA
is started at initial position, increment the ring buffer position, then dma_chB
is started at that new position.
When dma_chA
completes, increment the ring buffer position, then dma_chA
is started at that new position.
When dma_chB
completes, increment the ring buffer position, then dma_chB
is started at that new position.
When dma_chA
completes, increment the ring buffer position, then dma_chA
is started at that new position.
When dma_chB
completes, increment the ring buffer position, then dma_chB
is started at that new position.
...
This way there would be no need to look-up the next upcoming position, only update the position and pick the new value? Or are things more interleaved than this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the interleaving is happening at a hardware level (chain_to field), This is done to avoid reconfiguring the channels when there is new data to be processed. When chA is being reconfigured, chB already copies data to it's portion of the buffer.
void rb_init(struct rb* rb, char* buf, int bufsize); | ||
int rb_data_left(struct rb* rb); | ||
int rb_data_left_continuous(struct rb* rb); | ||
int rb_space_left(struct rb* rb); | ||
int rb_space_left_continuous(struct rb* rb); | ||
char* rb_read_ptr(struct rb* rb); | ||
char* rb_write_ptr(struct rb* rb); | ||
char* rb_write_ptr_next_ack(struct rb* rb, unsigned int bytes); | ||
void rb_read_ack(struct rb* rb, unsigned int bytes); | ||
void rb_write_ack(struct rb* rb, unsigned int bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of API surface for a ring buffer.
I suspect that it is possible to remove a few functions that are only used internally.
Also, in this case, rb_read_ack()
is always called with the same value as what rb_space_lelft_continuous()
returned. It becomes tempting to combine 4 functions:
int rb_data_left(struct rb* rb);
int rb_data_left_continuous(struct rb* rb);
char* rb_read_ptr(struct rb* rb);
void rb_read_ack(struct rb* rb, unsigned int bytes);
Into this one, that returns the next continuous buffer, and remove this region from the ring buffer at the same time:
char *rb_read(struct rb* rb, size_t *size);
Not important though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about reading buffer that won't fit into CDC or uart fifos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the same thing as now: it would get a partial read operation, and size
would get updated to the new size.
It is also good to keep the read and write side symmetric though, so your rb_data_left_continuous()
is kind of great way to split the problem to be honest.
You can ignore the CI failure, it is not related to this PR... |
Thanks for the review, I'll implement (or disagree) with your suggestions. But this slow ramp up keeps me up at night...
there are no usb logs there, yet getting tusb to accept 512 bytes in
The function is polled more frequently. As I'm writing it I saw |
Here with some news. Transmit logic is stalling the writes. This is long: while (!uart_is_writable(uart->inst));
uart_putc(uart->inst, *buf++); But I've tried with So It wasn't rx side fault at all I think. The question is now, should we keep the dma/alarm code? Is it worth the complexity of it? @josuah |
Good catch! It makes sense that UART is slower than USB.
Maybe it is still useful, did you try at max UART speed? A ring buffer is useful in case of CDC ACM implementations to aggregate enough data to transmit at once to fill a full USB packet. That way, upon one UART byte received from the FPGA, there is not 1 USB packet sent to the host with just one byte, it would wait that enough data was accumulated onto the UART; something often put on USB TX side (aka USB IN from the host point of view). I hoped to not limit your progress due to the limitations of the SDK, so am very glad to include your improvements. This boils down to how TinyUSB is organized: single-application in mind, not meant to be turned into an SDK, but configured directly by the final user (i.e. there is no per-CDC instance API). Which is great design as very simple to investigate, but difficult to make many things cohabitate. Glad to receive your feedback on what the pico-ice SDK should become. Some leads:
All of this being a trade-off with the amount of time available for tinyVision.ai in the long therm. |
I'd choose an RTOS port. I don't like embedded python (too much memory used, unstable). I think this SDK should provide a clean way to passthrough UART. Honestly, I love it about it. With for example intel boards I had to use external uart dongle, and it provided only one uart. Here I can do however many I want (thanks to PIO) and it should be abstracted. User can always disable It and develop their own solution. Or bring up their ideas here (as I did) |
Or maybe this whole interconnection thing should be a part of the default firmware? SDK i think should implement board-specific protocols and programming (iCE40 cram, on-board spi bus, etc...) and no more. Kinda like pico-sdk provides convenience functions for hardware configuration but no more. |
Clearly having it in the default firmware makes sense! Maybe the solution is to migrate away all these "USB features" from the SDK, turn them into examples, and turn the default firmware into an example too. With exception of DFU, TinyUSB does not handle merging multiple source of configuration too easily. Hacks are used for getting tinyUF2 and the pico-sdk USB config merged along with custom ones. Doing this at example-level is much easier than doing this at SDK-level, which does not exclude custom configuration at build-time (i.e. how many SPIs, how many UARTs, which pins for the UARTs...). |
Now on UART0 too. Speeds up UART0 by 16%.
Only in Rx direction. I've tested another case of bidir buffering but it actually slowed things down by 25%.