Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Improved serial data processing (smarter reading, faster writing) #2835

Closed
rondlh opened this issue Sep 11, 2023 · 124 comments · Fixed by #2840
Closed

[FR] Improved serial data processing (smarter reading, faster writing) #2835

rondlh opened this issue Sep 11, 2023 · 124 comments · Fixed by #2840
Labels
enhancement New feature or request

Comments

@rondlh
Copy link

rondlh commented Sep 11, 2023

1. Serial reading
The current FW uses DMA for reading data from the serial ports, which is done in a very efficient way, the process is started and will run automatically in the background. Interrupts are generated when data is available, this happens when the serial line goes idle.
This is overall a very efficient approach, but has a small drawback. The received data only gets processed once the serial line goes idle, under most circumstances this is fine because the received messages are short and fragmented ("ok\n"). But if a long continues message is received, like the response to M43 (pins information), then a TFT buffer overrun can occur and a part of the received message is lost.
This issue can be solved by not waiting for the serial idle interrupt, but immediately starting to process the data that is being received. Based on the improvements in #2824 (improvements in Serial_Get) it is quite obvious how to implement this. I will provide some sample code soon.

The advantages:

  • the hardware dependent interrupt code is not needed anymore
  • buffer overruns become less likely, commands like M43 will work (better)

2. Serial writing
Practically the TFT only needs to receive a little bit of data ("ok\n" messages), but needs to write a lot of data to the motherboard (the gcode commands, about 40 characters per command).
The current serial write implementation is very slow, the software actually needs to wait after each byte until it's physically send over the serial line. This slows down the TFT response time, especially for lower baud rates and when the workload is high anyway. A buffered DMA based would significantly help to increase the TFT responsiveness.

Currently I have a DMA based serial write solution under test, which works in this way:

  • The message from Serial_Put is stored in a buffer memory and the program can continue immediately. If not enough buffer memory is available then the Serial_Put has to wait until enough space becomes available.
  • Once a complete message is in the buffer the DMA process is started and the Serial transfer will occur in the background, which will trigger an interrupt when done, then more data can be transferred is needed.

The advantages:

  • overall improved speed and responsiveness of the TFT
  • high scan rates (Serial_Get) even under heavy load
  • buffer overruns become less likely, commands like M43 will work

A disadvantage is that this code is hardware dependent (DMA setup).
I will post the STM32F2_4 sample code here soon for review and discussion.

I have some questions:

  1. Is it acceptable to provide this solution for one hardware platform only?

  2. Who can help test/implement a STM32F10x implementation?
    BTT TFT24 V1.1
    BTT TFT28 V1.0
    BTT TFT35 1.0/1.1/1.2/2.0
    BTT GD TFT24 V1.1
    BTT GD TFT35 V2.0
    MKS TFT28 V3.0/4.0
    MKS TFT28 NEW GENIUS
    MKS TFT32 V1.3/1.4
    MKS TFT32L V3

  3. Who can help test/implement a gd32f20x implementation?
    BTT GD TFT35 V3.0
    BTT GD TFT35 E3 V3.0
    BTT_GD TFT35 B1 V3.0
    BTT GD TFT43 V3.0
    BTT GD TFT43 V3.0
    BTT GD TFT50 V3.0

@rondlh rondlh added the enhancement New feature or request label Sep 11, 2023
@rondlh
Copy link
Author

rondlh commented Sep 11, 2023

Serial reading
Here the current Serial_Get, which has the "flag" update, it remembers what part of the serial read data was already scanned to see if the command is clear (commands end with '\n').

1. uint16_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint16_t bufSize)
2. {
3.   // if port index is out of range or no data to read from L1 cache
4.   if (!WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1) || dmaL1Data[portIndex].flag == dmaL1Data[portIndex].wIndex)
5.     return 0;
6. 
7.   DMA_CIRCULAR_BUFFER * dmaL1Data_ptr = &dmaL1Data[portIndex];
8. 
9.   // make a static access to dynamically changed (by L1 cache's interrupt handler) variables/attributes
10.   uint16_t wIndex = dmaL1Data_ptr->wIndex;
11. 
12.   // L1 cache's reading index (not dynamically changed (by L1 cache's interrupt handler) variables/attributes)
13.   uint16_t rIndex = dmaL1Data_ptr->rIndex;
14. 
15.   while (dmaL1Data_ptr->cache[rIndex] == ' ' && rIndex != wIndex)  // remove leading empty space, if any
16.   {
17.     rIndex = (rIndex + 1) % dmaL1Data_ptr->cacheSize;
18.   }
19. 
20.   for (uint16_t i = 0; i < (bufSize - 1) && rIndex != wIndex; )  // retrieve data until buf is full or L1 cache is empty
21.   {
22.     buf[i] = dmaL1Data_ptr->cache[rIndex];
23.     rIndex = (rIndex + 1) % dmaL1Data_ptr->cacheSize;
24. 
25.     if (buf[i++] == '\n')  // if data end marker is found
26.     {
27.       buf[i] = '\0';                                         // end character
28.       dmaL1Data_ptr->flag = dmaL1Data_ptr->rIndex = rIndex;  // update queue's custom flag and reading index with rIndex
29. 
30.       return i;  // return the number of bytes stored in buf
31.     }
32.   }
33. 
34.   // if here, a partial message is present on the L1 cache (message not terminated by "\n").
35.   // We temporary skip the message until it is fully received updating also dmaL1Data_ptr->flag to
36.   // prevent to read again (multiple times) the same partial message on next function invokation
37. 
38.   dmaL1Data_ptr->flag = wIndex;  // update queue's custom flag with wIndex
39. 
40.   return 0;  // return the number of bytes stored in buf
41. }

Line 4 checks if new data is available: dmaL1Data[portIndex].flag == dmaL1Data[portIndex].wIndex
dmaL1Data[portIndex].wIndex is updated by the serial idle interrupt, which happens 1 serial frame time after the last serial byte was received. This means that during the time that serial data is being received you are not aware of any new data. Usually that is fine if it's an "ok\n" message for example. But for long messages this is not optimal.

wIndex could be updated every time that Serial_Get is executed (line 3):
dmaL1Data[portIndex].wIndex = dmaL1Data[portIndex].cacheSize - DMA_CHCNT(Serial[portIndex].dma_stream, Serial[portIndex].dma_channel);

The serial interrupt routine handling the wIndex update would not be needed anymore (line 3 to 9 below),

1. void USART_IRQHandler(uint8_t port)
2. {
3.   if ((USART_STAT0(Serial[port].uart) & (1<<4)) != 0)
4.   {
5.     USART_STAT0(Serial[port].uart);  // Clear interrupt flag
6.     USART_DATA(Serial[port].uart);
7. 
8.     dmaL1Data[port].wIndex = dmaL1Data[port].cacheSize - DMA_CHCNT(Serial[port].dma_stream, Serial[port].dma_channel);
9.   }
10. }

And of course the serial idle interrupt setup can be cancelled too:

USART_ITConfig(uart[port], usart_it, ENABLE);
USART_ClearITPendingBit(uart[port], usart_it);

Actually, much more can be cancelled, but I don't do that here, because it will be needed for the faster writing part...

@digant73
Copy link
Contributor

digant73 commented Sep 11, 2023

The serial writing based on DMA should improve the code for sure. For the change you propose on serial reading (update wIndex) I think it has a negative impact during a print (even with small replies (ok, temp notifications etc.) provided by mainboard). In that case, I see a lot of Serial_Get() invokation (more on faster TFT) reading a partial message (e.g. ok T:16.13 /0.00 B:16.64 /0.00 @:0 B@:0\n ). I would make some testing on that. If confirmed, I would prefer more to maintain the current logic and eventually increase the serial queue size (e.g. to fit M43 output) on TFT variants having enough RAM available (see definition SERIAL_PORT_QUEUE_SIZE in SerialConnection.c. Increase of serial queue size is also possible if #2820 is merged. It frees more than 2.2 KB of memory that could be assigned to serial queue

For example, change:
#define SERIAL_PORT_QUEUE_SIZE NOBEYOND(512, RAM_SIZE * 64, 4096)
to
#define SERIAL_PORT_QUEUE_SIZE NOBEYOND(512, RAM_SIZE * 96, 6144)

@rondlh
Copy link
Author

rondlh commented Sep 12, 2023

I see a lot of Serial_Get() invokation

Isn't Serial_Get invoked at a very high rate all the time anyway, not only when a serial idle interrupt is received? It seems Serial_Get is polling at very high rates all the time, but returns once it sees that no new data is received.
The "flag" improvement helps to limit the overhead, so worst case Serial_Get could analyze the received data once for every new byte received if it was aware that there was new data earlier.
It could be worth while to first check the new data (everything after the flag) for a '\n', only then continue with Serial_Get. In that case there is virtually no drawback anymore.
Either way, this is a tiny thing (just add 1 line) so it can be tested easily, leaving the idle interrupt enabled in the background also doesn't harm anything.
In my case I would need almost 11K buffer to handle the M43 message because it is send in 1 continues transmit. I use an ESP3D, and it annoys me a bit that I cannot capture the M43 response fully.

I will post the DMA serial writing files soon, that's much more interesting. The combination of the "smart" reading and DMA writing works very well for me.
I'm currently testing DMA writing with the ADVANCED_OK update #2824, and it works fine.

@digant73
Copy link
Contributor

digant73 commented Sep 12, 2023

In Serial_Get() try to put a KPI to count the number of partial messages. E.g. a counter after the following line

dmaL1Data_ptr->flag = wIndex; // update queue's custom flag with wIndex

Display the KPI in the Monitoring menu. Compare it with and without your proposed solution. I suspect a lot of partial (useless) messages with your solution (e.g. even for "ok\n" reply you possibly read 3 partial messages (one for "o", one for "k", one for "\n". If so, the solution is not efficient (consider how many useless partial read will be present for M43) and you could try with the check on presence of \n between the flag and the updated wIndex. The flag has to be updated in case \n is not found. I will try to implement that and make some testing

@rondlh
Copy link
Author

rondlh commented Sep 12, 2023

In Serial_Get() try to put a KPI to count the number of partial messages. E.g. a counter after the following line

dmaL1Data_ptr->flag = wIndex; // update queue's custom flag with wIndex

Display the KPI in the Monitoring menu. Compare it with and without your proposed solution. I suspect a lot of partial (useless) messages with your solution (e.g. even for "ok\n" reply you possibly read 3 partial messages (one for "o", one for "k", one for "\n". If so, the solution is not efficient (consider how many useless partial read will be present for M43) and you could try with the check on presence of \n between the flag and the updated wIndex. The flag has to be updated in case \n is not found. I will try to implement that and make some testing

OK, great, but let me do the testing... this is a minor thing... And I agree with your analysis, even an "ok\n" would probably be split because scanning is very fast, but I just think it doesn't matter especially because your "flag" update limits the scanning and scanning is fast, especially if I add some code to check for '\n' first.

I have something much more important for you to test... Copy it over #2824 test on STM32F2_4 platform.
src-user-HAL-DMA Writing.zip
srt-user-API-SerialConnection.zip

@digant73
Copy link
Contributor

ok. I will test it in the next days

@rondlh
Copy link
Author

rondlh commented Sep 12, 2023

@digant73 Thanks, I hope your printer doesn't explode.

I have a question, here the start of _Serial_Get

uint16_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint16_t bufSize)
{
  // if port index is out of range or no data to read from L1 cache
  if (!WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1) || dmaL1DataRX[portIndex].flag == dmaL1DataRX[portIndex].wIndex)
    return 0;

I wonder if the check !WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1) is actually needed:

Serial_Get is called in 2 places:

  1. parseACK.c Serial_Get is called with Serial_Get(SERIAL_PORT, ..., SERIAL_PORT is always within PORT_1 and SERIAL_PORT_COUNT, of course you could misconfigure SERIAL_PORT, but then the TFT will never work anyway and a check and warning at startup would be more useful.

  2. serialConnection.c

    for (SERIAL_PORT_INDEX portIndex = PORT_2; portIndex < SERIAL_PORT_COUNT; portIndex++)
       Serial_Get(serialPort[portIndex].port, ...

Here it also seems that the check is always True.
Serial_Get is called very fast (> 35k per second), so removing the condition would release some resources.

In principle you could also pull out the continue blocking guard dmaL1DataRX[portIndex].flag == dmaL1DataRX[portIndex].wIndex
and convert it to an inline function newDataAvailable and then only call Serial_Get when actually something changed.

bool inline newDataAvailable(SERIAL_PORT_INDEX portIndex)
{
  return dmaL1DataRX[portIndex].flag != dmaL1DataRX[portIndex].wIndex;
}

So the while loop would look like:

while (newDataAvailable(SERIAL_PORT) && (ack_len = Serial_Get(SERIAL_PORT, ack_cache, ACK_CACHE_SIZE)) != 0)

Maybe not very beautiful, but certainly fast and efficient.

@digant73
Copy link
Contributor

digant73 commented Sep 12, 2023

At the moment I would avoid this kind of changes, They don't provide a significant improvement. Also all the functions/features invoked by loopProcess() are managed in the same way (call a function and make the proper checks in the function just to make the code more readable and manageable). The important part in the serial functions is to speed up the reading, checks and buffering.
I would simply provide just the essential code to reach the goal of reading also long reply messages without eccessive partial messages reading

@rondlh
Copy link
Author

rondlh commented Sep 12, 2023

It's a 2 part question, the first is, can "!WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1)" be dropped?

The second part is what I believe you answered. I also do not expect great speed improvements from it.

@digant73
Copy link
Contributor

digant73 commented Sep 13, 2023

yes WITHIN... can be removed. Consider that (just to make all the API consistent) we should do the same also in Serial_GetReadingIndex. If possible I would change the argument type SERIAL_PORT_INDEX for those two functions to something limiting the range from PORT1 to last available port for the TFT variant

@rondlh
Copy link
Author

rondlh commented Sep 13, 2023

I just did a quick benchmark assuming there is no serial data available (if data is available then both approaches are about the same).
Approach 1: (Current situation) call Serial_Get, let it exit if no data is available
Approach 2. (New approach) First check if new data is available, only then call data available
Benchmarks:
1 million calls to the current Serial_Get [with no data available] takes about 442ms.
1 million checks like described above, avoiding a call to Serial_Get and no "WITHIN" check takes only about 109ms.
So a difference of 330ms. 1M / 35K is almost 30, so in total this simple change would save 11ms processing time per second, an overall improvement of 1.1%. I dare anybody here to find an improvement giving more benefit :D (I know a potential case, but that one I already released...)

The required changes are very small, and readability is still ok.

SerialCommunication.c (new function)

inline bool newDataAvailable(SERIAL_PORT_INDEX portIndex)
{
  return (dmaL1DataRX[portIndex].flag != dmaL1DataRX[portIndex].wIndex);
}

Use the new function in the while-condition

void Serial_GetFromUART(void)
.
.
.
  while (newDataAvailable(portIndex) && Serial_Get(serialPort[portIndex].port, cmd, CMD_MAX_SIZE) != 0)  // if some data have been retrieved

The check in Serial_Get is not needed anymore, it's done in newDataAvailable and the WITHIN check is dropped

  //  if (!WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1) || dmaL1DataRX[portIndex].flag == dmaL1DataRX[portIndex].wIndex) // IRON, REMOVED
  //    return 0;

parseACK.c Use the new function in the while-condition

void parseACK(void)
{
  while (newDataAvailable(SERIAL_PORT) && (ack_len = Serial_Get(SERIAL_PORT, ack_cache, ACK_CACHE_SIZE)) != 0)  // if some data have been retrieved

I have this running already... worked the first time, I hope it stays that way :D

@digant73
Copy link
Contributor

digant73 commented Sep 13, 2023

I just did a quick benchmark assuming there is no serial data available (if data is available then both approaches are about the same). Approach 1: (Current situation) call Serial_Get, let it exit if no data is available Approach 2. (New approach) First check if new data is available, only then call data available Benchmarks: 1 million calls to the current Serial_Get [with no data available] takes about 442ms. 1 million checks like described above, avoiding a call to Serial_Get and no "WITHIN" check takes only about 109ms. So a difference of 330ms. 1M / 35K is almost 30, so in total this simple change would save 11ms processing time per second, an overall improvement of 1.1%. I dare anybody here to find an improvement giving more benefit :D (I know a potential case, but that one I already released...)

The required changes are very small, and readability is still ok.

SerialCommunication.c (new function)

inline bool newDataAvailable(SERIAL_PORT_INDEX portIndex)
{
  return (dmaL1DataRX[portIndex].flag != dmaL1DataRX[portIndex].wIndex);
}

Use the new function in the while-condition

void Serial_GetFromUART(void)
.
.
.
  while (newDataAvailable(portIndex) && Serial_Get(serialPort[portIndex].port, cmd, CMD_MAX_SIZE) != 0)  // if some data have been retrieved

The check in Serial_Get is not needed anymore, it's done in newDataAvailable and the WITHIN check is dropped

  //  if (!WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1) || dmaL1DataRX[portIndex].flag == dmaL1DataRX[portIndex].wIndex) // IRON, REMOVED
  //    return 0;

parseACK.c Use the new function in the while-condition

void parseACK(void)
{
  while (newDataAvailable(SERIAL_PORT) && (ack_len = Serial_Get(SERIAL_PORT, ack_cache, ACK_CACHE_SIZE)) != 0)  // if some data have been retrieved

I have this running already... worked the first time, I hope it stays that way :D

good results.
It should be enough to update wIndex and then check the presence of \n on serial queue between indexes dmaL1Data[portIndex].flag and dmaL1Data[portIndex].wIndex. If found, proceed with a brutal raw copy (e.g. with memcpy) in the output buffer buf (consider that the serial queue is circular so in case the data are across the end of the queue and the beginning we should split the copy with two memcoy invokation). Or you could maintain the current code based on updating rIndex for each byte) but it should be slower.
In both cases (\n found or not found) dmaL1Data[portIndex].flag must be updated just to start from that point on the next Serial_Get invokation

UPDATE: if you use newDataAvailable() to decide if calling or not Serial_Get() you also need to update wIndex inside newDataAvailable otherwise Serial_Get will be invoked only when wIndex is updated by the interrupt handler (and this is not what you need for reading a big output like for M43). But if you put wIndex code in newDataAvailable it is possibly reducing the performance benefit so it could be more convenient to move the update and check directly on Serial_Get. That seems to me also a more readable solution overall considering that we are more interested to the scenario where the TFT is printing (so most of the time Serial_get is invoked) than the idle scenario you used for benchmark

@rondlh
Copy link
Author

rondlh commented Sep 13, 2023

UPDATE: if you use newDataAvailable() to decide if calling or not Serial_Get() you also need to update wIndex inside newDataAvailable otherwise Serial_Get will be invoked only when wIndex is updated by the interrupt handler (and this is not what you need for reading a big output like for M43). But if you put wIndex code in newDataAvailable it is possibly reducing the performance benefit so it could be more convenient to move the update and check directly on Serial_Get. That seems to me also a more readable solution overall

You are very right, like always :D You could probably guess that my code looks like this, but I didn't want to push that on you and everyone...

inline bool newDataAvailable(SERIAL_PORT_INDEX portIndex)
{
  dmaL1DataRX[portIndex].wIndex = Get_wIndex(portIndex); // update wIndex, DMA is reading and storing serial data in the background
  return (dmaL1DataRX[portIndex].flag != dmaL1DataRX[portIndex].wIndex);
}

The updating of the flag is done in Serial_Get when new data is reported. It doesn't matter if the new data was reported by the interrupt or by updating wIndex manually. This works both for complete and incomplete messages... that is how you coded it...
This code is already running on my TFT, working fine

@digant73
Copy link
Contributor

digant73 commented Sep 14, 2023

UPDATE: if you use newDataAvailable() to decide if calling or not Serial_Get() you also need to update wIndex inside newDataAvailable otherwise Serial_Get will be invoked only when wIndex is updated by the interrupt handler (and this is not what you need for reading a big output like for M43). But if you put wIndex code in newDataAvailable it is possibly reducing the performance benefit so it could be more convenient to move the update and check directly on Serial_Get. That seems to me also a more readable solution overall

You are very right, like always :D You could probably guess that my code looks like this, but I didn't want to push that on you and everyone...

inline bool newDataAvailable(SERIAL_PORT_INDEX portIndex)
{
  dmaL1DataRX[portIndex].wIndex = Get_wIndex(portIndex); // update wIndex, DMA is reading and storing serial data in the background
  return (dmaL1DataRX[portIndex].flag != dmaL1DataRX[portIndex].wIndex);
}

The updating of the flag is done in Serial_Get when new data is reported. It doesn't matter if the new data was reported by the interrupt or by updating wIndex manually. This works both for complete and incomplete messages... that is how you coded it... This code is already running on my TFT, working fine

If possible please, try the following implementation and verify the performance in particular under load (so receiving messages) more than on idle (not so much relevant IMHO). The code should provide better results in particular for reading mid (e.g. temp ACK) and long messages (e.g. output lines for M43 most of the time even longer than 250 chars).
Serial_GetWritingIndex() function is your Get_wIndex() function

uint16_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint16_t bufSize)
{
  dmaL1Data[portIndex].wIndex = Serial_GetWritingIndex(portIndex);

  if (dmaL1Data[portIndex].flag == dmaL1Data[portIndex].wIndex)  // if no data to read from L1 cache
    return 0;

  // wIndex: make a static access to dynamically changed (by L1 cache's interrupt handler) variables/attributes
  //
  DMA_CIRCULAR_BUFFER * dmaL1Data_ptr = &dmaL1Data[portIndex];
  uint16_t wIndex = dmaL1Data_ptr->wIndex;
  uint16_t flag = dmaL1Data_ptr->flag;
  uint16_t cacheSize = dmaL1Data_ptr->cacheSize;
  char * cache = dmaL1Data_ptr->cache;

  while (cache[flag] != '\n' && flag != wIndex)  // check presence of "\n", if any
  {
    flag = (flag + 1) % cacheSize;
  }

  if (flag != wIndex)  // if "\n" was found, proceed with data copy
  {
    // rIndex: L1 cache's reading index (not dynamically changed (by L1 cache's interrupt handler) variables/attributes)
    uint16_t rIndex = dmaL1Data_ptr->rIndex;

    while (cache[rIndex] == ' ' && rIndex != flag)  // remove leading empty space, if any
    {
      rIndex = (rIndex + 1) % cacheSize;
    }

    // tailEnd: last index on upper part of L1 cache
    // headStart: first index on lower part of L1 cache, if any is needed
    // msgSize: message size. Last +1 is for the terminating null character "\0" (code is optimized by the compiler)
    //
    uint16_t tailEnd = (rIndex <= flag) ? flag: cacheSize - 1;
    uint16_t headStart = (rIndex <= flag) ? flag + 1 : 0;
    uint16_t msgSize = (tailEnd - rIndex + 1) + ((headStart > flag) ? 0 : flag + 1) + 1;

    // if buf size is not enough to store the data plus the terminating null character "\0", skip the data copy
    //
    // NOTE: the following check should never be matched if buf has a proper size or there is no reading error.
    //       If so, the check could be commented out just to improve performance. Just keep it to make the code more robust
    //
    if (bufSize < msgSize)
      goto skip_copy;

    while (rIndex <= tailEnd)  // retrieve data on upper part of L1 cache
    {
      *(buf++) = cache[rIndex++];
    }

    while (headStart <= flag)  // retrieve data on lower part of L1 cache, if any is needed
    {
      *(buf++) = cache[headStart++];
    }

    *buf = '\0';  // end character

  skip_copy:
    // update queue's custom flag and reading index with next index
    dmaL1Data_ptr->flag = dmaL1Data_ptr->rIndex = (flag + 1) % cacheSize;

    return msgSize;  // return the number of bytes stored in buf
  }

  // if here, a partial message is present on the L1 cache (message not terminated by "\n").
  // We temporary skip the message until it is fully received updating also dmaL1Data_ptr->flag to
  // prevent to read again (multiple times) the same partial message on next function invokation

  // update queue's custom flag with flag (also equal to wIndex)
  dmaL1Data_ptr->flag = flag;

  return 0;  // return the number of bytes stored in buf
}

@rondlh
Copy link
Author

rondlh commented Sep 15, 2023

Your algorithm looks great, I like it! (please get rid of the "goto").
Using the flag to prevent scanning the same data over and over again, and then to pinpoint the end of the message is quite smart and efficient. It seems like the best of 2 worlds (interrupt vs. manual update of wIndex).

Do you want me to benchmark this code vs. the current Serial_Get? There is no doubt that this implementation is significantly faster and more efficient because Serial_Get doesn't do unneeded work anymore and it becomes aware of new complete messages significantly earlier.

The situation where Serial_Get is called when there is NO new data available (wIndex not changed) happens literally 10 to 100x more often than Serial_Get actually receives new data. So improving the case where no new data is available will save a lot of MCU cycles and allows for higher scanrates and thus a faster overall response. This is especially true when using the interrupt to update wIndex. It's still true (to a lesser extend) when wIndex is updated manually, this is because only very little data is received. So first checking a condition before making a function call that is exited after doing the same check is significantly faster. Making function calls is an expensive operation considering MCU cycles.

@digant73
Copy link
Contributor

digant73 commented Sep 15, 2023

yes please benchmark the code and eventually make the changes you want. I know idle state is more often than busy state but overall on idle state the TFT has nothing to do so it is not so much important to me (I would prefer a more compact code as we did with all other functionalities). Of course if you make the check of new available bytes before calling Serial_Get() you must add also the time spent on that check in the stats

@rondlh
Copy link
Author

rondlh commented Sep 15, 2023

If your idle state is faster, then you arrive at your busy state earlier.

OK, I can test it.

1980's code:

if (bufSize < msgSize)
      goto skip_copy;

    while (rIndex <= tailEnd)  // retrieve data on upper part of L1 cache
    {
      *(buf++) = cache[rIndex++];
    }

    while (headStart <= flag)  // retrieve data on lower part of L1 cache, if any is needed
    {
      *(buf++) = cache[headStart++];
    }

    *buf = '\0';  // end character

  skip_copy:

A bit nicer code would be:

if (bufSize >= msgSize)
{
    while (rIndex <= tailEnd)  // retrieve data on upper part of L1 cache
    {
      *(buf++) = cache[rIndex++];
    }

    while (headStart <= flag)  // retrieve data on lower part of L1 cache, if any is needed
    {
      *(buf++) = cache[headStart++];
    }

    *buf = '\0';  // end character

 }

@digant73
Copy link
Contributor

digant73 commented Sep 15, 2023

sure in the code cleanup can be applied (I used the goto simply because I could remove the check simply commenting out the lines as also reported in the inline comment without any change on code indentation)

EDIT:

I would use (in case bufSize is < msgSize 0 must also be ret:urned after flag and rIndex have been also updated)

uint16_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint16_t bufSize)
{
  dmaL1Data[portIndex].wIndex = Serial_GetWritingIndex(portIndex);

  if (dmaL1Data[portIndex].flag == dmaL1Data[portIndex].wIndex)  // if no data to read from L1 cache
    return 0;

  // wIndex: make a static access to dynamically changed (by L1 cache's interrupt handler) variables/attributes
  //
  DMA_CIRCULAR_BUFFER * dmaL1Data_ptr = &dmaL1Data[portIndex];
  uint16_t wIndex = dmaL1Data_ptr->wIndex;
  uint16_t flag = dmaL1Data_ptr->flag;
  uint16_t cacheSize = dmaL1Data_ptr->cacheSize;
  char * cache = dmaL1Data_ptr->cache;

  while (cache[flag] != '\n' && flag != wIndex)  // check presence of "\n", if any
  {
    flag = (flag + 1) % cacheSize;
  }

  if (flag != wIndex)  // if "\n" was found, proceed with data copy
  {
    // rIndex: L1 cache's reading index (not dynamically changed (by L1 cache's interrupt handler) variables/attributes)
    // tailEnd: last index on upper part of L1 cache
    // headStart: first index on lower part of L1 cache, if any is needed
    // msgSize: message size. Last +1 is for the terminating null character "\0" (code is optimized by the compiler)
    //
    uint16_t rIndex = dmaL1Data_ptr->rIndex;
    uint16_t tailEnd;
    uint16_t headStart;
    uint16_t msgSize;

    while (cache[rIndex] == ' ' && rIndex != flag)  // remove leading empty space, if any
    {
      rIndex = (rIndex + 1) % cacheSize;
    }

    if (rIndex <= flag)
    {
      tailEnd = flag;
      headStart = flag + 1;
      msgSize = (tailEnd - rIndex + 1) + 1;
    }
    else
    {
      tailEnd = cacheSize - 1;
      headStart = 0;
      msgSize = (tailEnd - rIndex + 1) + (flag + 1) + 1;
    }

    // update queue's custom flag and reading index with next index
    dmaL1Data_ptr->flag = dmaL1Data_ptr->rIndex = (flag + 1) % cacheSize;

    // if buf size is not enough to store the data plus the terminating null character "\0", skip the data copy
    //
    // NOTE: the following check should never be matched if buf has a proper size and there is no reading error.
    //       If so, the check could be commented out just to improve performance. Just keep it to make the code more robust
    //
    if (bufSize < msgSize)
      return 0;

    while (rIndex <= tailEnd)  // retrieve data on upper part of L1 cache
    {
      *(buf++) = cache[rIndex++];
    }

    while (headStart <= flag)  // retrieve data on lower part of L1 cache, if any is needed
    {
      *(buf++) = cache[headStart++];
    }

    *buf = '\0';  // end character

    return msgSize;  // return the number of bytes stored in buf
  }

  // if here, a partial message is present on the L1 cache (message not terminated by "\n").
  // We temporary skip the message until it is fully received updating also dmaL1Data_ptr->flag to
  // prevent to read again (multiple times) the same partial message on next function invokation

  // update queue's custom flag with flag (also equal to wIndex)
  dmaL1Data_ptr->flag = flag;

  return 0;  // return the number of bytes stored in buf
}

@rondlh
Copy link
Author

rondlh commented Sep 15, 2023

Very nice!
I can test how much time it takes for this function to process a "ok\n" and ADVANCED_OK message, compared to the current code. Should the message arrive character by character or in one go?

I'm a bit confused by the use of "SERIAL_PORT_INDEX portIndex" and "uint8_t port". It seems to be the same thing.
So this should be Serial_GetWritingIndex?

static inline uint16_t Serial_GetWritingIndex(uint8_t port) 
{
  return dmaL1Data[port].cacheSize - Serial[port].dma_stream->NDTR;
}

@digant73
Copy link
Contributor

Very nice! I can test how much time it takes for this function to process a "ok\n" and ADVANCED_OK message, compared to the current code. Should the message arrive character by character or in one go?

it should be as it is sent by mainboard and received by TFT (so one go)

I'm a bit confused by the use of "SERIAL_PORT_INDEX portIndex" and "uint8_t port". It seems to be the same thing. So this should be Serial_GetWritingIndex?

static inline uint16_t Serial_GetWritingIndex(uint8_t port) 
{
  return dmaL1Data[port].cacheSize - Serial[port].dma_stream->NDTR;
}

Yes, defined in Serial.h. Leave the type as the are now (SERIAL_PORT_INDEX in SerialConnection API and uint8_t in Serial API)

@rondlh
Copy link
Author

rondlh commented Sep 15, 2023

Here the benchmarks, 100K runs (STM32F207 @ 120MHz)
I have to use some tricks and simplifications to get some numbers, but this should be within 10% of the actual situation.

CURRENT SERIAL_GET
Message = "ok\n" ---> 119ms
Message = "ok P15 B7\n" ---> 266ms

NEW SERIAL_GET
Message = "ok\n": ---> 141ms
Message = "ok\n", flag points to "k" ---> 73ms (flag = rIndex + 2)
Message = "ok\n", flag points to "\n" ---> 53ms (flag = rIndex + 3)

Message = "ok P15 B7\n" ---> 305ms
Message = "ok P15 B7\n", flag point to "7" ---> 120ms (flag = rIndex + 9)
Message = "ok P15 B7\n", flag point to "\n" ---> 83ms (flag = rIndex + 10)

Conclusions:
When using the Serial Idle interrupt to check for new messages the current algorithm will be slightly faster (15%) than the new one.
When manually updating wIndex to allow flag to adjust to incoming data, the new algorithm will be about 2 to 3 times faster to respond when the message is complete.

@digant73
Copy link
Contributor

ok, many thanks for testing.
Unfortunately even with the changes, the new Serial_Get() is slower than the original one. If it was possible to program the interrupt handler to catch \n we will be able to obtain a perfect result. I will try to find a better Serial_Get()

@rondlh
Copy link
Author

rondlh commented Sep 15, 2023

Perhaps I don't understand what you really care about...
Could you please define your goals?

The busy state of Serial_Get takes a maximum of 1ms (max 400 commands/s), while the idle state takes 10x that amount of time. I showed you how to safe that time, but you say it doesn't matter.
The new algorithm is 15% slower when using the Serial Idle interrupt to detect new data, but if you check for new data every scan then it is 2-3 times faster in responding to new data. To achieve this it only has to update the flag when the message is incomplete.

Apart from that, the Serial idle interrupt is always 1 frame time behind, because the interrupt only comes after 1 idle frame (the time of 1 serial character). At 250K baud this means that the "slow" algorithm could have already run 20 times before the "fast" algorithm even starts.

So if you care about response time then this is the way to go, and you also prevent buffer overruns at the same time.

Using interrupt is possible of course, but it is not going to change much, because on the other side (loopProcess) is just polling for new data anyway. So the ISR could check for '\n', but so could the loopProcess, and that way we don't introduce another hesitation bug.

@rondlh
Copy link
Author

rondlh commented Sep 16, 2023

NEW SERIAL_GET
Message = "ok P15 B7\n" ---> 305ms
Message = "ok P15 B7\n", flag point to "7" ---> 120ms (flag = rIndex + 9)
Message = "ok P15 B7\n", flag point to "\n" ---> 53ms (flag = rIndex + 10)

UPDATE
New algorithm when replacing uint16_t with uint32_t (7 times)
Message = "ok P15 B7\n" ---> 278ms (the same as the current Serial_Get)

Additionally
Convert DMA_CIRCULAR_BUFFER to uint32_t
Message = "ok P15 B7\n" ---> 274ms

@digant73
Copy link
Contributor

Perhaps I don't understand what you really care about... Could you please define your goals?

The busy state of Serial_Get takes a maximum of 1ms (max 400 commands/s), while the idle state takes 10x that amount of time. I showed you how to safe that time, but you say it doesn't matter. The new algorithm is 15% slower when using the Serial Idle interrupt to detect new data, but if you check for new data every scan then it is 2-5 times faster in responding to new data. To achieve this it only has to update the flag when the message is incomplete.

It seems I didn't understand your previous report. If we rely on serial idle interrupt, the current algorithm was expected to be faster

Apart from that, the Serial idle interrupt is always 1 frame time behind, because the interrupt only comes after 1 idle frame (the time of 1 serial character). At 250K baud this means that the "slow" algorithm could have already run 20 times before the "fast" algorithm even starts.

So if you care about response time then this is the way to go, and you also prevent buffer overruns at the same time.

Using interrupt is possible of course, but it is not going to change much, because on the other side (loopProcess) is just polling for new data anyway. So the ISR could check for '\n', but so could the loopProcess, and that way we don't introduce another hesitation bug.

If the interrupt handler could be programmed to be invoked when \n is read from serial line (instead of waiting for the idle frame) there will be no more 1 frame delay and we could even maintain the current faster algorithm (and even speed up it more) and of course preventing buffer overruns (one goal missing in the current algorithm)

@rondlh
Copy link
Author

rondlh commented Sep 16, 2023

When I say "current algorithm" I mean the current official released code.

So what are the objective goals here? I thought this was about lowering the response time, but it seems that is not the goal here.

If the interrupt handler could be programmed to be invoked when \n is read from serial line (instead of waiting for the idle frame) there will be no more 1 frame delay and we could even maintain the current faster algorithm (and even speed up it more) and of course preventing buffer overruns (one goal missing in the current algorithm)

That is not possible, but you could do an interrupt per received character and report when '\n' is found. Of course this would be less efficient than using DMA.

@digant73
Copy link
Contributor

digant73 commented Sep 16, 2023

to avoid buffer overruns (as you also experienced with M43) and possibly improve performance on Serial_Get and response time.
In case the interrupt handler cannot be programmed to intercept \n, ok it seems we cannot improve the logic more. We could propose the new algorithm.
In case the interrupt handler can be programmed to intercept \n, we can even obtain better results with less code (e.g. in Serial_Get)

EDIT: Ok, based on your last post, we can exclude the possibility of an interrupt handler to intercept \n

@rondlh
Copy link
Author

rondlh commented Sep 16, 2023

  • Goals
  1. prevent buffer underruns
  2. improved Serial_Get performance

Both goals are achieved by your new algorithm.

Note that the serial data is received slower than the polling of Serial_Get.
This means that Serial_Get will let the flag walk along with the received data, and once '\n' is found is response 2-5 times faster than the old algorithm.

A close to optimal algorithm could be like this:

  1. Check if new data is available (flag!=wIndex), this is the only thing that needs to be done in the loopProcess (very efficient!).
  2. Check for '\n'
  3. Found --> call 'Serial_Get`
  4. Not found --> update flag

This is efficient and will allow for a significant increase in the scanrate and efficiency.

So in the newDataAvailable, which is run in the loop process:

  1. Check for flag!=wIndex
  2. Check for '\n'

I can test how much the scanrate will increase if you do this, I expect a significant increase, and efficiency is best like this too.

@digant73
Copy link
Contributor

ok good. waiting for the results

@rondlh
Copy link
Author

rondlh commented Sep 22, 2023

2% speed gain by only changing the serial port to 32-bit is useful is you ask me, especially considering it's a tiny change with no risk. It would be best if there was a consistent serial port type. Then the choice would be easy...
The official recommendation is not to use the smallest type your data fits in, but use 32-bit most of the time.
Do you have some good tools to do a global replacement in the code? I didn't make the changes manually.
Anyway, this is not the most urgent topic right now. Please check #2824... I found a potential issue.

Let me retest the effect of using 32-bit variables on Serial_Get again, to confirm I got it right the first time.

@digant73
Copy link
Contributor

digant73 commented Sep 23, 2023

I made some other optimizations in #2840. With wifi enabled it is now 180K on idle and 152K on printing (with wifi off it is 200K and 165K). Same results for interrupt based and DMA based.
But I also made a comparison with the old version (unbuffered) enabling DEFAULT_WRITE in Serial.c and I get the same values.
Is there any explanation for that? Compared to old version (unbuffered), I was expecting better results for interrupt and even more for DMA based writing.

EDIT: on STM32F10x I got basically half the performance than STM32F2. With wifi enabled it is now 90K on idle and 74K on printing (with wifi off it is 99K and 80K). Even in this case same result as for old version (DMA version is not yet implemented)

@rondlh
Copy link
Author

rondlh commented Sep 25, 2023

I made some other optimizations in #2840. With wifi enabled it is now 180K on idle and 152K on printing (with wifi off it is 200K and 165K). Same results for interrupt based and DMA based. But I also made a comparison with the old version (unbuffered) enabling DEFAULT_WRITE in Serial.c and I get the same values. Is there any explanation for that? Compared to old version (unbuffered), I was expecting better results for interrupt and even more for DMA based writing.

EDIT: on STM32F10x I got basically half the performance than STM32F2. With wifi enabled it is now 90K on idle and 74K on printing (with wifi off it is 99K and 80K). Even in this case same result as for old version (DMA version is not yet implemented)

Thanks for the speed comparison. I will checkout what you did in #2840 later.

The more data you send, the more benefit buffered (interrupt/DMA) writing will give. Unbuffered (the old way) writing is blocking, this means that if you send a line of gcode, you will have to wait until the last character in the gcode ('\n') is copied to the serial port. If you are printing the first layer (low speed) then you will not see much difference because there is not much data to send. But try an small oval in vase mode without ARC welder enabled, then the difference is clear. So basically buffered writing helps when it counts most, when the printer/TFT is busy.

Buffered writing via Interrupt or DMA is about the same in speed. The MCU can handle interrupts easily without noticeable speed loss. Still DMA is clearly more efficient because it's a background process done in dedicated hardware and only needs to be setup once and then only triggers 1 interrupt (sometimes 2, if the gcode line is split). Fewer interrupts is obviously better, but from a speed point of view there is not much difference. For Marlin the TFT sends lines of codes (Serial_PutChar is never used) so DMA writing is a good match.

BUG REPORT: I made a silly mistake with txBytes. txBytes should not be an array, because the serial ports are already the array. serial.c. This also affect the initialization, there txBytes needs to be added. I will post the full serial.c later for reference.

Corrected code at start of serial.c

// config for USART DMA channels
typedef struct
{
         USART_TypeDef *uart;   // uint32_t
         uint32_t dma_rcc;
         uint32_t dma_channel;
         DMA_Stream_TypeDef *dma_streamRX; // uint32
         DMA_Stream_TypeDef *dma_streamTX; // uint32
volatile uint32_t txBytes;
} SERIAL_CFG;

static SERIAL_CFG Serial[_UART_CNT] = {
// USART   DMA1 or DMA2      Channel RX_STREAM    TX_STREAM     txBytes
  { USART1, RCC_AHB1Periph_DMA2, 4, DMA2_Stream2, DMA2_Stream7, 0 },
  { USART2, RCC_AHB1Periph_DMA1, 4, DMA1_Stream5, DMA1_Stream6, 0 },
  { USART3, RCC_AHB1Periph_DMA1, 4, DMA1_Stream1, DMA1_Stream3, 0 },
  { UART4,  RCC_AHB1Periph_DMA1, 4, DMA1_Stream2, DMA1_Stream4, 0 },
  { UART5,  RCC_AHB1Periph_DMA1, 4, DMA1_Stream0, DMA1_Stream7, 0 },
  { USART6, RCC_AHB1Periph_DMA2, 5, DMA2_Stream1, DMA2_Stream6, 0 },
};

And replace: Serial[port].txBytes[port] with Serial[port].txBytes

@digant73
Copy link
Contributor

Ok,
In Monitoring menu, before loopProcess() invokation I put the following line:

// mustStoreCmd("M118 P0 A1 test\n");
// mustStoreCmd("M118 P0 A1 test test test test test test test test test test test test test test test test test\n");
mustStoreCmd("M43\n");

Just to load the system with different kind of gcode/ACK.
The first msg is for a short and balanced load (short gcode and short ACK).
The second msg is for a long and balanced load (long gcode and long ACK).
The third msg is for a short gcode and very long ACK load.

Even with those load scenarios I get the same values for both unbuffered and DMA/Interrupt based writing.

@rondlh
Copy link
Author

rondlh commented Sep 25, 2023

I just noticed that a well know saboteur is at it again, he is repeatedly force-pushing his "own" DMA write code.
He's so proud of his work that he blocks comments, so I cannot inform him where the bugs are, what a fool!
He basically took my code and made it slower just so he can claim it's his code. He has the STMF1 and GD32F2 code already.
That will save me some work in looking up the register names on the different platforms.

@digant73
Copy link
Contributor

Ohh I also saw it now. Since some time he banned me so I can't also provide any kind of feedback

@rondlh
Copy link
Author

rondlh commented Sep 25, 2023

OK, interesting, any number on how many bytes per second are send?
You can basically calculate how much MCU time you lose like this:

Serial write wait time = (10 x bytes send per second / baudrate).
So if you send 50 gcodes of 40 bytes, 2K Bytes/s, at 115200 baud, you will waste about 20000/115200 = 17%.
(At 250000 baud it would only be 8%).
Scan rate will drop with the same amount.

I tested like this:
In Cura drop a cylinder, convert to oval with base of 30x60mm.

  1. Disable arc welder
  2. enable vase mode.
    Print it, and turn up the speed slowly to 600% then you can see the scan rate drop considerably...
    If you want more effect lower the baudrate to 115200.

NOTE: When testing, I usually do not put filament, and lower the temp of bed and nozzle.
To stop the movement, and get more range to push the speed, I send a "M906 X50 Y50 E50" to stop the movement.
(TMC Driver set current, assuming you have TMC drivers installed)

BTW: Did you know that sounds can pause the printing?

@digant73
Copy link
Contributor

digant73 commented Sep 25, 2023

I saved the stats only for STM32F10x just yesterday. Below the results:

wifi ON

OLD      NEW       PATTERN
88K      88K       storeCmd("M118 P0 A1 test\n");
88K      88K       storeCmd("M118 P0 A1 test test test test test test test test test test test test test test test test test\n");
63K      63K       storeCmd("M43\n");

90K      90K       idle
74K      74K       printing


For the sound, I read one of your post some days ago. If there is a fix I can add it on one of my open PRs

@rondlh
Copy link
Author

rondlh commented Sep 25, 2023

If this is for buffered write then it makes sense. That's the whole purpose of buffered writing.
If you are using standard write, then it doesn't make sense, seems no loss of performance.
Because of this I added the info to your monitoring screen... so I don't get confused.
Quick testing here... hold on.

@digant73
Copy link
Contributor

digant73 commented Sep 25, 2023

Hmm, M118 P0 A1 test\n etc... is also replied by mainboard with ACK "test" that will be forwarded towards all the active serial ports (e.g. WiFi). In that case I will load both RX and TX interfaces and I would expect benefits from DMA/Interrupt implementation

@digant73
Copy link
Contributor

digant73 commented Sep 25, 2023

Oh I forgot also a question about the access to Serial[port].dma_streamRX->NDTR used to detect wIndex by the formula:

dmaL1DataRX[port].cacheSize - Serial[port].dma_streamRX->NDTR

Shall dma_streamRX be defined as volatile because NDTR is updated in the background by RX DMA?

@rondlh
Copy link
Author

rondlh commented Sep 25, 2023

Quick cylinder test (115200 baud):
Idle: 113K scans.
Printing 100% speed 93k-95k
300% speed 81k-85k
500% speed 70k

My conclusion: You are probably testing in buffered mode.

About dma_streamRX->NDTR:
I don't think "volatile" is needed, because every time Serial_GetWritingIndex is called the "fresh" value of NDTR is loaded. That will work fine without "volatile". It only becomes relevant if changes of the value inside your algorithm matter, which is not the case here. On the other side, adding volatile will not cause any harm because we always need the latest value anyway. So do whatever you think is best.

I would recommend a name change:
Serial_GetWritingIndex ---> Serial_GetWritingIndexRX
Serial_GetReadingIndex ---> Serial_GetReadingIndexRX

To stop me from getting confused too much

@digant73
Copy link
Contributor

digant73 commented Sep 25, 2023

Quick cylinder test (115200 baud): Idle: 113K scans. Printing 100% speed 93k-95k 300% speed 81k-85k 500% speed 70k

My conclusion: You are probably testing in buffered mode.

About dma_streamRX->NDTR: I don't think "volatile" is needed, because every time Serial_GetWritingIndex is called the "fresh" value of NDTR is loaded. That will work fine without "volatile". It only becomes relevant if changes of the value inside your algorithm matter, which is not the case here. On the other side, adding volatile will not cause any harm because we always need the latest value anyway. So do whatever you think is best.

I would recommend a name change: Serial_GetWritingIndex ---> Serial_GetWritingIndexRX Serial_GetReadingIndex ---> Serial_GetReadingIndexRX

To stop me from getting confused too much

Ok, I pushed the changes you asked for.

Try this simple test putting the following code just before loopProcess() in the Monitoring menu:

    if (!isFullCmdQueue())  // needed to avoid the busy message continuously redrawn on the menu
//      mustStoreCmd("M118 P0 A1 test\n");
//      mustStoreCmd("M118 P0 A1 test test test test test test test test test test test test test test test test test\n");
//      mustStoreCmd("M43\n");

Just uncomment the pattern you want to use, recompile the fw and flash it on TFT.
Move on the Monitoring menu and see the scan rate (you will also have always a full command queue).
If you have a WiFi connected see what you receive.

Also. If I send an M43 from WiFi I'm not able to receive a completely correct response. Just repeat the command and see if you get the same reply or not

@rondlh
Copy link
Author

rondlh commented Sep 25, 2023

OK, I'm not sure what you are trying to prove, but let me check...
(DMA WRITING MODE, 115200 baud to host, 250000 to ESP3D)

No lines added: 0 0 7 113K (Buffered gcode, pending gcodes, free tx, scan rate)
mustStoreCmd("M118 P0 A1 test\n") --> 20 7 0 113K some "test" on ESP3D
mustStoreCmd("M118 P0 A1 test test ... --> 20 7 0 114K lots of "test" on ESP3D
mustStoreCmd("M43\n") --> 20 7 0 117K, output is showing and is correct (complete)

Buzzer is triggered in these modes! Do you also see that?

Manually enter "M43" in ESP3D (no lines added in monitoring), works fine for me, message is correct and complete.


I see you did a lot of work on the serial writing code...
So the STM32F1 interrupt code is done already?
We still need the GD32F2 code? Interrupt first? Then DMA?


I just checked your Serial_Get code...

  #ifdef USE_INLINE_COPY
    while (rIndex <= cacheSize)
    {
      *(buf++) = cache[rIndex++];
    }

    rIndex = 0;
    uint32_t maxIndex = bufSize - (cacheSize - dmaL1Data_ptr->rIndex);  // used dmaL1Data_ptr->rIndex and not rIndex

    while (rIndex < maxIndex)
    {
      *(buf++) = cache[rIndex++];
    }
  #else
    memcpy(buf, &cache[rIndex], cacheSize - rIndex);
    buf += cacheSize - rIndex;

    memcpy(buf, cache, flag + 1);
    buf += flag;
  #endif

I think uint32_t maxIndex = bufSize - (cacheSize - dmaL1Data_ptr->rIndex); // used dmaL1Data_ptr->rIndex and not rIndex

can be simplified to: flag + 1, flag points to the first '\n'.

And

rIndex = 0;
while...
  *(buf++) = cache[rIndex++];

is think this is the same as:

*(buf++) = *(cache++)

I benchmarked these copy modes, and your method is about 25% faster for both 3 and 10 byte messages.


How about a KPI: RX/TX bytes/s: 123/345

@digant73
Copy link
Contributor

digant73 commented Sep 26, 2023

OK, I'm not sure what you are trying to prove, but let me check... (DMA WRITING MODE, 115200 baud to host, 250000 to ESP3D)

No lines added: 0 0 7 113K (Buffered gcode, pending gcodes, free tx, scan rate) mustStoreCmd("M118 P0 A1 test\n") --> 20 7 0 113K some "test" on ESP3D mustStoreCmd("M118 P0 A1 test test ... --> 20 7 0 114K lots of "test" on ESP3D mustStoreCmd("M43\n") --> 20 7 0 117K, output is showing and is correct (complete)

OK, and what are the results with unbuffered and interrupt based writing? Are those the same as for DMA mode?
In my printers I set 250K baudrate with mainboard and 500K with WiFi. With M43 I'm not able to get a correct output. I will try reducing mainboard connection to 115K.

Buzzer is triggered in these modes! Do you also see that?

I do not hear any sound. Do you mean the current buzzer implementation (loopBuzzer etc...) is affecting/invalidating the scan rate?

Manually enter "M43" in ESP3D (no lines added in monitoring), works fine for me, message is correct and complete.

Ok, as I said I'm not able to get a complete correct output. I will also try reducing the mainboard connection to 115K.

I see you did a lot of work on the serial writing code... So the STM32F1 interrupt code is done already? We still need the GD32F2 code? Interrupt first? Then DMA?

Yes, for STM31F10x the interrupt mode is working (tested in my TFT). The code is the same as in STM32F2_F4. Possibly, the DMA mode is also similar or equal to STM32F2_F4.
Yes GD32 has to be implemented. It's up to you what mode you want to provide first. Interrupt mode should be more easy to implement.

I just checked your Serial_Get code...

  #ifdef USE_INLINE_COPY
    while (rIndex <= cacheSize)
    {
      *(buf++) = cache[rIndex++];
    }

    rIndex = 0;
    uint32_t maxIndex = bufSize - (cacheSize - dmaL1Data_ptr->rIndex);  // used dmaL1Data_ptr->rIndex and not rIndex

    while (rIndex < maxIndex)
    {
      *(buf++) = cache[rIndex++];
    }
  #else
    memcpy(buf, &cache[rIndex], cacheSize - rIndex);
    buf += cacheSize - rIndex;

    memcpy(buf, cache, flag + 1);
    buf += flag;
  #endif

I think uint32_t maxIndex = bufSize - (cacheSize - dmaL1Data_ptr->rIndex); // used dmaL1Data_ptr->rIndex and not rIndex

can be simplified to: flag + 1, flag points to the first '\n'.

And

rIndex = 0;
while...
  *(buf++) = cache[rIndex++];

is think this is the same as:

*(buf++) = *(cache++)

I benchmarked these copy modes, and your method is about 25% faster for both 3 and 10 byte messages.

Sure, correct. I will apply the changes. Yes I tested a lot that code and I always got better performance than with memcpy

How about a KPI: RX/TX bytes/s: 123/345

Do you mean total read and written bytes per second or total read and written bytes for the main serial port (mainboard)?
I was also thinking to add a R and W index for the RX cache for the main serial port.

@rondlh
Copy link
Author

rondlh commented Sep 26, 2023

Buzzer is triggered in these modes! Do you also see that?

I do not hear any sound. Do you mean the current buzzer implementation (loopBuzzer etc...) is affecting/invalidating the scan rate?

Perhaps you have sound disable. I have a little mod of the sound system, where if I disable the sound, I still get the keypress sound. unless I also have disabled the keypress sound. I believe the sound is because there is a warning when you have too many (20) commands queued. I don't think the sounds affects anything, that's not my point. You just scared me a bit because I have the event triggered sounds under test.

I didn't test for unbuffered and interrupt writing yet, interrupt should be very close to DMA.
M43 should work if Wifi serial speed => host serial speed, I previously tested it at 250K/250K, and that worked fine.
So is your issue that you cannot get the M43 correctly? Or that you don't see performance degradation when using unbuffered writing?

Do you mean total read and written bytes per second or total read and written bytes for the main serial port (mainboard)?
I was also thinking to add a R and W index for the RX cache for the main serial port.

Right, it's about host communication, I don't care about the Wifi communication.
Also commands per second would be interesting.

I have Interrupt serial writing working for GD32F2... I'm working on DMA now...

@digant73
Copy link
Contributor

digant73 commented Sep 26, 2023

Buzzer is triggered in these modes! Do you also see that?

I do not hear any sound. Do you mean the current buzzer implementation (loopBuzzer etc...) is affecting/invalidating the scan rate?

Perhaps you have sound disable. I have a little mod of the sound system, where if I disable the sound, I still get the keypress sound. unless I also have disabled the keypress sound. I believe the sound is because there is a warning when you have too many (20) commands queued. I don't think the sounds affects anything, that's not my point. You just scared me a bit because I have the event triggered sounds under test.

I didn't test for unbuffered and interrupt writing yet, interrupt should be very close to DMA. M43 should work if Wifi serial speed => host serial speed, I previously tested it at 250K/250K, and that worked fine. So is your issue that you cannot get the M43 correctly? Or that you don't see performance degradation when using unbuffered writing?

strange we have different result in the same HW (BTT TFT35 V3), If possible (now that also the advanced ok PR is merged), download entirely my PR and make the test. I also tried defining a RX cache of 16K (the output of M43 is about 12K in my TFT) and even with that I'm not able to receive a complete (without some missing or other previously received output) output for M43. That is very strange and I'm adding other KPI in Monitoring menu also to verify W and R indexes.

Do you mean total read and written bytes per second or total read and written bytes for the main serial port (mainboard)?
I was also thinking to add a R and W index for the RX cache for the main serial port.

Right, it's about host communication, I don't care about the Wifi communication. Also commands per second would be interesting.

Yes I was also thinking to add commands per second (very useful).

I have Interrupt serial writing working for GD32F2... I'm working on DMA now...

Wow ultra fast! Oh I forgot to ask you to check the changes I made on _USART6 on void Serial_DMAClearITflags. Are the changes OK or wrong (e.g. DMA2->HIFCR = (0x3F << 16);)?

@rondlh
Copy link
Author

rondlh commented Sep 26, 2023

strange we have different result in the same HW (BTT TFT35 V3), If possible (now that also the advanced ok PR is merged), download entirely my PR and make the test. I also tried defining a RX cache of 16K (the output of M43 is about 12K in my TFT) and even with that I'm not able to receive a complete (without some missing or other previously received output) output for M43. That is very strange and I'm adding other KPI in Monitoring menu also to verify W and R indexes.

Great to see ADVANCED_OK is finally here, great job!
I have a rearranged frontend, backend, that might affect things a bit. I will do some specific test for M43 later (should work), and check the scanrates with unbuffered and interrupt serial writing and your special monitoring lines.

You did not change DMA2->HIFCR = (0x3F << 16); ... you changed the line above it. That's perfectly correct, more consistent that way.

In the same function, I wrote (a few times): (Hal\stm32f2_f4xx\serial.c)

// Channel to bits: Low  0: 0-5, 1: 6-11, 2: 16-21, 3: 22-27
// Channel to bits: High 4: 0-5, 5: 6-11, 6: 16-21, 7: 22-27

It should say

// Stream # to bits: Low  0: 0-5, 1: 6-11, 2: 16-21, 3: 22-27
// Stream # to bits: High 4: 0-5, 5: 6-11, 6: 16-21, 7: 22-27

This function is not needed:
Serial_DMAClearITflags

@digant73
Copy link
Contributor

strange we have different result in the same HW (BTT TFT35 V3), If possible (now that also the advanced ok PR is merged), download entirely my PR and make the test. I also tried defining a RX cache of 16K (the output of M43 is about 12K in my TFT) and even with that I'm not able to receive a complete (without some missing or other previously received output) output for M43. That is very strange and I'm adding other KPI in Monitoring menu also to verify W and R indexes.

Great to see ADVANCED_OK is finally here, great job! I have a rearranged frontend, backend, that might affect things a bit. I will do some specific test for M43 later (should work), and check the scanrates with unbuffered and interrupt serial writing and your special monitoring lines.

You did not change DMA2->HIFCR = (0x3F << 16); ... you changed the line above it. That's perfectly correct, more consistent that way.

right, I was referring to the line above

In the same function, I wrote (a few times): (Hal\stm32f2_f4xx\serial.c)

// Channel to bits: Low  0: 0-5, 1: 6-11, 2: 16-21, 3: 22-27
// Channel to bits: High 4: 0-5, 5: 6-11, 6: 16-21, 7: 22-27

It should say

// Stream # to bits: Low  0: 0-5, 1: 6-11, 2: 16-21, 3: 22-27
// Stream # to bits: High 4: 0-5, 5: 6-11, 6: 16-21, 7: 22-27

Ok, changed in the new update.
Also added new KPIs in Monitoring menu and the optimization on Serial_Get

Serial_DMAClearITflagsRX

This function is not needed: Serial_DMAClearITflags

Hmm I see that Serial_DMAClearITflagsRX is also not used at all. Is it needed somewhere in the code?

@rondlh
Copy link
Author

rondlh commented Sep 26, 2023

Hmm I see that Serial_DMAClearITflagsRX is also not used at all. Is it needed somewhere in the code?

It's in (or should be in):

  1. Serial_DMA_Config
  2. Serial_DeConfig

Actually, I think it's not needed, because it's never reset during the reading process, but leave it in please just to be safe.
GD32F2 all write modes supported (under test).
gd32f20x Serial.zip
stm32f2_f4xx Serial.zip
stm32f10x Serial.zip

UPDATE: I made some small changes, flag can be used in TX too, to replace txBytes. Here the AIOs.
UPDATE2: flag must be volatile

@digant73
Copy link
Contributor

digant73 commented Sep 27, 2023

Hmm I see that Serial_DMAClearITflagsRX is also not used at all. Is it needed somewhere in the code?

It's in (or should be in):

  1. Serial_DMA_Config
  2. Serial_DeConfig

Actually, I think it's not needed, because it's never reset during the reading process, but leave it in please just to be safe. GD32F2 all write modes supported (under test).

Yes, in those methods you originally provided the DMA RX clear and I replaced it with the DMA TX and RX clear (just to clear both).
Consider that (with the exception of main serial port) all the supplementary ports can be enabled/disabled form the Connection menu (Serial_DMA_Config and Serial_DeConfig are invoked). So it should be more safe to clear both TX and RX. Currently the Serial_DMAClearITflagsRX is no more invoked. By the way I would maintain it just to have a reference code. Obviously, not being used that function is not linked in the binary (so no waste of flash).

UPDATE: I want to make some small changes, flag can be used in TX too, to replace txBytes.

sure, make your changes and let me know

BTW: Do you know how to report a liar and saboteur?

I think you can simply provide BTT your review/opinion on our PR referring to this FR. Pretty much sure BTT reads all posts (even this one) and is probably aware although they are apparently not present.

@rondlh
Copy link
Author

rondlh commented Sep 27, 2023

Somebody is really not acting in the interest of everyone here, repeatedly boldly lying and worst of all, potentially harming the source code base we are nurturing.

Did you find the updates? What's next? STMF1 DMA write?

Test results:
250K/250K baud DMA Mode, BTT TFT35V3.0
M43 response is 8.3KB
Cylinder test (lots of small moves)
Idle Scan rate: 116K
M43 output looks ok, works while printing at 500%, but causes movement hesitations
Printing scan rate: 113K
Printing 500% speed scan rate: 112K
This is great, exactly what I was looking for :D (can we discuss on WhatsApp)?

@digant73
Copy link
Contributor

Somebody is really not acting in the interest of everyone here, repeatedly boldly lying and worst of all, potentially harming the source code base we are nurturing.

Did you find the updates? What's next? STMF1 DMA write?

Yes it seems STM32F10x DMA is the last one

Test results: 250K/250K baud DMA Mode M43 response is 8.3KB Cylinder test (lots of small moves) Idle Scan rate: 116K M43 output looks ok, works while printing at 500%, but causes movement hesitations Printing scan rate: 113K Printing 500% speed scan rate: 112K This is great, exactly what I was looking for :D (can we discuss on WhatsApp)?

good results. Although just yesterday (with the new KPI) I saw that after some time my TFT was sending no more gcodes (it is like there is a bug breaking the TX even with unbuffered mode. Do you have the same issue (e.g. use mustStoreCmd("M118 P0 A1 test\n"); in Monitoring menu and wait some minutes. Are TX, RX and scan rate KPI still the same as at the beginning or TX and RX KPI drop to 0 while scan rate jumps as like on idle state?

Yes we can talk even on WhatsApp although I prefer to always share info on public forum

@rondlh
Copy link
Author

rondlh commented Sep 27, 2023

UPDATE:
@kisslorand, please do not provide misinformation, what is written here is about code under development in #2840, which is clearly stated below. THIS HAS NOTHING TO DO WITH #2824, which is working fine and provides great added value to the users.
We are working together to identify issues before they are merged. This issue was identified and solved.
Your behavior is harming the community!

OK, loading up #2840
Enabled DEBUG_MONITORING
Added mustStoreCmd("M118 P0 A1 test\n"); before loopProcess.
Waiting... great effect on the notification area :D
Display froze after about 15 seconds
Restarted... lived 30 seconds this time
(I have persistent info enabled)
Reset to default values...
I see different failure modes

  1. Screen freezes, doesn't respond anymore
  2. Notification area is flashing, but screen doesn't respond anymore
  3. Notification message changes when tapping the screen (from "busy" to "SD card inserted" and "SD card removed", cycles to theses 3 messages when I tap the screen, but I cannot exit

Did you notice the difference in commands/s with and without ADVANCE_OK? With ADVANCED_OK is about 6x the amount of commands.
You can see the scanrate can drop significantly under extreme loads... I would suggest to use timer based priority slicing.

@digant73
Copy link
Contributor

digant73 commented Sep 27, 2023

use this

if (!isFullCmdQueue())
  mustStoreCmd("M118 P0 A1 test\n");

The freeze should not be present but after some time TX and RX KPI drops to 0 in my TFTs. I will see were the issue is this night (I don't think it's on your TX/RX implementation but a stupid bug in the code. The bug is present even with unbuffered mode)

Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants