-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] Improved serial data processing (smarter reading, faster writing) #2835
Comments
Serial reading
Line 4 checks if new data is available:
The serial interrupt routine handling the wIndex update would not be needed anymore (line 3 to 9 below),
And of course the serial idle interrupt setup can be cancelled too:
Actually, much more can be cancelled, but I don't do that here, because it will be needed for the faster writing part... |
The serial writing based on DMA should improve the code for sure. For the change you propose on serial reading (update wIndex) I think it has a negative impact during a print (even with small replies (ok, temp notifications etc.) provided by mainboard). In that case, I see a lot of Serial_Get() invokation (more on faster TFT) reading a partial message (e.g. For example, change: |
Isn't Serial_Get invoked at a very high rate all the time anyway, not only when a serial idle interrupt is received? It seems Serial_Get is polling at very high rates all the time, but returns once it sees that no new data is received. I will post the DMA serial writing files soon, that's much more interesting. The combination of the "smart" reading and DMA writing works very well for me. |
In Serial_Get() try to put a KPI to count the number of partial messages. E.g. a counter after the following line
Display the KPI in the Monitoring menu. Compare it with and without your proposed solution. I suspect a lot of partial (useless) messages with your solution (e.g. even for "ok\n" reply you possibly read 3 partial messages (one for "o", one for "k", one for "\n". If so, the solution is not efficient (consider how many useless partial read will be present for M43) and you could try with the check on presence of \n between the flag and the updated wIndex. The flag has to be updated in case \n is not found. I will try to implement that and make some testing |
OK, great, but let me do the testing... this is a minor thing... And I agree with your analysis, even an "ok\n" would probably be split because scanning is very fast, but I just think it doesn't matter especially because your "flag" update limits the scanning and scanning is fast, especially if I add some code to check for '\n' first. I have something much more important for you to test... Copy it over #2824 test on STM32F2_4 platform. |
ok. I will test it in the next days |
@digant73 Thanks, I hope your printer doesn't explode. I have a question, here the start of _Serial_Get
I wonder if the check Serial_Get is called in 2 places:
Here it also seems that the check is always True. In principle you could also pull out the continue blocking guard
So the while loop would look like:
Maybe not very beautiful, but certainly fast and efficient. |
At the moment I would avoid this kind of changes, They don't provide a significant improvement. Also all the functions/features invoked by loopProcess() are managed in the same way (call a function and make the proper checks in the function just to make the code more readable and manageable). The important part in the serial functions is to speed up the reading, checks and buffering. |
It's a 2 part question, the first is, can "!WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1)" be dropped? The second part is what I believe you answered. I also do not expect great speed improvements from it. |
yes |
I just did a quick benchmark assuming there is no serial data available (if data is available then both approaches are about the same). The required changes are very small, and readability is still ok. SerialCommunication.c (new function)
Use the new function in the while-condition
The check in Serial_Get is not needed anymore, it's done in newDataAvailable and the WITHIN check is dropped
parseACK.c Use the new function in the while-condition
I have this running already... worked the first time, I hope it stays that way :D |
good results. UPDATE: if you use |
You are very right, like always :D You could probably guess that my code looks like this, but I didn't want to push that on you and everyone...
The updating of the flag is done in Serial_Get when new data is reported. It doesn't matter if the new data was reported by the interrupt or by updating wIndex manually. This works both for complete and incomplete messages... that is how you coded it... |
If possible please, try the following implementation and verify the performance in particular under load (so receiving messages) more than on idle (not so much relevant IMHO). The code should provide better results in particular for reading mid (e.g. temp ACK) and long messages (e.g. output lines for M43 most of the time even longer than 250 chars).
|
Your algorithm looks great, I like it! (please get rid of the "goto"). Do you want me to benchmark this code vs. the current Serial_Get? There is no doubt that this implementation is significantly faster and more efficient because Serial_Get doesn't do unneeded work anymore and it becomes aware of new complete messages significantly earlier. The situation where Serial_Get is called when there is NO new data available (wIndex not changed) happens literally 10 to 100x more often than Serial_Get actually receives new data. So improving the case where no new data is available will save a lot of MCU cycles and allows for higher scanrates and thus a faster overall response. This is especially true when using the interrupt to update wIndex. It's still true (to a lesser extend) when wIndex is updated manually, this is because only very little data is received. So first checking a condition before making a function call that is exited after doing the same check is significantly faster. Making function calls is an expensive operation considering MCU cycles. |
yes please benchmark the code and eventually make the changes you want. I know idle state is more often than busy state but overall on idle state the TFT has nothing to do so it is not so much important to me (I would prefer a more compact code as we did with all other functionalities). Of course if you make the check of new available bytes before calling Serial_Get() you must add also the time spent on that check in the stats |
If your idle state is faster, then you arrive at your busy state earlier. OK, I can test it. 1980's code:
A bit nicer code would be:
|
sure in the code cleanup can be applied (I used the goto simply because I could remove the check simply commenting out the lines as also reported in the inline comment without any change on code indentation) EDIT: I would use (in case bufSize is < msgSize 0 must also be ret:urned after flag and rIndex have been also updated)
|
Very nice! I'm a bit confused by the use of "SERIAL_PORT_INDEX portIndex" and "uint8_t port". It seems to be the same thing.
|
it should be as it is sent by mainboard and received by TFT (so one go)
Yes, defined in Serial.h. Leave the type as the are now ( |
Here the benchmarks, 100K runs (STM32F207 @ 120MHz) CURRENT SERIAL_GET NEW SERIAL_GET Message = "ok P15 B7\n" ---> 305ms Conclusions: |
ok, many thanks for testing. |
Perhaps I don't understand what you really care about... The busy state of Serial_Get takes a maximum of 1ms (max 400 commands/s), while the idle state takes 10x that amount of time. I showed you how to safe that time, but you say it doesn't matter. Apart from that, the Serial idle interrupt is always 1 frame time behind, because the interrupt only comes after 1 idle frame (the time of 1 serial character). At 250K baud this means that the "slow" algorithm could have already run 20 times before the "fast" algorithm even starts. So if you care about response time then this is the way to go, and you also prevent buffer overruns at the same time. Using interrupt is possible of course, but it is not going to change much, because on the other side (loopProcess) is just polling for new data anyway. So the ISR could check for '\n', but so could the loopProcess, and that way we don't introduce another hesitation bug. |
NEW SERIAL_GET UPDATE Additionally |
It seems I didn't understand your previous report. If we rely on serial idle interrupt, the current algorithm was expected to be faster
If the interrupt handler could be programmed to be invoked when |
When I say "current algorithm" I mean the current official released code. So what are the objective goals here? I thought this was about lowering the response time, but it seems that is not the goal here.
That is not possible, but you could do an interrupt per received character and report when '\n' is found. Of course this would be less efficient than using DMA. |
to avoid buffer overruns (as you also experienced with M43) and possibly improve performance on Serial_Get and response time. EDIT: Ok, based on your last post, we can exclude the possibility of an interrupt handler to intercept \n |
Both goals are achieved by your new algorithm. Note that the serial data is received slower than the polling of Serial_Get. A close to optimal algorithm could be like this:
This is efficient and will allow for a significant increase in the scanrate and efficiency. So in the newDataAvailable, which is run in the loop process:
I can test how much the scanrate will increase if you do this, I expect a significant increase, and efficiency is best like this too. |
ok good. waiting for the results |
2% speed gain by only changing the serial port to 32-bit is useful is you ask me, especially considering it's a tiny change with no risk. It would be best if there was a consistent serial port type. Then the choice would be easy... Let me retest the effect of using 32-bit variables on Serial_Get again, to confirm I got it right the first time. |
I made some other optimizations in #2840. With wifi enabled it is now 180K on idle and 152K on printing (with wifi off it is 200K and 165K). Same results for interrupt based and DMA based. EDIT: on STM32F10x I got basically half the performance than STM32F2. With wifi enabled it is now 90K on idle and 74K on printing (with wifi off it is 99K and 80K). Even in this case same result as for old version (DMA version is not yet implemented) |
Thanks for the speed comparison. I will checkout what you did in #2840 later. The more data you send, the more benefit buffered (interrupt/DMA) writing will give. Unbuffered (the old way) writing is blocking, this means that if you send a line of gcode, you will have to wait until the last character in the gcode ('\n') is copied to the serial port. If you are printing the first layer (low speed) then you will not see much difference because there is not much data to send. But try an small oval in vase mode without ARC welder enabled, then the difference is clear. So basically buffered writing helps when it counts most, when the printer/TFT is busy. Buffered writing via Interrupt or DMA is about the same in speed. The MCU can handle interrupts easily without noticeable speed loss. Still DMA is clearly more efficient because it's a background process done in dedicated hardware and only needs to be setup once and then only triggers 1 interrupt (sometimes 2, if the gcode line is split). Fewer interrupts is obviously better, but from a speed point of view there is not much difference. For Marlin the TFT sends lines of codes (Serial_PutChar is never used) so DMA writing is a good match. BUG REPORT: I made a silly mistake with Corrected code at start of serial.c
And replace: |
Ok, // mustStoreCmd("M118 P0 A1 test\n"); Just to load the system with different kind of gcode/ACK. Even with those load scenarios I get the same values for both unbuffered and DMA/Interrupt based writing. |
I just noticed that a well know saboteur is at it again, he is repeatedly force-pushing his "own" DMA write code. |
Ohh I also saw it now. Since some time he banned me so I can't also provide any kind of feedback |
OK, interesting, any number on how many bytes per second are send? Serial write wait time = (10 x bytes send per second / baudrate). I tested like this:
NOTE: When testing, I usually do not put filament, and lower the temp of bed and nozzle. BTW: Did you know that sounds can pause the printing? |
I saved the stats only for STM32F10x just yesterday. Below the results:
For the sound, I read one of your post some days ago. If there is a fix I can add it on one of my open PRs |
If this is for buffered write then it makes sense. That's the whole purpose of buffered writing. |
Hmm, M118 P0 A1 test\n etc... is also replied by mainboard with ACK "test" that will be forwarded towards all the active serial ports (e.g. WiFi). In that case I will load both RX and TX interfaces and I would expect benefits from DMA/Interrupt implementation |
Oh I forgot also a question about the access to
Shall |
Quick cylinder test (115200 baud): My conclusion: You are probably testing in buffered mode. About I would recommend a name change: To stop me from getting confused too much |
Ok, I pushed the changes you asked for. Try this simple test putting the following code just before loopProcess() in the Monitoring menu:
Just uncomment the pattern you want to use, recompile the fw and flash it on TFT. Also. If I send an M43 from WiFi I'm not able to receive a completely correct response. Just repeat the command and see if you get the same reply or not |
OK, I'm not sure what you are trying to prove, but let me check... No lines added: 0 0 7 113K (Buffered gcode, pending gcodes, free tx, scan rate) Buzzer is triggered in these modes! Do you also see that? Manually enter "M43" in ESP3D (no lines added in monitoring), works fine for me, message is correct and complete. I see you did a lot of work on the serial writing code... I just checked your Serial_Get code...
I think can be simplified to: And
is think this is the same as:
I benchmarked these copy modes, and your method is about 25% faster for both 3 and 10 byte messages. How about a KPI: RX/TX bytes/s: 123/345 |
OK, and what are the results with unbuffered and interrupt based writing? Are those the same as for DMA mode?
I do not hear any sound. Do you mean the current buzzer implementation (loopBuzzer etc...) is affecting/invalidating the scan rate?
Ok, as I said I'm not able to get a complete correct output. I will also try reducing the mainboard connection to 115K.
Yes, for STM31F10x the interrupt mode is working (tested in my TFT). The code is the same as in STM32F2_F4. Possibly, the DMA mode is also similar or equal to STM32F2_F4.
Sure, correct. I will apply the changes. Yes I tested a lot that code and I always got better performance than with memcpy
Do you mean total read and written bytes per second or total read and written bytes for the main serial port (mainboard)? |
Perhaps you have sound disable. I have a little mod of the sound system, where if I disable the sound, I still get the keypress sound. unless I also have disabled the keypress sound. I believe the sound is because there is a warning when you have too many (20) commands queued. I don't think the sounds affects anything, that's not my point. You just scared me a bit because I have the event triggered sounds under test. I didn't test for unbuffered and interrupt writing yet, interrupt should be very close to DMA.
Right, it's about host communication, I don't care about the Wifi communication. I have Interrupt serial writing working for GD32F2... I'm working on DMA now... |
strange we have different result in the same HW (BTT TFT35 V3), If possible (now that also the advanced ok PR is merged), download entirely my PR and make the test. I also tried defining a RX cache of 16K (the output of M43 is about 12K in my TFT) and even with that I'm not able to receive a complete (without some missing or other previously received output) output for M43. That is very strange and I'm adding other KPI in Monitoring menu also to verify W and R indexes.
Yes I was also thinking to add commands per second (very useful).
Wow ultra fast! Oh I forgot to ask you to check the changes I made on _USART6 on void Serial_DMAClearITflags. Are the changes OK or wrong (e.g. |
Great to see ADVANCED_OK is finally here, great job! You did not change In the same function, I wrote (a few times): (Hal\stm32f2_f4xx\serial.c)
It should say
This function is not needed: |
right, I was referring to the line above
Ok, changed in the new update. Serial_DMAClearITflagsRX
Hmm I see that Serial_DMAClearITflagsRX is also not used at all. Is it needed somewhere in the code? |
It's in (or should be in):
Actually, I think it's not needed, because it's never reset during the reading process, but leave it in please just to be safe. UPDATE: I made some small changes, flag can be used in TX too, to replace txBytes. Here the AIOs. |
Yes, in those methods you originally provided the DMA RX clear and I replaced it with the DMA TX and RX clear (just to clear both).
sure, make your changes and let me know
I think you can simply provide BTT your review/opinion on our PR referring to this FR. Pretty much sure BTT reads all posts (even this one) and is probably aware although they are apparently not present. |
Somebody is really not acting in the interest of everyone here, repeatedly boldly lying and worst of all, potentially harming the source code base we are nurturing. Did you find the updates? What's next? STMF1 DMA write? Test results: |
Yes it seems STM32F10x DMA is the last one
good results. Although just yesterday (with the new KPI) I saw that after some time my TFT was sending no more gcodes (it is like there is a bug breaking the TX even with unbuffered mode. Do you have the same issue (e.g. use mustStoreCmd("M118 P0 A1 test\n"); in Monitoring menu and wait some minutes. Are TX, RX and scan rate KPI still the same as at the beginning or TX and RX KPI drop to 0 while scan rate jumps as like on idle state? Yes we can talk even on WhatsApp although I prefer to always share info on public forum |
UPDATE: OK, loading up #2840
Did you notice the difference in commands/s with and without ADVANCE_OK? With ADVANCED_OK is about 6x the amount of commands. |
use this
The freeze should not be present but after some time TX and RX KPI drops to 0 in my TFTs. I will see were the issue is this night (I don't think it's on your TX/RX implementation but a stupid bug in the code. The bug is present even with unbuffered mode) |
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
1. Serial reading
The current FW uses DMA for reading data from the serial ports, which is done in a very efficient way, the process is started and will run automatically in the background. Interrupts are generated when data is available, this happens when the serial line goes idle.
This is overall a very efficient approach, but has a small drawback. The received data only gets processed once the serial line goes idle, under most circumstances this is fine because the received messages are short and fragmented ("ok\n"). But if a long continues message is received, like the response to M43 (pins information), then a TFT buffer overrun can occur and a part of the received message is lost.
This issue can be solved by not waiting for the serial idle interrupt, but immediately starting to process the data that is being received. Based on the improvements in #2824 (improvements in Serial_Get) it is quite obvious how to implement this. I will provide some sample code soon.
The advantages:
2. Serial writing
Practically the TFT only needs to receive a little bit of data ("ok\n" messages), but needs to write a lot of data to the motherboard (the gcode commands, about 40 characters per command).
The current serial write implementation is very slow, the software actually needs to wait after each byte until it's physically send over the serial line. This slows down the TFT response time, especially for lower baud rates and when the workload is high anyway. A buffered DMA based would significantly help to increase the TFT responsiveness.
Currently I have a DMA based serial write solution under test, which works in this way:
The advantages:
A disadvantage is that this code is hardware dependent (DMA setup).
I will post the STM32F2_4 sample code here soon for review and discussion.
I have some questions:
Is it acceptable to provide this solution for one hardware platform only?
Who can help test/implement a STM32F10x implementation?
BTT TFT24 V1.1
BTT TFT28 V1.0
BTT TFT35 1.0/1.1/1.2/2.0
BTT GD TFT24 V1.1
BTT GD TFT35 V2.0
MKS TFT28 V3.0/4.0
MKS TFT28 NEW GENIUS
MKS TFT32 V1.3/1.4
MKS TFT32L V3
Who can help test/implement a gd32f20x implementation?
BTT GD TFT35 V3.0
BTT GD TFT35 E3 V3.0
BTT_GD TFT35 B1 V3.0
BTT GD TFT43 V3.0
BTT GD TFT43 V3.0
BTT GD TFT50 V3.0
The text was updated successfully, but these errors were encountered: