-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClockSync.lf test seems to silently fail #2274
Comments
This maybe related to lf-lang/reactor-c#332. However, now, it looks like the same problem occurs on other platforms, not just Mac. |
Can you confirm whether clock sync is actually happening or not? And if not, why not? With that information if would be easier to come up with a fix. |
I think the clock sync is happening. Here is part of the log for the same test with a debug option on for both RTI and federates:
|
Interesting. Perhaps @edwardalee or @erlingrj know more? |
Its the RTI which throws the error. Are you sure RTI is also from v0.7? |
I think so. I used the master branch of lingua-franca, and installed the RTI using the reactor-c version attached to the current master branch.
|
OK. It looks as if the RTI believes that the initial clock sync has not completed, while the federates does. Before debugging this issue I would like to know if Edward has any clue, I know that they recently fixed clock sync after seeing that it did not work on RPIs. In any case, I would recommend to just use NTP or PTP to synchronize the RPis and disable clock sync. It is not that hard to setup your own local NTP server on either your desktop, or on one of the RPis, and point the other NTP clients to it. It is also an educational experience as NTP is a very important piece of the internet |
I suspect the error message is occurring because the clock-sync thread is trying to communicate after the other side has shut down. |
OK, so then it is not a bug. Then I guess we need to know more about the clock-sync error that was measured. Can you explain the experiment in more detail. How did you actually measure clock sync precision based on lag? What kind of network was used? |
@fra-p has been tracking problems with clock sync using two RP5s. He is running a simple federated program where each federate drives a GPIO pin in a reaction triggered by a timer. He uses a logic analyzer to measure the time between transitions on the pins. He first identified an error where clock sync was not being done when clock sync option was "init". This was fixed in PR 414 in reactor-c. However, there are still mysterious problems. Most weirdly, he sees that when clock sync is set to "init", then after the initial clock sync phase, the clocks drift by more than their natural drift. This is inexplicable and it only occurs if the clocks are initially significantly out-of-phase. |
That is strange, is the clocks drifting faster than if we had just |
The problem that I got was when I used two Raspeberry Pis.
What I found out that the time differs depending on what RPI was the Destination printing the time. On v0.6.0, it shows a proper lag between 7~10 milliseconds, on both sides. On v0.7.0 it shows an unstable lag. When RPI 1 was the Destination, it showed a lag under 1 milliseconds, and when RPI 2 was the Destination, it showed a around 100 milliseconds. I think the initial clock-synchronization using TCP is not working properly. I also had the same problem when one device was the RPI, and the other was a Windows Machine running WSL Ubuntu 20.04. |
I suggest being careful with terminology: "drift" is the rate of change of clock synchronization error. "Offset" is the clock synchronization error. Above, for example, "24 second drift" should be "24 second initial offset." Technically, drift is unitless, but it's more useful to see it as having units of seconds per second. |
@fra-p Thanks for sharing those plots! I have some questions to understand it better.
In this plot it seems you have a drift of around 33PPB (roughly from a 2us offset to 2.5us offset between 5s and 25s) This is remarkably little and must be because PTP has very accurately estimated the drift. I think, if you stop ptpd, CLOCK_REALTIME will just continue with the last drift corrections. I cannot find info on the crystal oscillator on the RPi5, but read something about 100PPM on a previous version.
Yes, we would expect the same behavior more-or-less. But could this not be explained by PTP this time having estimated an even more accurate relative drift between the two oscillators, before you turned it off?
In this plot we are seeing a drift of ~2PPM (last time was per billion), this is also very small actually. But the big question here is this: Was ptpd/ntpd/chronyd running at all after your reboot? Because if they were not, then I would expect this kind of drift.
Hm, is this the correct behavior? Should not the initial clock sync make the clocks appear equal, the one hosting the RTI should be the master and the other should get an offset that it will apply to all readings of the clock? This seems like a bug to me. But the other plots seems like within expected behavior?
Great. FYI, there is critical logic that every time you read with |
In the program you share, there is no clock-sync target properties set. Can you explain what they were for the different experiments. Also where is the RTI located? When you switch where each federate is running, do you also switch where the RTI is running?
This certainly looks like a clock sync error. It looks to me that RTI is always located at one of the RPIs and that the LF physical clocks (which in these cases are distinct from system clocks btw, It might be good for you to compare notes with @fra-p since he apparently has the initial clock sync working on the RPis. Lastly, I think we need to build a testbed for evaluating the real-time performance which can run as part of the CI. Would be a great project for a PhD student :D |
Note that if you are not running in an up-to-date master/main, then you won't have my bug fix that resulted in clock-sync being off by default rather than init by default. |
@erlingrj The RTI was working on a workstation, and federates on a separate RPIs. @edwardalee I didn't understand. ASAIK, the RTI and federates do initial clock sync as default, even with the You mean that on the latest version(v0.7.0) clock sync is off as default? |
Hmm, it looks to me like my PR was merged into main two weeks ago, so it should be in 0.7.0, though I'm not sure. However, on re-examining that PR, it seems that initial clock synchronization will only be done if
but this won't work because then it will be impossible to turn clock synchronization off. |
As a workaround while we fix this, you should be able to turn on clock synchronization with the target parameter:
or
|
@edwardalee I'm sorry for my incorrect use of the terminology. I edited the original comment with the (hopefully) correct terms, thank you! @erlingrj thanks for your comments! For the experiment with I cannot tell if in the experiment with I also noticed that I get the same linear error behaviour with |
Update: I managed to get the LF |
I had a look at the code. The clock-sync option setting implementations are different in the federate and RTI. Federate (Use compiler definitions)The _LF_CLOCK_SYNC_INITIAL and _LF_CLOCK_SYNC_ON are only for federates. You can find out that these definitions are not used in
RTI (Use variables)RTI has a variable After a federate Now the RTI decides the clock-sync mode for the federate depending on this port number. If the federate sends a port number If the federate sends a port number If the federate sends anything else, it means the clock-sync is on. The logic how the RTI works, is that if the port number is not So these are some points interesting.
|
This is another reason we should be code generating the RTI together with the federates. Currently, getting the clock sync parameters to match relies on the launch script being run. If you run the RTI and federates manually, you have to start the RTI to match the compiled federates. |
Also, it looks like my clock sync fixes have not been merged and are not in 0.7.0, so I don't think clock sync init will work in 0.7.0 unless it is explicitly specified in the target properties. |
Where are those changes then? |
The clock sync fixes are here: lf-lang/reactor-c#425 |
It wasn't, because the submodule apparently wasn't updated... Happy to release another patch soon. |
@edwardalee and @hokeun, can this issue be closed? |
Yes, I think this can be closed now. |
Description:
It looks like there is a problem with clock synchronization in federated LF. While @Jakio815 was running federated LF programs on distributed computers (Raspberry PI), he found that clock sync did not work, and it did not work on release 0.7.0 either, showing a very high ( 1-100 ms) clock synchronization error when he measured the lag (physical time - logical time) for the very simple federated examples.
Code version:
Platform
Test
Error log
The text was updated successfully, but these errors were encountered: