Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NTP sometimes loads a wildly incorrect time #5

Closed
SmittyHalibut opened this issue May 14, 2021 · 19 comments
Closed

NTP sometimes loads a wildly incorrect time #5

SmittyHalibut opened this issue May 14, 2021 · 19 comments

Comments

@SmittyHalibut
Copy link
Collaborator

My guess is that we're getting an error packet, or not flushing buffers fully, or something. I'm pretty sure the NTPClient code is pretty trusting in the data it gets from the network and will happily parse something else (eg: an error) as a timestamp and return it.

Anyway. Detect this sort of failure if possible and retry.

@andrei-tatar
Copy link

Are you sure it's related to NTP? I also noticed sometimes a very weird clock value, but after some debugging, it seems the problem is the RTC. I added an api/info endpoint to my version of the wifi clock and after a random number of refreshes, the RTC value gets corrupted (I see it at 0 and on the next read it has some random value). I'm not sure what the issue is exactly, still looking into it.

image

@SmittyHalibut
Copy link
Collaborator Author

Nope! I am not sure at all. :-)

You've already done more research into the problem than I have. If the RTC is returning a bad value, I'd expect the stock firmware to have this same problem. So, that's weird.

We are polling the RTC often enough, we can just throw some logic in there that says "if the RTC is zero, or has changed more than 5x the amount of time we think has passed, then don't trust it." And keep ticking based on millis(). Or, trigger an NTP sync.

That's weird though. I really would like to understand why the RTC isn't trustworthy.

@andrei-tatar
Copy link

andrei-tatar commented Jan 29, 2022

As far as I can see, the RTC lib should return 0 if it can't read the time from the RTC and the TimeLib should ignore the time from RTC when it's 0. The weird thing is that the RTC time actually changes to some random value when all I have in the code is RTC.get().

So all I can think of is:

  1. there's noise on the i2c bus
  2. something else is using the i2c and the access between the tasks is not synchronized. As far as I know there's no other I2C device on the bus. Is it?

I'm trying now to add a Wire.flush before each RTC.get and trying to see if it still happens.

@andrei-tatar
Copy link

I'm trying now to add a Wire.flush before each RTC.get and trying to see if it still happens.

No luck :( ... Trying to reduce i2c clock frequency to 25kHz.

@SmittyHalibut
Copy link
Collaborator Author

Do you see syncProvider() on the console before the time goes weird? Is that where this is happening?

So far as I can remember, there isn't anything else on I2C.

@andrei-tatar
Copy link

andrei-tatar commented Jan 29, 2022

@SmittyHalibut the sync provider is set to sync only after 5 mins. I see the RTC time value getting corrupted way more often (sometimes after just a few seconds). I reduced the i2c clock at 10kHz and keeping an eye on it (doing batches of RTC.get periodically)... but it seems way more stable now. I haven't managed to get a corrupted value since.

  • maybe I need to add stronger pullups on the i2c bus (haven't checked what the values are - if any :)... )
  • maybe the RTC chip is a really low quality one that didn't pass the factory checks

So in normal use, I rarely see the time going wild, but the endpoint I added made it way easier to trigger a RTC value corruption (clock.getRtcTime() just does a return RTC.get()):
image

@aly-fly
Copy link
Owner

aly-fly commented Feb 14, 2022

I have very similar issue. However RTC is always right, NTP data is wildly incorrect.
ntp_now = ntpTimeClient.getEpochTime(); Serial.print("NTP time = "); Serial.println(ntpTimeClient.getFormattedTime()); --> if (ntp_now > 1644601505) { //is it valid - reasonable number? // Sync the RTC to NTP if needed. Serial.println("NTP, RTC, Diff: ");
Here I am trying to catch the lower part of incorrect values. NTP was sometimes returning zero or some small value. This code fixed most of the issues. However today it happened that NTP returned huge value starting with 2.......
But in the clock::begin section this code was still presenting correct time. Very weird.
ntpTimeClient.begin(); ntpTimeClient.update(); Serial.println(ntpTimeClient.getFormattedTime());
Is it possible there is too short delay between NTP queries on startup and server rejects the second request, and error message is not correctly parsed?

@SmittyHalibut
Copy link
Collaborator Author

I'm beginning to wonder whether we should take three NTP samples over some amount of time and only use it if all three "match" (+/- the amount of time obviously).

@aly-fly
Copy link
Owner

aly-fly commented Feb 15, 2022

It just did it again! Real Local time is 16:20 (UTC +1). Serial log:

Connected to WiFi
IP address: 10.3.9.xxx
NTP time = 15:20:53 ........ this is correct
syncProvider()
Getting NTP..NTP query done.
NTP time = 06:28:16 .......... failure right here
NTP, RTC, Diff:
2085978496 ......... incorrect (NTP)
1644938453 ......... correct (RTC)
441040043 ........... difference is substantial and of course it writes new incorrect data into RTC
Updating RTC

Those two queries to NTP are one after another, immediately. First one from Setup() and second one from Loop()-> SyncProvider().
I guess NTP server blocks us from sending requests too frequently and either returns garbage or error message (lost somewhere in the libraries). Will try to comment out the first request in Setup() and see if it makes any difference. Second step is diving into Time library to see if there is any error checking at all.

@andrei-tatar
Copy link

So for me changing the i2c freq to 10kHz seems to have fixed the issue. The time was always correct in the past couple of weeks.
In my code variant, I update NTP by default only once per hour.

@SmittyHalibut
Copy link
Collaborator Author

@andrei-tatar Those two could be related. Slowing down I2C to 10kHz will make it take longer to send the request, which might be just the amount of time the NTP server needs to let it in again. This is pure speculation on my part though.

@aly-fly I think I dug into the NTP library and noticed it has no error checking in it before, but I wouldn't bet more than a couple quid on that right now. My memory was that it just started pulling bytes out of the return packet without checking to make sure the response was a valid "here's your time" packet.

Assuming that's the case, we could:

  • Add error checking to the NTP library and submit a PR upstream. This is the "right" solution, but depends on the upstream accepting our PR, otherwise we are maintaining a branch and I hate maintaining branches.
  • Our code already keeps track of millis_at_last_ntp (or whatever the variable is called.) Using that, we can compare NTP and RTC values, see whether one of them is closer to the prediction based on millis() - millis_at_last_ntp, and pick the one that's closest. Ideally, we'd also keep track of millis_when_ntp_and_rtc_last_agreed and use that instead.

I just haven't had a chance to implement any of this error checking, been working on other projects. But I will happily review and merge a pull request submitted by someone else... ;-)

@aly-fly
Copy link
Owner

aly-fly commented Feb 15, 2022

Inside SyncProvider() I changed just this part of the code: ntpTimeClient.forceUpdate(); to if (ntpTimeClient.update()) {

  1. It has some basic checking built-in, so it's worth using the returned value (true/false).
  2. Update() has built-in timer to prevent sending requests faster than every minute. Avoid using ForceUpdate().

It is working fine for the last 20 or so reboots. Previously it has definitely failed at least once during such batch.
Regarding NTPClient library; NTP_PACKET_SIZE is 48 bytes long so for sure it has some more contents than just time. Function just takes 4 fixed bytes out of it and done. It should at least fill the buffer with zeroes before starting query and check some of the contents for correctness: https://www.potaroo.net/ispcol/2014-03/fig-1-1.png

@aly-fly
Copy link
Owner

aly-fly commented Feb 16, 2022

I think I found the reason. arduino-libraries/NTPClient#84 (comment)
Most NTP servers will not allow for fast repetitive queries.
Here is updated library with added error-checking
https://github.com/aly-fly/EleksTubeHAX/blob/main/EleksTubeHAX/NTPClient_AO.cpp

@SmittyHalibut
Copy link
Collaborator Author

Nice! I haven't looked at your code yet (still drinking coffee for the morning), but have you submitted a pull request to the upstream library with your fix? If not, please do.

@SmittyHalibut
Copy link
Collaborator Author

@andrei-tatar ignore my last post to you. I was conflating two things and my statement made no sense at all.

I think we are chasing two different problems here. @aly-fly's NTP problem which hopefully they've found and fixed, and the RTC problem you're talking about. I'm surprised the 10kHz clock rate fixes it. The RTC should be way more reliable than that. But there's no denying results. I'm just surprised, that's all.

The person with one clock always knows what time it is. The person with two clocks is never quite sure.

We have three clocks: millis(), NTP, and RTC. Millis is the most reliable, but the least precise. We can trust Millis to be right, give or take 10%.

  • If either NTP or RTC are wrong by more than 10% of the time since we last checked according to millis, then ignore it and sync to the other one.
  • If BOTH are off, and are the same, then millis must be off some how (uint32_T wrap?) and go ahead and use them.
  • If they're both off and do not agree, then use neither and skip syncing this time around.

Some logic like the above.

@aly-fly
Copy link
Owner

aly-fly commented Feb 24, 2022

I sent a Pull request that incorporates all my changes regarding NTP. Since implementation I didn't see any glitches in time synchronization.

@SmittyHalibut
Copy link
Collaborator Author

Brilliant. If your PR doesn't get accepted quickly, do me a favor and submit a PR against this repo that updates the documentation to mention your fork. Have it say to use your fork until the upstream accepts the PR.

@SmittyHalibut
Copy link
Collaborator Author

Gah! Assuming that's not already in here: #21

@SmittyHalibut
Copy link
Collaborator Author

Correction: Looks like your changes are in the local NTP Client code. I've merged your PR. I'm going to resolve this issue.

To anyone else who has problems with time after using this code, please open a new ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants