-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PID1 getting stuck printing "systemd[1]: Time has been changed" continuously #1143
Comments
Well, on Linux 32bit, time_t is 32bit iirc, hence this cannot work afaics... internally, systemd always calculates with 64bit values, but the kernel APIs might use something else. Did you check what sizeof(time_t) prints on your arch? |
sizeof(time_t) prints 4 |
IRC with Lennart: poettering |
Note that this bug also hits on 32bit MIPS and 32bit ARM... |
Tjo! I've found a workaround for this issue, but it requires patches in both kernel and systemd:
And in systemd:
|
I think it shouldn't be the case that once we reach the magic day, user space stops booting. |
Especially as the kernel since 3.19 actually allows dates past 2038 from a battery-backed RTC, which could have a bad date due to not being initialized or by being maliciously set by an attacker to DOS the system. |
We cannot work around a broken ABI in userspace. To fix the 2038 problem for good you need to update the ABI of your arch to provide a 64bit time_t. I will close this now, as I see nothing to fix here in systemd: when the ABI (in the kernel and in glibc) is updated to provide 64bit time_t then systemd should start working after a simple recompile (possibly adding a new define, a la |
True, you may not be able to work around this issue, but systemd could do better than completely locking up printing error messages when it happens. The bug is that PID 1 completely locks up, not that the time is incorrect. I have the exact issue that @antijn describes with a misbehaving RTC, and it makes my system unbootable. Could this be reopened with an emphasis on fixing the lockup, even though the time can't be fixed? |
I've sent the patch for the kernel to Tomas Gleixner who's the maintainer, no response yet though. If he's at LinuxCon Europe in Dublin next week, I'm going to ask him in person. My plan was to resubmit my workaround for systemd when upstream Linux has queued the kernel change. Perhaps that can also be handled at LinuxCon if Lennart is there. Much easier to explain some issues in person... |
@antijn, thank you. That is a good approach. |
I am not at LinuxCon. Also, Gleixner is pretty aggressive towards me, so I am pretty sure I will not do anything on this issue. |
Too bad, but anyway, since any solution depends on the relative timerfd support getting upstream in the kernel first, I'll keep on plugging at that front, and come back to you when the patch in systemd actually has a chance of working. |
Hmmm, hit this bug while booting a previously working cubox-i (arm). I'm thinking the battery for the RTC went bad, the RTC battery in the cubox-i has a historically short life. I've replaced two out of three, this will make three out of three. So yeah, I'd say this is of significant importance, if a RTC battery fail can cause a non-booting system. |
I applied both patches and it resolved my rtc/fail-to-boot scenario. |
@cbxbiker61 hmm? i doubt that a failing battery would put the rtc into the year 2038... |
Hard to say, all I know is that:
|
Either way, this is something to fix in the kernel. We will not question the time information returned by the kernel. If it is not correct and a driver returns rubbish, then the kernel should be fixed, not systemd. Also, time_t being 32bit is a kernel issue too, it's nothing we can work around in userspace. |
If it helps community, let me share my experience on this annoying behavior, my ARMv5 NAS was stuck in this alert loop and then refused to boot (I got the logs using netconsole) because of faulty RTC. |
Unfortunately, discussion on this has petered out on LKML with no clear solution, my patch is still in limbo. It is a real problem, and yes, it is a kernel problem, and it will only be completely solved when 32bit userspace is gone. But it also a fact that the number of RTC models that fail in this fashion due to no/bad battery is increasing. There's probably also going to be more problems if manufacturers move towards using super-caps instead of batteries... |
I just ran into this on a Solidrun Ltd I4P-300-D (Freescale i.MX6 Quad ARMv7). I'd appreciate something being done to ensure I don't need to manually intervene (occasionally) after a reboot occurs. I didn't set my clock forward manually, so either:
To summarize: this isn't just something that affects people who manually set their clocks forward, it appears to affect normal users as well. |
Can you try to reproduce that with a kernel that has RTC_HCTOSYS not set? My understanding of the issue is that the RTC subsystem properly sets a system time that is 64 bits then, because userspace and other part of the kernel are not ready to handle 64 bit time. Having userspace instead of the kernel setting the system time can solve that issue. If that is not the case, we have discussed adding an option to make sure the RTC subsystem doesn't set a time that doesn't fit on 32bits. |
This just bit me hard after my CMOS battery died on an Asus x86 machine. I had to work around it with a variant of this: armbian/build#111 (comment) Please, don't break the boot this badly on x86 machines! Even booting a LiveUSB and resetting the clock from under that environment didn't fix the issue. |
I just hit this issue as well.
Is there really no way to detect and recover from a clearly bogus RTC date? |
One workaround would be to force the date to be after time of the released software running on system... There werent any systemd in 1943 as far as know ;) |
@amoskvin sounds like an undetected overflow bug in the kernel... Really, there's nothing to fix here in systemd, the kernel should really validate the time after reading it out of the RTC and makes sure the time doesn't overflow and is unreasonably low. |
While the year 2038 issue certainly needs to be fixed in the kernel and glibc on 32 bit systems, maybe it would be useful to have systemd agree with the rest of the system on the size of time_t. This would probably solve this issue. |
@poettering the problem is that systemd overtakes its role. When a whole system can boot and work reliably with a bad date, even to reach the point where it can ask for the proper date over the network, it should be allowed to do this. And systems have been reliably working this way for decades. You may not remember MS-DOS, but in its early versions the user was asked to enter the date at end of boot. And yes, a broken RTC can cause any invalid date to be reported and with regular systems it's not an issue. I'm switching to another distro because a debian 8 cannot boot on my board for this precise reason. Thank you for deliberately ignoring issues your software cause to end users. |
@wtarreau again, the key of the issue is that when the clock overrun takes place timerfd with TFD_TIMER_CANCEL_ON_SET keeps firing continously and hence systemd busy loops. Its a kernel issue that this keeps firinG continously and needs to be fixed there. Systemd cares in no way if the clock is bogus, we just end up busy looping because of a fucked kernel interface. |
So the root cause of this clearly is that the init system should not need to use timerfd in the first place. I don't know why systemd needs this while legacy init hasn't needed it for decades, but if there was a way to disable this dependency on such a recent (and possibly still broken) feature, it would definitely solve this issue. |
If it's a kernel bug as @poettering suggested , if it's not done already can someone report and then link the report to this one, thanks. Also downstream bugs are welcome too. |
@rzr kernels have always and will always have bugs. It's important to be able to workaround them in userspace. As a lot of systems can boot regardless of this bug it seems reasonable to expect systemd to simply disable the non-working feature when it detects that it loops and only emit a message "kernel bug detected, fix your system". Anyway it's not my problem anymore I've switched away from Debian 8 and have no problems now. |
@wtarreau well, we expect bugs to be fixed where they are, and not hacked around in systemd. Sorry. |
Sure, let's never check any syscall's return code nor errno since the lower side may never fail and is expected to be perfect. Now I get a clearer view how this whole thing is designed, that's enough to scare me away from it. For me that's the difference between a quick proof-of-concept and production code. |
@wtarreau hmm? we check all return codes and errnos here and pretty much everywhere else. Except, that in this case the kernel reports no error, it just keeps firing the timerfd. But anyway, I think this discussion is pointless. You apparently have no idea what you are talking about, but just want to vent frustration. Let's just stop this then here, and please find a different place to vent that, this is a bug tracker. |
Just like the kernel disables spurious IRQs, systemd should disable spurious timerfd. There's no blackmagic here. Basically manager_clock_watch() could start with something more or less like this
And yes it's a workaround for a kernel bug, just like kernel works around hardware bugs. But it allows end users who didn't chose systemd not the kernel version to boot a system with a dead battery. |
I just ran into this problem on a brand new ARM based board embedded system with an RTC that had it's clock way out at year 2128 . The system was fine other than systemd keep spewing this message and wouldn't allow any user logins. |
Can someone affected by the issue try to compile a kernel with that patch: I believe this will solve the issue however, I can't test as I'm not running systemd on 32bit platforms. Note that I'm still not sure this is the correct way to go as unsetting RTC_HCTOSYS should have the same effect and I'm pretty sure no distribution actually need it but still activates it. |
@alexandrebelloni, unless I'm mistaken, this will only fix the issue at boot but will not prevent the system from going down just afterwards, so it can even be worse in some cases. For example if your system boots at 1/1/2038, it will suddenly die in this endless loop 18 days later. The real issue is the loop itself. I agree that timerfd should not trigger in loops, but the fact that it's a known design limitation should be taken into account by the caller, especially since this code is only there to detect time changes. Another solution based on your patch could be to tell systemd to stop using timerfd if the time is close to this limit. You may also make timerfd return an error if the current time is past this limit, but I believe that in this case systemd will loop on timerfd errors (and that doesn't solve the issue of running userland on existing boards shipping with their own kernels working fine with other distros). |
hctosys is setting the system time from the kernel. This means that 32bit system can get their time set to a date after the 31bit time_t overflow. This is currently an issue as userspace is not yet ready to handle those dates and may break. For example systemd's usage of timerfd shows that the timerfd will always fire immediately because it can't be set at a date after the current date. The new RTC_INVALID_2038 option will make sure that date after 03:09:07 on Jan 19 2038 are invalid. This is 5 minutes before the 31bit overflow. This leaves enough time for userspace to react and is short enough to make the issue visible. Signed-off-by: Alexandre Belloni <[email protected]>
@wtarreau: for systemd, this will indeed only solve the issue at boot but this also solve other issues with other parts of userspace so I will probably queue it for v4.6. Anyway in the current state, 32bit platforms will fail on 19/1/2038. At least, this solves the current issue. I'll switch the default to n once libc and critical applications have switched to a 64bit time. Note that I've tried to reproduce the issue and I can't reproduce the infinite loop. It loops for a while but the boot process seems to continue. |
On Sat, Feb 20, 2016 at 05:14:03AM -0800, Alexandre Belloni wrote:
I agree.
That's precisely the problem I was discussing. I think it doesn't leave
It depends if my system boots more than 5 minutes before wrapping or not :-)
Interesting. In my case the loop happens on the serial port, so the messages Willy |
On 20/02/2016 at 05:46:43 -0800, Willy Tarreau wrote :
I don't believe any serious person will let its system run one year with
Yeah but it will have 10 minutes to get the real date and time. Which
I'm pretty sure it was not ntp as the board didn't have any network Alexandre Belloni |
Again, please fix the kernel to handle this less awfully. The kernel is borked here, systemd is not. If the option is to fix it properly in the kernel or work around in systemd, then the former is the right solution. Both are Open Source projects, hence there's really no reason to work around borkedness in other open source software. If you have hardware that comes up with the clock set to 2038 on an empty battery, then that too sounds like something the kernel should fix... |
@poettering I agree that the kernel needs fixing and I'm trying to work on that. From the reports I see on this bug, this is mainly the pcf8523 (present on the cubox-i) and I've spotted an issue in its driver. Can people affected tell which RTC they are using? I'd like to get more reports and I would prefer those reports to be on the rtc mailing list: |
My lsproduo ships rs5c372a, I'll may write a longer report to list. once I am able to hack on it. |
On Sat, Feb 20, 2016 at 03:40:43PM -0800, Alexandre Belloni wrote:
A clearfog A1. I suspect the RTC is not yet well supported as it reports Regarding your point about 5-10mn being enough for an NTP request, yes Willy |
The one instance I've seen this is on a custom board with a PCF8523 and it wasn't a dead battery. It appears that the first time the part is turned on the date and time is random. In this case the year register was set to 0x80. Matt S. |
I have an old EeePC T91MT 32bit laptop with no battery that sets the date to 2060 for whatever reason. At least now I know I just need to set the clock in the bios and stop yanking out the power cable. |
On Tue, Feb 23, 2016 at 02:50:30AM -0800, David C. Bishop wrote:
Please also report the bug to your distro vendor, otherwise whatever the Willy |
Seems to be resolved in Debian Testing's kernel 4.6 for the OLinuXino A20 Micro. So, save off your data and upgrade your kernel, if you're running Debian. On my A20 hardware, a previous install of FBX-0.7 failed to boot with the "Time has been changed" error, while the current FBX-0.9 succeeds. |
I didn't push it upstream but I don't know what vendors are doing with their kernel. |
Currently there is booting issue of tizen 3.0 mobile 32bit platform with PID1 getting stuck printing "systemd[1]: Time has been changed" continuously. [1][2] This problem is related with rtc-s3c and now rtc-s3c is reporting time values over 2038 years like below: [ 5.124065] s3c-rtc 10590000.rtc: setting system clock to 2140-10-02 10:52:03 UTC (5388461523) Android MM N910S kernel uses only pmic rtc, not rtc-s3c so i think it's better to disable rtc-s3c. [1] systemd/systemd#1143 [2] https://patchwork.ozlabs.org/patch/585661 Change-Id: Idc580c2494aa309607dd835ca39411236f3366e6 Signed-off-by: Joonyoung Shim <[email protected]>
|
Difference is that old school init is able to bring up the system to usable state, while systemd doesn't but ends up in a limbo. |
By setting the date to sometime in 2038, we can make systemd getting stuck printing
systemd[1]: Time has been changed
over and over again.Seems like timerfd the manager creates has always an event for us to execute. I am not sure if this is a problem specific to our 32 bit setup (kernel / libc / systemd) or a generic problem. Would be nice if someone else can test it on their 32 bit setup.
Here PID1 prints
systemd[1]: Time has been changed
over and over again.The text was updated successfully, but these errors were encountered: