Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration backup/safeguard against corruption #9046

Closed
thucar opened this issue Aug 7, 2020 · 18 comments
Closed

Configuration backup/safeguard against corruption #9046

thucar opened this issue Aug 7, 2020 · 18 comments
Labels
awaiting feedback Action - Waiting for response or more information troubleshooting Type - Troubleshooting

Comments

@thucar
Copy link

thucar commented Aug 7, 2020

Have you looked for this feature in other issues and in the docs?
Yes

Is your feature request related to a problem? Please describe.
Prolonged period of power problems lead to 6 separate devices with corrupted config, needing to be re-flashed over serial.
#8929

Describe the solution you'd like
It would be great to have a backup of the config stored in flash, along with sanity check/journaling to fall back to a working conf should the primary config get corrupted for whatever reason. Removing installed devices to re-flash them can be a real pain and damage the internal decoration of a finished setup.

Describe alternatives you've considered
Currently I'm having to reconsider installing Tasmota devices in limited access locations.

Additional context
I know that a corrupted config is something that is not very common. But it can be extremely disruptive if it happens to a large number of devices at the same time (as I recently experienced). I've been recommending Tasmota to friends and even helped less technically inclined people set up home automation systems based on Tasmota. I've been telling them how reliable and trouble-free my experience of many years has been so far. Now I need to be aware of the fact that a power outage or any other problems on the power line can cause their entire setup to fail.

@Jason2866
Copy link
Collaborator

If you have often power outtakes use SetOption65 1 It disables the function power device recovery

@thucar
Copy link
Author

thucar commented Aug 7, 2020

I do not have power outages often. The previous one before this incident was maybe 4-5 years ago. But the severity of the incident is what makes me worried, not the frequency.

@Jason2866
Copy link
Collaborator

Jason2866 commented Aug 7, 2020

There is no problem with actual Tasmota version. It is as stable as older versions.
The issue you mentioned #8929 is a result of fast-power-cycle-device-recovery
Anyway loosing power during a flash write can always render the device. For critical stuff a UPS is needed to avoid the chance to get damaged hardware. This is not a Tasmota problem and can not be solved with software.

@thucar
Copy link
Author

thucar commented Aug 7, 2020

If the consensus is that a case where all devices in a section of a house get soft bricked, is by design and nothing will be done about it - then I accept it and take note.

However, I do not agree with the statement that there is no problem and this is not something that can be solved via software. I have probably way more "smart home" devices in my house than the average user. Ranging from Z-Wave to Zigbee to WiFi. Not to mention wired devices. During this incident, none of the other devices showed any problems. Everything just worked after the power was fixed.

Do not get me wrong, I absolutely love Tasmota. I've been using and preaching for it for years. I just feel that this is something that should not just get dismissed with "Not our problem, our firmware works great" but rather it could be a point of discussion on "what can we do to overcome these situations - however unlikely, in the future"

@Jason2866
Copy link
Collaborator

Please provide the process how to reproduce the issue. Since we do not have this issue we cant search.

@thucar
Copy link
Author

thucar commented Aug 7, 2020

Unfortunately it will be difficult to reproduce. The best I can do is offer one of my Tasmota flashed devices that is affected, to someone who is willing to troubleshoot this. Not sure if it would help in troubleshooting or not though.

@Jason2866
Copy link
Collaborator

Thanks, but this would not help.

@PatMan6889
Copy link

PatMan6889 commented Aug 9, 2020

I think I have a similar issue. But in my case it is because of a test setup with a breadboard and unreliable wiring. The ESP32 in my case sometimes is just not starting up anymore and doing other weird things. Is there a way to do a clean shutdown to prevent any corruption when doing maintainance or shutting down power?

@arendst
Copy link
Owner

arendst commented Aug 10, 2020

Currently there is no way to do a shutdown; only a clean restart using command restart 1 will save settings just before the restart.

A shutdown without a restart would mean to execute the restart code and stay in an early loop after the restart or finding a way to halt the processor(s) during restart. The only way out would be a power cycle.

I haven't thought about it yet.

@PatMan6889
Copy link

Maybe it is possible to have a special restart option. Unfortunately i cannot code it.

arendst added a commit that referenced this issue Aug 10, 2020
Add command ``Restart 2`` to halt system. Needs hardware reset or power cycle to restart (#9046)
@arendst
Copy link
Owner

arendst commented Aug 10, 2020

@Ingenieur89 try latest development commit which introduces command restart 2 to perform clean shutdoen and leave the system in a wait loop forever.

Pls report if this solves your issue.

@PatMan6889
Copy link

@Ingenieur89 try latest development commit which introduces command restart 2 to perform clean shutdoen and leave the system in a wait loop forever.

Pls report if this solves your issue.

Works fine. No bugs or unexpected behaviour. Thanks a lot!

What I see when looking at "Flash write Count":
Restart 1 -> adds 1 count
Restart 2, after that restart by button (enable) or replugging power -> adds 1 count
Restart by button (enable) or replugging power (no console input) -> adds 3 counts

@ascillato2 ascillato2 added awaiting feedback Action - Waiting for response or more information troubleshooting Type - Troubleshooting labels Aug 10, 2020
@stefanbode
Copy link
Contributor

stefanbode commented Aug 11, 2020

I can comment also on the bricked devices due to config corruption. If you loose power or have under voltage during flash write you’re in trouble. The checksum is already a big step forward. If we now change the pointer to the new config AFTER it was written that may also help to start without a bad config. I use the user config override to provide WiFi and minimal mqtt connection. This currently allows me to get in 99% control to the device. Even if it was completely resetted. The only 1% issue I still have is that the config is detected as good from tasmota and the WiFi ap and credentials are scrap. I have disabled ap mode for security. In this case the usb-cable comes into play, or pressing 10 times short reset.

@PatMan6889
Copy link

There is no problem with actual Tasmota version. It is as stable as older versions.
The issue you mentioned #8929 is a result of fast-power-cycle-device-recovery
Anyway loosing power during a flash write can always render the device. For critical stuff a UPS is needed to avoid the chance to get damaged hardware. This is not a Tasmota problem and can not be solved with software.

Regarding damaged hardware, can microcontrollers like an ESP get permanently damaged by brownouts? (No fix by just deleting and reflashing)

@Jason2866
Copy link
Collaborator

Never had a defect ESP82xx. All defects where power supply or Flash chip.

@arendst
Copy link
Owner

arendst commented Aug 13, 2020

The checksum is already a big step forward. If we now change the pointer to the new config AFTER it was written that may also help to start without a bad config.

There is no pointer pointing to the latest config after a restart for the same reason that it may be corrupt too.

The way config resiliency currently works after a restart is that it tries to find the latest updated config from the config pool of eight 4k flash pages. Without corruption this works as expected.

With corruption in theory it searches the pool for the next older config and uses that one.

As it's theory it is depending on legacy situations:

  • versions before 6.0 did not have CRC at all so the only way to find a supposedly valid page is by checking the cfg_holder value.
  • versions between 6.0 and 6.6.0.11 had a 16-bit CRC which doesn't trap all changes
  • versions after 6.6.0.11 have a 32-bit CRC AND a timestamp

As said due to legacy reasons the timestamp isn't used yet. The config version detection code is depending on a valid config version value in the config page. If the page is corrupt a valid version number cannot be found and the detection will fail too.

In all cases of detected corruption it should have loaded the default configuration but in practice it seems it often fails to detect corruption.

Considering the fact that the 32-bit CRC AND the timestamp are now active for almost a year I think it's time to drop config resilient support for versions before 6.6.0.11 and try to fix resiliency based on CRC AND tiemstamp only.

As an important note this will definitly break OTA upgrades from versions before 6.6.0.11 to 8.4.0.2 in one step. But then I already noted this in the ReleaseNotes what the supported upgrade path is.

arendst added a commit that referenced this issue Aug 13, 2020
- Add better config corruption recovery (#9046)
- Remove support for 1-step upgrade from versions before 6.6.0.11 to versions after 8.4.0.1
@arendst
Copy link
Owner

arendst commented Aug 13, 2020

Give it a try.

@thucar
Copy link
Author

thucar commented Aug 13, 2020

Thank you @arendst for putting your time and effort into figuring this out and trying to find a solution that would prevent scenarios like I experienced from happening in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting feedback Action - Waiting for response or more information troubleshooting Type - Troubleshooting
Projects
None yet
Development

No branches or pull requests

6 participants