Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added node API to allow in place firmware uprade #2606

Closed
TerryE opened this issue Jan 10, 2019 · 12 comments
Closed

Added node API to allow in place firmware uprade #2606

TerryE opened this issue Jan 10, 2019 · 12 comments
Assignees

Comments

@TerryE
Copy link
Collaborator

TerryE commented Jan 10, 2019

Missing feature

(This FR replaces #1496.) At the moment there is no facility for remote upgrade of the Lua Firmware.

Justification

Currently Lua applications can be field upgraded by updating the SPIFFS and LFS. However there is no functionality on the ESP8266 to similarly upgrade the firmware, so we can't update libraries or introduce new firmware functionality. We should provide Lua developers with the ability to remotely upgrade firmware on ESP8266 devices.

There are broadly two approaches to implementing this functionality.

  • The first is to base this on rBoot or equivalent technology and configure multiple OTA partitions. The major limitation of this approach is that this does not support a wide range of 1Mb RAM devices such as the Sonoff and Shelly devices.
  • The second is to use and asymetic SPIFFS-based staging approach as discussed below.

Workarounds

Devices must be upgraded by a field visit.

Proposed Approach

My suggested functionally the same approach as with LFS reload, only by moving the code into a pre_load stub, there is nothing preventing this also being used to load the firmware itself.

  • This is resilient because it uses a SPIFFS to flash copy
  • The process is two pass with the first pass simply doing a dummy run to scan, inflate and checksum the image. The second phase does write to flash but directly from SPIFFS.
  • SPIFFS is readonly during this process, so any power-fail and restart will simply retry the firmware load.
  • There is only one base address for the firmware, and the Lua application needs some method of uploading the .bin.gz file into SPIFFS (e.g a Lua-based WGET or FTP transfer) before initiating the reflash.
  • We can use a NAND-write region to allow the node API call to communicate context to the bootstrap.
  • The stub would need a none-ICACHE version of the uzlib/inflate.c and some lua/lfash.c functions as well as a cut-down version of SPIFFS readonly, so this isn't a small piece of work.
  • A major advantage of this approach is that it will enable remote upgrade of 1Mb flash devices.

If we go to this approach then I suggest that we move the LFS reflash into this pre-init and unify the two lots of code. The main reason that I would like to do this is that the LFS deflate format is differs from RFC 1951 in one annoying aspect: The RFC stipulates a 32Kb dictionary and on top of this I need another 3 × 2Kb buffers and I can't guarantee that I can map these at our current entry point so I have dropped the dictionary size back down to a non-standard 16Kb. This means that whilst you can use the standard inflate (gzip) library and utilities host-side to inflate LFS images, I can't use the same standard deflate libraries as these will use a 32Kb dictionary which craters as runtime.

I have though of using alternative approaches such as switching to a 16Kb ICACHE for the LFS unpack as this would release another 16Kb of RAM buffers for the dictionary. But more though on this is needed.

@marcelstoer
Copy link
Member

This FR replaces #1496.

I'd rather not keep around both then, no?

@djphoenix
Copy link
Contributor

What happens when power fails during unpack? Some fallback? Or stuck in-the-middle with required repair with PC?

@marcelstoer
Copy link
Member

Hint for anyone interested in OTA... A Mongoose architect wrote an eye-opening (for me) article about things to consider if you want to do OTA reliably: https://www.embedded.com/design/prototyping-and-development/4443082/Updating-firmware-reliably

@TerryE
Copy link
Collaborator Author

TerryE commented Jan 11, 2019

What happens when power fails during unpack? Some fallback? Or stuck in-the-middle with required repair with PC?

Yuri,I have been doing this crap for decades, so I am not about to start making beginners mistakes 😄

Have a review of the LFS code, for example. Your Qs are entirely valid, but I have also already discussed and addressed most of these points. The one failure mode that we would be vulnerable to using is that the new firmware image isn't functional, and this could be mitigated in a number of ways and we should certainly discuss and agree which is the most appropriate.

  • One option is not to bother. The firmware is that and not the application. The failure mode that we are discussing is that the upgrade works, but the firmware build itself is not functional. A reasonable presumption here is that if a developer has a lot of devices in the field, then he or she should first test the firmware in a release environment. If I have 5 Sonoff switches running my NodeMCU build in my house, I currently have no remote upgrade facility. This change would give me one. If I've built a flawed image, then presumably I would test it on one Sonoff first and the reversion is go back to the old method of taking it apart and plugging in the flashing header.

  • An alternative approach for those configs that have a large enough SPIFFS would be to have a reversion image and some form of update confirmation. Doable, but a lot of complication and maybe a later option.

@TerryE
Copy link
Collaborator Author

TerryE commented Jan 11, 2019

A Mongoose architect wrote an eye-opening (for me) article.

Thanks Marcel, this is a good into for the sort of issues that you need to address, but nothing new for me in this case.

Yes, this sort of stuff is complicated but it is not new to IoT devices. Just remember that the guys who wrote the S/W for the Apollo programme computers had less total processing capacity on the Command Module and LEM than in a single ESP826, but they still faced the same sorts of issues.

I would ask all reviewers to focus on specific functional issues at this stage, and accept that the implementers are competent at the basic technologies. An example of a functional issue here is the implementation of reversion images, do we:

  • Defer this option in the first release, accepting that even without this the new functionality would be a major step forward in 99%+ of usecases.
  • Mandate the use of a reversion image, in which case we disqualify ourselves from targeting two of the largest IoT consumer device ranges: SonOff and Shelly
  • Facilitate it at a subsequent PR.

Also even if we do facilitate image automated image reversion, establishing a simple and robust algo for triggering this isn't easy.

@marcelstoer
Copy link
Member

nothing new for me in this case

I know 😉

@dtran123
Copy link

Having a rollback scenario would be nice indeed but beggers are not choosers. If we have the ability to upgrade forward reliably, I'd be happy with that. Support for rollback could be a subsequent enhancement. Presumably the "new" build has been tested at length before being deployed.
Though, there will be unforseen obscure edge cases where a device could end up bricked under the specific conditions despite all the code reviews and QA testing.

@TerryE TerryE self-assigned this May 1, 2019
@TerryE
Copy link
Collaborator Author

TerryE commented May 19, 2019

Done in SDK 3.0 tranche 2 update.

@TerryE TerryE closed this as completed May 19, 2019
@TerryE TerryE reopened this Oct 29, 2019
@TerryE
Copy link
Collaborator Author

TerryE commented Oct 29, 2019

Not sure why I closed this one. It wasn't part of SDK 3.0 tranche 2

@stale
Copy link

stale bot commented Jun 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 2, 2021
@KT819GM
Copy link

KT819GM commented Jun 2, 2021

Don't even think about that, 'stale bot'

@stale stale bot removed the stale label Jun 2, 2021
@TerryE
Copy link
Collaborator Author

TerryE commented Sep 26, 2021

@KT819GM Modestas, in the current environment and developer resourcing, I can't see us getting around to this so I don't think it sensible to leave an aspirational issue open. Feel free to reopen it if you can identify a realistic development resource.

@TerryE TerryE closed this as completed Sep 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants