-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Policy of supporting Lua in ROM #2068
Comments
With the version that I built, I wasn't convinced that you could entirely rom-ify some functions (without a lot of hacking in the GC). The issue is that some objects need to be followed by the GC process, and so cannot be romified. [My memory is a little hazy, and I don't know if you can actually write such a function. My implementation had all sorts of corner cases where some of the internal vectors could not be rommed] |
It's quite straight forward to get the I'll do the usual trick of adding this code to a vanilla Lua 5.1 version first and hammering it there on the PC before porting it to NodeMCU. It's just a pity that MMAPing a file into an absolute address window with the But to my core Qs, if we can get a robust approach here should be push it through dev to master? |
As someone who often exhausts the heap, I'd be for such a change, assuming it's not crazy to support. The only downside, other than code complexity, perhaps, that leaps to mind is that running from flash means that it's less clear ahead of time when the flash chip will be engaged, so anyone trying to use the flash-associated GPIOs might be in for a surprise. Not to ask a stupid question, but does anything in lua 5.2 or 5.3 make this easier? (I recall reading somewhere that there was an effort to bring nodemcu over to 5.3; am I just making that up?) |
Getting Lua running out of flash would be a big boon given the RAM constraints. My Qs/concerns are:
|
Sorry for the long reply guys, but I've tried to cover all of Johny's and Nathaniel's below. Forward compatibility with the 5.3 workI will do a separate update on the 5.3 work on #1661, but as to the specifics of this functionality, like Johnny I see addressing the RAM constraints a key criteria for the success of NodeMCU Lua, so my original intent was only to add this functionality to 5.3. What I've raised here is essentially a backport of the technology to 5.1. Unlike the rest of the 5.3 work which from a user perspective is either out-of-the-box 5.3 functionality or compatibility with the existing NodeMCU/eLua module API, this is new and pretty decoupled from the rest of the 5.3 work. Moving it into 5.1 allows a more vigorous engagement with the current developer community to get a consensus on how this API should work -- as well as bringing forward the benefits to the community. Flash buffer location and wear levellingI see this as less of an issue for a number of reasons. The current Windbond chips such as the W25QxxFV series quote a life of more than 100,000 erase/program cycles, and even if the modules have the earlier generation NAND flash chips with a 10K cycle life, say, the mode that I suggest we use here which is essentially a reboot-reload-reboot cycle envisages a usecase more similar to the convention C rebuild and flash life-cycle. Even during active development the module might see 10 reloads a day, and in production maybe 1 a month so this isn't going to be an issue. It would also be trivial to consider refinements such as the Search order for loadingThe Lua local pl=package.loaders; pl[3],pl[2] = pl[2],pl[3] to reverse this search order. And note that you can only specify the module name as a require parameter; it is the loaders (or The load functions are different in that these don't have a searcher concept and so we need some simple method of encapsulating accessing the ROM store at a Lua API level. Also accessing the ROM store is fundamentally different from any of the load functions ( This is why I would promote the use of modules rather than functions in the store, as this is more transparent and fits better into the Lua paradigm. Nonetheless if we want to make a more transparent method of loading functions from the store or VFS whatever we do is going to be slightly a botch because internally within the relevant load function this is't encapsulated within the vfs because you don't actual do a load with a stored routine. It's already loaded, just not bound to a closure. We have exactly the same issue today with setmetatable( self, {__index=function(self, func) --upval: loadfile
func = self.prefix .. func
local f, msg
if not skiprom or not skiprom[func] then f = getrom(func) end -- handle ROM load
if not f then f,msg = loadfile( func..".lc") end
if msg then f, msg = loadfile(func..".lua") end
if msg then error (msg,2) end
if func:sub(8,8) ~= "_" then self[func] = f end
return f
end} ) or actually the ROM optimised version which saves about 300 bytes RAM: -- skiprom if defined is a global
setmetatable( self, {__index=getrom("autoloader")} ) The Performance and alignment issuesYes access from RAM is roughly 13-25× slower than Flash in the case of cache-miss, but at the moment executing every Lua VM instruction (4 bytes) involves reading 100s of bytes of xtensa instructions from Flash to interpret this one instruction. However, flash access is RAM cached and this reduces the overall impact, though accessing from scattered flash address regions will increase cache fault rates, so IMO moving code into rom will slightly decrease instruction execution performance. But we also have to balance the slight increase of runtime in accessing rom-based constants and strings with the fact that all of these resources are in ROM and have been removed from the scope of the GC, so GC sweeps will be a lot shorter and required less often. A big runtime saving. Also, the RAM limitations mean that non-trivial Lua programs involve a lot of dynamic loading of code from SPIFFS which is slow because of the double whammy of SPIFFS overheads and the Lua load process. Converting ROM Protos to encapsulated functions is fast. So I believe that the average Lua application will run faster overall. Unaligned (in the Lua RTS nearly all byte) fetches are slow because of the overhead of the unaligned exception handle overhead. -O2 instead of -Os helps for general string access but not for this. However there are (inline macro assembler) techniques we can use to replace unaligned fetches by a two instruction aligned fetch and extract. But I see this as a second order optimisation for later. |
Thanks much for the very detailed response! |
Nice comprehensive response, cheers. Just a minor comment regarding unaligned stuff; I really did mean unaligned (exception code 9), rather than the sub-32bit-wide loads ("load/store error", code 3). The latter we have our custom exception handler to patch up and recover with. Unaligned 32bit access however would still be fatal. It may not be an issue as you say "nearly all byte", but worth keeping an eye on. I'm in favour of the approach outlined here. Obviously we'd need to have good docs explaining how to use it, when the time comes :) Ah, and one more question: a 64k block is obviously larger than the amount of free RAM we currently have, and thus would likely go partially unused no matter how badly one tries to move code to it. Any ideas on how to get the best use out of it? I know you ruled out incremental freezes above... |
It's late for me so a quick response. I understand your exception code 9 point and I will check this, but I don't believe that this is an issue. Re 64kKb, the reason for two passes is to serialise the load process. There are two constraining factors: the size of the string table, and the size of the largest module that you need to load, because each file is loaded into RAM, then cloned into flash. We need to play to see how much of a constraint this is in practice, but whatever it is, it's still a lot better than current constraints. |
For those concerned with performance, I would argue that one of the key usecases for the ESP8266 (& ESP32) is IoT applications. For the majority of cases that I can think of, fast execution is not critical. So even if this resulted in a slightly slower execution time, I would be ok with it. I can see great benefit from this feature:
These are just 3 benefits that completely justifies such initiative. I would be curious what is the effort estimate for completing this task. Are we talking weeks or months ? (for a developer) I believe we run the risk of losing many community members for the ESP8266 if we don't solve the RAM shortage which appears to affect proper support of secured connections. |
The issue isn't so much absolute hours, but elapsed. The internals of the Lua engine are both subtle and complex, and I (TerryE) seem to have taken the short straw to get to grips with all this. The issue is that all of this work is unfunded and done in my spare time, threaded amongst my other commitments like finishing off a house that my wife and I are building -- and doing the Home Automation for the same which needs its own ESP code. But as to your core point it's man-days of work (my being a male) rather person-weeks, as a lot of the foundation work is already done as part of my Lua 5.3 upgrade for NodeMCU. |
Would it help you if some of us were helping you with the funding ? We could gather a few volonteers to help out with donations. I am willing to help if I know this feature will result in fixing the current problem with secured tcp connection that has started with the SDK 2.x. on the ESP8266. |
@dtran123, nah. don't need the dosh. I spent 35 years in IT and ended up on top of the techie shit-heap. I am now a gentleman living on a sinecure (pension). It's hours in the day and priorities that are my problems 😉 Let me crack on, whilst the brain is still working. |
@dtran123 Increasing available heap may help with secure SSL, but as I reported in #1707, I think it is already viable to work with (and verify) ECC keys instead of RSA keys. If you control both ends, I think this is the most immediate path forward. You'll have to tweak the mbedtls configuration file as done in nwf@c1ed48c (and likely want to cherry-pick @djphoenix's update to mbedtls first, djphoenix@4958a4a) and/or see if @marcelstoer can add some checkboxes to the web builder to achieve the same effect. @TerryE Please don't take any of that to mean that I amn't rooting for your success. If not donations, perhaps a beverage of your choosing if we're ever in the same place. :) |
I want to support this in any way I can. |
At the moment, I am thinking about work-arounds for some interesting catch-22s thrown up in standard Lua testing. The clone to flash process destructively overwrites the old version of the cloned ROstrt, but this in turn was a clone of and earlier version of the strt, so contains all of the strings like |
Would it help to have two segments of the in-flash data? One which was objects whose positions needed to remain invariant across updates, and one which could be overwritten at will at each clone? I presume the former can be relatively small and so loaded into RAM at the start of cloning, and then written back to flash only after the second segment has been constructed and any requisite additions made? |
Close. My current approach is to treat the first boot after flashing the Lua firmware as special. This is partly to solve some issues in the NodeMCU 5.3 version where you can declare TStrings at compile time. The RTS performs library initialisation then executes a clone before starting Lua execution. This just clones the base string table. The addresses of this first tranche of TStrings are then preserved across subsequent clones, so the tables which use them are OK. |
Please see my paper on this approach: LRO Functions in NodeMCU Lua. Sorry, it includes some typos and other errors, but I will fix these if I update it following any review comments. |
@TerryE This looks really well thought-out. I very much like the flash-block lifecycle tracking trick (1F -> 7 -> 3) and the multi-reboot design seems like it will work well without being too complicated. ETA: Is there any way we could compute the flash block on the host as part of the image build? Obviously not exclusively, given the intended |
I already have a lot of code that can do this in a 5.3 environment, but then again the standard NodeMCU make generates a |
I have been thinking about this cross build issue. It would in principle be a straightforward variation. The way to do this would be to extend
might build a flash image for downloading based on the Lua files in |
Just throwing another thought into the pot here. For the modules with mere 512k flash, would it be feasible to exclude the "freezer" support? Would it make sense to have the interface sitting in e.g. a |
@jmattsson, I'd already decided to do that (and also make this option disabled by default in early releases, at least) for two reasons: first, so that those that don't want it don't have the flash overhead, and second just in case we find issues in early testing. If we can conditionally remove the code then we can do safely release it into dev. A second point: what do we formally call this? Philip first proposed the idea and called it "freezer". Do we use that, or do we use "flash"? |
I've got the vanilla PC 5.1.5 version working fine now. This does a
The PC-based version has to support PIC Flash because of Linux address randomisation, and if we start looking at @nwf Nathaniel's suggestion of Host buildable images then we might want to do the same for the NodeMCU version. However, I suggest that we keep the NodeMCU version simple as possible in its first iteration. I am not going to include byte-access optimisation in this first version so it will hit the aligned handler, but adding this as a second pass is pretty straight forward. One thing that did strike me is that as soon as the VM starts running, the minimal string table is around 10Kb. Because this minimal core is moved to the ROstrt, this immediately frees up ~10Kb from a tpyical running Lua app even if it hasn't freezed any code into the flash. |
👍 |
I've just been doing an L8UI code review. (This is instruction that trigger 99+% of the unaligned fetch from flash exceptions.) It isn't to bad al all: the 'hot' modules,
But we can do this sort of optimisation once we've got the basic code working. |
Testing this lot is a total bitch. If you use the gdbstub then you can't use uart0 for Lua input or output. So you have to hook up a second USB serial chip to the UART1 and get debug logging that way. I've got two methods of loading code: a RAD cycle based on spiffsimg'ing small 32Kb FS with various test stubs, and potentially a telnet stub, but you've got to get your execution past the basic bootstrapping processes. I am still fighting EGC issues which are subtly different to the standardd Lua, and I am jugging this with all of my other time pressures. At least the PC version works OK. It doesn't help that the the gdbstub is very fragile if you can get to a breakpoint then you can examine RAM, but flash-based exceptions just seem to bypass the GDB exception handler entirely, and panic the CPU, so there's no opportunity for PM diagnosis. 😞 Any pearls of wisdom or even sarcasm welcomed 😄 |
@TerryE I am afraid I have no wisdom to add, and sarcasm seems like it won't help much. I don't suppose the ESP8266 believes in JTAG? |
Sure @TerryE , how's "if it was easy then some other idiot would already have done it"? ;) Is the gdb stub being bypassed because we've already hooked the flash exception? I think Philip changed it so the handlers would chain for anything we didn't handle though, so I might be way off. |
@nwf As far as I know JTAG is a BITE interfacing technology. I've got more than that level of access and diagnostics; as Johny says: if it was easy ... We're working way up the stack here. Philip has already done some extremely valuable ground-breaking to help. It's a balancing act: I accept what we've got for now and work around it, or get sidetracked in improving and integrating the built-in test. What is clear is that we should to do a major rewrite of the Extension Developer FAQ to include stuff like using the gdb stub, logging to UART1, using the mapfile, ... Yes, Johny, perhaps we need to make the flash exception handler gdb aware in the As it stands I have Lua 5.3 working as a NodeMCU host build and ditto the flash variant of std Lua 5.1.5, but bootstrapping this into the ESP8266 just takes time and perseverance. It's just that my other commitments mean that the elapsed time is more that I'd prefer. Luckily I an old enough fart that i've done quite a bit of this low level hacking professionally back in the day, so it's just a matter of dusting off the cobwebs. |
So, so great to hear that this is progressing!! Thank you very much @TerryE for your continued support! We all completely understand Terry that you are under no obligation to meet deadlines or schedules and that this is a 'done in your spare time' kinda thing, however, for those of us that are eagerly awaiting a dev version of this, do you have a rough estimate of when you think you'll be able to release something? Not looking for a commitment in any shape or form, just a realistic prediction of when we are likely to be able to start using NodeMCU again! :) I understand you "don't need the dosh" but if a donation would sweeten the deal, then please let us all know. Unfortunately money is all I can offer, I wish it was technical support, but this is all a little beyond me! Hope it's going well! |
No money. Just priorities, I'm sorry to say. I am up to my eyeballs in Lua and Node Red commissioning my home automation system for my new house. I'll take a break soon and spend a half a day getting to the bottom of this GC issue. As soon as I have a stable build, I will push a commit to my github fork. |
Hi @TerryE - Any update on this please? I am desperate to get my hands on a version of NodeMCU that allows me to comfortably connect securely and also have enough heap for other stuff too! Again, I know you have no obligation here, but do you have a rough idea of when you'll be able to find some time to complete this? I am currently looking to invest in a developer to get a usable version up and running and wanted to see what stage you were at first, before I engage them...? Many thanks :) |
Hi, @georeb. I've just moved into the house that I've been building for the last few years and am typing this in my office on the first floor (in England the ground floor = zeroth). I hope to work on this over the holiday break and get a version out for evaluation. As to your investing in a developer to do this, my advice is: don't bother. This is complex stuff because you've got standard Lua, the eLua hacks, and all of the ESP issues interplaying. The learning curve is huge. |
Any progress over the Christmas break @TerryE ? I understand that it'll be a learning curve employing a developer, but I don't have much choice. You are unfortunately the bottleneck and as I cannot interest you in payment, I have to pay someone that will. As always, I understand that you have no obligation; however I (and others) have been waiting 5 months for this now and I have to do something before it all gets superseded by something else :/ If others want to chip in to help out with developer costs, please get in touch. |
@georeb, we sold our old house and moved into the one we built on the 19th Dec and I've just been getting the HA system to the point where it will run the house's heating and environmental controls. After working 7 days a week on the new build, we've got to the point where my wife and I both have time available for our interests. By all means employ a programmer to do this work, but don't underestimate the learning curve. You will probably waste your money as I will beat her or him to this deliverable. |
Congrats on moving into your new build! :) How far off the deliverable would you say you are @TerryE ? |
Any update please? @TerryE |
Yup. Long story, short. I ran into a bit of show stopper that has forced my to change my implementation strategy. The problem wasn't that the approach doesn't work but more of a scaling issue because of how the GC (and the EGC modifications) interact with the build process. The EGC includes extra GC pause / restart directive around some operations and nested pause / restarts aren't honoured, so these would always restart the GC. The Lua GC will aggressively scan all collectables and mark any that aren't in Lua scope for GC. What this means is that the build process doesn't scale robustly, and above a certain size of flash image the GC could come in and collect elements that I was assembling for the flash. Getting around this by referencing them was creating extra overheads which hit the scaling issue even more. Either that or start making fundamental changes to the GC, which I just am not willing to do. So this issue was about robustly building a flash image on device, rather than executing it on the node once built. My alternative approach is to move the flash image building into the
The only complication is that the host environment must be a little endian architecture such as Intel or ARM, but the code has to cope with 32bit and 64 host environments. I am junking the current eLua-based The |
You're right, sounds extremely complicated @TerryE !! Are we close?! :) |
OK, It looks as if I have ironed out most of the issues and can put together an evaluation PR. I just need to check that my build without all of the debug hooks works as anticipated. We will clearly need a tweak of the API stuff and I still have some bits to add. But the highlight so far are:
There's stlill a TODO list, for example:
I've just been playing with a test LFS which has 7 function files loaded, has 135 string constants in the ROM table, 22 are in the RAM string table and there is over 39Kb heap still available for the App, so this is all looking promising. I've also fixed a bug in the remote debugger and become adept at using this. I've also added some gdb macros which will help library developers examine the Lua stack, and I need to write all of this up in the developer guide sometime. |
How does the following get built?
I'm hoping that it will be possible to write a wrapper for |
Phillip, there's no need. Read up on
The local index = node.flash.index
local function loader_flash(module)
local r = node.flash.index(module)
return type(r) == 'function' and r -- or nil otherwise
end
if index then package.loaders[2] = loader_flash end If you have some init module in flash then you can stick this fragment in it, then the only RAM overhead is the As far as how it gets build, you can either just stick the modules in |
Another trick is that I include a dummy module -- preload a bunch of strings into the ROstrt and avoid the RAM overhead.
-- use debug.getstrings('RAM') to work out which you might want to add
-- for your application
local preload = "?.lc;?.lua", "@init.lua" -- , ... extend as you need or add more OK you are wasting n × ( I was thinking about reverse engineering the compiler to preload all of the common strings used during compilation to drops the compilation overhead. |
@TerryE Makes sense. Looking forward to seeing this in action! |
Incidentally one of the best tricks to do with the debugger is to add a macro for lua_assert which does a debugger break and then enable this for your test code. The Lua API macros use lua_assert a lot to do validation so this will pick up a lot of consistence errors. You can also make heavy use of lua_assert in your own code. If not enabled then this all gets optimised away / removed by the GCC code generator at -O2. The real PITA with using the debugger is that you loose the ability to input strings through the UART input, so you need to use a telnet stub for interactive testing. I am thinking of having a variant assert stub which puts out a warning message to come out of your UART terminal session and start |
This is GREAT news @TerryE !! :) Thankyou. |
The Alpha version will stay in my fork until at least one other committer has checked it out. Then it will be pulled into dev. It will go into master on the following release cycle, but with the |
Excellent! Will the version in your fork be an adapted version of NodeMCU |
The way that the release cycle works is that we commit to dev, then batches of commits to dev once stable are then committed to Master. The only path to updating master is to move dev patches into it. So I am not sure what you mean by your repeated Q. There should be a master version with LFS support in the next 2-3 months, but the delay is only because of the dev to master promotion cycle. About half the community use dev builds to take advantage of the latest bug fixes etc. The delay ensures that we have a reasonable chance to give good usage coverage to any changes before moving them into master. |
What I meant was, will your version be a standard copy of the current MASTER with the addition of LFS? |
@georeb Unlikely; it's more likely to be a fork of dev, rather than master, since that's the target for merge. |
Okay, understood. Thanks |
I have just updated my Lua Flash Store (LFS) whitepaper so it now reflects the current LFS implementation. Anyone interested in this, please reread carefully. The LFS patch is so large that I have also had split it into 5 commits, each of which is larger than a typical PR here. |
For those who are wondering about my delays here, I find it quite time consuming to cover all of the base test cases and their variants: float vs Integer build; host (luac) vs target (lua) firmware; without LFS; with but no LFS used; with with LFS used. In my testing, I have come across a subtle architectural issue which related to my implementation of GC marking, and this really needed reworking before I release this. We made quite a few compromises in getting the 0.9x versions of Lua out within the timescales that zeroday achieved. By now we have the luxury of a robust working 2.1 version. I don't want to compromise this by rushing out an LFS version too soon. |
See #2292 for further discussion. |
Although this issue links to earlier discussions in #1289 and #1661, I see this as a policy issue mainly for the committers, so can you all read this and give your comments so we can move forward on the basis of some form of consensus?
Of the ~45Kb RAM available on the ESP8266, typically half or more of this RAM is Lua compiled code and constant data as opposed to true R/W data. The facility to move Lua binary code in to Flash will more than double the effective RAM available to programmers.
Do we add support for running Lua directly out of Flash?
If so do we add it to the current dev branch soon?
Background
A hierarchy of function prototypes. and their associated vectors (constants, instructions, meta data for debug) are loaded into RAM when any Lua source or
lc
file is loaded into memory. Because in the Lua architecture, eachProto
hierarchy can be bound to multiple closures (this closure creation is only done by executing theCLOSURE
statement at runtime), such hierarchies are intrinsically read-only and therefore in principle ROMable.The main complication here is that, like all other Lua resources,
Proto
hierarchies are garbage-collectable (and advanced Lua programmers exploit this collection). So IMO, the difficulties arise when devising the details of how any compiled Lua in ROM interacts nicely and stably with the GC: it's fairly straightforward to implement a scheme which work mostly: but we need one which works all of the time in a well determined manner if we proceed with this.I haven't worked out a robust way of doing an incremental storage system, as Phil discusses in #128, and IMO this will be hard to realise. What I have worked out how to do is essentially an "freeze into flash, then reboot" approach.
lua
orlc
files.node.rebuild_flash()
function supplying a list of lua files that you wanted including into the ROM. Thisrebuild_flash
call should be preferably called just after reboot. This call then either rebuilds flash block and reboots the ESP immediately on completion, or leaves the flash block unchanged and reboots with an error status.require
path and so can be executed by arequire
statement; theloadfile
anddofile
will also parserom:module
syntax and return or execute a closure accordingly.Basic approach
rebuild_flash
routine unhooks the current ROM table, and does two passes of loading the modules.Proto hierarchies
; (ii) to fill the RAM string table with all the strings needed to store the hierarchies.Proto
hierarchies are now in ROM, these hierarchies can now persist over reboot, and only the closure-based resources will occupy RAM.This process is simple and robust, but the Lua RTS is built around the assumption that collectable objects don't move their location and that strings are interned. It will be impossible to return control to the invoking Lua after a successful load, and difficult to return control after a failed one, which is why this "reload flash and immediately reboot" option is the most robust.
This system would enable Lua programmers to be able to compile and execute significantly larger Lua programs within the ESP resources.
There are some extra wrinkles for the Lua 5.3 environment but I will park these for now. So comments so far?
The text was updated successfully, but these errors were encountered: