Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy of supporting Lua in ROM #2068

Closed
TerryE opened this issue Aug 7, 2017 · 81 comments
Closed

Policy of supporting Lua in ROM #2068

TerryE opened this issue Aug 7, 2017 · 81 comments
Assignees

Comments

@TerryE
Copy link
Collaborator

TerryE commented Aug 7, 2017

Although this issue links to earlier discussions in #1289 and #1661, I see this as a policy issue mainly for the committers, so can you all read this and give your comments so we can move forward on the basis of some form of consensus?

Of the ~45Kb RAM available on the ESP8266, typically half or more of this RAM is Lua compiled code and constant data as opposed to true R/W data. The facility to move Lua binary code in to Flash will more than double the effective RAM available to programmers.

Do we add support for running Lua directly out of Flash?

If so do we add it to the current dev branch soon?

Background

A hierarchy of function prototypes. and their associated vectors (constants, instructions, meta data for debug) are loaded into RAM when any Lua source or lc file is loaded into memory. Because in the Lua architecture, each Proto hierarchy can be bound to multiple closures (this closure creation is only done by executing the CLOSURE statement at runtime), such hierarchies are intrinsically read-only and therefore in principle ROMable.

The main complication here is that, like all other Lua resources, Proto hierarchies are garbage-collectable (and advanced Lua programmers exploit this collection). So IMO, the difficulties arise when devising the details of how any compiled Lua in ROM interacts nicely and stably with the GC: it's fairly straightforward to implement a scheme which work mostly: but we need one which works all of the time in a well determined manner if we proceed with this.

I haven't worked out a robust way of doing an incremental storage system, as Phil discusses in #128, and IMO this will be hard to realise. What I have worked out how to do is essentially an "freeze into flash, then reboot" approach.

  • Essentially what this does is to maintain a ROM string table and a set of Proto hierarchies in a fixed (64Kb) flash block within the ICACHE_FLASH address space.
  • Any Lua files that you want to move into ROM must be preloaded into SPIFFS as lua or lc files.
  • The flash block can be discarded and rebuilt using a new node.rebuild_flash() function supplying a list of lua files that you wanted including into the ROM. This rebuild_flash call should be preferably called just after reboot. This call then either rebuilds flash block and reboots the ESP immediately on completion, or leaves the flash block unchanged and reboots with an error status.
  • After the reboot, the modules in the flash block are in the require path and so can be executed by a require statement; the loadfile and dofile will also parse rom:module syntax and return or execute a closure accordingly.

Basic approach

  • This ROM-base lua system uses two string tables as discussed in Rework of the lua source hierachy to support a unified apporach to ESP8266 and ESP32 #1661: the standard RAM-based table and a ROM-based one ahead of it in the search order. The rebuild_flash routine unhooks the current ROM table, and does two passes of loading the modules.
  • The first pass is a dummy load. This serves to purpose: (i) to calculate the total storage requirement for the Proto hierarchies; (ii) to fill the RAM string table with all the strings needed to store the hierarchies.
  • The size of the string table and all string resources is then calculated and if the total size of code and string table fit within the flash block, then the strings and string table are copied to flash, then the RAM string table GC pruned. The ROM table is rehooked and the RAM table replaced with an empty table.
  • The second pass is a flash buffer load, with all string constants resolved against the ROM table. Since all of the strings needed to load the Proto hierarchies are now in ROM, these hierarchies can now persist over reboot, and only the closure-based resources will occupy RAM.

This process is simple and robust, but the Lua RTS is built around the assumption that collectable objects don't move their location and that strings are interned. It will be impossible to return control to the invoking Lua after a successful load, and difficult to return control after a failed one, which is why this "reload flash and immediately reboot" option is the most robust.

This system would enable Lua programmers to be able to compile and execute significantly larger Lua programs within the ESP resources.

There are some extra wrinkles for the Lua 5.3 environment but I will park these for now. So comments so far?

@pjsg
Copy link
Member

pjsg commented Aug 7, 2017

With the version that I built, I wasn't convinced that you could entirely rom-ify some functions (without a lot of hacking in the GC). The issue is that some objects need to be followed by the GC process, and so cannot be romified. [My memory is a little hazy, and I don't know if you can actually write such a function. My implementation had all sorts of corner cases where some of the internal vectors could not be rommed]

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 7, 2017

It's quite straight forward to get the Proto hierarchy only depending on only ROM resources, so you can safely short circuit the GC sweeping here. IIRC, your implementation also pushes closures into ROM, and this is bad news, IMO, because the upvalue chain will invariably point back into RAM.

I'll do the usual trick of adding this code to a vanilla Lua 5.1 version first and hammering it there on the PC before porting it to NodeMCU. It's just a pity that MMAPing a file into an absolute address window with the MAP_FIXED attribute is such a dog on modern OSs kernels which use with address space randomisation. But getting around this is only a few dozen lines of code. At least this way you can hammer out the GC interactions using a decent gdb implementation.

But to my core Qs, if we can get a robust approach here should be push it through dev to master?

@nwf
Copy link
Member

nwf commented Aug 8, 2017

As someone who often exhausts the heap, I'd be for such a change, assuming it's not crazy to support. The only downside, other than code complexity, perhaps, that leaps to mind is that running from flash means that it's less clear ahead of time when the flash chip will be engaged, so anyone trying to use the flash-associated GPIOs might be in for a surprise.

Not to ask a stupid question, but does anything in lua 5.2 or 5.3 make this easier? (I recall reading somewhere that there was an effort to bring nodemcu over to 5.3; am I just making that up?)

@jmattsson
Copy link
Member

Getting Lua running out of flash would be a big boon given the RAM constraints.

My Qs/concerns are:

  • Is it forward compatible with the 5.3 work? If not, I think we'd better hold off on this - it'll be a noticeable bump doing that jump without pulling this rug too.
  • Would the "frozen" flash buffer be in a fixed location, or a logically fixed location (i.e. "at the end of .text" or so)? If the former, would we need to do something to deal with uneven wear? I'd very much like to avoid that if possible. Having the "frozen" flash block just hanging off the end of the firmware would move it about depending on which modules were compiled in at least, which is probably Good Enough(tm).
  • Presumably there'd be a function to check whether/what's loaded into "frozen" flash? Main use case here being where you pre-build your file system and want to get some things "moved" to flash on first boot, automatically, without ending up in a reboot loop :)
  • Do you need to say require('rom:mymodule') to look into flash? If not, can we make it so? I could foresee a case where you have a read-only filesystem from which you've "frozen" some code over into flash, and if you then just require('mymodule') you'd need to search the frozen area first, but if we do that then you're likely to end up surprised because you're loading old code even though your .lua (or .lc) has been updated. Of course, it could be argued we're sticking to tradition given we already have the issue with .lc vs .lua...
  • Instead of a rom:module syntax, would it be possible to hook the "freezer" into the VFS read-only? That way you could simply do require('/freezer/module') or such, and would avoid introducing another namespace.
  • Are there any alignment changes needed to Proto in order to be able to read them straight from flash? How much impact would that have in cases where the freezer isn't used?

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 8, 2017

Sorry for the long reply guys, but I've tried to cover all of Johny's and Nathaniel's below.

Forward compatibility with the 5.3 work

I will do a separate update on the 5.3 work on #1661, but as to the specifics of this functionality, like Johnny I see addressing the RAM constraints a key criteria for the success of NodeMCU Lua, so my original intent was only to add this functionality to 5.3. What I've raised here is essentially a backport of the technology to 5.1. Unlike the rest of the 5.3 work which from a user perspective is either out-of-the-box 5.3 functionality or compatibility with the existing NodeMCU/eLua module API, this is new and pretty decoupled from the rest of the 5.3 work.

Moving it into 5.1 allows a more vigorous engagement with the current developer community to get a consensus on how this API should work -- as well as bringing forward the benefits to the community.

Flash buffer location and wear levelling

I see this as less of an issue for a number of reasons. The current Windbond chips such as the W25QxxFV series quote a life of more than 100,000 erase/program cycles, and even if the modules have the earlier generation NAND flash chips with a 10K cycle life, say, the mode that I suggest we use here which is essentially a reboot-reload-reboot cycle envisages a usecase more similar to the convention C rebuild and flash life-cycle. Even during active development the module might see 10 reloads a day, and in production maybe 1 a month so this isn't going to be an issue. It would also be trivial to consider refinements such as the SPIFFS_FIXED_LOCATION type parameter.

Search order for loading

The Lua require system uses a set of helpers defined in the package.loaders array. These can even be changed at a Lua programming level. Even so I recommend that the default order should be that the ROM should be searched first, for performance reasons, but that in a development mode the developer would be free to do in init.lua

  local pl=package.loaders; pl[3],pl[2] = pl[2],pl[3]

to reverse this search order. And note that you can only specify the module name as a require parameter; it is the loaders (or searchersin Lua 5.3) that determine where to look.

The load functions are different in that these don't have a searcher concept and so we need some simple method of encapsulating accessing the ROM store at a Lua API level. Also accessing the ROM store is fundamentally different from any of the load functions (load, loadfile, loadstring, dofile) as these all execute a load operation which is expensive at runtime. The ROM store contains a set of compiled Proto hierarchies in memory, and that is needed to convert them into a closure (which is a "function" in Lua terms) is to execute the CLOSURE VM statement, and this needs a few lines of C to be executed as Protos are hidden from Lua execution world.

This is why I would promote the use of modules rather than functions in the store, as this is more transparent and fits better into the Lua paradigm. Nonetheless if we want to make a more transparent method of loading functions from the store or VFS whatever we do is going to be slightly a botch because internally within the relevant load function this is't encapsulated within the vfs because you don't actual do a load with a stored routine. It's already loaded, just not bound to a closure.

We have exactly the same issue today with lc vs lua loads. The standard API leaves the handling of this and any precedence issues to the Lua programmer. I would just add rom to the list and still leave it to the programmer. My own standard template is to use an autoloader for my functions to hide all of the error handling and precedence issues. I myself would just extend this with one line:

setmetatable( self, {__index=function(self, func) --upval: loadfile
    func = self.prefix .. func
    local f, msg
    if not skiprom or not skiprom[func] then f = getrom(func) end  -- handle ROM load
    if not f then f,msg = loadfile( func..".lc") end
    if msg then f, msg = loadfile(func..".lua") end
    if msg then error (msg,2) end
    if func:sub(8,8) ~= "_" then self[func] = f end
    return f
  end} )

or actually the ROM optimised version which saves about 300 bytes RAM:

-- skiprom if defined is a global
setmetatable( self, {__index=getrom("autoloader")} )

The getrom function returns nil if the Proto isn't in the store, so this is the cheapest method of checking for existence. However we could also have a debug.getromprotos in the same way we have debug.getregistry though this would have return a list of names since the Proto values are meaningless in Lua.

Performance and alignment issues

Yes access from RAM is roughly 13-25× slower than Flash in the case of cache-miss, but at the moment executing every Lua VM instruction (4 bytes) involves reading 100s of bytes of xtensa instructions from Flash to interpret this one instruction. However, flash access is RAM cached and this reduces the overall impact, though accessing from scattered flash address regions will increase cache fault rates, so IMO moving code into rom will slightly decrease instruction execution performance.

But we also have to balance the slight increase of runtime in accessing rom-based constants and strings with the fact that all of these resources are in ROM and have been removed from the scope of the GC, so GC sweeps will be a lot shorter and required less often. A big runtime saving.

Also, the RAM limitations mean that non-trivial Lua programs involve a lot of dynamic loading of code from SPIFFS which is slow because of the double whammy of SPIFFS overheads and the Lua load process. Converting ROM Protos to encapsulated functions is fast.

So I believe that the average Lua application will run faster overall.

Unaligned (in the Lua RTS nearly all byte) fetches are slow because of the overhead of the unaligned exception handle overhead. -O2 instead of -Os helps for general string access but not for this. However there are (inline macro assembler) techniques we can use to replace unaligned fetches by a two instruction aligned fetch and extract. But I see this as a second order optimisation for later.

@nwf
Copy link
Member

nwf commented Aug 8, 2017

Thanks much for the very detailed response!

@jmattsson
Copy link
Member

Nice comprehensive response, cheers. Just a minor comment regarding unaligned stuff; I really did mean unaligned (exception code 9), rather than the sub-32bit-wide loads ("load/store error", code 3). The latter we have our custom exception handler to patch up and recover with. Unaligned 32bit access however would still be fatal. It may not be an issue as you say "nearly all byte", but worth keeping an eye on.

I'm in favour of the approach outlined here. Obviously we'd need to have good docs explaining how to use it, when the time comes :)

Ah, and one more question: a 64k block is obviously larger than the amount of free RAM we currently have, and thus would likely go partially unused no matter how badly one tries to move code to it. Any ideas on how to get the best use out of it? I know you ruled out incremental freezes above...

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 9, 2017

It's late for me so a quick response. I understand your exception code 9 point and I will check this, but I don't believe that this is an issue.

Re 64kKb, the reason for two passes is to serialise the load process. There are two constraining factors: the size of the string table, and the size of the largest module that you need to load, because each file is loaded into RAM, then cloned into flash. We need to play to see how much of a constraint this is in practice, but whatever it is, it's still a lot better than current constraints.

@dtran123
Copy link

dtran123 commented Aug 9, 2017

For those concerned with performance, I would argue that one of the key usecases for the ESP8266 (& ESP32) is IoT applications. For the majority of cases that I can think of, fast execution is not critical. So even if this resulted in a slightly slower execution time, I would be ok with it.

I can see great benefit from this feature:

  • additional RAM to be able to properly support secured tcp/http connections for the ESP8266, additional cyphers and larger key sizes. I am less concerned for the ESP32.
  • significantly reduce frustrating heap crashes cases resulting in greater adoption of the Lua Nodemcu platform.
  • simplifying code (many times, ugly lua code has to be used to work around ram limitations)

These are just 3 benefits that completely justifies such initiative.

I would be curious what is the effort estimate for completing this task. Are we talking weeks or months ? (for a developer)

I believe we run the risk of losing many community members for the ESP8266 if we don't solve the RAM shortage which appears to affect proper support of secured connections.

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 9, 2017

The issue isn't so much absolute hours, but elapsed. The internals of the Lua engine are both subtle and complex, and I (TerryE) seem to have taken the short straw to get to grips with all this. The issue is that all of this work is unfunded and done in my spare time, threaded amongst my other commitments like finishing off a house that my wife and I are building -- and doing the Home Automation for the same which needs its own ESP code. But as to your core point it's man-days of work (my being a male) rather person-weeks, as a lot of the foundation work is already done as part of my Lua 5.3 upgrade for NodeMCU.

@dtran123
Copy link

Would it help you if some of us were helping you with the funding ? We could gather a few volonteers to help out with donations. I am willing to help if I know this feature will result in fixing the current problem with secured tcp connection that has started with the SDK 2.x. on the ESP8266.

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 10, 2017

@dtran123, nah. don't need the dosh. I spent 35 years in IT and ended up on top of the techie shit-heap. I am now a gentleman living on a sinecure (pension). It's hours in the day and priorities that are my problems 😉 Let me crack on, whilst the brain is still working.

@nwf
Copy link
Member

nwf commented Aug 10, 2017

@dtran123 Increasing available heap may help with secure SSL, but as I reported in #1707, I think it is already viable to work with (and verify) ECC keys instead of RSA keys. If you control both ends, I think this is the most immediate path forward. You'll have to tweak the mbedtls configuration file as done in nwf@c1ed48c (and likely want to cherry-pick @djphoenix's update to mbedtls first, djphoenix@4958a4a) and/or see if @marcelstoer can add some checkboxes to the web builder to achieve the same effect.

@TerryE Please don't take any of that to mean that I amn't rooting for your success. If not donations, perhaps a beverage of your choosing if we're ever in the same place. :)

@georeb
Copy link

georeb commented Aug 10, 2017

I want to support this in any way I can.
Donations, testing, beer... please let me know if I can help! 👍

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 10, 2017

At the moment, I am thinking about work-arounds for some interesting catch-22s thrown up in standard Lua testing. The clone to flash process destructively overwrites the old version of the cloned ROstrt, but this in turn was a clone of and earlier version of the strt, so contains all of the strings like package.loaders keys, and I need these to persist in the same locations, so that loading code itself doesn't fall over. So I can't quite use a simple serial allocator for flash. It's an issue that I will need to address anyway with 5.3, but this is one to solve during some ZZZZZZZ or over a glass of wine, and not in the editor 😄

@nwf
Copy link
Member

nwf commented Aug 11, 2017

Would it help to have two segments of the in-flash data? One which was objects whose positions needed to remain invariant across updates, and one which could be overwritten at will at each clone? I presume the former can be relatively small and so loaded into RAM at the start of cloning, and then written back to flash only after the second segment has been constructed and any requisite additions made?

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 11, 2017

Close. My current approach is to treat the first boot after flashing the Lua firmware as special. This is partly to solve some issues in the NodeMCU 5.3 version where you can declare TStrings at compile time. The RTS performs library initialisation then executes a clone before starting Lua execution. This just clones the base string table. The addresses of this first tranche of TStrings are then preserved across subsequent clones, so the tables which use them are OK.
Time for bed for me, as I am on UTS+0

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 14, 2017

Please see my paper on this approach: LRO Functions in NodeMCU Lua. Sorry, it includes some typos and other errors, but I will fix these if I update it following any review comments.

@nwf
Copy link
Member

nwf commented Aug 14, 2017

@TerryE This looks really well thought-out. I very much like the flash-block lifecycle tracking trick (1F -> 7 -> 3) and the multi-reboot design seems like it will work well without being too complicated.

ETA: Is there any way we could compute the flash block on the host as part of the image build? Obviously not exclusively, given the intended node.rebuild_flash() API, but in addition, perhaps?

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 14, 2017

I already have a lot of code that can do this in a 5.3 environment, but then again the standard NodeMCU make generates a luac which runs on the host and does the same as the current luac.cross, except for a -X option which allows you to run NodeMCU Lua on the host,. This makes it simpler to implement this, but there's no reason in principle why the equivalent shouldn't be done for 5.1, but let's get the on-chip version working and released first.

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 15, 2017

I have been thinking about this cross build issue. It would in principle be a straightforward variation. The way to do this would be to extend luac with an option -F <stringlist> where the string list would just be a text file list of strings to be included in the ROstrt, so

  luac.cross -O flash.bin -F default_strings.txt MyProj/*.lua

might build a flash image for downloading based on the Lua files in MyProj. I couple of wrinkles here: (i) luac.cross uses host-native pointers for its in-memory data references which are typically 64bit these day, not 32, so some resizing would need to be done. (ii) I'd need a relocatable format for these flash binaries. These are both trivial to address, but I don't want to get sidetracked on this just yet.

@jmattsson
Copy link
Member

Just throwing another thought into the pot here.

For the modules with mere 512k flash, would it be feasible to exclude the "freezer" support? Would it make sense to have the interface sitting in e.g. a freezer module with load(), isempty() and clear() functions, and then either #ifdef out or use a zero-size page/area for the storage if the module is not enabled? Just thinking that 64k might prevent people from using the old ESP01 modules with modern NodeMCU outwise.

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 16, 2017

@jmattsson, I'd already decided to do that (and also make this option disabled by default in early releases, at least) for two reasons: first, so that those that don't want it don't have the flash overhead, and second just in case we find issues in early testing. If we can conditionally remove the code then we can do safely release it into dev.

A second point: what do we formally call this? Philip first proposed the idea and called it "freezer". Do we use that, or do we use "flash"?

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 28, 2017

I've got the vanilla PC 5.1.5 version working fine now. This does a mmap() of a pseudo flash area into the VM, and then used mprotect() to turn off write access whilst the VM is running.

  • I had a bit of fun with the GC as this still tries to mark fixed resources during CG sweeps even if it isn't doing this in string sweeps. This relates to its algo for weak tables, especially kv mode tables that typically used for memoized functions and ephermon tables(see PiL 17.2 and 17.3), but I can't see where strings in Flash being fixed would break the application behaviour. More to the point, I very much doubt that any IoT Lua application would fall foul of this, and having double the RAM available would help sweeten the pill if is does.

  • The other area is in the cascade clean-up of Proto hierarchies when a closure is GCed, but here the Proto fixing works as intended.

The PC-based version has to support PIC Flash because of Linux address randomisation, and if we start looking at @nwf Nathaniel's suggestion of Host buildable images then we might want to do the same for the NodeMCU version. However, I suggest that we keep the NodeMCU version simple as possible in its first iteration. I am not going to include byte-access optimisation in this first version so it will hit the aligned handler, but adding this as a second pass is pretty straight forward.

One thing that did strike me is that as soon as the VM starts running, the minimal string table is around 10Kb. Because this minimal core is moved to the ROstrt, this immediately frees up ~10Kb from a tpyical running Lua app even if it hasn't freezed any code into the flash.

@jmattsson
Copy link
Member

Because this minimal core is moved to the ROstrt, this immediately frees up ~10Kb from a tpyical running Lua app

👍

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 30, 2017

I've just been doing an L8UI code review. (This is instruction that trigger 99+% of the unaligned fetch from flash exceptions.) It isn't to bad al all: the 'hot' modules, lobject.c,lstring.c, ltable.c have only 8,7,11 respectively and a couple of simple macro changes will avoid the material ones here, and in one case (luaO_log2()) a reasonable chunk of code replaced by a single asm instruction

lgc.c does a lot of marking and sweeping so accesses here should really avoid byte-based bit diddling in flag byte fields. For example, the compiler generates a 3 instruction sequence to test a single bit: load a byte; shift left the bit into the sign bit and branch if negative -- and this generates an unaligned exception. But the equivalent load 32bit; shift left the bit into the sign bit and branch if negative also takes 3 instruction, executes as fast and doesn't generate the exception.

lstrlib.c has a lot of character based manipulation, especially in the pattern-based searching and matching so ROM-based patterns will be bad news, but it is pretty straightforward to define a macro to clone the string into an alloca()ed copy (if the string is in ROM and less than some safe limit long) and this would be a single line addition per parameter to such hot routines.

But we can do this sort of optimisation once we've got the basic code working.

@TerryE
Copy link
Collaborator Author

TerryE commented Sep 12, 2017

Testing this lot is a total bitch. If you use the gdbstub then you can't use uart0 for Lua input or output. So you have to hook up a second USB serial chip to the UART1 and get debug logging that way. I've got two methods of loading code: a RAD cycle based on spiffsimg'ing small 32Kb FS with various test stubs, and potentially a telnet stub, but you've got to get your execution past the basic bootstrapping processes. I am still fighting EGC issues which are subtly different to the standardd Lua, and I am jugging this with all of my other time pressures.

At least the PC version works OK.

It doesn't help that the the gdbstub is very fragile if you can get to a breakpoint then you can examine RAM, but flash-based exceptions just seem to bypass the GDB exception handler entirely, and panic the CPU, so there's no opportunity for PM diagnosis. 😞

Any pearls of wisdom or even sarcasm welcomed 😄

@nwf
Copy link
Member

nwf commented Sep 12, 2017

@TerryE I am afraid I have no wisdom to add, and sarcasm seems like it won't help much. I don't suppose the ESP8266 believes in JTAG?

@jmattsson
Copy link
Member

Sure @TerryE , how's "if it was easy then some other idiot would already have done it"? ;)

Is the gdb stub being bypassed because we've already hooked the flash exception? I think Philip changed it so the handlers would chain for anything we didn't handle though, so I might be way off.

@TerryE
Copy link
Collaborator Author

TerryE commented Sep 12, 2017

@nwf As far as I know JTAG is a BITE interfacing technology. I've got more than that level of access and diagnostics; as Johny says: if it was easy ... We're working way up the stack here. Philip has already done some extremely valuable ground-breaking to help. It's a balancing act: I accept what we've got for now and work around it, or get sidetracked in improving and integrating the built-in test. What is clear is that we should to do a major rewrite of the Extension Developer FAQ to include stuff like using the gdb stub, logging to UART1, using the mapfile, ...

Yes, Johny, perhaps we need to make the flash exception handler gdb aware in the die branch. I will think about this.

As it stands I have Lua 5.3 working as a NodeMCU host build and ditto the flash variant of std Lua 5.1.5, but bootstrapping this into the ESP8266 just takes time and perseverance. It's just that my other commitments mean that the elapsed time is more that I'd prefer. Luckily I an old enough fart that i've done quite a bit of this low level hacking professionally back in the day, so it's just a matter of dusting off the cobwebs.

@georeb
Copy link

georeb commented Nov 4, 2017

So, so great to hear that this is progressing!! Thank you very much @TerryE for your continued support!

We all completely understand Terry that you are under no obligation to meet deadlines or schedules and that this is a 'done in your spare time' kinda thing, however, for those of us that are eagerly awaiting a dev version of this, do you have a rough estimate of when you think you'll be able to release something? Not looking for a commitment in any shape or form, just a realistic prediction of when we are likely to be able to start using NodeMCU again! :)

I understand you "don't need the dosh" but if a donation would sweeten the deal, then please let us all know. Unfortunately money is all I can offer, I wish it was technical support, but this is all a little beyond me!

Hope it's going well!

@TerryE
Copy link
Collaborator Author

TerryE commented Nov 5, 2017

No money. Just priorities, I'm sorry to say. I am up to my eyeballs in Lua and Node Red commissioning my home automation system for my new house. I'll take a break soon and spend a half a day getting to the bottom of this GC issue.

As soon as I have a stable build, I will push a commit to my github fork.

@georeb
Copy link

georeb commented Dec 17, 2017

Hi @TerryE - Any update on this please?

I am desperate to get my hands on a version of NodeMCU that allows me to comfortably connect securely and also have enough heap for other stuff too!

Again, I know you have no obligation here, but do you have a rough idea of when you'll be able to find some time to complete this?

I am currently looking to invest in a developer to get a usable version up and running and wanted to see what stage you were at first, before I engage them...?

Many thanks :)

@TerryE
Copy link
Collaborator Author

TerryE commented Dec 17, 2017

Hi, @georeb. I've just moved into the house that I've been building for the last few years and am typing this in my office on the first floor (in England the ground floor = zeroth). I hope to work on this over the holiday break and get a version out for evaluation. As to your investing in a developer to do this, my advice is: don't bother. This is complex stuff because you've got standard Lua, the eLua hacks, and all of the ESP issues interplaying. The learning curve is huge.

@georeb
Copy link

georeb commented Jan 10, 2018

Any progress over the Christmas break @TerryE ?

I understand that it'll be a learning curve employing a developer, but I don't have much choice. You are unfortunately the bottleneck and as I cannot interest you in payment, I have to pay someone that will. As always, I understand that you have no obligation; however I (and others) have been waiting 5 months for this now and I have to do something before it all gets superseded by something else :/

If others want to chip in to help out with developer costs, please get in touch.

@TerryE
Copy link
Collaborator Author

TerryE commented Jan 12, 2018

@georeb, we sold our old house and moved into the one we built on the 19th Dec and I've just been getting the HA system to the point where it will run the house's heating and environmental controls. After working 7 days a week on the new build, we've got to the point where my wife and I both have time available for our interests.

By all means employ a programmer to do this work, but don't underestimate the learning curve. You will probably waste your money as I will beat her or him to this deliverable.

@georeb
Copy link

georeb commented Jan 18, 2018

Congrats on moving into your new build! :) How far off the deliverable would you say you are @TerryE ?

@georeb
Copy link

georeb commented Feb 1, 2018

Any update please? @TerryE

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 1, 2018

Yup. Long story, short. I ran into a bit of show stopper that has forced my to change my implementation strategy. The problem wasn't that the approach doesn't work but more of a scaling issue because of how the GC (and the EGC modifications) interact with the build process. The EGC includes extra GC pause / restart directive around some operations and nested pause / restarts aren't honoured, so these would always restart the GC. The Lua GC will aggressively scan all collectables and mark any that aren't in Lua scope for GC.

What this means is that the build process doesn't scale robustly, and above a certain size of flash image the GC could come in and collect elements that I was assembling for the flash. Getting around this by referencing them was creating extra overheads which hit the scaling issue even more. Either that or start making fundamental changes to the GC, which I just am not willing to do.

So this issue was about robustly building a flash image on device, rather than executing it on the node once built.

My alternative approach is to move the flash image building into the luac.cross build, so that doing a cross luac on a collection of files with the correct switch builds a PIC flash image, which you can copy into the SPIFFS on the target and execute a single API call to reload the LFS with this image and restart the node.

  • We need position independence to avoid the PITA of having to specify the image size from luac
  • An upside is that being able to work within a large host RAM environment removes all of the shoe-horning that we need to do when doing an on-node build.
  • It also means that we can easily support 128Kb LFS regions -- so long as the app can work within a 48Kb RAM for its data,
  • The build process is simple and fast.

The only complication is that the host environment must be a little endian architecture such as Intel or ARM, but the code has to cope with 32bit and 64 host environments.

I am junking the current eLua-based cross-lua.lua approach, and the standard make now builds luac.cross as well ( as I do with 5.3) so long as the host includes the standard build-essential toolchain.

The luac side is working. I am in the middle of stripping out the rebuild stuff from the ESP end and adding the small PIC loader, Another few days of dev work.

@georeb
Copy link

georeb commented Feb 2, 2018

You're right, sounds extremely complicated @TerryE !!
So, this is sort of good news then? It's hard to tell, not being technical!

Are we close?! :)

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 11, 2018

OK, It looks as if I have ironed out most of the issues and can put together an evaluation PR. I just need to check that my build without all of the debug hooks works as anticipated. We will clearly need a tweak of the API stuff and I still have some bits to add. But the highlight so far are:

  • The standard make also builds luac.cross
  • luac.cross can build a PIC flash image ffrom a file list of lua files.
  • The LFS can be up to 256Kb in size.
  • The standard make for tools also processes a local/lua directory, placing the luac.cross generated flash image in local/fs and this is then included in the SPIFFS image
  • The SPIFFS image make now parses user_config.h and honours the SPIFFS_MAX_FILESYSTEM_SIZE and SPIFFS_FIXED_LOCATION defines. (I have these at 32Kb and 1Mb resp for my testing.)
  • The API cal node.flash.reload(filename) will reload the LFS and reboot the ESP.
  • The API call node.flash.reload() returns a list of top level function names in the cache.
  • The API call node.flash.reload(functionName) returns the corresponding of top level function.
  • I have added a debug.getstrings() which can return a list of strings in either the ROM or RAM table.

There's stlill a TODO list, for example:

  • I still need to add an __index metafield so that node.flash.someFunction(args) has the expected result.
  • I still need to add some inline macros to optimise some of the unaligned exception counts.

I've just been playing with a test LFS which has 7 function files loaded, has 135 string constants in the ROM table, 22 are in the RAM string table and there is over 39Kb heap still available for the App, so this is all looking promising.

I've also fixed a bug in the remote debugger and become adept at using this. I've also added some gdb macros which will help library developers examine the Lua stack, and I need to write all of this up in the developer guide sometime.

@pjsg
Copy link
Member

pjsg commented Feb 11, 2018

How does the following get built?

local M = {}
M.add1 = function(x) return x + 1 end
return M

I'm hoping that it will be possible to write a wrapper for require that searches the file system first, and it not found, uses the version in the node.flash. (The rationale for that order is that it allows easy development by only having to upload the one file that you want to change)

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 11, 2018

I'm hoping that it will be possible to write a wrapper for require

Phillip, there's no need. Read up on package.loaders. The require loader passes the module name to each in turn, and this handler then either

  • loads it and returns a Lua function variable (a TValue containing a LClosure in API terms) or
  • returns an error string, in which case the loader tries the next handler.

The package.loaders table is in RAM so the application can reorder the handlers or replace/add one. (search for lua_CFunction loaders in app/lua/loadlib.c). We only use the second, loader_Lua in NodeMCU, so you can replace any of the other 3 with your own Lua function:

local index = node.flash.index  
local function loader_flash(module)
  local r = node.flash.index(module)
  return type(r) == 'function' and r -- or nil otherwise
end
if index then package.loaders[2] = loader_flash end

If you have some init module in flash then you can stick this fragment in it, then the only RAM overhead is the loader_flash LClosurewith its one upval.

As far as how it gets build, you can either just stick the modules in fs/lua and do a make, or you can do your own process. I am going to update my own provisioning system to be LFS aware, so this will all be seamless for me.

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 11, 2018

Another trick is that I include a dummy module preload which is just a single lua line:

-- preload a bunch of strings into the ROstrt and avoid the RAM overhead.
-- use debug.getstrings('RAM') to work out which you might want to add 
-- for your application
local preload = "?.lc;?.lua", "@init.lua" -- , ... extend as you need

or add more preload = .... if you have lots of string that you want to preload into ROM. This creates a dummy module with just a load of LOADK instructions and a constant list of all of these strings, which luac.cross will then preload in the ROstrt, so you won't chew up your RAMstrt and have all of the associated GC overhead. You never need to call this; just including it in the compile is enough.

OK you are wasting n × (TValue + Instruction) in the LFS to do this, but with up to 256Kb available and it never being called, do you care?

I was thinking about reverse engineering the compiler to preload all of the common strings used during compilation to drops the compilation overhead.

@pjsg
Copy link
Member

pjsg commented Feb 12, 2018

@TerryE Makes sense. Looking forward to seeing this in action!

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 12, 2018

Incidentally one of the best tricks to do with the debugger is to add a macro for lua_assert which does a debugger break and then enable this for your test code. The Lua API macros use lua_assert a lot to do validation so this will pick up a lot of consistence errors. You can also make heavy use of lua_assert in your own code. If not enabled then this all gets optimised away / removed by the GCC code generator at -O2. The real PITA with using the debugger is that you loose the ability to input strings through the UART input, so you need to use a telnet stub for interactive testing.

I am thinking of having a variant assert stub which puts out a warning message to come out of your UART terminal session and start xtensa-lx106-elf-gdb before itself starting the GDB remote stub then issuing a break so that the host and target can rendezvous in a debug session, and this way you get the best of both interactive and debug use.

@georeb
Copy link

georeb commented Feb 12, 2018

This is GREAT news @TerryE !! :) Thankyou.
Is the plan to release a DEV version that will eventually be merged with the MASTER branch?

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 12, 2018

The Alpha version will stay in my fork until at least one other committer has checked it out. Then it will be pulled into dev. It will go into master on the following release cycle, but with the LUA_FLASH_STORE define in user_config.h commented out so that builds won't have LFS enabled by default. However individual developers will be able to enable it for their builds. We might subsequently switch it be default but that will be up to a consensus of the committers, not just me.

@georeb
Copy link

georeb commented Feb 12, 2018

Excellent!

Will the version in your fork be an adapted version of NodeMCU MASTER branch?
Sorry for the, perhaps, obvious questions.

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 12, 2018

The way that the release cycle works is that we commit to dev, then batches of commits to dev once stable are then committed to Master. The only path to updating master is to move dev patches into it. So I am not sure what you mean by your repeated Q. There should be a master version with LFS support in the next 2-3 months, but the delay is only because of the dev to master promotion cycle.

About half the community use dev builds to take advantage of the latest bug fixes etc. The delay ensures that we have a reasonable chance to give good usage coverage to any changes before moving them into master.

@georeb
Copy link

georeb commented Feb 12, 2018

Will the version in your fork be an adapted version of NodeMCU MASTER branch?

What I meant was, will your version be a standard copy of the current MASTER with the addition of LFS?
I hope this is a little clearer?

@nwf
Copy link
Member

nwf commented Feb 12, 2018

@georeb Unlikely; it's more likely to be a fork of dev, rather than master, since that's the target for merge.

@georeb
Copy link

georeb commented Feb 12, 2018

Okay, understood. Thanks

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 15, 2018

I have just updated my Lua Flash Store (LFS) whitepaper so it now reflects the current LFS implementation. Anyone interested in this, please reread carefully. The LFS patch is so large that I have also had split it into 5 commits, each of which is larger than a typical PR here.

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 22, 2018

For those who are wondering about my delays here, I find it quite time consuming to cover all of the base test cases and their variants: float vs Integer build; host (luac) vs target (lua) firmware; without LFS; with but no LFS used; with with LFS used. In my testing, I have come across a subtle architectural issue which related to my implementation of GC marking, and this really needed reworking before I release this.

We made quite a few compromises in getting the 0.9x versions of Lua out within the timescales that zeroday achieved. By now we have the luxury of a robust working 2.1 version. I don't want to compromise this by rushing out an LFS version too soon.

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 8, 2018

See #2292 for further discussion.

@TerryE TerryE closed this as completed Mar 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants