Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimising access to ROTables #2504

Closed
TerryE opened this issue Oct 3, 2018 · 12 comments
Closed

Optimising access to ROTables #2504

TerryE opened this issue Oct 3, 2018 · 12 comments

Comments

@TerryE
Copy link
Collaborator

TerryE commented Oct 3, 2018

Background for Enhancement

One of the eLua memory optimizations is to move the list of base functions and installed libraries out of _G and into ROM-based lists. An __index function has been added to the _G metatables, so that any global misses are also resolved against the base function list then the library ROM list.

The upside is a RAM saving, but the downside is a material runtime hit when accessing ROtables. The average access time for a variable cached in a local is 3 µSec, and for looking up one in _G is 8 µSec. However resolving ROM entries is a lot slower: 43 µSec in the case of the base functions such as print() and 132 µSec for library entries.

Incidentally this 132 µSec time is perhaps 5× better earlier builds than pre LFS versions since the string constants were byte-packed at -Os and c_strcmp()function compares the target key against each unaligned ROM table entry, thus constantly triggered unaligned exception handler.

@pjsg and I found that switching to -O2 word-aligned the ROM-based string contents meant that the flash-aware optimised ROM-based strcmp() routine does proper word-aligned data fetches in this case.

Even so, the implementation is pretty dumb in that the code always does linear table scan using a pairwise c_strcmp(), and this annoyed me, I decided to spend a day optimising this code.

  • First the idea of having different structures -- a ROtable for the list of modules and a different one-off structure for the base functions seemed daft. Why not just combine them into a unified master ROtable?

  • Unifying these into a single ROtable meant that I could drop the __index function and instead use Lua to do this by assigning _G.__index = <master_ROtable>

  • Doing this also made half of the lrotable.c functions redundant and I was left with one indexing by a lua string and one by an integer.

  • Since all keyed access to ROTables goes through a single access function, it seemed common sense to stick a small (16×4×2word) RAM lookaside cache in front of these accesses. One my benchmarks, this gives better than 99.5% hit rate for keyed ROTable access.

  • Since all keys are accessed via TSstrings and are therefore word aligned, I have changed the LRO_STRKEY(k) definition to include (STORE_ATTR char *)which forces word alignment independent of compiler options (just in case someone does a custom build and goes back to -Os. The key comparison also has an added (key ^ entry) & mask guard which does a word aligned comparison and only calls c_strcmp() if the first four chars match. This is code executes fast and is inline. The mask is to handle cases like ow\0 where this is set to 0x00ffffff, so the main limitations are now the linear nature of the search and the slow flash-fault access times.

This patch brings the average runtime for these lookups down:

  • 12 µSec down from 43 µSec for the base functions
  • 12 µSec down from 132 µSec for the library tables.

The isn't any difference in the new times because these are in the same table and accessed directly with a 99½% hit-rate. Of course this speed up also applies to reference such as file.open.

Justification

This represents an approximate 50× speed up in the larger ROTable access times compared to 6 months ago, for the cost of a ½kB RAM cache. Anyway, I've done the work. Is it worth raising as a PR? If the ½kB RAM is an issue, we can save about 1kB RAM from the DNS library with a simple patch.

BTW, another bonus is that you can now directly enumerate the ROM libraries:

for l,v in pairs(__index) do
  if type(v) == 'romtable' then
    print(l,v)
  end
end

Workarounds

The Lua environment runs slower than needed and you need to get up to techniques like

  local ow,file=ow,file

to get efficient access to ROM functions.

@nwf
Copy link
Member

nwf commented Oct 3, 2018

Please open a PR; this sounds like good work! I'm more than happy to give the ½kB up, even without the DNS patches (though those might be good, too).

@jmd13391
Copy link

jmd13391 commented Oct 4, 2018

I second @nwf

If I am reading @TerryE correctly, PR'ing both ROTables and DNS would give us much improved performance and +½kB RAM... Seems like a no-brainer to me.

@LDmitryL
Copy link

LDmitryL commented Oct 4, 2018

@TerryE I vote with both hands yes! Will it work on dev-esp32 as well?

@TerryE
Copy link
Collaborator Author

TerryE commented Oct 4, 2018

@jmd13391 At the mo we will lose ½Kb RAM, but I can do a tweak to DNS to claw back about 1Kb giving a net saving of ½Kb RAM but that saving is another PR.

@LDmitryL, I haven't merged any of the recent changes including LFS into esp-32, but on this note there is also stragic reason for this work: I want to restart my project to integrate the two ESP platforms under a new Lua 5.3 baseline. However I also want to maximise backwards compatability at a Lua application level and this involves retaining all of the nice NodeMCU extensions that we've added over the last four years. A key component is ROM support, both ROTables and LFS.

Unfortunately IMO, the original ROTables implementation is a cludge done by the eLua implementation, but at least it works. The Lua VM unifies Table and ROTable handling, but instead of ROTables being a proper Table variant, the code checks to see if the (Table *) is in the ROM address space, and in this case treats it as a ROTable pointer. This isn't going work well in the Lua 5.3 VM, so what we really need is for ROTables to have a proper Table header. This first step to getting there is what I have done here which is to separate the internal ROtable implementation from the rest of the VM and applications code.

Having a lookaside cache internal to ROTable lookup effectively eliminates the access overheads for ROtable fetches, which is the main usecase here. However there is a second usecase, which is an existance probe. This can occur with the VM's use of metamethods; for example if a field doesn't exist in a table, then the VM checks to see if there is an __index metamethod to resolve it. We use this feature in _G to magically add the ROM functions and module ROTables. This type of access can and will often yield a miss, and this currently requires a full scan of the ROTable vector.

It would be nice to have the option to tag those ROTable vectors that are sorted so we could have the option to do a faster binary chop instead.

I feel that the key to derisking the move to a unified Lua 5.3 for all ESP architectures is to make this bumpless at an application level. So the tl;dr is:

_This will eventually end up in ESP32, but let's take one step at a time.

@TerryE
Copy link
Collaborator Author

TerryE commented Oct 9, 2018

I have been reworking this over the last couple of days. I want to split the base library table (functions like print() and collectgarbage()) from the library modules table (ROTables like ow and file). This is easy to do using standard Lua: you set up the base functions ROTable as the __index pointer from _G and its __index pointer as the master ROTables pointer. The Lua VM handles all of the magic.

This makes the encapsulation cleaner. The downside is that the VM accesses the ROTables like file by

  • First trying _G (a miss)
  • The base function ROTable (a miss)
  • The master ROTables table (a hit)

The second step involves a failed full scan of the base function ROTable which adds about 50 µSec to the access time -- which is still a lot better than we have currently.

So why do I want to do this? (i) it makes the implementation a lot cleaner; and (ii) at the next stage, I will have a binary chop search option which will be a lot faster than the serial search and this will lop off about 40 µSec.

@jmattsson, @pjsg, @devsaurus are you tracking this? Any feedback / comments would be much appreciated. 😄

There is also nothing stopping you setting local ROM = getmetatable(__index).__index and then ROM.file etc is fast but as always you should always cash library references in locals, e.g.

local file, wifi, ow = file, wifi, ow

and the accesses like file.open() will run as fast as shit off a shovel.

What this is all really about is incrementally bringing our current Lua 5.1 plus library modules up to a level where we can seamlessly cut over to Lua 5.3

@TerryE
Copy link
Collaborator Author

TerryE commented Oct 12, 2018

The second step involves a failed full scan of the base function ROTable which adds about 50 µSec to the access time -- which is still a lot better than we have currently.

I have been thinking about this one, and 50 µSec is too big a hit to ignore. The main issue that I had here is that I can't copy @jmattsson 's linker magic which marshals the master ROTable table in RAM on host builds without a botch and I don't like botches, so I've added some #ifdef logic to use a single table on the target builds and and two tables on luac builds -- best of both worlds.

I've also added a ROM entry referring to itself in the master ROTables table so you can do a pairs(ROM) for loop to work out which libraries are loaded in the firmware. I'll do this push later today.

@tomsci
Copy link

tomsci commented Nov 10, 2018

I want to restart my project to integrate the two ESP platforms under a new Lua 5.3 baseline. However I also want to maximise backwards compatability at a Lua application level

👍 I'd like to get everything moved to 5.3 too, 5.1 feels so, well, dated now.

Unfortunately IMO, the original ROTables implementation is a cludge done by the eLua implementation, but at least it works. The Lua VM unifies Table and ROTable handling, but instead of ROTables being a proper Table variant, the code checks to see if the (Table *) is in the ROM address space, and in this case treats it as a ROTable pointer. This isn't going work well in the Lua 5.3 VM, so what we really need is for ROTables to have a proper Table header. This first step to getting there is what I have done here which is to separate the internal ROtable implementation from the rest of the VM and applications code.

As it happens I've been looking at this too the last couple of days. I've reimplemented rotables support on top of 5.3 as a LUA_TTABLE variant, although without using a Table struct as it's not really required to use the exact same type (cf closures and light c functions both being LUA_TFUNCTION but using completely different TValue representations). 5.3's variant types makes it much nicer than having to do it on 5.1. I didn't like the ROM address check either so I got rid of that and declared a MetatablePointer type in the places that used a raw Table * (which turned out to be always for metatables) and used the bottom bit as a flag for which type it was. Still a bit nasty but much more portable than the luaR_isrotable() fn.

I made some similar-sounding chops to lrotables.c to get rid of some of the redundant functions although I didn't make any performance changes. I'd be very interested to integrate with your performance changes, the O(N) search behaviour of rotables has always made me a bit sad when I thought too hard about it!

I have my changes up and running on an ESP32, with a few minor tweaks across the rest of the code base to remove checks for light functions etc, and a few other bits and bobs. The RAM usage is a bit higher, but as I haven't got to the readonly string optimisations yet I don't know what's an increased overhead of 5.3 vs what's a missing saving from the rom strings optimisation.

@TerryE would you be interested in setting up a shared 5.3+rotables2 branch?

@TerryE
Copy link
Collaborator Author

TerryE commented Nov 10, 2018

Tom, thanks for your offer, and I would welcome an active partnership with any other experienced Lua VM developer. I feel that your lupi project is good karma for this.

I've already done a first iteration over a year ago, but other priorities got in the way and since then, other priorities took over. The ESP architectures are in general a lot more capable than the Arduino / Atmel architectures, but also an equivalent step below the RPi-class ARM SBCs, so a Lua VM-based IoT platform is entirely workable with the ESPs but with some care. Also NodeNCU adds some awkward non-functional requirements to the port / implementation architecture.

  • RAM is a scarce resource, especially on the ESP8266s, and is typically the first constraint to scaling of embedded Lua IoT applications. So some form of modified Harvard support with the Lua VM is essential -- ROtables and my LFS patch pretty much do everything that we need here (though its a pity that we haven't managed to offer an ability to compile LFS regions on device yet).

  • The project is currently fractured by the ESP32 / ESP8266 fork, and I feel that there is consensus that this was a tactical success but strategic mistake. We don't have the resources to maintain diverged versions in the long term so any Lua 5.3 implementation should again unify the Lua and other sub-system components.

  • Performance is import. So our goal should be ensure that the new version is at least as RAM and runtime efficient as the current 5.1 versions. If either of these is worse then we create a barrier for active developers to move to 5.3

  • Maintaining stability. Up to a couple of years ago, the firmware suffered various gremlins that cause stability problems. We have pretty much eliminated these and we now get very few genuine bugreps relating to the core VM and runtimes. We need to keep this level of stability with the 5.3 versions.

  • Module support. WE have a large range of device-based modules. We ned a pretty seamless migration path from the 5.1 to the 5.3 VM environments for these.

  • Unnecessary changes to the Lua core code. The original developers of NodeMCU did a fantastic job getting Lua to run at all on the ESP8266, but in order to do this they did a lot of "hack" changes to the code that with a bit of planning should have been unnecessary. (e.g. the ESP8266 toolchain uses c_strlen(), etc., instead of the POSIX standard naming. This was resolved by global search and replace, rather then using make and define magic to keep the source unchanged. IMO, we should absolutely seek to keep changes to the Lua core to a minimum needed.

By way of an example to this last point, my first version (see my LROR in Lua 5.3 paper) went to some length to include RO TStrings into the compiler components, but I've subsequently put in diagnostic instrumentation and this added a lot of complexity / code changes for little gain as the Lus parser and compiler make almost no use of TString resources. We only need RO TString support within the VM.

The ROTable lookaside pretty much eliminates the runtime overheads of ROTable access for access-hit cases, but doesn't help on the cost of a miss, since this still requires a full scan if the ROTable is unordered. Using an explicit ROtable header will have a lot of benefits: we can tag whether the table has a meta entries or not so we short-circuit unneeded scans for __metatable, __index, etc.; we can tag ordered ROtables and use a binary chop access for these. We can drop the address range botch to determine whether a reference is a ROTable or a normal RW table.


So yes I am interested in a collaboration of Lua 5.3, but to be honest only if this a net gain in terms of my effort: we need to keep this compatible with the overall NodeMCU framework. I do have some personal constraints in that my wife and I have build a new house which we moved into at the beginning of the year, Doing this and finishing off some of my home automation details have been taking at lot of my time this last few months. I have some paperwork, and certification approvals that I need to do over the next few weeks. So I am planning to do this Lua 5.3 work over December.

@tomsci
Copy link

tomsci commented Nov 10, 2018

Gosh, a lot more info than I was expecting! I wasn't aware of the LROR paper; interesting reading, thanks.

The project is currently fractured by the ESP32 / ESP8266 fork, [...] any Lua 5.3 implementation should again unify the Lua and other sub-system components.

I think this is the crux of the problem - we have that huge fork and until that's resolved one way or another it's difficult to move forward with any other significant changes. As I understand it, the ESP32 branch tightens up the restrictions on where Lua code can be called from (to play nice with the FreeRTOS task-based model) which might mean a bunch of the eLua-derived changes to the core Lua are no longer required. This is speculation as I don't know all the liberties ESP8266 code might take in ISRs etc but I fear that an unforked 5.3 core might have problems (or are all the 5.1 eLua hacks just irrelevant for NodeMCU?). Meaning that migration to an ESP32-compliant model might be also be required to work with 5.3?

Maybe we should just bite the bullet and make the ESP32 branch use 5.3? :) Although that might just make the fragmentation worse. As a relative newcomer to the project I don't have a solution to how that might be resolved, but it certainly seems like the way forward would involve:

  • All the structural refactors from dev-esp32 to be taken back to master/dev and for that to be the definitive structure in the future.
  • Modules which take liberties not supported on the ESP32 to be modified to work on both.
  • Wherever possible, hacks like the lack of proper POSIX support in the non-OS SDK should be back-filled for ESP8266 rather than worked around, so we can have one reasonably sane C API going forward, including the VFS stuff - we should just make some conditionally-compiled wrappers so we can use the real io library and be done with it (making a compatibility file module, of course). I found some hacky #defines was enough to make at least 5.3's lauxlib.c work with the VFS API with no modifications to the code itself, so I don't think it's insurmountable.

we should absolutely seek to keep changes to the Lua core to a minimum needed

Definitely, one of the reasons for me playing around with a 5.3 port was to pare back the changes and work out for myself what was absolutely necessary and what was a legacy from eLua or things superseded by 5.3 or the improved APIs available in the ESP32 SDK. 5.3 is unfortunately still too RAM-oriented, I do wish the Lua devs would make a concerted effort in that direction in the future (although the light c functions, short string optimisations and variant tags were all definite improvements).

new version is at least as RAM and runtime efficient as the current 5.1 versions. If either of these is worse then we create a barrier for active developers to move to 5.3

I'll have to dig out my numbers but I don't recall the core 5.3 being appreciably worse RAM wise compared to 5.1 - I moaned a bit when TStrings got larger in 5.2 and the Lua devs shrunk them back down in 5.3. It'll probably depend on how good we can make the 5.3 static tables/strings/code optimisations, but I agree for ESP8266 especially, we mustn't go backwards on RAM footprint.

Perhaps this is something that needs the ESP8266/ESP32 schism to be resolved first? Much as I'd like to bang in some new features, perhaps the painful step of reconciling the forks needs to happen first, as part of a project roadmap that includes upgrading both to 5.3 after that?

So I am planning to do this Lua 5.3 work over December.

I have some time in the next week or two but December is likely to be quite busy, alas.

@TerryE
Copy link
Collaborator Author

TerryE commented Nov 10, 2018

As I understand it, the ESP32 branch tightens up the restrictions on where Lua code can be called from (to play nice with the FreeRTOS task-based model)

Not quite. In fact the Lua VM can sit happily within the RTOS, but it executes as a single threaded single process, with the majority of hardware and network related activities execution as non-blocking services with callbacks. Note that this execution model is uncannily similar similar to node.js which has excellent performance and scaling characteristics. So yes there will be some tidying to do to converge the ESP32 and ESP8266 variants, but I don't envisage any major architectural hurdles.

the 5.1 eLua hacks

In fact the 5.3 has already acquired 3 of these: lightweight C functions, lightweight userdata and the EGC, but implemented in a cleaner native Lua manner. About the only two major functional areas that I would like to add are those which are more Harvard-like: ROTables and LFS. LFS isn't an eLua concept anyway and if you look at the current dev ROTables implementation you will see that I've pretty much reworked this, so I see no eLa code being carried across into the 5.3 implementation.

Maybe we should just bite the bullet and make the ESP32 branch use 5.3?

Whilst the ESP32 is falling in price and is now readily obtainable, the ESP8266 is still an excellent solution for many applications and is under half the price; it's used in Sonoff and Shelly IoT devices, etc., so I don't see the demand for this SoC going away soon. I therefore don't think it wise to let the fork diverge further.

proper POSIX support

The dev and master build already include a POSIX build option -- luac.cross builds in a POSIX environment and unlike standard luac this includes a functional VM (see its -e option). OK, the implementation is a little cludgy, but I am fairly relaxed about this so long and the same Lua code-base compiles for both POSIX and SDK environments. (BTW, a number of people have suggested that we just migrate our 8266 variant to its RTOS version, but that major gotcha here is RAM: RTOS takes up too much RAM resources to offer a viable 8266 NodeMCU runtime environment.)

As far as 5.3 vs 5.1 goes in terms of RAM use, it is a case of swings and roundabouts: some wins and some losses but overall there is no material difference IMO for this sort of usecase. And because Lua 5.3 already implements Integer and Float as two separate Number subtypes, one thing that I do want to do is to drop the Integer build option all together. If you stick to using integer data types then Lua 5.3 will run faster than the 5.1 Integer build anyway. What we might want to support are 32 and 64 bit Number variants, so developers have the option of trading off runtime data use and data precision.

I have some time in the next week or two but December is likely to be quite busy, alas.

This is going to be a long haul. The core developers are quite constrained in terms of commitment. I am an OLD fart (I wrote my first programs over 50 years ago and I've seen OSes, architectures and languages come and go). But luckily my brain still works. I see this as a long term commitment. That's not to say that I don't welcome short term intense contributions, but this definitely a case of not letting short term quick wins compromise the overall roadmap.

@tomsci
Copy link

tomsci commented Nov 12, 2018

I don't envisage any major architectural hurdles

Good to know.

drop the Integer build option all together

Yes, there doesn't seem much benefit to keeping it on 5.3.

This is going to be a long haul. The core developers are quite constrained in terms of commitment.

Understood. I happen to be working on a project that's required me to make a few changes so I'm raising a bunch of stuff at the moment while it's fresh in my mind - but I'm not in a huge hurry and I don't plan to disappear at the end of it. I too would much prefer a solid long-term roadmap for the project.

@TerryE
Copy link
Collaborator Author

TerryE commented May 1, 2019

@tomsci we should continue this on #2726 if you are interested as this is a more specific implementation of the new points that you have raise 😄

@TerryE TerryE closed this as completed May 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants