-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimising access to ROTables #2504
Comments
Please open a PR; this sounds like good work! I'm more than happy to give the ½kB up, even without the DNS patches (though those might be good, too). |
@TerryE I vote with both hands yes! Will it work on dev-esp32 as well? |
@jmd13391 At the mo we will lose ½Kb RAM, but I can do a tweak to DNS to claw back about 1Kb giving a net saving of ½Kb RAM but that saving is another PR. @LDmitryL, I haven't merged any of the recent changes including LFS into esp-32, but on this note there is also stragic reason for this work: I want to restart my project to integrate the two ESP platforms under a new Lua 5.3 baseline. However I also want to maximise backwards compatability at a Lua application level and this involves retaining all of the nice NodeMCU extensions that we've added over the last four years. A key component is ROM support, both ROTables and LFS. Unfortunately IMO, the original ROTables implementation is a cludge done by the eLua implementation, but at least it works. The Lua VM unifies Table and ROTable handling, but instead of ROTables being a proper Table variant, the code checks to see if the (Table *) is in the ROM address space, and in this case treats it as a ROTable pointer. This isn't going work well in the Lua 5.3 VM, so what we really need is for ROTables to have a proper Table header. This first step to getting there is what I have done here which is to separate the internal ROtable implementation from the rest of the VM and applications code. Having a lookaside cache internal to ROTable lookup effectively eliminates the access overheads for ROtable fetches, which is the main usecase here. However there is a second usecase, which is an existance probe. This can occur with the VM's use of metamethods; for example if a field doesn't exist in a table, then the VM checks to see if there is an It would be nice to have the option to tag those ROTable vectors that are sorted so we could have the option to do a faster binary chop instead. I feel that the key to derisking the move to a unified Lua 5.3 for all ESP architectures is to make this bumpless at an application level. So the tl;dr is: _This will eventually end up in ESP32, but let's take one step at a time. |
I have been reworking this over the last couple of days. I want to split the base library table (functions like This makes the encapsulation cleaner. The downside is that the VM accesses the ROTables like
The second step involves a failed full scan of the base function ROTable which adds about 50 µSec to the access time -- which is still a lot better than we have currently. So why do I want to do this? (i) it makes the implementation a lot cleaner; and (ii) at the next stage, I will have a binary chop search option which will be a lot faster than the serial search and this will lop off about 40 µSec. @jmattsson, @pjsg, @devsaurus are you tracking this? Any feedback / comments would be much appreciated. 😄 There is also nothing stopping you setting local file, wifi, ow = file, wifi, ow and the accesses like What this is all really about is incrementally bringing our current Lua 5.1 plus library modules up to a level where we can seamlessly cut over to Lua 5.3 |
I have been thinking about this one, and 50 µSec is too big a hit to ignore. The main issue that I had here is that I can't copy @jmattsson 's linker magic which marshals the master ROTable table in RAM on host builds without a botch and I don't like botches, so I've added some I've also added a |
👍 I'd like to get everything moved to 5.3 too, 5.1 feels so, well, dated now.
As it happens I've been looking at this too the last couple of days. I've reimplemented rotables support on top of 5.3 as a I made some similar-sounding chops to lrotables.c to get rid of some of the redundant functions although I didn't make any performance changes. I'd be very interested to integrate with your performance changes, the O(N) search behaviour of rotables has always made me a bit sad when I thought too hard about it! I have my changes up and running on an ESP32, with a few minor tweaks across the rest of the code base to remove checks for light functions etc, and a few other bits and bobs. The RAM usage is a bit higher, but as I haven't got to the readonly string optimisations yet I don't know what's an increased overhead of 5.3 vs what's a missing saving from the rom strings optimisation. @TerryE would you be interested in setting up a shared 5.3+rotables2 branch? |
Tom, thanks for your offer, and I would welcome an active partnership with any other experienced Lua VM developer. I feel that your lupi project is good karma for this. I've already done a first iteration over a year ago, but other priorities got in the way and since then, other priorities took over. The ESP architectures are in general a lot more capable than the Arduino / Atmel architectures, but also an equivalent step below the RPi-class ARM SBCs, so a Lua VM-based IoT platform is entirely workable with the ESPs but with some care. Also NodeNCU adds some awkward non-functional requirements to the port / implementation architecture.
By way of an example to this last point, my first version (see my LROR in Lua 5.3 paper) went to some length to include RO TStrings into the compiler components, but I've subsequently put in diagnostic instrumentation and this added a lot of complexity / code changes for little gain as the Lus parser and compiler make almost no use of TString resources. We only need RO TString support within the VM. The ROTable lookaside pretty much eliminates the runtime overheads of ROTable access for access-hit cases, but doesn't help on the cost of a miss, since this still requires a full scan if the ROTable is unordered. Using an explicit ROtable header will have a lot of benefits: we can tag whether the table has a meta entries or not so we short-circuit unneeded scans for So yes I am interested in a collaboration of Lua 5.3, but to be honest only if this a net gain in terms of my effort: we need to keep this compatible with the overall NodeMCU framework. I do have some personal constraints in that my wife and I have build a new house which we moved into at the beginning of the year, Doing this and finishing off some of my home automation details have been taking at lot of my time this last few months. I have some paperwork, and certification approvals that I need to do over the next few weeks. So I am planning to do this Lua 5.3 work over December. |
Gosh, a lot more info than I was expecting! I wasn't aware of the LROR paper; interesting reading, thanks.
I think this is the crux of the problem - we have that huge fork and until that's resolved one way or another it's difficult to move forward with any other significant changes. As I understand it, the ESP32 branch tightens up the restrictions on where Lua code can be called from (to play nice with the FreeRTOS task-based model) which might mean a bunch of the eLua-derived changes to the core Lua are no longer required. This is speculation as I don't know all the liberties ESP8266 code might take in ISRs etc but I fear that an unforked 5.3 core might have problems (or are all the 5.1 eLua hacks just irrelevant for NodeMCU?). Meaning that migration to an ESP32-compliant model might be also be required to work with 5.3? Maybe we should just bite the bullet and make the ESP32 branch use 5.3? :) Although that might just make the fragmentation worse. As a relative newcomer to the project I don't have a solution to how that might be resolved, but it certainly seems like the way forward would involve:
Definitely, one of the reasons for me playing around with a 5.3 port was to pare back the changes and work out for myself what was absolutely necessary and what was a legacy from eLua or things superseded by 5.3 or the improved APIs available in the ESP32 SDK. 5.3 is unfortunately still too RAM-oriented, I do wish the Lua devs would make a concerted effort in that direction in the future (although the light c functions, short string optimisations and variant tags were all definite improvements).
I'll have to dig out my numbers but I don't recall the core 5.3 being appreciably worse RAM wise compared to 5.1 - I moaned a bit when TStrings got larger in 5.2 and the Lua devs shrunk them back down in 5.3. It'll probably depend on how good we can make the 5.3 static tables/strings/code optimisations, but I agree for ESP8266 especially, we mustn't go backwards on RAM footprint. Perhaps this is something that needs the ESP8266/ESP32 schism to be resolved first? Much as I'd like to bang in some new features, perhaps the painful step of reconciling the forks needs to happen first, as part of a project roadmap that includes upgrading both to 5.3 after that?
I have some time in the next week or two but December is likely to be quite busy, alas. |
Not quite. In fact the Lua VM can sit happily within the RTOS, but it executes as a single threaded single process, with the majority of hardware and network related activities execution as non-blocking services with callbacks. Note that this execution model is uncannily similar similar to
In fact the 5.3 has already acquired 3 of these: lightweight C functions, lightweight userdata and the EGC, but implemented in a cleaner native Lua manner. About the only two major functional areas that I would like to add are those which are more Harvard-like: ROTables and LFS. LFS isn't an eLua concept anyway and if you look at the current
Whilst the ESP32 is falling in price and is now readily obtainable, the ESP8266 is still an excellent solution for many applications and is under half the price; it's used in Sonoff and Shelly IoT devices, etc., so I don't see the demand for this SoC going away soon. I therefore don't think it wise to let the fork diverge further.
The As far as 5.3 vs 5.1 goes in terms of RAM use, it is a case of swings and roundabouts: some wins and some losses but overall there is no material difference IMO for this sort of usecase. And because Lua 5.3 already implements Integer and Float as two separate Number subtypes, one thing that I do want to do is to drop the Integer build option all together. If you stick to using integer data types then Lua 5.3 will run faster than the 5.1 Integer build anyway. What we might want to support are 32 and 64 bit Number variants, so developers have the option of trading off runtime data use and data precision.
This is going to be a long haul. The core developers are quite constrained in terms of commitment. I am an OLD fart (I wrote my first programs over 50 years ago and I've seen OSes, architectures and languages come and go). But luckily my brain still works. I see this as a long term commitment. That's not to say that I don't welcome short term intense contributions, but this definitely a case of not letting short term quick wins compromise the overall roadmap. |
Good to know.
Yes, there doesn't seem much benefit to keeping it on 5.3.
Understood. I happen to be working on a project that's required me to make a few changes so I'm raising a bunch of stuff at the moment while it's fresh in my mind - but I'm not in a huge hurry and I don't plan to disappear at the end of it. I too would much prefer a solid long-term roadmap for the project. |
Background for Enhancement
One of the eLua memory optimizations is to move the list of base functions and installed libraries out of
_G
and into ROM-based lists. An__index
function has been added to the_G
metatables, so that any global misses are also resolved against the base function list then the library ROM list.The upside is a RAM saving, but the downside is a material runtime hit when accessing ROtables. The average access time for a variable cached in a
local
is 3 µSec, and for looking up one in_G
is 8 µSec. However resolving ROM entries is a lot slower: 43 µSec in the case of the base functions such asprint()
and 132 µSec for library entries.Incidentally this 132 µSec time is perhaps 5× better earlier builds than pre LFS versions since the string constants were byte-packed at
-Os
andc_strcmp()
function compares the target key against each unaligned ROM table entry, thus constantly triggered unaligned exception handler.@pjsg and I found that switching to
-O2
word-aligned the ROM-based string contents meant that the flash-aware optimised ROM-basedstrcmp()
routine does proper word-aligned data fetches in this case.Even so, the implementation is pretty dumb in that the code always does linear table scan using a pairwise
c_strcmp()
, and this annoyed me, I decided to spend a day optimising this code.First the idea of having different structures -- a ROtable for the list of modules and a different one-off structure for the base functions seemed daft. Why not just combine them into a unified master ROtable?
Unifying these into a single ROtable meant that I could drop the __index function and instead use Lua to do this by assigning
_G.__index = <master_ROtable>
Doing this also made half of the lrotable.c functions redundant and I was left with one indexing by a lua string and one by an integer.
Since all keyed access to ROTables goes through a single access function, it seemed common sense to stick a small (16×4×2word) RAM lookaside cache in front of these accesses. One my benchmarks, this gives better than 99.5% hit rate for keyed ROTable access.
Since all keys are accessed via
TSstrings
and are therefore word aligned, I have changed theLRO_STRKEY(k)
definition to include(STORE_ATTR char *)
which forces word alignment independent of compiler options (just in case someone does a custom build and goes back to-Os
. The key comparison also has an added(key ^ entry) & mask
guard which does a word aligned comparison and only callsc_strcmp()
if the first four chars match. This is code executes fast and is inline. The mask is to handle cases likeow\0
where this is set to0x00ffffff
, so the main limitations are now the linear nature of the search and the slow flash-fault access times.This patch brings the average runtime for these lookups down:
The isn't any difference in the new times because these are in the same table and accessed directly with a 99½% hit-rate. Of course this speed up also applies to reference such as
file.open
.Justification
This represents an approximate 50× speed up in the larger ROTable access times compared to 6 months ago, for the cost of a ½kB RAM cache. Anyway, I've done the work. Is it worth raising as a PR? If the ½kB RAM is an issue, we can save about 1kB RAM from the DNS library with a simple patch.
BTW, another bonus is that you can now directly enumerate the ROM libraries:
Workarounds
The Lua environment runs slower than needed and you need to get up to techniques like
to get efficient access to ROM functions.
The text was updated successfully, but these errors were encountered: