Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lua 5.3 Compatible ROTables #2726

Closed
TerryE opened this issue Apr 14, 2019 · 10 comments
Closed

Lua 5.3 Compatible ROTables #2726

TerryE opened this issue Apr 14, 2019 · 10 comments
Assignees
Labels

Comments

@TerryE
Copy link
Collaborator

TerryE commented Apr 14, 2019

New feature

The current NodeMCU modules use a ROTable implementation taken directly from the eLua implementation. This is a design compromise that does allow a form of RO Table to be implemented in a Flash memory compatible data structure, but in a way that gives poor runtime performance and uses an address-range test to determine whether a table pointer points to a Table record in RAM or a vector of ROtable entries in Flash.

The serial search algorithm for ROTables gives poor access performance especially for table-miss accesses.

It is going to be very difficult to implement the same approach for Lua 5.3 which has already established an implementation architecture of using record variants for such subtypes.

Justification

So it makes sense to move to a Lua 5.3 compatible ROTable implementation as this will give performance benefits and this is the main aspect where our current Lua C API used in our modules will need reworking to be compatible with Lua 5.3. We can therefore prepare all of modules as a separate PR step to the Lua 5.3 migration itself. There are also performance benefits in doing this now.

Scope and impacts

In our app/modules we have the following numbers of ROTables in our C files: 45 × 1, 15 × 2, 4 × 3, 3 × 4, 1 × 5, 1 × 6. (e.g one module net.c has 6 ROTable declarations.). We have another 6 in app/lua and 1 in app/pm These all need updating to the new format.

The way to do this is to use a special to purpose Python conversion script that will automate the source conversion. So for example this will convert:

static const LUA_REG_TYPE wifi_ap_dhcp_map[] = {
  { LSTRKEY( "config" ),  LFUNCVAL( wifi_ap_dhcp_config ) },
  { LSTRKEY( "start" ),   LFUNCVAL( wifi_ap_dhcp_start ) },
  { LSTRKEY( "stop" ),    LFUNCVAL( wifi_ap_dhcp_stop ) },
  { LNILKEY, LNILVAL }
};

into the Lua 5.3 format:

LROT_BEGIN(wifi_ap_dhcp)
  LROT_FUNCENTRY(config, wifi_ap_dhcp_config)
  LROT_FUNCENTRY(start,  wifi_ap_dhcp_start)
  LROT_FUNCENTRY(stop,   wifi_ap_dhcp_stop)
LROT_END(wifi_ap_dhcp, NULL, 0)

Implementation notes and wrinkles:

  • The table entries will always be in ascending order to allow the access routine to do a binary chop access.
  • One of the main performance issues is that the Lua VM needs to check any table access miss for the presence of a meta table and meta methods such as __index this involves a table scan which can fail and as a result is currently a full table scan. With this PR ROTables are now implemented as a variant Table record pointing to an ordered luaR_entry vector. The Table record includes a pointer its metatable and if this is blank then no metatable search is required.
  • The aim with the batch conversion is that the old to new syntax will result in a total clean compile or not. So whilst the size of the PR is large, the actual risks are largely fixed.
  • We probably will need a temporary separate dev-LROTABLE branch to trial this one -- its not one that I would want to push into dev without multiple testers checking it out.
@TerryE
Copy link
Collaborator Author

TerryE commented Apr 24, 2019

In essence this change involves changing every ROTable declaration in our source base, because at the moment the ROTable is just a luaR_entry vector, but the new encapsulation uses a RO Table record pointing at the luaR_entry vector. There is a compelling runtime reason to do this.

Whilst PR #2505 introduced a cached lookaside that removed the need for the VM do a linear search of the ROTable in Flash for over 95% of table hits, a miss still requires a full table scan. In the case of metamethods like __index or __metatable or __gc, the VM has to do a full scan of the raw table first before searching for the metamethod and using it. All very flash access intensive, and this was the underlying mechanisms for #1590. The Table header includes a direct pointer to the metatable, so the runtime can immediately determine whether this exists and there is a flag bye which again tells it whether the __index attribute exists or not with a simple bit test.

Doing this bulk conversion is something that I want to automate as much as practical and we need to have confidence that we haven't screwed up anything in the process. So we will do this change in two steps, plus an optional one:

  1. We will do the bulk conversion of the table declarations in the modules, that is the LSTRKEY / LFUNCVAL declarations to the equivalent LROT_FUNCENTRY ones, etc. However, in this step the LROT macros simply act as wrappers to convert the static initialisers to the same content as the old macros did. In this way the new code still generates the same binary in the firmware, so execution paths are unchanged.
  2. In the second phase the table handling is updated so that both the RW and RO Tables use a standard Table header and the low level access routines are modified to reflect this. The LROT macros are now updated to generate the new data structures so the moving to the new execution variant only involve recompile and does not involve changes to the library source.
  3. The third optional step is to resort the table entries into ascending numeric order and tag the table header accordingly. The current search it linear, so if we access the table net, say the runtime has to scan the table of 26 base functions and get a miss before trying a library ROTable scan. Switching to a sorted table means that we can do a binary chop which takes 5 probes for a 26 entry table.

I have done the bulk of the edits as a WIP. There are a few funnies such as the UCG declarations which require handcrafted edits.

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 24, 2019

I suspect that @jmattsson is the only other committer tracking this but just another update dump. The issue of using the eLua { LSTRKEY( "key" ), LFUNCVAL( some_func ) }approach is that the implementation is exposed at a source level and this is also not very compatible with how Lua does thing in general and how the 5.3 engine works more specifically. If I want to make any change to this then I have an average of about 50 lines to modify in some 80 files. Not good. using the LROT macros such as LROT_FUNCENTRY(key, func) entirely hides the implementation at source level, so now this block a ~4K lines spread across these files is invariant and all I need to change and test are a bunch of macros and some low level routines.

The eLua build supported 9 various configuration variants through two macros LUA_OPTIMIZE_MEMORY and MIN_OPT_LEVEL, though for at least the last 4 years our code is designed to work with only one of these. Even the host luac.cross build uses this combination, so I have decided to fix this and strip these macros and delete the dead code.

One piece of smarts that Jonny introduced was the compile / link time method of declaring libraries and Rotables. For historic (eLua) reasons these use different declaration mechanisms. I have changed this so that everything now uses a global ROM ROTable. I've got the code morphed and compiling clean. Just need to fill in some of the low level changes and do the testing now.

@jmattsson
Copy link
Member

This sounds really good!

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 25, 2019

Yup, I've got both the host and target compiling clean, and I need to start thorough testing on Mon (doing a family visit this long w/e). The idea is that we can switch in the Lua 5.3 engine with little or no further changes to the modules.

One simplification that I have made is that we only use char[] keyed ROTables, so there is no point is supporting numeric and string keys, since this doubles the word probes into Flash that you need to do to find a given key. This is a trivial change to lrotables.h plus ~ 10 lines change in lrotables.c

@jmattsson
Copy link
Member

I like the incremental ease-in of this. Should make the eventual 5.3 switch much less risky.

@TerryE
Copy link
Collaborator Author

TerryE commented Apr 29, 2019

@jmattsson, I've just added a PR. Have a scan and give me your thoughts. Same goes for anyone else interested.

I have given this a good exercise both with luac.cross -e <scriptname> and a firmware build with an LFS including the FTP server. It still feels odd having over 57Kb free RAM even when the FTP server is servicing an FTP client.

@TerryE
Copy link
Collaborator Author

TerryE commented May 1, 2019

@jmattsson Johnny has done a code review, and very much appreciated. In terms of number of lines of changed, this was a biggy, but most of these were batch global edits which for the current macros didn't change the actual compiled code in the modules.

There were some peephole optimisations to the ROTable scan algo which halved the number of word reads from flash per entry in a ROTable scan.

Johny has picked up some points but none are material, especially given the size of the change. I have a SDK 3.0 PR (#2732) that I want to merge first. I will then rebaseline this against the current dev and include the points that Johny picked up. What I would like to do is to get this one merged ASAC, because this makes source changes to every module and will therefore block most other PRs.

As I said in the intro post, once we have stabilised this change in dev, the next step is to implement the macro changes to add Table headers for each ROTable. And following that I can redo the Lua 5.3 port.

@TerryE TerryE self-assigned this May 1, 2019
@TerryE
Copy link
Collaborator Author

TerryE commented May 30, 2019

@jmattsson I've continued to look at the performance tuning here. What I find is quite counter intuitive. The linear search based on the1st word comparison is usually faster than the binary chop. The reason if that the linear search is a tight loop doing a work fetch every 4th word in the ROTable to do the scan, and since this is an equality match, little-endian vs big-endian issues are irrelevant. This runs fast.

The binary search must do a ROM strcmp() per test and this has a far higher unit cost. So the fastest option seems to be binary chop until the search window is ~10 entries or less and then do a linear scan on this -- not that we have many ROtables with more than 10 entries.

The lookaside cache is fast and effective but the major performance hit is actually on cache misses such as if the VM checks for a __gc method, so I found that a simple short circuit really helps here.

The bottom line is that ROTable access is now about 10× faster than the 1.5.x versions.

I'll push an update soon, and I'll use the same routines for Lua 5,3

@TerryE
Copy link
Collaborator Author

TerryE commented Jun 4, 2019

I've done another review cycle because my Lua 5.3 port is in progress. There's another tranche of minor updates here because some of the fields which are function no-ops in Lua5.1 are incorrect for Lua5.3.

One minor issue is that perhaps we should rename LROT_NUMENTRY() as nominally takes Lua_Number value but all uses (except for the singular case of math.PI) take an integer value. Perhaps we should change this to LROT_INTENTRY() as Lua5.3 has a separate numeric type LROT_FLOATENTRY(). This generates the same code as LROT_NUMENTRY() for Lua51 Float builds, and throws a compile error for Lua51 Integer builds.

See my Lua 5.3 port to NodeMCU Firmware for more details.

@TerryE TerryE added the Lua 5.3 label Jun 15, 2019
@TerryE
Copy link
Collaborator Author

TerryE commented Oct 28, 2019

This now implemented in our PR #2912, so I am closing this issue.

@TerryE TerryE closed this as completed Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants