-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spontaneous reset around concat and number to string conversions #795
Comments
I'm glad you raised the issue because I've got a similar one and I was dreading having to raise an issue about it. My issue is code which reads onewire temp sensors and POSTs the temperatures. If there's only 1 sensor attached it works fine; two or more sensors will all be reported the first time through the loop, then the device resets. It's not a problem with 2 or more sensors per se - if I disable wifi then the problem goes away; if I stop POSTing the results then the problem goes away; if I send lua commands to the device through the serial interface after booting, the problem goes away or changes. My conclusion so far is this is not a hardware error - the various advice around the net to add capacitors on the power supply doesn't solve it - and what's happening is the lua data structures are somehow being corrupted during execution, and sending commands to the device alters its memory usage to hide (or expose) the issue. Where I see it most clearly is a fragment of code: function conv2()
local pin = 3
startConversion(pin)
tmr.alarm(0, 1000, 0, function() conv1(pin) end)
end Sometimes lua reports an error which indicates that So I don't think your error is anything to do with string concatenation, or your code. Something is corrupting lua's data structures, and that part of code is just where the error manifests. If you change your code the problem might occur somewhere else, or disappear. I'd love to know how to debug or single-step the firmware. |
You can turn on the debug mode compile time option and get extra printing out the serial port. |
What version of the firmware are you using? Also, is there a test suite for the firmware? |
@nickandrew the _debug ON_ checkbox if your using the nodemcu-build.com site by @marcelstoer. If building on your own turn on |
I have built the debug version of the code, but it keeps printing task task_lua started and this prevents any of the uploaders that I have from uploading code to the board (esplorer and another one). Since the debug build is bigger than the original build, some of the original files are not present after the upgrade to the debug version of the build. How do people work around this? |
@pjsg Can you use luatool.py to upload the code? |
@nickandrew Nope -- it reads back the 'task task_lua started' message and throws an error. There is actually a bug in the current version in that the error message that it tries to print references a variable not in scope. Easy enough to fix, but it is still following an error path. |
@pjsg None of this is making debug easy (or, frankly, even possible). We need to be able to single-step, breakpoint and stacktrace an unmodified firmware. I see a lot of lua_assert() and LWIP_ASSERT() in the code. I'll try to enable them without all the additional "task task_lua started" type debugging messages. |
I built a firmware with DEVELOP_VERSION defined, and deleted the
|
success! I added these lines to user_config.h
I get the following:
then, when I trigger my bad request:
this is all on the current dev branch. The lvm.c assert failure is in the concat implementation as I suspected. However, this provides a useful pointer. btw, I also applied the following patches to reduce the amount of clutter:
|
My mapfile shows:
I don't know how to get an asm file to track down the exact offending instruction. |
OK -- I've fixed the bug in the implementation of concat -- pull request to follow. However, the assert failure in lobject.c is a little more confusing -- it may be a false positive. The code looks like:
pvalue is defined as
This assert will naturally fire if t1 is of type TROTABLE or TLIGHTFUNCTION. I don't know if the check_exp is wrong or whether this is another bug. |
The amount of free heap reported by I can't reconcile your I am just wondering if this realloc is failing silently and the RTS isn't picking this up. |
The assert is hidden inside |
@pjsg Philip, I am a little confused by what you've found. lua_assert isn't defined in normal lua and luac builds and therefore defaults to switch (ttype(t1)) {
case LUA_TLIGHTUSERDATA:
return pvalue(t1) == pvalue(t2);
case LUA_TROTABLE:
return rvalue(t1) == rvalue(t2);
case LUA_TLIGHTFUNCTION:
return fvalue(t1) == fvalue(t2); |
@TerryE In one of the comments above, I showed what I put in user_config.h to enable lua_assert -- it is three lines that defines lua_assert as a macro and then the usual override to ((void) 0) doesn't happen any more. I also enabled debug mode, but I'm not sure if that was actually required to make the lua_assert stuff actually work. I only tested on the integer build of the firmware. I have also seen (once or twice) lvm.c:459: L->top == L->ci->top || luaG_checkopenop(i) lvm.c:454: base <= L->top && L->top <= L->stack + L->stacksize This looks as though the main interpreter loop has a case where it fails to reset 'base' after a potential GC. |
nodemcu is a fork of eluas which is itself a fork of Lua.5.1.5. My presumption if I find an issue in the bits of the code base that have been added by this project is that the issue might be our coding bug. However for the Lua 5.1.5 code has been pounded to death and it is extremely unlikely that there is be this sort of bug remaining. My presumption in this case is that the problem is in my under standing or peripheral bits rather than the core implementation. This patch proposes a change to the core code, So -1 unless you provide compelling evidence that this is a genuine bug and that your patch fixes this. Incidentally I always test changes to Lua core functionality on a bog standard Lua 5.1.5 build under Linux on my PC where gdb and the Lua test suite is available to me before porting them to nodemcu. |
@TerryE Actually, this does appear to be nodemcu specific and not in standard 5.1.5. Rationale: The luaV_concat function calls tostring early on in the loop and if it returns non-zero, control passes to the else clause where (in the original nodemcu code) top is used unaltered. This is a problem if there has been a GC (inside tostring). tostring calls luaV_tostring The actual chance of this happening is small, but it was happening reproducibly for me. In the stock lua-5.1.5, the frealloc function is mapped (via l_alloc) onto native c runtime calls If this full gc moves the stack, then the top pointer from many stack frames above is now invalid. My PR fixes that problem by resetting it. Now, this analysis suggests that there may be other lurking problems -- the mainline lua code may assume that certain sorts of memory allocations cannot trigger GCs. This assumption is not true for nodemcu. |
?? base is a pointer into the lua stack. The lua stack is only adjusted in
The preceding is a great piece of analysis, but I feel that this assertion is the flaw in your thesis. The Lua GC never moves memory blocks. It uses a mark and sweep algo only to determine dead Lua objects and then frees the associated memory via What might be more plausible IMO, is that the SDK memory allocator is somehow getting its knickers in a twist. However if your thesis is correct then you should be able put assertion diagnostics in the old code to detect the error when occurring. For example replace your reassignment of You need to compare eLua sources with the EGC patch which is pretty similar to nodemcu. |
Ah - yes. However, following on from my previous message: luaC_fullgc calls singlestep Actually following this gives me a much better idea of when this might happen -- in particular when there is more than one thread around and when the stack in one of those threads can be shrunk significantly. I did have an instrumented version of the firmware that showed that the L->base did indeed change over a call to tostring. This was what pointed me in this direction in the first place. Actually, there are a number of places where a true C pointer into the stack is cached across calls. There is the case of 'top' in luaV_concat across tostring(). Also, in luaV_execute, 'base' is cached across calls. However, in many cases, it is reloaded at the end of the interpreted instruction (normally via the Protect macro). Presumably in the other cases, there is no chance that the GC will be called. |
Yes but the second call is guarded by a Think of Popper: you have a testable hypothesis. If the Anyway, it's nice to find someone else who is willing to dig into these internals! 😃 |
I have a testable hypothesis. I altered the luaV_tostring code as follows:
The change is to add the saving of L->base to oldbase and then the assert afterwards. This assert fires when I run my testcase. I have a simplified version of my testcase:
As I read the code, the call in traversestack is
so the question is whether the stack is fixed at this point. propagatemark doesn't set fixedstack I was thinking about this last night, and I'm coming to agree with you that this may not be the right fix. It certainly fixes the bug, but the right solution may be to do it much deeper in the code -- otherwise we will be playing whack-a-mole for a while. However, the penalty for doing it deeper in the code will be that more calls to GC will not be able to save memory from the stack. |
@pjsg Phillip, thanks for this, but I would like to take 24hrs to brood about this. I would really welcome your contribution to any resulting discussions. |
@pjsg Philip, I can reproduce this as well. Here is my minimal version: -- execute simple recursive function to grow stack and then concat which fails if OP2 numeric
do local function count(x) return x > 0 and count(x-1) end count(64) local x = "a"..1 end The issue is with the line if (!(ttisstring(top-2) || ttisnumber(top-2)) || !tostring(L, top-1)) {
Also note that replacing the concat with I am off visiting relatives for a couple of days but I'll put in a proper fix when I get back. Thanks again. Terry |
Why do we allow developers to visit relatives? :-) |
Thanks muchly... |
A corollary to this is that with the EGC running, any operation which does a malloc or realloc can trigger EGC and render invalid any such pointer. The stack can get trimmed and table hash tables if I recall. But the latter entries probably won't get cached, just the address of the the entry they point to. |
I was thinking of changing (for dev purposes) the GC so that it always trims the stack (unless isfixedstack is set) and see if that exposes anything. I'm not sure of the best way to find these potential issues. I have another assert failure which happens after I run the uploader to upload files and then I switch to esplorer. The first thing that esplorer does is to send '=node.heap()'. Sometimes, this causes an assert which seems to be associated with returned data. in lvm.c lua_assert(L->top == L->ci->top || luaG_checkopenop(i)); |
I found an interesting #define -- try adding
This does much more resizing of stacks. My code that makes sure that every realloc does move the block causes all sorts of problems in conjunction with the above #define. In particular, I get a block too big error when requesting memory. However, if I make the same changes to the 5.1.5 lua interpreter and enable the HARDSTACKTESTS, then I get no assert failures. I'm going to work on some more debug code and verify that it doesn't trigger in lua-5.1.5 and then port it into nodemcu. |
There is a major difference between the standard lua-5.1.5 and the nodemcu version. This is that the lua-5.1.5 memory allocator does not call luaV_fullgc when invoked. Yes, the code is there to call it, but the precondition is not met. If you enable the precondition (as in nodemcu firmware), then it too crashes spectacularly. In luaL_newstate, there is an extra line in the nodemcu firmware:
This passes the lua_state pointer into the lower level allocator -- and this enables the allocator to trigger a full gc. Testing the 5.1.5 interpreter is a much more pleasant experience as you can run it under gdb (or any other favorite debugger). However, there are lots of assert failures. I think that the HARDSTACKTESTS are broken as they keep shrinking the stack even when the upper levels are trying to increase them. It isn't clear to me whether the nodemcu firmware suffers from this particular problem. For lua-5.1.5, for example, in luaD_precall, it ensures that there is LUA_MINSTACK stack space. Then it does some more memory allocations, and then it can fail as the stack no longer has LUA_MINSTACK space. The stock 5.1.5 lua has no mechanism to be able to stop the fullgc from happening during these critical periods, but the nodemcu firmware does. I fear that trying to debug with 5.1.5 is a lost cause.... |
OK, I've had a trawl though all of the local StkID declarations and usage looking for cases where the variables can be invalidated by a realloc side-effect and this |
In the history of eLua, there was a build which ran as a host application, but I gave up on trying to get this to work. |
I'm making progress -- I only get one assert with any regularity
The other issue is that node.compile("somefile.lua") appears to work if it is not run inside a pcall environment. If run inside a pcall, then it completes, but manages to corrupt something and soon thereafter the wdt resets the platform, I ended up changing the way that the GC resized the stacks. In my model, the GC is never allowed to resize the stacks unless specifically allowed to -- i.e. the fixedstack() status is now effectively on all the time except for one place where everything is (I think) safe to move/resize things. Are other people seeing any asserts firing? |
Philip, can you split this into two asserts for testing. I suspect that it's the The fix/unfix stack operation is pretty lightweight. This was introduced as part of the ECG patch. We found a case that the author, Robert Jakabosky, missed. My inclination is to minimum scoped change unless there is a compelling otherwise and I feel that a shift from default "can resize" to "can''t resize" doesn't fit well here and might have other unintended consequences. This being said, I'll upload my PR for this fix today and you will see that I am extending the fixed scope. |
After tearing my hair out some more with regard to the node.compile problem, I reverted to the stock dev firmware. It fails rather worse (or better) depending on your point of view -- it just triggers a reboot almost immediately. If you upload this file to your nodemcu (and call it httpserver.lua) and then run, from the prompt, node.compile("httpserver.lua") -- what happens? |
I have just raised PR #812 to fix this ECG patch error. I when therough all of the cached stack pointers and only found this one case that Jakabosky missed. IMO, this closes the original issue here, but some of the wider points that Philip raises still stand, but probably merit being swept up into another issue. |
This is a minimal build (bit, file, gpio, net, node, rc, rtcfifo, rtcmem, rtctime, spi, tmr, uart, wifi) with #812 applied. |
I cut everything back, and my code is now stable (apart from the solitary assert which I have put into its own ticket). In particular, I have a custom implementation of c_realloc and c_free that performs extra checking and always moves blocks when realloc is called (to make sure that nobody is looking at the old block). Further it sets any block being freed to 0x55555555 so that code should trap if it tries to use a pointer from the freed bock. Even with this, it looks good! |
This one was fixed some time ago. |
I am tearing my hair out -- the following code (when executed) causes the system to restart:
The print of
node.heap()
shows 16816 -- so plenty of space left.This code does not cause a restart:
This file ends up being invoked by nodemcu-httpserver which ends up running this in a coroutine.
I'm running version:
I don't expect anybody to solve my problem, but I would appreciate some hints on how to track this down. Is there a way to build the nodemcu firmware on a linux box so that I could attach a debugger or similar to it?
The text was updated successfully, but these errors were encountered: