-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding an error traceback to panic() #1119
Comments
Yes please. |
More debug info is always good. |
This isn't quite as simple as I first thought. The cb stubs in all of the modules typically follow this type of pattern as in if(nud->cb_disconnect_ref != LUA_NOREF && nud->self_ref != LUA_NOREF) {
lua_rawgeti(gL, LUA_REGISTRYINDEX, nud->cb_disconnect_ref);
lua_rawgeti(gL, LUA_REGISTRYINDEX, nud->self_ref);
lua_call(gL, 1, 0);
} The issue is that if the cb_disconnect throws an error then an error handler hasn't been established so I need to play around to see exactly when an error panics, but I suspect that we will need some form of lua_call variant (and not just the standard xpcall) so I can do a glocal search and replace of these cb's in the modules. More work is needed. |
What is different from a standard Batch Lua is that with this there is a single call to Lua. An unprotected error results in program termination. A typical nodeMCU application is fragmented into many callback tasks. It would be a really nice development feature to be able to declare a Note that this would be quite a complex change since you would want to turn off all further callbacks to prevent the application continuing in some indetermined way. Need to think about whether the advantages of doing this merit the effort involved. |
(Warning: Stream of conciousness coming) To my way of thinking there are two modes of operation of the nodemcu. The first is during the development phase and the second during the operational (deployed) phase. I think that the requirements are different in these cases. Development: On an unprotected error, you would like to be dropped into a debugger so that you can figure out what went wrong. This is beyond the capabilities of the platform, so generating a stack traceback is the first step. Having a way to run some lua code when this happens would be useful. I might (for example) dump out the globals table etc. After this runs, I think that I'm happy to have the platform restart. Operation: Typically there isn't anything attached to the serial console (or those pins might actually be repurposed). However, I'm interested in how the platform fails and how often. I would report this (maybe using syslog or mqtt) to another server. Given the danger of doing stuff once the panic is triggered, it might make more sense to save the critical data somewhere and report it after the platform restarts.For example, the node.bootreason() returns information on a previous failure that can be reported after the restart. Obviously, any of this functionality should be optional, but we could certainly document some design patterns and examples. As a side note, it would be useful if the lua code could get access to the build information (e.g. the commit hash of the code, maybe the list of modules). This would allow the reporting to be more detailed. |
Any form of full implementation of the debug module is going to be fraught as a debug session is essentially synchronous to a given processing thread, but threads as such don't exist as a concept within the SDK. If we did have any form of interactive debug, then it would have to be some sort of diagnostic prologue to rebooting the chip because you would also need a global means of blocking any timer, network, GPIO cb's which would still continue to fire. I have been brooding about this one because it would be a really useful feature. Even the ability to establish a default at panic handler would be useful both for dev and prod. Even if just to capture a PS Edit: Sorry but it's difficult to post from a mobile phone when a passenger on a motorway! |
I don't think that doing the debug module is easy (or maybe even possible). I would like the stack traceback, and I would like to be able to register a piece of LUA to be called on a panic. It won't stop the panic, but I can always log some more stuff. Yes, there may be cases when the callback faults as well (e.g. run out of memory), but you can't have everything. I'd like the panic message as an argument to this callback. |
Notv sure if there is much point. There is only one Panic message and it's content is a subset of the debug.traceback. As to your wider point, agreed. |
I've been brooding about this one and what the best way to proceed is. Philip and I have also been batting around the issue of debugging development vs production, and I think that this play into this. The main complication with all this is that thinking all of this through is involved and so I don't want to rush into this. However, my current thinking is that
This is really just a heads-up for comment and ideas. To give you an ideas of how this might work in an production environment we might do the following in do
local b=history.buffer(4096, true) -- 4K wrap around buffer and collect times
node.output(function(str) b:add(str) end) -- redirect output to wraparound logger
node.atpanic(function()
print(debug.traceback()) -- do a Lua traceback
b:find((b:find(0)<40) and 0 or -40)
b:dump(22514, '192.16.172') -- write last 40 or available lines to remote logger
end) -- exiting the panic function reboots the ESP
end Does this sounds like a useful addition? Of course, since I've used the object form, there is nothing to stop the application having multiple circular loggers, for example if one is also used for collecting history data. Likewise, there's no |
If I may make a suggestion, per discussion in #1085 with @devsaurus: Instead of |
@devyte Good suggestion. Will do :) |
I like this. I do want to be able to do a complete RAM dump to the remote server. However, I don't see how to do this when the callbacks are disabled. Having said that, I don't want perfect to be the enemy of good -- and what you are proposing would be a big step forwards. |
Philip, it's only the Lua callbacks that we'd need to inhibit which is why the dump will have to be written in C. In my first version of history, I limited the record size to 255 bytes but now I've decided that the only limit is that the record size must be less than the buffer size which must be less than 32Kb. BTW. Another one that I've gotten into the habit of doing is this which picks up any registry leaks: for k,v in pairs(debug.getregistry()) do print(k,v) end |
If it's any help to anyone, I have an ESPlorer snippet button that does this:
I use it everyday to detect leaks in the globals and in the registry, and it's just a click of a mouse button. Only issue is when I get an output like this:
I don't understand those number pairs, I'm not using numbers as counters or anything anywhere in my code, so I can't figure out how I could be leaking them... but still, very useful! listG() receives an optional argument. When given (i.e.: load the function manually into mem and call it from cmdline), it will list the contents of the arg t instead of _G.
|
You have to look at the register and unregister API source, and the PiL write up. The registry is a standard Lua table that is not directly accessible from Lua but that the GC knows about. When a C routine registers a Lua variable, which could be a table or a function for example, then it is returns an index to entry whose value is this variable. The variable is now referenced, so the GC won't collect it as garbage. When it is unregistered the value of the entry is set to an integer, dereferencing the previous Lua variable. These integer values are reused (if you notice the integer->integer entries are a linked list starting at zero, and these are the ones that can be reused). You can do BTW: I would just do: function listG(t) for k,v in pairs(t or _G) print(k,v) end end |
It struck me (in the process of digging in some concrete spurs for fence posts -- it's funny what goes on in the subconscious) that since I am proposing to replace all |
On a purely instinctual level, the |
Johny your suggestion was my original idea, but this approach requires some difficult changes to the Lua core and limits the panic handler to UART output. I switched to my last suggestion because the However, all of this is epilogue diagnostics; exiting the error handler function would trigger a reboot albeit possibly after a few seconds delay to allow the logger to complete any network logging. |
A couple of follow up Qs for feedback on my history module:
I would be interested in views. |
(And two years later I pick this up again, but with a new house built and lived in.) My previous comments about stuff in the registry is really a distraction. We have some facts:
Move vote is for a simple 1-1 macro swap. @pjsg, @jmattsson, if this is OK, then I will prepare the PR. |
I'm actually quite happy with the current behavior -- if something fails in the app, then it panics and restarts. This protects you against all sorts of failures. What would you do in the error case (e.g. if a net.socket:onconnected callback throws an error)? |
@pjsg you've got the wrong end of the stick here. I am not suggesting we change the behaviour; I am suggesting that we change the level of diagnostics that we emit. Take this somewhat contrived ESPlorer snippet: loadstring([[
function fred(n,m)
--
print(m, n[m]())
end
]], "fred")()
loadstring([[
function dofred(t)
for k,v in pairs(t) do
fred(v,k)
end
end]], "dofred")() If I call this interactively, say
but if wrap this with a
whereas I am suggesting that it would be a lot more helpful to developer if we had something like:
The full stack trace makes it far easier to analyse the error |
I now 100% agree -- better diagnostics are always better! |
I can write a macro If we do this we strictly won't be calling panicking at all. The handler will return control if the execution was successful, or print the error and reboot the system (actually issue a break 0:0 which does the same thing is the debugger hook isn't installed). My last Q is in the case of (i) above: do we think it better to reboot or to drop into the interactive prompt if init.lua throws an error? |
I still think that this one has legs. A lower priority TODO. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Just out of interest I am doing this as part of the Lua 5.1 to Lua 5.3 alignment so we can let this close. |
As part of my LCD patch, I also reimplemented the Lua error traceback, so that by default an error give a proper Lua stack traceback, rather the a one-line goodbye. However, the RTS still has a catch-all panic() to trap any unprotected error calls. This puts out a very unhelpful error message:
PANIC: unprotected error in call to Lua API (<Lua routine name>)
. It just struck me that it would be better to provide a proper Lua traceback. Is this a good idea?This is a fairly easy change.
The text was updated successfully, but these errors were encountered: