-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lua reentrancy, tasks, callbacks and interrupt callbacks. #851
Comments
I agree with the (obvious) conclusion - ISRs should not directly be calling into Lua. That's pretty much driver writing 101 - never do more than absolutely necessary inside the ISR. |
I'm reading these related issues with interest, but I don't know what's going on well enough to contribute, sorry. In my issue #802 I expect something is getting corrupted, but I'm not sure why you mentioned "OW interrupts" in your initial comment. My lua code uses 2 timers; one every 30 seconds to initiate each complete pass, and a 1-second one shot to accomplish the necessary 1 second delay while the attached sensors convert their temperature. There's also net.tcp activity to POST results, so all up there'll be 2 timer callbacks and whatever net callbacks are required (on connection). No GPIO triggers. |
Nick, you might be correct. I need to look at the OW code and see in detail how it does its bit diddling of the I/O pins. This may be triggering gpio triggers and causing your problems as a consequence. One simple way that might check this is to comment out the If YES / YES then it is highly likely that it is an interrupt related issue. Oh, the wonders of a binary chop! |
The five ISR routines declared in the code are:
Of these, the pwm, rtc, spi stay out of Lua land. The uart task does implement a task post correctly (Thanks Johny); and the key.c file is not used in the firmware. So currently GPIO is the only ISR that is not implemented correctly. However, the SDK advises that implementers adopt same post task approach to implement some espconn functions, so I will sweep these up at the same time. |
What about network callbacks, e.g.
Are these guaranteed to be called only when no Lua task is running? |
These are SDK tasks within the above definition so will only be delivered when no Lua task is running. The issues is with calling sk:on('sent', function(sk) sk:close() end) Also the connect function returns a success status or one of 4 error codes which we ignore and don't return to the user, but thinking about this I will keep the net changes as a separate issue / PR |
The following routines disable and then reenable all interrupts: I am not sure why gpio would want the disable all interrupts, rather than just gpio ones, for example. @nodemcu @jmattsson any comments? |
onewire.c use gpio to simulate 1-wire protocol bit-bang, the timing is important, but I'am not sure if it is critical, so when bit-banging ALL interrupts is disabled. gpio.c has a function lgpio_serout() for gpio.serout(), which serialize the code byte out of gpio pin. ws2812.c and dht.c if I looked at it right also bit-banging out gpio pin. so, all this simulated protocol from a single gpio pin. I think it's ok to disable all interrupt to ensure protocol timing, or at least make sure every interrupt is very short compared to the protocol timing. Of course if all interrupt is disabled, should make sure that it will re-enabled in very short time. readline.c, when getc() is called, is only protect the rx buffer/fifo pointer, if read/write is protected, then no need to disable all interrupt I think. |
Thanks @nodemcu zeroday. Your thoughts echo mine. The issue relates to bit-banging and how resilient the application needs to be whilst doing this. On an Arduino its easy because there is no OS or SDK to get in the way. With the single core ESP8266 it's a little more complicated in that the WiFi and connection oriented TCP stacks require timely responsiveness and will tend to timeout and get lost in some nasty state unless interrupts are disabled for less than 100 uSec or so and return passed back to the SDK every 50mSec. So the bit banging works fine but we then get developers bitching about why the network has died. Another aspect that I don't understand if what happens if you get an irom0 cash miss when interrupts are disabled as these are handled just like Johny's non-aligned access by an interrupt handler (though ROM-based). It's a port that we don't have access to the ROM and SDK code here. |
sorry no idea about whether the cache miss interrupt would be impacted or not. |
@jmattsson Johny, the tasking interface will allow any module to declare a single task callback and the address of this routine acts as a handle for that callback. A typical use will be as in the case of the UART and GPIO ISRs which will use the post to task function to schedule Lua land event-based triggers. We need to keep the ISR side lean, but runtime-safe. There are three alternative mechanisms for declaring these task callbacks:
Having weighed up the pros and cons (1) is too complicated, (2) is to nasty, so I am going with (3). So my thinking is that the standard pattern for the ISR will use a standard memory structure to store the interrupt context (the handle for the task handler and any interrupt parameters), and this will be the uint32 parameter to the post. This will be statically allocated in RAM to avoid malloc overheads in the ISR. The Lua-side task processor can parse this and validate that the handle is one of the declared ones. If the ISR needs to support stacking multiple requests, then it will need to allocate an array of interrupt context and manage which to use for the current call. (This might be needed for GPIO triggers). At the moment we only have 4 modules which use this interface so the implementation will be simple and lean. |
Option (3) sounds like the most appropriate approach, I agree. Queueing up interrupts (or the effects thereof) is always a challenging business, but I don't think this approach makes it any harder. Do we know whether it's possible to have multiple events of the same type/target posted to a task? |
I think that a probable case is #845 and Philip's screwdriver test. The GPIO ISR currently reenables interrupts before calling the Lua world directly, possibly preempting an already running Lua thread. A bouncing contact or even two parallel triggers on mechanically coupled inputs could easily generate interrupts within 10s of msec, causing nested preemptions. This is disastrous with the current implementation, but could still be damaging with the task-based one if a single interrupt context block is used by the ISR because you could get a second interrupt triggered before the SDK has delivered the task to clear the first. Ideally we would need a mutex mechanism to control this, but I'd live with a simple test & set / clear flag. This way we could either use a single context where the ISR sets it in use and the delivered task clears it; the ISR would then need collision detection logic to drop extra interrupts before this is cleared (not my preferred scheme) or we have a fixed (say 4) array of context blocks and the ISR picks a free one to queue the task. |
I'd like to add another device driver which makes use of this mechanism. I just want to be able to call system_os_post from my ISR, and then get a callback at a later stage. Then I will actually invoke the LUA function.... |
@pjsg, Philip I'll document the API later today, and possibly do the PR if I can finish testing. |
I think that the new gpio / task interface work address this so we can now close this issue. |
This issue was highlighted to me by #795 and #802. We have Lua instabilities manifesting themselves when the ESP is being hammered by GPIO or OW interrupts. How it manifests itself is with Lua stack corruptions which cause assert trips if lua_assert diagnostics are available and by a messy death if not.
There is a separate issue #846, which might be related but I think not. Because this stuff is terribly documented within the Espressif documentation, I am largely relying on wider experience and retroengineering of the behaviour of the SDK. But if anyone else can provide better information, then please add to this discussion.
lua_lock()
andlua_unlock()
are optimised away. We don't have the option to enable them because the SDK is essentially non-preemptive and any lock collision would simply result in deadlock.user_init()
starts the main Lua thread and posts a task to run theinit.lua
.tmr.alarm()
callback) are executed using standard "calling Lua from C" API features and pick up a Lua thread (not necessarily the main thread -- see Updating modules that are incompatable with Lua coroutining #846).ETS_XXXX_INTR_ATTACH()
macros defined inSDK//include/ets_sys.h
and execute theets_isr_attach()
function. Now these seems to be true interrupt routines.This last category often pick up the Lua_state for the main or other thread and operate on the Lua stack using the stack API calls and even call Lua functions. This is a recipe for death. It these are true interrupt callbacks then there will be no guarantee that the Lua stack will be in a fit state to be used, and calling any complex processing such as Lua code will break all of the timing guidelines (10µS interrupts disabled) or allow reentrance of non-reentrant code.
What these interrupt cb routines should do is stay firmly outside the lua environment (which really begs the Q of whether we can even use the registry) and if they want to carry out any Lua processing then this should be done by doing a
system_os_post()
to schedule a non-interrupt callback. This will hit the latency of callbacks such asgpio.trig()
. All Lua execution and access to Lua stack and registry should be at the non-preemptive level.On a related note, I see that the SDK also requires the same task approach for the espconn disconnect and abort functions, which we don't do, so we should roll up these changes at the same time.
Does anyone agree / disagree with this analysis?
The text was updated successfully, but these errors were encountered: