Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lua reentrancy, tasks, callbacks and interrupt callbacks. #851

Closed
TerryE opened this issue Dec 14, 2015 · 16 comments
Closed

Lua reentrancy, tasks, callbacks and interrupt callbacks. #851

TerryE opened this issue Dec 14, 2015 · 16 comments
Assignees

Comments

@TerryE
Copy link
Collaborator

TerryE commented Dec 14, 2015

This issue was highlighted to me by #795 and #802. We have Lua instabilities manifesting themselves when the ESP is being hammered by GPIO or OW interrupts. How it manifests itself is with Lua stack corruptions which cause assert trips if lua_assert diagnostics are available and by a messy death if not.

There is a separate issue #846, which might be related but I think not. Because this stuff is terribly documented within the Espressif documentation, I am largely relying on wider experience and retroengineering of the behaviour of the SDK. But if anyone else can provide better information, then please add to this discussion.

  • Lua can either be build to be reentrant or not. In the case of the eLua implementations it is assumed that the RTS is non -rentrant so the lua_lock() and lua_unlock() are optimised away. We don't have the option to enable them because the SDK is essentially non-preemptive and any lock collision would simply result in deadlock.
  • The SDK uses two primary mechanisms for executing applications code: tasks and callbacks. Though in many ways tasks are simply a queued callback mechanism.
  • All task and most callback functions are non-preemptive, that is once a function is initiated, it will run to completion and return control to the SDK -- which in turn can then schedule the next runnable task / cb. The guideline is that no cb should take longer than 10 mSec, otherwise some functions such as net and wifi will start to fail.
  • user_init() starts the main Lua thread and posts a task to run the init.lua.
  • Any code chunks input through the UART are also posted as tasks.
  • Unlike standard Lua, the firmware never closes the main thread, but leaves it open when tasks complete, and hence globals, the thread's registry and dangling upvalues remain in scope.
  • Callbacks (such as a tmr.alarm() callback) are executed using standard "calling Lua from C" API features and pick up a Lua thread (not necessarily the main thread -- see Updating modules that are incompatable with Lua coroutining #846).
  • Some callbacks seem to be classed as "interrupt drivers", and it isn't clear to me if they are handled within the non-pre-emptive queueing system or are true interrupt routines. For example I am thinking specifically of cb routines bound by the ETS_XXXX_INTR_ATTACH() macros defined in SDK//include/ets_sys.h and execute the ets_isr_attach() function. Now these seems to be true interrupt routines.

This last category often pick up the Lua_state for the main or other thread and operate on the Lua stack using the stack API calls and even call Lua functions. This is a recipe for death. It these are true interrupt callbacks then there will be no guarantee that the Lua stack will be in a fit state to be used, and calling any complex processing such as Lua code will break all of the timing guidelines (10µS interrupts disabled) or allow reentrance of non-reentrant code.

What these interrupt cb routines should do is stay firmly outside the lua environment (which really begs the Q of whether we can even use the registry) and if they want to carry out any Lua processing then this should be done by doing a system_os_post() to schedule a non-interrupt callback. This will hit the latency of callbacks such as gpio.trig(). All Lua execution and access to Lua stack and registry should be at the non-preemptive level.

On a related note, I see that the SDK also requires the same task approach for the espconn disconnect and abort functions, which we don't do, so we should roll up these changes at the same time.

Does anyone agree / disagree with this analysis?

@jmattsson
Copy link
Member

I agree with the (obvious) conclusion - ISRs should not directly be calling into Lua. That's pretty much driver writing 101 - never do more than absolutely necessary inside the ISR.

@nickandrew
Copy link
Contributor

I'm reading these related issues with interest, but I don't know what's going on well enough to contribute, sorry.

In my issue #802 I expect something is getting corrupted, but I'm not sure why you mentioned "OW interrupts" in your initial comment. app/driver/onewire.c doesn't generate any interrupts as far as I know; all onewire operations run in the one thread to completion and they disable interrupts during time sensitive bit banging.

My lua code uses 2 timers; one every 30 seconds to initiate each complete pass, and a 1-second one shot to accomplish the necessary 1 second delay while the attached sensors convert their temperature. There's also net.tcp activity to POST results, so all up there'll be 2 timer callbacks and whatever net callbacks are required (on connection). No GPIO triggers.

@TerryE
Copy link
Collaborator Author

TerryE commented Dec 15, 2015

Nick, you might be correct. I need to look at the OW code and see in detail how it does its bit diddling of the I/O pins. This may be triggering gpio triggers and causing your problems as a consequence. One simple way that might check this is to comment out the #define GPIO_INTERRUPT_ENABLE in user_config.h. This removes all of the GPIO interrupt related code. (i) Does your OW code still work in general? (ii) Has this issue gone away?

If YES / YES then it is highly likely that it is an interrupt related issue.
If YES / NO then we've effectively ruled out that it is an interrupt related issue.

Oh, the wonders of a binary chop!

@TerryE
Copy link
Collaborator Author

TerryE commented Dec 16, 2015

The five ISR routines declared in the code are:

File Line Type ISR
driver/key.c 57 GPIO key_intr_handler(keys)
platform/platform.c 154 GPIO platform_gpio_intr_dispatcher(cb)
driver/pwm.c 376 FRC_TIMER1 pwm_tim1_intr_handler()
include/rtc/rtctime_internal.h 600 FRC_TIMER1 rtc_time_ccount_wrap_handler()
driver/spi.c 435 SPI spi_slave_isr_handler()
driver/uart.c 54 UART uart0_rx_intr_handler(&UartDev.rcv_buff))

Of these, the pwm, rtc, spi stay out of Lua land. The uart task does implement a task post correctly (Thanks Johny); and the key.c file is not used in the firmware. So currently GPIO is the only ISR that is not implemented correctly.

However, the SDK advises that implementers adopt same post task approach to implement some espconn functions, so I will sweep these up at the same time.

@nickandrew
Copy link
Contributor

What about network callbacks, e.g.

espconn_regist_recvcb(pesp_conn, net_socket_received);
espconn_regist_sentcb(pesp_conn, net_socket_sent);
espconn_regist_disconcb(pesp_conn, net_server_disconnected);
espconn_regist_reconcb(pesp_conn, net_server_reconnected);

Are these guaranteed to be called only when no Lua task is running?

@TerryE
Copy link
Collaborator Author

TerryE commented Dec 16, 2015

These are SDK tasks within the above definition so will only be delivered when no Lua task is running. The issues is with calling espconn_disconnect(), espconn_secure_disconnect() and espconn_regist_disconcb(). See the SDK warnings on these functions, and the details in Appendix 9. These can't be directly called from a espconn callback (which we do, BTW). Eg. in our current implementation for:

sk:on('sent', function(sk) sk:close() end)

Also the connect function returns a success status or one of 4 error codes which we ignore and don't return to the user, but thinking about this I will keep the net changes as a separate issue / PR

@TerryE
Copy link
Collaborator Author

TerryE commented Dec 28, 2015

The following routines disable and then reenable all interrupts: dhtlib/dht.c, app/driver/i2c_master.c (commented out) ,driver/onewire.c, driver/readline.c, modules/gpio.c, modules/ws2812.c (More routines disable the specific interrupt class that they are processing.) I would a lot happier if we had a robust definition of is and is not allowable for routines that do this.

I am not sure why gpio would want the disable all interrupts, rather than just gpio ones, for example.

@nodemcu @jmattsson any comments?

@nodemcu
Copy link
Collaborator

nodemcu commented Dec 29, 2015

onewire.c use gpio to simulate 1-wire protocol bit-bang, the timing is important, but I'am not sure if it is critical, so when bit-banging ALL interrupts is disabled.

gpio.c has a function lgpio_serout() for gpio.serout(), which serialize the code byte out of gpio pin.
(this is first added to simulate IR remote). also use bit-banging. if gpio.serout() is not called, gpio will not need to disable all interrupt.

ws2812.c and dht.c if I looked at it right also bit-banging out gpio pin.

so, all this simulated protocol from a single gpio pin. I think it's ok to disable all interrupt to ensure protocol timing, or at least make sure every interrupt is very short compared to the protocol timing.
not sure how long a SDK interrupt(for example for network receive/transfer) takes, so it may interrupt the timing.

Of course if all interrupt is disabled, should make sure that it will re-enabled in very short time.

readline.c, when getc() is called, is only protect the rx buffer/fifo pointer, if read/write is protected, then no need to disable all interrupt I think.
my point of view.

@TerryE
Copy link
Collaborator Author

TerryE commented Dec 29, 2015

Thanks @nodemcu zeroday. Your thoughts echo mine. The issue relates to bit-banging and how resilient the application needs to be whilst doing this. On an Arduino its easy because there is no OS or SDK to get in the way. With the single core ESP8266 it's a little more complicated in that the WiFi and connection oriented TCP stacks require timely responsiveness and will tend to timeout and get lost in some nasty state unless interrupts are disabled for less than 100 uSec or so and return passed back to the SDK every 50mSec.

So the bit banging works fine but we then get developers bitching about why the network has died.

Another aspect that I don't understand if what happens if you get an irom0 cash miss when interrupts are disabled as these are handled just like Johny's non-aligned access by an interrupt handler (though ROM-based). It's a port that we don't have access to the ROM and SDK code here.

@TerryE TerryE self-assigned this Jan 4, 2016
@nodemcu
Copy link
Collaborator

nodemcu commented Jan 5, 2016

sorry no idea about whether the cache miss interrupt would be impacted or not.
maybe can do a test with a flash read/write within block that interrupts globally disabled.

@TerryE
Copy link
Collaborator Author

TerryE commented Jan 15, 2016

@jmattsson Johny, the tasking interface will allow any module to declare a single task callback and the address of this routine acts as a handle for that callback. A typical use will be as in the case of the UART and GPIO ISRs which will use the post to task function to schedule Lua land event-based triggers. We need to keep the ISR side lean, but runtime-safe.

There are three alternative mechanisms for declaring these task callbacks:

  1. Replicate the NODEMCU_MODULE linker magic to allow these to be statically declared. This could be done in module rather than user_modules.c.
  2. Overload the property map to use the convention that a reserved method name is not really a method but the task callback interface.
  3. Add a new lua_declaretask() API call which can be called from the luaopen_XXXX callback.

Having weighed up the pros and cons (1) is too complicated, (2) is to nasty, so I am going with (3).

So my thinking is that the standard pattern for the ISR will use a standard memory structure to store the interrupt context (the handle for the task handler and any interrupt parameters), and this will be the uint32 parameter to the post. This will be statically allocated in RAM to avoid malloc overheads in the ISR. The Lua-side task processor can parse this and validate that the handle is one of the declared ones.

If the ISR needs to support stacking multiple requests, then it will need to allocate an array of interrupt context and manage which to use for the current call. (This might be needed for GPIO triggers).

At the moment we only have 4 modules which use this interface so the implementation will be simple and lean.

@jmattsson
Copy link
Member

Option (3) sounds like the most appropriate approach, I agree. Queueing up interrupts (or the effects thereof) is always a challenging business, but I don't think this approach makes it any harder. Do we know whether it's possible to have multiple events of the same type/target posted to a task?

@TerryE
Copy link
Collaborator Author

TerryE commented Jan 16, 2016

I think that a probable case is #845 and Philip's screwdriver test. The GPIO ISR currently reenables interrupts before calling the Lua world directly, possibly preempting an already running Lua thread. A bouncing contact or even two parallel triggers on mechanically coupled inputs could easily generate interrupts within 10s of msec, causing nested preemptions. This is disastrous with the current implementation, but could still be damaging with the task-based one if a single interrupt context block is used by the ISR because you could get a second interrupt triggered before the SDK has delivered the task to clear the first.

Ideally we would need a mutex mechanism to control this, but I'd live with a simple test & set / clear flag. This way we could either use a single context where the ISR sets it in use and the delivered task clears it; the ISR would then need collision detection logic to drop extra interrupts before this is cleared (not my preferred scheme) or we have a fixed (say 4) array of context blocks and the ISR picks a free one to queue the task.

@pjsg
Copy link
Member

pjsg commented Jan 19, 2016

I'd like to add another device driver which makes use of this mechanism. I just want to be able to call system_os_post from my ISR, and then get a callback at a later stage. Then I will actually invoke the LUA function....

@TerryE
Copy link
Collaborator Author

TerryE commented Jan 19, 2016

@pjsg, Philip I'll document the API later today, and possibly do the PR if I can finish testing.

@TerryE
Copy link
Collaborator Author

TerryE commented Feb 17, 2016

I think that the new gpio / task interface work address this so we can now close this issue.

@TerryE TerryE closed this as completed Feb 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants