Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPIO interrupts crash the device #874

Closed
nickandrew opened this issue Dec 23, 2015 · 14 comments
Closed

GPIO interrupts crash the device #874

nickandrew opened this issue Dec 23, 2015 · 14 comments
Assignees

Comments

@nickandrew
Copy link
Contributor

Enabling an interrupt handler for GPIO events can cause a crash.

I'm breaking this off from #871 as it is more fundamental than enabling safe posting of Lua tasks after an interrupt is handled.

In my test case, the NodeMCU is doing essentially nothing but it still crashes.

See https://github.com/nickandrew/nodemcu-firmware/commits/debug-gpio-intr and there's a 1-line commit which enables the crash to be demonstrated:

--- a/app/modules/node.c
+++ b/app/modules/node.c
@@ -471,6 +471,7 @@ static int node_setcpufreq(lua_State* L)
 // Lua: code = bootreason()
 static int node_bootreason (lua_State *L)
 {
+  platform_gpio_intr_init(3, GPIO_PIN_INTR_NEGEDGE);
   lua_pushnumber (L, rtc_get_reset_reason ());
   return 1;
 }

I have an Infrared Receiver attached to GPIO0. Its output is normally high; goes low with a demodulated infrared signal.

To test, I call node.bootreason() from lua to enable the gpio interrupt handler, then send a blast of infrared data at it (by mashing the keys on the remote). I can cause a reset within 10 seconds if I try. Note that there is no Lua code being run by this interrupt handler, as no function has been hooked into it.

I did the testing on SDK 1.4.0; will try it all again with 1.5.0.

@TerryE
Copy link
Collaborator

TerryE commented Dec 23, 2015

Nick, I've got the family here getting ready for Christmas, so I'll keep this short. The extra line enables HW interrupts for the given pin. Level changes on that pin are now captures by the gpio ISR and this can cause side effects. It's complicated and it needs to be robust. Sorry can't go into more details now. Maybe tonight after everyone has gone to bed, because that's when I usually end up clearing the backlog :(

@nickandrew
Copy link
Contributor Author

Tested with 1.5.0 and stressing the GPIO interrupt still crashes the device. Or maybe not even stressing it. If I had a signal generator, I could get a better idea.

@TerryE
Copy link
Collaborator

TerryE commented Dec 26, 2015

Nick, let me make the changes first and raise the PR before you test this. Thanks

@pjsg
Copy link
Member

pjsg commented Jan 3, 2016

@TerryE Are you going to use the eLua model for interrupts or some other model? I ask because the eLua model uses the lua_sethook mechanisms which render them useless for other purposes (AFAIK you can only have one hook registered at a time)

@TerryE
Copy link
Collaborator

TerryE commented Jan 3, 2016

It's a standard Lua mechanism rather than eLua, AFAIK. Yes you can only have one hook registered, What this means is that any library seeking to register a second hook my save the previous hook and daisy chain to it. But given that we don't use any by default, even if we don't do this it should be a major issue now.

Sorry Philip, I picked up the comment and though you were referring to your PR. The sethoook mechanism is currently used to hook debugging and diagnostics into the VM, not for interrupt processing -- which is broadly what you are wanting to enhance.

@TerryE TerryE self-assigned this Jan 4, 2016
@TerryE
Copy link
Collaborator

TerryE commented Jan 24, 2016

Nick, IMO the root causes are still the same as #845 so I will answer there.

@TerryE
Copy link
Collaborator

TerryE commented Jan 26, 2016

I've been going through the implementation for all of the GPIO devices and its a mess on a number of levels, and really needs a full rework. At the nub of this are two factors which really mismatch:

  • The architecture of the esp8266 and the non-OS SDK. The chipset is uni-core and has no DMA capability. It does have a Wifi coprocessor, but this is very low level and to support the high level house keeping functions needed to 'feed and water' the Wifi and TCP stacks, the SDK needs to schedule these functions with sufficient timeliness and regularity to avoid function queuing and overflow. To keep the SDK simple a non pre-emptive short-task framework adopted. This is the reason for to 50 µSec maximum interval for ISRs and interrupt masked code; the 15 mSec per task execution and the 500mSec maximum serialised continuous execution guidelines. Incidentally, this whole issue of memory blow-up as a result of resource queuing, is also the reason for the one async network primitive per task rule (a.k.a. "Don't do two sk:sends() in one callback").
  • The underlying assumptions which underpin the eLua Architecture. These are that the eLua framework runs procedurally over a bare metal chipset -- very much as is typical in Arduino implementations.

So we have task driven SDK with stringent timeout constraints vs. a Lua implementation really intended for a procedural execution over bare metal. Luckily as Lua is typically used as a scripting language for applications such as gaming kernels and database engines, it facilitates an implementation using the SDK tasking and callback framework. It's just that this is very different to that assumed in the eLua architectural overview:

eLua Architecture

OK, the picture looks OK but the lack of commonality of the codebase at the platform interface level really underlines how different nodemcu is from a more typical eLua implementation. I think that Zeroday and his colleagues at NodeMCU Inc did a fantastic job in their initial especially given that the SDK is closed source and given this, the documentation is very inadequate, but IMO the whole paradigm for GPIO handling and the related drivers layered on this needs revisiting to see if they will ever work properly, and if they aren't going to then we need to actively deprecate them.

  • Our GPIO implementation does not use the Espressif standard interface ROM (and SDK) GPIO interface (grep "PROVIDE (gpio" app/mapfile to see these) and as documented in sdk/esp_iot_sdk_v1.5.1/include/gpio.h.
  • It chronically breaks the 50 µSec limit.
  • It mixes Lua Land and ISRs.
  • Most of the layered drivers are Ardiuno ports and don't recognise the fundamental differences between the Arduino-way and the SDK-way. Given the lack of DMA and the 50 µSec limit, the standard approach
    of disabling interrupts and bit banging does work as far as the I/O goes, but when you re-enable interrupts either your Wifi and TCP stack have died or got into a twist and then panic the CPU.
  • So (excluding the new WS2812 driver) the only form of bit-banging that does work is an optimistic
    avoidance strategy. Here interrupts are not disabled for the entire sequence but might be disabled up to 50 µSec. Any delivered ISRs will delay timing, but in many cases this might still we entirely workable:
  • In the case of master-clocked protocols such as I2C and SPI, most slave peripherals are tolerant to this delay. We could never support the slave variants
  • One wire will miss-read, but this will be detected in the OW CRC, so the reading can be retried.
  • The GPIO serout protocol will never work stably, as is the case with PMW and RC protocols.
  • The old WS2812 protocol will never work stably, but the encoding scheme allows it to be overlaid on a UART stream and therefore use the onchip H/W FIFO.

Ultimately, if you want to understand the shortcomings of the ESP8266, you only need to compare it to the spec of the ESP32. IMO, we should be honest about these short-comings and deprecate drivers that we know to be unreliable.

I have more to say, but accidentally posted this comment in partial form, so I will now pause for feedback.

@pjsg
Copy link
Member

pjsg commented Jan 26, 2016

Thanks for this. I think that there are levels of unreliability in drivers. What is acceptable for a hobbyist would not be acceptable for a production system. I have a humidity controller based on the dht module and using gpio to control a relay that drives a misting pump for my carnivorous plant tank. Yes, the reads from the dht module sometimes fail, but I get an indication that this has happened, and it typically works when I retry a couple of seconds later. I don't know (and to some degree I don't care) whether this is due to interrupts happing at a bad time, or something else.

If the driver causes system failure (e.g. the current gpio in interrupt mode), then we should clearly deprecate it (or fix it). If the driver sometimes signals failure, then that is OK -- provided that this is documented and it is easy/possible to recover from this failure.

I don't know whether this code base is supposed to be production ready or hobbyist ready....

@TerryE
Copy link
Collaborator

TerryE commented Jan 26, 2016

Failure is detectable with retry is a perfectly good strategy. Enet used it happily for years. Relay and I/O pin logic is fine; using PWM I2C chip is fine (you can pick these up on eBay for few $); ditto any chip using @Alkorin Thomas's UART technique. But using the RT module will trash the Wifi, as can the serout function.

What I am trying to work out are the incremental steps that we can take where each has a tangible improvement. My current plan on this commit is to leave the current low level code which bypasses the ROM/SDK GPIO API in place, and fix the architect stuff around timing.

@pjsg
Copy link
Member

pjsg commented Jan 26, 2016

If you commit the new task code, then I'll undertake to improve the GPIO code so that it doesn't invoke LUA from the interrupt callback, and doesn't spend too long in the interrupt callback.

The only decision is whether to try and timestamp events and pass that information to the lua callback, or just do the simplest thing. I suspect that the timestamping approach is more useful, though more work and more prone to overflow.

@nickandrew
Copy link
Contributor Author

Per @TerryE's comment in #845, adding ICACHE_RAM_ATTR to relevant GPIO interrupt handlers seems to solve the crash. I was unable to reproduce the crash over 5 minutes of frenetic button pushing.

I tested this on both 1.5.0 and 1.5.1. The diff against 1.5.1 is:

diff --git a/app/modules/gpio.c b/app/modules/gpio.c
index d4c189e..7e5ceb5 100644
--- a/app/modules/gpio.c
+++ b/app/modules/gpio.c
@@ -29,7 +29,7 @@ void lua_gpio_unref(unsigned pin){
   gpio_cb_ref[pin] = LUA_NOREF;
 }

-void gpio_intr_callback( unsigned pin, unsigned level )
+void ICACHE_RAM_ATTR gpio_intr_callback( unsigned pin, unsigned level )
 {
   NODE_DBG("pin:%d, level:%d \n", pin, level);
   if(gpio_cb_ref[pin] == LUA_NOREF)
diff --git a/app/platform/platform.c b/app/platform/platform.c
index 1b06f7d..be5ed76 100755
--- a/app/platform/platform.c
+++ b/app/platform/platform.c
@@ -135,7 +135,7 @@ int platform_gpio_read( unsigned pin )
 }

 #ifdef GPIO_INTERRUPT_ENABLE
-static void platform_gpio_intr_dispatcher( void *arg) {
+static void ICACHE_RAM_ATTR platform_gpio_intr_dispatcher( void *arg) {
   platform_gpio_intr_handler_fn_t cb = arg;
   uint8 i, level;
   uint32 gpio_status = GPIO_REG_READ(GPIO_STATUS_ADDRESS);

This doesn't deal with the wider issue of Lua code being invoked from an interrupt handler (that's #845) and there's no point me submitting a PR for the above diff as Terry's rewrite of GPIO interrupt handling will override it soon enough. So closing.

@TerryE
Copy link
Collaborator

TerryE commented Jan 30, 2016

Thanks Nick, and sorry for my tardiness in getting this out but I've been spending 8 hrs a day working on the roof of our new build cutting roof window slots in the sarking, mounting roof windows, verge and fascia edging etc. and trying to keep ahead of my slater. I've just been to knackered after this to progress the coding, but at least this give me lots of mull time to sort out details in my head 😅

@shinji2009
Copy link

shinji2009 commented Oct 31, 2016

hi. can anybody please insert ICACHE_RAM_ATTR in the proper place/places of this script?

--sensor.lua
file.open( "sensorThreshold.lua", "r" )
sensorThreshold = file.read()
sensorThreshold = string.gsub(sensorThreshold, "%s+", "")
sensorThreshold = tonumber(sensorThreshold)
file.close()


--Set how many times you'd like the sensor to receive an input before sending an alert.

file.open( "timerThreshold.lua", "r" )
timerThreshold = file.read()
timerThreshold = string.gsub(timerThreshold, "%s+", "")
timerThreshold = tonumber(timerThreshold)*1000
file.close()
--print("Timer Reset: "..timerThreshold)

--The interaction of these two variables above will determine your tolerance for false alarms.
--For example if this device is a panic button and you want to send a message after 3 button presses within 10 seconds, set countThreshold = 3 and TimerResetThreshold = 10000.
--If this device is a Smoke Alarm Sound Sensor, and you want to wait until the alarm has been sounding for ~30 seconds before sending an alert, set countThreshold = 60, TimerResetThreshold = 60000 (60 loud sounds within 60 seconds).
--If this device is a Baby Noise SMS Alert, set countThreshold = 5, TimerResetThreshold = 60000 (5 loud sounds/cries within 60 seconds). Keep in mind you will have to calibrate your microphone sensor to a lower threshold than a Smoke Alarm Sensor.

buttonPin = 2 -- this is ESP-01 pin GPIO4
gpio.mode(buttonPin,gpio.INT,gpio.PULLUP)
count = 0

function resetCounter()
    count = 0
    end

function debounce (func)
    local last = 0
    local delay = 10000

    return function (bounceCheck)
        local now = tmr.now()
            if now - last < delay
            then return
            end
        last = now
        return func(bounceCheck)
    end
end

function onChange()
    if gpio.read(buttonPin) == 0
    then count = count + 1

        tmr.alarm(6,timerThreshold,0,function() 
        resetCounter()
        print("Time limit reached. Restarting Counter...")
        end)

            if count == sensorThreshold
            then 
                print("(" .. count .. ") sensor inputs counted! Sending Alert!")
                dofile("ifttt.lua")
            else print("(" .. count .. ") sensor inputs counted...")
            end     
    end
end

gpio.trig(buttonPin,"down", debounce(onChange))

@TerryE
Copy link
Collaborator

TerryE commented Oct 31, 2016

@shinji2009 ICACHE_RAM_ATTR is used in the development of C libraries for NodeMCU, not in Lua scripts. This is the wrong list for resolving issues with your Lua applications. Try one of the user forums.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants