Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wifi connection lost..... No reconnection #3208

Closed
chathurangawijetunge opened this issue Jul 13, 2020 · 76 comments
Closed

Wifi connection lost..... No reconnection #3208

chathurangawijetunge opened this issue Jul 13, 2020 · 76 comments

Comments

@chathurangawijetunge
Copy link

chathurangawijetunge commented Jul 13, 2020

NodeMCU 3.0.0.0 built on nodemcu-build.com provided by frightanic.com
branch: dev
commit: 2fa63a1
release:
release DTS: 202007071335
SSL: false
build type: integer
LFS: 0x40000 bytes total capacity
modules: file,gpio,mqtt,net,node,rtctime,sntp,tmr,uart,wifi
build 2020-07-08 00:54 powered by Lua 5.1.4 on SDK 3.0.1-dev(fce080e)

I am using above build....
In long run wifi gets disconnected and does not reconnect automatically. even if i do a soft reset using node.restart() but if i toggle power supply it reconnects.
had this issue with 4 esp-07 devices .
not sure if it is a bug or not......

@KT819GM
Copy link

KT819GM commented Jul 13, 2020

I've noticed same issue, when in some cases (did not found exact ones, so did not reported yet). Wi-Fi monitor will report 201 when connection lost and not restored. It will not attempt to reconnect till power cycle. Also strange behavior is with #define WIFI_STA_HOSTNAME as after firmware write you have to do power cycle for changes to be reflected. I've checked only 5.3 version only. Any suggestions how to catch fault?

@chathurangawijetunge
Copy link
Author

There is definitely something wrong with the wifi sta modul.... wifi gets disconnected after some time and reports 201 and never reconnect until power cycle..
I think this is a bug

@chathurangawijetunge chathurangawijetunge changed the title node.restart() vs Hard Reset Wifi connection lost..... No reconnection Jul 13, 2020
@KT819GM
Copy link

KT819GM commented Jul 14, 2020

@chathurangawijetunge Maybe you got any steps to reproduce this constantly? On dev version, LUA5.3 normal reconnection on normal circumstances works okay. I can't remember now what was exact reason when module just got stuck on 201 and will not reconnect till power cycled.

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Jul 14, 2020

It only happens in long run over 24 hours...

And It happens with both master and dev LUA5.3

getting wifi.eventmon.reason.AUTH_EXPIRE (2)
and
wifi.eventmon.reason.ASSOC_EXPIRE (4)
and somtiles
wifi.eventmon.reason.NO_AP_FOUND (201)

The Device will not auto reconnect even with wifi.sta.disconnect() wifi.sta.connect()
or even with node.restart()
only with hard reset with rts pin or power cycle will connect

3.0-master_20190907 work's fine.

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Jul 17, 2020

i'v tried
wifi.sta.autoconnect(0)
wifi.sta.config({ssid="ssid",pwd="pwd",auto=false})
wifi.sta.connect()

wifi connection is little stable..... not sure y this is happening

@chathurangawijetunge
Copy link
Author

any updates with regards to above situation....?

@nwf
Copy link
Member

nwf commented Aug 1, 2020

Glancing at the logs, I don't see anything interesting happening to wifi between 3.0-master_20190907 (i.e. 310faf7) and 2fa63a1, but you might have a go with git bisect to see what happens.

@chathurangawijetunge
Copy link
Author

Yes
3.0-master_20190907 work's fine but the issue is with new dev

@chathurangawijetunge
Copy link
Author

Any one having this issue....?
It happens in long run >24 hours

@nwf
Copy link
Member

nwf commented Aug 8, 2020

It may well be that nobody but you is experiencing this problem (yet); perhaps all our long-running esp8266es are still back on master. Because you are able to reliably see it, please git bisect and tell us when the problem first appeared. Otherwise, I'm afraid you'll just be waiting for someone else to notice, and that might not happen.

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Aug 10, 2020

It may well be that nobody but you is experiencing this problem (yet); perhaps all our long-running esp8266es are still back on master. Because you are able to reliably see it, please git bisect and tell us when the problem first appeared. Otherwise, I'm afraid you'll just be waiting for someone else to notice, and that might not happen.

i don't know how to do git bisect exactly...
but
3.0-master_20190907 works fine
and the issue is with
3.0-master_20200610 and in (dev)

NO_AP_FOUND (201) after about 12-24 hours no way to recover until power cycle

@KT819GM
Copy link

KT819GM commented Aug 10, 2020

I will leave 2 devices with #995114b LUA53 (I've obsoleted 5.1 in my head already) with weak wifi connection (<-85 dBm) and will respond after few days about results. As I mentioned before I've also had same issue with few boards but somehow I could not identify problem and now it for me it works stable (on the first glance).
Config will be:

station_cfg = {}
wifi.sta.sethostname("WiFitest")
wifi.sta.autoconnect(1)
station_cfg.ssid = "ssid"
station_cfg.pwd = "password"
station_cfg.save = true
wifi.sta.config(station_cfg)

With two eventmon

wifi.eventmon.register(wifi.eventmon.STA_CONNECTED, function(T)
    print("\n\tSTA - CONNECTED" .. "\n\tSSID: " .. T.SSID .. "\n\tBSSID: " ..
              T.BSSID .. "\n\tChannel: " .. T.channel)
end)
wifi.eventmon.register(wifi.eventmon.STA_DISCONNECTED, function(T)
    print("\n\tSTA - DISCONNECTED" .. "\n\tSSID: " .. T.SSID .. "\n\tBSSID: " ..
              T.BSSID .. "\n\treason: " .. T.reason)
    connectedToMqtt = false
end)

Because it will use mqtt, I will add additional check:

function MyMqtt.watch_mqtt()
tmr.create():alarm(10000, tmr.ALARM_AUTO, function()
    if not connectedToMqtt and wifi.sta.getip() ~= nil and wifi.eventmon.STA_CONNECTED == 0 then
        m:close() print('Reconnecting to Mqtt!') collectgarbage()
        tmr.create():alarm(1000, tmr.ALARM_SINGLE, function()
        MyMqtt.Connect()
            end)
        elseif not connectedToMqtt and wifi.sta.getip() == nil then
            wifi.sta.config(station_cfg)
        end
    end)
end

Also mqtt will be as indicator of lost and not restored connection if it will happen.

@chathurangawijetunge

This comment has been minimized.

@nwf
Copy link
Member

nwf commented Aug 11, 2020

@KT819GM: wifi.eventmon.STA_CONNECTED == 0 is a comparison between two constants; I do not think it means what you think it means?

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Aug 11, 2020

@KT819GM: wifi.eventmon.STA_CONNECTED == 0 is a comparison between two constants; I do not think it means what you think it means?

At what point does MyMqtt.watch_mqtt() is called..?
do we need to call wifi.sta.autoconnect(1) separately as in wifi.sta.config() defaults is true ...?
and do we have to reconfigure wifi.sta.config(station_cfg) if connection gets lost as wifi.sta.autoconnect is enabled in the beginning...?
and wifi.eventmon.STA_CONNECTED is all was 0

@nwf
Copy link
Member

nwf commented Aug 11, 2020

@chathurangawijetunge Please either edit your earlier comments or merely refrain from making duplicate comments like that. They are, like the duplicate issues, not conducive to conversation.

@KT819GM
Copy link

KT819GM commented Aug 11, 2020

@KT819GM: wifi.eventmon.STA_CONNECTED == 0 is a comparison between two constants; I do not think it means what you think it means?

Yeah, it bit of brain fart, got stuck experimenting with wifi.NULLMODE so did even more stupidness like asking question about it on gitter, and it's the reason why I don't like to provide code examples being not a programmer.

@KT819GM: At what point does MyMqtt.watch_mqtt() is called..?
do we need to call wifi.sta.autoconnect(1) separately as in wifi.sta.config() defaults is true ...?
and do we have to reconfigure wifi.sta.config(station_cfg) if connection gets lost as wifi.sta.autoconnect is enabled in the beginning...?
and wifi.eventmon.STA_CONNECTED is all was 0

MyMqtt.watch_mqtt() is started after mqtt connected. I've did not posted full code for reasons I've said bit higher.
wifi.sta.autoconnect(1) I've usually declare what I use, yes, it defaults to true anyways.
wifi.sta.config(station_cfg) this is what does reconnection when wifi.sta.autoconnect(1) fails, as an example:
Connect to Wifi, push Lua wifi.NULLMODE then put it back to Lua wifi.STATION and wifi.sta.autoconnect(1) will not be active, only wifi.sta.config(station_cfg) will put it back online. This is my dirty workaround for more stable wifi.

p.s. both units online, one at constant -86 / -92 dBm

@pjsg
Copy link
Member

pjsg commented Aug 11, 2020

There is definitely something weird going on on the dev branch. Today, the node that I'm working on got into a state where it wouldn't connect to the AP. It kept on given eventmon reason 23 (a type of auth fail) and the AP also reported

Aug 11 22:21:43.273 | Debugging | Station 2c3a.e835.f4eb Authentication failed

I tried redoing the wifi.sta.config({ssid="correct ssid", pwd="correct pw"}) and it didn't help. I tried changing modes to wifi.NULLMODE and back to wifi.STATION -- no help. I tried power cycling the board. No help.

I tried disabling the AP that it was trying to connect to so that it would switch to a different AP. No help.

I switched to wifi.sta.config({ssid="correct ssid", pwd="incorrect pw"}) and then back to correct. No help.

I switched to wifi.sta.config({ssid="incorrect ssid", pwd="correct pw"}) and then back to correct. This worked.

The eventmon data showed the correct ssid.

This started after I loaded a new LFS image. I have no idea whether this is related -- I include it for completeness.

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Aug 15, 2020

Not just with dev this happens in 3.0-master_20200610
too.....
Only stable version for me is 3.0-master_20190907

@pjsg
Copy link
Member

pjsg commented Aug 23, 2020

I'm adding some code so that I can read out the last 12k of flash (where the wifi setup informaion is stored) and see when/if it changes unexpectedly.

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Aug 29, 2020

I'm adding some code so that I can read out the last 12k of flash (where the wifi setup informaion is stored) and see when/if it changes unexpectedly.

Any luck in finding the bug....,?
or if you can share the code so i also can try with some modules..... as loosing wifi is a major issue.....

@pjsg
Copy link
Member

pjsg commented Aug 29, 2020

I managed to reproduce it today. It turns out that (I think) SPIFFS writes into the flash area at the end of the flash chip and overwrites the wifi settings. This is ugly.

Normally the last 12k doesn't change -- even on a reboot. however, sometimes it does -- it could be to do with reloading the LFS region -- that was when it happened. However, the LFS partition also got corrupted at that time, so I don't know whether I can really blame it. It was SPIFFS data that was found in the last 12k. I suppose that I ought to check that the spiffs partition doesn't overlap the end of the flash.....

@TerryE
Copy link
Collaborator

TerryE commented Aug 30, 2020

@pjsg, This shouldn't make any difference, because the SDK is supposed to use the PT now. See my comments in #3260. If there some bit of the code in our current SDK that are still writing to the old locations then we have wider issues that we need to scope and understand. We are currently running an old 3.0 SDK version. My first instinct would be to rebaseline to a current version and see if that fixes the problem before abandoning use of the Partition Table.

@TerryE
Copy link
Collaborator

TerryE commented Aug 30, 2020

We currently use SDK 3.0.1.

From what you say saving the default wifi.sta.config ssid + password does not use the PT partition but writes to this old SDK 2.x area. We can just erase a chip and set the SPIFFS at 0x100000 then the 5 page reqion should be left at FF. I will try to see if saving SSID credentials corrupts this. I'll also rebaseline our SDK to 3.0.4 and see if this repeats the issue.

@pjsg
Copy link
Member

pjsg commented Aug 30, 2020

This is my partition table:

[{"size":4096,"address":45056,"type":4},{"size":4096,"address":49152,"type":5},{"size":12288,"address":53248,"type":6},{"size":45056,"address":0,"type":101},{"size":507904,"address":65536,"type":102},{"size":131072,"address":573440,"type":103},{"size":3469312,"address":704512,"type":106}]
> 

I just iterated through all the partition types and these were the values that were returned. This looks plausible, but nevertheless, when you do wifi.sta.config, it does overwrite the last 12k of flash (I have a 4MB flash chip).

@TerryE
Copy link
Collaborator

TerryE commented Aug 30, 2020

This is as below and this looks pretty typical. It is worth moving SPIFFS to 1M for 1M, say, so you can see exactly what is writing to the forbidden region. Let me have a play.

type address size
SYSTEM_PARTITION_RF_CAL 0xB000 0x1000
SYSTEM_PARTITION_PHY_DATA 0xC000 0x1000
SYSTEM_PARTITION_SYSTEM_PARAMETER 0xD000 0x3000
NODEMCU_PARTITION_EAGLEROM 0x0000 0xB000
NODEMCU_PARTITION_IROM0TEXT 0x10000 0x7C000
NODEMCU_PARTITION_LFS0 0x8C000 0x20000
NODEMCU_PARTITION_SPIFFS 0xAC000 0x34F000

@TerryE
Copy link
Collaborator

TerryE commented Aug 30, 2020

I've just tried various combinations of setmode(), sta.config() getconfig, connect and autoconnect including valid and invalid SSID / password combos. All works as expected and it only seems to be the valid system partitions that are getting updated. Every dump of 0XFB000 is 20Kb × 0xFF, so if the SDK is writing to this, it's going to be on some weird error path.

@KT819GM
Copy link

KT819GM commented Sep 10, 2020

On this unit wifi.sta.config() was running on every reboot and it had ssid and pass saved on LFS. That's why possibly after power cycle it reconnected. But - it did not reconnected on software restart, as wifi.sta.config() was launched also.

@TerryE
Copy link
Collaborator

TerryE commented Sep 10, 2020

We are currently running on 3.0.1 and the current is 3.0.4. We'll rebaseline the SDK immediately after the next master drop. This might help.

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Sep 11, 2020

I have following code ruing on 3 modules

 local Status_led = 4 
 for i=0, 8, 1 do gpio.mode(i, gpio.OUTPUT) gpio.write(i,0) end 
 wifi.setmode(wifi.STATION) 
 wifi.sta.clearconfig() 
 wifi.sta.config({ssid="SSID" ,pwd="PWD"}) 
 --~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
 local Status_led_map={0,2,3,4,0,1} -- 1=Connected 2=Connecting 3=password error 4=AP not found 
 tmr.create():alarm(5000, 1, function() 
       tmr.softwd(30) 
       local WiFi_Status=Status_led_map[wifi.sta.status()+1] 
       if WiFi_Status==0 then 
          gpio.write(Status_led,0)  
       else    
          gpio.serout(Status_led,gpio.LOW,{180000,180000},WiFi_Status,nil) 
      end      
      print(wifi.sta.status())   
 end) 
 --~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `

after overnight all 3 went offline error '201 AP not found' this is in my init.lua not in LFS (now)
i will try the same with 3.0.4. SDK and update the findings....

@KT819GM
Copy link

KT819GM commented Sep 11, 2020

I've bumped SDK to 3.0.4 also and left one device alive with wifi.sta.config() inactive. Attached it to broker.hivemq.com on channel: 9741000/#. It send data every 20 sec, with content: {"Serial": "9741000", "Boot reason": "1", "Heap": "32776", "rssi": "-84", "Live": 269, "powered by Lua 5.3.5 on SDK 3.0.4(9532ceb)"} where Boot reason indicates 1 - power cycle, 2 - software restart, Live is tmr.time() counter in seconds.

@chathurangawijetunge
Copy link
Author

I've bumped SDK to 3.0.4 also and left one device alive with wifi.sta.config() inactive. Attached it to broker.hivemq.com on channel: 9741000/#. It send data every 20 sec, with content: {"Serial": "9741000", "Boot reason": "1", "Heap": "32776", "rssi": "-84", "Live": 269, "powered by Lua 5.3.5 on SDK 3.0.4(9532ceb)"} where Boot reason indicates 1 - power cycle, 2 - software restart, Live is tmr.time() counter in seconds.

It seems that device have WTD restarted...
2020-09-11 18:38:08Topic: 9741000/dataQos: 0{"Serial": "9741000", "Boot reason": "4", "Heap": "32512", "rssi": "-81", "Live": 6968, "powered by Lua 5.3.5 on SDK 3.0.4(9532ceb)"}

@KT819GM
Copy link

KT819GM commented Sep 11, 2020

It seems that device have WTD restarted...

Yeah, because "somebody" have done mqtt publish without checking if mqtt is available at all 😄. I've left it on battery and sadly can't fix that now. Still, if wifi will fail it should not reconnect even after restart.

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Sep 12, 2020

It seems that device have WTD restarted...

Yeah, because "somebody" have done mqtt publish without checking if mqtt is available at all 😄. I've left it on battery and sadly can't fix that now. Still, if wifi will fail it should not reconnect even after restart.

True... but to my experience this error happens only after about >24 hors so if the device reboots in between it might not pop

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Sep 12, 2020

not sure if this is related to this issue, but i have notice that by wifi.sta.clearconfig() wan't clear the MAC address of previously connected router.
and once i get the error error '201 AP not found'

function listap(t)
    for k,v in pairs(t) do
        print(k.." : "..v)
    end
end
wifi.sta.getap(listap)

will do nothing...

by wifi.sta.clearconfig() should clear all wifi config. but after wifi.sta.clearconfig() and wifi.sta.getdefaultconfig() i get

ssid ="" (empty sting)
pwd="" (should be nill)
bssid_set =0
bssid_set = 34:e8:94:04:7a:a0 (old bssid should be ff:ff:ff:ff:ff:ff )

@chathurangawijetunge
Copy link
Author

It seems that device have WTD restarted...

Yeah, because "somebody" have done mqtt publish without checking if mqtt is available at all 😄. I've left it on battery and sadly can't fix that now. Still, if wifi will fail it should not reconnect even after restart.

Your device is re starting.... not running continuously

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Sep 12, 2020

Even with LUA 5.3 wifi issue is still there. I guess I have to stick with 3.0-master_20190907 for my projects 🙁 where wifi is stable..

@TerryE
Copy link
Collaborator

TerryE commented Sep 14, 2020

@chathurangawijetunge, are you saying that 3.0-master_20190907 doesn't manifest this issue but 3.0-master_20200610 does? If so this piece of data will help to work out any underlying failure.

@KT819GM
Copy link

KT819GM commented Sep 14, 2020

Your device is re starting.... not running continuously

To be honest I see watchdog restart for the first time, and I think it came from the code part I added from your example tmr.softwd(30) Still, let's keep thinking test was unsuccessful and go further.

@TerryE Please consider checking WiFi when UART writing is in process - I can get disconnections reliably when simply transferring files to SPIFFS. Reconnecting most times but sometimes it's not.

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Sep 15, 2020

@chathurangawijetunge, are you saying that 3.0-master_20190907 doesn't manifest this issue but 3.0-master_20200610 does? If so this piece of data will help to work out any underlying failure.

Yes @TerryE I have devices running on 3.0-master_20190907 over 6 months with out any issue.
But 3.0-master_20200610 gets wifi disconnect. Sometimes in about 2 hours and in some 24 hours or mor.
Yesterday I even bought a new ESP-12 and tested it. got discounted after abut 14 hours with error 201 AP not found...(only power/hard reset reconnects)
(3.0-master_20190907 working perfectly)
modules: file,gpio,mqtt,net,node,rtctime,sntp,tmr,uart,wifi

@pjsg
Copy link
Member

pjsg commented Sep 16, 2020

I'm going to try the example above and see if it does anything strange. I'm using a regular nodemcu board with nothing attached.

However, I've been running a node off the dev branch for a while and after fixing the issue with spiffs overwriting the config, it has been rock solid.

@TerryE
Copy link
Collaborator

TerryE commented Sep 16, 2020

Thanks Philip 😊

@pjsg
Copy link
Member

pjsg commented Sep 16, 2020

After 24 hours, it is still running fine. Note that this code doesn't do anything except check for the status of the wifi. Some of the comments in this thread associated the failures with writing to spiffs.

Another aspect that is different is the radio environment. I have a number of Ubiquiti APs and I have pretty strong signal and I'm running WPA2.

When this fails, what environment does it fail in? @chathurangawijetunge

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Sep 17, 2020

After 24 hours, it is still running fine. Note that this code doesn't do anything except check for the status of the wifi. Some of the comments in this thread associated the failures with writing to spiffs.

Another aspect that is different is the radio environment. I have a number of Ubiquiti APs and I have pretty strong signal and I'm running WPA2.

When this fails, what environment does it fail in? @chathurangawijetunge

My Wifi setting as follows
RTS/CTS Threshold = 2347
Wireless Mode = 80211b+g+n
Channel Bandwidth = 20/40 Mhz  
Authentication Type = WPA2-PSK
Encryption = AES

also in ruing following code to check switch status with a timer of 50ms

if #(Out_Pin or {})~=3 then
Out_Pin={}
      Out_Pin[1] = 5 --GPIO-14  
      Out_Pin[2] = 6 --GPIO-12
      Out_Pin[3] = 7 --GPIO-13
else print("'user Define out put pin set") end

--if table.getn(Sw_Pin or {})~=3 then
if #(Sw_Pin or {})~=3 then
Sw_Pin={}
      Sw_Pin[1] = 0 --GPIO-4
      Sw_Pin[2] = 1 --GPIO-5
      Sw_Pin[3] = 2 --GPIO-16      
else print("'user Define Switch pin set") end
        
Timer_status = {} 
local Sw_Master = 3  -- GPIO0 and (+3.3v) 
local prese_ctn=0
----------------------------------------------------------------------
gpio.mode(Sw_Master,gpio.INPUT)
gpio.write(Sw_Master,0)
--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
if #(file.getcontents("led") or "")~=6 then file.remove("led") end
if file.open("led", "r") then
   for i=1, 3, 1 do
     gpio.write(Out_Pin[i],string.gsub(file.readline(),"\n",""))
   end     
   file.close()
end
--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
function write_led_Status()
   file.open("led", "w")  
   for i=1,3, 1 do
     file.writeline(Timer_status[i]==nil and gpio.read(Out_Pin[i]) or 0)
   end
   file.close()
end
--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
local function manual_on_off(pin) 
     gpio.write(Out_Pin[pin],gpio.read(Out_Pin[pin]) == 1 and 0 or 1)
     write_led_Status() 
     pcall(LED_ON_OFF,pin,gpio.read(Out_Pin[pin]) == 1 and "on" or "off",1)
end
--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
local Sw_Clicks=0
local mytimer = tmr.create()
--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
local mytimer1 = tmr.create()
local sw_sta=gpio.read(Sw_Master)
local debounce=0
--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
local function check_master() 
   local function Process_switch_Press()
         if prese_ctn>=10 then 
            pcall(system_reboot,1)                  
         elseif prese_ctn>=1 and prese_ctn<=3 then
            manual_on_off(prese_ctn)
         elseif prese_ctn~=0 then
            pcall(Beep,3)
         end
         prese_ctn=0 
   end                
   if sw_sta==0 and gpio.read(Sw_Master)==1 and math.abs(tmr.now()-debounce)>250000 then
      debounce=tmr.now()
      prese_ctn=prese_ctn+1
      pcall(Beep,1)
      mytimer1:alarm(750, tmr.ALARM_SINGLE, function () 
           if gpio.read(Sw_Master)==1 then
              local long_press=tmr.time()
              mytimer1:alarm(500,1,function(t)
                 if math.abs(tmr.time()-long_press)==4 then 
                    t:stop() prese_ctn=0
                    pcall(Go_AP_Mode) 
                 elseif gpio.read(Sw_Master)==0 then
                      t:stop()
                      Process_switch_Press()
                 end
              end) 
           else    
           Process_switch_Press()
          end     
      end)
   end 
   sw_sta=gpio.read(Sw_Master)
end
--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
local blink_ctn=0
tmr.create():alarm(50, 1, function()
    check_master()
    blink_ctn=blink_ctn>8 and 0 or blink_ctn+1
    for i=1,3, 1 do
        if Timer_status[i]=="timer" then
           gpio.write(Sw_Pin[i], blink_ctn<=4 and 1 or 0)
        else
           gpio.write(Sw_Pin[i],gpio.read(Out_Pin[i])==1 and 0 or 1) 
        end
    end 
end)
--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

i will remove above code and check if this problem is related to it...... and update......

@pjsg
Copy link
Member

pjsg commented Sep 17, 2020 via email

@TerryE
Copy link
Collaborator

TerryE commented Sep 17, 2020

I'm wondering if this is due to your 50ms timer

You are calling a pretty complicated function with various nested function calls every 50mS. What happens if it's execution time is near or over 50mS? You will always have a task ready to run and so are breaking SDK scheduling rules, as you are starving the WiFi stack the ability to run low priority housekeeping. My FAQ and the SDK API guides warm that this might happen.

I am really tempted to close this unless you can do what the issue template asks for and that is to provide a minimal complete example that shows the failure mode.

@pjsg, the task scheduling rules are what they are. IMO, it would be impractical to try to detect when a Lua developer isn't following them.

@chathurangawijetunge
Copy link
Author

I'm wondering if this is due to your 50ms timer

You are calling a pretty complicated function with various nested function calls every 50mS. What happens if it's execution time is near or over 50mS? You will always have a task ready to run and so are breaking SDK scheduling rules, as you are starving the WiFi stack the ability to run low priority housekeeping. My FAQ and the SDK API guides warm that this might happen.

I am really tempted to close this unless you can do what the issue template asks for and that is to provide a minimal complete example that shows the failure mode.

@pjsg, the task scheduling rules are what they are. IMO, it would be impractical to try to detect when a Lua developer isn't following them.

I understand... but if this code works in 3.0-master_20190907 for continually over many months why not with the new firmware...?
as i said before I have remove the above code and have about 3 Esp-12 modules running now will update the results....

@marcelstoer
Copy link
Member

provide a minimal complete example that shows the failure mode

👍

if this code works in 3.0-master_20190907 for continually over many months why not with the new firmware...?

That's a totally different question - a valid one, but kind of OT here. Nothing is ever going to be infinitely backwards compatible. Either our code or the Espressif SDK may change the behavior of your code. For our code we strive to mention breaking changes in the release notes.

@TerryE
Copy link
Collaborator

TerryE commented Sep 18, 2020

I understand... but if this code works in 3.0-master_20190907 for continually over many months why not with the new firmware...?

Feel free to ask the Q, and even try to answer it yourself. However if you want one of the maintainers to answer it for you and to fix the issue, then the first step is (as we ask) to supply a minimal, complete, and verifiable example that we can use to examine the core issue and determine a fix.

@KT819GM
Copy link

KT819GM commented Sep 19, 2020

At first with all respect to dev's - don't take this as some "cry to developers / hammer developers to find non-existing bug" thread. I've spent last two days checking commit history, so in bright side learned to use git more than I thought I will ever need. As you already found, I'm also facing issues with WIFI, but differently from @chathurangawijetunge I can't tell exactly when I've started facing them. I will try to describe as detailed and in proper way as much as my non-native English allows to do that:

  1. I'm, like thread OP facing rarely repeatable but not consistently reproduceable, thus making hard to provide verifiable example, issues with WiFi module, mostly
  • wifi.eventmon.reason.NO_AP_FOUND 201
  • wifi.eventmon.reason.ASSOC_FAIL 203
  1. These events fire when CPU is on heavy usage with timers firing repeatedly and in my case bad written code. Still, I'm fairly convinced, that Lua C code part should be responsible to maintain WIFI connection disregarding if Lua part was written by high skilled dev like Terry or by me. Bad Lua code should be rewarded with panic - reboot.

Leaving literary part aside

After Philip engaged in this thread I've removed most of the static WIFI config and left first time connection to be made by enduser_setup.start() with static config leftovers:

wifi.setmode(wifi.STATION)
wifi.sta.autoconnect(1)
wifi.sta.sethostname("TLStest")
wifi.setcountry({
    country = "LT",
    start_ch = 1,
    end_ch = 13,
    policy = wifi.COUNTRY_MANUAL
})

wifi.eventmon.register(wifi.eventmon.STA_CONNECTED, function(T)
    print("\n\tSTA - CONNECTED" .. "\n\tSSID: " .. T.SSID .. "\n\tBSSID: " ..
              T.BSSID .. "\n\tChannel: " .. T.channel)
end)
wifi.eventmon.register(wifi.eventmon.STA_DISCONNECTED, function(T)
    print("\n\tSTA - DISCONNECTED" .. "\n\tSSID: " .. T.SSID .. "\n\tBSSID: " ..
              T.BSSID .. "\n\treason: " .. T.reason)
    _G.connectedToMqtt = false
end)

Compiled firmware with Lua 5.3 with 0x40000 for LFS and 0x80000 for spiffs (SPIFFS_FIXED_LOCATION and SPIFFS_SIZE_1M_BOUNDARY) with 28 modules, SSL and TLS enabled but not used in this code. Added Lua code which uses 2 tmr[auto], connects to non-TLS mqtt server and every 10 seconds sends string < 100 bytes of data into LFS. Left it to send data and have logged it into "influxdb". After ~18 hours it stopped sending data and on UART I got code 203. For comparison few other devices were connected on same network (some of them with exactly same firmware, some with arduino), same mqtt server and they were still sending data. I'm pretty sure that statement it wasn't WIFI router fault could be made.

Moving further
I've observed some random disconnects when was uploading file into spiffs while already running Lua code with failure 201. Thought it's some coincidence, but I was able to repeat it, and in rare occasions It went to catastrophic failure when only hard reset helped (power disconnect / reconnect). Moving further I still have no exact procedure to repeat this failure, as seems it depends on code running on ESP8266 and disconnect still occurs randomly, I've started looking for changes made to wifi.c in commit history with git log --follow wifi.c. As for me, non programmer, it haven't made much sense, but still I've observed:

commit 98e428f12edb7869993b5fa3d0eda3976f52a8f4
Author: Terry Ellison <Terry@ellisons.org.uk>
Date:   Fri May 15 12:45:54 2020 +0100

    Update wifi..c to fix #3106

Which led me to read about:

Wifi resume occurs asynchronously, this means that the resume request will only be processed when control of the processor is passed back to the SDK (after MyResumeFunction() has completed). The resume callback also executes asynchronously and will only execute after wifi has resumed normal operation.

and

// If your application uses the light sleep functions and you wish the
// firmware to manage timer rescheduling over sleeps (the CPU clock is
// suspended so timers get out of sync) then enable the following options

Though I'm not using light sleep, still for me timers get out of sync makes much sense on this problem and could justify problems which seems only me and thread OP are facing.

So from all this, most likely fail observations, could any C dev check that if some bad Lua user code influences with CPU cycles (while WIFI events on C part had to be fired but failed because of busy CPU) does Lua interpreted gets callback from C module? And if not, maybe because of that it does not fire reconnection leaving module in not connected state (201, 203) with eventmon showing failure status? Also, what should I use to try to reconnect when failure code occurs if I'm not using wifi.sta.config()?

Thank you for whoever will take a look at this "essay".

@TerryE
Copy link
Collaborator

TerryE commented Sep 19, 2020

@KT819GM Modestas, I really appreciate this type of constructive feedback.

A couple of of comments:

  • Forget about 98e428f. This was a one line change to fix a compile error introduced when I was making the modules compilable under both Lua 5.1 and 5.3. (I missed this because I don't personally don't use the wifi.eventmon feature and it's not enabled by default.)
  • I discuss the SDK in my Lua Developer FAQ. Well worth a read, even through it is time for a major update. The SDK uses a non-preemptive FIFO within priority scheduler. Leaving aside the small realtime ISRs, everything is split into tasks which run until completion. 3 of the 32 priority levels are allocated to application use (our low, medium and high priorities). Most SDK services run at a a higher level than these and will be scheduled preferentially, but some run below these, so if we continually repost at an application priority, then we can starve out some of the lower priority SDK housekeeping tasks. I note that wifi_eventmon.c breaks these rules. If you want to repost continually that you should use a timer to repost rather than task.post(). Even a 10 mSec gap will allow enough to allow any pending low priority task to start, and thus avoid this starvation.

Incidentally my time on the project is pro-bono as and when available; I am currently having some yard-work done by some contractors and doing some of the associated tasks myself, so my NodeMCU work is itself being starved out a bit until this work is concluded. I will post further when I have time 😄

@TerryE
Copy link
Collaborator

TerryE commented Sep 19, 2020

@KT819GM see #3285. Try doing what I do which is to leave wifi.eventmon disabled and roll your own monitor code in Lua on a 100 mSec timer, say, and see if this removed the issue. Thanks Terry.

@KT819GM
Copy link

KT819GM commented Sep 19, 2020

@KT819GM see #3285. Try doing what I do which is to leave wifi.eventmon disabled and roll your own monitor code in Lua on a 100 mSec timer, say, and see if this removed the issue. Thanks Terry.

Thank you, will disable wifi.eventmon and will leave few devices running with wifi.sta.autoconnect() enabled on latest dev.

Incidentally my time on the project is pro-bono as and when available; I am currently having some yard-work done by some contractors and doing some of the associated tasks myself, so my NodeMCU work is itself being starved out a bit until this work is concluded. I will post further when I have time 😄

Seems it would be better for some of us to come and do some of your yard-work, so you would have more time for nodemcu things. I'm pretty sure I would be better in digging than I'm in programming currently 😄

@chathurangawijetunge
Copy link
Author

chathurangawijetunge commented Oct 5, 2020

I have no idea what you guys have done to fix this issue... or with other fixes but I'm so happy to tell that new dev commit ebfce4a9111dd6e1c1c35352c2198a8f566639f8 is working fine with same old codes of mine without any wifi issue for over 3 days now.

@nwf
Copy link
Member

nwf commented Oct 5, 2020

@chathurangawijetunge We have, I think, done nothing to address this issue, which lends credence to the theory that your code is treading dangerously close to instability occupying so much CPU time and denying the Espressif SDK stack the opportunity to run its tasks. If you have done nothing to correct your code, you should expect it to break again in the future, and I would ask you to please not file a similar issue with us until you can persuasively argue that your code is not starving the Espressif stack.

@nwf nwf closed this as completed Oct 5, 2020
@pjsg
Copy link
Member

pjsg commented Oct 5, 2020

I think that all our tasks run below the Espressif tasks. I suspect that the root cause was the missing IRAM_CACHE_ATTR on one of the functions being called at interrupt level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants