Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sntp.sync in progress #3044

Closed
danjohn opened this issue Feb 21, 2020 · 20 comments
Closed

sntp.sync in progress #3044

danjohn opened this issue Feb 21, 2020 · 20 comments

Comments

@danjohn
Copy link

danjohn commented Feb 21, 2020

Actual behavior

Calling sntp.sync() breaks the box, issuing "PANIC: unprotected error in call to Lua API (samwere.lua:87: sync in progress)"

Expected behavior

It's fine, that smtp.sync() observes an ongoing time sync attempt, but should keep calm (and maybe return an error), but definitely should not panic (+restart the device)!

More complicating, there is no way(?) to find out if there is a sync on the way (there are several async code paths in that very project), other than try&crash. Not very deterministic.

One more complicating, even the suggested pcall(sync) freaks out the same way (how can that occur?)

Test code

sntp.sync()
xpcall(sntp.sync(nil, function() print"synced" end, nil), nil)

NodeMCU startup banner

NodeMCU 3.0.0.0 built on nodemcu-build.com provided by frightanic.com
	branch: master
	commit: 310faf7fcc9130a296f7f17021d48c6d717f5fb6
	release: 3.0-master_20190907
	release DTS: 201909070945
	SSL: true
	build type: float
	LFS: 0x20000
	modules: adc,coap,cron,crypto,dht,file,gpio,http,mdns,net,node,perf,rtcfifo,rtctime,sjson,sntp,tmr,uart,wifi,tls
 build 2019-09-23 05:80 powered by Lua 5.1.4 on SDK 3.0.1-dev(fce080e)
@nwf
Copy link
Member

nwf commented Feb 21, 2020

The existing sntp code is rather messy; IMHO #2819 is the way forward, but someone needs to finish the work.

@pjsg
Copy link
Member

pjsg commented Feb 21, 2020

I think that your use of xpcall is incorrect -- it ought to be:

sntp.sync()
xpcall(function() sntp.sync(nil, function() print"synced" end, nil) end, nil)

otherwise the second sntp.sync is executed as part of computing the arguments to the xpcall.

@danjohn
Copy link
Author

danjohn commented Feb 22, 2020

@pjsg: You mean, encapsulating a function call into another (lambda) function would help in catching panics? Well, I'll hopelessly try..., maybe I misunderstood your hint? TY!

@nwf: The error "handling" in smtp seems to stem from the dark mideval where ERROR=>PANICK. Can that easily be patched bevore the next Milestone lands (only just the panic/crash thing (leaving the system alive!), not touching newer enhancements)?

@nwf
Copy link
Member

nwf commented Feb 22, 2020

You can flatten that xpcall down to xpcall(sntp.sync, nil, function() print"synced" end), since that's how xpcall works: it feeds its 2nd-and-later arguments to its first argument. No need for a lambda. Otherwise, tho', @pjsg is correct and your use of xpcall is wrong, as xpcall(foo, ...) evaluates foo before the guard is in place (as you might imagine, this being a call-by-value language).

@danjohn
Copy link
Author

danjohn commented Feb 22, 2020

Ok, let's flat this down to even:

sntp.sync()
sntp.sync() --> Panic

IMHO calling a core function (twice) simply shouln't kill the device - no more, no less.

What about simply

+++ /home/ESP/build/sntp.c	2020-02-22 14:54:38.516989797 +0100
@@ -764,7 +764,7 @@
 #define sync_err(x) do { errmsg = x; goto error; } while (0)
 
   if (state)
-    return luaL_error (L, "sync in progress");
+    return //luaL_error (L, "sync in progress");
 
   char *state_err;
   state_err = state_init(L);

I didn't dig deeper to decide if that can break other functionalities (as am not in the position to know the design-in-mind of the originator of that code).

Your thoughts, bad/good idea for the next housekeeping round?

@nwf
Copy link
Member

nwf commented Feb 22, 2020

I would reject that patch. Functions, core or not, must be allowed to raise errors, and asynchronous handlers must be coded to deal with them. Documentation patches to correct examples would be very welcome, tho'.

@danjohn
Copy link
Author

danjohn commented Feb 28, 2020

I agree with you @nwf, that just commentig out obnoxious parts of the core is a strict 'no go'.
The arguments that let me judge differently in that part are:

  • there's no way for async handlers (=lua) to deal with that (maybe arising) situation.
  • as you stated above ("sntp is messy"): That part might need some rework someday; my suggestion was just an emergency fix, to stabilize working NodeMCUs around the globe.
  • if I understood that logic right, that was a design decision made in the early burn&crash days, that doesn't need to be strictly prolongued, as it more makes things worse than helps debugging now?

Another approach would be to make state (off the code excerpt above) visible to the lua layer, to catch it bevore calling-to-crash: but there's nothing to do with it other than "if sntp.state()~=progressing then sntp.sync(..)"... this shold be core logic IMHO.

I simply suggest to drop the exception in that case (when an ongoing sync likes to tell "WIP, don't bother me further", it shouldn't raise an exception, but just keep silently working it out). But I'm a late adaptor, not seeing the big picture, just making (maybe silly) assumptions!

"Docs patch": Well, that would be a red box saying "call 'sntp.sync' with care, as it might crash your box".. No offence ;-)

@nwf
Copy link
Member

nwf commented Feb 28, 2020

There is a way to deal with it, and that is pcall(sntp.sync, nil, function() print"synced" end) (note the lack of parentheses after sntp.sync there!), as we said earlier. This is the standard Lua mechanism for dealing with functions that raise exceptions and there's no reason to think that nodemcu should be different. Even if we do ultimately add a top-level exception handler callback (see, e.g., #2885), one will likely still want to reboot the device if that fires, because it's the only safe default.

The reason to not just silently swallow sntp.sync() when syncs are in progress is that you might want to know that a sync has stalled. You probably shouldn't be calling sntp.sync() sufficiently often that it overlaps with itself anyway, unless something has really gone wrong. IMHO rebooting is perhaps not the worst outcome from this situation anyway!

I'm not opposed, in principle, to exposing the syncing state of the sntp module to Lua, but I don't think if not sntp.syncing then sntp.sync() is any better than pcall(sntp.sync) and it would need documentation anyway.

Speaking of documentation, the documentation update could actually be quite a bit more specific than a warning box; in particular, sntp.sync should document that it throws this exception (and any others), and uses of sntp.sync should be wrapped in pcall.

@danjohn
Copy link
Author

danjohn commented Feb 28, 2020

TY, recoding the project.

All (most) sntp.sync calls are (now) protedected calls (with timed recalls on fail). Let's see, my hope has lowered. Trying.
In that context, also all calls to core (net.* , wifi.* , telnet.lua, all core?) will have to be protected as well, as they rise (unprotected) mem affords?
Oh my.
NodeMCU mesh will take a another my year...

BTT: Can you state a link where I can take part of editing the docs? I'm frustated on criss&cross searching on Github already, and magically forwarding to readthedocs... serious.
Sorry, non-regular player. ;)

@nwf
Copy link
Member

nwf commented Feb 28, 2020

You'll only need to pcall things that can raise exceptions and for which crashing/rebooting on failure is not the desired action. In general, exceptions are there to indicate programmer error or a pretty serious problem. (I continue to think that colliding sntp.syncs is a problem as it indicates that the network is pretty wedged or you're sampling way too fast.)

You should not expect that you will be able to respond in Lua to memory allocation failures: invoking the handler will likely attempt to allocate memory and that's... not likely to work out well. It is possible that large allocation failures will have left enough heap around for the failure path to run, but don't bet on it. The small heap size is definitely a leaky abstraction on NodeMCU; use LFS aggressively and try to keep an eye on node.heap() as you test parts of your application.

All exception sources should be documented, morally, but our documentation is... well, patches are welcome. Speaking of, the docs are right here in the source tree and are just mkdocs / Markdown files with some local conventions: https://github.com/nodemcu/nodemcu-firmware/tree/dev/docs .

@pjsg
Copy link
Member

pjsg commented Feb 28, 2020 via email

@danjohn
Copy link
Author

danjohn commented Feb 28, 2020

@pjsg sntp.sync(nil,nil,nil,1):
(Docs say: "synchronization will happen every 1000 seconds") -- that's one of the first things I stumbled upon, stepping into this project five years ago, but never mentioned to makers:
Is it really a good idea, to make all selled (some Mllions, I suppose) MCUs to query pool*.ntp.org all 15 minutes? Keeps the poor boys there from doing something bad, but... shouldn't be default, I feel.

BTT: I saw the problem (racing sntp.sync), when I started up a bunch of MCUs at midnight (I CRON'd them to do so), also having a time sync in boot code. No problem, when the nodes start up scattered over the day, but... unavoidable :crash:. Not a real USP, rather than a game stopper.

Easy thing to avoid, I first thought, but... it's 11 comments later, not to be soluted in near future :)

...(even thought of putting a metatable ( __call) to _G to catch all core raised shortcomings... badest idea ever so far, I know =)

@nwf
Copy link
Member

nwf commented Feb 28, 2020

If you are selling products based on NodeMCU, you should use a different *.pool.ntp.org name in your products so as not to impact our project (and so that other users of our project do not impact your users).

The correct version of pcall was given in the third comment and this is starting to feel like we're retreading rather than making progress.

@danjohn
Copy link
Author

danjohn commented Feb 28, 2020

ntp.org == 'our' project? ;)
Stopping that thread now, for a friendly future outcoming.

What's over:

  • core raises (sometimes mem-bound) issues, uncatchabe on the lua layer
  • core functions should provide all senseful stati to the lua layer to help decisions
  • it's a grown project with sharp edges inside

Nothing new ;-)

@danjohn danjohn closed this as completed Feb 28, 2020
@nwf
Copy link
Member

nwf commented Feb 28, 2020

Just to clarify: no, nodemcu is our project. Using nodemcu.pool.ntp.org in a large number of devices runs the risks that your devices harm our project and, dually, that our users harm your products.

@danjohn
Copy link
Author

danjohn commented Mar 7, 2020

Just to leave a conclusio:
My fail was to think, one could simply wrap an existing error-prone function call into xpcall() (bcs pcall() needs a syntax reorder to tear off the parameters).

I had to accept, that xpcall() i isn't just a magic wrapper for preventing a crash, but has implications to think of firstly -- tha argument is evaluated unprotected at call, bevore the protection greps, so there's nothing won really (in that case). Better break up the code and use pcall() (without x) indeed (as pjsg was nice enough to explain in early thread) and adding an error handler (obvious as anywhere told, uneasy as every lazy coder can tell 😃).

Well, well, well...it took a while looping the concepts thru personal wetware, but was a fine lecture in does&dont's. And some shortcomings remain (in an other thread..)!

TY both @pjsg&@nwf for keeping constitute on a noob!
(And yes nwf, I forgot about that default nodemcu.pool.ntp.org thing, read some yrs ago somewhere in the docs. My fault again, sorry!)

danjohn added a commit to danjohn/nodemcu-firmware that referenced this issue Mar 7, 2020
+Outcomings of <nodemcu#3044>
@danjohn
Copy link
Author

danjohn commented Mar 7, 2020

I think, I've broken the docs (!!! attention) dev...danjohn:patch-3

Anybody help me out that github jungle? ;-)

@HHHartmann
Copy link
Member

Adding an empty line before and after the note might help. But not sure.

Maybe the note about using the default server should be more prominent a bit further up.

Maybe mentioning the error message this issue is about would help others to better understand why protected calling is useful.

danjohn added a commit to danjohn/nodemcu-firmware that referenced this issue Mar 7, 2020
+Outcomings of <nodemcu#3044>
@danjohn
Copy link
Author

danjohn commented Mar 7, 2020

TY4response, @HHHartmann!

"Adding an empty line before and after the note" doesn't help (in the preview, bunches of needed text is striked and shown in red). Maybe that's just the github preview gone mad, but I'm afraid of making such crumble going online.
(dev...danjohn:patch-4 by now)

Any advices from regulars (nudging @pjsg @nwf @TerryE (sorry for that!)), please?
I tried to take part in making a better world (er, docs), but... I definitely stop that show bevore breaking something!

@pjsg
Copy link
Member

pjsg commented Mar 8, 2020

One thing to note is that you wrote smtp rather than sntp!

This is the way to generate a note in the text:

!!! attention

    Secure (`https`) connections come with quite a few limitations.  Please see
    the warnings in the [tls module](tls.md)'s documentation.

I don't know if your title is too long or if the text is not indented -- but I found this example in the mqtt documentation and it formats correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants