Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorising modules as EC or non-EC and non-EC compliant #991

Closed
TerryE opened this issue Jan 30, 2016 · 5 comments
Closed

Categorising modules as EC or non-EC and non-EC compliant #991

TerryE opened this issue Jan 30, 2016 · 5 comments
Labels

Comments

@TerryE
Copy link
Collaborator

TerryE commented Jan 30, 2016

We've had various discussions about the Espressif timing guidelines (e.g. the max 50 uSec period for interrupt masked code) and the fact that many bit-banging modules ignore these.

My suggestion is that we define a criterion "Espressif Compliant " (EC) meaning a module which conforms to these guidelines and which will therefore can interoperate safely with WiFi and Flash functions. Those modules such as RC which don't should all be flagged as non-EC in their documentation, plus a guideline on when and how to use such modules if at all.

We could also do with an investigation / dialogue with Espressif on how to mitigate the impact of using a non-EC module - for example either

  • restart the ESP immediately after completion, or
  • close all timers, network cb's, WiFi, TCP + UDP sockets and listeners before calling the module, then reset the station mode on return to reinitialize the network stack and then reestablish any listeners etc.

Clearly some exploration is needed to work out exactly what needs to be done in the second case, but our goal here should be well determined stability: it you want a stable system and you need to use a non-EC module then you need to do XYZ before and after calling it, and here are the constraints / consequences.

@pjsg
Copy link
Member

pjsg commented Jan 30, 2016

I think that trying to find out from Espressif why the recommendations are set the way that they are would also be important. 50us is a remarkably round number. Is there really a bit of hardware that needs poking within 50us? Or is this the constraint that if you don't want to lose packets, you have to service the RF component within that time?

I've been thinking of putting together the ultimate non-EC module that would allow easy experimentation. For example

hacker.delaymasked(100 [, mask]) -- wait for 100 us with interrupts masked
hacker.peek(address) -- read location


hacker.poke(address, value) -- write to

I'm not sure what the priority is of the interrupts, and it might make a difference which interrupts are masked off.

@nickandrew
Copy link
Contributor

Ideally if a module disobeys the guidelines, nothing worse should happen (ignoring the wdt reset) than possibly losing WiFi packets. Radio is inherently imperfect; packets can be lost or corrupted in transit at any time and the protocols take that into account. But, I guess that's not the world we live in.

I too would like to know what it's doing under the covers that creates these time constraints. The end result of this plus non-preemptible user programs is a fragile platform, one which can't be considered as Philip mentioned the other day, "production ready".

@nickandrew
Copy link
Contributor

A sensible design I would have thought would have the time-critical WiFi operations run in interrupt handlers at a minimum, and have the "userspace" preemptible by that and timer interrupts. That would allow all network functions to operate unimpeded even if the userspace does "for (;;) {i++}"

But maybe the nonOS SDK is just an afterthought; good enough to run the AT firmware.

@TerryE
Copy link
Collaborator Author

TerryE commented Jan 30, 2016

The 50 µSec is almost certainly probabilistically informed. Some interrupts need a 250 µSec response say so with a 50µSec max restponse the MTBF is 168 hrs, say, but with 100 µSec this will fall to 30mins or whatever. If the failures are soft failures -- that is an individual operation fails, but the retry success -- then this is a minor issue, but I've seen issues reported of the Network latching into a failed state and Panics occurring.

Nick, your scenario is typical with a round-robin scheduler as is typical with interactive *nix systems. But the non-SDK is non-preemptive.

@stale
Copy link

stale bot commented Jun 7, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 7, 2019
@TerryE TerryE closed this as completed Jun 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants