Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on what happens during a terminated activation #1372

Open
josephliccini opened this issue Dec 4, 2018 · 26 comments · Fixed by #1451
Open

Clarification on what happens during a terminated activation #1372

josephliccini opened this issue Dec 4, 2018 · 26 comments · Fixed by #1451

Comments

@josephliccini
Copy link
Contributor

Hi!

Across browsers, we are seeing different behaviors for a terminated activation.

For Chrome:
If a worker is terminated during activation, it is placed back into the 'activating' state on next browser load. The activation handlers are re-run.

For FireFox:
If a worker is terminated during activation, the worker is discarded, and re-installation is required.

For Microsoft Edge:
If a worker is terminated during activation, the worker (on next browser load) is placed into the 'activated' state. The activation handlers are not re-run.

From the spec:
https://w3c.github.io/ServiceWorker/#activation-algorithm

Note: Once an active worker is activating, neither a runtime script error nor a force termination of the active worker prevents the active worker from getting activated.

Chrome's behavior seems to be what I'd expect as a developer, but I don't think that it's well-reflected in the spec.

Thoughts?

Thanks!

@jakearchibald
Copy link
Contributor

I think the Edge behaviour is correct here. I wouldn't expect the activate handler to run more than once.

@wanderview
Copy link
Member

wanderview commented Dec 4, 2018

But if the activate event did not complete, isn't it possible you will proceed with unexpected partial state? That was the reason why we chose to reinstall in firefox. I think just re-running activate event like chrome might be better to avoid using partial state with the old worker, though.

@josephliccini
Copy link
Contributor Author

josephliccini commented Dec 4, 2018

I agree with @wanderview -- the issue is the unexpected partial state.

For a concrete example, let's consider the Workbox precaching activation flow.

During installation, the assets are downloaded into a 'temp' cache.

During activation, the assets are transferred from the 'temp' cache into the 'active' cache.

A workbox service worker only serves assets from the 'active' cache.

If activation does not fully complete, but we move the worker into the active slot, some assets will be un-servable from the 'active' cache, because they did not complete moving from the 'temp' cache.

It would mean that all 'activate' handlers need to be built in a way that is compatible with a partial, non-deterministic completion point. This is possible, but needs to be clearly documented in the spec

@mfalken
Copy link
Member

mfalken commented Dec 6, 2018

This was discussed in #447, #659, #776. The design decision was that "fail to activate" could never happen.

I'm curious how you saw the behavior in Chrome. Chrome's intent is to implement that decision. In most cases, Chrome will not re-run the activate event handler.

BUT. We did run into one problem. During browser shutdown, Chrome closes each tab one-by-one in quick succession. We were seeing activation get triggered right at shutdown, and of course not finishing. The end result was a large % of service workers were activated who never properly finished their activate event handler, which seemed quite unfortunate.

Our hacky solution (since we don't know whether we're in shutdown or not at the layer we trigger activation in) was to delay activation by about 1 sec if it's triggered due to tab close. Then we try to activate. I think we also added an condition that if activation failed due to shutdown (we can sometimes detect it at that point), that the activation attempt is forgotten and it'd run it on the next worker startup.

The better solution is probably to detect shutdown before trying to activate, but we haven't done that.

EDIT: Most discussion was on #659.

@josephliccini
Copy link
Contributor Author

Yes, the issue is around browser shutdown. That is where I have been seeing the unexpected partial state in Edge.

It's this behavior in Chrome I was addressing in the original issue:

I think we also added an condition that if activation failed due to shutdown (we can sometimes detect it at that point), that the activation attempt is forgotten and it'd run it on the next worker startup.

Perhaps the activate handlers are not being re-run, but I am running into this behavior you described (or activation is being properly delayed through your 1s timeout)

Maybe this behavior and the timeout should be standard? Otherwise it seems browser shutdown during activation is going to cause issues for all apps using a service worker, depending on a predictable activated state.

@mfalken
Copy link
Member

mfalken commented Dec 7, 2018

I think probably Chrome should learn when it's in shutdown and not run activate in that case. I'd like us to remove that hack, especially since you noticed this.

The spec could likely say something like the browser may delay running activate if it expects it will not be able to complete, e.g., due to browser shutdown.

Other than browser shutdown, not finishing activate is a bit of an edge case and we can continue with the decision in #115, #447, #659, #776, unless we have new information to overturn that.

@makotoshimazu
Copy link

JFYI, regarding to the 1s timeout hack, I think it now happens on all activation unfortunately for addressing https://crbug.com/879069 with taking care of the browser termination case.

@josephliccini
Copy link
Contributor Author

@mattto I agree with you that terminated activation is an edge case. I think it's a tricky one because the browser is unable to really run code to respond to a terminated activation because the process becomes unavailable.

What is the next step for pitching this for inclusion in the spec?

@jeffposnick
Copy link
Contributor

👋 from the Workbox team.

We just ran into what looks to be someone running into inconsistent cache states due to Firefox failing activation (for reasons unknown) causing the service worker to re-install. Workbox currently should play nicely with Chrome's current behavior around a failed activation (re-firing the activate event) and now that we're aware that it's an issue, we can also adapt to Firefox's current behavior (starting from scratch with a new install event).

What I'm concerned about is that I'm not sure how we can deal with Edge's current behavior of not re-firing either activate or install (followed by activate), especially since it sounds like that's the "correct" behavior that might be standardized. If Edge's behavior were implemented everywhere, would we just have to check for temporary cache entries that need to be cleaned up each time the service worker starts up?

@mfalken
Copy link
Member

mfalken commented Dec 14, 2018

Well backing up a bit, this issue is only about activate and not install. install definitely has a notion of failing. If install failed to finish successfully, that worker will be discarded instead of ever becoming active. activate has no (in the spec) notion of failing. Once activate starts, that worker is already the active worker and is committed to reaching the activated state and start receiving events.

If Edge's behavior were implemented everywhere, would we just have to check for temporary cache entries that need to be cleaned up each time the service worker starts up?

Yes, I think that's right. The Edge behavior is implemented by Chrome too (at least, that's the intent), and is the currently specified behavior. It's only when activation was interrupted by browser shutdown that Chrome might re-run the event handler. So I think sites already needed to deal with the Edge behavior.

My plan is to teach Chrome to not start the activate handler if browser shutdown is what triggered activation (i.e., by closing the last tab that was using the incumbent worker). There is not much spec work required, except perhaps to add some text that browsers might delay activate if they don't expect to be able to finish the event, e.g, because it's during browser shutdown.

@jeffposnick
Copy link
Contributor

It's only when activation was interrupted by browser shutdown that Chrome might re-run the event handler.

What if Chrome's service worker process/thread crashes during activation, due to something outside the control of the JavaScript code being executed during the activate handler(s)? That's what I (think) I'm seeing with Firefox and in that scenario, as described above, Firefox ends up discarding the service worker and going back to install on the next navigation.

We're talking about edge cases here, but those are the most fun to code around, and from the Workbox side of things, we'd like to try to accommodate as many failure scenarios as is viable.

@mfalken
Copy link
Member

mfalken commented Dec 19, 2018

What if Chrome's service worker process/thread crashes during activation, due to something outside the control of the JavaScript code being executed during the activate handler(s)?

Good question! Chrome's algorithm is roughly:

  1. Browser asks the renderer to dispatch the activate event handler to the service worker.
  2. Browser waits for the renderer to report back that event handler finished. This includes detecting that the renderer died or the worker thread died, if the connection is gone the browser knows the event is done. Also the timeout mechanism can forcefully abort the event handler and declare it finished.
  3. Once it finished, the browser writes "activated" to disk.

As an exception in step 3 if the browser detects it's shutting down, it doesn't write "activated" to disk and will retry the next time the worker is used in some browser session.

This mostly matches the spec, except the spec doesn't account for crashing/shutdown between dispatching the activate event and running Update Worker State to activated in https://w3c.github.io/ServiceWorker/#activation-algorithm.

I'd like to change Chrome to remove the shutdown exception and instead only run the activate event if we're not in shutdown already. I could shuffle where we write "activated" to disk but either way there can be an edge case due to this cross-process model. If the browser process writes "activated" once it asks the renderer to dispatch the event, then a browser crash/shutdown occurs, the worker can become "activated" without activate ever firing from the POV of the developer. If the browser process writes "activated" after the event finishes, and a browser crash/shutdown occurs before that, the worker can get an activate event again in the next browser session.

@mfalken
Copy link
Member

mfalken commented Dec 19, 2018

I guess in summary there's no hard guarantees about activate due to the design that "there's no such thing as failing to activate".

@jeffposnick
Copy link
Contributor

All good to know. Thanks!

(On the Workbox side of things, we've got a plan in GoogleChrome/workbox#1793 to move any substantive work out of activate.)

@josephliccini
Copy link
Contributor Author

Does it make sense for the spec to have a note about this?

Something like:

*Note: * Make sure to design your activation handlers to do only non-essential work (like cleanup). This is because activation handlers may not always run to completion, especially in cases of browser termination during activation.

There's probably a better way to say it, but just having a note like this in the spec would help developers and library authors when designing their service worker activation handlers.

@mfalken
Copy link
Member

mfalken commented Dec 22, 2018

That's a great idea. Agreed.

@josephliccini
Copy link
Contributor Author

Finally got around to adding this note to the spec. Opened a PR #1451

@makotoshimazu
Copy link

Regarding to #1372 (comment), I'm wondering if we should explicitly allow retrying activate event when unexpected failure of activate event (e.g. browser shutdown, renderer crashes).
So far, we don't have a safe place to manipulate a chunk of persistent data atomically if we allow activate event termination. Any JS-observable exceptions seem to be out of the scope, but more transaction-like activate event could be more useful and easy to use.
What do you think?

@jakearchibald
Copy link
Contributor

Pre TPAC thoughts:

  • We need to decide what behaviour we want here.
  • We decided that activate rejection shouldn't roll back to the old service worker, since caches have now been updated.
  • However, if activate doesn't run to completion, maybe we should rerun it?

@wanderview
Copy link
Member

We could have some attribute on the event set to true if its a re-run. Might make it slightly easier for sites to handle?

@asutherland
Copy link

Re-run with retrying/similar attribute sounds good.

We should also consider adding something like BackgroundSync's SyncEvent's lastChance attribute as it's unlikely an implementation would want to perform an infinite number of retries and it might help non-bad-actor ServiceWorkers report in that they're having trouble and/or have them skip inessential steps. Bad-actors would presumably just be mitigated by only re-dispatching "activate" when the ServiceWorker would be eligible for having an event dispatched at it normally and maintaining existing lifetime limits based on that.

@asutherland
Copy link

Firefox needs to change its behavior here as part of addressing a shutdown data race that's similar to what @mfalken described in #1372 (comment). Is there interest in moving forward with explicitly standardizing re-run logic as well as potentially exposing retrying and lastChance?

@asutherland
Copy link

For the time being Firefox/Gecko is planning to move to:

  • Inhibit the "activate" event during shutdown, leaving the ServiceWorker in what amounts to its "activating" state where (relevant) functional events are delayed until the ServiceWorker becomes active.
  • The next time a functional event is dispatched at the ServiceWorker, the "activate" will be dispatched. As per the above, the (relevant) functional events will be enqueued until activation completes.
    • The general intent here is to conform to the behavior discussed around background fetch where a ServiceWorker will not be woken up if there isn't a controlled document involved.
    • For the quite common pinned-tab use-case this means that activation will happen when the browser next starts.
  • No retries at this time, although I'll probably structure the metadata so this is possible. (Since the choice is to have the activation happen at startup and thereby block page load, it would be undesirable to open the door to the page becoming permanently broken due to a very long-running activation that keeps being interrupted by the user restarting the browser in the hopes it will make the page load.)

@jakearchibald
Copy link
Contributor

@mfalken does Chrome ever 're-run' in an observable way? As in, is the 'activate' event listener called twice for a single service worker?

@mfalken
Copy link
Member

mfalken commented Feb 19, 2021

Sorry for the delay. Yes I think it happens if an event is dispatched during shutdown, we will dispatch the event again on the next browser session that uses the service worker.

@mangelozzi
Copy link

So is activate callback to say:

  1. The service worker is now activated, is there any code you want to run? If it raises an error then theres no consequence to the service worker still being active. (This is what I think the thread is leaning towards).
  2. Or is it.. Now is your change to activate the service worker, if its fails the service worker will be discarded?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants