-
-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refetch server data from scratch after server upgrade #4793
Comments
Thinking about how often this logic would cause a refetch, I see on the API changelog: That's often enough that it could reduce somewhat the benefit of #3916 -- though every 1-2 weeks is still a lot less often than the status quo, for anyone who uses the app frequently. In principle, we don't actually have any need to refetch on every upgrade -- only on upgrades that change how the app itself would behave through some feature-detection logic. In other words, when the version and/or feature level moves past one of the values that we compare them against somewhere in the app's code. There's only a handful of those per year; so one strategy that would reduce the number of refetches here to a very low level would be to keep all such versions in a central list somewhere in the code, and only trigger a refetch on an upgrade that changes where the version falls vs. that list. |
Also, when implementing this we should be sure to not just reload immediately, in order to avoid hitting the server with a thundering herd of reloading clients. Instead, following the API documentation at https://zulip.com/api/get-events#restart , we should spread the reload out randomly over a period of at least 10 minutes. |
This thunk action includes registering an event queue, of course -- but it has also included some unrelated tasks: notification setup and an invocation of the broken outbox subsystem. Make this action only about registering a queue and starting a polling loop on it, pulling out that other stuff. Also change its name to explicitly say what it does: not just the "initial fetch", which I think basically means POST /register [1], but also starting a polling loop, which should always be done synchronously after that. The fetch-PMs call stays in the function just because it's meant to fill in a shortcoming of the register-queue functionality on older servers. We still plan to remove that call when we desupport server 2.1. Possibly the unrelated stuff should be grouped into a new thunk action of its own, but I don't yet see a nice name or description for such a group. Next, we'll move this and its helpers to eventActions.js. [1] See https://zulip.readthedocs.io/en/latest/subsystems/events-system.html#the-initial-data-fetch . But also "initial" can be a bit misleading, because it could misleadingly imply, for example, that this is only meant to run on startup and on first-time connection with an account. We run this with DEAD_QUEUE and ACCOUNT_SWITCH too, and will probably run it on `restart` events for zulip#4793.
This thunk action includes registering an event queue, of course -- but it has also included some unrelated tasks: notification setup and an invocation of the broken outbox subsystem. Make this action only about registering a queue and starting a polling loop on it, pulling out that other stuff. Also change its name to explicitly say what it does: not just the "initial fetch", which I think basically means POST /register [1], but also starting a polling loop, which should always be done synchronously after that. The fetch-PMs call stays in the function just because it's meant to fill in a shortcoming of the register-queue functionality on older servers. We still plan to remove that call when we desupport server 2.1. Possibly the unrelated stuff should be grouped into a new thunk action of its own, but I don't yet see a nice name or description for such a group. Next, we'll move this and its helpers to eventActions.js. [1] See https://zulip.readthedocs.io/en/latest/subsystems/events-system.html#the-initial-data-fetch . But also "initial" can be a bit misleading, because it could misleadingly imply, for example, that this is only meant to run on startup and on first-time connection with an account. We run this with DEAD_QUEUE and ACCOUNT_SWITCH too, and will probably run it on `restart` events for zulip#4793.
Since zulip/zulip#18205, the server sends its new version and feature level in an event each time it restarts; and since #4707, we use that to keep our record of the server's version and feature level up to date.
This is good because it means that when we subsequently make a request to the server, if we have any logic that needs to look at the feature level to interpret the results of that request, we'll use the correct feature level in order to behave correctly.
There still may be edge cases, though, where that isn't quite enough. As I wrote on the chat thread that led to the mentioned changes: there can have been places where we did something relying on the old feature level, and have data sticking around from that that is no longer quite right for the new one. Also places where we have something in flight at the time of the restart -- like some request we decided to make, and prepared something based on the old version, and with the new version that's no longer right.
To be sure to put a bound on the effect of any such edge case, we should notice when a restart event indicates the server got upgraded (for which a change in
zulip_feature_level
should be enough), and then make sure to reload the data from scratch within some bounded period of time.Right now we effectively get this automatically, because we already reload the server data from scratch on each app startup, plus whenever we've been offline for 10 minutes so that the queue was expired. Realistically that means we never go more than a day or so without reloading. So there's nothing we need to do about this now.
Once we're otherwise getting close to implementing long-lived event queues, though -- aka #3916 -- it'll start becoming possible for small data errors to persist indefinitely and to accumulate. This issue will be one of the things we should do to prevent such errors, before rolling long-lived event queues out broadly.
The text was updated successfully, but these errors were encountered: