-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for an alternative algorithm for events-based cache event tracking #5624
Comments
I think it'd be good to get some perf numbers between the past, current, and proposed solution. Is this something you need spire adopters in the community or in the wild to help collect data on?
Besides perf, reliability was also my biggest concern reading the proposal, so it's encouraging that so far signs are good :)
My only higher preference than simple code is no code at all :) Definitely a worthy goal to strive to. |
Absolutely. I do have a local benchmark set up to evaluate reliability and perf on a simulated load (e.g. something that churns entries at some rate and occasionally pauses and asserts that the cache in two or more SPIRE servers are in sync). Real life numbers are even better. Outside of perf, it would be invaluable to know whether or not the first two of three assumptions hold in your deployment. If you have any of the following metrics, it would be super helpful:
|
This algorithm has a number of issues:
I don't think that the current system is complete, and there's been a lot of questioning that has delayed the work deploying it. However, all the problems are performance based, and two solutions should fix this:
However, if the maintainers wish to discard the currently observed to be error free solution for a new one, it's the maintainer's call. However, this does seem a bit reactionary, as if it were directly in response to not accepting one of many back off algorithms. I'd rather see a sub-optimal but functional back off algorithm accepted over a new means of scanning the database that clearly would miss records that arrive after a few minute delay. For a volunteer effort, this also seems like re-inventing the solution by rewriting from scratch, an approach that is known to have costs and risks that often are under estimated. In many ways, this rolls back the clock on the current solution nearly six months, when these ideas were deemed insufficient to actually capture all changes. |
There's no perfect answer to this, as we have very few observed items in the field that led to the failure this system was designed to fix. Or, it is equally possible that people didn't inform me of details which were the input into the current design. So, |
Providing some numbers:
|
Hmm. I think that failover event blows up the first assumption "Number of events in the last N minutes is manageably sized" in a crippling way. In previous discussions I assumed that 1M registration changes were more or less distributed over a one day period (which is a reasonable ~3000 events every 5 minutes). If those registration changes are going to come in as fast as they can in the failover scenario, that number is no longer reasonable. |
@amoore877 I see a lot of focus on p99 numbers. The main issue is that anything that's below p99 works, but the failures are not statistically going to be captured by p99. A single long-lived database transaction will be enough to cause cache synchronization issues. If we focus on p99, then that means we only need 100 successful transactions to ignore the failing one, and our cache will still be out-of-sync. For this reason, I don't see this as an area where statistical analysis takes precedence over a very old, traditional concept of "code correctness". We had a system where one event outside of the p99 can break the system, and in the past we realized that saying it worked 99% of the time didn't buy sympathy for the one time it led to a running server that didn't reflect its backing database. That's why we treat every database change that might exist as something to consider until it cannot exit. It is an extreme point of view, but one that ensures that the cache will always be in-sync. Should we stop polling the "might newly exist" changes at the same frequency we poll the entire database for new records, we can reduce the impact of polling the "might newly exist" records a lot. The previously rejected submission reduced it by 1/60th, but that was an arbitrary setting. While I argued extensively for that solution (it had a lot of nice features, which led to its approach), any solution that polls with less frequency would be sufficient. With the proposed approach in this Issue, any transaction that managed to live past the setting for
The first scenario is undesirable, because there's a lot that goes on in a server restart, including denying services to the other unaffected SVIDs. The second scenario is undesirable, because nobody is going to make the call on what the frequency of "having the right data in the cache" is. Previous calls for this went unanswered. Late in our analysis, it was shared with me that Uber, the reporting entity, was updating their cache once a second. In many minds, any delays sill create failures on the Pod or process, leading to quick reloads. However, the faster the full reload, the more we will see the benefits of the db event approach disappear. At five seconds, it would be more efficient for the db event approach to simply not exist (but that also comes with 10,000 times more database CPU overhead). The third scenario basically is indeterminable. One can't surmise when the record might be updated. For this reason, I would feel more comfortable with a embarrassingly bad polling back off algorithm than re-writing the framework from scratch. I put too much of my social capital in attempting to get the previously rejected back off algorithm accepted. I regret this, and to a degree, I partially see this solution as a wholesale call to disregard all lessons learned for a "simpler time". Please do not be swayed by p99, because it means that after 100 cache updates, it's okay if your cache is out of sync once. I recommend that we choose any back off algorithm, even one known to be embarrassingly bad, instead of rewriting the currently existing solution which (to my knowledge) has been running error free for multiple releases. |
FWIW, this failover event would generate 1M skipped events to track for 12 hours (current transaction timeout value). Even with the backoff algorithm that has not yet been accepted, that would be roughly 16K queries a second for 12 hours. |
@azdagron I'm ok with polling once per hour, as long as there is one poll that is guaranteed to be beyond the limit of the transaction timeout. That said, I fully understand that the people who are going to be impacted really are the people that have the authority to set the limit. Unfortunately, there's no limit I've ever received. For this reason, I created a polling system that could be configured to poll at any limit, in any manner (some recommended exponential backoff, some indicated that a maximum limit would be desired, and linear backoff seemed better suited to the problem). For this reason the backoff system I previously proposed was configurable, which made it far more complex than a non-configurable system. The decision to make it configurable I now see as a mistake. I recommend that we implement your previous priority queue backoff, which I don't see as optimal, but at least adding it to a correct polling approach is preferable to possibly creating a new approach that would miss database changes. Of course, your approach uses random numbers, making it very difficult to test deterministically, but even that's better than "let's start all over again" which would likely recreate the issues that led to the solution we currently have. |
The experimental events-based cache relies on tracking change events for registration entries and attested nodes. The current algorithm relies on the monotonically increasing property of the event ID. Previous assumptions around arrival order of events and event ID increment stepping have contributed to a series of fixes to the current algorithm that have complicated the codebase (e.g. skipped event discovery and tracking) and contributed to pessimistic database query load, when one of the primary goals for the events-based cache is to reduce database load in the steady state. Organizations whose deployments tickle the conditions above are not realizing as much benefit from the events-based cache as should be possible.
I propose the following algorithm change to the events based cache:
processed_events
, which is initially empty.N
minutes are retrieved, based on the event CreatedAt column, into a listrecent_events
processed_events
andrecent_events
recent_events
but not inprocessed_events
(i.e. new events) are processed and added toprocessed_events
processed_events
but not inrecent_events
(i.e. old events) are removed fromprocessed_events
processed_events
andrecent_events
(i.e. already processed events) are ignoredWe can handle failures to talk to the database in a given poll cycle by either:
N
, last time we successfully polled), orThis algorithm assumes a few things:
N
minutes is manageably sized:Seems reasonable if we keep
N
small, e.g. 3 minutesN
minutes:As far as I know, the only mutating operation that has the potential for a long-lived transaction is that which prunes registration entries, which executes a single statement to delete all registration entries older than some threshold. When the registration entry set is large, this operation can take some time. We can likely reduce the per-transaction time by doing the deletion in small batches and more frequently.
The created_at column is not currently indexed, but can be.
PROS:
N
minute window.CONS:
N
minute window, but will fully reduce afterN
minutes.I have prototyped the above algorithm and so far have not witnessed split brain in the cache though more testing is required.
The text was updated successfully, but these errors were encountered: