Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full missed-event-reconciliation based on SQLTransactionTimeout Proposal #5470

Closed
stevend-uber opened this issue Sep 5, 2024 · 2 comments
Closed
Labels
triage/in-progress Issue triage is in progress

Comments

@stevend-uber
Copy link
Contributor

stevend-uber commented Sep 5, 2024

Currently SQLTransactionTimeout is used to do the following:

  • Initial cache hydration on spire-server startup (ref)
  • Pruning of missed events from cache (ref)

Assuming every eventID should exist, this puts all missing eventIDs into the in-memory cache. The proposal here is to remove the in-memory cache of assumed missed events and to instead make it a seen-events cache. The missed-events loop can then query for all events in the last SQLTransactionTimeout timespan (+wiggle room to account for clocks) and do a delta against the seen-events list. What this gives us is removing the assumption that every eventID should exist and enable SPIRE deployments with auto_increment_increment ≠ 1 without the included false-positive noise that comes with false-missed-events. This allows the same continued functionality at the cost of likely negligible CPU cycles.

Additional performance enhancements can additionally be made to do

SELECT * 
FROM Events within (timespan) 
WHERE eventID NOT IN (seen)

We can take the older of

{lastReconciledTime, now-SQLTransactionTimeout}

In order to account for db failure scenarios.

Additionally, this can be done in a separate go-routine so that it does not affect the performance of the hot-path of the normal cache sync.

@stevend-uber stevend-uber changed the title Full event-reconciliation based on SQLTransactionTimeout Proposal Full missed-event-reconciliation based on SQLTransactionTimeout Proposal Sep 5, 2024
@MarcosDY MarcosDY added the triage/in-progress Issue triage is in progress label Sep 5, 2024
@edwbuck
Copy link
Contributor

edwbuck commented Sep 6, 2024

SQLTransactionTimeout settings don't impact the initial fill of the cache, because it's filled on startup regardless of the setting, or the value of the setting.

What you have identified in your first item is the polling of "before the first" events to determine if a transaction was holding events before the first discovered event. Such polling only needs to occur for sql_transaction_timeout seconds, because after then, no transaction could hold an event prior to the first detected event. I'll open a new PR to improve the naming of element in this area, I agree it is confusing and not clear.

For the second point; yes, we remove the Event IDs from the polling list sql_transaction_timeout seconds after the gap they represent was detected, because it is now impossible for them to be in an uncommitted transaction that can still commit them to database.

Assuming every eventID should exist, this puts all missing eventIDs into the in-memory cache

The event ids are never in the cache, because the cache is a cache of Entries, not of Events. The Event IDs for gaps are in the update routines of the cache to implement knowledge of what Events might arrive late (or out of order). While they are present, let's clarify that this is a cache of Entries, not Events.

The approach you suggest works well for regions that are mostly empty. The returned data is small, and we currently implement that approach in the "before the first" gap polling. We scan for every Event before the first detected Event, we filter out the ones we successfully processed in prior updates, and we process any new ones.

For ranges that are primarily full, such an algorithm comes with many issues. The returned data is significant, especially due to it being polled frequently. For most systems, nearly 100% of the range will be full. Even in systems that create an Event Gap for each Event they register, nearly 50% of the range will be full. This incurs additional costs not mentioned above, including both CPU costs to marshall the data into the reply packets as well as I/O costs. Those costs would be felt by every user of SPIRE, even those that do not have database replication setups that impose Event gaps for every registered Event.

That's why #5341 and #5342 exist. They attempt to solve the issue of the costs of session creation and destruction for systems with large gaps in the Event table, through the following means:

  1. The server will use a logarithmic backoff polling police (calculated based on each gap's first detection time). This will reduce the number of times a gap is polled for the possible late arrival of an event.
  2. The server will use a batch query mechanism to request the gaps due for polling in the 5 second polling cycle in batches, reducing the number of created sessions and the impact of each.

I recommend that we close this issue, as it already is implemented for the "before the first" events, and the scenario of scanning already processed events incurs CPU and I/O costs that were not considered, as well as reestablishes the scaling issues (perhaps on a smaller scale) that the Event system was designed to remove.

I'll also prioritize #5341 and #5342 to ensure that our current polling of Event gaps reduces impact on the database by using logarithmic backoff and batch querying mechanisms respectively.

@amartinezfayo
Copy link
Member

Discussed during the contributor's meeting that having #5341 and #5342 in place will replace the need for this.
Keeping this open until both are addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/in-progress Issue triage is in progress
Projects
None yet
Development

No branches or pull requests

4 participants