-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BCFR-899] MaxLogsKept implementation #14574
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6e1376d
to
184422e
Compare
184422e
to
a064334
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive query 🤯!
Left a couple of nits
17c1357
to
e3491b8
Compare
06ecc5d
to
2225dac
Compare
reductionista
commented
Oct 1, 2024
Also: add block_number >= lower
Also: add deduping of ids for logs matching multiple filters
Also, remove some extraneous lines in orm_test.go
- Remove topics from SelectExcessLogs query - Early exit from loadFilters - upper >= end
76d4550
2fcfc91
to
76d4550
Compare
|
jmank88
approved these changes
Oct 16, 2024
dhaidashenko
approved these changes
Oct 16, 2024
This was referenced Oct 16, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BCFR-899
Motivation
Presently, LogPoller supports only time-based retention, via the Retention field in the filters passed to RegisterFilter. The MaxLogsKept field was added earlier in anticipation of the need for also supporting recency-count based retention. One example of a case where time based retention is risky is the Transmit event in the OCR Contract Transmitter. No matter how long the retention period is set to, there's a chance the node will be down for longer than that, and miss logs when it comes back up. This would make a bad situation even worse, because the transmit event would never be picked up at all.
Solution
This implements the MaxLogsKept feature in LogPoller. When specified, this field tells LogPoller it's okay to prune logs matching a filter if there are at least MaxLogsKept more recent matching logs in the db.
In the example above, this avoid storing any more logs than needed while always having the latest transmit event available. In this case, older transmit events are no longer relevant if there is a more recent one.
In general, this should be just as useful for anything accessed only via ChainReader's GetLatestValue() method rather than QueryKey().
A log may be pruned either because it's too old in terms of time or in terms of the number of logs being saved. It need not satisfy both theRetention and MaxLogsKeptcriteria in order to get pruned.
Testing
This was tested by changing the MaxLogsKept setting on the ContractTransmitter filter passed to LogPoller from 0 to 1, and running the CCIP load tests.
Without paging, this query is one of the most cpu intensive. Similar to the DeleteExpiredLogs query, it must go though every row of the logs table unless LogPrunePageSize is set to non-zero. But it's slower than DeleteExpiredLogs because on top of that, it also has to group, sort, and count every log in each group in order to figure out how many there are and which ones are the excess logs eligible for deletion. Also similar to DeleteExpiredLogs, the final step is to merge together all the results for logs matching multiple filters by making sure that no log is dropped unless ALL of its matching filters agree that it came before their own MaxLogsKept threshold was hit.
Without paging, the SelectExcessLogs query grew in median duration linearly as the number of logs in the table increased. After a few hours it was taking longer than the Insert queries (and any other queries) and started causing timeouts to occur, not just for itself but for other queries as well. It eventually got to a point where the p90 & p99 charts were continuously in the 4-5s range, generating many critical errors as well as a lot of backlogged queries waiting on connections.
With a paging size of 4000, you could see that the query durations grew linearly at first with the size of the table, and then leveled off at slightly more than the insert queries. It still resulted in a couple fairly large bursts of critical errors (query timeouts) during the heaviest cpu usage.
A paging size of 1000-2000 worked much better. There were a much smaller # of timeouts, and only during a brief window of time. The p99 durations were noticeably lower than the insert query durations, aside from some very brief but high spikes which we believe are due to an unrelated bug (which has been fixed in Chainlink repo, but hasn't been back posted to CCIP yet) in not letting go of connections quickly enough (due to sql logging).
Aside from these spikes, the charts look pretty healthy... so we should retest once that bug is backported but before the MaxLogsKept feature is enabled.
Another scenario tested was the performance of the query when all filters have MaxLogsKept=0. This was tested both with and without the code which skips the query unless there is at least one filter with MaxLogsKept > 0. Obviously, skipping the query took it off the chart entirely... so the rest of the chart looked the same as before this PR. Without slipping the query, it was slightly faster than the MaxLogs=1 case, but not a very significant reduction. Based on this it was decided it did make sense to disable the pruning until this feature starts being used by at least 1 filter.
With MaxLogsKept=1:
![Screenshot 2024-10-15 at 5 54 53 PM](https://private-user-images.githubusercontent.com/2644901/376848321-fb97be8d-2e67-432b-80c0-54dcdb2d8e86.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMTk1NTYsIm5iZiI6MTczOTIxOTI1NiwicGF0aCI6Ii8yNjQ0OTAxLzM3Njg0ODMyMS1mYjk3YmU4ZC0yZTY3LTQzMmItODBjMC01NGRjZGIyZDhlODYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTBUMjAyNzM2WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YmI5M2U0NjAyM2UwNDkyZDM5YmM0MmFjYjI3ZDcwMTkzODQ2NTcwYjgxMmQ4YjMyODFiZGNlNjRmNjkyYzUwZiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ._ITgXg9BOpz-B8bF6naBxO1M5VRci0erdLvDQTIZ9BE)
![Screenshot 2024-10-15 at 5 55 11 PM](https://private-user-images.githubusercontent.com/2644901/376848368-4e179f97-4e81-422a-bece-89eed1be6246.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMTk1NTYsIm5iZiI6MTczOTIxOTI1NiwicGF0aCI6Ii8yNjQ0OTAxLzM3Njg0ODM2OC00ZTE3OWY5Ny00ZTgxLTQyMmEtYmVjZS04OWVlZDFiZTYyNDYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTBUMjAyNzM2WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YzMxYzJmODllZTQ4OTE4MDU5NzIxNjU5YjgyYWViZGU5Nzc4YmE2MjE2MGJhZmI5NjRhYTVkNGNjZTRkMzljNSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.e25LSSsVavpOkVHp6PK23Ib0UZQZ3cR3pyC4kisDQAU)
With MaxLogsKept=0 (disabled):
![Screenshot 2024-10-15 at 5 58 09 PM](https://private-user-images.githubusercontent.com/2644901/376848486-f8de0a00-356e-4b98-a0f9-fb0eb9df735c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMTk1NTYsIm5iZiI6MTczOTIxOTI1NiwicGF0aCI6Ii8yNjQ0OTAxLzM3Njg0ODQ4Ni1mOGRlMGEwMC0zNTZlLTRiOTgtYTBmOS1mYjBlYjlkZjczNWMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTBUMjAyNzM2WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MWRiYTZiOWZlOWI2NzM5ZDAxNDZhZDQyNzk1Yzk0ZTRkZmI1ZGU1YWY4NzBjNzBmZTNkNGVlYjZjYzYxMjYxNCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.QwpR-a2K_NVcofvlPLiutO1g914wmq4dBx5rOU5leH8)
![Screenshot 2024-10-15 at 5 58 46 PM](https://private-user-images.githubusercontent.com/2644901/376848529-17109b1a-96f6-4c3e-979e-26c745acac28.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMTk1NTYsIm5iZiI6MTczOTIxOTI1NiwicGF0aCI6Ii8yNjQ0OTAxLzM3Njg0ODUyOS0xNzEwOWIxYS05NmY2LTRjM2UtOTc5ZS0yNmM3NDVhY2FjMjgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTBUMjAyNzM2WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Mjg5NjkxMTg1NDBmYjFjYjRmZWJlZDQ4NTU5ZTgyMmEyNWZiOTE1NWRmYzk2YzA1MjNkZjRkNTUwYTI3OTdkNSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.yFxr1-4wmbw8eXWEUY2Mxsua8uTT0OkABRRS6as_xP0)
For both of these tests, the pruning of unmatched logs--which is also an expensive query--has been increased in frequency from every 20 ticks to every 4 ticks, to see what it looks like under the most severe conditions while they're both happening at once (with every 20 ticks, the test usually completes before the first one happens). On the MaxLogsKept=1 charts, SelectUnmatchedLogs query durations show up as cyan and SelectExcessLogs query durations as bright yellow. On the MaxLogsKept=0 charts, SelectUnmatchedLogs is pale yellow (and SelectExcessLogs does not run--as desired). LogPrunePageSize was set to 1000 for both.