Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BCFR-899] MaxLogsKept implementation #14574

Merged
merged 15 commits into from
Oct 16, 2024
Merged

Conversation

reductionista
Copy link
Contributor

@reductionista reductionista commented Sep 26, 2024

BCFR-899

Motivation

Presently, LogPoller supports only time-based retention, via the Retention field in the filters passed to RegisterFilter. The MaxLogsKept field was added earlier in anticipation of the need for also supporting recency-count based retention. One example of a case where time based retention is risky is the Transmit event in the OCR Contract Transmitter. No matter how long the retention period is set to, there's a chance the node will be down for longer than that, and miss logs when it comes back up. This would make a bad situation even worse, because the transmit event would never be picked up at all.

Solution

This implements the MaxLogsKept feature in LogPoller. When specified, this field tells LogPoller it's okay to prune logs matching a filter if there are at least MaxLogsKept more recent matching logs in the db.
In the example above, this avoid storing any more logs than needed while always having the latest transmit event available. In this case, older transmit events are no longer relevant if there is a more recent one.
In general, this should be just as useful for anything accessed only via ChainReader's GetLatestValue() method rather than QueryKey().

A log may be pruned either because it's too old in terms of time or in terms of the number of logs being saved. It need not satisfy both theRetention and MaxLogsKeptcriteria in order to get pruned.

Testing

This was tested by changing the MaxLogsKept setting on the ContractTransmitter filter passed to LogPoller from 0 to 1, and running the CCIP load tests.

Without paging, this query is one of the most cpu intensive. Similar to the DeleteExpiredLogs query, it must go though every row of the logs table unless LogPrunePageSize is set to non-zero. But it's slower than DeleteExpiredLogs because on top of that, it also has to group, sort, and count every log in each group in order to figure out how many there are and which ones are the excess logs eligible for deletion. Also similar to DeleteExpiredLogs, the final step is to merge together all the results for logs matching multiple filters by making sure that no log is dropped unless ALL of its matching filters agree that it came before their own MaxLogsKept threshold was hit.

Without paging, the SelectExcessLogs query grew in median duration linearly as the number of logs in the table increased. After a few hours it was taking longer than the Insert queries (and any other queries) and started causing timeouts to occur, not just for itself but for other queries as well. It eventually got to a point where the p90 & p99 charts were continuously in the 4-5s range, generating many critical errors as well as a lot of backlogged queries waiting on connections.

With a paging size of 4000, you could see that the query durations grew linearly at first with the size of the table, and then leveled off at slightly more than the insert queries. It still resulted in a couple fairly large bursts of critical errors (query timeouts) during the heaviest cpu usage.

A paging size of 1000-2000 worked much better. There were a much smaller # of timeouts, and only during a brief window of time. The p99 durations were noticeably lower than the insert query durations, aside from some very brief but high spikes which we believe are due to an unrelated bug (which has been fixed in Chainlink repo, but hasn't been back posted to CCIP yet) in not letting go of connections quickly enough (due to sql logging).
Aside from these spikes, the charts look pretty healthy... so we should retest once that bug is backported but before the MaxLogsKept feature is enabled.

Another scenario tested was the performance of the query when all filters have MaxLogsKept=0. This was tested both with and without the code which skips the query unless there is at least one filter with MaxLogsKept > 0. Obviously, skipping the query took it off the chart entirely... so the rest of the chart looked the same as before this PR. Without slipping the query, it was slightly faster than the MaxLogs=1 case, but not a very significant reduction. Based on this it was decided it did make sense to disable the pruning until this feature starts being used by at least 1 filter.

With MaxLogsKept=1:
Screenshot 2024-10-15 at 5 54 53 PM
Screenshot 2024-10-15 at 5 55 11 PM

With MaxLogsKept=0 (disabled):
Screenshot 2024-10-15 at 5 58 09 PM
Screenshot 2024-10-15 at 5 58 46 PM

For both of these tests, the pruning of unmatched logs--which is also an expensive query--has been increased in frequency from every 20 ticks to every 4 ticks, to see what it looks like under the most severe conditions while they're both happening at once (with every 20 ticks, the test usually completes before the first one happens). On the MaxLogsKept=1 charts, SelectUnmatchedLogs query durations show up as cyan and SelectExcessLogs query durations as bright yellow. On the MaxLogsKept=0 charts, SelectUnmatchedLogs is pale yellow (and SelectExcessLogs does not run--as desired). LogPrunePageSize was set to 1000 for both.

@reductionista reductionista force-pushed the BCFR-899-max-logs-kept branch 2 times, most recently from 6e1376d to 184422e Compare September 27, 2024 07:32
@reductionista reductionista marked this pull request as ready for review September 27, 2024 07:33
@reductionista reductionista requested review from a team as code owners September 27, 2024 07:33
@reductionista reductionista requested review from EasterTheBunny and removed request for a team September 27, 2024 07:33
@reductionista reductionista changed the base branch from develop to BCFR-900-log-poller-id-columns September 27, 2024 07:34
@reductionista reductionista requested review from a team as code owners September 27, 2024 07:34
@reductionista reductionista changed the title [BCF-899] MaxLogsKept implementation [BCFR-899] MaxLogsKept implementation Sep 27, 2024
Copy link
Collaborator

@dhaidashenko dhaidashenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive query 🤯!
Left a couple of nits

core/chains/evm/logpoller/orm.go Outdated Show resolved Hide resolved
core/chains/evm/logpoller/orm.go Outdated Show resolved Hide resolved
core/chains/evm/logpoller/log_poller.go Show resolved Hide resolved
@reductionista reductionista force-pushed the BCFR-899-max-logs-kept branch 9 times, most recently from 17c1357 to e3491b8 Compare October 1, 2024 02:41
Base automatically changed from BCFR-900-log-poller-id-columns to develop October 1, 2024 06:16
@reductionista reductionista force-pushed the BCFR-899-max-logs-kept branch 2 times, most recently from 06ecc5d to 2225dac Compare October 1, 2024 22:08
core/chains/evm/logpoller/log_poller.go Show resolved Hide resolved
core/chains/evm/logpoller/orm.go Outdated Show resolved Hide resolved
core/chains/evm/logpoller/orm.go Outdated Show resolved Hide resolved
core/chains/evm/logpoller/orm.go Outdated Show resolved Hide resolved
@reductionista reductionista added this pull request to the merge queue Oct 16, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 16, 2024
@reductionista reductionista added this pull request to the merge queue Oct 16, 2024
Merged via the queue into develop with commit accbf0f Oct 16, 2024
128 of 129 checks passed
@reductionista reductionista deleted the BCFR-899-max-logs-kept branch October 16, 2024 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants