Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Background eviction #4191

Merged
merged 7 commits into from
Apr 25, 2024
Merged

Background eviction #4191

merged 7 commits into from
Apr 25, 2024

Conversation

SirTyson
Copy link
Contributor

@SirTyson SirTyson commented Feb 13, 2024

Description

This PR enables background eviction scans when running BucketListDB. This increases the time we have to scan for expired entries to evict and will improve eviction throughput.

It is currently in draft form, as it is based on #4172. This will also fail CI at the moment since it has not been updated with the new XDR changes.

TODO:
Additional test for entries that should be evicted in ledger N but are recreated during N tx application. Done
Graceful shutdown for background eviction thread. Not required, background eviction scan is sufficiently short lived.
Rebase on updated XDR. Done

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

@MonsieurNicolas
Copy link
Contributor

I would recommend adding a flag of sorts that continues to perform eviction in the foreground:

  • That way we can merge this without performing an in depth performance analysis (avoids the PR to rot).
  • This gives us time to understand the impact of performing so much work in the background.
    • Hopefully we'll have protocol 20 live on pubnet soon, so we can test against the real pubnet dataset without having to fork it.
    • Right now large merges happen but not that often, where as this potentially cranks up the amount of disk transfer by quite a bit.
      • We may need to tweak I/O priority or hints given to the OS as to avoid trashing disk caches. Right now the only thing we do with worker thread is call runCurrentThreadWithLowPriority which gives them a nice value of 5.

Copy link
Contributor

@marta-lokhova marta-lokhova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm late with reviewing, I only did a partial review so far. Thanks for taking a stab at this! I found it too difficult to reason about the changes unfortunately, due to a few things:

  • With the current interface, both background and main threads have access to basically every component of core, because both hold references to the Application class. This is very dangerous, and makes it nearly impossible to verify correctness of code. Maybe we should take a step back, and start with a diagram of interactions/dependencies between LedgerManager, BucketManager, etc followed by a sketch of interface changes (like for simplicity start with public API only, we can focus on intra-class synchronization later).

  • I think a design issue that makes things complicated is that the snapshot manages itself so it has to query LedgerManager for latest ledgers, and BucketManager for snapshot mutex (it feels weird that it has to acquire a mutex held by someone else to modify its own data). Why do we need state in snapshot class in the first place? It could be a const wrapper around buckets (with no access to any of the managers) maintained by some higher-level abstraction, which just swaps copies with the new snapshot once ledger close is done. We could introduce a new class that manages this hand-off with some invariance: e.g. only main thread can refresh the snapshot, while background threads can only retrieve a const copy of state to do whatever is needed.

  • As part of the interface cleanup, can we think about encapsulation and removal of unneeded mutexes. For example, with the changes suggested above, I think you won't need LCL mutex. We should encapsulate mutexes as much as possible, potentially inside the "handoff manager", so that callers don't need to worry about synchronization (note that we'll still need to keep track of ledger numbers, but conceptually it's part of "the snapshot", so it would all be guarded with one mutex).

  • I'd suggest incorporating additional invariance, so we can detect bugs faster and blow up rather than go into UB land. Some ideas here: assert main vs non-main thread, future state management like we do in FutureBucket, etc

src/bucket/BucketManager.h Outdated Show resolved Hide resolved
@@ -207,9 +207,11 @@ class BucketManager : NonMovableOrCopyable
virtual void maybeSetIndex(std::shared_ptr<Bucket> b,
std::unique_ptr<BucketIndex const>&& index) = 0;

virtual std::unique_ptr<SearchableBucketListSnapshot const>
virtual std::unique_ptr<SearchableBucketListSnapshot>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the removal of const here doesn't look right: any consumer of snapshots shouldn't be able to modify the underlying object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree in principal, the only members of SearchableBucketListSnapshot are mSnapshotManager, which is already const, and std::unique_ptr<BucketListSnapshot const> mSnapshot{};. Because SearchableBucketListSnapshot checks and automatically updates mSnapshot on each load, to be const correct mSnapshot would have to be mutable. I don't think this makes much sense to have the only piece of non-const state be mutable, and I don't think it's dangerous because the underlying BucketListSnapshot is const regardless.

src/ledger/LedgerManagerImpl.cpp Outdated Show resolved Hide resolved
}

auto iter = mBucketListDBBulkTimers.find(label);
if (iter == mBucketListDBBulkTimers.end())
mEvictionFuture.wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is redundant: get calls wait already

if (numEntries != 0)
ZoneScoped;

if (!mEvictionFuture.valid())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look right: why isn't the future valid if we get to this state (resolving the eviction scan)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is std::future::valid() returns false if the future has been default initialized or has had a call to get. It returns true if the future has been moved from (i.e. mEvictionFuture = std::packaged_task<EvictionResult()>::get_future()) but has not yet been called by get. Essentially, valid returns true if there is ready or pending work, false if the work is done and has been retrieved, or if no work has been set yet.

This particular state is possible on startup, when we're closing our first ledger and haven't had a chance to call startBackgroundEvictionScan as part of closeLedger yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this only the case for tests then where we close v20 genesis ledger? In practice, genesis is always protocol 0, so it shouldn't need to do any eviction at all, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, by "closing our first ledger", I don't mean the genesis ledger, but the first ledger that a node closes since that node's startup. Like during catchup after assuming state, we'd hit this path on the first call to ledgerClose, since the assume state path doesn't kick off the background eviction scan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like during catchup after assuming state, we'd hit this path on the first call to ledgerClose, since the assume state path doesn't kick off the background eviction scan.

why not? it should, right? otherwise, we'd stall on that first ledger and have to block until the scan is done. Seems like we should kick off the eviction scan in LedgerManagerImpl::setLastClosedLedger which is called by catchup

void
BucketManagerImpl::resolveBackgroundEvictionScan(
AbstractLedgerTxn& ltx, uint32_t ledgerSeq,
LedgerKeySet const& modifiedKeys)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use asserts like threadIsMain in critical codepaths like this one

auto& metric =
mApp.getMetrics().NewTimer({"bucketlistDB", "bulk", label});
iter = mBucketListDBBulkTimers.emplace(label, metric).first;
startBackgroundEvictionScan(ledgerSeq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're on the main thread here scanning synchronously, so why not just do it with scanForEvictionLegacySQL that is much less of a foot gun?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While scanForEvictionLegacySQL and the background eviction scan should behave identically, there implementations are fairly different. The goal is to have background scanning by default and eventually deprecate the synchronous scan. For this reason, I think it's safer to only call variant of the scan, and of the two I default to the background scan when it's enabled. Personally, I think it's less of a footgun only calling into one implementation instead of mixing and matching 2.

auto iter = mBucketListDBBulkTimers.find(label);
if (iter == mBucketListDBBulkTimers.end())
mEvictionFuture.wait();
auto evictionCandidates = mEvictionFuture.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reset the current future as soon as we retrieve the result: some compilers keep lambdas alive as long as the future is alive (even if it's invalidated). I'm not sure this is specifically a problem with what we capture at the moment, but it's definitely a footgun.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by reset here? Looking at the interface of std::future, there's no functions I can see like reset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I mean stop referencing the underlying object so it's destroyed properly, like we do in FutureBucket

mOutputBucketFuture = std::shared_future<std::shared_ptr<Bucket>>();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should only be an issue in shared_future, not regular future though right? Like there's no shared state being referenced here that can be kept alive, mEvictionFuture.get() calls move on the underlying state into auto evictionCandidates

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yea, you're right, I didn't see it's just std::future. in this case yeah, get should release shared state as per the standard (note that std::move doesn't actually guarantee a move here, and the object can still be copied, but it doesn't matter as long as the future releases its shared state)

src/bucket/BucketManagerImpl.cpp Outdated Show resolved Hide resolved
src/bucket/BucketManagerImpl.cpp Outdated Show resolved Hide resolved
});

mEvictionFuture = task->get_future();
mApp.postOnBackgroundThread(bind(&task_t::operator(), task),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current worker threads are all low priority, so I think we need a high-priority thread for eviction scans, since they are time-sensitive.

@MonsieurNicolas
Copy link
Contributor

I would recommend adding a flag of sorts that continues to perform eviction in the foreground:

* That way we can merge this without performing an in depth performance analysis (avoids the PR to rot).

* This gives us time to understand the impact of performing so much work in the background.
  
  * Hopefully we'll have protocol 20 live on pubnet soon, so we can test against the real pubnet dataset without having to fork it.
  * Right now large merges happen but not that often, where as this potentially cranks up the amount of disk transfer by quite a bit.
    
    * We may need to tweak I/O priority or hints given to the OS as to avoid trashing disk caches. Right now the only thing we do with worker thread is call `runCurrentThreadWithLowPriority` which gives them a nice value of 5.

actually something else related to this (as I am seeing reports in Discord of something that could be related to this):
if we switch IO patterns to be spiky, there is a potential to exhaust burst capacity when using bucketsdb on top of EBS. When volumes run out of burst capacity, all IO becomes extremely slow (depending on quotas), which for sure will impact not just core's ability to close ledgers, but also other collocated services like Horizon.

void
runCurrentThreadWithLowPriority()
{
constexpr auto const LOW_PRIORITY_NICE = 5;
runCurrentThreadWithPriority(THREAD_PRIORITY_LOWEST);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about changing worker threads from THREAD_PRIORITY_BELOW_NORMAL to THREAD_PRIORITY_LOWEST. Any input from Windows folks would be appreciated.

@SirTyson SirTyson marked this pull request as ready for review April 10, 2024 03:04
@SirTyson SirTyson force-pushed the background-eviction branch 2 times, most recently from c9b2cbe to 4bf9087 Compare April 10, 2024 18:24
Copy link
Contributor

@graydon graydon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks .. fairly plausible! I mean I didn't see earlier iterations of this but as it stands I don't see anything serious left to change. A few minor things:

  • I'm a bit nervous about a thread / packaged task capturing an unsynchronized & to the statistics structure. I guess it's the only writer and the only reader looks at the stats once the task is complete, but it's a somewhat implicit / distant synchronization. I might like to see these change to atomics, or be bundled together with the result as a value type that gets returned from the packaged task rather than a lambda capture, or something.
  • The number of tests that have an isUsingBucketListDB branch that basically skips subsequent integrity checking makes me uneasy; I realize we haven't got rid of the legacy SQL path yet so you're still getting logic coverage from the other branch, but I assume we're going to someday, and it'd be nice to actuall do the integrity checking against the bucketlist DB too! So .. maybe at least file a followup issue to refactor the integrity checks there so that these branches can go away?
  • It'd be good to figure out what happened with the test hashes removed from the json files and reconcile to what they're "supposed" to be on master currently.

@SirTyson
Copy link
Contributor Author

  • I'm a bit nervous about a thread / packaged task capturing an unsynchronized & to the statistics structure. I guess it's the only writer and the only reader looks at the stats once the task is complete, but it's a somewhat implicit / distant synchronization. I might like to see these change to atomics, or be bundled together with the result as a value type that gets returned from the packaged task rather than a lambda capture, or something.

While probably not strictly necessary, to be on the safe side I've refactored the stats to be thread safe. The medida metric themselves are still passed by reference, but the library is thread safe and these should always outlive the eviction scan task, so I think we're ok there.

  • The number of tests that have an isUsingBucketListDB branch that basically skips subsequent integrity checking makes me uneasy; I realize we haven't got rid of the legacy SQL path yet so you're still getting logic coverage from the other branch, but I assume we're going to someday, and it'd be nice to actuall do the integrity checking against the bucketlist DB too! So .. maybe at least file a followup issue to refactor the integrity checks there so that these branches can go away?

Opened #4295.

  • It'd be good to figure out what happened with the test hashes removed from the json files and reconcile to what they're "supposed" to be on master currently.

I think something here got mixed up in a rebase, after rebasing on master and regenerating the meta files everything seems fine.

@graydon graydon dismissed marta-lokhova’s stale review April 25, 2024 20:04

Marta verbally agreed this was ready to land, but is on vacation

@graydon
Copy link
Contributor

graydon commented Apr 25, 2024

r+ 3b8a91e

@latobarita latobarita merged commit 45b8f9d into stellar:master Apr 25, 2024
15 checks passed
@marta-lokhova
Copy link
Contributor

Marta verbally agreed this was ready to land, but is on vacation

Unfortunately, I didn't have a chance to review/approve v2 of the PR since I was out. My previous review is on v1, and I've resolved some comments there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants