Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only flush after recovery for retryable IOError #11880

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions db/db_impl/db_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,7 @@ Status DBImpl::ResumeImpl(DBRecoverContext context) {
if (shutdown_initiated_) {
s = Status::ShutdownInProgress();
}
if (s.ok()) {
if (s.ok() && context.flush_after_recovery) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this scenario be avoided if we check shutdown_initiated_ here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be the same as the existing check in line 453 (in the same critical section). I don't think it would work since db mutex is released/re-acquired several times in FlushAllColumnFamilies().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's true. Its dropping the locking before scheduling the flush.

// Since we drop all non-recovery flush requests during recovery,
// and new memtable may fill up during recovery,
// schedule one more round of flush.
Expand All @@ -472,14 +472,16 @@ Status DBImpl::ResumeImpl(DBRecoverContext context) {
// FlushAllColumnFamilies releases and re-acquires mutex.
if (shutdown_initiated_) {
s = Status::ShutdownInProgress();
} else {
Copy link
Contributor

@jaykorean jaykorean Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} else if (context.flush_after_recovery) {

would do the same without re-checking s.ok() below, but not a big deal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean scheduling compaction in the else if block? We need to do that when context.flush_after_recovery is false too.

Copy link
Contributor

@jaykorean jaykorean Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh never mind. I missed the collapsed part of the code. I didn't see we were calling FlushAll() above between line 460 - 471

I originally meant line 456 would stay as if (s.ok()), but then you will trigger FlushAll() when flush_after_recovery is false.

You can disregard my comment above

for (auto cfd : *versions_->GetColumnFamilySet()) {
SchedulePendingCompaction(cfd);
}
MaybeScheduleFlushOrCompaction();
}
}

if (s.ok()) {
for (auto cfd : *versions_->GetColumnFamilySet()) {
SchedulePendingCompaction(cfd);
}
MaybeScheduleFlushOrCompaction();
}

// Wake up any waiters - in this case, it could be the shutdown thread
bg_cv_.SignalAll();

Expand Down
1 change: 1 addition & 0 deletions db/error_handler.cc
Original file line number Diff line number Diff line change
Expand Up @@ -679,6 +679,7 @@ void ErrorHandler::RecoverFromRetryableBGIOError() {
return;
}
DBRecoverContext context = recover_context_;
context.flush_after_recovery = true;
int resume_count = db_options_.max_bgerror_resume_count;
uint64_t wait_interval = db_options_.bgerror_resume_retry_interval;
uint64_t retry_count = 0;
Expand Down
9 changes: 6 additions & 3 deletions db/error_handler.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,13 @@ class DBImpl;
// FlushReason, which tells the flush job why this flush is called.
struct DBRecoverContext {
FlushReason flush_reason;
bool flush_after_recovery;

DBRecoverContext() : flush_reason(FlushReason::kErrorRecovery) {}

DBRecoverContext(FlushReason reason) : flush_reason(reason) {}
DBRecoverContext()
: flush_reason(FlushReason::kErrorRecovery),
flush_after_recovery(false) {}
DBRecoverContext(FlushReason reason)
: flush_reason(reason), flush_after_recovery(false) {}
};

class ErrorHandler {
Expand Down