Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: introduce LogSnapshot API #130967

Merged
merged 1 commit into from
Sep 25, 2024
Merged

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Sep 18, 2024

The API allows reading from raft log while RawNode continues making progress. Practically, this means we can unlock Replica.mu after having obtained the log snapshot, and can safely read from it while only holding Replica.raftMu.

To be used for:

  • Fetching committed entries in Ready handler while only holding raftMu (instead of both mu and raftMu).
  • Building leader->follower MsgApp messages on demand, driven by RACv2. Today these messages can be constructed (and incur IO) eagerly on any Step while holding mu.

Part of #128779
Part of #130955

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 18, 2024

@tbg @nvanbenschoten Only the last commit is contentful. The other commits are clean-ups, lmk if you'd like me to split them out or drop.

@pav-kv pav-kv marked this pull request as ready for review September 18, 2024 18:55
@pav-kv pav-kv requested a review from a team as a code owner September 18, 2024 18:55
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good the (lo,hi] clean-up is great. I tried to scrutinize all of the index arithmetic and didn't see anything that looked off.
Some minor comments and questions about testing but I'll approve this pending some reaction to the comments, happy to do another quick pass on the resulting diff.

pkg/raft/log_unstable.go Outdated Show resolved Hide resolved
pkg/raft/log_unstable.go Outdated Show resolved Hide resolved
pkg/raft/log.go Outdated Show resolved Hide resolved
pkg/raft/log.go Outdated
// first is the first available log index.
first uint64
// storage contains the stable log entries.
storage Storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably saw my comment on the prototype. LogSnapshot is definitely a step in the right direction, so I don't think it would make sense to try to take Storage out of it at this point. In other words, ignore that comment as far as making code changes now goes. In spirit, I still think there's something there, but it may be for another day (which may never come).

// entries (e.g. the committed entries subject to state machine application),
// allow the application to read them from LogSnapshot in the Ready handler.
// This gives the application direct control on resource allocation, and
// flexibility to do raft log IO without blocking RawNode operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/raft/log.go Outdated Show resolved Hide resolved
pkg/raft/log.go Outdated Show resolved Hide resolved
pkg/raft/log.go Outdated Show resolved Hide resolved
pkg/raft/log.go Outdated Show resolved Hide resolved
pkg/raft/log_unstable_test.go Outdated Show resolved Hide resolved
craig bot pushed a commit that referenced this pull request Sep 18, 2024
130980: raft: clean up raftLog entry slice indexing r=tbg a=pav-kv

This PR cleans up various `raftLog/unstable` methods, and converts the sub-slicing methods to `(lo, hi]` semantics instead of `[lo, hi)`.

All raft log slices are constructed in context of being appended after a certain index, so `(lo, hi]` addressing makes more sense, and simplifies a bunch of +-1 arithmetics.

Related to #130967, #130955

Co-authored-by: Pavel Kalinnikov <[email protected]>
@pav-kv pav-kv force-pushed the raft-log-snapshot-api branch 4 times, most recently from 7854e24 to f41ae68 Compare September 19, 2024 00:09
@nvanbenschoten nvanbenschoten requested a review from tbg September 19, 2024 14:26
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r9, 5 of 5 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @tbg)


pkg/raft/log.go line 30 at r9 (raw file):

//
// To access it safely, the user must not mutate the underlying raft log storage
// between the snapshot is obtained and the reads are done.

"between when the ..."


pkg/raft/rawnode.go line 123 at r9 (raw file):

// long as the application guarantees immutability of the underlying log storage
// while the snapshot is being used. Typically, this means that the application
// does not run a Ready() handling cycle until the snapshot is released.

It also means that the application does not Step the RawNode with any incoming messages, right?


pkg/raft/storage.go line 89 at r10 (raw file):

// LogStorage is a read API for the raft log.
type LogStorage interface {

The distinction here between LogStorage and LogStorageRO doesn't feel very strong to me, and it's not clear why it is valuable to users for them to be split into two separate interfaces. What about establishing a snapshot makes LogStorage any less "read-only"? Is there a benefit to separating these out, given that any LogStorage is also a LogStorageRO? If anything, I would think we would want to go the other way around and specialize a LogStorageSnapshot interface which extends LogStorage with a guarantee of immutability.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/raft/rawnode.go line 123 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It also means that the application does not Step the RawNode with any incoming messages, right?

No, RawNode can be stepped - this is the key idea. With LogSnapshot() method, we effectively return a consistent RawNode snapshot (only the subset of it that's interesting to us in this case - the log). Once Replica.mu is released, the RawNode can run forward, but that will not invalidate our snapshot. What invalidates the snapshot is when we mutate the storage (and so the storage-related queries to the snapshot read updated state which can be inconsistent with our RawNode snapshot). So while raftMu is held, our snapshot is good, despite steps that can occur under Replica.mu only.

The LogSnapshot idea can be extended to a whole RawNode snapshot. This is almost what Ready() returns, but it mostly returns a delta today. Instead it could return a whole snapshot + indicators of what the delta is. Then the Ready handler would iterate the delta and send it to storage/peers (under raftMu only). Note that storage can mutate under RawNode's nose, but that is fine - RawNode always keeps the yet unstable part of the state in memory, so the storage can do the updates asynchronously, and RawNode will never try to read those parts.

However, RawNode may try to read from storage while only Replica.mu is held, and it can be racing with raftMu-only writes. There is no conflict in which keys are read/written, but there might be some conflicts/races w.r.t. in-memory-data structures that are involved in these paths.

I think this all exists today, we're just rethinking it.


pkg/raft/storage.go line 89 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The distinction here between LogStorage and LogStorageRO doesn't feel very strong to me, and it's not clear why it is valuable to users for them to be split into two separate interfaces. What about establishing a snapshot makes LogStorage any less "read-only"? Is there a benefit to separating these out, given that any LogStorage is also a LogStorageRO? If anything, I would think we would want to go the other way around and specialize a LogStorageSnapshot interface which extends LogStorage with a guarantee of immutability.

Yeah, I like your idea in saying that LogStorageSnapshot extends LogStorage. I tried to play with this, it looks like:

// LogStorageRO is a read-only API for the raft log.
type LogStorageRO interface {
	// ...
}

// LogStorageSnapshot is a read-only API for the raft log. It guarantees
// immutability of outside the raft package while the user holds this snapshot.
type LogStorageSnapshot LogStorageRO

// LogStorage is the API for the raft log.
type LogStorage interface {
	LogStorageRO

	// LogSnapshot returns an immutable point-in-time log storage snapshot.
	LogSnapshot() LogStorageSnapshot
}

Which is almost identical to what's in this PR, modulo the type alias. Not sure if the type alias is worth it in this case.

Alternatively, we could structure it like below, to avoid one extra type:

// LogStorage is a read-only API for the raft log.
type LogStorage interface {
	// ...
}

// LogStorageSnapshot is a read-only API for the raft log. It guarantees
// immutability outside the raft package while the user holds this snapshot.
type LogStorageSnapshot LogStorage

type Storage interface {
	LogStorage
	StateStorage

	// LogSnapshot returns an immutable point-in-time log storage snapshot.
	LogSnapshot() LogStorageSnapshot
}

Or simply:

// LogStorage is a read-only API for the raft log.
type LogStorage interface {
	// ...
}

type Storage interface {
	LogStorage
	StateStorage

	// LogSnapshot returns an immutable point-in-time log storage snapshot.
	// The returned snapshot guarantees immutability outside raft package while
	// the user holds this snapshot.
	LogSnapshot() LogStorage
}

WDYT?

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/raft/rawnode.go line 123 at r9 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

No, RawNode can be stepped - this is the key idea. With LogSnapshot() method, we effectively return a consistent RawNode snapshot (only the subset of it that's interesting to us in this case - the log). Once Replica.mu is released, the RawNode can run forward, but that will not invalidate our snapshot. What invalidates the snapshot is when we mutate the storage (and so the storage-related queries to the snapshot read updated state which can be inconsistent with our RawNode snapshot). So while raftMu is held, our snapshot is good, despite steps that can occur under Replica.mu only.

The LogSnapshot idea can be extended to a whole RawNode snapshot. This is almost what Ready() returns, but it mostly returns a delta today. Instead it could return a whole snapshot + indicators of what the delta is. Then the Ready handler would iterate the delta and send it to storage/peers (under raftMu only). Note that storage can mutate under RawNode's nose, but that is fine - RawNode always keeps the yet unstable part of the state in memory, so the storage can do the updates asynchronously, and RawNode will never try to read those parts.

However, RawNode may try to read from storage while only Replica.mu is held, and it can be racing with raftMu-only writes. There is no conflict in which keys are read/written, but there might be some conflicts/races w.r.t. in-memory-data structures that are involved in these paths.

I think this all exists today, we're just rethinking it.

So my mental model is:

  • raft storage reads can occur when either Replica.mu or raftMu is held (for reads or writes)
  • raft storage writes occur only when raftMu is held for writes?

Is this accurate? If not, would it make sense to make it accurate - are there gaps in this thinking?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @tbg)


pkg/raft/rawnode.go line 123 at r9 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

No, RawNode can be stepped - this is the key idea. With LogSnapshot() method, we effectively return a consistent RawNode snapshot (only the subset of it that's interesting to us in this case - the log). Once Replica.mu is released, the RawNode can run forward, but that will not invalidate our snapshot. What invalidates the snapshot is when we mutate the storage (and so the storage-related queries to the snapshot read updated state which can be inconsistent with our RawNode snapshot). So while raftMu is held, our snapshot is good, despite steps that can occur under Replica.mu only.

The LogSnapshot idea can be extended to a whole RawNode snapshot. This is almost what Ready() returns, but it mostly returns a delta today. Instead it could return a whole snapshot + indicators of what the delta is. Then the Ready handler would iterate the delta and send it to storage/peers (under raftMu only). Note that storage can mutate under RawNode's nose, but that is fine - RawNode always keeps the yet unstable part of the state in memory, so the storage can do the updates asynchronously, and RawNode will never try to read those parts.

However, RawNode may try to read from storage while only Replica.mu is held, and it can be racing with raftMu-only writes. There is no conflict in which keys are read/written, but there might be some conflicts/races w.r.t. in-memory-data structures that are involved in these paths.

I think this all exists today, we're just rethinking it.

I'm confused by the comment that "as long as the application guarantees immutability of the underlying log storage while the snapshot is being used" then. Can't a Step due to an incoming message mutate the log storage? We might get incoming MsgApps from a new leader.

Or are you specifically referring to the raft.Storage persistent storage interface and distinguishing that from the in-memory storage of log entries within the raft library? If so, what you're saying makes more sense because only a Ready handling cycle will mutate the persistent log storage, but then this comment could make that more clear. Maybe by replacing "the underlying log storage" with "the underlying persistent log storage exposed through the Storage interface", or something along those lines.


pkg/raft/storage.go line 89 at r10 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Yeah, I like your idea in saying that LogStorageSnapshot extends LogStorage. I tried to play with this, it looks like:

// LogStorageRO is a read-only API for the raft log.
type LogStorageRO interface {
	// ...
}

// LogStorageSnapshot is a read-only API for the raft log. It guarantees
// immutability of outside the raft package while the user holds this snapshot.
type LogStorageSnapshot LogStorageRO

// LogStorage is the API for the raft log.
type LogStorage interface {
	LogStorageRO

	// LogSnapshot returns an immutable point-in-time log storage snapshot.
	LogSnapshot() LogStorageSnapshot
}

Which is almost identical to what's in this PR, modulo the type alias. Not sure if the type alias is worth it in this case.

Alternatively, we could structure it like below, to avoid one extra type:

// LogStorage is a read-only API for the raft log.
type LogStorage interface {
	// ...
}

// LogStorageSnapshot is a read-only API for the raft log. It guarantees
// immutability outside the raft package while the user holds this snapshot.
type LogStorageSnapshot LogStorage

type Storage interface {
	LogStorage
	StateStorage

	// LogSnapshot returns an immutable point-in-time log storage snapshot.
	LogSnapshot() LogStorageSnapshot
}

Or simply:

// LogStorage is a read-only API for the raft log.
type LogStorage interface {
	// ...
}

type Storage interface {
	LogStorage
	StateStorage

	// LogSnapshot returns an immutable point-in-time log storage snapshot.
	// The returned snapshot guarantees immutability outside raft package while
	// the user holds this snapshot.
	LogSnapshot() LogStorage
}

WDYT?

I was thinking something closer to:

type LogStorage interface {
    ...
    LogSnapshot() LogStorageSnapshot
}

type LogStorageSnapshot interface {
    LogStorage
    IsImmutableSnapshot() // annotation method
}

This allows users of the interface which require an immutable snapshot to use LogStorageSnapshot, and others that don't to use LogStorage.

@nvanbenschoten nvanbenschoten self-requested a review September 19, 2024 15:48
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/raft/rawnode.go line 123 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm confused by the comment that "as long as the application guarantees immutability of the underlying log storage while the snapshot is being used" then. Can't a Step due to an incoming message mutate the log storage? We might get incoming MsgApps from a new leader.

Or are you specifically referring to the raft.Storage persistent storage interface and distinguishing that from the in-memory storage of log entries within the raft library? If so, what you're saying makes more sense because only a Ready handling cycle will mutate the persistent log storage, but then this comment could make that more clear. Maybe by replacing "the underlying log storage" with "the underlying persistent log storage exposed through the Storage interface", or something along those lines.

Correction:

  • raft storage reads under Replica.mu are valid only because they never try to read keys that raftMu protects (the stable part as seen by RawNode)

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/raft/rawnode.go line 123 at r9 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Correction:

  • raft storage reads under Replica.mu are valid only because they never try to read keys that raftMu protects (the stable part as seen by RawNode)

NB: raft.Storage exposes more than the stable state. It may contain an in-flight suffix that's not yet synced. But raft will never try to read it until it's reported back as stable.

So what's required from LogSnapshot() is not exactly immutability of storage (though we can implement it this way, it's stronger/simpler), but it requires immutability of keys that RawNode would read. For example, if we take a LogSnapshot() when RawNode thinks that the stable log is <= (term, index), then there must be no in-flight or future writes below this key until the snapshot is "released".

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/raft/log.go line 41 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You probably saw my comment on the prototype. LogSnapshot is definitely a step in the right direction, so I don't think it would make sense to try to take Storage out of it at this point. In other words, ignore that comment as far as making code changes now goes. In spirit, I still think there's something there, but it may be for another day (which may never come).

Yeah, I think it's possible to move the "snapshot" concept out of RawNode entirely. In some end state, I imagine the replica_raftstorage.go and logstore packages merged into the canonical LogStorage implementation which observes all raft writes.


pkg/raft/storage.go line 89 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I was thinking something closer to:

type LogStorage interface {
    ...
    LogSnapshot() LogStorageSnapshot
}

type LogStorageSnapshot interface {
    LogStorage
    IsImmutableSnapshot() // annotation method
}

This allows users of the interface which require an immutable snapshot to use LogStorageSnapshot, and others that don't to use LogStorage.

Looks a bit recursive (which I was trying to avoid), but otherwise looks ok to me.

Let's consider the Storage.Entries call. We basically want it to be flexible w.r.t. which mutex is the "main" one being held. Today, it's always r.mu, so we have r.mu.AssertHeld() in the Entries() implementation (and it then uses the r.mu.stateLoader etc). But we also want to be able to hold raftMu only (which, in this implementation, requires using r.raftMu.stateLoader instead; I think we don't need the stateloader in this call in the first place and have an idea how to avoid it, but will address this separately).

With your interface, it would be something like:

if IsImmutableSnapshot() {
	r.raftMu.AssertHeld()
	return logstore.LoadEntries(..., r.raftMu.stateLoader, ...)
}
r.mu.AssertHeld()
return logstore.LoadEntries(..., r.mu.stateLoader, ...)

With my interface, the snapshot-ness would be a static parameter of the type:

func (replicaRaftStorage) Entries(...) {
	r.mu.AssertHeld()
	return logstore.LoadEntries(...)
}

func (replicaRaftStorageSnap) Entries(...) {
	r.raftMu.AssertHeld()
	return logstore.LoadEntries(...)
}

Longer-term, I prefer the second approach, but the first one makes sense in the short term while we need more flexibility before arriving at the right abstractions (or eliminating all read paths with only mu held).

@pav-kv pav-kv force-pushed the raft-log-snapshot-api branch from f41ae68 to 5f682a4 Compare September 23, 2024 13:36
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/raft/log.go line 30 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"between when the ..."

Done


pkg/raft/rawnode.go line 123 at r9 (raw file):
@nvanbenschoten

If so, what you're saying makes more sense because only a Ready handling cycle will mutate the persistent log storage, but then this comment could make that more clear.

Clarified "the underlying log storage" with a note that it is what's behind the Storage interface.

@pav-kv pav-kv force-pushed the raft-log-snapshot-api branch from 5f682a4 to 14bd4e2 Compare September 23, 2024 14:01
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/raft/storage.go line 89 at r10 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Looks a bit recursive (which I was trying to avoid), but otherwise looks ok to me.

Let's consider the Storage.Entries call. We basically want it to be flexible w.r.t. which mutex is the "main" one being held. Today, it's always r.mu, so we have r.mu.AssertHeld() in the Entries() implementation (and it then uses the r.mu.stateLoader etc). But we also want to be able to hold raftMu only (which, in this implementation, requires using r.raftMu.stateLoader instead; I think we don't need the stateloader in this call in the first place and have an idea how to avoid it, but will address this separately).

With your interface, it would be something like:

if IsImmutableSnapshot() {
	r.raftMu.AssertHeld()
	return logstore.LoadEntries(..., r.raftMu.stateLoader, ...)
}
r.mu.AssertHeld()
return logstore.LoadEntries(..., r.mu.stateLoader, ...)

With my interface, the snapshot-ness would be a static parameter of the type:

func (replicaRaftStorage) Entries(...) {
	r.mu.AssertHeld()
	return logstore.LoadEntries(...)
}

func (replicaRaftStorageSnap) Entries(...) {
	r.raftMu.AssertHeld()
	return logstore.LoadEntries(...)
}

Longer-term, I prefer the second approach, but the first one makes sense in the short term while we need more flexibility before arriving at the right abstractions (or eliminating all read paths with only mu held).

Made it even simpler for now, with no IsImmutableSnapshot annotation. We're going to need a replicaRaftStorage wrapper anyway (which will know that this is a snapshot), so exposing the method is unnecessary.

@pav-kv pav-kv force-pushed the raft-log-snapshot-api branch from 14bd4e2 to 153a169 Compare September 23, 2024 14:07
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r11, 1 of 1 files at r12, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @tbg)


pkg/raft/log.go line 662 at r12 (raw file):

	return LogSnapshot{
		first:    l.firstIndex(),
		storage:  storage,

Should we be calling storage.LogSnapshot() here?


pkg/raft/storage.go line 89 at r10 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Made it even simpler for now, with no IsImmutableSnapshot annotation. We're going to need a replicaRaftStorage wrapper anyway (which will know that this is a snapshot), so exposing the method is unnecessary.

The annotation method was a way for this library to require that implementors of LogStorage also opt-in to the stronger requirements of LogStorageSnapshot before unintentionally implementing both. It would also serve as a convenient place to hang off commentary about how application-layer immutability is being provided.

It's not particularly useful right now when there are no separate LogStorageSnapshot implementations, but would be useful once we add some (e.g. MemoryStorageSnapshot).

@tbg tbg requested a review from nvanbenschoten September 24, 2024 13:43
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pav-kv)


pkg/raft/rawnode.go line 124 at r12 (raw file):

// (exposed through the Storage interface) while the snapshot is being used.
// Typically, this means that the application does not run a Ready() handling
// cycle until the snapshot is released.

Would we gain anything if we instead used a consistent iterator? It would act like a snapshot, and could simplify the locking story even further - you could basically get a LogSnapshot under replicaMu and it would stay valid until you're done with it and is explicitly released. (Just a thought experiment with the aim to understand the constraints better, and not in the way of merging this PR).


pkg/raft/storage.go line 89 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The annotation method was a way for this library to require that implementors of LogStorage also opt-in to the stronger requirements of LogStorageSnapshot before unintentionally implementing both. It would also serve as a convenient place to hang off commentary about how application-layer immutability is being provided.

It's not particularly useful right now when there are no separate LogStorageSnapshot implementations, but would be useful once we add some (e.g. MemoryStorageSnapshot).

Just caught up on this thread - suggestion to wrap up this discussion, as it seems to be what's holding this PR open and there'll be plenty of opportunity to revisit.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/raft/rawnode.go line 124 at r12 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Would we gain anything if we instead used a consistent iterator? It would act like a snapshot, and could simplify the locking story even further - you could basically get a LogSnapshot under replicaMu and it would stay valid until you're done with it and is explicitly released. (Just a thought experiment with the aim to understand the constraints better, and not in the way of merging this PR).

Part of the log storage is the ad-hoc Sideloaded storage which does not support consistent iterators/snapshots, so we should hold raftMu or mu to prevent a ready handling section messing with the log while we read it.

If not for this quirk, we could use a storage iterator/snapshot, though atm we are capable of doing this without support from the storage engine, at no extra cost, which seems nice.

We are [ab]using the fact that raftMu is locked anyway, and we don't need more than one snapshot at a time. We would need > 1 only if there were multiple independent threads using a log snapshot to do some work: e.g. if we separate the proposer vs acceptor vs learner jobs, then the proposer needs a snapshot to send entries to followers, while the learner needs it to send committed entries to the state machine apply stack; in the meantime, the acceptor job may be appending entries to the log storage.

This snapshotting should be the application-side implementation detail. API-wise, raft should only say "this is a snapshot of the log storage" (and it cares only that it's immutable up to a particular log mark), and then the application implements it one way or another. In some future, raft log storage will be in its own engine, and taking an engine iterator/snapshot approach might be cheaper / more relevant.

The API allows reading from raft log while RawNode continues making
progress. Practically, this means we can unlock Replica.mu after having
obtained the log snapshot, and can safely read from it while only
holding Replica.raftMu.

Epic: none
Release note: none
@pav-kv pav-kv force-pushed the raft-log-snapshot-api branch from 153a169 to 3fdfbef Compare September 25, 2024 18:43
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/raft/log.go line 662 at r12 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we be calling storage.LogSnapshot() here?

Not sure how to best implement it, but the idea was:

  • The change should be a no-op. There is existing code in RawNode calling log storage methods, so we want these to be served from the original log storage impl rather than a snapshot. So by default we don't call LogSnapshot().
  • Only when we want to use the snapshot with extended guarantees we want to call the new LogSnapshot(). Initially it'll be RACv2 only.

Since LogStorageSnapshot type "extends" LogStorage, it's fine to use the GCD of the two (the LogStorage type) to implement the common functionality.

Changed the type of storage: field back to LogStorage, to this effect.


pkg/raft/storage.go line 89 at r10 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Just caught up on this thread - suggestion to wrap up this discussion, as it seems to be what's holding this PR open and there'll be plenty of opportunity to revisit.

Leaving as is for now, and will adjust API (if needed) in the next PR that implements LogSnapshot() for our raft storage.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 25, 2024

TFTRs! Will follow up with the interface implementation soon.

bors r=tbg

@craig craig bot merged commit d926d5e into cockroachdb:master Sep 25, 2024
22 of 23 checks passed
@pav-kv pav-kv deleted the raft-log-snapshot-api branch September 25, 2024 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants