-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write durability: always commit a write to both kube and spicedb, or neither #16
Conversation
c4cd734
to
b3be9e8
Compare
this adds a durable saga that writes to spicedb and kube. if the kube write fails, the spicedb write is reverted. if the program crashes, the process picks up where it left off when it restarts by reading state from a sqlite db.
b3be9e8
to
3be6842
Compare
e2e/proxy_test.go
Outdated
|
||
// paul creates chani's namespace | ||
Expect(failpoint.Disable("github.com/authzed/spicedb-kubeapi-proxy/pkg/proxy/panicKubeWrite")).To(Succeed()) | ||
_, err = paulClient.CoreV1().Namespaces().Create(ctx, &corev1.Namespace{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the task has been made durable, wouldn't it be restarted where it left off before crashing? If so, I wouldn't expect this call to succeed, because after disabling panicKubeWrite
, chani's Create
call should have eventually succeeded after process restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in this test is that task runner catches panics and returns records it as a failure of the WriteToKube
activity. So the workflow continues, and because it errored, runs CheckKube
and finds that the record doesn't exist, and rolls back the write.
I think you're right that that's what would happen if we crashed the whole process, and maybe we should work on a test harness to allow us to actually test that. Or maybe we can contribute something to durabletask to disable panic recovery for tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would explain it, but it seems weird to have various ways to recover from an error:
- one that handles scenario where the process does not crash
- one that handles scenario where the process crashes
Plus there is a record of the task being failed, so why would the workflow ignore it? Or are we implicitly telling it that "we are ok" because we run CheckKube
task?
e9c8ce0
to
db94273
Compare
err := CreateNamespace(ctx, chaniClient, chaniNamespace) | ||
Expect(err).ToNot(BeNil()) | ||
// pessimistic locking reports a conflict, optimistic locking reports already exists | ||
Expect(k8serrors.IsConflict(err) || k8serrors.IsAlreadyExists(err)).To(BeTrue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why reporting different errors? this would be breaking the contract depending on which implementation is used - I think both implementations should yield the same result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different operations are happening in kube and spicedb in the two cases, I'm mostly just passing the errors back and not trying to obfuscate. I can look into normalizing them.
this also removes the failpoint library in favor of a quick local version that doesn't require transforming the codebase.
db94273
to
0f25b07
Compare
Closes #3
This adds a durable saga that writes to spicedb and kube, with the goal of ensuring that a write happens in both, or neither, but not just one or the other.
There are two methods of writing implemented: a pessimistic lock that prevents other requests from attempting to create same object at the same time, and an optimistic lock that detects when there are conflicts and rolls back or forward as needed.
Pessimistic outline:
create
namespace foo
call comes in fromuser:evan
xxhash(create, namespace, foo)
workflow:xxhash(create,namespace,foo)#id@workflow_id:caca56e8-388b-46ca-bf2a-7fe325defe68
namespace:foo#creator@user:evan
operation
: OPERATION_MUST_NOT_MATCHfilter
: `workflow:xxhash(create,namespace,foo)#id@workflow_id:*Optimistic outline:
create
namespace foo
call comes in fromuser:evan
There are pros and cons to each approach, for now both are supported and we can configure them per request type or per instance of the proxy.
The durability of this function means that inputs, outputs, and progress state are stored in a sqlite database. The goal is to be robust to service failures (SpiceDB and Kube API) and process failures (network dies, process crashes and restarts).
The tests make use of failpoints to inject faults at specific places, and then verify that either both writes effectively happened, or neither did.
This initial implementation just deals with
namespace
objects but should be fairly straightforward to make generic for other types. I'm assuming we'll spend time on that in #6.