-
Notifications
You must be signed in to change notification settings - Fork 205
handle-exception lifecycle doesn't appear to apply to write-batch #491
Comments
Yup, write-batch isn't handled like read-batch or the others. https://github.com/onyx-platform/onyx/blob/develop/src/onyx/peer/task_lifecycle.clj#L284 |
Hmm, I guess these should already be covered by after-batch which was placed in the task-lifecycle. |
Ah, it isn't covered by that because it doesn't invoke those task lifecycle calls via restartable-invocation. So that's the overall issue. |
@MichaelDrogalis this would be a good one to take if you have some time. Basically, I'm not seeing handle-exception be able to handle an exception thrown in write-batch. |
PR #505 successfully passes Jepsen tests without :onyx/restart-pred-fn. I think there might be additional task-lifecycle stages where we should allow restarts though. For example, build-new-segments and flow-retry-segments. I think build-new-segments isn't particularly risky, however I think that when a user really wants a job to be up no matter what, then we should cover all stages of the lifecycle. |
I remember that we discussed this a while back and decided not to restart on exceptions thrown by internal Onyx code - e.g. nothing a user could have written. If we have a bug in Onyx, it should crash hard, because it's unlikely that there's hope of recovery for doing the right thing. I think assign-windows falls under this category. |
I think the distinction should be whether the function is pure or not. If we're doing stateful things, they might be transitory / fixed if the peer is restarted. I think assign-windows, and possibly flow-retry-segments falls in this category (especially because it calls emit-latency which uses monitoring). |
Monitoring calls, even when they fail, should never crash the job. We need to be tolerant of a failure here and there to emit metrics. I still think that even if it's a transient, stateful issue, and it's Onyx's fault, restarting gives the user a false sense of security when something wrong is lurking. |
Merged. |
Found by jepsen. I switched out :onyx/restart-pred-fn for a restart lifecycle and the job still seems to be being killed under certain scenarios. The main one I noticed is in write-batch. I'll look into this.
The text was updated successfully, but these errors were encountered: