Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

handle-exception lifecycle doesn't appear to apply to write-batch #491

Closed
lbradstreet opened this issue Jan 20, 2016 · 9 comments
Closed
Assignees
Labels
Milestone

Comments

@lbradstreet
Copy link
Member

Found by jepsen. I switched out :onyx/restart-pred-fn for a restart lifecycle and the job still seems to be being killed under certain scenarios. The main one I noticed is in write-batch. I'll look into this.

@lbradstreet lbradstreet self-assigned this Jan 20, 2016
@lbradstreet
Copy link
Member Author

Yup, write-batch isn't handled like read-batch or the others. https://github.com/onyx-platform/onyx/blob/develop/src/onyx/peer/task_lifecycle.clj#L284

@lbradstreet lbradstreet added this to the 0.8.5 milestone Jan 20, 2016
@lbradstreet
Copy link
Member Author

Hmm, I guess these should already be covered by after-batch which was placed in the task-lifecycle.

@lbradstreet
Copy link
Member Author

Ah, it isn't covered by that because it doesn't invoke those task lifecycle calls via restartable-invocation. So that's the overall issue.

@lbradstreet
Copy link
Member Author

@MichaelDrogalis this would be a good one to take if you have some time.

Basically, I'm not seeing handle-exception be able to handle an exception thrown in write-batch.

@lbradstreet
Copy link
Member Author

PR #505 successfully passes Jepsen tests without :onyx/restart-pred-fn.

I think there might be additional task-lifecycle stages where we should allow restarts though. For example, build-new-segments and flow-retry-segments. I think build-new-segments isn't particularly risky, however I think that when a user really wants a job to be up no matter what, then we should cover all stages of the lifecycle.

@MichaelDrogalis
Copy link
Contributor

I remember that we discussed this a while back and decided not to restart on exceptions thrown by internal Onyx code - e.g. nothing a user could have written. If we have a bug in Onyx, it should crash hard, because it's unlikely that there's hope of recovery for doing the right thing. I think assign-windows falls under this category.

@lbradstreet
Copy link
Member Author

I think the distinction should be whether the function is pure or not. If we're doing stateful things, they might be transitory / fixed if the peer is restarted.

I think assign-windows, and possibly flow-retry-segments falls in this category (especially because it calls emit-latency which uses monitoring).

@MichaelDrogalis
Copy link
Contributor

Monitoring calls, even when they fail, should never crash the job. We need to be tolerant of a failure here and there to emit metrics.

I still think that even if it's a transient, stateful issue, and it's Onyx's fault, restarting gives the user a false sense of security when something wrong is lurking.

@MichaelDrogalis
Copy link
Contributor

Merged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants