Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve panic recovery handling #5

Open
irvinlim opened this issue Mar 25, 2022 · 2 comments
Open

Improve panic recovery handling #5

irvinlim opened this issue Mar 25, 2022 · 2 comments
Labels
area/stability Related to stability and reliability of the operator

Comments

@irvinlim
Copy link
Member

The code is currently lacking recovery routines where it could crash (e.g. nil pointer exceptions). Since we start many goroutines at different points, we need to investigate a robust way to ensure that we do not forget to handle panic recovery as well.

Some areas which require panic recovery:

  • Controller goroutines
  • HTTP handler goroutines
  • Other background workers (e.g. in controllerruntime)
@irvinlim irvinlim added good first issue Good for newcomers area/stability Related to stability and reliability of the operator labels Mar 25, 2022
@joaokorcz
Copy link

Hey, @irvinlim, how are you? i'd like to contribute on the solve of this issue. Are there other people already working with this? I saw the #60 and i guess the problem is close to the issue reported here.

@irvinlim
Copy link
Member Author

irvinlim commented Feb 17, 2024

Hey @joaokorcz! Apologies for the late reply as I was on vacation.

This is more like a "blanket" issue to try to cover panic scenarios. I think there's a few problems we want to address:

  1. Any goroutines that panic may cause the controller to crash (even if you use defer in the main goroutine), which degrades availability
  2. There isn't a good way to detect such crashes from a developer perspective, and would have to rely on users to report them
  3. In certain scenarios, a panic might be the only way out of a bad situation (e.g. corrupt state that we can't recover from)

I appreciate the enthusiasm to contribute! However, I don't think that this issue is sufficiently well-scoped that I can provide pointers to which you can provide immediate fixes to. I'll remove the good first issue tag, which I believe is improperly tagged.

I'll mark some other issues as "good first issue" in a bit, so if you are still interested in contributing, do check those out!

@irvinlim irvinlim removed the good first issue Good for newcomers label Feb 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stability Related to stability and reliability of the operator
Projects
None yet
Development

No branches or pull requests

2 participants