Improve unwind info persisting failure handling #23
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In some production machines, persisting the unwind info fails. We are currently investigating this and so far we don't know what the culprit is.
On those hosts we get pretty much 100% unwind errors, which should not happen. This leads me to notice that errors persisting the unwind info aren't handled properly.
For example, once a shard is full, the current code ignores this wipes the in-memory shard and assigns a new BPF shard. This is not correct.
Test Plan
Forced some errors in this logic and the current in-memory state wasn't wiped. We need failure injection during testing to ensure all these cases are covered and don't regress.
cc @gmarler