Improve unwind info persisting failure handling #23

javierhonduco · 2024-04-23T11:50:42Z

In some production machines, persisting the unwind info fails. We are currently investigating this and so far we don't know what the culprit is.

On those hosts we get pretty much 100% unwind errors, which should not happen. This leads me to notice that errors persisting the unwind info aren't handled properly.

For example, once a shard is full, the current code ignores this wipes the in-memory shard and assigns a new BPF shard. This is not correct.

Test Plan

Forced some errors in this logic and the current in-memory state wasn't wiped. We need failure injection during testing to ensure all these cases are covered and don't regress.

cc @gmarler

In some production machines, persisting the unwind info fails. We are currently investigating this and so far we don't know what the culprit is. Typically after some attempts, persisting eventually succeeds. On those hosts we get pretty much 100% unwind errors, which should not happen. This leads me to notice that errors persisting the unwind info aren't handled properly. For example, once a shard is full, the current code ignores this wipes the in-memory shard and assigns a new BPF shard. This is not correct. Test Plan ========= Forced some errors in this logic and the current in-memory state wasn't wiped. We need failure injection during testing to ensure all these cases are covered and don't regress.

javierhonduco force-pushed the improve-persisting-failures-handling branch from b978e90 to fdebff4 Compare April 23, 2024 11:50

javierhonduco force-pushed the improve-persisting-failures-handling branch from fdebff4 to 67e97db Compare April 23, 2024 13:23

javierhonduco merged commit ab2dd8d into main Apr 23, 2024
4 checks passed

javierhonduco deleted the improve-persisting-failures-handling branch April 23, 2024 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve unwind info persisting failure handling #23

Improve unwind info persisting failure handling #23

javierhonduco commented Apr 23, 2024

Improve unwind info persisting failure handling #23

Improve unwind info persisting failure handling #23

Conversation

javierhonduco commented Apr 23, 2024

Test Plan