-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix watchablestore runlock bug #13505
Conversation
Can you please add a test that would prevent future regression? |
8deda4c
to
7e6c29c
Compare
It is difficult to reproduce this bug by test case, because the probability trigger this bug is low. Firstly, we must make bolt db re-mmap by filling a large number data. I add some annotation in code firstly. Do you have some suggestion for test case? |
Not my area of expertise, but we are running unit tests with data race detection. Maybe we could add a test case that would trigger race in this case and depend on the data race detection to report error. Based on the code I think the fix is correct, however I just want to make sure we are thinking long term how to avoid making same mistakes again. |
Great finding. |
This looks like a correct bug fix. I have a patch for logging more information in case of failure. Shall we check in both changes and wait and see if we can get this reproduced? |
My only concern is that here it can significantly reduce the throughput. Shall we make a copy of the data and then call kvsToEvents? syncWatchers() is not a time-sensitive operation. @ptabor |
Another related issue: #13067 It makes sense that in our company, we have observed this in multiple versions of etcd but only recently we are seeing this more frequently. The reason might be because we have improved the etcd performance and there is a higher chance to see this happening. |
Without benchmarking we will not know for sure (and your framework does not support (yet) RW transactions while watches are open, right ?) The current kvsToEvents is in practice only parsing protos. I wouldn't consider this especially heavy operation in comparison to making an additional copy of [][]byte just to release shared lock faster. |
Correct, this part is not covered in the script yet. Also, syncWatchers() is called when there are lots of write operations and the watchers are out of sync. When holding the lock, it might alleviate(?) some of the write workloads. I am okay with this and we can wait and see if there is any feedback later in case of any performance issue. |
Btw, we may also need to backport this change to earlier versions too. |
It's a great fix, but the changelog isn't updated. Just submit a PR to update the changelog. |
First we must know, the value(vs) got from function UnsafeRange is using shallow copy.
The pointer(of vs) comes from memory of bolt db.
So if bolt db is re-mmaping, the memory will be retrieved, then will trigger SIGSEGV: