-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glusterfs client may crash/hang when the same file is written using different fds simultaneously when a race is hit #3065
Comments
@pranithk have you made any progress here ? |
Thank you for your contributions. |
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it. |
This should be fixed |
Thank you for your contributions. |
hi @pranithk , is there any method to reproduce this issue? |
Gluster client crashed with
Thread-1
On inspecting the core:
The frame is stuck in lock->post_op list.
At the same time Thread-26 is in the following state:
Old fd was in the process of getting closed(FLUSH fop) and at around the same time, a write fop was issued on the new fd.
flush fop in this case is leading to invalid state of the lock structure where
lock->acquired
istrue
,lock->delay_timer
isNULL
andlock->release
isfalse
. Ideally eitherlock->release
should betrue
or lock->delay_timer should benon-NULL
. This problem is happening becauseafr_wakeup_same_fd_delayed_op()
is settinglock->delay_timer
toNULL
but not settinglock->release
to true, this is leading to write on a new fd to go for a lock instead of being put in wait list. Since I was using DEBUG build it lead to crash. If it were release build, it would have led to hangs because of stale inodelks.Initially I thought setting
lock->release
to true would be the fix, but after thinking a bit more, it looks like the implementation of flush when pending post-op operations were in present is not completely handled. It resumes flush operation as soon as the first operation with the same fd finishes, but instead flush operation should finish after all the pending operations at the time of flush are finished.FYI @itisravi @karthik-us @xhernandez I remember you guys working on some issues where there were stale inode locks.
The text was updated successfully, but these errors were encountered: