-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TA1143] Fix deadlock issue in finish_async_tasks. #71
[TA1143] Fix deadlock issue in finish_async_tasks. #71
Conversation
Signed-off-by: satbir <[email protected]>
Signed-off-by: satbir <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes are good..
cmd/zrepl/zrepl.c
Outdated
/* First command should be OPEN */ | ||
while (zinfo == NULL) { | ||
if (open_zvol(fd, &zinfo) != 0) | ||
goto exit; | ||
snprintf(tinfo, 50, "io_receiver_%lu", zinfo->zvol_guid); | ||
prctl(PR_SET_NAME, tinfo, 0, 0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the prctl man page:
The name can be up to 16 bytes long,
including the terminating null byte. (If the length of the string, including the terminating null byte, exceeds 16 bytes, the string is
silently truncated.)
we cannot store the whole guid there. We can number them from 1 to n (using increasing static counter). However I would probably just revert it to the previous code and just have a name io_receiver for all such threads. By using gdb we can always figure out which pool it belongs to (printing zinfo). Unless there is some advantage if the thread name is unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar command applies to the other thread below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted back this change
lib/libzrepl/mgmt_conn.c
Outdated
@@ -499,6 +499,8 @@ uzfs_zvol_mgmt_do_handshake(uzfs_mgmt_conn_t *conn, zvol_io_hdr_t *hdrp, | |||
*/ | |||
mgmt_ack.zvol_guid = dsl_dataset_phys( | |||
zv->zv_objset->os_dsl_dataset)->ds_guid; | |||
zinfo->zvol_guid = mgmt_ack.zvol_guid; | |||
LOG_INFO("Volume:%s has zvol_guid:%lu", zinfo->name, zinfo->zvol_guid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit but can we improve the message as "Zvol %s has guid %lu"
? this function is called also for ZVOL_OPCODE_PREPARE_FOR_REBUILD in which case guid is already assigned to zinfo->zvol_guid. So maybe if zinfo->zvol_guid != 0 ...
would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lib/libzrepl/mgmt_conn.c
Outdated
@@ -595,9 +597,8 @@ finish_async_tasks(void) | |||
rc = reply_nodata(async_task->conn, async_task->status, | |||
async_task->hdr.opcode, async_task->hdr.io_seq); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not ignore rc here. If epoll fails then it is a serious error leading to internally inconsistent state which is difficult to recover from. we could just break if rc != 0 and change final return to return (rc). That will take care of releasing the mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed deadlock issue. I will raise a PR to address mgmt_thread exit issue which need discussion among team members.
Signed-off-by: satbir <[email protected]>
* Fixed hdr.len check in io_receiver thread Signed-off-by: satbir <[email protected]>
Fixed, deadlock issue in finish_async_tasks(). In error case, we were returning without releasing lock. Now we do not return even if there is error on that particular connection.
Some other minor fixes are...
Signed-off-by: satbir [email protected]