Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TA1143] Fix deadlock issue in finish_async_tasks. #71

Merged
merged 5 commits into from
Jun 20, 2018

Conversation

satbirchhikara
Copy link
Collaborator

Fixed, deadlock issue in finish_async_tasks(). In error case, we were returning without releasing lock. Now we do not return even if there is error on that particular connection.

Some other minor fixes are...

  1. Use GUID of volume for naming threads in uzfs_zvol_io_receiver() & uzfs_zvol_io_ack_sender().
  2. uzfs_zvol_io_receiver(), should check hdr->len for read and write and error out if hdr->len is zero for read/write.
  3. uzfs_zvol_worker() & uzfs_zvol_io_ack_sender(), zinfo counters need to be updated properly, as of today, all other opcode except writes are treated as read.
  4. uzfs_zvol_io_ack_sender(), should use SHUTDOWN(fd) before closing fd.
  5. Better to log thread exit with GUID so that we know which thread(related to which replica got exited).

Signed-off-by: satbir [email protected]

Copy link

@vishnuitta vishnuitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes are good..

/* First command should be OPEN */
while (zinfo == NULL) {
if (open_zvol(fd, &zinfo) != 0)
goto exit;
snprintf(tinfo, 50, "io_receiver_%lu", zinfo->zvol_guid);
prctl(PR_SET_NAME, tinfo, 0, 0, 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the prctl man page:

The name can  be  up  to  16  bytes  long,
              including  the  terminating  null  byte.  (If the length of the string, including the terminating null byte, exceeds 16 bytes, the string is
              silently truncated.)

we cannot store the whole guid there. We can number them from 1 to n (using increasing static counter). However I would probably just revert it to the previous code and just have a name io_receiver for all such threads. By using gdb we can always figure out which pool it belongs to (printing zinfo). Unless there is some advantage if the thread name is unique.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar command applies to the other thread below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted back this change

@@ -499,6 +499,8 @@ uzfs_zvol_mgmt_do_handshake(uzfs_mgmt_conn_t *conn, zvol_io_hdr_t *hdrp,
*/
mgmt_ack.zvol_guid = dsl_dataset_phys(
zv->zv_objset->os_dsl_dataset)->ds_guid;
zinfo->zvol_guid = mgmt_ack.zvol_guid;
LOG_INFO("Volume:%s has zvol_guid:%lu", zinfo->name, zinfo->zvol_guid);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit but can we improve the message as "Zvol %s has guid %lu" ? this function is called also for ZVOL_OPCODE_PREPARE_FOR_REBUILD in which case guid is already assigned to zinfo->zvol_guid. So maybe if zinfo->zvol_guid != 0 ... would be more appropriate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -595,9 +597,8 @@ finish_async_tasks(void)
rc = reply_nodata(async_task->conn, async_task->status,
async_task->hdr.opcode, async_task->hdr.io_seq);
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not ignore rc here. If epoll fails then it is a serious error leading to internally inconsistent state which is difficult to recover from. we could just break if rc != 0 and change final return to return (rc). That will take care of releasing the mutex.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed deadlock issue. I will raise a PR to address mgmt_thread exit issue which need discussion among team members.

@vishnuitta vishnuitta merged commit eb7d617 into mayadata-io:zfs-0.7-release Jun 20, 2018
@satbirchhikara satbirchhikara deleted the deadlock branch June 20, 2018 12:13
jkryl pushed a commit that referenced this pull request Jun 26, 2018
* Fixed hdr.len check in io_receiver thread

Signed-off-by: satbir <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants