[TA1143] Fix deadlock issue in finish_async_tasks. #71

satbirchhikara · 2018-06-20T06:24:38Z

Fixed, deadlock issue in finish_async_tasks(). In error case, we were returning without releasing lock. Now we do not return even if there is error on that particular connection.

Some other minor fixes are...

Use GUID of volume for naming threads in uzfs_zvol_io_receiver() & uzfs_zvol_io_ack_sender().
uzfs_zvol_io_receiver(), should check hdr->len for read and write and error out if hdr->len is zero for read/write.
uzfs_zvol_worker() & uzfs_zvol_io_ack_sender(), zinfo counters need to be updated properly, as of today, all other opcode except writes are treated as read.
uzfs_zvol_io_ack_sender(), should use SHUTDOWN(fd) before closing fd.
Better to log thread exit with GUID so that we know which thread(related to which replica got exited).

Signed-off-by: satbir [email protected]

Signed-off-by: satbir <[email protected]>

vishnuitta

changes are good..

jkryl · 2018-06-20T09:47:44Z

cmd/zrepl/zrepl.c

 	/* First command should be OPEN */
 	while (zinfo == NULL) {
 		if (open_zvol(fd, &zinfo) != 0)
 			goto exit;
+		snprintf(tinfo, 50, "io_receiver_%lu", zinfo->zvol_guid);
+		prctl(PR_SET_NAME, tinfo, 0, 0, 0);


from the prctl man page:

The name can be up to 16 bytes long, including the terminating null byte. (If the length of the string, including the terminating null byte, exceeds 16 bytes, the string is silently truncated.)

we cannot store the whole guid there. We can number them from 1 to n (using increasing static counter). However I would probably just revert it to the previous code and just have a name io_receiver for all such threads. By using gdb we can always figure out which pool it belongs to (printing zinfo). Unless there is some advantage if the thread name is unique.

similar command applies to the other thread below.

Reverted back this change

jkryl · 2018-06-20T10:19:36Z

lib/libzrepl/mgmt_conn.c

@@ -499,6 +499,8 @@ uzfs_zvol_mgmt_do_handshake(uzfs_mgmt_conn_t *conn, zvol_io_hdr_t *hdrp,
 	 */
 	mgmt_ack.zvol_guid = dsl_dataset_phys(
 	    zv->zv_objset->os_dsl_dataset)->ds_guid;
+	zinfo->zvol_guid = mgmt_ack.zvol_guid;
+	LOG_INFO("Volume:%s has zvol_guid:%lu", zinfo->name, zinfo->zvol_guid);


Just a nit but can we improve the message as "Zvol %s has guid %lu" ? this function is called also for ZVOL_OPCODE_PREPARE_FOR_REBUILD in which case guid is already assigned to zinfo->zvol_guid. So maybe if zinfo->zvol_guid != 0 ... would be more appropriate.

jkryl · 2018-06-20T10:23:09Z

lib/libzrepl/mgmt_conn.c

@@ -595,9 +597,8 @@ finish_async_tasks(void)
 			rc = reply_nodata(async_task->conn, async_task->status,
 			    async_task->hdr.opcode, async_task->hdr.io_seq);
 		}
+


we should not ignore rc here. If epoll fails then it is a serious error leading to internally inconsistent state which is difficult to recover from. we could just break if rc != 0 and change final return to return (rc). That will take care of releasing the mutex.

Fixed deadlock issue. I will raise a PR to address mgmt_thread exit issue which need discussion among team members.

Signed-off-by: satbir <[email protected]>

… deadlock

* Fixed hdr.len check in io_receiver thread Signed-off-by: satbir <[email protected]>

[TA1143] Fix deadlock issue in finish_async_tasks.

8e3e020

Signed-off-by: satbir <[email protected]>

satbirchhikara requested review from jkryl, payes, gila, vishnuitta and mynktl June 20, 2018 06:24

Fixed hdr.len check in io_receiver thread

c505fed

Signed-off-by: satbir <[email protected]>

vishnuitta approved these changes Jun 20, 2018

View reviewed changes

Merge branch 'zfs-0.7-release' into deadlock

1190f52

jkryl reviewed Jun 20, 2018

View reviewed changes

satbirchhikara added 2 commits June 20, 2018 16:46

Fixed some of the issues raised during code review.

b3755d1

Signed-off-by: satbir <[email protected]>

Merge branch 'deadlock' of https://github.com/satbirchhikara/zfs into…

534ee79

… deadlock

jkryl approved these changes Jun 20, 2018

View reviewed changes

vishnuitta merged commit eb7d617 into mayadata-io:zfs-0.7-release Jun 20, 2018

satbirchhikara deleted the deadlock branch June 20, 2018 12:13

jkryl pushed a commit that referenced this pull request Jun 26, 2018

[TA1143] Fix deadlock issue in finish_async_tasks. (#71)

9aa90df

* Fixed hdr.len check in io_receiver thread Signed-off-by: satbir <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TA1143] Fix deadlock issue in finish_async_tasks. #71

[TA1143] Fix deadlock issue in finish_async_tasks. #71

satbirchhikara commented Jun 20, 2018

vishnuitta left a comment

jkryl Jun 20, 2018

jkryl Jun 20, 2018

satbirchhikara Jun 20, 2018

jkryl Jun 20, 2018

satbirchhikara Jun 20, 2018

jkryl Jun 20, 2018

satbirchhikara Jun 20, 2018

[TA1143] Fix deadlock issue in finish_async_tasks. #71

[TA1143] Fix deadlock issue in finish_async_tasks. #71

Conversation

satbirchhikara commented Jun 20, 2018

vishnuitta left a comment

Choose a reason for hiding this comment

jkryl Jun 20, 2018

Choose a reason for hiding this comment

jkryl Jun 20, 2018

Choose a reason for hiding this comment

satbirchhikara Jun 20, 2018

Choose a reason for hiding this comment

jkryl Jun 20, 2018

Choose a reason for hiding this comment

satbirchhikara Jun 20, 2018

Choose a reason for hiding this comment

jkryl Jun 20, 2018

Choose a reason for hiding this comment

satbirchhikara Jun 20, 2018

Choose a reason for hiding this comment