-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication functionality check-in. #4
Replication functionality check-in. #4
Conversation
cmd/zrepl/zrepl
Outdated
@@ -0,0 +1,228 @@ | |||
#! /bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file should be removed, it is automatically generated by libtool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jan. I forgot that .....now done
module/zfs/spa.c
Outdated
@@ -3505,9 +3505,14 @@ spa_open_common(const char *pool, spa_t **spapp, void *tag, nvlist_t *nvpolicy, | |||
mutex_exit(&spa_namespace_lock); | |||
} | |||
|
|||
if (firstopen) | |||
if (firstopen) { | |||
#ifndef _UZFS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here uzfs_zvol_create_cb is conditional based on _UZFS but the same function in zfs_ioctl.c is conditional based on _KERNEL. For consistency reasons it looks this one should be _KERNEL too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely right. I fixed it. Consistency maintained.
cmd/zrepl/zrepl.c
Outdated
@@ -0,0 +1,976 @@ | |||
|
|||
#include </usr/include/arpa/inet.h> | |||
#include </usr/include/netdb.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/usr/include should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmd/zrepl/zrepl.c
Outdated
* Free zio command along with buffer. | ||
*/ | ||
static void | ||
zio_cmd_free(zvol_io_cmd_t **cmd, zvol_op_code_t opcode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't opcode stored in cmd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opcode is part of cmd->hdr, hdr is allocated from stack then we use hdr info to allocate cmd memory and buffer for read/write of size equal to hdr->len. Once we allocated cmd, we bcopy hdr to cmd->hdr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but if I understand correctly since hdr is part of cmd (no matter how it was constructed) it contains the opcode and we don't need to pass it explicitly as a parameter to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I agree with you. I fixed it now.
hdr->offset, hdr->len, NULL); | ||
break; | ||
|
||
case ZVOL_OPCODE_SYNC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't here be something what eventually calls zil_commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vitta, has provided zil_commit API very recently. I will call that API here. This was not added because I need to add SYNC command in Replica which I deferred for my next sprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/* The event loop */ | ||
while (1) { | ||
int n, i; | ||
n = epoll_wait(efd, events, MAXEVENTS, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we bother with epoll if all we want to do is read data from one file descriptor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I changed it. It was very hard for me to address this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I did not provide a misleading advice in this case :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, absolutely not. It was constructive comment.
cmd/zrepl/zrepl.c
Outdated
} | ||
|
||
strncpy(mgmt_ack.volname, name, strlen(name)); | ||
mgmt_ack.port = 3232; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the port number should be defined somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* exchanging info like IP add, port etc. | ||
*/ | ||
static void | ||
uzfs_zvol_mgmt_thread(void *arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do I understand correctly that as time will go we will add more management commands on this connection? Or is the handshake command the only one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. You are right. There will be many more command, just to name few user/app triggered sync, snap creation, snap rollback/restrore, rebuild etc.......
cmd/zrepl/zrepl.c
Outdated
buf = kmem_alloc(1024, KM_SLEEP); | ||
nbytes = fread(buf, sizeof (char), 1024, fp); | ||
|
||
if (nbytes == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be nbytes <= 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Done 👍
* and spawn a new thread for each new connection req. | ||
*/ | ||
static void | ||
uzfs_zvol_io_conn_acceptor(void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a problem when this thread terminates for whatever reason it can be. the main thread does not know about it and the program will continue to run but will not accept any new connections. I think it would have been better if acceptor was run on behalf on the main thread instead of dedicating a separate thread to it, which makes the synchronization more difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jan, this is by design, we know that if uzfs_zvol_io_conn_acceptor thread die down because of any reason, replica will not be able to accept new IO connections. Same goes with uzfs_zvol_mgmt_thread. I would discuss it with team here and get back to you. I am keeping this as it is for now until we have better way of doing things.
One simple approach would be to kill main replica process if any of uzfs_zvol_mgmt_thread or uzfs_zvol_io_conn_acceptor goes down. That way container will reboot system back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jan,
I appreciate efforts and time you dedicated for this review. So many valuable comments which could be potential bugs in production. I am going to push the changes (comments addressed). Please have a look and let me know your views/comments.
Just a small request for time being, I do not want to remove printf/log statements before I push these changes to main repo, so you can ignore for time being. These extra debug statements help in verifying functionality a lot.
Again thanks a lot !!!
} | ||
|
||
ZREPL_LOG("Count:%d Size: %ld\n", count, hdr.len); | ||
if (hdr.opcode == ZVOL_OPCODE_HANDSHAKE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes Jan you are right, thats what our end goal is, for time being (during in house testing) ASSERT had been added "ASSERT(!zinfo->is_io_ack_sender_created)" to catch any such issue where we are sending multiple handshake. Will change code.
} | ||
(void) pthread_mutex_unlock(&zinfo->zinfo_mutex); | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jan. Make sense to test that case. As of today, code will panic if another op_code sent before handshake. As part of handling it, I will turn down any such request/op_code which come before handshake.
cmd/zrepl/zrepl.c
Outdated
} | ||
|
||
strncpy(mgmt_ack.volname, name, strlen(name)); | ||
mgmt_ack.port = 3232; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cmd/zrepl/zrepl.c
Outdated
buf = kmem_alloc(1024, KM_SLEEP); | ||
nbytes = fread(buf, sizeof (char), 1024, fp); | ||
|
||
if (nbytes == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Done 👍
* exchanging info like IP add, port etc. | ||
*/ | ||
static void | ||
uzfs_zvol_mgmt_thread(void *arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. You are right. There will be many more command, just to name few user/app triggered sync, snap creation, snap rollback/restrore, rebuild etc.......
cmd/zrepl/zrepl.c
Outdated
|
||
zfs_arc_max = (512 << 20); | ||
zfs_arc_min = (256 << 20); | ||
printf("zarcmax: %lu zarcmin:%lu\n", zfs_arc_max, zfs_arc_min); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jan to be candid, I have thought abt these values. I have inherited this code from one of file in uzfs. Will see what would be optimal value that we can place here. As of now I am adding TODO comments here.
} | ||
|
||
uzfs_zrepl_walk_pool_directory(); | ||
sleep(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a random value. This delays is mainly because I want to load/open all datasets incore before someone start establishing connection and pumping data to it. We do little heavy task
loading/opening data set, which may take longer time. I know 5Sec is too much but since this is one time delay which is in boot path so I think it is okay to have some delay(if not 5sec). Please thoughts around ......
I have added this based on UT where I was starting replica and at same time started tgt, where tgt was little faster and it requested LUN/VOL which exist but replica unable to found it because replica was still busy loading/opening it.
|
||
while (1) { | ||
sleep(5); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. Focus was mainly on connections and IOs. These things like memory limit, signal handling etc ...left as it is. But these are equally important. Will fix these issues via separate bug.
cmd/zrepl/zrepl.c
Outdated
initialize_error: | ||
uzfs_zrepl_close_log(); | ||
uzfs_fini(); | ||
return (0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it.
module/zfs/spa.c
Outdated
@@ -3505,9 +3505,14 @@ spa_open_common(const char *pool, spa_t **spapp, void *tag, nvlist_t *nvpolicy, | |||
mutex_exit(&spa_namespace_lock); | |||
} | |||
|
|||
if (firstopen) | |||
if (firstopen) { | |||
#ifndef _UZFS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely right. I fixed it. Consistency maintained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Satbir, thank you for addressing review comments from previous round. I'm fine with the changes. Just one thing which I would appreciate as writer of fio engine (and iscsit could benefit from it as well): if we could have one include file which does not depend on any other non-standard (zfs) header files which would contain definitions strictly related to over-the-wire protocol (various defines and structures). That way anyone implementing the replica protocol can simply include it and use the definitions. Also to avoid different layout of structures due to compiler differences, all protocol structures should be defined with packed
attribute and avoiding types which are architecture dependent. And one nit, I have noticed void pointer in one of the structures going over the wire - that should not be the case.
cmd/zrepl/zrepl.c
Outdated
* Free zio command along with buffer. | ||
*/ | ||
static void | ||
zio_cmd_free(zvol_io_cmd_t **cmd, zvol_op_code_t opcode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but if I understand correctly since hdr is part of cmd (no matter how it was constructed) it contains the opcode and we don't need to pass it explicitly as a parameter to this function.
char *p = buf; | ||
ZREPL_ERRLOG("Trying to read nbytes: %lu\n", nbytes); | ||
while (nbytes) { | ||
count = read(fd, (void *)p, nbytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is blocking read then I think that the code looks good as it is
hdr->offset, hdr->len, NULL); | ||
break; | ||
|
||
case ZVOL_OPCODE_SYNC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/* The event loop */ | ||
while (1) { | ||
int n, i; | ||
n = epoll_wait(efd, events, MAXEVENTS, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I did not provide a misleading advice in this case :-)
* and spawn a new thread for each new connection req. | ||
*/ | ||
static void | ||
uzfs_zvol_io_conn_acceptor(void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
cmd/zrepl/zrepl.c
Outdated
|
||
zfs_arc_max = (512 << 20); | ||
zfs_arc_min = (256 << 20); | ||
printf("zarcmax: %lu zarcmin:%lu\n", zfs_arc_max, zfs_arc_min); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
uzfs_zrepl_walk_pool_directory(); | ||
sleep(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. I think it would be good to add a brief comment explaining the reason here.
|
||
while (1) { | ||
sleep(5); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jan for your comments. This reply is specific to comments that you have sent recently, there are few action items pending from my side(which I need to discuss with team today and get back to you ....ofcourse with fix if needed) from your earlier comments.
-
Regarding sending void pointer over wire: To be candid I do not know implications of sending void pointer over wire though I googled but did not found useful. Would be great if you share some thoughts around it. Other thing is that this pointer is sent by iSCSI tgt controller, after receiving ack from replica, controller need to know in which cmd wait queue it need to look for to send ack back to client. So all said, this pointer is pointing to one of the queue at controller side. I will see if I can inherit definition to make this pointer explicit but if there is dependency of "Inclusion" then I would love to leave it as it is(unless there is some big issue with sending void pointer over wire).
-
W.r.t your suggestion about replica code to be independent, I tried my best of doing so but there may be glitch because this replica code is splitted into two parts a) Loading zinfo during mount path/destroying zinfo during export b) Steady state IO acceptance and acknowledgement. So part "b" is something I would say independent.
cmd/zrepl/zrepl.c
Outdated
* Free zio command along with buffer. | ||
*/ | ||
static void | ||
zio_cmd_free(zvol_io_cmd_t **cmd, zvol_op_code_t opcode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I agree with you. I fixed it now.
/* The event loop */ | ||
while (1) { | ||
int n, i; | ||
n = epoll_wait(efd, events, MAXEVENTS, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, absolutely not. It was constructive comment.
if ((hdr.opcode == ZVOL_OPCODE_WRITE) || | ||
(hdr.opcode == ZVOL_OPCODE_HANDSHAKE)) { | ||
count = uzfs_zvol_socket_read(fd, zio_cmd->buf, | ||
(sizeof (char) * hdr.len)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sizeof (char) can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be removed but IMHO it is not a good practice. Moreover there is no performance impact of it. Compiler replace it with 1 and since compiler know that multiplication with 1 yield nothing so compiler keep hdr.len and discard other stuff. It make code little readable.
zinfo->io_ack_waiting = 0; | ||
if ((zinfo->state == ZVOL_INFO_STATE_OFFLINE) || | ||
(zinfo->conn_closed == true)) { | ||
(void) pthread_mutex_unlock( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this unlock can be moved to the end of 'exit'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today we have 2 exit in this function and in one the exit path we have lock held and in other exit path we do not have lock. Considering that we can not make it generic to release lock at exit. W.r.t to correctness/race, this thread would not be here unless Producer set conn_closed to true so we are okay to free zio_cmd without lock protection but if you insist I can take lock again and free zio_cmd(Completion Queue) here.
cmd/zrepl/zrepl.c
Outdated
char *buf = NULL; | ||
|
||
|
||
sfd = create_and_bind(mgmt_port); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'bind' is not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand we do not need to explicitly bind this since connect anyway bind socket before establishing connection to iSCSI controller.
goto exit; | ||
} | ||
|
||
packet = kmem_alloc((sizeof (mgmt_ack_t) + sizeof (*hdr)) * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
packet need to be freed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 ....done
@@ -385,6 +386,101 @@ uzfs_create_dataset(spa_t *spa, char *ds_name, uint64_t vol_size, | |||
return (error); | |||
} | |||
|
|||
/* uZFS Zvol create call back function */ | |||
int | |||
uzfs_zvol_create_cb(const char *ds_name, void *arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be duplicated code, can be merged with other function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. The other API which we are talking about has been developed very recently and it will take time for me to soak those change. Since this code has wider impact on export/create/destroy i.e we have to take care of refcount increments and decrements on dataset as part of create and destroy respectively and since this code is already tested and very much stable, I am little apprehensive to merge both of them right away. Will open an issue and fix it. I hope that is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Vitta,
I honestly tried to address your comments. I would like to treat merging comments through separate issue. I hope that is fine. Please have a look and let me know.
Thanks & Regards,
Satbir Singh
if ((hdr.opcode == ZVOL_OPCODE_WRITE) || | ||
(hdr.opcode == ZVOL_OPCODE_HANDSHAKE)) { | ||
count = uzfs_zvol_socket_read(fd, zio_cmd->buf, | ||
(sizeof (char) * hdr.len)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be removed but IMHO it is not a good practice. Moreover there is no performance impact of it. Compiler replace it with 1 and since compiler know that multiplication with 1 yield nothing so compiler keep hdr.len and discard other stuff. It make code little readable.
goto exit; | ||
} | ||
|
||
packet = kmem_alloc((sizeof (mgmt_ack_t) + sizeof (*hdr)) * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 ....done
cmd/zrepl/zrepl.c
Outdated
char *buf = NULL; | ||
|
||
|
||
sfd = create_and_bind(mgmt_port); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand we do not need to explicitly bind this since connect anyway bind socket before establishing connection to iSCSI controller.
zinfo->io_ack_waiting = 0; | ||
if ((zinfo->state == ZVOL_INFO_STATE_OFFLINE) || | ||
(zinfo->conn_closed == true)) { | ||
(void) pthread_mutex_unlock( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of today we have 2 exit in this function and in one the exit path we have lock held and in other exit path we do not have lock. Considering that we can not make it generic to release lock at exit. W.r.t to correctness/race, this thread would not be here unless Producer set conn_closed to true so we are okay to free zio_cmd without lock protection but if you insist I can take lock again and free zio_cmd(Completion Queue) here.
@@ -385,6 +386,101 @@ uzfs_create_dataset(spa_t *spa, char *ds_name, uint64_t vol_size, | |||
return (error); | |||
} | |||
|
|||
/* uZFS Zvol create call back function */ | |||
int | |||
uzfs_zvol_create_cb(const char *ds_name, void *arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. The other API which we are talking about has been developed very recently and it will take time for me to soak those change. Since this code has wider impact on export/create/destroy i.e we have to take care of refcount increments and decrements on dataset as part of create and destroy respectively and since this code is already tested and very much stable, I am little apprehensive to merge both of them right away. Will open an issue and fix it. I hope that is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes looks good.
Signed-off-by: satbir <[email protected]>
[ZoL#62] Replication functionality check-in Signed-off-by: satbir <[email protected]>
Replication functionality.
Description
Replication related changes.
Motivation and Context
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.