Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication functionality check-in. #4

Merged
merged 16 commits into from
Mar 7, 2018

Conversation

satbirchhikara
Copy link
Collaborator

Replication functionality.

Description

Replication related changes.

Motivation and Context

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

cmd/zrepl/zrepl Outdated
@@ -0,0 +1,228 @@
#! /bin/bash
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should be removed, it is automatically generated by libtool

Copy link
Collaborator Author

@satbirchhikara satbirchhikara Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jan. I forgot that .....now done

module/zfs/spa.c Outdated
@@ -3505,9 +3505,14 @@ spa_open_common(const char *pool, spa_t **spapp, void *tag, nvlist_t *nvpolicy,
mutex_exit(&spa_namespace_lock);
}

if (firstopen)
if (firstopen) {
#ifndef _UZFS
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here uzfs_zvol_create_cb is conditional based on _UZFS but the same function in zfs_ioctl.c is conditional based on _KERNEL. For consistency reasons it looks this one should be _KERNEL too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely right. I fixed it. Consistency maintained.

@@ -0,0 +1,976 @@

#include </usr/include/arpa/inet.h>
#include </usr/include/netdb.h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/include should be removed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Free zio command along with buffer.
*/
static void
zio_cmd_free(zvol_io_cmd_t **cmd, zvol_op_code_t opcode)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't opcode stored in cmd?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opcode is part of cmd->hdr, hdr is allocated from stack then we use hdr info to allocate cmd memory and buffer for read/write of size equal to hdr->len. Once we allocated cmd, we bcopy hdr to cmd->hdr.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but if I understand correctly since hdr is part of cmd (no matter how it was constructed) it contains the opcode and we don't need to pass it explicitly as a parameter to this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I agree with you. I fixed it now.

hdr->offset, hdr->len, NULL);
break;

case ZVOL_OPCODE_SYNC:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't here be something what eventually calls zil_commit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vitta, has provided zil_commit API very recently. I will call that API here. This was not added because I need to add SYNC command in Replica which I deferred for my next sprint.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/* The event loop */
while (1) {
int n, i;
n = epoll_wait(efd, events, MAXEVENTS, -1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we bother with epoll if all we want to do is read data from one file descriptor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I changed it. It was very hard for me to address this comment.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I did not provide a misleading advice in this case :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, absolutely not. It was constructive comment.

}

strncpy(mgmt_ack.volname, name, strlen(name));
mgmt_ack.port = 3232;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the port number should be defined somewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* exchanging info like IP add, port etc.
*/
static void
uzfs_zvol_mgmt_thread(void *arg)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I understand correctly that as time will go we will add more management commands on this connection? Or is the handshake command the only one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. You are right. There will be many more command, just to name few user/app triggered sync, snap creation, snap rollback/restrore, rebuild etc.......

buf = kmem_alloc(1024, KM_SLEEP);
nbytes = fread(buf, sizeof (char), 1024, fp);

if (nbytes == 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be nbytes <= 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. Done 👍

* and spawn a new thread for each new connection req.
*/
static void
uzfs_zvol_io_conn_acceptor(void)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a problem when this thread terminates for whatever reason it can be. the main thread does not know about it and the program will continue to run but will not accept any new connections. I think it would have been better if acceptor was run on behalf on the main thread instead of dedicating a separate thread to it, which makes the synchronization more difficult.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jan, this is by design, we know that if uzfs_zvol_io_conn_acceptor thread die down because of any reason, replica will not be able to accept new IO connections. Same goes with uzfs_zvol_mgmt_thread. I would discuss it with team here and get back to you. I am keeping this as it is for now until we have better way of doing things.

One simple approach would be to kill main replica process if any of uzfs_zvol_mgmt_thread or uzfs_zvol_io_conn_acceptor goes down. That way container will reboot system back.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator Author

@satbirchhikara satbirchhikara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jan,

I appreciate efforts and time you dedicated for this review. So many valuable comments which could be potential bugs in production. I am going to push the changes (comments addressed). Please have a look and let me know your views/comments.

Just a small request for time being, I do not want to remove printf/log statements before I push these changes to main repo, so you can ignore for time being. These extra debug statements help in verifying functionality a lot.

Again thanks a lot !!!

}

ZREPL_LOG("Count:%d Size: %ld\n", count, hdr.len);
if (hdr.opcode == ZVOL_OPCODE_HANDSHAKE) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Jan you are right, thats what our end goal is, for time being (during in house testing) ASSERT had been added "ASSERT(!zinfo->is_io_ack_sender_created)" to catch any such issue where we are sending multiple handshake. Will change code.

}
(void) pthread_mutex_unlock(&zinfo->zinfo_mutex);
continue;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jan. Make sense to test that case. As of today, code will panic if another op_code sent before handshake. As part of handling it, I will turn down any such request/op_code which come before handshake.

}

strncpy(mgmt_ack.volname, name, strlen(name));
mgmt_ack.port = 3232;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

buf = kmem_alloc(1024, KM_SLEEP);
nbytes = fread(buf, sizeof (char), 1024, fp);

if (nbytes == 0) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. Done 👍

* exchanging info like IP add, port etc.
*/
static void
uzfs_zvol_mgmt_thread(void *arg)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. You are right. There will be many more command, just to name few user/app triggered sync, snap creation, snap rollback/restrore, rebuild etc.......


zfs_arc_max = (512 << 20);
zfs_arc_min = (256 << 20);
printf("zarcmax: %lu zarcmin:%lu\n", zfs_arc_max, zfs_arc_min);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jan to be candid, I have thought abt these values. I have inherited this code from one of file in uzfs. Will see what would be optimal value that we can place here. As of now I am adding TODO comments here.

}

uzfs_zrepl_walk_pool_directory();
sleep(5);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a random value. This delays is mainly because I want to load/open all datasets incore before someone start establishing connection and pumping data to it. We do little heavy task
loading/opening data set, which may take longer time. I know 5Sec is too much but since this is one time delay which is in boot path so I think it is okay to have some delay(if not 5sec). Please thoughts around ......

I have added this based on UT where I was starting replica and at same time started tgt, where tgt was little faster and it requested LUN/VOL which exist but replica unable to found it because replica was still busy loading/opening it.


while (1) {
sleep(5);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. Focus was mainly on connections and IOs. These things like memory limit, signal handling etc ...left as it is. But these are equally important. Will fix these issues via separate bug.

initialize_error:
uzfs_zrepl_close_log();
uzfs_fini();
return (0);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

module/zfs/spa.c Outdated
@@ -3505,9 +3505,14 @@ spa_open_common(const char *pool, spa_t **spapp, void *tag, nvlist_t *nvpolicy,
mutex_exit(&spa_namespace_lock);
}

if (firstopen)
if (firstopen) {
#ifndef _UZFS
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely right. I fixed it. Consistency maintained.

Copy link

@jkryl jkryl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Satbir, thank you for addressing review comments from previous round. I'm fine with the changes. Just one thing which I would appreciate as writer of fio engine (and iscsit could benefit from it as well): if we could have one include file which does not depend on any other non-standard (zfs) header files which would contain definitions strictly related to over-the-wire protocol (various defines and structures). That way anyone implementing the replica protocol can simply include it and use the definitions. Also to avoid different layout of structures due to compiler differences, all protocol structures should be defined with packed attribute and avoiding types which are architecture dependent. And one nit, I have noticed void pointer in one of the structures going over the wire - that should not be the case.

* Free zio command along with buffer.
*/
static void
zio_cmd_free(zvol_io_cmd_t **cmd, zvol_op_code_t opcode)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but if I understand correctly since hdr is part of cmd (no matter how it was constructed) it contains the opcode and we don't need to pass it explicitly as a parameter to this function.

char *p = buf;
ZREPL_ERRLOG("Trying to read nbytes: %lu\n", nbytes);
while (nbytes) {
count = read(fd, (void *)p, nbytes);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is blocking read then I think that the code looks good as it is

hdr->offset, hdr->len, NULL);
break;

case ZVOL_OPCODE_SYNC:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/* The event loop */
while (1) {
int n, i;
n = epoll_wait(efd, events, MAXEVENTS, -1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I did not provide a misleading advice in this case :-)

* and spawn a new thread for each new connection req.
*/
static void
uzfs_zvol_io_conn_acceptor(void)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


zfs_arc_max = (512 << 20);
zfs_arc_min = (256 << 20);
printf("zarcmax: %lu zarcmin:%lu\n", zfs_arc_max, zfs_arc_min);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

uzfs_zrepl_walk_pool_directory();
sleep(5);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. I think it would be good to add a brief comment explaining the reason here.


while (1) {
sleep(5);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator Author

@satbirchhikara satbirchhikara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jan for your comments. This reply is specific to comments that you have sent recently, there are few action items pending from my side(which I need to discuss with team today and get back to you ....ofcourse with fix if needed) from your earlier comments.

  1. Regarding sending void pointer over wire: To be candid I do not know implications of sending void pointer over wire though I googled but did not found useful. Would be great if you share some thoughts around it. Other thing is that this pointer is sent by iSCSI tgt controller, after receiving ack from replica, controller need to know in which cmd wait queue it need to look for to send ack back to client. So all said, this pointer is pointing to one of the queue at controller side. I will see if I can inherit definition to make this pointer explicit but if there is dependency of "Inclusion" then I would love to leave it as it is(unless there is some big issue with sending void pointer over wire).

  2. W.r.t your suggestion about replica code to be independent, I tried my best of doing so but there may be glitch because this replica code is splitted into two parts a) Loading zinfo during mount path/destroying zinfo during export b) Steady state IO acceptance and acknowledgement. So part "b" is something I would say independent.

* Free zio command along with buffer.
*/
static void
zio_cmd_free(zvol_io_cmd_t **cmd, zvol_op_code_t opcode)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I agree with you. I fixed it now.

/* The event loop */
while (1) {
int n, i;
n = epoll_wait(efd, events, MAXEVENTS, -1);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, absolutely not. It was constructive comment.

if ((hdr.opcode == ZVOL_OPCODE_WRITE) ||
(hdr.opcode == ZVOL_OPCODE_HANDSHAKE)) {
count = uzfs_zvol_socket_read(fd, zio_cmd->buf,
(sizeof (char) * hdr.len));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof (char) can be removed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be removed but IMHO it is not a good practice. Moreover there is no performance impact of it. Compiler replace it with 1 and since compiler know that multiplication with 1 yield nothing so compiler keep hdr.len and discard other stuff. It make code little readable.

zinfo->io_ack_waiting = 0;
if ((zinfo->state == ZVOL_INFO_STATE_OFFLINE) ||
(zinfo->conn_closed == true)) {
(void) pthread_mutex_unlock(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this unlock can be moved to the end of 'exit'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today we have 2 exit in this function and in one the exit path we have lock held and in other exit path we do not have lock. Considering that we can not make it generic to release lock at exit. W.r.t to correctness/race, this thread would not be here unless Producer set conn_closed to true so we are okay to free zio_cmd without lock protection but if you insist I can take lock again and free zio_cmd(Completion Queue) here.

char *buf = NULL;


sfd = create_and_bind(mgmt_port);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'bind' is not required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand we do not need to explicitly bind this since connect anyway bind socket before establishing connection to iSCSI controller.

goto exit;
}

packet = kmem_alloc((sizeof (mgmt_ack_t) + sizeof (*hdr)) *

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packet need to be freed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 ....done

@@ -385,6 +386,101 @@ uzfs_create_dataset(spa_t *spa, char *ds_name, uint64_t vol_size,
return (error);
}

/* uZFS Zvol create call back function */
int
uzfs_zvol_create_cb(const char *ds_name, void *arg)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be duplicated code, can be merged with other function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. The other API which we are talking about has been developed very recently and it will take time for me to soak those change. Since this code has wider impact on export/create/destroy i.e we have to take care of refcount increments and decrements on dataset as part of create and destroy respectively and since this code is already tested and very much stable, I am little apprehensive to merge both of them right away. Will open an issue and fix it. I hope that is fine.

Copy link
Collaborator Author

@satbirchhikara satbirchhikara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Vitta,

I honestly tried to address your comments. I would like to treat merging comments through separate issue. I hope that is fine. Please have a look and let me know.

Thanks & Regards,
Satbir Singh

if ((hdr.opcode == ZVOL_OPCODE_WRITE) ||
(hdr.opcode == ZVOL_OPCODE_HANDSHAKE)) {
count = uzfs_zvol_socket_read(fd, zio_cmd->buf,
(sizeof (char) * hdr.len));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be removed but IMHO it is not a good practice. Moreover there is no performance impact of it. Compiler replace it with 1 and since compiler know that multiplication with 1 yield nothing so compiler keep hdr.len and discard other stuff. It make code little readable.

goto exit;
}

packet = kmem_alloc((sizeof (mgmt_ack_t) + sizeof (*hdr)) *
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 ....done

char *buf = NULL;


sfd = create_and_bind(mgmt_port);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand we do not need to explicitly bind this since connect anyway bind socket before establishing connection to iSCSI controller.

zinfo->io_ack_waiting = 0;
if ((zinfo->state == ZVOL_INFO_STATE_OFFLINE) ||
(zinfo->conn_closed == true)) {
(void) pthread_mutex_unlock(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today we have 2 exit in this function and in one the exit path we have lock held and in other exit path we do not have lock. Considering that we can not make it generic to release lock at exit. W.r.t to correctness/race, this thread would not be here unless Producer set conn_closed to true so we are okay to free zio_cmd without lock protection but if you insist I can take lock again and free zio_cmd(Completion Queue) here.

@@ -385,6 +386,101 @@ uzfs_create_dataset(spa_t *spa, char *ds_name, uint64_t vol_size,
return (error);
}

/* uZFS Zvol create call back function */
int
uzfs_zvol_create_cb(const char *ds_name, void *arg)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. The other API which we are talking about has been developed very recently and it will take time for me to soak those change. Since this code has wider impact on export/create/destroy i.e we have to take care of refcount increments and decrements on dataset as part of create and destroy respectively and since this code is already tested and very much stable, I am little apprehensive to merge both of them right away. Will open an issue and fix it. I hope that is fine.

Copy link

@vishnuitta vishnuitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes looks good.

@vishnuitta vishnuitta merged commit 1d06753 into mayadata-io:zfs-0.7-release Mar 7, 2018
jkryl pushed a commit that referenced this pull request Jun 26, 2018
[ZoL#62] Replication functionality check-in
Signed-off-by: satbir <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants