Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1 node down in a 3 node system caused many volumes to be Degraded/Faulted without a Target node #1714

Open
veenadong opened this issue Aug 8, 2024 · 5 comments
Assignees

Comments

@veenadong
Copy link

Mayastor 2.5.1 on a 3 node system, take 1 node down and many volumes are in a rebuilding then faulting, and trying again.

mayastor-io-engine-xs9sk_mayastor_io-engine-dd23276f2aaa3971af97201157038ba42b9e57962f307d92e693116527acd457.log:2024-08-07T18:19:38.950624012-07:00 stdout F [2024-08-08T01:19:38.950553630+00:00 ERROR io_engine::rebuild::rebuild_descriptor:rebuild_descriptor.rs:85] 10.245.250.69:8420/nqn.2019-05.io.openebs:b8d8cb2e-0160-4015-931a-53c3b125447cn1: failed to get I/O handle: failed to get IO channel for 10.245.250.69:8420/nqn.2019-05.io.openebs:b8d8cb2e-0160-4015-931a-53c3b125447cn1
Mayastor pods are running:

NAME                                            READY   STATUS    RESTARTS        AGE
mayastor-agent-core-5859d9898c-hfjrd            2/2     Running   8 (134m ago)    4h7m
mayastor-agent-ha-node-hmnn9                    1/1     Running   8 (136m ago)    39h
mayastor-agent-ha-node-zjm5c                    1/1     Running   0               3h1m
mayastor-api-rest-bb7656dd7-26ckt               1/1     Running   0               177m
mayastor-api-rest-bb7656dd7-857cr               1/1     Running   1 (136m ago)    6h19m
mayastor-api-rest-bb7656dd7-rdl5m               1/1     Running   0               177m
mayastor-csi-controller-56bb85db5f-bmd9k        5/5     Running   5 (136m ago)    4h1m
mayastor-csi-node-gqknk                         2/2     Running   16 (136m ago)   39h
mayastor-csi-node-p8l7n                         2/2     Running   0               3h1m
mayastor-etcd-0                                 1/1     Running   1 (135m ago)    4h
mayastor-etcd-1                                 1/1     Running   3 (136m ago)    39h
mayastor-etcd-2                                 0/1     Pending   0               170m
mayastor-io-engine-tp6cm                        2/2     Running   0               3h1m
mayastor-io-engine-xs9sk                        2/2     Running   8 (48m ago)     39h
mayastor-localpv-provisioner-6fd649f5fb-bcn8g   1/1     Running   2 (136m ago)    171m
mayastor-nats-0                                 3/3     Running   3 (136m ago)    4h
mayastor-nats-1                                 3/3     Running   0               170m
mayastor-nats-2                                 3/3     Running   0               170m
mayastor-operator-diskpool-c6b78cbdb-bqrqx      1/1     Running   0               171m

IO-engine pod is logging:

[2024-08-08T01:16:25.047340423+00:00 ERROR io_engine::bdev::nvmx::handle:handle.rs:388] I/O completed with PI error
[2024-08-08T01:16:25.047343437+00:00 ERROR io_engine::bdev::nvmx::handle:handle.rs:388] I/O completed with PI error

[2024-08-08T01:16:25.585879277+00:00 ERROR mayastor::spdk:nvme_fabric.c:596] Connect command failed, rc -125, trtype:TCP adrfam:IPv4 traddr:10.245.250.71 trsvcid:8420 subnqn:nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986
[2024-08-08T01:16:25.585921213+00:00 ERROR mayastor::spdk:nvme_tcp.c:2017] Failed to poll NVMe-oF Fabric CONNECT command
[2024-08-08T01:16:25.585926825+00:00 ERROR mayastor::spdk:nvme_tcp.c:1806] Failed to connect tqpair=0x5582015e9850
[2024-08-08T01:16:25.585970057+00:00 ERROR mayastor::spdk:nvme_qpair.c:797] CQ transport error -6 (No such device or address) on qpair id 3
[2024-08-08T01:16:25.585976264+00:00 ERROR io_engine::bdev::nvmx::qpair:qpair.rs:473] I/O qpair async connection polling error: UnknownErrno: Unknown errno self=QPair { qpair: 0x5582015e9850, ctrlr: 0x558201617f80, ctrlr_name: "10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1", state: Connecting, waiters: 15 }
[2024-08-08T01:16:25.585990227+00:00 ERROR io_engine::bdev::nvmx::qpair:qpair.rs:483] I/O qpair async connection failed self=QPair { qpair: 0x5582015e9850, ctrlr: 0x558201617f80, ctrlr_name: "10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1", state: Connecting, waiters: 15 } err=OpenBdev { source: UnknownErrno }
[2024-08-08T01:16:25.586064498+00:00 ERROR io_engine::rebuild::rebuild_descriptor:rebuild_descriptor.rs:85] 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1: failed to get I/O handle: failed to open bdev
[2024-08-08T01:16:25.586085241+00:00 ERROR io_engine::rebuild::rebuild_descriptor:rebuild_descriptor.rs:85] 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1: failed to get I/O handle: failed to open bdev

[2024-08-08T01:16:25.586346683+00:00 ERROR io_engine::rebuild::rebuild_job_backend:rebuild_job_backend.rs:368] Rebuild job #705 (running) 'nvmf://10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986?uuid=16b7a8d2-e89e-4e31-9e8f-571a3726a986' -> 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e' on nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6': failed to rebuild segment id=14 block=8969216 with error: Failed to get a handle for bdev 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1
[2024-08-08T01:16:25.586468900+00:00  INFO io_engine::rebuild::rebuild_job_backend:rebuild_job_backend.rs:426] Rebuild job #705 (failed: done) 'nvmf://10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986?uuid=16b7a8d2-e89e-4e31-9e8f-571a3726a986' -> 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e' on nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6': changing state from Running to Failed; current stats: RebuildStats { blocks_total: 629135326, blocks_recovered: 8958976, blocks_transferred: 333824, blocks_remaining: 620176350, progress: 1, blocks_per_task: 128, block_size: 512, tasks_total: 16, tasks_active: 0, start_time: 2024-08-08T01:15:17.756714484Z, is_partial: false }
[2024-08-08T01:16:25.586486051+00:00  INFO io_engine::rebuild::rebuild_job_backend:rebuild_job_backend.rs:606] Rebuild job #705 (failed: done) 'nvmf://10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986?uuid=16b7a8d2-e89e-4e31-9e8f-571a3726a986' -> 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e' on nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6': backend dropped; final stats: RebuildStats { blocks_total: 629135326, blocks_recovered: 8958976, blocks_transferred: 333824, blocks_remaining: 620176350, progress: 1, blocks_per_task: 128, block_size: 512, tasks_total: 16, tasks_active: 0, start_time: 2024-08-08T01:15:17.756714484Z, is_partial: false }
[2024-08-08T01:16:25.586728047+00:00 ERROR io_engine::bdev::nexus::nexus_bdev_rebuild:nexus_bdev_rebuild.rs:472] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open out-of-sync; rebuilding]: rebuild job failed with error: Failed to get a handle for bdev 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1: failed to open bdev: UnknownErrno: Unknown errno
[2024-08-08T01:16:25.591830223+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1049] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open out-of-sync; rebuilding]: closing child...
[2024-08-08T01:16:25.591843753+00:00 ERROR io_engine::core::descriptor:descriptor.rs:53] Bdev '3d646500-3214-4b3b-9849-139b3648e56e' is not claimed by this module 'NEXUS_CAS_MODULE'
[2024-08-08T01:16:25.591848712+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1072] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open (destroying) out-of-sync; rebuilding]: destroying block device...
[2024-08-08T01:16:25.591874931+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1075] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open (destroying) out-of-sync; rebuilding]: block device destroyed, waiting for removal...
[2024-08-08T01:16:25.591958581+00:00  INFO io_engine::bdev::nexus::nexus_bdev_children:nexus_bdev_children.rs:899] Unplugging nexus child device nexus_name="956e0c1b-59da-46ea-aa74-4f04bc4e75f6" child_device="3d646500-3214-4b3b-9849-139b3648e56e"
[2024-08-08T01:16:25.591966411+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1103] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open (destroying) out-of-sync; rebuilding]: unplugging child...
[2024-08-08T01:16:25.591974582+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:632] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels...
[2024-08-08T01:16:25.592098633+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:655] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels completed with result: Ok
[2024-08-08T01:16:25.592145370+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1147] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [closed (destroying) out-of-sync]: child successfully unplugged
[2024-08-08T01:16:25.592204944+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1086] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [closed out-of-sync]: child closed successfully
[2024-08-08T01:16:25.592215300+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:632] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: rebuild, reconfiguring I/O channels...
[2024-08-08T01:16:25.592330041+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:655] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: rebuild, reconfiguring I/O channels completed with result: Ok
[2024-08-08T01:16:29.631726005+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:4380] [nqn.2019-05.io.openebs:a2685ffc-437b-4286-8aa7-ab6bbd64e686] Submitting Keep Alive failed

Attached are the support bundle from the two running nodes:
mayastor-2024-08-08--00-35-44-UTC.tar.gz
mayastor-2024-08-08--00-34-37-UTC.tar.gz

@avishnu
Copy link
Member

avishnu commented Aug 8, 2024

What is the value of repl in the storage class? Did the node come back online?

@tiagolobocastro
Copy link
Contributor

Seems you have hit this bug: #1710

[2024-08-08T01:16:25.585879277+00:00 ERROR mayastor::spdk:nvme_fabric.c:596] Connect command failed, rc -125, trtype:TCP adrfam:IPv4 traddr:10.245.250.71 trsvcid:8420 subnqn:nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986
[2024-08-08T01:16:25.585921213+00:00 ERROR mayastor::spdk:nvme_tcp.c:2017] Failed to poll NVMe-oF Fabric CONNECT command
[2024-08-08T01:16:25.585926825+00:00 ERROR mayastor::spdk:nvme_tcp.c:1806] Failed to connect tqpair=0x5582015e9850

We have a fix on #1711

@tiagolobocastro
Copy link
Contributor

hmm no logs are included in the bundle @veenadong, which plugin version did you use to create the bundle?

@veenadong
Copy link
Author

veenadong commented Aug 8, 2024

hmm no logs are included in the bundle @veenadong, which plugin version did you use to create the bundle?

We are running 2.5.1 and the plugin version is:
Kubectl Plugin (kubectl-mayastor) revision e8f685e3c9e1 (v2.5.0+0)

I see the fix #1711 will be merged into 2.7? What is the ETA for this fix to be available for GA?

@tiagolobocastro
Copy link
Contributor

Plan is to release in a couple of weeks along with a control-plane previously missed functionality for updating pool labels after creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants