Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(io-engine): listen on IPv6 unspecified by default #1743

Merged

Conversation

michaelbeaumont
Copy link
Contributor

@michaelbeaumont michaelbeaumont commented Sep 21, 2024

Not able to fully build/test this because of SPDK issues yet but looks good in editor.

I removed the indirection with macros and strings, not sure if there was a particular need for that.

See #1731. This will need changes to the helm chart after being merged.

@michaelbeaumont michaelbeaumont marked this pull request as ready for review September 29, 2024 21:12
@auto-assign auto-assign bot requested a review from dsavitskiy September 29, 2024 21:12
@tiagolobocastro
Copy link
Contributor

Sorry, seems I forgot to review this somehow

I removed the indirection with macros and strings, not sure if there was a particular need for that.

Not that I know of.

bors try

bors-openebs-mayastor bot pushed a commit that referenced this pull request Oct 4, 2024
@bors-openebs-mayastor
Copy link

try

Build failed:

@tiagolobocastro
Copy link
Contributor

Tests passed, but we found a coredump...
I wonder if this is due to the SPDK update of recent... @dsharma-dc any clues?

[2024-10-04T09:16:07.529Z] Thread 1 (LWP 9):

[2024-10-04T09:16:07.529Z] #0  0x00007fc6090d1bc7 in __pthread_kill_implementation () from /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libc.so.6

[2024-10-04T09:16:07.529Z] #1  0x00007fc609084b46 in raise () from /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libc.so.6

[2024-10-04T09:16:07.529Z] #2  0x00007fc60906f4b5 in abort () from /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libc.so.6

[2024-10-04T09:16:07.529Z] #3  0x00007fc60906f3d9 in __assert_fail_base.cold.0 () from /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libc.so.6

[2024-10-04T09:16:07.529Z] #4  0x00007fc60907d7b6 in __assert_fail () from /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libc.so.6

[2024-10-04T09:16:07.529Z] #5  0x000055d7f66d77cd in nvmf_tcp_req_complete (req=0x7fc6001d4f20) at tcp.c:3331

[2024-10-04T09:16:07.529Z] #6  0x000055d7f66d48b8 in nvmf_transport_req_complete (req=0x7fc6001d4f20) at transport.c:740

[2024-10-04T09:16:07.529Z] #7  0x000055d7f66a9be8 in _nvmf_request_complete (ctx=0x7fc6001d4f20) at ctrlr.c:4555

[2024-10-04T09:16:07.529Z] #8  0x000055d7f66aefe0 in nvmf_ctrlr_process_io_cmd (req=0x7fc6001d52b8) at ctrlr.c:4417

[2024-10-04T09:16:07.529Z] #9  0x000055d7f66aea76 in spdk_nvmf_request_exec (req=0x7fc6001d52b8) at ctrlr.c:4772

[2024-10-04T09:16:07.529Z] #10 0x000055d7f66da169 in nvmf_tcp_req_process (ttransport=0x55d7f7df76d0, tcp_req=0x7fc6001d52b8) at tcp.c:3093

[2024-10-04T09:16:07.529Z] #11 0x000055d7f66e1d8f in nvmf_tcp_h2c_data_payload_handle (ttransport=0x55d7f7df76d0, tqpair=0x55d7f82a4400, pdu=0x20000285b9c8) at tcp.c:1959

[2024-10-04T09:16:07.529Z] #12 0x000055d7f66e1a2f in _nvmf_tcp_pdu_payload_handle (tqpair=0x55d7f82a4400, pdu=0x20000285b9c8) at tcp.c:2020

[2024-10-04T09:16:07.529Z] #13 0x000055d7f66e0884 in nvmf_tcp_pdu_payload_handle (tqpair=0x55d7f82a4400, pdu=0x20000285b9c8) at tcp.c:2088

[2024-10-04T09:16:07.529Z] #14 0x000055d7f66dfc80 in nvmf_tcp_sock_process (tqpair=0x55d7f82a4400) at tcp.c:2438

[2024-10-04T09:16:07.529Z] #15 0x000055d7f66df624 in nvmf_tcp_sock_cb (arg=0x55d7f82a4400, group=0x7fc60001d7f0, sock=0x55d7f82a4230) at tcp.c:3225

[2024-10-04T09:16:07.529Z] #16 0x000055d7f670896c in sock_group_impl_poll_count (group_impl=0x7fc60001d8b0, group=0x7fc60001d7f0, max_events=32) at sock.c:728

[2024-10-04T09:16:07.529Z] #17 0x000055d7f67087e6 in spdk_sock_group_poll_count (group=0x7fc60001d7f0, max_events=32) at sock.c:754

[2024-10-04T09:16:07.529Z] #18 0x000055d7f670874a in spdk_sock_group_poll (group=0x7fc60001d7f0) at sock.c:705

[2024-10-04T09:16:07.529Z] #19 0x000055d7f66d761b in nvmf_tcp_poll_group_poll (group=0x7fc60001d760) at tcp.c:3382

[2024-10-04T09:16:07.529Z] #20 0x000055d7f66d4854 in nvmf_transport_poll_group_poll (group=0x7fc60001d760) at transport.c:728

[2024-10-04T09:16:07.529Z] #21 0x000055d7f66cb16b in nvmf_poll_group_poll (ctx=0x7fc6000014f0) at nvmf.c:157

[2024-10-04T09:16:07.529Z] #22 0x000055d7f671910e in thread_execute_poller (thread=0x55d7f7e0bc40, poller=0x7fc6000015b0) at thread.c:959

[2024-10-04T09:16:07.529Z] #23 0x000055d7f6712901 in thread_poll (thread=0x55d7f7e0bc40, max_msgs=0, now=54564662220878052) at thread.c:1085

[2024-10-04T09:16:07.529Z] #24 0x000055d7f6712765 in spdk_thread_poll (thread=0x55d7f7e0bc40, max_msgs=0, now=54564662220878052) at thread.c:1173

[2024-10-04T09:16:07.529Z] #25 0x000055d7f5adca7a in spdk_rs::thread::Thread::poll (self=0x7fc6000011f0) at spdk-rs/src/thread.rs:165

bors try

bors-openebs-mayastor bot pushed a commit that referenced this pull request Oct 4, 2024
@bors-openebs-mayastor
Copy link

try

Build failed:

@tiagolobocastro
Copy link
Contributor

tiagolobocastro commented Oct 4, 2024

Ah sorry, there was an actual failure on test nexus_rebuild_parallel
Here's the full log:
ci.txt
nexus_rebuild_parallel log:
nexus_rebuild_parallel.txt

[2024-10-04T09:47:35.840Z] Monitoring 20 volumes
[2024-10-04T09:47:35.840Z]     v0:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v1:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v2:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v3:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v4:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v5:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v6:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v7:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v8:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v9:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v10:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v11:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v12:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v13:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v14:  ONLINE     |  ONLINE     |  FAILED     | 
[2024-10-04T09:47:35.840Z]     v15:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v16:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v17:  ONLINE     |  ONLINE     |  ONLINE     | 
[2024-10-04T09:47:35.840Z]     v18:  ONLINE     |  ONLINE     |  REBUILD 93 | 
[2024-10-04T09:47:35.840Z]     v19:  ONLINE     |  ONLINE     |  REBUILD 09 | 
[2024-10-04T09:47:35.840Z] -
[2024-10-04T09:47:35.840Z] One or more volumes failed
[2024-10-04T09:47:35.841Z] thread 'nexus_rebuild_parallel' panicked at 'All volumes must go online: Status { code: Internal, message: "One or more volumes failed", source: None }', io-engine/tests/nexus_rebuild_parallel.rs:132:10

Is this just a timeout?

@michaelbeaumont
Copy link
Contributor Author

Hmm I haven't been able to look at what the test does yet. I'd suspect with this change tests would either fail completely, probably at beginning, since the gRPC server wouldn't be reachable, or not at all.

@tiagolobocastro
Copy link
Contributor

Hmm I haven't been able to look at what the test does yet. I'd suspect with this change tests would either fail completely, probably at beginning, since the gRPC server wouldn't be reachable, or not at all.

It does many concurrent rebuilds and waits for them to all complete ok
Actually seems one of them failed:

[2024-10-04T09:47:35.840Z]     v14:  ONLINE     |  ONLINE     |  FAILED     | 

@tiagolobocastro
Copy link
Contributor

Here's the failure:

[2024-10-04T09:47:35.873Z] �[37mms_2 :: 09:47:33.634568 | uri.rs:288         [I] �[0m::create() 10.1.0.2:8420/nqn.2019-05.io.openebs:v15r0n1
[2024-10-04T09:47:35.873Z] �[2;37mms_2 :: 09:47:33.634589 | controller.rs:188  [D] �[0m10.1.0.2:8420/nqn.2019-05.io.openebs:v15r0n1: new NVMe controller created
[2024-10-04T09:47:35.873Z] �[31mms_2 :: 09:47:33.642541 | nvme_fabric.c:607  [E] �[0mConnect command failed, rc -5, trtype:TCP adrfam:IPv4 traddr:10.1.0.2 trsvcid:8420 subnqn:nqn.2019-05.io.openebs:v14r0   
[2024-10-04T09:47:35.873Z] �[31mms_2 :: 09:47:33.642572 | nvme_fabric.c:612  [E] �[0mConnect command completed with error: sct 1, sc 1   
[2024-10-04T09:47:35.873Z] �[31mms_2 :: 09:47:33.642591 | nvme_tcp.c:2426    [E] �[0mFailed to poll NVMe-oF Fabric CONNECT command   
[2024-10-04T09:47:35.873Z] �[31mms_2 :: 09:47:33.642606 | nvme_tcp.c:2216    [E] �[0mFailed to connect tqpair=0x560f48d5a080   
[2024-10-04T09:47:35.873Z] �[31mms_2 :: 09:47:33.642733 | rebuild_job_bac... [E] �[0mRebuild job #15 (running) 'bdev:///v14r2?uuid=537434d0-99d2-444f-b380-069447139ff7' -> 'nvmf+tcp://10.1.0.2:8420/nqn.2019-05.io.openebs:v14r0?uuid=e734bf0e-6698-4527-b1fd-8e4548bee53a' on nexus 'v14': failed to rebuild segment id=3 block=10624 with error: Write IO failed for bdev nvmf+tcp://10.1.0.2:8420/nqn.2019-05.io.openebs:v14r0?uuid=e734bf0e-6698-4527-b1fd-8e4548bee53a

@tiagolobocastro
Copy link
Contributor

And on the target:

[2024-10-04T09:47:35.880Z] �[33mms_0 :: 09:47:33.641439 | ctrlr.c:313        [W] �[0mDuplicate QID detected, re-check in 1000us   
[2024-10-04T09:47:35.880Z] �[31mms_0 :: 09:47:33.642478 | ctrlr.c:305        [E] �[0mGot I/O connect with duplicate QID 1   

I thought we had fully fixed this :(
Also not sure why this is now showing up with your PR only..
@dsharma-dc any clues?

@dsharma-dc
Copy link
Contributor

And on the target:

[2024-10-04T09:47:35.880Z] �[33mms_0 :: 09:47:33.641439 | ctrlr.c:313        [W] �[0mDuplicate QID detected, re-check in 1000us   
[2024-10-04T09:47:35.880Z] �[31mms_0 :: 09:47:33.642478 | ctrlr.c:305        [E] �[0mGot I/O connect with duplicate QID 1   

I thought we had fully fixed this :( Also not sure why this is now showing up with your PR only.. @dsharma-dc any clues?

I have seen this very rarely without this PR too over last few weeks, so I think unrelated to this change.

@dsharma-dc
Copy link
Contributor

bors try

bors-openebs-mayastor bot pushed a commit that referenced this pull request Oct 4, 2024
@bors-openebs-mayastor
Copy link

try

Build succeeded:

@dsharma-dc
Copy link
Contributor

bors merge

@bors-openebs-mayastor
Copy link

Build succeeded:

@bors-openebs-mayastor bors-openebs-mayastor bot merged commit a2a75d9 into openebs:develop Oct 7, 2024
8 checks passed
@michaelbeaumont michaelbeaumont deleted the feat/io-listen-ipv6 branch October 7, 2024 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants