Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Intel ipu-plugin(VSP) PR-> 330 #274

Closed

Conversation

sudhar-krishnakumar
Copy link
Contributor

Test Intel ipu-plugin(VSP) PR-> 330

@openshift-ci openshift-ci bot requested review from SalDaniele and vrindle January 27, 2025 22:50
Copy link

openshift-ci bot commented Jan 27, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sudhar-krishnakumar
Once this PR has been reviewed and has the lgtm label, please assign bn222 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 27, 2025
Copy link

openshift-ci bot commented Jan 27, 2025

Hi @sudhar-krishnakumar. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@SalDaniele
Copy link
Contributor

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 27, 2025
@SalDaniele
Copy link
Contributor

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 27, 2025
@SalDaniele
Copy link
Contributor

/retest

1 similar comment
@SalDaniele
Copy link
Contributor

/retest

Copy link

openshift-ci bot commented Jan 28, 2025

@sudhar-krishnakumar: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@SalDaniele
Copy link
Contributor

Passed initially with WA still in place, manually re-running the test on this system to verify RHEL re-install works

@SalDaniele
Copy link
Contributor

SalDaniele commented Jan 28, 2025

Tested manually, after removing the WA for https://issues.redhat.com/browse/IIC-502 the run fails

2025-01-28 11:00:29 INFO [th:139776695305792] (ipu.py:265): Keeping existing iso since url and size didn't change                                [159/1951]
2025-01-28 11:00:30 INFO [th:139776695305792] (ipu.py:267): setting boot source override                                                                   
2025-01-28 11:00:30 INFO [th:139776695305792] (ipu.py:269): triggering reboot                                                                              
2025-01-28 11:00:31 INFO [th:139776695305792] (ipu.py:271): sleeping 5 minutes                                                                             
2025-01-28 11:05:31 INFO [th:139776695305792] (ipu.py:273): restarting Redfish                                                                             
2025-01-28 11:05:31 INFO [th:139776695305792] (host.py:154): wsfd-advnetlab216-intel-ipu-imc.anl.eng.bos2.dc.redhat.com up, connecting with root           
2025-01-28 11:05:31 INFO [th:139776695305792] (host.py:186): Attempting SSH connections on wsfd-advnetlab216-intel-ipu-imc.anl.eng.bos2.dc.redhat.com with 
logins: {'_username': 'root', '_hostname': 'wsfd-advnetlab216-intel-ipu-imc.anl.eng.bos2.dc.redhat.com'}, {'_username': 'root', '_hostname': 'wsfd-advnetla
b216-intel-ipu-imc.anl.eng.bos2.dc.redhat.com', '_key_path': '/root/.ssh/id_rsa', '_pkey': <paramiko.rsakey.RSAKey object at 0x7f204da4aed0>}              
2025-01-28 11:05:31 INFO [th:139776695305792] (host.py:194): Login successful on wsfd-advnetlab216-intel-ipu-imc.anl.eng.bos2.dc.redhat.com                
2025-01-28 11:05:42 INFO [th:139776695305792] (ipu.py:275): unsetting boot source override                                                                 
2025-01-28 11:05:43 INFO [th:139776695305792] (ipu.py:108): Boot command sent                                                                              
2025-01-28 11:05:43 INFO [th:139776695305792] (ipu.py:58): Redfish boot triggered, attempting to connect to ACC at ip 172.16.3.16                          2025-01-28 11:05:43 INFO [th:139776695305792] (host.py:154): 172.16.3.16 up, connecting with root                                                          
2025-01-28 11:05:43 INFO [th:139776695305792] (host.py:186): Attempting SSH connections on 172.16.3.16 with logins: {'_username': 'root', '_hostname': '172
.16.3.16'}, {'_username': 'root', '_hostname': '172.16.3.16', '_key_path': '/root/.ssh/id_rsa', '_pkey': <paramiko.rsakey.RSAKey object at 0x7f204c5c9fd0>}
, {'_username': 'root', '_hostname': '172.16.3.16'}                                                                                                        
2025-01-28 11:05:44 INFO [th:139776695305792] (host.py:194): Login successful on 172.16.3.16                                                               2025-01-28 11:05:44 INFO [th:139776695305792] (ipu.py:63): (returncode: 0, error: )                                                                        
2025-01-28 11:05:44 INFO [th:139776695305792] (ipu.py:64): Connected to ACC                                                                                
2025-01-28 11:05:44 INFO [th:139777003874112] (ipu.py:124): Workaround: cold booting the host since currently driver can't deal with host rebooting without
 coordination                                                                                                                                              2025-01-28 11:05:44 INFO [th:139777003874112] (bmc.py:11): https://wsfd-advnetlab216-drac.anl.eng.bos2.dc.redhat.com/redfish/v1/Systems/System.Embedded.1 root calvin                                                                                                                                                 2025-01-28 11:06:02 INFO [th:139777003874112] (host.py:152): waiting for '172.16.3.16' to respond to ping                                            
2025-01-28 12:06:03 ERROR [th:139777003874112] (logger.py:22): Waited for 1h for ping

Ssh-ed on the system and verified the card does not have connectivity


[root@localhost ~]# ifconfig enp0s1f0d5 172.16.3.16/24
[root@localhost ~]# ping 172.16.3.1
PING 172.16.3.1 (172.16.3.1) 56(84) bytes of data.
From 172.16.3.16 icmp_seq=1 Destination Host Unreachable
From 172.16.3.16 icmp_seq=2 Destination Host Unreachable
From 172.16.3.16 icmp_seq=3 Destination Host Unreachable

--- 172.16.3.1 ping statistics ---
6 packets transmitted, 0 received, +3 errors, 100% packet loss, time 5165ms

@SalDaniele
Copy link
Contributor

SalDaniele commented Jan 28, 2025

@sudhar-krishnakumar
Upon further investigation, it looks like the card does successfully come up with connectivity after RHEL install.

It is failing to come up with devmem run when the card is cold booted to ensure idpf is healthy: https://github.com/bn222/cluster-deployment-automation/blob/71ed7477e25eb882aab94baee9a7971a44b70c27/ipu.py#L126

@sudhar-krishnakumar
Copy link
Contributor Author

@SalDaniele if we have to test both RHEL install...followed by cold boot of host, can you let us know the CDA commands to manually run in our cluster.

@SalDaniele
Copy link
Contributor

@sudhar-krishnakumar can we close this? I think the bump has been merged in another PR

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2025
@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sudhar-krishnakumar
Copy link
Contributor Author

@sudhar-krishnakumar can we close this? I think the bump has been merged in another PR

@SalDaniele Yes, I can close this PR, this was just for testing change on private branch. Your PR 280, included the SHA for this fix in ipu-plugin(VSP).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants