Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some s390x machines failing net tests with NoRouteToHostException #2807

Open
smlambert opened this issue Nov 2, 2022 · 24 comments
Open

Some s390x machines failing net tests with NoRouteToHostException #2807

smlambert opened this issue Nov 2, 2022 · 24 comments

Comments

@smlambert
Copy link
Contributor

smlambert commented Nov 2, 2022

As described in adoptium/aqa-tests#4039 (comment)

To make it easy for the infrastructure team to repeat and diagnose, please
answer the following questions:

Any other details:

@sxa
Copy link
Member

sxa commented Nov 2, 2022

Testing on the other instance on:

@smlambert
Copy link
Contributor Author

smlambert commented Nov 2, 2022

Yes and I have some reruns on various other machines going now as part of triage efforts, and will update the issue once results are in.

NoRouteToHostExceptions also seen on test-marist-sles15-s390x-2 - see https://ci.adoptopenjdk.net/job/Grinder/6086/

Those types of exceptions are not seen on test-marist-ubuntu2204-s390x-1, but other problems on that machine... issues appear to be mainly related to tests using multicast addresses. https://ci.adoptopenjdk.net/job/Grinder/6087/testReport/

@sxa
Copy link
Member

sxa commented Nov 2, 2022

issues appear to be mainly related to tests using multicast addresses. https://ci.adoptopenjdk.net/job/Grinder/6087/testReport/

I have a fix I can try on there relating to the firewall configuration - this only occurs on the new Marist machines we've got and will allow multicast to work based on past experience - applied on the ubuntu2204-s390x-1 machine referred to above and regrinding at https://ci.adoptopenjdk.net/job/Grinder/6103/ to test

iptables -I INPUT -m pkttype --pkt-type multicast -j ACCEPT

[EDIT: This has resolved the problem - everything in java_net passed, although https://ci.adoptopenjdk.net/job/Grinder/6103/testReport/tools_jlink_JLinkReproducibleTest/java/JLinkReproducibleTest/ failed which is not likely to be related to this issue]

@sxa
Copy link
Member

sxa commented Nov 2, 2022

Yes and I have some reruns on various other machines going now as part of triage efforts, and will update the issue once results are in.

NoRouteToHostExceptions also seen on test-marist-sles15-s390x-2 - see https://ci.adoptopenjdk.net/job/Grinder/6086/

Those types of exceptions are not seen on test-marist-ubuntu2204-s390x-1, but other problems on that machine... issues appear to be mainly related to tests using multicast addresses. https://ci.adoptopenjdk.net/job/Grinder/6087/testReport/

I'm going to re-grind that one after removing a someone rogue entry in /etc/hosts - I don't /think/ it will have made all those fail, but we'll see - depends exactly what the tests are doing in terms reverse host lookups ... https://ci.adoptopenjdk.net/job/Grinder/6104/ - If not it's going to need someone to do somre more low level debugging.

[EDIT: As expected no real change - [https://ci.adoptopenjdk.net/job/Grinder/6086/testReport/java_net_httpclient_http2_TLSConnection/java/TLSConnection/] passed in the new run, but that may have just been luck]

@zdtsw
Copy link
Contributor

zdtsw commented Nov 3, 2022

Yes and I have some reruns on various other machines going now as part of triage efforts, and will update the issue once results are in.
NoRouteToHostExceptions also seen on test-marist-sles15-s390x-2 - see https://ci.adoptopenjdk.net/job/Grinder/6086/
Those types of exceptions are not seen on test-marist-ubuntu2204-s390x-1, but other problems on that machine... issues appear to be mainly related to tests using multicast addresses. https://ci.adoptopenjdk.net/job/Grinder/6087/testReport/

I'm going to re-grind that one after removing a someone rogue entry in /etc/hosts - I don't /think/ it will have made all those fail, but we'll see - depends exactly what the tests are doing in terms reverse host lookups ... https://ci.adoptopenjdk.net/job/Grinder/6104/ - If not it's going to need someone to do somre more low level debugging.

seems still failing on java_net

@smlambert
Copy link
Contributor Author

smlambert commented Nov 6, 2022

Node Grinder link Predominant type of failure
test-marist-rhel7-s390x-2 Grinder/6108 NoRouteToHostException
test-marist-rhel8-s390x-2 Grinder/6110 NoRouteToHostException
test-marist-sles12-s390x-2 Grinder/6112 NoRouteToHostException
test-marist-sles15-s390x-1 -- offline
test-marist-sles15-s390x-2 Grinder/6113 NoRouteToHostException
test-marist-ubuntu1604-s390x-1 Grinder/6102 offline
test-marist-ubuntu1804-s390x-1 -- offline
test-marist-ubuntu1804-s390x-2 -- offline
test-marist-ubuntu1804-s390x-3 -- offline
test-marist-ubuntu1804-s390x-4 Grinder/6101 offline
test-marist-ubuntu2004-s390x-1 Grinder/6111
test-marist-ubuntu2204-s390x-1 Grinder/6103 after fixes appled re: #2807 (comment), only JLinkReproducibleTest fails which is a problematic testcase that should get excluded JDK-8217166

@sxa
Copy link
Member

sxa commented Nov 7, 2022

The above analysis suggests that we can resolve a lot of the issues on the RHEL/SLES systems by performing a similar firewall fix to assist the multicast packets to get through. It will be interesting to see how many other problems remain after doing that.

Bear in mind that many of the offline machines are the older ones which were replaced during September as part of the Marist machine migration which we have done, so that is expected (They've been offline in jenkins for a while, but now need to be fully removed)

@smlambert
Copy link
Contributor Author

Of note is that they do not appear as "offline", https://ci.adoptopenjdk.net/label/hw.arch.s390x&&ci.role.test/ shows
Screen Shot 2022-11-07 at 11 03 17 AM

where I would have expected to see the red X as with some other offline nodes:
Screen Shot 2022-11-07 at 11 03 58 AM

@sxa
Copy link
Member

sxa commented Nov 8, 2022

Re-runs on RHEL/SLES systems after adding the same iptables rule:

Node Grinder link Predominant type of failure
test-marist-rhel7-s390x-2 Grinder/6118 110 failures NoRouteToHost/Timeouts
test-marist-rhel8-s390x-2 Grinder/6122 1 failure - only JLinkReproducibleTest
test-marist-sles12-s390x-2 Grinder/6117 117 failures
test-marist-sles15-s390x-2 Grinder/6120 114 failures

@sxa
Copy link
Member

sxa commented Nov 15, 2022

(Comment removed as it was supposed to be in #2820)

@adamfarley
Copy link
Contributor

adamfarley commented Oct 17, 2023

Update: This issue (or something like it) is still seen.

https://ci.adoptium.net/job/Test_openjdk11_hs_extended.openjdk_s390x_linux/140/

e.g. on https://ci.adoptium.net/computer/test-marist-sles12-s390x-2

[2023-10-08T07:59:44.878Z] Running test jdk_rmi_1 ...
...
[2023-10-08T07:30:30.158Z] java.lang.RuntimeException: java.rmi.ConnectIOException: Exception creating connection to: 148.100.74.193; nested exception is: 
[2023-10-08T07:30:30.158Z] 	java.net.NoRouteToHostException: No route to host (Host unreachable)

After a number of NoRouteToHostExceptions in other targets, the jdk_jfr_1 target appears to cause the entire job to fail, and I'm guessing it's related to this issue.

Have the other jobs associated with this issue failed as well? As in non-"unsafe" failed. Jenkins red job failed.

@sxa
Copy link
Member

sxa commented Nov 2, 2023

Let's check if the outstanding problems are only on the SLES12 systems and whether they also occur in the docker SLES12 images that we have.

@smlambert
Copy link
Contributor Author

March JDK22 release activities
Grinder/9226 50 compiler testcases fail with no route to host issues on test-marist-sles12-s390x-2

FYI @steelhead31

@sxa
Copy link
Member

sxa commented Nov 5, 2024

A little more narrowing down may be useful to verify the systems it's running on and then decide whether we need to continue supporting them, and potentially raise with Marist if it is across the board on all machines.

@sxa
Copy link
Member

sxa commented Nov 22, 2024

It's been a while so time for a new table!
Using the re-run in jdk17u grinder link above, but with the master branch of aqa-tests (since the one in the link doesn't work) and with the latest released version and the JDK_BRANCH=master

Machine Job Link Duration Result
test-marist-rhel8-s390x-2 11764 2h32
test-marist-ubuntu2404-s390x-1 11765 4h24 3 failures in java/net/DatagramSocket and MulticastSocket - All timeouts - same as test-marist-ubuntu2204-s390x-1 below
test-docker-sles12-s390x-1 11766 2m14 1 failure HttpsURLConnection/PostThruProxyWithAuth.java.PostThruProxyWithAuth Cannot run program "hostname": error=2, No such file or directory (rerun*100). [*] Passes with hostname fixed
test-docker-sles12-s390x-1 11788 [*]
test-docker-ubi9-s390x-1 11767 2h04
test-docker-ubuntu2404-s390x-1... 11768 2h03
test-marist-sles15-s390x-2 11769 11h 82 failures
test-marist-rhel8-s390x-2 11770 5h53
test-marist-sles15-s390x-2 11771 82 failures again Many "No route to host"
test-marist-sles12-s390x-2 11787 84 failures See also Full extended.openjdk run aborted after 1d0h hours
test-marist-rhel7-s390x-2 11790 74 failures. Some no route to host New extended.openjdk aborted after 1d0h hours
test-marist-ubuntu2204-s390x-1 11792 4h04m 3 failures DataGramSocketExample, DatagramSocketMulticasting and SetLoopbackModeIPv4. No No route to host

[*] zypper install hostname has now been run on this container, although noting that the non-container sles12 machine has that command provided by the net-tools package instead. The second line for this host is the run after this package was added. Noting that this was already fixed in #3481 but the container has not been redeployed since. Also since SLES12 is out of regular support as of October, it probably makes sense to remove that system (Noting we still have the docker container for SLES12 which is passing). It might be interesting to see if a RHEL7 container on the RHEL8 host works with podman, although that's also now out of support and we'd likely need to resolve #3808 first to test - or fire up the pre-built container image on that machine)

From an earlier comment our RHEL8 machine was not previously passing the tests (despite the iptables fix being put in place) but it is in the latest table above.

Based on the above it's entirely possible that the subject message is only applicable to test-marist-sles15-s390x-2 now, and the others should be covered under separate issues ... Although based on adoptium/aqa-tests#5156 (comment) I'm going to try and run the whole of extended.openjdk on this machine to see if it hits any of these errors elsewhere in the compiler suite: https://ci.adoptium.net/job/Test_openjdk21_hs_extended.openjdk_s390x_linux/81/

Noting also that test-marist-rhel7-s390x-2 has been taken offline due to these exceptions so is not included in the above tests. Memo to self: iptables suggestion that worked on some machines previously is at #2807 (comment)

@sxa
Copy link
Member

sxa commented Nov 28, 2024

The three particularly problematic machines have now been taken offline. The expectation is that the RHEL7 and SLES12 will be decomissioned, although we may wish to example sles15 further, and perhaps attempt a new SLES15 provision to see how that goes.
Affected machines:

  • test-marist-rhel7-s390x-2
  • test-marist-sles12-s390x-2
  • test-marist-sles15-s390x-2

@sxa sxa moved this from Todo to Paused/Blocked in 2024 4Q Adoptium Plan Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Paused/Blocked
Development

No branches or pull requests

6 participants