Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCT fails with OSError: [Errno 98] Address already in use #6345

Closed
1 of 2 tasks
ilya-rarov opened this issue Jul 10, 2023 · 6 comments · Fixed by #6348
Closed
1 of 2 tasks

SCT fails with OSError: [Errno 98] Address already in use #6345

ilya-rarov opened this issue Jul 10, 2023 · 6 comments · Fixed by #6348
Assignees

Comments

@ilya-rarov
Copy link
Contributor

ilya-rarov commented Jul 10, 2023

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

The artifacts-ubuntu2204-arm-test job failed with the error:

14:36:24  < t:2023-07-09 11:36:24,227 f:test_config.py  l:296  c:sdcm.test_config     p:INFO  > Initializing Argus connection...
14:36:24  < t:2023-07-09 11:36:24,703 f:tester.py       l:390  c:ArtifactsTest        p:INFO  > test_id 707e495b-0757-47c6-966e-4a613fa62c4f already exists in Argus with status: created
14:36:25  < t:2023-07-09 11:36:24,849 f:tester.py       l:405  c:ArtifactsTest        p:INFO  > sct_runner info in Argus TestRun is updated
14:36:26  Process SyncManager-3:
14:36:26  Traceback (most recent call last):
14:36:26    File "/usr/local/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
14:36:26      self.run()
14:36:26    File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
14:36:26      self._target(*self._args, **self._kwargs)
14:36:26    File "/usr/local/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server
14:36:26      server = cls._Server(registry, address, authkey, serializer)
14:36:26    File "/usr/local/lib/python3.10/multiprocessing/managers.py", line 156, in __init__
14:36:26      self.listener = Listener(address=address, backlog=16)
14:36:26    File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 453, in __init__
14:36:26      self._listener = SocketListener(address, family, backlog)
14:36:26    File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 596, in __init__
14:36:26      self._socket.bind(address)
14:36:26  OSError: [Errno 98] Address already in use
14:36:26  
14:36:26  Aborted!

The error happened at the very beginning of the execution of Run SCT Test () stage of the pipeline, when ClusterTester was being initialized - the test didn't really start.

How frequently does it reproduce?

It reproduced in many other artifacts jobs
scylla-master:

scylla-enterprise:

and more

Installation details

Cluster size: 1 nodes (im4gn.xlarge)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: ami-022c8ce295ce9ac4c (aws: undefined_region)

Test: artifacts-ubuntu2204-arm-test
Test id: 707e495b-0757-47c6-966e-4a613fa62c4f
Test name: scylla-master/artifacts/artifacts-ubuntu2204-arm-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 707e495b-0757-47c6-966e-4a613fa62c4f
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 707e495b-0757-47c6-966e-4a613fa62c4f

Logs:

Jenkins job URL
Argus

@soyacz
Copy link
Contributor

soyacz commented Jul 11, 2023

@aleksbykov isn't it related to manager you added recently in #6302?
Maybe we can postpone starting it when needed only?

@aleksbykov
Copy link
Contributor

@soyacz , yes looks it is related to that code. It failed when try to initialize the new process event for counter. But it seems internal python errro, so event if it will be initialized lately, it will get same error, because addr will in used anyway. May be i am wrong. As workaround, i can add try except on launching this event-process, so start of it doesn't fail the test? WDYT?

with perf and longevities i didn't catch such error.

@soyacz
Copy link
Contributor

soyacz commented Jul 11, 2023

@soyacz , yes looks it is related to that code. It failed when try to initialize the new process event for counter. But it seems internal python errro, so event if it will be initialized lately, it will get same error, because addr will in used anyway. May be i am wrong. As workaround, i can add try except on launching this event-process, so start of it doesn't fail the test? WDYT?

We don't count events in ArtifactTests, so this could be a workaround.

with perf and longevities i didn't catch such error.

Because we run only one test at a time in sct runners.

How about resigning from using manager at all and use Queue/Pipe as a bus for add/get/remove counter (or even don't get counters, just get_stats straight from EventsCounter) and make counters register internal to EventsCounter. WDYT?

@fruch
Copy link
Contributor

fruch commented Jul 11, 2023

one thing is for sure in artifacts we could get into a situation we would be running multiple jobs on the same builder as the same time.

so the failure of address is already in use, sounds related to two test running at the same time.

we shouldn't have such a limit, it cloud backfire in multiple places.

@fruch
Copy link
Contributor

fruch commented Jul 11, 2023

sounds like it's a bug in python 3.9 and 3.10
and we are pinning to 3.10.0-slim-bullseye, we should move to use 3.10.12-slim-bullseye
(and also start looking at moving to python 3.11)

@fruch
Copy link
Contributor

fruch commented Jul 11, 2023

I'm opening a PR with rebuild of the image, I think it would solve this one.

fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Jul 11, 2023
looks like we are using a very early release of python 3.10
and we didn't got the revert that was done in python/cpython#98503
rebuilding image with latest release of python 3.10

Fixes: scylladb#6345
Ref: python/cpython#98503
fruch added a commit that referenced this issue Jul 11, 2023
looks like we are using a very early release of python 3.10
and we didn't got the revert that was done in python/cpython#98503
rebuilding image with latest release of python 3.10

Fixes: #6345
Ref: python/cpython#98503
fruch added a commit that referenced this issue Jul 11, 2023
looks like we are using a very early release of python 3.10
and we didn't got the revert that was done in python/cpython#98503
rebuilding image with latest release of python 3.10

Fixes: #6345
Ref: python/cpython#98503
(cherry picked from commit 61bd889)
soyacz pushed a commit to soyacz/scylla-cluster-tests that referenced this issue May 9, 2024
looks like we are using a very early release of python 3.10
and we didn't got the revert that was done in python/cpython#98503
rebuilding image with latest release of python 3.10

Fixes: scylladb#6345
Ref: python/cpython#98503
(cherry picked from commit 61bd889)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants