Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simcore datcore-adapter service periodically fails to connect otel-collector host in inhouse-master deploy #825

Closed
pcrespov opened this issue Sep 20, 2024 · 5 comments · Fixed by #826
Assignees
Labels
FAST p:high-prio t:bug Something isn't working
Milestone

Comments

@pcrespov
Copy link
Member

pcrespov commented Sep 20, 2024

  • datcore service seems to have a problem communicating with otel-collector.
  • Open graylog Log Analysis in master and will show periodic error with
    image
log_msg
Exception while exporting Span batch.
Traceback (most recent call last):
  File "/home/scu/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/socket.py", line 962, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/scu/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 496, in _make_request
    conn.request(
  File "/home/scu/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 400, in request
    self.endheaders()
  File "/usr/local/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.11/http/client.py", line 1058, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.11/http/client.py", line 996, in send
    self.connect()
  File "/home/scu/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 238, in connect
    self.sock = self._new_conn()
                ^^^^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 205, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPConnection object at 0x7f3677767790>: Failed to resolve 'otel-collector' ([Errno -2] Name or service not known)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/scu/.venv/lib/python3.11/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='otel-collector', port=4318): Max retries exceeded with url: /v1/traces (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f3677767790>: Failed to resolve 'otel-collector' ([Errno -2] Name or service not known)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/scu/.venv/lib/python3.11/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 367, in _export_batch
    self.span_exporter.export(self.spans_list[:idx])  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 169, in export
    return self._export_serialized_spans(serialized_data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 139, in _export_serialized_spans
    resp = self._export(serialized_data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 114, in _export
    return self._session.post(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/scu/.venv/lib/python3.11/site-packages/requests/adapters.py", line 700, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='otel-collector', port=4318): Max retries exceeded with url: /v1/traces (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f3677767790>: Failed to resolve 'otel-collector' ([Errno -2] Name or service not known)"))
@pcrespov pcrespov added this to the MartinKippenberger milestone Sep 20, 2024
@pcrespov pcrespov changed the title simcore_master_datcore-adapter service periodically fails to connect otel-collector host simcore datcore-adapter service periodically fails to connect otel-collector host in inhouse-master deploy Sep 20, 2024
@mrnicegyu11 mrnicegyu11 transferred this issue from ITISFoundation/osparc-simcore Sep 30, 2024
@mrnicegyu11 mrnicegyu11 added t:bug Something isn't working p:high-prio FAST labels Sep 30, 2024
@mrnicegyu11
Copy link
Member

I hope my PR fixed this :--)

@mrnicegyu11 mrnicegyu11 reopened this Sep 30, 2024
@mrnicegyu11
Copy link
Member

Close if it looks fixed please

@mrnicegyu11
Copy link
Member

again temporarily fixed on osparc.io, to fix this properly a prod-release is needed

@pcrespov
Copy link
Member Author

pcrespov commented Oct 21, 2024

again temporarily fixed on osparc.io, to fix this properly a prod-release is needed

@mrnicegyu11 I took the liberty of adding it in the next release issue ITISFoundation/osparc-simcore#6441

@mrnicegyu11
Copy link
Member

I verified today that the issue is not present anymore both on on-prem and aws master, closing :--)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FAST p:high-prio t:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants