Health check extension returns 200 status code during errors #8276

william-tran · 2022-03-07T17:22:10Z

Describe the bug
When I simulate exporter errors and use health check with check_collector_pipeline enabled, I get a response like

$ while true; do python3 test.py; curl -v localhost:13133; done

Traces cannot be uploaded; HTTP status code: 500, message: Internal Server Error
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 13133 (#0)
> GET / HTTP/1.1
> Host: localhost:13133
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Mon, 07 Mar 2022 17:06:40 GMT
< Content-Length: 0
<
* Connection #0 to host localhost left intact
* Closing connection 0

Steps to reproduce

run v0.46.0 with this config.yaml

receivers:
  jaeger:
    protocols:
      thrift_http:

extensions:
  health_check:
    check_collector_pipeline:
      enabled: false
      interval: 1s
      exporter_failure_threshold: 1

exporters:
  otlphttp:
    # should result in connection refused
    endpoint: "http://localhost:55555"
    retry_on_failure:
      enabled: true
    sending_queue:
      enabled: true
      queue_size: 10

service:
  extensions:
    - health_check
  pipelines:
    traces:
      receivers:
        - jaeger
      exporters:
        - otlphttp

And with this python script test.py:

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

trace.set_tracer_provider(
TracerProvider(
        resource=Resource.create({SERVICE_NAME: "my-helloworld-service"})
    )
)
tracer = trace.get_tracer(__name__)

# create a JaegerExporter
jaeger_exporter = JaegerExporter(
    collector_endpoint='http://localhost:14268/api/traces?format=jaeger.thrift',
)

# Create a BatchSpanProcessor and add the exporter to it
span_processor = BatchSpanProcessor(jaeger_exporter)

# add to the tracer
trace.get_tracer_provider().add_span_processor(span_processor)

with tracer.start_as_current_span("hello"):
    print("Hello world from OpenTelemetry Python!")

and requirements.txt

opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-jaeger-thrift

$ pip install -r requirements.txt

execute in a loop until you see errors:

$ while true; do python3 test.py; curl -v localhost:13133; done

What did you expect to see?
Health check eventually responds with a 5xx status code

What did you see instead?
Health check always responds with a 200 status code

What version did you use?
0.46.0

What config did you use?
See above

Environment
OS: locally tested on OSX

The text was updated successfully, but these errors were encountered:

jpkrohling · 2022-03-07T17:28:18Z

I'm assigning this to myself as I'm the code owner, but I believe we didn't implement yet the reporting of the state of individual components.

william-tran · 2022-03-07T17:35:26Z

@jpkrohling sorry this might be a red herring, when I use interval: 1m instead, it eventually returns 500, but after a minute it reverts back to 200.

william-tran · 2022-03-07T17:56:57Z

More context: when running a traces exporter like otlp or kafka, sometimes the TCP connection dies, but there is no built-in connection restart, so the exporter queue starts filling up. I want to restart otel-collector to reestablish connections. Ideally this would be done before data loss occurs when you hit exporter queue capacity. Exposing this as a metric in open-telemetry/opentelemetry-collector#4902 and then configuring a percent of capacity threshold for health check failure like "signal unhealthy when capacity reaches 95%" would be a way to prevent data loss.

ItsLastDay · 2022-07-01T14:33:30Z

I report the same issue in #11780, with more technical details (e.g. explaining why initially HC serves status 500, but after a minute revert to 200).

github-actions · 2022-09-16T17:24:27Z

Pinging code owners: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2022-11-16T03:43:26Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

extension/healthcheck: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling · 2022-11-23T17:48:19Z

I still have this on my queue.

github-actions · 2023-01-23T03:32:22Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

extension/healthcheck: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-03-27T03:31:05Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

extension/healthcheck: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-05-29T03:31:57Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

extension/healthcheck: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-08-28T03:31:37Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

extension/healthcheck: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-10-30T03:29:38Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

extension/healthcheck: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-01-01T03:30:31Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

extension/healthcheck: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-03-01T05:18:51Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

william-tran added the bug Something isn't working label Mar 7, 2022

jpkrohling self-assigned this Mar 7, 2022

codeboten added the extension/healthcheck Health Check Extension label Sep 16, 2022

github-actions bot added the Stale label Nov 16, 2022

jpkrohling removed the Stale label Nov 23, 2022

github-actions bot added the Stale label Jan 23, 2023

jpkrohling removed the Stale label Jan 25, 2023

github-actions bot added the Stale label Mar 27, 2023

jpkrohling removed the Stale label Mar 29, 2023

github-actions bot added the Stale label May 29, 2023

jpkrohling removed the Stale label Jun 27, 2023

github-actions bot added the Stale label Aug 28, 2023

jpkrohling removed the Stale label Aug 28, 2023

github-actions bot added the Stale label Oct 30, 2023

jpkrohling removed the Stale label Nov 1, 2023

github-actions bot added the Stale label Jan 1, 2024

github-actions bot added the closed as inactive label Mar 1, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health check extension returns 200 status code during errors #8276

Health check extension returns 200 status code during errors #8276

william-tran commented Mar 7, 2022 •

edited

Loading

jpkrohling commented Mar 7, 2022

william-tran commented Mar 7, 2022 •

edited

Loading

william-tran commented Mar 7, 2022

ItsLastDay commented Jul 1, 2022

github-actions bot commented Sep 16, 2022

github-actions bot commented Nov 16, 2022

jpkrohling commented Nov 23, 2022

github-actions bot commented Jan 23, 2023

github-actions bot commented Mar 27, 2023

github-actions bot commented May 29, 2023

github-actions bot commented Aug 28, 2023

github-actions bot commented Oct 30, 2023

github-actions bot commented Jan 1, 2024

github-actions bot commented Mar 1, 2024

Health check extension returns 200 status code during errors #8276

Health check extension returns 200 status code during errors #8276

Comments

william-tran commented Mar 7, 2022 • edited Loading

jpkrohling commented Mar 7, 2022

william-tran commented Mar 7, 2022 • edited Loading

william-tran commented Mar 7, 2022

ItsLastDay commented Jul 1, 2022

github-actions bot commented Sep 16, 2022

github-actions bot commented Nov 16, 2022

jpkrohling commented Nov 23, 2022

github-actions bot commented Jan 23, 2023

github-actions bot commented Mar 27, 2023

github-actions bot commented May 29, 2023

github-actions bot commented Aug 28, 2023

github-actions bot commented Oct 30, 2023

github-actions bot commented Jan 1, 2024

github-actions bot commented Mar 1, 2024

william-tran commented Mar 7, 2022 •

edited

Loading

william-tran commented Mar 7, 2022 •

edited

Loading