Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check extension returns 200 status code during errors #8276

Closed
william-tran opened this issue Mar 7, 2022 · 14 comments
Closed

Health check extension returns 200 status code during errors #8276

william-tran opened this issue Mar 7, 2022 · 14 comments
Assignees
Labels
bug Something isn't working closed as inactive extension/healthcheck Health Check Extension Stale

Comments

@william-tran
Copy link

william-tran commented Mar 7, 2022

Describe the bug
When I simulate exporter errors and use health check with check_collector_pipeline enabled, I get a response like

$ while true; do python3 test.py; curl -v localhost:13133; done

Traces cannot be uploaded; HTTP status code: 500, message: Internal Server Error
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 13133 (#0)
> GET / HTTP/1.1
> Host: localhost:13133
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Mon, 07 Mar 2022 17:06:40 GMT
< Content-Length: 0
<
* Connection #0 to host localhost left intact
* Closing connection 0

Steps to reproduce

run v0.46.0 with this config.yaml

receivers:
  jaeger:
    protocols:
      thrift_http:

extensions:
  health_check:
    check_collector_pipeline:
      enabled: false
      interval: 1s
      exporter_failure_threshold: 1

exporters:
  otlphttp:
    # should result in connection refused
    endpoint: "http://localhost:55555"
    retry_on_failure:
      enabled: true
    sending_queue:
      enabled: true
      queue_size: 10

service:
  extensions:
    - health_check
  pipelines:
    traces:
      receivers:
        - jaeger
      exporters:
        - otlphttp

And with this python script test.py:

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

trace.set_tracer_provider(
TracerProvider(
        resource=Resource.create({SERVICE_NAME: "my-helloworld-service"})
    )
)
tracer = trace.get_tracer(__name__)

# create a JaegerExporter
jaeger_exporter = JaegerExporter(
    collector_endpoint='http://localhost:14268/api/traces?format=jaeger.thrift',
)

# Create a BatchSpanProcessor and add the exporter to it
span_processor = BatchSpanProcessor(jaeger_exporter)

# add to the tracer
trace.get_tracer_provider().add_span_processor(span_processor)

with tracer.start_as_current_span("hello"):
    print("Hello world from OpenTelemetry Python!")

and requirements.txt

opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-jaeger-thrift
$ pip install -r requirements.txt

execute in a loop until you see errors:

$ while true; do python3 test.py; curl -v localhost:13133; done

What did you expect to see?
Health check eventually responds with a 5xx status code

What did you see instead?
Health check always responds with a 200 status code

What version did you use?
0.46.0

What config did you use?
See above

Environment
OS: locally tested on OSX

@william-tran william-tran added the bug Something isn't working label Mar 7, 2022
@jpkrohling jpkrohling self-assigned this Mar 7, 2022
@jpkrohling
Copy link
Member

I'm assigning this to myself as I'm the code owner, but I believe we didn't implement yet the reporting of the state of individual components.

@william-tran
Copy link
Author

william-tran commented Mar 7, 2022

@jpkrohling sorry this might be a red herring, when I use interval: 1m instead, it eventually returns 500, but after a minute it reverts back to 200.

@william-tran
Copy link
Author

More context: when running a traces exporter like otlp or kafka, sometimes the TCP connection dies, but there is no built-in connection restart, so the exporter queue starts filling up. I want to restart otel-collector to reestablish connections. Ideally this would be done before data loss occurs when you hit exporter queue capacity. Exposing this as a metric in open-telemetry/opentelemetry-collector#4902 and then configuring a percent of capacity threshold for health check failure like "signal unhealthy when capacity reaches 95%" would be a way to prevent data loss.

@ItsLastDay
Copy link
Contributor

I report the same issue in #11780, with more technical details (e.g. explaining why initially HC serves status 500, but after a minute revert to 200).

@codeboten codeboten added the extension/healthcheck Health Check Extension label Sep 16, 2022
@github-actions
Copy link
Contributor

Pinging code owners: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

I still have this on my queue.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Oct 30, 2023
@jpkrohling jpkrohling removed the Stale label Nov 1, 2023
Copy link
Contributor

github-actions bot commented Jan 1, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 1, 2024
Copy link
Contributor

github-actions bot commented Mar 1, 2024

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working closed as inactive extension/healthcheck Health Check Extension Stale
Projects
None yet
Development

No branches or pull requests

4 participants