Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(exporter-jaeger-thrift): Exception in forked environments #2837

Closed
jpmelos opened this issue Jul 22, 2022 · 20 comments
Closed

(exporter-jaeger-thrift): Exception in forked environments #2837

jpmelos opened this issue Jul 22, 2022 · 20 comments
Labels
bug Something isn't working exporters help wanted

Comments

@jpmelos
Copy link

jpmelos commented Jul 22, 2022

Describe your environment
OTel packages v1.11.1 and v0.30b1.
Python 3.8.11 but Python version should be irrelevant to this problem.
Bug spotted on MacOS, but should happen on all systems.

Steps to reproduce
This is really hard to reproduce because it depends heavily on timing. It's probably a race condition among threads.

What is the expected behavior?
Not see the exception.

What is the actual behavior?
This is the exception:

Traceback (most recent call last):
  File "/.../site-packages/opentelemetry/sdk/trace/export/__init__.py" line 358 in _export_batch
    self.span_exporter.export(self.spans_list[:idx])  # type: ignore
  File "/.../site-packages/opentelemetry/exporter/jaeger/thrift/__init__.py" line 219 in export
    self._agent_client.emit(batch)
  File "/.../site-packages/opentelemetry/exporter/jaeger/thrift/send.py" line 70 in emit
    self.client.emitBatch(batch)
  File "/.../site-packages/opentelemetry/exporter/jaeger/thrift/gen/agent/Agent.py" line 61 in emitBatch
    self.send_emitBatch(batch)
  File "/.../site-packages/opentelemetry/exporter/jaeger/thrift/gen/agent/Agent.py" line 64 in send_emitBatch
    self._oprot.writeMessageBegin('emitBatch', TMessageType.ONEWAY, self._seqid)
  File "/.../site-packages/thrift/protocol/TCompactProtocol.py" line 157 in writeMessageBegin
    assert self.state == CLEAR
AssertionError

Additional context
None.

@jpmelos jpmelos added the bug Something isn't working label Jul 22, 2022
@jpmelos jpmelos changed the title Exception for BatchSpanExporter in forked environments Exception for BatchSpanProcessor in forked environments Jul 22, 2022
@jpmelos
Copy link
Author

jpmelos commented Jul 22, 2022

I believe the problem comes from the fact that, even if the BatchSpanProcessor is now ready for forked environments (per #2242), the exporter objects it carries are not. So even if the processor restarts its queue and thread when a fork happens, the exporter it uses may be left in an inconsistent state after the fork.

I believe the solution here would be the processor receiving a callable that instantiates the exporter, so that it can create a new exporter after a fork. Currently, it receives a ready exporter instance that can't be easily recreated after the fork by the _at_fork_reinit function.

One specific instance, which is what causes the above exception, is the JaegerExporter. When instantiated, it creates a AgentClientUDP, which requires a coherent internal state to work properly (as it composes the messages to the collector). If a fork happens right in the middle of the processor-exporter-client-compound composing a message to send, the forked compound (in the child process) could be in an inconsistent state. The processor will restart the thread and the queue, but the exporter will still be in a state that it thinks it's in the middle of composing a message. Next time it tries to compose a message to send, it will fail.

@srikanthccv
Copy link
Member

I believe the solution here would be the processor receiving a callable that instantiates the exporter, so that it can create a new exporter after a fork. Currently, it receives a ready exporter instance that can't be easily recreated after the fork by the _at_fork_reinit function.

Or an exporter and any other component in the pipeline implements the necessary hooks itself before, after_in_parent, after_in_child to clean up the inconsistent state as a result of the fork.

@jpmelos
Copy link
Author

jpmelos commented Jul 23, 2022

I personally wouldn't implement the solution that way, because of the separation of concerns principle.

We only need to make the processor fork-ready because it uses threads. A fork only forks the thread that calls the fork function[1]. So the batch processor now is fork-aware because it uses threads, and, after the fork, it restarts the thread and resets the queue. But notice that the whole thing only exists because of the usage of threads by the batch processor.

The fact that the batch processor uses threads is completely out of the exporter's league. The exporter is not worried about which processor is being used, it just gets the traces from any processor and composes the message to be sent over the wire, and sends it.

So, in my opinion, implementing concepts related to fork-treatment in exporters is, from an architecture point of view, a mistake. I still believe that the correct treatment needs to reside entirely in the processor itself, which is where the existence of threads in this scenery comes from, and that the solution is to teach the processor how to instantiate the exporter so it can do that when it needs to: after it notices a fork.

But I'll leave the judgement call to whoever decides to tackle the problem.

[1] Barring some options, but in most systems that's the default, and in most cases that's what happens.

@srikanthccv
Copy link
Member

srikanthccv commented Jul 23, 2022

A fork only forks the thread that calls the fork function

What do you mean it only forks the thread that calls the fork?

So, in my opinion, implementing concepts related to fork-treatment in exporters is, from an architecture point of view, a mistake.

One specific instance, which is what causes the above exception, is the JaegerExporter. When instantiated, it creates a AgentClientUDP, which requires a coherent internal state to work properly (as it composes the messages to the collector). If a fork happens right in the middle of the processor-exporter-client-compound composing a message to send, the forked compound (in the child process) could be in an inconsistent state.

If I understand you correctly, the UDP client used by the specific exporter will end up in some inconsistent state when the fork happens. The exception will be raised regardless of the processor?

@jpmelos
Copy link
Author

jpmelos commented Jul 23, 2022

What do you mean it only forks the thread that calls the fork?

A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html

Emphasis mine. I mean that.

If I understand you correctly, the UDP client used by the specific exporter will end up in some inconsistent state when the fork happens. The exception will be raised regardless of the processor?

Yes, but it is not the concern of the exporter to work correctly in forked environments, because it is not expecting that and it is not forking anything. The concern belongs to the processor to make everything it needs work in the forked environment it is creating.

@srikanthccv
Copy link
Member

The concern belongs to the processor to make everything it needs work in the forked environment it is creating.

Just want make sure we are on same page, is there a misunderstanding that batch processor is forking? No, it doesn't fork but it's implementation is made fork-aware to make it work in programs which use fork.

@srikanthccv
Copy link
Member

My understanding of the issue so far is, that AgentClientUDP will possibly end up in an inconsistent state if forking happens at the right time in the export pipeline. This will be an issue irrespective of the processor used (i.e should be an issue with SimpleSpanProcessor also). This does appear to be an issue with this specific exporter + encoding used. IMO the batch processor shouldn't be bothered about the state and re-creation of exporter instances because one such exporter among all is prone to errors. Even if does, the underlying issue is not really fixed if I understand the description correctly. I think the correct fix would be the exporter making sure it remains in a clean state when in such scenarios. Feel free to send a PR and correct me If I am understood it wrong.

@jpmelos
Copy link
Author

jpmelos commented Jul 23, 2022

I don't remember saying the processor was forking. And it seems like GitHub doesn't either.

That said, I did say the processor should be the one worried about making things work in a forked environment, and I said that because it is the processor creating the thing that causes the trouble, which is the thread that uses the exporter. If that thread didn't exist, this problem also probably wouldn't exist. For example, if the exporter was to be used in the main thread in the process, there wouldn't be a fork at the wrong time (by definition, since the exporter itself doesn't fork).

There are plenty of code all over the place that doesn't work in forked environments. That is okay. In fact, most code does not work in forked environments. You can't pick any piece of software, insert a fork in any arbitrary place, and expect it to work. Not everything is done to work in forked environments, and being fork-aware is something that is very intentional. So, even if the problem happens in the exporter line of execution, that doesn't make the responsibility of the exporter to be fork-aware. The exporter itself is made to work correctly in a non-forked environment, and then it is the responsibility of processor to make sure the exporter gets that environment, because it is the processor who creates the thread, and that is what doesn't work in forked environments (if not done carefully, like how it's being done now).

This will be an issue irrespective of the processor used

That is very much not the case. If there is not a separate thread, there won't be a fork in the middle of the execution of the exporter, because the exporter itself doesn't fork, and so the exporter will work just fine in both the parent and the child processes. The problem only happens if the processor itself creates separate threads that operate the exporter, which is the case with the BatchSpanProcessor.

But let's say that there's a process using the exporter, and it calls something that begins a thrift struct, and forks before calling the function that ends the struct. Well, that is misuse of the exporter, and it's not the exporter's job to fix itself. It is the job of the process to make sure that, if it needs to do that, it reinstatiates or otherwise fixes the exporter after the fork in the child process. Or maybe the process just wants to let the child run and end up with duplicate messages. 🤷‍♂️ I don't know.

Either way, my original point stands: the problem stems from the existence of a separate thread using the exporter, and that thread comes from the BatchSpanProcessor. It is the processor's responsibility to fix the situation.

@srikanthccv
Copy link
Member

But let's say that there's a process using the exporter, and it calls something that begins a thrift struct, and forks before calling the function that ends the struct. Well, that is misuse of the exporter, and it's not the exporter's job to fix itself. It is the job of the process to make sure that, if it needs to do that, it reinstatiates or otherwise fixes the exporter after the fork in the child process.

This is where we have differences of opinion. I believe either the exporter fixes and makes itself work or documents if doesn't. It shouldn't be the telemetry processor's (any component's) concern to make sure how the exporter (or another component in the trace pipeline) works in different environments. That's what I think, I will let other members chime in and share what have to say about this.

@falsedlah
Copy link

Any update on this, I have the same problem. I got the exporter exception when the application starts and from time to time. My application is using forks.

@lzchen
Copy link
Contributor

lzchen commented Oct 26, 2022

Currently, the BatchSpanProcessor can be deemed as having the feature of supporting forked environments for itself, meaning the thread creation, the api surface, the worker behavior should all work in a forked environment.

@jpmelos

I said that because it is the processor creating the thing that causes the trouble, which is the thread that uses the exporter. If that thread didn't exist, this problem also probably wouldn't exist.

What "causes trouble" is when an exporter is used in conjunction that does NOT support being in a forked environment. If an exporter is designed so that it could be used in a forked environment, or if it does not even hold any state (simply does exporting as you've mentioned above), this "troubled state" would not happen either. It is not solely due to the thread existing in the batch span processor. This seems to me (as @srikanthccv pointed out) as a responsibility of the exporter. I do not believe it is the responsibility of the span processor to make sure that it's components (the exporter) supports being in a forked environment, just that itself is.

With that being said, to say that "my OpenTelemetry telemetry pipeline works in a forked environment" is a bold statement that has not been promised. This probably would extend to other custom components as well (if they ever have workers), not just the BatchSpanProcessor. We can provide guidance (similar to what @srikanthccv suggested for hooks) if any of these issues ever come up, to make those components "fork environment-compatible".

Any other opinions?

@PeterJCLaw
Copy link

Is this issue specific to the Jaeger exporter? If so, could we update the title to clarity that? (Even if this only affects that exporter under the BatchSpanProcessor, if the BatchSpanProcessor is otherwise fork-safe it'd be useful to clarity that that's the case so users can know the BatchSpanProcessor is otherwise safe)

@falsedlah
Copy link

Is this issue specific to the Jaeger exporter? If so, could we update the title to clarity that? (Even if this only affects that exporter under the BatchSpanProcessor, if the BatchSpanProcessor is otherwise fork-safe it'd be useful to clarity that that's the case so users can know the BatchSpanProcessor is otherwise safe)

My case, it is in otel exporter and still have the problem with latest version.

@PeterJCLaw
Copy link

@falsedlah please could you clarify which exporter class you're using? (There are lots of exporters provided as part of Open Telemetry)

@srikanthccv
Copy link
Member

My case, it is in otel exporter and still have the problem with latest version.

What is the exception? You should minimum share the stack trace.

@kayleejacques
Copy link

In our case, we are using SimpleSpanProcessor , gunicorn with multithreading (3 threads by default) and forked environment
We followed https://opentelemetry-python.readthedocs.io/en/latest/examples/fork-process-model/README.html#gunicorn-post-fork-hook to setup the instrumentation

opentelemetry-exporter-jaeger-thrift==1.10.0
opentelemetry-instrumentation==0.29b0

code snippet in gunicorn_config.py - simplified version

from opentelemetry.instrumentation.django import DjangoInstrumentor

def post_fork(server, worker):
  ......
  DjangoInstrumentor().instrument(request_hook=fix_django_req)
  tracer_provider = TracerProvider(resource=Resource.create(tags))
  tracer_provider.add_span_processor(SimpleSpanProcessor(JaegerExporter()))
  trace.set_tracer_provider(tracer_provider)

We ran into the exact same exception as reported occasionally
Screenshot 2023-07-31 at 5 18 12 PM

This forced us to set the sampling rate VERY low at "0.0001" to prevent the exception from re-occuring. While we still get the spans, it's not very helpful due to the low sampling rate.

We hope to get some thoughts on the implementation above. Perhaps, we are not supposed to run the trace with multi threads this way?
Any thought is much appreciated

@srikanthccv
Copy link
Member

@kayleejacques, you confirmed the point I made in the thread above, i.e. it doesn't matter whether the used span processor is Simple or Batch; the exception will still be raised since the root cause is jaeger-thrift exporter, which could have inconsistent state for AgentClientUDP when forked. Here are some suggestions for your implementation.

  • The SimpleSpanProcessor is blocking and not recommended in production environments. Please use BatchSpanProcessor instead.

  • If you want the traces to be exported more frequently (assuming you use a Simple processor for this reason), try adjusting the OTEL_BSP_EXPORT_* settings.

  • The opentelemetry-exporter-jaeger-{thrift/grpc} have been deprecated and will be removed in the next release. Please use OTLP exporter since Jaeger has supported native OTLP ingestion since the 1.35 version. See note

    Screenshot 2023-08-01 at 11 11 43 AM

@kayleejacques
Copy link

@srikanthccv much appreciate your speedy response. I will consult with the team to switch to BatchSpanProcessor and adjust OTEL_BSP_EXPORT_* to start. The exporter migration might be a bit complicated as we run jaeger agent as daemonset so we most likely have to switch to otel collector as well. Thanks again!

@aabmass
Copy link
Member

aabmass commented Aug 3, 2023

it doesn't matter whether the used span processor is Simple or Batch; the exception will still be raised since the root cause is jaeger-thrift exporter,

@srikanthccv if the issue is not with BatchSpanProcessor, would you mind updating the issue title to reflect the issue better?

@srikanthccv srikanthccv changed the title Exception for BatchSpanProcessor in forked environments (exporter-jaeger-thrift): Exception in forked environments Aug 4, 2023
@srikanthccv
Copy link
Member

A fix could have been made to make the exporter fork-aware. The Jaeger exporters have been removed from the baseline.

@srikanthccv srikanthccv closed this as not planned Won't fix, can't repro, duplicate, stale Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporters help wanted
Projects
None yet
Development

No branches or pull requests

7 participants