Fix panic when max_span_count is reached, add counter metric #104

nickbp · 2022-01-19T01:57:31Z

Panic seen in ghcr.io/jaegertracing/jaeger-clickhouse:0.8.0 with log-level=debug:

panic: undefined type *clickhousespanstore.WriteWorker return from workerHeap

goroutine 20 [running]:
github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).CleanWorkers(0xc00020c300, 0xc00008eefc)
	github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:95 +0x199
github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).Work(0xc00020c300)
	github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:50 +0x15e
created by github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*SpanWriter).backgroundWriter
	github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/writer.go:89 +0x226

Also adds metric counter and logging to surface when things are hitting backpressure.

Signed-off-by: Nick Parker [email protected]

Which problem is this PR solving?

Fixes panic when recently added max_span_count flush is being exercised

Short description of the changes

Updates case type so that the panic no longer occurs
Adds debug log and metric counter to allow monitoring of the setting, since reaching the limit may indicate backpressure/problems with writing spans to ClickHouse. (In my case I was seeing the panic when ClickHouse's disk was full)

nickbp · 2022-01-20T05:58:11Z

After fixing the panic, I'm now seeing an issue where the worker.CLose() call blocks indefinitely and eventually jaeger-clickhouse stops logging any progress. I'm assuming that this is a deadlock somewhere in that CLose() logic, which eventually halts all progress. I guessed that it might be due to a double lock on the mutex and tried fixing that in 935ff78. But with that fix I'm still seeing the deadlock behavior, where the expected Existing batch finished message is never produced.

At this point I'm assuming the indefinite block on CLose() is due to an issue in the worker's internal sync/messaging handling. I'm thinking an easy way of resolving the situation would be to just discard new batches when the limit is being exceeded (while logging a warning/increasing a counter), rather than blocking on existing batches to finish. This then avoids the need for this special handling.

nickbp · 2022-01-21T01:40:54Z

I simplified the max_span_count handling, and am no longer seeing the earlier stalled behavior when the max was reached:

When max_span_count is exceeded, switched to discarding data (and recording it in log+metric), rather than waiting potentially indefinitely(!) for the existing workers to finish their current work
Moved pending span counter into pool, removing the need for mutexing
Treated max_span_count<=0 configuration as unlimited

nickbp · 2022-01-25T09:03:23Z

I've been running this for a few days now in a test environment where clickhouse is intentionally still backed up, and it's been holding stable.

Panic seen in `ghcr.io/jaegertracing/jaeger-clickhouse:0.8.0` with `log-level=debug`: ``` panic: undefined type *clickhousespanstore.WriteWorker return from workerHeap goroutine 20 [running]: github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).CleanWorkers(0xc00020c300, 0xc00008eefc) github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:95 +0x199 github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).Work(0xc00020c300) github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:50 +0x15e created by github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*SpanWriter).backgroundWriter github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/writer.go:89 +0x226 ``` Also adds metric counter and logging to surface when things are hitting backpressure. Signed-off-by: Nick Parker <[email protected]>

Signed-off-by: Nick Parker <[email protected]>

The current limit logic can result in a stall where `worker.CLose()` never returns due to errors being returned from ClickHouse. This switches to a simpler system of discarding new work when the limit is reached, ensuring that we don't get backed up indefinitely in the event of a long outage. Also moves the count of pending spans to the parent pool: - Avoids race conditions where new work can be started before it's added to the count - Mutexing around the count is no longer needed Signed-off-by: Nick Parker <[email protected]>

Signed-off-by: Nick Parker <[email protected]>

nickbp mentioned this pull request Jan 24, 2022

Looking for maintainers #105

Closed

Nick Parker added 5 commits January 26, 2022 08:42

Potential fix for deadlock: Avoid holding mutex while waiting on close

030ef89

Signed-off-by: Nick Parker <[email protected]>

Include arbitrary worker_id in logs to differentiate between retry loops

2ffa2cb

Signed-off-by: Nick Parker <[email protected]>

Fix lint

237e63c

Signed-off-by: Nick Parker <[email protected]>

nickbp merged commit 34ad236 into jaegertracing:main Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix panic when max_span_count is reached, add counter metric #104

Fix panic when max_span_count is reached, add counter metric #104

nickbp commented Jan 19, 2022

nickbp commented Jan 20, 2022

nickbp commented Jan 21, 2022

nickbp commented Jan 25, 2022

Fix panic when max_span_count is reached, add counter metric #104

Fix panic when max_span_count is reached, add counter metric #104

Conversation

nickbp commented Jan 19, 2022

Which problem is this PR solving?

Short description of the changes

nickbp commented Jan 20, 2022

nickbp commented Jan 21, 2022

nickbp commented Jan 25, 2022