-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric creation slowed down by unreachable collector with gRPC #3925
Comments
Removing the lock from // Temporality returns the Temporality to use for an instrument kind.
func (e *exporter) Temporality(k metric.InstrumentKind) metricdata.Temporality {
start := time.Now()
defer func() {
fmt.Println("OTEL: exporter.Temporality took", time.Since(start))
}()
//e.clientMu.Lock()
//defer e.clientMu.Unlock()
return e.client.Temporality(k)
}
// Aggregation returns the Aggregation to use for an instrument kind.
func (e *exporter) Aggregation(k metric.InstrumentKind) aggregation.Aggregation {
start := time.Now()
defer func() {
fmt.Println("OTEL: exporter.Aggregation took", time.Since(start))
}()
//e.clientMu.Lock()
//defer e.clientMu.Unlock()
return e.client.Aggregation(k)
} Don't mind the Anyway, given that the temporality and aggregation selectors aren't really using the client, is the sequentiality that you're trying to achieve via the lock needed there? |
@MrAlias can you tell me the reason why this comment was added?
Perhaps if I understand the rationale behind that I could try to fix it myself. It's a serious issue so we can't go to production with it. The problem is that I'm not sure that removing the lock for |
I think that synchronization is required to avoid races when doing |
Right, but that doesn't affect At the moment I see these two being used for Temporality and Aggregation:
|
I see that the lock is currently needed to avoid a race as the EDIT: Personally I would simply use a |
If there are no plans to get such information via the gRPC connection (for some reason) then yeah I can try to have a stab at it. I'd like confirmation from @MrAlias before commencing any work though since apparently he wrote that bit. |
It's been a while since I looked at this code, but, if I recall correctly, the client lock was included to ensure synchronous access to all client methods so the http and grpc clients didn't have to manage concurrency. This was before the temporality and aggregation selection was added to the reader. As long as client implementations are updated to ensure they are concurrent safe and the coordination with the |
What do you think about having a leaner client and pass a Both the aggregation and temporality selectors are already coming from Logically speaking the separation is already there. Instead of Client we could have: // Client handles the transmission of OTLP data to an OTLP receiving endpoint.
type Client interface {
// UploadMetrics transmits metric data to an OTLP receiver.
//
// All retry logic must be handled by UploadMetrics alone, the Exporter
// does not implement any retry logic. All returned errors are considered
// unrecoverable.
UploadMetrics(context.Context, *mpb.ResourceMetrics) error
// ForceFlush flushes any metric data held by an Client.
//
// The deadline or cancellation of the passed context must be honored. An
// appropriate error should be returned in these situations.
ForceFlush(context.Context) error
// Shutdown flushes all metric data held by a Client and closes any
// connections it holds open.
//
// The deadline or cancellation of the passed context must be honored. An
// appropriate error should be returned in these situations.
//
// Shutdown will only be called once by the Exporter. Once a return value
// is received by the Exporter from Shutdown the Client will not be used
// anymore. Therefore all computational resources need to be released
// after this is called so the Client can be garbage collected.
Shutdown(context.Context) error
}
type ConfigSelector interface {
// Temporality returns the Temporality to use for an instrument kind.
Temporality(metric.InstrumentKind) metricdata.Temporality
// Aggregation returns the Aggregation to use for an instrument kind.
Aggregation(metric.InstrumentKind) aggregation.Aggregation
} And then in exporter.go: // exporter exports metrics data as OTLP.
type exporter struct {
// Ensure synchronous access to the client across all functionality.
clientMu sync.Mutex
client Client
configSelector ConfigSelector
shutdownOnce sync.Once
} Here's a draft PR. |
The aggregation and temporality methods may be called concurrently. Both with themselves and with other methods of the Reader or Exporter. Implementations need to provide concurrent safe methods. This change adds documentation about this requirement. Part of open-telemetry#3925
Description
I noticed that the creation of instruments while the agent/collector is down can easily take more than 10 seconds each.
I'm testing this by passing an invalid host (e.g.
unreachable:4317
) to my application and I see these calls taking more than 10 seconds each:That is a small
switch
that I have in my own instruments factory that works as an adapter to your library.To me it looks like these operations all hold a mutex (i.e.
e.clientMu
):I think
Aggregation
andTemporality
are called when an instrument is created but the same mutex is kept locked for the entire duration ofExport
as well, so maybe that is where the problem lies. I would expectExport
to hold a lock just to create a copy of the data that it needs to send, then unlock then asynchronously take its time to send the data over the gRPC connection. Buffered channels can be used to and metrics can be discarded once the buffer is full.Environment
Steps To Reproduce
You don't even need a collector. Just create many go routines spawning instruments (also the same instrument over and over since it should be cached in memory) and pass something like
"unreachable:4317"
as the gRPC endpoint.This is my default retry config:
And this is how I initialize the meter provider:
Expected behavior
I would expect the creation of instruments not to take more than a few milliseconds at most.
The text was updated successfully, but these errors were encountered: