Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enrichment/trace] Handle exception for span events #99

Merged
merged 6 commits into from
Oct 1, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 22 additions & 2 deletions enrichments/trace/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,18 @@ type SpanEventConfig struct {
// https://github.com/elastic/opentelemetry-dev/issues/374.
TimestampUs AttributeConfig `mapstructure:"timestamp_us"`
ProcessorEvent AttributeConfig `mapstructure:"processor_event"`

// For exceptions/errors
ErrorID AttributeConfig `mapstructure:"error_id"`
ErrorExceptionType AttributeConfig `mapstructure:"error_exception_type"`
ErrorExceptionMessage AttributeConfig `mapstructure:"error_exception_message"`
ErrorExceptionHandled AttributeConfig `mapstructure:"error_exception_handled"`
ErrorStacktrace AttributeConfig `mapstructure:"error_exception_stacktrace"`
ErrorGroupingKey AttributeConfig `mapstructure:"error_grouping_key"`

// For no exceptions/errors
Message AttributeConfig `mapstructure:"message"`
EventKind AttributeConfig `mapstructure:"event_kind"`
}

// AttributeConfig is the configuration options for each attribute.
Expand Down Expand Up @@ -124,8 +136,16 @@ func Enabled() Config {
RepresentativeCount: AttributeConfig{Enabled: true},
},
SpanEvent: SpanEventConfig{
TimestampUs: AttributeConfig{Enabled: true},
ProcessorEvent: AttributeConfig{Enabled: true},
TimestampUs: AttributeConfig{Enabled: true},
ProcessorEvent: AttributeConfig{Enabled: true},
ErrorID: AttributeConfig{Enabled: true},
ErrorExceptionType: AttributeConfig{Enabled: true},
ErrorExceptionMessage: AttributeConfig{Enabled: true},
ErrorExceptionHandled: AttributeConfig{Enabled: true},
ErrorStacktrace: AttributeConfig{Enabled: true},
ErrorGroupingKey: AttributeConfig{Enabled: true},
Message: AttributeConfig{Enabled: true},
EventKind: AttributeConfig{Enabled: true},
},
}
}
11 changes: 11 additions & 0 deletions enrichments/trace/internal/elastic/attributes.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,15 @@ const (
AttributeSpanDestinationServiceResource = "span.destination.service.resource"
AttributeSpanDurationUs = "span.duration.us"
AttributeSpanRepresentativeCount = "span.representative_count"

// span event attributes
AttributeParentID = "parent.id"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should solve that by adding a parent.id -> span_id alias in the logs data stream.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid the "parent" in parent.id is not the span, but the parent of span, i.e. parent.id != span_id.

I looked at the apm-data code, it appears that parent.id is really the span event's span's parent id. Not the span's id: https://github.com/elastic/apm-data/blob/cd0bb416903eebb5dc22e3eb6b6e4e9c75ad51fe/input/otlp/traces.go#L196

Parent id is passed onto the span event via cloning the span: https://github.com/elastic/apm-data/blob/cd0bb416903eebb5dc22e3eb6b6e4e9c75ad51fe/input/otlp/traces.go#L1072

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid the "parent" in parent.id is not the span, but the parent of span, i.e. parent.id != span_id.

I don't think this is true for errors. In case of errors, we specifically override the parent ID of the SpanEvent to be the SpanID (or `TransactionID for elastic transaction): https://github.com/elastic/apm-data/blob/bd5e34116889942c53b186deab0f76beaf182de0/input/otlp/traces.go#L1255-L1259

Copy link
Member

@carsonip carsonip Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wow, I missed that. Then is it a bug for logs from span events in apm-server?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this fix in apm-data look right? elastic/apm-data#374
If so, yes, I think we can have parent.id alias -> span.id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are discussing on slack the alternative to make the UI use span.id directly: https://elastic.slack.com/archives/C95SB62AG/p1727693736192519?thread_ts=1727462154.365409&cid=C95SB62AG

AttributeErrorID = "error.id"
AttributeErrorExceptionType = "error.exception.type"
AttributeErrorExceptionMessage = "error.exception.message"
AttributeErrorExceptionHandled = "error.exception.handled"
AttributeErrorStacktrace = "error.stack_trace"
AttributeErrorGroupingKey = "error.grouping_key"
AttributeEventKind = "event.kind"
AttributeMessage = "message"
lahsivjar marked this conversation as resolved.
Show resolved Hide resolved
)
82 changes: 80 additions & 2 deletions enrichments/trace/internal/elastic/span.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,11 @@
package elastic

import (
"crypto/md5"
"crypto/rand"
"encoding/hex"
"fmt"
"io"
"math"
"net"
"net/http"
Expand Down Expand Up @@ -152,7 +156,8 @@ func (s *spanEnrichmentContext) Enrich(span ptrace.Span, cfg config.Config) {
// Ensure all dependent attributes are handled.
s.normalizeAttributes()

if isElasticTransaction(span) {
isElasticTxn := isElasticTransaction(span)
if isElasticTxn {
s.enrichTransaction(span, cfg.Transaction)
} else {
s.enrichSpan(span, cfg.Span)
Expand Down Expand Up @@ -417,15 +422,38 @@ func (s *spanEnrichmentContext) setDestinationService(span ptrace.Span) {
}

type spanEventEnrichmentContext struct {
exception bool
exceptionType string
exceptionMessage string
exceptionStacktrace string

exception bool
exceptionEscaped bool
}

func (s *spanEventEnrichmentContext) enrich(
se ptrace.SpanEvent,
cfg config.SpanEventConfig,
) {
// TODO(lahsivjar): Is error/log context handling be done in es-exporter?
// Ref: https://github.com/elastic/apm-data/blob/59d33b8113d629f43dc82f9b3931ceda63a19a7a/input/otlp/traces.go#L1243-L1260
Copy link
Contributor Author

@lahsivjar lahsivjar Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For reviewers] APM-data sets context for span-events where it sets fields like span.id, transaction.id, parent.id based on the span (ref), however, I am not sure if enrichments is the correct place to do this. Since the es-exporter could decide how to encode span-events into ES docs (separate or in the same doc as parent span) maybe it would be better to have this handled in the expoter? I am confused here, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that all properties that are already mapped as top level fields in the logs data stream should not be handled by enrichments but in the ES exporter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @carsonip This would require some changes in the spanevent handling in es-exporter. I think just copying some attributes from the parent span should be good enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

Copy link
Member

@carsonip carsonip Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

span event span.id, transaction.id are already done in elasticsearchexporter https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/elasticsearchexporter/model.go#L645-L647

in that case, no change is needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need transaction.type for span events?

I don't know

If so, should they be in attributes.* or at top level?

If needed, it should be under attributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that all properties that are already mapped as top level fields in the logs data stream should not be handled by enrichments but in the ES exporter.

If needed, it should be under attributes.

As they are under attributes, it is better done in enrichment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As they are under attributes, it is better done in enrichment

I don't think it is a good idea to do this as enrichments since the fields we are talking about here are derived from the span. The ES exporter decides HOW to extract and encode the span event (in a separate doc or in the same doc). Whlie currently we are encoding them as a separate doc making this dependent on enrichment doesn't feel correct to me.

(#99 (comment) discussion thread is also related to the same topic)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me and @carsonip talked about this IRL and decided to do it as enrichments for now due to the differences in the handling of error and log contexts in apm-data (ref). Once the specs are a bit more clearer around this we can revisit the enrichment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

222386e

The above commit adds transaction.{type, sampled} enrichment for span event.


// Extract top level span event information.
s.exception = se.Name() == "exception"
if s.exception {
se.Attributes().Range(func(k string, v pcommon.Value) bool {
switch k {
case semconv.AttributeExceptionEscaped:
s.exceptionEscaped = v.Bool()
case semconv.AttributeExceptionType:
s.exceptionType = v.Str()
case semconv.AttributeExceptionMessage:
s.exceptionMessage = v.Str()
case semconv.AttributeExceptionStacktrace:
s.exceptionStacktrace = v.Str()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregkalapos These attributes are mostly 1<>1 mappings however, I didn't see them in the otel-data plugin so added it here. Do you think we should do this here or in the otel-data plugin?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can create alias fields for these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, aliases are how we have been handling such fields. Will wait for @gregkalapos to confirm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd be also for aliases as well.

One exception (no pun intended) may be the handled flag: from what I remember from the UI testing, the UI access that field, so it always needs to be present (We could question why, but from what I remember, that's how it is today), but exception.escaped is only recommended. So we always need to populate the handled flag.

That's the only edge case which comes to my mind, I'd expect the other ones to work with aliases.

Copy link
Member

@felixbarny felixbarny Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Could you add these aliases, Greg? One challenge will be how to add the attribute aliases just for logs, without these getting overridden with the standard set of attribute aliases. Maybe we'll need to add them to the alias definition that applies to all signals.
Also, I'm not sure if we need an alias for the exception stacktrace, since we likely need to fetch that from _source anyway, because IINM, we don't have a stored field for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't even need the aliases, don't we? 🤔 This should all work out via passthrough.

Following could be sent via attributes and would be visible top level:

  • exception.type
  • exception.message
    ( above 2 are in SemConv - I don't think we even need to do anything with those here in the enrichment lib)
  • exception.handled: that is not part of OTel, se we need to set it here.

I opened a draft: elastic/elasticsearch#113636 I commented out all the aliases, the test is still green which queries exception.type top level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregkalapos I think the alias should be for error.exception.{type, message} -> exception.{type, message}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we might need to put alias for stacktrace too: error.stack_trace -> exception.stacktrace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.

Ok, we can do error.exception.{type, message} -> exception.{type, message}; I'll update the PR.

Same for stack trace.

}
return true
})
}

// Enrich span event attributes.
if cfg.TimestampUs.Enabled {
Expand All @@ -434,6 +462,43 @@ func (s *spanEventEnrichmentContext) enrich(
if cfg.ProcessorEvent.Enabled && s.exception {
se.Attributes().PutStr(AttributeProcessorEvent, "error")
}
if s.exceptionType == "" && s.exceptionMessage == "" {
// Span event does not represent an exception
se.Attributes().PutStr(AttributeEventKind, "event")
lahsivjar marked this conversation as resolved.
Show resolved Hide resolved
se.Attributes().PutStr(AttributeMessage, se.Name())
lahsivjar marked this conversation as resolved.
Show resolved Hide resolved
return
}

// Span event represents exception
if cfg.ErrorID.Enabled {
if id, err := newUniqueID(); err == nil {
se.Attributes().PutStr(AttributeErrorID, id)
}
}
if cfg.ErrorExceptionType.Enabled && s.exceptionType != "" {
se.Attributes().PutStr(AttributeErrorExceptionType, s.exceptionType)
}
if cfg.ErrorExceptionMessage.Enabled {
se.Attributes().PutStr(AttributeErrorExceptionMessage, s.exceptionMessage)
}
if cfg.ErrorExceptionHandled.Enabled {
se.Attributes().PutBool(AttributeErrorExceptionHandled, !s.exceptionEscaped)
}
if cfg.ErrorStacktrace.Enabled {
// TODO (lahsivjar): Where to parse stacktraces?
lahsivjar marked this conversation as resolved.
Show resolved Hide resolved
se.Attributes().PutStr(AttributeErrorStacktrace, s.exceptionStacktrace)
}
if cfg.ErrorGroupingKey.Enabled {
// See https://github.com/elastic/apm-data/issues/299
hash := md5.New()
// ignoring errors in hashing
if s.exceptionType != "" {
io.WriteString(hash, s.exceptionType)
} else if s.exceptionMessage != "" {
io.WriteString(hash, s.exceptionMessage)
}
se.Attributes().PutStr(AttributeErrorGroupingKey, hex.EncodeToString(hash.Sum(nil)))
lahsivjar marked this conversation as resolved.
Show resolved Hide resolved
}
}

// getRepresentativeCount returns the number of spans represented by an
Expand Down Expand Up @@ -540,3 +605,16 @@ var standardStatusCodeResults = [...]string{
"HTTP 4xx",
"HTTP 5xx",
}

func newUniqueID() (string, error) {
var u [16]byte
if _, err := io.ReadFull(rand.Reader, u[:]); err != nil {
return "", err
}

// convert to string
buf := make([]byte, 32)
hex.Encode(buf, u[:])

return string(buf), nil
}
38 changes: 33 additions & 5 deletions enrichments/trace/internal/elastic/span_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
package elastic

import (
"crypto/md5"
"encoding/hex"
"net/http"
"testing"
"time"
Expand Down Expand Up @@ -764,18 +766,23 @@ func TestSpanEventEnrich(t *testing.T) {
name string
input ptrace.SpanEvent
config config.SpanEventConfig
errorID bool // indicates if the error ID should be present in the result
enrichedAttrs map[string]any
}{
{
name: "not_exception",
input: func() ptrace.SpanEvent {
event := ptrace.NewSpanEvent()
event.SetName("not_exception")
event.SetTimestamp(ts)
return event
}(),
config: config.Enabled().SpanEvent,
config: config.Enabled().SpanEvent,
errorID: false, // error ID is only present for exceptions
enrichedAttrs: map[string]any{
AttributeTimestampUs: ts.AsTime().UnixMicro(),
AttributeEventKind: "event",
AttributeMessage: "not_exception",
},
},
{
Expand All @@ -784,12 +791,25 @@ func TestSpanEventEnrich(t *testing.T) {
event := ptrace.NewSpanEvent()
event.SetName("exception")
event.SetTimestamp(ts)
event.Attributes().PutStr(semconv.AttributeExceptionType, "java.net.ConnectionError")
event.Attributes().PutStr(semconv.AttributeExceptionMessage, "something is wrong")
event.Attributes().PutStr(semconv.AttributeExceptionStacktrace, `Exception in thread "main" java.lang.RuntimeException: Test exception\\n at com.example.GenerateTrace.methodB(GenerateTrace.java:13)\\n at com.example.GenerateTrace.methodA(GenerateTrace.java:9)\\n at com.example.GenerateTrace.main(GenerateTrace.java:5)`)
return event
}(),
config: config.Enabled().SpanEvent,
config: config.Enabled().SpanEvent,
errorID: true,
enrichedAttrs: map[string]any{
AttributeTimestampUs: ts.AsTime().UnixMicro(),
AttributeProcessorEvent: "error",
AttributeTimestampUs: ts.AsTime().UnixMicro(),
AttributeProcessorEvent: "error",
AttributeErrorExceptionHandled: true,
AttributeErrorExceptionType: "java.net.ConnectionError",
AttributeErrorExceptionMessage: "something is wrong",
AttributeErrorStacktrace: `Exception in thread "main" java.lang.RuntimeException: Test exception\\n at com.example.GenerateTrace.methodB(GenerateTrace.java:13)\\n at com.example.GenerateTrace.methodA(GenerateTrace.java:9)\\n at com.example.GenerateTrace.main(GenerateTrace.java:5)`,
AttributeErrorGroupingKey: func() string {
hash := md5.New()
hash.Write([]byte("java.net.ConnectionError"))
return hex.EncodeToString(hash.Sum(nil))
}(),
},
},
} {
Expand All @@ -807,7 +827,15 @@ func TestSpanEventEnrich(t *testing.T) {
SpanEvent: tc.config,
})

assert.Empty(t, cmp.Diff(expectedAttrs, span.Events().At(0).Attributes().AsRaw()))
actual := span.Events().At(0).Attributes()
errorID, ok := actual.Get(AttributeErrorID)
assert.Equal(t, tc.errorID, ok, "error_id must be present for exception and must not be present for non-exception")
if tc.errorID {
assert.NotEmpty(t, errorID, "error_id must not be empty")
}
// Ignore error in actual diff since it is randomly generated
actual.Remove(AttributeErrorID)
assert.Empty(t, cmp.Diff(expectedAttrs, actual.AsRaw()))
})
}
}
Expand Down