Cramer's Experience #64

smeubank · 2024-02-28T13:26:08Z

Cramer's Experience

The Setup

Four services, all in TypeScript:

@peated/api - a fastify-based web service
@peated/web - a Remix application
@peated/worker - a faktory-based async worker (think Celery)
@peated/cli - manual command line driven interactions (think a backfill)

Running @sentry/node-experimental wherever possible, otherwise generally latest version of relevant SDKs.

The Objective

We want perfect instrumentation. That means:

Traces start in three locations:
- @peated/web (the Remix server hydration)
- @peated/cli
- @peated/worker (for things like cron jobs)
Traces have OTel-specified standards and semantics, and are complete end to end, with the correct root
All core fields have the correct data (such as the transaction field in errors)
All errors have traces (unless Sentry otherwise hard-prevents us), and spans create an actionable debug path for errors.
A single project, as we're modeling a micro service architecture, using a single DSN and a service tag to replicate upcoming changes.

The Numerous Issues

Traces are really fucking hard to instrument

We rely way too much on magical instrumentation, and while I know this is changing, it hasn't changed fast enough.

I'm using an experimental SDK because not only is it our future direction, but the auto instrumentation is superior.
Its not obvious how to actually use our SDK in many cases, because I can't cleanly always use OTel's documentation. For example, I don't ever create 'tracers', but instead call Sentry's startSpan. Our docs contain completely separate sections on how to actually implement tracing, and none of them are complete.

🎯 Should we expose more of OTel's API?

Sentry's implementation sometimes looks similar to OTel but isn't compatible. For example, startSpan has different args than OTel's implementation. Additionally I believe OTel flips the meaning of startSpan. We should change our terminology to avoid confusing developers.

🎯 Similar to above: are we moving away from OTel too much?

startSpan requires a callback, but not all code is callback driven. For example, my CLI uses commander, which uses a standard hook pattern (e.g. beforeCommand and afterCommand). I can't easily use a callback to instrument that. If I try to use startInactiveSpan it actually doesn't associate with the current parent (at least per our docs) so that has even more issues.

🎯 Again is this something we need to look at the OTel API and enable? Do we have a similar use case with our tools?

// this does not work correctly at all because of the missing parent
program
  .name("peated")
  .description("CLI for assisting with Peated")
  .hook("preSubcommand", async (thisCommand, subcommand) => {
    (thisCommand as MutatedCommand).span = Sentry.startInactiveSpan({
      op: "command",
      name: `commander.${subcommand.name()}`,
    });
  })
  .hook("postAction", async (thisCommand) => {
    await shutdownClient();
    if ((thisCommand as MutatedCommand).span) {
      (thisCommand as MutatedCommand).span.end();
    }
  });```

5) Continuing traces is needlessly complicated. OTel's APIs aren't any better here. We need obvious APIs to get the current traceparent, and to pass in a traceparent. These should be available as high level interfaces and not have complex abstractions.

For example, to grab the current context:

```typescript
// pull out traceparent to forward to faktory job
const activeContext = {};
propagation.inject(context.active(), activeContext);

await client.job(jobName, args, { traceContext: activeContext }).push();

And injecting it is even worse:

const activeContext = propagation.extract(
    context.active(),
    args.length > 1
    ? (
        args[1] as {
            traceContext: {
            traceId: string;
            baggage: any;
            };
        }
        ).traceContext
    : {},
);

return context.with(activeContext, async () => {
    return await Sentry.startSpan(
        //...

Using op and name, I believe per OTel's recs, does not play nicely with Sentry. We group on name and ignore op, which means we create transaction summaries that have mixed operations and makes things less actionable.

🎯 can we get reference to specifics of OTel here?
🎯 this leads to txn names and breaking error links and display name in Performance, detail below

    {
      op: "publish",
      name: `faktory.${jobName.toLowerCase()}`,
    },

Transaction is deprecated and y'all keep ignoring that we have an error monitoring product

scope.setTransaction binds event.transaction, which exists for very explicit reasons. It makes it much faster to triage an error as a human when the culprit is shown as myCoolRouteName than it does with openai.ts(lib:foo). Additionally that same transaction gives me reasonable aggregates if it happens on different routes, and in an ideal world, maps up to the same name as the span (meaning I can cross reference data).

https://peated.sentry.io/issues/5007350914/?query=is%3Aunresolved&referrer=issue-stream&stream_index=17

There are additional problems here that have continued to exist for fucking ages, such as needing to call two different APIs to update the current transaction name, and now we have deprecated setTransaction without implementing a valid replacement in the experimental SDK.

🎯 do we just need a setOpName or setTransactionName itself, i think we have a category list of APIs and names

In general, transaction MUST exist on errors, and almost certainly should be tightly coupled to the span (in which the span should probably always exists, rather than having two discrete code paths).

SourceMaps still suck

I can't even be asked to try to spend more time to set these up because its never fun and the debug timelime often takes hours. Not sure if I have any feedback here other than this is very unfun and it'd be a lot easier if I could just publish .map files and Sentry can pick up the slack.

Trace explorer is useless

📈 <3

Half the time I can't get our instrumentation right, and it requires perfection instrumentation.
Even if I get instrumentation right, I will likely get sampled, and again, the trace explorer requires perfect instrumentation

Its broken by design, and more importantly, it makes me think our product doesn't work. If I were a real customer doing a POC, I would have used a competitor by this point because of how broken the product looks.

I need spans to debug JavaScript

Stack traces are useless more and more frequently, and spans give you the picture of inputs. That is, a valid trace and span tree tells me how an error happened, and is almost required to be able to debug modern javascript errors.

🐛 why are they useless?

Except the span tree is buried in a trace explorer which is empty half the time or otherwise broken, and then error page itself is filled with upsells instead of just rendering the information that helps me debug the error.

🐛 issue details page declutter project 🙏

Replays are not the answer, and should be reserved for outlier conditions that are harder to debug (let alone, they are not useful on backend errors in many situations).

My feed is filled with performance issues

I literally only care about errors 99% of the time. I want to seek out performance issues, not be forced to triage them non-stop.

🐛 perf issue noise

trpc errors suck

🐛 otel gives us intsrumention but just performance, meaning SDK data might not be best for errors

https://peated.sentry.io/issues/4630767004/?query=is%3Aunresolved&referrer=issue-stream&stream_index=2

TRPCClientError
[
  {
    "received": "United States",
    "code": "invalid_enum_value",
    "options": [
      "Afghanistan",
      "Albania",
      "Algeria",
      "American Samoa",
      "Andorra",
      "Angola",
      "Anguilla",
      "Antarctica",
      "Ant...

So do Zod errors and a bunch of other shit.

🐛 ``

https://peated.sentry.io/issues/5007350836/?query=is%3Aunresolved&referrer=issue-stream&stream_index=10

Breadcrumbs are useless in so many contexts

🐛 i also always found breadcrumbs to be inconsistent and repetitive but customers like them supposedly

They end up with a loop of information because they're a buffer and not a tree. They need to have a hard sit down on why they continue to exist when a span tree is often the right solution to diagnostics.

For example:

for item in list:
  do_a_thing_that_gracefully_fails_and_logs()

Each of those will pollute breadcrumbs even tho most crumbs are only relevant to the specific function call instance.

wtf is Object.<anonymous>

🐛 is there anything we can do with anonymous? Can we inform the user better? Like with figma, detect and inform that is a platform issue

https://peated.sentry.io/issues/4546294867/?query=is%3Aunresolved&referrer=issue-stream&stream_index=13

wtf is Object captured as exception with keys: data, error, internal, status

📈 data issue or product inform issue?

https://peated.sentry.io/issues/5010841858/?query=is%3Aunresolved&referrer=issue-stream&stream_index=14

Transactions are shit all over the place even with OTel's better instrumentation

🎯 need further investigation data quality

https://peated.sentry.io/issues/5006626006/?query=is%3Aunresolved&referrer=issue-stream&statsPeriod=14d&stream_index=0

This should somehow have a route name or url or something, and it should have it because of OTel's superior instrumentation, but seemingly I'm missing a trace and I've got an absolute trash error.

SDK API 🎯

Give feedback

Our docs are incomplete, unclear what is true from OTel and what you CAN do with our API 🎯 Should we expose more of OTel's API? And can we adopt a more exhaustive approach to documenting ours?
we do not maintain the same args as OTel while some of our APIs are same/similar startSpan 🎯 Similar to above: are we moving away from OTel too much?
callbacks vs standard hook patterns problem 🎯 Again is this something we need to look at the OTel API and enable? Do we have a similar use case with our tools?
Using op and name, I believe per OTel's recs, does not play nicely with Sentry. We group on name and ignore op, which means we create transaction summaries that have mixed operations and makes things less actionable. 🎯 can we get reference to specifics of OTel here? 🎯 this leads to txn names and breaking error links and display name in Performance, detail below
How to update txn name? 🎯 do we just need a setOpName or setTransactionName itself, i think we have a category list of APIs and names
Transactions are shit all over the place even with OTel's better instrumentation** 🎯 need further investigation data quality
Options

Issues Platform

Give feedback

Stack traces are useless more often 🐛 why are they useless? I think this comments get's into if STs are still relevnat or more often we NEED spans/traces
error page itself is filled with upsells instead of just rendering the information that helps me debug the error. 🐛 issue details page declutter project 🙏
My feed is filled with performance issues 🐛 perf issue noise
trpc errors suck 🐛 otel gives us intsrumention but just performance, meaning SDK data might not be best for errors
Zod errors and a bunch of other shit. 🐛 ``
Breadcrumbs are useless in so many contexts 🐛 i also always found breadcrumbs to be inconsistent and repetitive but customers like them supposedly
wtf is Object. 🐛 is there anything we can do with anonymous? Can we inform the user better? Like with figma, detect and inform that is a platform issue
Options

The text was updated successfully, but these errors were encountered:

stephanie-anderson added the Team: Web Frontend label Mar 11, 2024

stephanie-anderson modified the milestones: [1] Discovery, Tracing Jul 10, 2024

smeubank closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cramer's Experience #64

Cramer's Experience #64

smeubank commented Feb 28, 2024 •

edited

Loading

SDK API 🎯

Issues Platform

Cramer's Experience #64

Cramer's Experience #64

Comments

smeubank commented Feb 28, 2024 • edited Loading

The Setup

The Objective

The Numerous Issues

SDK API 🎯

Issues Platform

smeubank commented Feb 28, 2024 •

edited

Loading