Semantic conventions for batch jobs #1640

mateuszrzeszutek · 2021-01-15T11:30:57Z

What are you trying to achieve?

I want to introduce some semantic conventions for batch jobs, since there are currently no conventions around that. I'm mostly interested in instrumenting Spring Batch applications.
There already is a Java JSR-352 spec that describes a batch job API, I was thinking of basing the trace spec on that - the concepts of job/step/chunk seem generic and language-agnostic enough (and there doesn't seem to be any other batch job specification).
Before diving into details, is there a place for this in the trace semantic conventions?

Additional context.

fbogsany · 2021-01-15T15:49:22Z

Good timing: we discussed this in the Ruby SIG meeting earlier this week. Background job queues are common in Ruby web applications, with the dominant implementations being Resque and Sidekiq. So far, we've instrumented those systems using the messaging semantic conventions, but they're really not a great fit for background/batch jobs. See open-telemetry/opentelemetry-ruby#547 (comment)

Oberon00 · 2021-01-15T16:25:58Z

Wouldn't this be a case where you just use an INTERNAL span without any attributes? Not every span needs to conform to a semantic convention. What information do you want to have on batch job spans?

fbogsany · 2021-01-15T16:59:20Z

the concepts of job/step/chunk

^ this bit seems useful.

For Ruby batch job systems, relative to message systems:

"receiver" spans are not particularly useful
The job class name is usually much more interesting as part of the span name than the queue name (e.g. MyJob enqueue vs default send)
The ... enqueue suffix is more in keeping with the domain language than ... send
destination_kind is likely always queue, so probably isn't an interesting thing to specify (and certainly shouldn't be required).

Oberon00 · 2021-01-15T17:02:10Z

For the "job class name": There are semantic conventions for code locations, see https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/span-general.md#source-code-attributes.

fbogsany · 2021-01-15T17:04:57Z

For the "job class name": There are semantic conventions for code locations, see https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/span-general.md#source-code-attributes.

That's useful, but doesn't meet the expectations of users re: span names.

mateuszrzeszutek · 2021-01-15T17:17:46Z

Good timing: we discussed this in the Ruby SIG meeting earlier this week.

Nice! I'll take a look at sidekiq & your instrumentation and try to extract common parts - at first it looks like the job span would be the only common thing, but maybe there's more. And there's the whole queue logic that's not there in Spring Batch/JSR-352.

Wouldn't this be a case where you just use an INTERNAL span without any attributes? Not every span needs to conform to a semantic convention. What information do you want to have on batch job spans?

True, JSR-352 does not expose much information that we could store as attributes, but that does not mean that there are zero of them: there's exit status of a job/step (arbitrary string), job/step execution id (jid in sidekiq?). And the at least for the JSR-352/Spring Batch job steps expose several metrics (read count, write count, ...) that could be used.
And probably the most important piece of information is the span name.

mateuszrzeszutek · 2021-01-19T14:32:39Z

This is what I had in mind for my use case (Spring Batch):

JobLauncher/JobOperator/enqueue span that marks the (possibly) asynchronous start of a batch job processing.
Name: Start Job <batch.job.name>
Attributes:
- batch.job.name: the name of the job (Spring Batch), or the job class name (resque/sidekiq), or the task name (celery);
- batch.job.id: the job execution id (Spring Batch), job['jid'] in case of sidekiq.
Job span that wraps the whole processing of a batch job.
Name: Job <batch.job.name>
Attributes:
- batch.job.name;
- batch.job.id;
- batch.job.exit_status: a plain string containing the exit status of a job. JSR-352/Spring Batch jobs (and steps) can return an arbitrary user-defined string as the exit status (e.g. error message saying why the job has failed). Not sure how this translates to Ruby/Python frameworks.
Step span that wraps the execution of a step. In JSR-352, a job consists of a series of steps: a step "encapsulates an independent, sequential phase of a batch job". In the simplest possible case, a job that only does one thing has only one step.
Name: Job <batch.job.name>.<batch.step.name>
Attributes:
- batch.step.name: the name of the step;
- batch.step.id: the step execution id;
- batch.step.exit_status: a plain string containing the exit status of a step. The batch job may take action depending on the result of a step, e.g. send an email and stop further processing in case of failure.
Chunk span. In JSR-352, steps that process numerous items (e.g. read a CSV file and write records into DB) may be split into chunks, which are pretty much equivalent to database transactions. For example, a step that writes thousands of records may be configured to commit every 50 items. There are no attributes that can be set on this span, but an exception can be recorded here.
Item read, process and write spans. This is the lowest level we want to instrument. I think that this image and pseudocode fragment describes what happens here the best. Again, there are no attributes here, but it's worth having spans on the item level because of the exception visibility and better trace structure.

@fbogsany I believe that the first two spans that I've briefly described here match your use case with both Ruby libs that you've mentioned. I'm not sure about the other three, sidekiq/resque do not seem to have this sort of rigid job structure that spring batch has.

fbogsany · 2021-01-19T19:54:24Z

I'm not sure about the other three, sidekiq/resque do not seem to have this sort of rigid job structure that spring batch has.

They don't have the Step span, but at Shopify, for example, we have a higher-level job execution framework that provides an equivalent of "chunks", so the Chunk span is relevant there. I'm not sure about the Item read, process and write spans.

weyert · 2021-04-06T15:45:38Z

Yeah, doesn't look like Hangfire (.net) and Bree (Node.js) they same level structure as Spring Batch (never used this) and looks more simplified

[0] https://docs.hangfire.io/en/latest/background-processing/processing-background-jobs.html
[1] https://jobscheduler.net/

Kiiv · 2024-11-28T15:26:16Z

Hi,

It seems that there is some properties to enable instrumentation of spring batch now : https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/spring/README.md#settings

Does it should alse work with JSR352 implementation ? I have some batch implemented over JSR352 running in an OpenLiberty server (no spring at all). I tried to enable "otel.instrumentation.spring-batch.item.enabled" but nothing appears. I just want to be sure that I didn't missed something.

trask · 2024-11-30T02:50:53Z

Does it should alse work with JSR352 implementation?

hi @Kiiv, I'd suggest asking your question in https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues or #otel-java

I noticed developers adding their own attributes to this namespace without going through the specification. We need to regulate this namespace through the specification, just like we do it for other semantic conventions.

mateuszrzeszutek mentioned this issue Jan 15, 2021

Instrument spring-batch: item-level spans open-telemetry/opentelemetry-java-instrumentation#2047

Merged

fbogsany mentioned this issue Jan 15, 2021

Adjust Sidekiq middlewares to match semantic conventions open-telemetry/opentelemetry-ruby#547

Merged

iNikem mentioned this issue Mar 29, 2021

[RFC] [Semantic Convention] "Job" traces open-telemetry/opentelemetry-specification#1582

Closed

alanwest mentioned this issue Jul 7, 2022

Hangfire jobs not considered transactions open-telemetry/opentelemetry-dotnet-contrib#489

Open

2 tasks

vitor-pinto-maersk mentioned this issue Oct 26, 2022

[Instrumentation.Hangfire] Enable Public API analyzer open-telemetry/opentelemetry-dotnet-contrib#734

Merged

maciej-szlosarczyk mentioned this issue Dec 21, 2022

MVBASE-1652 | Add Oban tracing mindvalley/mv-opentelemetry#39

Merged

danielgblanco transferred this issue from open-telemetry/opentelemetry-specification Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic conventions for batch jobs #1640

Semantic conventions for batch jobs #1640

mateuszrzeszutek commented Jan 15, 2021

fbogsany commented Jan 15, 2021

Oberon00 commented Jan 15, 2021 •

edited

Loading

fbogsany commented Jan 15, 2021

Oberon00 commented Jan 15, 2021

fbogsany commented Jan 15, 2021

mateuszrzeszutek commented Jan 15, 2021

mateuszrzeszutek commented Jan 19, 2021

fbogsany commented Jan 19, 2021

weyert commented Apr 6, 2021

Kiiv commented Nov 28, 2024

trask commented Nov 30, 2024

Semantic conventions for batch jobs #1640

Semantic conventions for batch jobs #1640

Comments

mateuszrzeszutek commented Jan 15, 2021

fbogsany commented Jan 15, 2021

Oberon00 commented Jan 15, 2021 • edited Loading

fbogsany commented Jan 15, 2021

Oberon00 commented Jan 15, 2021

fbogsany commented Jan 15, 2021

mateuszrzeszutek commented Jan 15, 2021

mateuszrzeszutek commented Jan 19, 2021

fbogsany commented Jan 19, 2021

weyert commented Apr 6, 2021

Kiiv commented Nov 28, 2024

trask commented Nov 30, 2024

Oberon00 commented Jan 15, 2021 •

edited

Loading