Skip to content

Latest commit

 

History

History
197 lines (163 loc) · 10.1 KB

roadmap.md

File metadata and controls

197 lines (163 loc) · 10.1 KB

OpenTelemetry Project Roadmap

OpenTelemetry is a healthy open source community, and contributors are free to propose new work streams or work on any part of the project that they desire. However, there's also value in the project focusing community members of specific bodies of work and releases, as this allows us to form a more cohesive set of capabilities (unified semantics, single implementations for each language, etc.) and to ship the project in a more impactful way (new signal types release across multiple languages at once, etc.).

This roadmap is not law, and it is not meant to be used to force people to work on specific projects -- this is an open source project and we (community members, the governance committee, etc.) are not anyone's boss. Rather, it exists to provide direction to new contributors, to the public and end users who want to know what's coming next, and to attempt to channel the bulk of our concentration and development efforts into the areas where they are most needed.

What We've Accomplished

OpenTelemetry was started in 2019 with the promise of making it easy and consistent for developers to capture distributed traces and metrics from their applications and infrastructure using SDKs, the Collector, and OTLP. Since then, we've delivered:

  • A specification for distributed traces and metrics, which defines the objects used to represent each, how to interact with them, and the types of data that are expected for different sources.
  • Semantic conventions for metadata, which define how resources and other components are represented. This is applied consistently across all signal types, which allows traces, metrics, and other signals to be processed consistently and correlated.
  • SDKs for twelve languages that allow developers to capture telemetry from their services and to create their own custom telemetry.
  • Automatic instrumentation for __ languages, which allow anyone to capture telemetry from their services without making code changes or redeploying.
  • Instrumentation libraries for thousands of pieces of software, including operating systems, container runtimes, databases, language runtimes, and libraries, all of which allow the Collector (for infrastructure and third-party applications), and SDKs and automatic instrumentation agents (for custom applications) to capture signals, capture metadata, and propagate context automatically.
  • A protocol, OTLP, for transmitting telemetry and metadata between OpenTelemetry components and to backends for processing.

Major Priorities

All of the following work streams are major areas of investment for OpenTelemetry: all of them have large groups of people already focused on them. Their priorities primarily reflect the ordering of when we expect them to be released.

These priorities are most useful when comparing them against each other. For example, P0 entries are more critical than P1 entries, which are in turn more critical than P2s. We do have some basic definitions of each priority level, however:

  • P0: We must do this within the specified time frame (if there is one), or we will fail as a project
  • P1: This is one of our most important major initiatives, and it will have a major impact on the project
  • P2: This work is important enough to be tracked and prioritized and is more important than items not on this list, but it is currently lower urgency or criticality than P0 and P1 entries

P0: Continued Investment in OpenTelemetry Artifacts

The project's top priority will always be ensuring that the capabilities and robustness of our existing artifacts (the Collector, SDKs, language agents, etc.) remain excellent. This work takes place in each language SIG and in the Collector SIG, and includes continual improvement to these components, making them even easier to use, providing and integrating more instrumentation libraries, working with the suppliers of telemetry sources to use the OpenTelemetry APIs natively, and more.

P1: Logs

OpenTelemetry established a logging SIG in mid-2020, with two goals:

  1. Providing a performant path for capturing logs from existing sources (typically text files on disk), where all captured logs have OpenTelemetry's metadata consistently applied to them.
  2. Providing a new, strongly-typed and extremely high-performance logging path for new applications, which allows logs to be authored and transmitted without being parsed from text, and which enforces the consistency of all metadata.

Much progress has been made on the first item, particularly through the donation of the Stanza logging agent into the OpenTelemetry Collector, and by the investments that the project has made in defining and stabilizing a data model and OTLP format for logs. However much still remains before we can declare our logging work generally available:

  • We must extend the Collector's support for existing log sources to satisfy the requirements of more users.
  • Elastic donated the Elastic Common Schema to OpenTelemetry's semantic conventions. We must fully integrate this.
  • We must prototype the new logging path within our SDKs, update the designs based on feedback from the prototyping effort, build integrations for existing logging components and telemetry sources, implement these designs across all languages (as we did for traces and metrics), and test and revise these implementations from beta through GA.

P1: Further Stabilizing Semantic Conventions

OpenTelemetry's consistent semantic conventions across all data types are a major source of the project's value, as they allow end-users and observability systems to both correlate related signals and set expectations about the metadata that should be present on telemetry captured from specific sources or types of interactions. For example, OpenTelemetry's semantics describe the expected metadata for traces, metrics, and logs captured as part of an HTTP 4xx response.

However we still need to define the semantic conventions for more scenarios, so that instrumentation authors can release stable instrumentation libraries and so that end-users and observability systems can make firmer dependencies on OpenTelemetry's metadata.

P2: Client Instrumentation (RUM)

We want OpenTelemetry to provide true end-to-end visibility to service owners, including E2E latency (including client app and internet latency) and the chain of backend service events and infrastructure-side performance stats that take place from a single user interaction. This requires OpenTelemetry to start supporting webpage JS, mobile applications, and desktop applications. OpenTelemetry JS has technically supported capturing spans from web browsers since its first releases, however this behavior was mostly unspecified, and there was no equivalent functionality for other types of client applications like those on Android, iOS, or Windows.

In late 2021, the Client Instrumentation SIG (often called the RUM SIG) was established, which seeks to specify client instrumentation behavior so that there is consistency in the data captured from and in the developer-facing telemetry interfaces in different types of client applications. This SIG is currently completing its first round of spec work, which will need to be implemented by the JS, Swift, Java, and other SIGs once it is complete.

P2: Profiling

Distributed profiling has been a long-standing topic of discussion within OpenTelemetry, and contributors to other profiling projects have advocated for it to be added to OpenTelemetry as an additional signal type. In May of 2022, this work commenced within OpenTelemetry's profiling SIG.

Sampled heap and CPU profiles will allow OpenTelemetry to extend end-users' visibility to the performance of their actual code. While other profiling solutions allow this kind of inspection today, few are able to properly correlate profiles with application and infrastructure resource metadata, and even fewer are able to correlate profiling telemetry with distributed traces or other signals. Adding this to OpenTelemetry will allow analysis solutions and end-users to find instances of poor performance between services and then immediately chase these down to their root cause within code.

P2: OpenTelemetry Demo

OpenTelemetry launched a community demo SIG in May 2022, which will provide sample applications that demonstrate OpenTelemetry's capabilities to prospective end users, and also allow the community to better perform automated testing of OpenTelemetry components. The project's first release occurred in October 2022, and we will be further investing in the demo throughout 2023 and beyond.

P2: OpenTelemetry Control Plane

Since OpenTelemetry's initiation, end-users and vendors have expressed a desire to (a) understand what SDKs, language agents, and Collectors are deployed within their environment (along with their status), and to (b) be able to make changes to the configuration of these artifacts or possibly even update agent binaries.

Specification work is already underway to address both of these needs, and the OpAMP SIG has already produced a specification for OpAMP, the protocol that will drive these interactions. Over time, the SIGs that develop various OpenTelemetry artifacts will need to implement OpAMP to enable these scenarios.

Backlog

These topics have been discussed in the past, but have either been deliberately prioritized beneath the project's major ongoing priorities, or have yet to have a large group of contributors form up behind them.

  • eBPF instrumentation (in-progress for the Collector via a small group of contributors).
  • Production debugging / dynamic log injection.
  • Automatically configuring the Collector to capture data from more sources.
  • Capturing telemetry (with appropriate semantics) from CI / CD systems.
  • Semantics for describing cloud spend and the environmental impact of cloud resource usage.
  • Enhanced user / organization extensibility of OpenTelemetry's semantic conventions.