Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP tracing #14125

Merged
merged 10 commits into from
Oct 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
254 changes: 218 additions & 36 deletions Cargo.lock

Large diffs are not rendered by default.

7 changes: 3 additions & 4 deletions crates/sui-core/src/authority/authority_per_epoch_store.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ use sui_types::transaction::{
AuthenticatorStateUpdate, CertifiedTransaction, SenderSignedData, SharedInputObject,
TransactionDataAPI, VerifiedCertificate, VerifiedSignedTransaction,
};
use tracing::{debug, error, error_span, info, trace, warn};
use tracing::{debug, error, info, instrument, trace, warn};
use typed_store::rocks::{
default_db_options, DBBatch, DBMap, DBOptions, MetricConf, TypedStoreError,
};
Expand Down Expand Up @@ -498,6 +498,7 @@ impl AuthorityEpochTables {
pub(crate) const MUTEX_TABLE_SIZE: usize = 1024;

impl AuthorityPerEpochStore {
#[instrument(name = "AuthorityPerEpochStore::new", level = "error", skip_all, fields(epoch = committee.epoch))]
pub fn new(
name: AuthorityName,
committee: Arc<Committee>,
Expand All @@ -514,9 +515,6 @@ impl AuthorityPerEpochStore {
let current_time = Instant::now();
let epoch_id = committee.epoch;

let span = error_span!("AuthorityPerEpochStore::new", ?epoch_id);
let _guard = span.enter();

let tables = AuthorityEpochTables::open(epoch_id, parent_path, db_options.clone());
let end_of_publish =
StakeAggregator::from_iter(committee.clone(), tables.end_of_publish.unbounded_iter());
Expand Down Expand Up @@ -798,6 +796,7 @@ impl AuthorityPerEpochStore {
.map(|t| t.into()))
}

#[instrument(level = "trace", skip_all)]
pub fn insert_tx_cert_and_effects_signature(
&self,
tx_digest: &TransactionDigest,
Expand Down
1 change: 1 addition & 0 deletions crates/sui-core/src/authority/authority_store.rs
Original file line number Diff line number Diff line change
Expand Up @@ -986,6 +986,7 @@ impl AuthorityStore {
///
/// Internally it checks that all locks for active inputs are at the correct
/// version, and then writes objects, certificates, parents and clean up locks atomically.
#[instrument(level = "debug", skip_all)]
pub async fn update_state(
&self,
inner_temporary_store: InnerTemporaryStore,
Expand Down
1 change: 1 addition & 0 deletions crates/sui-node/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ snap.workspace = true
git-version.workspace = true
const-str.workspace = true
url.workspace = true
humantime.workspace = true

sui-archival.workspace = true
sui-tls.workspace = true
Expand Down
64 changes: 54 additions & 10 deletions crates/sui-node/src/admin.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,12 @@ use axum::{
routing::{get, post},
Router,
};
use humantime::parse_duration;
use serde::Deserialize;
use std::net::{IpAddr, Ipv4Addr, SocketAddr};
use std::sync::Arc;
use sui_types::error::SuiError;
use telemetry_subscribers::FilterHandle;
use telemetry_subscribers::Filters;
use tracing::info;

// Example commands:
Expand All @@ -37,8 +38,19 @@ use tracing::info;
// View the node config (private keys will be masked):
//
// $ curl 'http://127.0.0.1:1337/node-config'
//
// Set a time-limited tracing config. After the duration expires, tracing will be disabled
// automatically.
//
// $ curl -X POST 'http://127.0.0.1:1337/enable-tracing?filter=info&duration=10s'
//
// Reset tracing to the TRACE_FILTER env var.
//
// $ curl -X POST 'http://127.0.0.1:1337/reset-tracing'

const LOGGING_ROUTE: &str = "/logging";
const TRACING_ROUTE: &str = "/enable-tracing";
const TRACING_RESET_ROUTE: &str = "/reset-tracing";
const SET_BUFFER_STAKE_ROUTE: &str = "/set-override-buffer-stake";
const CLEAR_BUFFER_STAKE_ROUTE: &str = "/clear-override-buffer-stake";
const FORCE_CLOSE_EPOCH: &str = "/force-close-epoch";
Expand All @@ -47,16 +59,13 @@ const NODE_CONFIG: &str = "/node-config";

struct AppState {
node: Arc<SuiNode>,
filter_handle: FilterHandle,
filters: Filters,
}

pub async fn run_admin_server(node: Arc<SuiNode>, port: u16, filter_handle: FilterHandle) {
let filter = filter_handle.get().unwrap();
pub async fn run_admin_server(node: Arc<SuiNode>, port: u16, filters: Filters) {
let filter = filters.get_log().unwrap();

let app_state = AppState {
node,
filter_handle,
};
let app_state = AppState { node, filters };

let app = Router::new()
.route(LOGGING_ROUTE, get(get_filter))
Expand All @@ -72,6 +81,8 @@ pub async fn run_admin_server(node: Arc<SuiNode>, port: u16, filter_handle: Filt
post(clear_override_protocol_upgrade_buffer_stake),
)
.route(FORCE_CLOSE_EPOCH, post(force_close_epoch))
.route(TRACING_ROUTE, post(enable_tracing))
.route(TRACING_RESET_ROUTE, post(reset_tracing))
.with_state(Arc::new(app_state));

let socket_address = SocketAddr::new(IpAddr::V4(Ipv4Addr::LOCALHOST), port);
Expand All @@ -87,8 +98,41 @@ pub async fn run_admin_server(node: Arc<SuiNode>, port: u16, filter_handle: Filt
.unwrap()
}

#[derive(Deserialize)]
struct EnableTracing {
filter: String,
duration: String,
}
Comment on lines +102 to +105
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to disable it too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It disables automatically after the duration expires. I guess it wouldn't hurt to be able to disable it actively in case you set too long a duration by accident.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a disable route


async fn enable_tracing(
State(state): State<Arc<AppState>>,
query: Query<EnableTracing>,
) -> (StatusCode, String) {
let Query(EnableTracing { filter, duration }) = query;

let Ok(duration) = parse_duration(&duration) else {
return (StatusCode::BAD_REQUEST, "invalid duration".into());
};

match state.filters.update_trace(filter, duration) {
Ok(()) => (
StatusCode::OK,
format!("tracing enabled for {:?}", duration),
),
Err(err) => (StatusCode::INTERNAL_SERVER_ERROR, err.to_string()),
}
}

async fn reset_tracing(State(state): State<Arc<AppState>>) -> (StatusCode, String) {
state.filters.reset_trace();
(
StatusCode::OK,
"tracing filter reset to TRACE_FILTER env var".into(),
)
}

async fn get_filter(State(state): State<Arc<AppState>>) -> (StatusCode, String) {
match state.filter_handle.get() {
match state.filters.get_log() {
Ok(filter) => (StatusCode::OK, filter),
Err(err) => (StatusCode::INTERNAL_SERVER_ERROR, err.to_string()),
}
Expand All @@ -98,7 +142,7 @@ async fn set_filter(
State(state): State<Arc<AppState>>,
new_filter: String,
) -> (StatusCode, String) {
match state.filter_handle.update(&new_filter) {
match state.filters.update_log(&new_filter) {
Ok(()) => {
info!(filter =% new_filter, "Log filter updated");
(StatusCode::OK, "".into())
Expand Down
8 changes: 4 additions & 4 deletions crates/sui-node/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,8 @@ fn main() {
config.supported_protocol_versions = Some(SupportedProtocolVersions::SYSTEM_DEFAULT);

let runtimes = SuiRuntimes::new(&config);
let registry_service = {
let _enter = runtimes.metrics.enter();
mysten_metrics::start_prometheus_server(config.metrics_address)
};
let metrics_rt = runtimes.metrics.enter();
let registry_service = mysten_metrics::start_prometheus_server(config.metrics_address);
let prometheus_registry = registry_service.default_registry();

// Initialize logging
Expand All @@ -73,6 +71,8 @@ fn main() {
.with_prom_registry(&prometheus_registry)
.init();

drop(metrics_rt);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - otherwise when we initialize TelemetryConfig, there is no active tokio runtime, so it can't install the batch exporter.

info!("Sui Node version: {VERSION}");
info!(
"Supported protocol versions: {:?}",
Expand Down
26 changes: 19 additions & 7 deletions crates/sui-swarm/src/memory/container.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ use std::thread;
use sui_config::NodeConfig;
use sui_node::{SuiNode, SuiNodeHandle};
use sui_types::crypto::{AuthorityPublicKeyBytes, KeypairTraits};
use telemetry_subscribers::get_global_telemetry_config;
use tracing::{info, trace};

use super::node::RuntimeType;
Expand Down Expand Up @@ -44,12 +45,21 @@ impl Container {
let (cancel_sender, cancel_receiver) = tokio::sync::oneshot::channel();

let thread = thread::spawn(move || {
let span = tracing::span!(
tracing::Level::INFO,
"node",
name =% AuthorityPublicKeyBytes::from(config.protocol_key_pair().public()).concise(),
);
let _guard = span.enter();
let span = if get_global_telemetry_config()
.map(|c| c.enable_otlp_tracing)
.unwrap_or(false)
{
// we cannot have long-lived root spans when exporting trace data to otlp
mystenmark marked this conversation as resolved.
Show resolved Hide resolved
None
} else {
Some(tracing::span!(
tracing::Level::INFO,
"node",
name =% AuthorityPublicKeyBytes::from(config.protocol_key_pair().public()).concise(),
))
};

let _guard = span.as_ref().map(|span| span.enter());

let mut builder = match runtime {
RuntimeType::SingleThreaded => tokio::runtime::Builder::new_current_thread(),
Expand All @@ -63,7 +73,9 @@ impl Container {
builder
.on_thread_start(move || {
SPAN.with(|maybe_entered_span| {
*maybe_entered_span.borrow_mut() = Some(span.clone().entered());
if let Some(span) = &span {
*maybe_entered_span.borrow_mut() = Some(span.clone().entered());
}
});
})
.on_thread_stop(|| {
Expand Down
7 changes: 6 additions & 1 deletion crates/telemetry-subscribers/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,15 @@ tracing.workspace = true
tracing-appender.workspace = true
tracing-subscriber.workspace = true
workspace-hack.workspace = true
opentelemetry = { version = "0.20.0", features = ["rt-tokio"], optional = true }
opentelemetry-otlp = { version = "0.13.0", features = ["grpc-tonic"], optional = true }
tracing-opentelemetry = { version = "0.21.0", optional = true }
tokio.workspace = true

[features]
default = []
default = ["otlp"]
tokio-console = ["console-subscriber"]
otlp = ["tracing-opentelemetry", "opentelemetry", "opentelemetry-otlp"]

[dev-dependencies]
camino.workspace = true
35 changes: 17 additions & 18 deletions crates/telemetry-subscribers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ You can also run the example and see output in ANSI color:
cargo run --example easy-init

## Features
- `jaeger` - this feature is enabled by default as it enables jaeger tracing
- `otlp` - this feature is enabled by default as it enables otlp tracing
- `json` - Bunyan formatter - JSON log output, optional
- `tokio-console` - [Tokio-console](https://github.com/tokio-rs/console) subscriber, optional

Expand All @@ -45,21 +45,27 @@ This output can easily be fed to backends such as ElasticSearch for indexing, al

NOTE: JSON output requires the `json` crate feature to be enabled.

### Jaeger (seeing distributed traces)
### OTLP

To see nested spans visualized with [Jaeger](https://www.jaegertracing.io), do the following:
#### Tracing locally:

1. Run this to get a local Jaeger container: `docker run -d -p6831:6831/udp -p6832:6832/udp -p16686:16686 jaegertracing/all-in-one:latest`
2. Set `enable_jaeger` config setting to true or set `TOKIO_JAEGER` env var
3. Run your app
4. Browse to `http://localhost:16686/` and select the service you configured using `service_name`
1. In `docker/grafana-local` run `docker compose up` to start a local grafana instance.
2. Set `TRACE_FILTER=<filter expression>` - for local use `TRACE_FILTER=sui=trace,info` is a good place to start.
3. Start the sui-node or other process.
4. Go to http://localhost:3000 (or [http://localhost:3000/ with traces already filtered to sui-node](http://localhost:3000/explore?panes=%7B%22iHz%22:%7B%22datasource%22:%22tempo%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22tempo%22,%22uid%22:%22tempo%22%7D,%22queryType%22:%22traceqlSearch%22,%22limit%22:20,%22filters%22:%5B%7B%22id%22:%22service-name%22,%22tag%22:%22service.name%22,%22operator%22:%22%3D%22,%22scope%22:%22resource%22,%22value%22:%5B%22sui-node%22%5D,%22valueType%22:%22string%22%7D,%7B%22id%22:%22span-name%22,%22tag%22:%22name%22,%22operator%22:%22%3D%22,%22scope%22:%22span%22,%22value%22:%5B%5D,%22valueType%22:%22string%22%7D,%7B%22id%22:%224f3681c5%22,%22operator%22:%22%3D%22,%22scope%22:%22span%22%7D%5D%7D%5D,%22range%22:%7B%22from%22:%22now-5m%22,%22to%22:%22now%22%7D%7D%7D&schemaVersion=1&orgId=1)
5. Select `Tempo` as the data source.

NOTE: separate spans (which are not nested) are not connected as a single trace for now.
#### Tracing in production:

Jaeger subscriber is enabled by default but is protected by the jaeger feature flag. If you'd like to leave
out the Jaeger dependencies, you can turn off the default-features in your dependency:
Because tracing is expensive, it is not enabled by default. To enable trace exporting on a production machine:

telemetry = { url = "...", default-features = false }
1. Ensure the process was started with `TRACE_FILTER=off` - this enables the OTLP system but filters out all spans.

2. Using the filter expression and duration of your choice, run:

$ curl -X POST 'http://127.0.0.1:1337/enable-tracing?filter=sui-node=trace,info&duration=10s'

Tracing will automatically be disabled after the specified duration has elapsed, in order to avoid leaving tracing on unintentionally.

### Automatic Prometheus span latencies

Expand All @@ -69,13 +75,6 @@ span performance in production apps.

Enabling this layer can only be done programmatically, by passing in a Prometheus registry to `TelemetryConfig`.

### Span levels vs log levels

What spans are included for Jaeger output, automatic span latencies, etc.? These are controlled by
the `span_level` config attribute, or the `TS_SPAN_LEVEL` environment variable. Note that this is
separate from `RUST_LOG`, so that you can separately control the logging verbosity from the level of
spans that are to be recorded and traced.

### Live async inspection / Tokio Console

[Tokio-console](https://github.com/tokio-rs/console) is an awesome CLI tool designed to analyze and help debug Rust apps using Tokio, in real time! It relies on a special subscriber.
Expand Down
Loading
Loading