Skip to content

Commit

Permalink
feat: add OpenTelemetry to node (#7102)
Browse files Browse the repository at this point in the history
This PR adds OpenTelemetry to the monorepo and starts tracking the
following metrics in the node:
- resource usage (CPU, system memory)
- block height
- average block size (how many txs per block)
- mempool status
- time taken to generate the witness and prove the protocol circuits
- time to simulate the circuits (only relevant if using mocked proofs)

The witgen/proving time are using gauges rather than histograms because
of a quirk with how often Prometheus scrapes the metrics vs how many
proofs we generate of the same type. In PromQL recreating the histogram
quantile requires looking at the rate of change of the individual
buckets but if proving takes tens-of-seconds it doesn't change often
enough so it ends up dividing by 0. Using a guage gives us instantaneous
values, but we potentially lose data (e.g. if two proofs finish in the
same scrape interval) and in the dashboard the numbers won't "decay"
(meaning if for some reason the node stop producing blocks the proof
duration will stay at the previous value).


Three new components are added to the architecture:
- an instance of the OpenTelemetry collector (aggregates metrics pushed
by the node)
- an instance of Prometheus to scrape to collector for data
- an instance of Grafana to chart the data

The top-level docker-compose has been updated to include these
components under the `metrics` profile. To run the metrics stack:

```
$ docker compose --profile metrics up -d
```

Then e2e tests can be run with metrics being exported to Grafana:
```
OTEL_COLLECTOR_HOST=127.0.0.1:4318 yarn test e2e_block_building
```

Two Grafana dashboards are included with this PR.

Node stats:

![image](https://github.com/AztecProtocol/aztec-packages/assets/3816165/0536ffde-934f-46de-b864-c8464925de4c)

Protocol circuit stats:

![image](https://github.com/AztecProtocol/aztec-packages/assets/3816165/8e1e15ce-daaf-4d65-b5e8-a681335cac80)
  • Loading branch information
alexghr authored Jun 24, 2024
1 parent fd92d46 commit 6bf2b72
Show file tree
Hide file tree
Showing 63 changed files with 2,517 additions and 83 deletions.
11 changes: 9 additions & 2 deletions cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,9 @@
"nullifer",
"offchain",
"onchain",
"opentelemetry",
"otel",
"OTLP",
"otterscan",
"outdir",
"overlayfs",
Expand Down Expand Up @@ -253,6 +256,7 @@
"typegen",
"typeparam",
"undeployed",
"undici",
"unexclude",
"unexcluded",
"unprefixed",
Expand All @@ -270,6 +274,7 @@
"viem",
"wasms",
"webassembly",
"WITGEN",
"workdir",
"yamux",
"yarnrc",
Expand Down Expand Up @@ -301,5 +306,7 @@
"lib",
"*.cmake"
],
"flagWords": ["anonymous"]
}
"flagWords": [
"anonymous"
]
}
116 changes: 107 additions & 9 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ services:
- aztec:/var/lib/aztec
ports:
- 8080:8080/tcp
profiles:
- pxe

node:
image: aztecprotocol/aztec${AZTEC_DOCKER_TAG:-@sha256:03feac60e91f1aabf678cecbcd13271dda229120ec6007f2c1bac718ff550c70}
Expand Down Expand Up @@ -59,18 +61,34 @@ services:
P2P_ENABLED: true
PEER_ID_PRIVATE_KEY:
AZTEC_PORT: 8999
OTEL_COLLECTOR_BASE_URL: ${OTEL_COLLECTOR_BASE_URL:-http://otel-collector:4318}
secrets:
- ethereum-host
- p2p-boot-node
entrypoint: [
"/bin/sh",
"-c",
"export ETHEREUM_HOST=$$(cat /var/run/secrets/ethereum-host);\
export BOOTSTRAP_NODES=$$(cat /var/run/secrets/p2p-boot-node);\
test -z \"$$PEER_ID_PRIVATE_KEY\" -a ! -f /var/lib/aztec/p2p-private-key && node /usr/src/yarn-project/cli/dest/bin/index.js generate-p2p-private-key | head -1 | cut -d' ' -f 3 | tee /var/lib/aztec/p2p-private-key || echo 'Re-using existing P2P private key';\
test -z \"$$PEER_ID_PRIVATE_KEY\" && export PEER_ID_PRIVATE_KEY=$$(cat /var/lib/aztec/p2p-private-key);\
node /usr/src/yarn-project/aztec/dest/bin/index.js start --node --archiver",
]
entrypoint: |
/bin/sh -c '
export ETHEREUM_HOST=$$(cat /var/run/secrets/ethereum-host)
export BOOTSTRAP_NODES=$$(cat /var/run/secrets/p2p-boot-node)
test -z "$$PEER_ID_PRIVATE_KEY" -a ! -f /var/lib/aztec/p2p-private-key && node /usr/src/yarn-project/cli/dest/bin/index.js generate-p2p-private-key | head -1 | cut -d" " -f 3 | tee /var/lib/aztec/p2p-private-key || echo "Re-using existing P2P private key"
test -z "$$PEER_ID_PRIVATE_KEY" && export PEER_ID_PRIVATE_KEY=$$(cat /var/lib/aztec/p2p-private-key)
# if the stack is started with --profile metrics --profile node, give the collector a chance to start before the node
i=0
max=3
while ! curl --head --silent $$OTEL_COLLECTOR_BASE_URL > /dev/null; do
echo "OpenTelemetry collector not up. Retrying after 1s";
sleep 1;
i=$$((i+1));
if [ $$i -eq $$max ]; then
echo "OpenTelemetry collector at $$OTEL_COLLECTOR_BASE_URL not up after $${max}s. Running without metrics";
unset OTEL_COLLECTOR_BASE_URL;
break
fi;
done;
node /usr/src/yarn-project/aztec/dest/bin/index.js start --node --archiver
'
volumes:
- aztec:/var/lib/aztec
profiles:
Expand All @@ -94,8 +112,88 @@ services:
profiles:
- cli

otel-collector:
image: otel/opentelemetry-collector-contrib
configs:
- source: otel-collector-config
target: /etc/otelcol-contrib/config.yaml
profiles:
- metrics
ports:
- 4318:4318

prometheus:
image: prom/prometheus
profiles:
- metrics
configs:
- source: prometheus-config
target: /etc/prometheus/prometheus.yml

grafana:
image: grafana/grafana
ports:
- 3000:3000
profiles:
- metrics
volumes:
- ./grafana_dashboards:/etc/grafana/provisioning/dashboards
- grafana:/var/lib/grafana
configs:
- source: grafana-sources
target: /etc/grafana/provisioning/datasources/default.yml

volumes:
aztec:
grafana:

configs:
grafana-sources:
content: |
apiVersion: 1
datasources:
- name: Prometheus
uid: aztec-node-metrics
type: prometheus
url: http://prometheus:9090
editable: false
isDefault: true
jsonData:
timeInterval: 10s
prometheus-config:
content: |
global:
evaluation_interval: 30s
scrape_interval: 10s
scrape_configs:
- job_name: otel-collector
static_configs:
- targets: ['otel-collector:8888']
- job_name: aztec
static_configs:
- targets: ['otel-collector:8889']
otel-collector-config:
content: |
receivers:
otlp:
protocols:
http:
processors:
batch:
exporters:
prometheus:
endpoint: 0.0.0.0:8889
metric_expiration: 5m
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
secrets:
aztec-node-url:
Expand Down
Loading

0 comments on commit 6bf2b72

Please sign in to comment.