Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DP-1537: Limit Influx db volume getting huge overtime #20

Merged
merged 8 commits into from
May 23, 2024

Conversation

AlexFernandes-MOVAI
Copy link
Contributor

@AlexFernandes-MOVAI AlexFernandes-MOVAI commented May 17, 2024

content

AlexFernandes-MOVAI and others added 3 commits January 12, 2024 23:19
Releasing new version with better check on linux cpu_plugin
@AlexFernandes-MOVAI AlexFernandes-MOVAI changed the title Feat: add 2 metrics in the telegraf configuration too track issues around sockets and file descriptors Feat: add 2 metrics in the telegraf configuration to track issues around sockets and file descriptors May 17, 2024
@AlexFernandes-MOVAI AlexFernandes-MOVAI self-assigned this May 17, 2024
@AlexFernandes-MOVAI AlexFernandes-MOVAI requested review from a team, duartecoelhomovai and mariana-dias-alves and removed request for a team May 17, 2024 13:31
@AlexFernandes-MOVAI AlexFernandes-MOVAI added the enhancement New feature or request label May 17, 2024
@AlexFernandes-MOVAI AlexFernandes-MOVAI changed the title Feat: add 2 metrics in the telegraf configuration to track issues around sockets and file descriptors DP-1537: Limit Influx db volume getting huge overtime May 22, 2024
@mariana-dias-alves
Copy link
Contributor

mariana-dias-alves commented May 22, 2024

@mariana-dias-alves
Copy link
Contributor

mariana-dias-alves commented May 22, 2024

Test telegraf on tugbot_simulator 2.4.3-28 (https://github.com/MOV-AI/project-tugbot/releases/tag/2.4.3-28)

Production Level

  • no errors on start up
  • status healthy for all containers
  • telegraf logs with no errors
$ docker logs telegraf-tugbot_simulator-movai 
2024-05-22T10:17:37Z I! Loading config: /etc/telegraf/telegraf.conf
  • influxdb logs with no errors
$ docker logs influxdb-tugbot_simulator-movai 
influxdb init process in progress...
ts=2024-05-22T10:17:28.219673Z lvl=info msg="InfluxDB starting" log_id=0pJxjkbl000 version=1.8.10 branch=1.8 commit=688e697c51fd
ts=2024-05-22T10:17:28.219685Z lvl=info msg="Go runtime" log_id=0pJxjkbl000 version=go1.13.8 maxprocs=12
ts=2024-05-22T10:17:28.325722Z lvl=info msg="Using data dir" log_id=0pJxjkbl000 service=store path=/var/lib/influxdb/data
ts=2024-05-22T10:17:28.325774Z lvl=info msg="Compaction settings" log_id=0pJxjkbl000 service=store max_concurrent_compactions=6 throughput_bytes_per_second=50331648 throughput_bytes_per_second_burst=50331648
ts=2024-05-22T10:17:28.325782Z lvl=info msg="Open store (start)" log_id=0pJxjkbl000 service=store trace_id=0pJxjl1G000 op_name=tsdb_open op_event=start
ts=2024-05-22T10:17:28.325805Z lvl=info msg="Open store (end)" log_id=0pJxjkbl000 service=store trace_id=0pJxjl1G000 op_name=tsdb_open op_event=end op_elapsed=0.024ms
ts=2024-05-22T10:17:28.326165Z lvl=info msg="Opened service" log_id=0pJxjkbl000 service=subscriber
ts=2024-05-22T10:17:28.326172Z lvl=info msg="Starting monitor service" log_id=0pJxjkbl000 service=monitor
ts=2024-05-22T10:17:28.326175Z lvl=info msg="Registered diagnostics client" log_id=0pJxjkbl000 service=monitor name=build
ts=2024-05-22T10:17:28.326177Z lvl=info msg="Registered diagnostics client" log_id=0pJxjkbl000 service=monitor name=runtime
ts=2024-05-22T10:17:28.326181Z lvl=info msg="Registered diagnostics client" log_id=0pJxjkbl000 service=monitor name=network
ts=2024-05-22T10:17:28.326183Z lvl=info msg="Registered diagnostics client" log_id=0pJxjkbl000 service=monitor name=system
ts=2024-05-22T10:17:28.326187Z lvl=info msg="Starting precreation service" log_id=0pJxjkbl000 service=shard-precreation check_interval=10m advance_period=30m
ts=2024-05-22T10:17:28.326192Z lvl=info msg="Starting snapshot service" log_id=0pJxjkbl000 service=snapshot
ts=2024-05-22T10:17:28.326208Z lvl=info msg="Starting continuous query service" log_id=0pJxjkbl000 service=continuous_querier
ts=2024-05-22T10:17:28.326629Z lvl=info msg="Starting HTTP service" log_id=0pJxjkbl000 service=httpd authentication=false
ts=2024-05-22T10:17:28.326652Z lvl=info msg="Listening on HTTP" log_id=0pJxjkbl000 service=httpd addr=127.0.0.1:8086 https=false
ts=2024-05-22T10:17:28.326902Z lvl=info msg="Starting retention policy enforcement service" log_id=0pJxjkbl000 service=retention check_interval=30m
ts=2024-05-22T10:17:28.327069Z lvl=info msg="Started listening on UDP" log_id=0pJxjkbl000 service=udp addr=:9096
ts=2024-05-22T10:17:28.327133Z lvl=info msg="Listening for signals" log_id=0pJxjkbl000
ts=2024-05-22T10:17:29.225889Z lvl=info msg="Executing query" log_id=0pJxjkbl000 service=query query="SHOW DATABASES"
/init-influxdb.sh: running /docker-entrypoint-initdb.d/retention_policy.sh
Configuring production retention policies ...
Create DB: "telegraf"
ts=2024-05-22T10:17:29.235164Z lvl=info msg="Executing query" log_id=0pJxjkbl000 service=query query="CREATE DATABASE telegraf"
ts=2024-05-22T10:17:29.249589Z lvl=info msg="Executing query" log_id=0pJxjkbl000 service=query query="ALTER RETENTION POLICY autogen ON telegraf DURATION 1w REPLICATION 1 SHARD DURATION 1h DEFAULT"
Create DB: "logs"
ts=2024-05-22T10:17:29.261626Z lvl=info msg="Executing query" log_id=0pJxjkbl000 service=query query="CREATE DATABASE logs"
ts=2024-05-22T10:17:29.274003Z lvl=info msg="Executing query" log_id=0pJxjkbl000 service=query query="ALTER RETENTION POLICY autogen ON logs DURATION 1w REPLICATION 1 SHARD DURATION 1h DEFAULT"
Create DB: "metrics"
ts=2024-05-22T10:17:29.287776Z lvl=info msg="Executing query" log_id=0pJxjkbl000 service=query query="CREATE DATABASE metrics"
ts=2024-05-22T10:17:29.300074Z lvl=info msg="Executing query" log_id=0pJxjkbl000 service=query query="ALTER RETENTION POLICY autogen ON metrics DURATION 1w REPLICATION 1 SHARD DURATION 1h DEFAULT"
Create DB: "_internal"
ts=2024-05-22T10:17:29.312147Z lvl=info msg="Executing query" log_id=0pJxjkbl000 service=query query="CREATE DATABASE _internal"
ts=2024-05-22T10:17:29.324114Z lvl=info msg="Executing query" log_id=0pJxjkbl000 service=query query="ALTER RETENTION POLICY autogen ON _internal DURATION 1d REPLICATION 1 SHARD DURATION 1h DEFAULT"
Configuring production retention policies: DONE

ts=2024-05-22T10:17:29.329766Z lvl=info msg="Signal received, initializing clean shutdown..." log_id=0pJxjkbl000
ts=2024-05-22T10:17:29.329777Z lvl=info msg="Waiting for clean shutdown..." log_id=0pJxjkbl000
[tcp] 2024/05/22 10:17:29 tcp.Mux: Listener at 127.0.0.1:8088 failed failed to accept a connection, closing all listeners - accept tcp 127.0.0.1:8088: use of closed network connection
ts=2024-05-22T10:17:29.329875Z lvl=info msg="Listener closed" log_id=0pJxjkbl000 service=snapshot
ts=2024-05-22T10:17:29.329874Z lvl=info msg="Shutting down monitor service" log_id=0pJxjkbl000 service=monitor
ts=2024-05-22T10:17:29.329893Z lvl=info msg="Terminating precreation service" log_id=0pJxjkbl000 service=shard-precreation
ts=2024-05-22T10:17:29.329904Z lvl=info msg="Terminating continuous query service" log_id=0pJxjkbl000 service=continuous_querier
ts=2024-05-22T10:17:29.329931Z lvl=info msg="Closing retention policy enforcement service" log_id=0pJxjkbl000 service=retention
ts=2024-05-22T10:17:29.329953Z lvl=info msg="Failed to read UDP message" log_id=0pJxjkbl000 service=udp error="read udp [::]:9096: use of closed network connection"
ts=2024-05-22T10:17:29.329977Z lvl=info msg="Service closed" log_id=0pJxjkbl000 service=udp
ts=2024-05-22T10:17:29.330003Z lvl=info msg="Closed service" log_id=0pJxjkbl000 service=subscriber
ts=2024-05-22T10:17:29.330149Z lvl=info msg="Server shutdown completed" log_id=0pJxjkbl000
ts=2024-05-22T10:17:29.339078Z lvl=info msg="InfluxDB starting" log_id=0pJxjozl000 version=1.8.10 branch=1.8 commit=688e697c51fd
ts=2024-05-22T10:17:29.339091Z lvl=info msg="Go runtime" log_id=0pJxjozl000 version=go1.13.8 maxprocs=12
ts=2024-05-22T10:17:29.440283Z lvl=info msg="Using data dir" log_id=0pJxjozl000 service=store path=/var/lib/influxdb/data
ts=2024-05-22T10:17:29.440351Z lvl=info msg="Compaction settings" log_id=0pJxjozl000 service=store max_concurrent_compactions=6 throughput_bytes_per_second=50331648 throughput_bytes_per_second_burst=50331648
ts=2024-05-22T10:17:29.440383Z lvl=info msg="Open store (start)" log_id=0pJxjozl000 service=store trace_id=0pJxjpO0000 op_name=tsdb_open op_event=start
ts=2024-05-22T10:17:29.440513Z lvl=info msg="Open store (end)" log_id=0pJxjozl000 service=store trace_id=0pJxjpO0000 op_name=tsdb_open op_event=end op_elapsed=0.134ms
ts=2024-05-22T10:17:29.440593Z lvl=info msg="Opened service" log_id=0pJxjozl000 service=subscriber
ts=2024-05-22T10:17:29.440615Z lvl=info msg="Starting monitor service" log_id=0pJxjozl000 service=monitor
ts=2024-05-22T10:17:29.440631Z lvl=info msg="Registered diagnostics client" log_id=0pJxjozl000 service=monitor name=build
ts=2024-05-22T10:17:29.440647Z lvl=info msg="Registered diagnostics client" log_id=0pJxjozl000 service=monitor name=runtime
ts=2024-05-22T10:17:29.440672Z lvl=info msg="Registered diagnostics client" log_id=0pJxjozl000 service=monitor name=network
ts=2024-05-22T10:17:29.440689Z lvl=info msg="Registered diagnostics client" log_id=0pJxjozl000 service=monitor name=system
ts=2024-05-22T10:17:29.440711Z lvl=info msg="Starting precreation service" log_id=0pJxjozl000 service=shard-precreation check_interval=10m advance_period=30m
ts=2024-05-22T10:17:29.440767Z lvl=info msg="Starting snapshot service" log_id=0pJxjozl000 service=snapshot
ts=2024-05-22T10:17:29.440801Z lvl=info msg="Starting continuous query service" log_id=0pJxjozl000 service=continuous_querier
ts=2024-05-22T10:17:29.440833Z lvl=info msg="Starting HTTP service" log_id=0pJxjozl000 service=httpd authentication=false
ts=2024-05-22T10:17:29.441002Z lvl=info msg="Listening on HTTP" log_id=0pJxjozl000 service=httpd addr=[::]:8086 https=false
ts=2024-05-22T10:17:29.441035Z lvl=info msg="Starting retention policy enforcement service" log_id=0pJxjozl000 service=retention check_interval=30m
ts=2024-05-22T10:17:29.442510Z lvl=info msg="Started listening on UDP" log_id=0pJxjozl000 service=udp addr=:9096
ts=2024-05-22T10:17:29.443951Z lvl=info msg="Listening for signals" log_id=0pJxjozl000
ts=2024-05-22T10:18:09.542389Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SHOW DATABASES"
ts=2024-05-22T10:18:09.543800Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="CREATE DATABASE logs"
ts=2024-05-22T10:18:09.549867Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SHOW DATABASES"
ts=2024-05-22T10:18:09.551802Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="CREATE DATABASE metrics"
ts=2024-05-22T10:22:57.849237Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SHOW DATABASES"
ts=2024-05-22T10:22:57.864747Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SHOW RETENTION POLICIES ON telegraf"
ts=2024-05-22T10:22:57.874767Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SHOW RETENTION POLICIES ON logs"
ts=2024-05-22T10:22:57.887510Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SHOW RETENTION POLICIES ON metrics"
ts=2024-05-22T10:22:57.897718Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SHOW RETENTION POLICIES ON _internal"
ts=2024-05-22T10:23:00.844752Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(io_service_bytes_recursive_write) AS \"write\", mean(io_service_bytes_recursive_read) AS \"read\" FROM telegraf.autogen.docker_container_blkio WHERE host = 'health-node' AND time > now() - 15m AND time < now() GROUP BY time(2500ms), container_name"
ts=2024-05-22T10:23:00.848034Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT non_negative_derivative(max(read_bytes), 1s) AS \"read bytes\", non_negative_derivative(max(write_bytes), 1s) AS \"write bytes\" FROM telegraf.autogen.diskio WHERE time > now() - 15m AND time < now() AND host = 'health-node' GROUP BY time(2500ms)"
ts=2024-05-22T10:23:00.851584Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(load1) AS load1, mean(load5) AS load5, mean(load15) AS load15 FROM telegraf.autogen.system WHERE time > now() - 15m AND time < now() AND host = 'health-node' GROUP BY time(2500ms)"
ts=2024-05-22T10:23:00.865330Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(n_containers) AS mean_n_containers FROM telegraf.autogen.docker WHERE host = 'health-node' AND time > now() - 15m AND time < now() GROUP BY time(2500ms) fill(previous)"
ts=2024-05-22T10:23:00.865828Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(used_percent) AS last_used_percent FROM telegraf.autogen.mem WHERE time > now() - 15m AND time < now() AND host = 'health-node' GROUP BY time(2500ms) fill(previous)"
ts=2024-05-22T10:23:00.865906Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(memory_total) / 1024 / 1024 AS mean_memory_total FROM telegraf.autogen.docker WHERE host = 'health-node' AND time > now() - 15m AND time < now() GROUP BY time(2500ms) fill(previous)"
ts=2024-05-22T10:23:00.865980Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(usage_percent) AS mean_usage_percent FROM telegraf.autogen.docker_container_mem WHERE host = 'health-node' AND time > now() - 15m AND time < now() GROUP BY time(2500ms), container_name"
ts=2024-05-22T10:23:00.888747Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(rx_bytes) AS rx_bytes, mean(tx_bytes) AS tx_bytes FROM telegraf.autogen.docker_container_net WHERE host = 'health-node' AND time > now() - 15m AND time < now() GROUP BY time(2500ms), container_name"
ts=2024-05-22T10:23:00.888808Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(usage_system) AS system, mean(usage_iowait) AS iowait, mean(usage_user) AS \"user\", mean(usage_idle) AS idle FROM telegraf.autogen.cpu WHERE host = 'health-node' AND time > now() - 15m AND time < now() AND cpu = 'cpu-total' GROUP BY time(2500ms)"
ts=2024-05-22T10:23:00.889373Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(usage) AS mean_usage FROM telegraf.autogen.docker_container_mem WHERE host = 'health-node' AND time > now() - 15m AND time < now() GROUP BY time(2500ms), container_name"
ts=2024-05-22T10:23:00.889462Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT last(n_cpus) AS n_cpus FROM telegraf.autogen.docker WHERE host = 'health-node' AND time > now() - 15m AND time < now() GROUP BY time(2500ms) fill(previous)"
ts=2024-05-22T10:23:00.889524Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(usage_percent) AS \"Usage per Container\" FROM telegraf.autogen.docker_container_cpu WHERE host = 'health-node' AND time > now() - 15m AND time < now() AND cpu = 'cpu-total' GROUP BY time(2500ms), container_name"
ts=2024-05-22T10:23:00.889635Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(used_percent) AS used FROM telegraf.autogen.disk WHERE time > now() - 15m AND time < now() AND host = 'health-node' GROUP BY time(2500ms), path"
ts=2024-05-22T10:23:00.889737Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(n_containers_running) AS running FROM telegraf.autogen.docker WHERE host = 'health-node' AND time > now() - 15m AND time < now() GROUP BY time(2500ms) fill(previous)"
ts=2024-05-22T10:23:00.893240Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SHOW TAG VALUES ON telegraf WITH KEY = host WHERE (_name = 'docker') AND (_tagKey = 'host')"
ts=2024-05-22T10:23:01.476802Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(io_service_bytes_recursive_write) AS \"write\", mean(io_service_bytes_recursive_read) AS \"read\" FROM telegraf.autogen.docker_container_blkio WHERE host = '{{.Hostname}}' AND time > now() - 15m AND time < now() GROUP BY time(2500ms), container_name"
ts=2024-05-22T10:23:01.477067Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(load1) AS load1, mean(load5) AS load5, mean(load15) AS load15 FROM telegraf.autogen.system WHERE time > now() - 15m AND time < now() AND host = '{{.Hostname}}' GROUP BY time(2500ms)"
ts=2024-05-22T10:23:01.481070Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(used_percent) AS last_used_percent FROM telegraf.autogen.mem WHERE time > now() - 15m AND time < now() AND host = '{{.Hostname}}' GROUP BY time(2500ms) fill(previous)"
ts=2024-05-22T10:23:01.492660Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(usage_system) AS system, mean(usage_iowait) AS iowait, mean(usage_user) AS \"user\", mean(usage_idle) AS idle FROM telegraf.autogen.cpu WHERE host = '{{.Hostname}}' AND time > now() - 15m AND time < now() AND cpu = 'cpu-total' GROUP BY time(2500ms)"
ts=2024-05-22T10:23:01.523339Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(memory_total) / 1024 / 1024 AS mean_memory_total FROM telegraf.autogen.docker WHERE host = '{{.Hostname}}' AND time > now() - 15m AND time < now() GROUP BY time(2500ms) fill(previous)"
ts=2024-05-22T10:23:01.523418Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(usage_percent) AS mean_usage_percent FROM telegraf.autogen.docker_container_mem WHERE host = '{{.Hostname}}' AND time > now() - 15m AND time < now() GROUP BY time(2500ms), container_name"
ts=2024-05-22T10:23:01.523944Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(n_containers_running) AS running FROM telegraf.autogen.docker WHERE host = '{{.Hostname}}' AND time > now() - 15m AND time < now() GROUP BY time(2500ms) fill(previous)"
ts=2024-05-22T10:23:01.524213Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT non_negative_derivative(max(read_bytes), 1s) AS \"read bytes\", non_negative_derivative(max(write_bytes), 1s) AS \"write bytes\" FROM telegraf.autogen.diskio WHERE time > now() - 15m AND time < now() AND host = '{{.Hostname}}' GROUP BY time(2500ms)"
ts=2024-05-22T10:23:01.524285Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT last(n_cpus) AS n_cpus FROM telegraf.autogen.docker WHERE host = '{{.Hostname}}' AND time > now() - 15m AND time < now() GROUP BY time(2500ms) fill(previous)"
ts=2024-05-22T10:23:01.524801Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(n_containers) AS mean_n_containers FROM telegraf.autogen.docker WHERE host = '{{.Hostname}}' AND time > now() - 15m AND time < now() GROUP BY time(2500ms) fill(previous)"
ts=2024-05-22T10:23:01.535634Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(rx_bytes) AS rx_bytes, mean(tx_bytes) AS tx_bytes FROM telegraf.autogen.docker_container_net WHERE host = '{{.Hostname}}' AND time > now() - 15m AND time < now() GROUP BY time(2500ms), container_name"
ts=2024-05-22T10:23:01.535729Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(usage) AS mean_usage FROM telegraf.autogen.docker_container_mem WHERE host = '{{.Hostname}}' AND time > now() - 15m AND time < now() GROUP BY time(2500ms), container_name"
ts=2024-05-22T10:23:01.536606Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(used_percent) AS used FROM telegraf.autogen.disk WHERE time > now() - 15m AND time < now() AND host = '{{.Hostname}}' GROUP BY time(2500ms), path"
ts=2024-05-22T10:23:01.537006Z lvl=info msg="Executing query" log_id=0pJxjozl000 service=query query="SELECT mean(usage_percent) AS \"Usage per Container\" FROM telegraf.autogen.docker_container_cpu WHERE host = '{{.Hostname}}' AND time > now() - 15m AND time < now() AND cpu = 'cpu-total' GROUP BY time(2500ms), container_name"
  • telegraf.conf matches production settings
  • Monitoring dashboards status
    • Docker: every graph is populated
    • Logs: tables are all populated
    • Redis: all graphs are populated
    • System: missing graphs on Network (missing net plugin), CPU per process, Mem per process (missing procstat plugin)
    • Top: no results
  • influxdb container memory consumption for 36 minutes
    image

Debug Level

  • no errors on start up or update
  • status healthy for all containers
  • telegraf logs with no errors
$ docker logs telegraf-tugbot_simulator-movai 
2024-05-22T10:57:41Z I! Loading config: /etc/telegraf/telegraf.conf
2024-05-22T10:57:41Z I! Starting Telegraf 1.30.3 brought to you by InfluxData the makers of InfluxDB
2024-05-22T10:57:41Z I! Available plugins: 233 inputs, 9 aggregators, 31 processors, 24 parsers, 60 outputs, 6 secret-stores
2024-05-22T10:57:41Z I! Loaded inputs: cpu disk diskio docker docker_log influxdb internal kernel linux_cpu linux_sysctl_fs mem netstat processes procstat redis swap system wireless
2024-05-22T10:57:41Z I! Loaded aggregators: 
2024-05-22T10:57:41Z I! Loaded processors: 
2024-05-22T10:57:41Z I! Loaded secretstores: 
2024-05-22T10:57:41Z I! Loaded outputs: influxdb
2024-05-22T10:57:41Z I! Tags enabled: host={{.Hostname}}
2024-05-22T10:57:41Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"{{.Hostname}}", Flush Interval:10s
  • influxdb logs with no errors
$ docker logs influxdb-tugbot_simulator-movai 
influxdb init process in progress...
ts=2024-05-22T10:57:32.053793Z lvl=info msg="InfluxDB starting" log_id=0pJ~1TaG000 version=1.8.10 branch=1.8 commit=688e697c51fd
ts=2024-05-22T10:57:32.053809Z lvl=info msg="Go runtime" log_id=0pJ~1TaG000 version=go1.13.8 maxprocs=12
ts=2024-05-22T10:57:32.164549Z lvl=info msg="Using data dir" log_id=0pJ~1TaG000 service=store path=/var/lib/influxdb/data
ts=2024-05-22T10:57:32.164727Z lvl=info msg="Compaction settings" log_id=0pJ~1TaG000 service=store max_concurrent_compactions=6 throughput_bytes_per_second=50331648 throughput_bytes_per_second_burst=50331648
ts=2024-05-22T10:57:32.164756Z lvl=info msg="Open store (start)" log_id=0pJ~1TaG000 service=store trace_id=0pJ~1U10000 op_name=tsdb_open op_event=start
ts=2024-05-22T10:57:32.164850Z lvl=info msg="Open store (end)" log_id=0pJ~1TaG000 service=store trace_id=0pJ~1U10000 op_name=tsdb_open op_event=end op_elapsed=0.097ms
ts=2024-05-22T10:57:32.164912Z lvl=info msg="Opened service" log_id=0pJ~1TaG000 service=subscriber
ts=2024-05-22T10:57:32.164924Z lvl=info msg="Starting monitor service" log_id=0pJ~1TaG000 service=monitor
ts=2024-05-22T10:57:32.164933Z lvl=info msg="Registered diagnostics client" log_id=0pJ~1TaG000 service=monitor name=build
ts=2024-05-22T10:57:32.164942Z lvl=info msg="Registered diagnostics client" log_id=0pJ~1TaG000 service=monitor name=runtime
ts=2024-05-22T10:57:32.164974Z lvl=info msg="Registered diagnostics client" log_id=0pJ~1TaG000 service=monitor name=network
ts=2024-05-22T10:57:32.164987Z lvl=info msg="Registered diagnostics client" log_id=0pJ~1TaG000 service=monitor name=system
ts=2024-05-22T10:57:32.165006Z lvl=info msg="Starting precreation service" log_id=0pJ~1TaG000 service=shard-precreation check_interval=10m advance_period=30m
ts=2024-05-22T10:57:32.165043Z lvl=info msg="Starting snapshot service" log_id=0pJ~1TaG000 service=snapshot
ts=2024-05-22T10:57:32.165065Z lvl=info msg="Starting continuous query service" log_id=0pJ~1TaG000 service=continuous_querier
ts=2024-05-22T10:57:32.165134Z lvl=info msg="Starting HTTP service" log_id=0pJ~1TaG000 service=httpd authentication=false
ts=2024-05-22T10:57:32.165238Z lvl=info msg="Listening on HTTP" log_id=0pJ~1TaG000 service=httpd addr=127.0.0.1:8086 https=false
ts=2024-05-22T10:57:32.165272Z lvl=info msg="Starting retention policy enforcement service" log_id=0pJ~1TaG000 service=retention check_interval=30m
ts=2024-05-22T10:57:32.166754Z lvl=info msg="Started listening on UDP" log_id=0pJ~1TaG000 service=udp addr=:9096
ts=2024-05-22T10:57:32.168516Z lvl=info msg="Listening for signals" log_id=0pJ~1TaG000
ts=2024-05-22T10:57:33.079139Z lvl=info msg="Executing query" log_id=0pJ~1TaG000 service=query query="SHOW DATABASES"
/init-influxdb.sh: running /docker-entrypoint-initdb.d/retention_policy.sh
Configuring production retention policies ...
Create DB: "telegraf"
ts=2024-05-22T10:57:33.098293Z lvl=info msg="Executing query" log_id=0pJ~1TaG000 service=query query="CREATE DATABASE telegraf"
ts=2024-05-22T10:57:33.112252Z lvl=info msg="Executing query" log_id=0pJ~1TaG000 service=query query="ALTER RETENTION POLICY autogen ON telegraf DURATION 1w REPLICATION 1 SHARD DURATION 1h DEFAULT"
Create DB: "logs"
ts=2024-05-22T10:57:33.124475Z lvl=info msg="Executing query" log_id=0pJ~1TaG000 service=query query="CREATE DATABASE logs"
ts=2024-05-22T10:57:33.137929Z lvl=info msg="Executing query" log_id=0pJ~1TaG000 service=query query="ALTER RETENTION POLICY autogen ON logs DURATION 1w REPLICATION 1 SHARD DURATION 1h DEFAULT"
Create DB: "metrics"
ts=2024-05-22T10:57:33.150393Z lvl=info msg="Executing query" log_id=0pJ~1TaG000 service=query query="CREATE DATABASE metrics"
ts=2024-05-22T10:57:33.162498Z lvl=info msg="Executing query" log_id=0pJ~1TaG000 service=query query="ALTER RETENTION POLICY autogen ON metrics DURATION 1w REPLICATION 1 SHARD DURATION 1h DEFAULT"
Create DB: "_internal"
ts=2024-05-22T10:57:33.174719Z lvl=info msg="Executing query" log_id=0pJ~1TaG000 service=query query="CREATE DATABASE _internal"
ts=2024-05-22T10:57:33.186989Z lvl=info msg="Executing query" log_id=0pJ~1TaG000 service=query query="ALTER RETENTION POLICY autogen ON _internal DURATION 1d REPLICATION 1 SHARD DURATION 1h DEFAULT"
Configuring production retention policies: DONE

ts=2024-05-22T10:57:33.192798Z lvl=info msg="Signal received, initializing clean shutdown..." log_id=0pJ~1TaG000
ts=2024-05-22T10:57:33.192811Z lvl=info msg="Waiting for clean shutdown..." log_id=0pJ~1TaG000
ts=2024-05-22T10:57:33.192857Z lvl=info msg="Shutting down monitor service" log_id=0pJ~1TaG000 service=monitor
[tcp] 2024/05/22 10:57:33 tcp.Mux: Listener at 127.0.0.1:8088 failed failed to accept a connection, closing all listeners - accept tcp 127.0.0.1:8088: use of closed network connection
ts=2024-05-22T10:57:33.192879Z lvl=info msg="Terminating precreation service" log_id=0pJ~1TaG000 service=shard-precreation
ts=2024-05-22T10:57:33.192899Z lvl=info msg="Listener closed" log_id=0pJ~1TaG000 service=snapshot
ts=2024-05-22T10:57:33.192906Z lvl=info msg="Terminating continuous query service" log_id=0pJ~1TaG000 service=continuous_querier
ts=2024-05-22T10:57:33.192938Z lvl=info msg="Closing retention policy enforcement service" log_id=0pJ~1TaG000 service=retention
ts=2024-05-22T10:57:33.192961Z lvl=info msg="Failed to read UDP message" log_id=0pJ~1TaG000 service=udp error="read udp [::]:9096: use of closed network connection"
ts=2024-05-22T10:57:33.192982Z lvl=info msg="Service closed" log_id=0pJ~1TaG000 service=udp
ts=2024-05-22T10:57:33.193004Z lvl=info msg="Closed service" log_id=0pJ~1TaG000 service=subscriber
ts=2024-05-22T10:57:33.193024Z lvl=info msg="Server shutdown completed" log_id=0pJ~1TaG000
ts=2024-05-22T10:57:33.202779Z lvl=info msg="InfluxDB starting" log_id=0pJ~1Y4W000 version=1.8.10 branch=1.8 commit=688e697c51fd
ts=2024-05-22T10:57:33.202791Z lvl=info msg="Go runtime" log_id=0pJ~1Y4W000 version=go1.13.8 maxprocs=12
ts=2024-05-22T10:57:33.304070Z lvl=info msg="Using data dir" log_id=0pJ~1Y4W000 service=store path=/var/lib/influxdb/data
ts=2024-05-22T10:57:33.304135Z lvl=info msg="Compaction settings" log_id=0pJ~1Y4W000 service=store max_concurrent_compactions=6 throughput_bytes_per_second=50331648 throughput_bytes_per_second_burst=50331648
ts=2024-05-22T10:57:33.304155Z lvl=info msg="Open store (start)" log_id=0pJ~1Y4W000 service=store trace_id=0pJ~1YU0000 op_name=tsdb_open op_event=start
ts=2024-05-22T10:57:33.304248Z lvl=info msg="Open store (end)" log_id=0pJ~1Y4W000 service=store trace_id=0pJ~1YU0000 op_name=tsdb_open op_event=end op_elapsed=0.095ms
ts=2024-05-22T10:57:33.304306Z lvl=info msg="Opened service" log_id=0pJ~1Y4W000 service=subscriber
ts=2024-05-22T10:57:33.304325Z lvl=info msg="Starting monitor service" log_id=0pJ~1Y4W000 service=monitor
ts=2024-05-22T10:57:33.304333Z lvl=info msg="Registered diagnostics client" log_id=0pJ~1Y4W000 service=monitor name=build
ts=2024-05-22T10:57:33.304342Z lvl=info msg="Registered diagnostics client" log_id=0pJ~1Y4W000 service=monitor name=runtime
ts=2024-05-22T10:57:33.304355Z lvl=info msg="Registered diagnostics client" log_id=0pJ~1Y4W000 service=monitor name=network
ts=2024-05-22T10:57:33.304369Z lvl=info msg="Registered diagnostics client" log_id=0pJ~1Y4W000 service=monitor name=system
ts=2024-05-22T10:57:33.304381Z lvl=info msg="Starting precreation service" log_id=0pJ~1Y4W000 service=shard-precreation check_interval=10m advance_period=30m
ts=2024-05-22T10:57:33.304411Z lvl=info msg="Starting snapshot service" log_id=0pJ~1Y4W000 service=snapshot
ts=2024-05-22T10:57:33.304441Z lvl=info msg="Starting continuous query service" log_id=0pJ~1Y4W000 service=continuous_querier
ts=2024-05-22T10:57:33.304467Z lvl=info msg="Starting HTTP service" log_id=0pJ~1Y4W000 service=httpd authentication=false
ts=2024-05-22T10:57:33.304681Z lvl=info msg="Listening on HTTP" log_id=0pJ~1Y4W000 service=httpd addr=[::]:8086 https=false
ts=2024-05-22T10:57:33.304718Z lvl=info msg="Starting retention policy enforcement service" log_id=0pJ~1Y4W000 service=retention check_interval=30m
ts=2024-05-22T10:57:33.305266Z lvl=info msg="Started listening on UDP" log_id=0pJ~1Y4W000 service=udp addr=:9096
ts=2024-05-22T10:57:33.305464Z lvl=info msg="Listening for signals" log_id=0pJ~1Y4W000
ts=2024-05-22T10:58:13.445268Z lvl=info msg="Executing query" log_id=0pJ~1Y4W000 service=query query="SHOW DATABASES"
ts=2024-05-22T10:58:13.447219Z lvl=info msg="Executing query" log_id=0pJ~1Y4W000 service=query query="CREATE DATABASE logs"
ts=2024-05-22T10:58:13.454513Z lvl=info msg="Executing query" log_id=0pJ~1Y4W000 service=query query="SHOW DATABASES"
ts=2024-05-22T10:58:13.456517Z lvl=info msg="Executing query" log_id=0pJ~1Y4W000 service=query query="CREATE DATABASE metrics"
  • telegraf.conf matches debug settings
  • Monitoring dashboards status
    • Docker: every graph is populated
    • Logs: tables are all populated
    • Redis: all graphs are populated
    • System: missing graphs on Network (missing net plugin)
    • Top: table populated
  • influxdb container memory consumption for 34 minutes
    image

@mariana-dias-alves
Copy link
Contributor

mariana-dias-alves commented May 22, 2024

Test telegraf on tugbot_simulator 2.3.1-31 (https://github.com/MOV-AI/project-tugbot/releases/tag/2.3.1-31)

influxdb container updated to v1.1.0

Production Level

  • no errors on start up
  • status healthy for all containers
  • telegraf logs with no errors
2024-05-22T15:23:55Z I! Loading config: /etc/telegraf/telegraf.conf
  • influxdb logs with no errors
  • telegraf.conf matches production settings
  • Monitoring dashboards status
    • Docker: every graph is populated
    • Logs: tables are empty
    • Redis: all graphs are populated
    • System: missing graphs on Network (missing net plugin), CPU per process, Mem per process (missing procstat plugin)
    • Top: no results
  • influxdb container memory consumption
    image

Debug Level

  • no errors on start up
  • status healthy for all containers
  • telegraf logs with no errors
$ docker logs telegraf-tugbot_simulator-movai 
2024-05-22T16:27:13Z I! Loading config: /etc/telegraf/telegraf.conf
2024-05-22T16:27:13Z I! Starting Telegraf 1.30.3 brought to you by InfluxData the makers of InfluxDB
2024-05-22T16:27:13Z I! Available plugins: 233 inputs, 9 aggregators, 31 processors, 24 parsers, 60 outputs, 6 secret-stores
2024-05-22T16:27:13Z I! Loaded inputs: cpu disk diskio docker docker_log influxdb internal kernel linux_cpu linux_sysctl_fs mem netstat processes procstat redis swap system wireless
2024-05-22T16:27:13Z I! Loaded aggregators: 
2024-05-22T16:27:13Z I! Loaded processors: 
2024-05-22T16:27:13Z I! Loaded secretstores: 
2024-05-22T16:27:13Z I! Loaded outputs: influxdb
2024-05-22T16:27:13Z I! Tags enabled: host={{.Hostname}}
2024-05-22T16:27:13Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"{{.Hostname}}", Flush Interval:10s
  • influxdb logs with no errors
  • telegraf.conf matches debug settings
  • Monitoring dashboards status
    • Docker: every graph is populated
    • Logs: tables are all populated
    • Redis: all graphs are populated
    • System: missing graphs on Network (missing net plugin)
    • Top: table populated
  • memory consumption
    image

files/telegraf_light.conf Outdated Show resolved Hide resolved
files/telegraf_debug.conf Outdated Show resolved Hide resolved
files/telegraf_production.conf Show resolved Hide resolved
Copy link
Contributor Author

@AlexFernandes-MOVAI AlexFernandes-MOVAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariana-dias-alves please add back procstat and net plugins on the production config to keep iso functionality with 2.3.1.

For procstat, maybe you can limit to the metrics we display:

## Properties to collect
  ## Available options are "cpu", "limits", "memory", "mmap"
  properties = ["cpu", "memory"]

(can be removed on 2.4.3 but we will need to adapt the dashboards)

files/telegraf_production.conf Show resolved Hide resolved
files/telegraf_production.conf Show resolved Hide resolved
files/telegraf_production.conf Show resolved Hide resolved
@mariana-dias-alves
Copy link
Contributor

@mariana-dias-alves please add back procstat and net plugins on the production config to keep iso functionality with 2.3.1.

For procstat, maybe you can limit to the metrics we display:

## Properties to collect
  ## Available options are "cpu", "limits", "memory", "mmap"
  properties = ["cpu", "memory"]

(can be removed on 2.4.3 but we will need to adapt the dashboards)

the option to specify properties in the procstat plugin was only added 2 weeks ago, there is still no official release with that option, so I cannot specify this option even if I update the dockerfile to the latest telegraf release: influxdata/telegraf#15299

@AlexFernandes-MOVAI AlexFernandes-MOVAI merged commit 7b7abcb into dev May 23, 2024
2 checks passed
@AlexFernandes-MOVAI AlexFernandes-MOVAI deleted the feat/collect_fd_net_metrics branch May 23, 2024 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants