Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beam.smp chewing CPU on v1.31.0 on arm #4307

Closed
1 task done
leewillis77 opened this issue Oct 27, 2024 · 52 comments · Fixed by #4319
Closed
1 task done

beam.smp chewing CPU on v1.31.0 on arm #4307

leewillis77 opened this issue Oct 27, 2024 · 52 comments · Fixed by #4319
Labels
kind:bug Something isn't working note:help wanted Extra attention is needed note:needs investigation The issue must be investigated first undetermined Not sure if this is considered a real bug

Comments

@leewillis77
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

I've upgraded to v1.31.0 and after the upgrade I noticed* that there's a beam.smp process chewing up CPU time:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1041 10000     20   0  192432  64284   4732 S 144.9   3.4   6:13.18 beam.smp

The system is idle currently (car is asleep, and neither teslamate nor grafana are open on any devices).

  • It's possible this was happening before the latest upgrade - I can't say for sure.

Expected Behavior

System is idle with no excessive CPU hogging.

Steps To Reproduce

No response

Relevant log output

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1041 10000     20   0  192432  64284   4732 S 144.9   3.4   6:13.18 beam.smp

Screenshots

No response

Additional data

No response

Type of installation

Docker

Version

v1.31.0

@swiffer
Copy link
Contributor

swiffer commented Oct 27, 2024

Seems to be caused by rammitmq (quick google search) which is not utilized by teslamate. Could that be ?

@leewillis77
Copy link
Contributor Author

@swiffer There's nothing else running on the box other than teslamate, and the processes in question look like they're coming from docker:

systemd(1)-+-agetty(455)
           |-agetty(456)
           |-alsactl(331)
           |-avahi-daemon(344)---avahi-daemon(356)
           |-colord(1279)-+-{colord}(1280)
           |              `-{colord}(1282)
           |-containerd(454)-+-{containerd}(458)
           |                 |-{containerd}(459)
           |                 |-{containerd}(460)
           |                 |-{containerd}(461)
           |                 |-{containerd}(462)
           |                 |-{containerd}(480)
           |                 |-{containerd}(481)
           |                 |-{containerd}(482)
           |                 |-{containerd}(498)
           |                 |-{containerd}(812)
           |                 |-{containerd}(813)
           |                 `-{containerd}(814)
           |-containerd-shim(828)-+-tini(889)-+-beam.smp(1041)-+-erl_child_setup(1438)---inet_gethost(1486)---inet_gethost(1487)
           |                      |           |                |-{beam.smp}(1435)
           |                      |           |                |-{beam.smp}(1436)
           |                      |           |                |-{beam.smp}(1437)
           |                      |           |                |-{beam.smp}(1439)
           |                      |           |                |-{beam.smp}(1440)
           |                      |           |                |-{beam.smp}(1441)
           |                      |           |                |-{beam.smp}(1442)
           |                      |           |                |-{beam.smp}(1443)
           |                      |           |                |-{beam.smp}(1444)
           |                      |           |                |-{beam.smp}(1445)
           |                      |           |                |-{beam.smp}(1446)
           |                      |           |                |-{beam.smp}(1447)
           |                      |           |                |-{beam.smp}(1448)
           |                      |           |                |-{beam.smp}(1449)
           |                      |           |                |-{beam.smp}(1450)
           |                      |           |                |-{beam.smp}(1451)
           |                      |           |                |-{beam.smp}(1452)
           |                      |           |                |-{beam.smp}(1453)
           |                      |           |                |-{beam.smp}(1454)
           |                      |           |                |-{beam.smp}(1455)
           |                      |           |                |-{beam.smp}(1456)
           |                      |           |                |-{beam.smp}(1457)
           |                      |           |                `-{beam.smp}(1458)

I do have DISABLE_MQTT=true in my docker compose since I'm not using any MQTT services if that helps.

@swiffer
Copy link
Contributor

swiffer commented Oct 27, 2024

Ok, rabbit is based on erlang, that's where it's coming from. I wonder if #4296 is causing this.

@swiffer
Copy link
Contributor

swiffer commented Oct 27, 2024

@leewillis77
Copy link
Contributor Author

Hi;

Thanks for the suggestion. I added the following under environment in my docker-compose.yml:

- ERL_MAX_PORTS=1024

It didn't make any difference - was that the correct way to make the change? E.g.

services:
  teslamate:
    image: teslamate/teslamate:latest
    restart: always
    environment:
      - DATABASE_USER=teslamate
      - DATABASE_NAME=teslamate
      - DATABASE_HOST=database
      - DISABLE_MQTT=true
      - ERL_MAX_PORTS=1024

The thread you linked to seems to be concerned with high memory usage, but the problem I'm experiencing is high CPU, not high memory.

@s56oa
Copy link

s56oa commented Oct 27, 2024

I have the same issue.
Running in a docker on Raspberry Pi. Completely new fresh install of Teslamate made today.
Version 1.31.0

Posnetek zaslona 2024–10–27 ob 19 59 52

@swiffer
Copy link
Contributor

swiffer commented Oct 27, 2024

@leewillis77 yes, you are right, sorry

Could one of you try

ghcr.io/teslamate-org/teslamate/teslamate:pr-4300

That is the latest before upgrading elixir / erlang

@leewillis77
Copy link
Contributor Author

I would if I knew how :)

@swiffer
Copy link
Contributor

swiffer commented Oct 27, 2024

Just replace image in docker compose file, see here

https://docs.teslamate.org/docs/development#testing

@leewillis77
Copy link
Contributor Author

I tried that actually.:

$ docker compose pull
[+] Pulling 3/3
 ✘ grafana Error   context canceled                                                                                                                                                          0.5s
 ✘ teslamate Error Head "https://ghcr.io/v2/teslamate-org/teslamate/teslamate/manifests/pr-4300": denied                                                                                     0.5s
 ✘ database Error  context canceled

@swiffer
Copy link
Contributor

swiffer commented Oct 27, 2024

okay, please try again with

image: ghcr.io/teslamate-org/teslamate:pr-4300

the contrib doc had an issue...

@leewillis77
Copy link
Contributor Author

beam.smp CPU usage is minimal (1.3%) with ghcr.io/teslamate-org/teslamate:pr-4300 vs > 100% as originally reported with the latest release.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
14533 10000     20   0  156660  84144   6616 S   1.3   4.4   0:07.23 beam.smp

@swiffer
Copy link
Contributor

swiffer commented Oct 27, 2024

@leewillis77 - thanks testing. I guess someone else needs to step in but seems to be caused by the erlang/elixir update @JakobLichterfeld

For now it should be fine to stay on pr-4300 until a fix is on the horizon.

@s56oa
Copy link

s56oa commented Oct 27, 2024

I can confirm the same with image: ghcr.io/teslamate-org/teslamate:pr-4300
Version 1.30.2-dev

beam.smp CPU usage has dropped significantly to around 1,1%.

Posnetek zaslona 2024–10–27 ob 20 47 45

@swiffer
Copy link
Contributor

swiffer commented Oct 27, 2024

Out of curiosity - could you check pr-4303 as well, just to make sure if it's caused by dep updates or erlang

@leewillis77
Copy link
Contributor Author

Out of curiosity - could you check pr-4303 as well, just to make sure if it's caused by dep updates or erlang

beam.smp CPU is high (over 100%) with pr-4303.

@leewillis77
Copy link
Contributor Author

I'm unfortunately not particularly knowledgeable about the erlang ecosystem, but looking back at the pstree output I posted in #4307 (comment) which references inet_gethost I wonder if erlang/otp#8961 may be related?

@baresel
Copy link

baresel commented Oct 27, 2024

Same issue for me, i switched to pr-4300 and it fixed the problem for now.

@brianmay
Copy link
Collaborator

It sounds like in that linked issue the process is just hanging, not using lots of CPU.

@brianmay
Copy link
Collaborator

brianmay commented Oct 27, 2024

Beam is a VM. Probably will have problems debugging this with system tools, think you will need beam tools.

One such tool is the erlang observer. https://blog.sequinstream.com/how-we-used-elixirs-observer-to-hunt-down-bottlenecks/

Having said that, it may be fiddly to get this working in a Docker container. Either the observer needs to have X access somehow (e.g. c0b/docker-elixir#19 (comment)), or you need to run a local beam that connects to the remote beam somehow (https://stackoverflow.com/questions/42164298/using-the-erlang-observer-app-with-a-remote-elixir-phoenix-server-inside-docker). Either way likely to be fiddly.

Also: It seems that our nix flake doesn't compile in the observer. I tried some things, but so far have been failures. Later: Oh, wait, this fixes it:

diff --git a/mix.exs b/mix.exs
index d159eefd..afa1c98a 100644
--- a/mix.exs
+++ b/mix.exs
@@ -27,7 +27,7 @@ defmodule TeslaMate.MixProject do
   def application do
     [
       mod: {TeslaMate.Application, []},
-      extra_applications: [:logger, :runtime_tools]
+      extra_applications: [:logger, :runtime_tools, :wx, :observer]
     ]
   end

(solution copied from https://stackoverflow.com/a/76760603)

@JakobLichterfeld JakobLichterfeld added kind:bug Something isn't working note:help wanted Extra attention is needed undetermined Not sure if this is considered a real bug note:needs investigation The issue must be investigated first labels Oct 28, 2024
@JakobLichterfeld
Copy link
Collaborator

I can reproduce the high CPU usage, without any high CPU usage, system runs fine with multiple docker containers on Raspberry Pi 3B+ currently logging fine a charge.

@JakobLichterfeld
Copy link
Collaborator

JakobLichterfeld commented Oct 28, 2024

Thanks for your investigation!

Later: Oh, wait, this fixes it:

--> PR: #4311

@JakobLichterfeld
Copy link
Collaborator

Perhaps we can disable beams busy waiting: https://stressgrid.com/blog/beam_cpu_usage/

@brianmay
Copy link
Collaborator

brianmay commented Oct 28, 2024

Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?

It could also be an ARM specific issue also with latest beam.

@brianmay
Copy link
Collaborator

brianmay commented Oct 28, 2024

I propose that somebody create a pull request from the change to the Dockerfile reverted:

FROM elixir:1.16.2-otp-26 AS builder

We test that. Assuming that test succeeds, we then test:

FROM elixir:1.17.2-otp-26 AS builder

We should test the following also, but I suspect we can't because this is not a valid option:

FROM elixir:1.16.2-otp-27 AS builder

My suspicion is we will find otp-26 good and otp-27 bad, but would be good to prove this is indeed the case.

On the other hand, if the problem is the elixir version, not the erlang version, we probably should try 1.17.3 also.

@cwanja
Copy link
Collaborator

cwanja commented Oct 29, 2024

Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?

It could also be an ARM specific issue also with latest beam.

Bullseye here.

Have not update my Pi in months (do not judge me ha). TeslaMate is the only thing running on it. Like others, after updating to 1.31.0, beam.vm shot up and chewed through CPU.

@swiffer
Copy link
Contributor

swiffer commented Oct 29, 2024

updated my comment with results on amd64 - works without increase in cpu time here. guess having some exploration builds as proposed by brianmay is best to quickly see what is introducing the issue?

@baresel
Copy link

baresel commented Oct 29, 2024

Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?

It could also be an ARM specific issue also with latest beam.

I am on Bullseye too.

@JakobLichterfeld
Copy link
Collaborator

Maybe a problem with Raspbian and latest code? What version are people using? Bookworm? Bullseye? Or Older?

It could also be an ARM specific issue also with latest beam.

Good point, I hadn't thought of that.
Armv7l, rasbian buster here.

I propose that somebody create a pull request from the change to the Dockerfile reverted

I will be the somebody :-) (as it will be a bit tricky to get CI build as we prevent it if there are changes to the docker file)

@swiffer
Copy link
Contributor

swiffer commented Oct 29, 2024

could someone affected try these three please?

  • image: ghcr.io/teslamate-org/teslamate:pr-4314 (1.16.3-otp-26)
  • image: ghcr.io/teslamate-org/teslamate:pr-4315 (1.17.3-otp-26)
  • image: ghcr.io/teslamate-org/teslamate:pr-4316 (1.17.3-otp-27)

tried these three as we know that 1.16.2-otp-26 is working and 1.17.2-otp-27 is not.

@JakobLichterfeld - i've been faster 😝

@baresel
Copy link

baresel commented Oct 29, 2024

  • ghcr.io/teslamate-org/teslamate:pr-4314
    not affected, CPU load in normal ranges
  • ghcr.io/teslamate-org/teslamate:pr-4315
    not affected, same as 4300 or 4314
  • ghcr.io/teslamate-org/teslamate:pr-4316
    definitly affected, CPU load is junoing between 30% and 90%

I also checked again the latest official release, there the load is jumping between 90 and 140% , which is more than i can see in the 4316 testing release!

@swiffer
Copy link
Contributor

swiffer commented Oct 29, 2024

ok, so it looks like otp27 is causing the issue on arm

@baresel could you provide more details about your env and hardware (OS, Docker Version, Hardware)

@baresel
Copy link

baresel commented Oct 29, 2024

OS Version: "Raspbian GNU/Linux 11 (bullseye)", Kernel: Linux 6.1.21-v8+
Docker Version: Docker version 20.10.5+dfsg1, build 55c4c88
Hardware: Raspi 4B, 4 GB RAM

@DrMichael
Copy link
Collaborator

Same here, 4314, 4315 are fine, 4316 has high load.

OS Version: "Raspbian GNU/Linux 11 (bullseye)", Kernel: Linux 6.1.70-v7+
Docker Version: Docker version 27.3.1, build ce12230
Hardware: Raspi 3B+, 1 GB RAM

@plutonn
Copy link

plutonn commented Oct 29, 2024

Hello, 4314, 4315 are fine, 4316 has high load.

OS Version: "Raspbian GNU/Linux 12 (bookworm) aarch64
", Kernel: Linux 6.1.21-v8+
Docker Version: Docker version 27.3.1, build ce12230
Hardware: Raspi 4B+, 8 GB RAM Rev 1.4

@JakobLichterfeld
Copy link
Collaborator

JakobLichterfeld commented Oct 29, 2024

I assume the OTP27 docker image has set different compiler flags, I suggest a patch release with downgrading to 1.17.3-otp-26, even if I could not find any negative effect beside the "high" load in shown in top on my Raspberry Pi 3B+ (edit: corrected the otp version)

@swiffer
Copy link
Contributor

swiffer commented Oct 29, 2024

what I do find interesting (but maybe completely unrelated) - RabbitMQ is not supporting OTP 27 as it has significant performance regressions in their use cases.

https://www.rabbitmq.com/docs/which-erlang#erlang-27-support

however we seem to be affected on aarch64 only.

@swiffer
Copy link
Contributor

swiffer commented Oct 29, 2024

@JakobLichterfeld you mean 1.17.3-otp-26 ?

@JakobLichterfeld
Copy link
Collaborator

JakobLichterfeld commented Oct 29, 2024

what I do find interesting (but maybe completely unrelated) - RabbitMQ is not supporting OTP 27 as it has significant performance regressions in their use cases.

OK, thats a reason to stay with 1.17.3-otp-26 (edit: correct otp version)

@swiffer
Copy link
Contributor

swiffer commented Oct 29, 2024

why not 1.17.3-otp-26 (will update my existing PR).

@JakobLichterfeld
Copy link
Collaborator

JakobLichterfeld commented Oct 29, 2024

@JakobLichterfeld you mean 1.17.3-otp-26 ?

Yeah, sorry for the typo :-D (edited above to avoid confusion)

why not 1.17.3-otp-26 (will update my existing PR).

Yeah, 1.17.3-otp-26

We will need to update the docs, the CI, the nix flake etc. as well

@JakobLichterfeld
Copy link
Collaborator

We will need to update the docs, the CI, the nix flake etc. as well

I will open a branch for it.

@JakobLichterfeld JakobLichterfeld changed the title beam.smp chewing CPU on v1.31.0 beam.smp chewing CPU on v1.31.0 on arm Oct 29, 2024
JakobLichterfeld added a commit that referenced this issue Oct 29, 2024
…4319)

* fix: use elixir-1.17.3-otp-26 to avoid beam.smp chewing CPU on arm

fixes: #4307

* docs: update changelog
@brianmay
Copy link
Collaborator

I had a theory that maybe the problem was otp-27 doesn't work correctly with bullseye. But I think the above tests clearly show bookworm has the same issue.

Just a detail I found interesting.

@brianmay
Copy link
Collaborator

brianmay commented Oct 29, 2024

erlang/otp#8322

OTP27 has various optimizations, wonder if one of these broke things for arm. https://www.erlang.org/blog/optimizations/

@brianmay
Copy link
Collaborator

Possibly this is not arm specific, but you notice it more on the slower processors.

@JakobLichterfeld
Copy link
Collaborator

OTP27 has various optimizations, wonder if one of these broke things for arm. https://www.erlang.org/blog/optimizations/

Actually a better explanation than mine, as

I assume the OTP27 docker image has set different compiler flags,

__

Possibly this is not arm specific, but you notice it more on the slower processors.

Yeah good point, currently I do not want dive to deep into this rabbit hole

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug Something isn't working note:help wanted Extra attention is needed note:needs investigation The issue must be investigated first undetermined Not sure if this is considered a real bug
Projects
None yet
9 participants