ETS memory leak #41

trarbr · 2019-10-16T13:22:45Z

We're experiencing a memory leak on one of our Nerves devices that seems to be related to SystemRegistry's use of ETS tables. See plot below for memory from the last 14 days. We're not 100% sure about the cause of the issue, but it seems to be caused by frequent uevent messages. We will try and generate some more data on this over the next couple of days. I am happy to provide any info you need if I can, so please let me know.

Environment

Elixir version (elixir -v): 1.8.1
Nerves environment: (mix nerves.env --info):

|nerves_bootstrap| Environment Package List

  Pkg:         kit_x86_64
  Vsn:         1.9.0
  Type:        system
  BuildRunner: {Nerves.Artifact.BuildRunners.Docker, []}

  Pkg:         nerves_toolchain_ctng
  Vsn:         1.5.0
  Type:        toolchain_platform
  BuildRunner: {nil, []}

  Pkg:         nerves_toolchain_x86_64_unknown_linux_gnu
  Vsn:         1.1.0
  Type:        toolchain
  BuildRunner: {Nerves.Artifact.BuildRunners.Local, []}

  Pkg:         nerves_system_br
  Vsn:         1.7.1
  Type:        system_platform
  BuildRunner: {nil, []}

|nerves_bootstrap| Loadpaths Start

Nerves environment
  MIX_TARGET:   intel_STK1A32SC
  MIX_ENV:      dev

|nerves_bootstrap| Environment Variable List
  target:     intel_STK1A32SC
  toolchain:  /Users/troels/.nerves/artifacts/nerves_toolchain_x86_64_unknown_linux_gnu-darwin_x86_64-1.1.0
  system:     /Users/troels/.nerves/artifacts/kit_x86_64-portable-1.9.0

Additional information about your host, target hardware or environment that
may help

We're running Nerves on an Intel Compute Stick (STK1A32SC) using a custom system based on nerves_system_x86_64. Our Nerves application starts Docker, and Docker manages a couple of containers.

Whenever a Docker container is started it creates a new virtual network interface. And when a faulty container is restarted over and over and over again, that generates a lot of virtual network interfaces 😄 and thus a lot of uevent messages.

Current behavior

After noticing the rising memory use on a device, we decided to investigate. Using htop we found that the BEAM was using about 55% of RAM. On a freshly booted device, the BEAM uses about 5% and it usually stabilises around 12% of RAM. The device has 2 GB of RAM.

Comparing the memory usage reported by :erlang.memory/0 on the unhealthy node with a healthy node we found the following:

allocator	unhealthy device	healthy device	comparison (unhealthy / healthy)
atom	594561	586369	1.01
atom_used	580202	559360	1.04
binary	1920088	2032104	0.94
code	13592541	13274957	1.02
ets	296828960	37491096	7.92
processes	61425888	46360584	1.32
processes_used	61424824	46359520	1.32
system	321617232	61810016	5.2
total	383043120	108170600	3.54

Seeing that ETS seemed to be using the most memory, we sorted all ETS tables by their memory usage (using the :memory stat reported by :ets.info/1) and found that the heavy hitters were tables owned by SystemRegistry. We looked to dmesg to see if it could find anything odd, and found repeating patterns like this:

[Wed Oct 16 11:29:54 2019] veth0122b00: renamed from eth0
[Wed Oct 16 11:29:54 2019] br-884c2e680dd2: port 8(veth59b04b9) entered disabled state
[Wed Oct 16 11:29:54 2019] br-884c2e680dd2: port 8(veth59b04b9) entered disabled state
[Wed Oct 16 11:29:54 2019] device veth59b04b9 left promiscuous mode
[Wed Oct 16 11:29:54 2019] br-884c2e680dd2: port 8(veth59b04b9) entered disabled state
[Wed Oct 16 11:29:55 2019] br-884c2e680dd2: port 8(veth398d220) entered blocking state
[Wed Oct 16 11:29:55 2019] br-884c2e680dd2: port 8(veth398d220) entered disabled state
[Wed Oct 16 11:29:55 2019] device veth398d220 entered promiscuous mode
[Wed Oct 16 11:29:55 2019] IPv6: ADDRCONF(NETDEV_UP): veth398d220: link is not ready
[Wed Oct 16 11:29:55 2019] br-884c2e680dd2: port 8(veth398d220) entered blocking state
[Wed Oct 16 11:29:55 2019] br-884c2e680dd2: port 8(veth398d220) entered forwarding state
[Wed Oct 16 11:29:55 2019] eth0: renamed from veth69a99ec
[Wed Oct 16 11:29:55 2019] IPv6: ADDRCONF(NETDEV_CHANGE): veth398d220: link becomes ready
[Wed Oct 16 11:33:11 2019] veth69a99ec: renamed from eth0
[Wed Oct 16 11:33:11 2019] br-884c2e680dd2: port 8(veth398d220) entered disabled state
[Wed Oct 16 11:33:11 2019] br-884c2e680dd2: port 8(veth398d220) entered disabled state
[Wed Oct 16 11:33:11 2019] device veth398d220 left promiscuous mode
[Wed Oct 16 11:33:11 2019] br-884c2e680dd2: port 8(veth398d220) entered disabled state
[Wed Oct 16 11:33:11 2019] br-884c2e680dd2: port 8(veth17ce7b4) entered blocking state
[Wed Oct 16 11:33:11 2019] br-884c2e680dd2: port 8(veth17ce7b4) entered disabled state
[Wed Oct 16 11:33:11 2019] device veth17ce7b4 entered promiscuous mode
[Wed Oct 16 11:33:11 2019] IPv6: ADDRCONF(NETDEV_UP): veth17ce7b4: link is not ready
[Wed Oct 16 11:33:11 2019] br-884c2e680dd2: port 8(veth17ce7b4) entered blocking state
[Wed Oct 16 11:33:11 2019] br-884c2e680dd2: port 8(veth17ce7b4) entered forwarding state
[Wed Oct 16 11:33:11 2019] eth0: renamed from veth964b38b
[Wed Oct 16 11:33:11 2019] IPv6: ADDRCONF(NETDEV_CHANGE): veth17ce7b4: link becomes ready
[Wed Oct 16 11:36:27 2019] veth964b38b: renamed from eth0
[Wed Oct 16 11:36:27 2019] br-884c2e680dd2: port 8(veth17ce7b4) entered disabled state
[Wed Oct 16 11:36:27 2019] br-884c2e680dd2: port 8(veth17ce7b4) entered disabled state
[Wed Oct 16 11:36:27 2019] device veth17ce7b4 left promiscuous mode
[Wed Oct 16 11:36:27 2019] br-884c2e680dd2: port 8(veth17ce7b4) entered disabled state
[Wed Oct 16 11:36:27 2019] br-884c2e680dd2: port 8(vethb123857) entered blocking state
[Wed Oct 16 11:36:27 2019] br-884c2e680dd2: port 8(vethb123857) entered disabled state
[Wed Oct 16 11:36:27 2019] device vethb123857 entered promiscuous mode
[Wed Oct 16 11:36:27 2019] IPv6: ADDRCONF(NETDEV_UP): vethb123857: link is not ready
[Wed Oct 16 11:36:27 2019] br-884c2e680dd2: port 8(vethb123857) entered blocking state
[Wed Oct 16 11:36:27 2019] br-884c2e680dd2: port 8(vethb123857) entered forwarding state
[Wed Oct 16 11:36:28 2019] eth0: renamed from veth7beac03
[Wed Oct 16 11:36:28 2019] IPv6: ADDRCONF(NETDEV_CHANGE): vethb123857: link becomes ready

Approximately every 3 minutes, a virtual network interface was being added. This coincides with a periodic restart of a faulty container.

Based on this, we hypothesize that frequent and repeating addition and removal of network interfaces is a problem for SystemRegistry. To test our hypothesis (and system resilience 😄), we decided to delete all ETS tables owned by SystemRegistry, to see how this would effect memory usage:

iex> :ets.all() |> Enum.map(fn table -> :ets.info(table) end) |> Enum.filter(fn table -> to_string(table[:name]) |> String.starts_with?("Elixir.SystemRegistry") end) |> Enum.each(fn table -> :ets.delete(table[:id]) end)
:ok
iex> :erlang.memory |> Enum.sort_by(fn {a, size} -> size end)
[
  atom_used: 580285,
  atom: 594561,
  ets: 1062752,
  binary: 1991296,
  code: 13598749,
  system: 25211040,
  processes_used: 62829632,
  processes: 62830848,
  total: 88041888
]

So total memory usage dropped a from 383043120 to 88041888, approx. 4 times reduction. (And a little later the system rebooted to recover itself - hurray).

I don't know what is causing the slow leak of memory in this case. When the docker container is rebooted, the virtual network interface is removed from the system (it doesn't show up in ip addr). But for some reason, the size of the ETS table keeps growing.

Expected behavior

Memory usage should be stable regardless how many times an interface is added and removed.

The text was updated successfully, but these errors were encountered:

fhunleth · 2019-10-16T20:02:06Z

😱

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETS memory leak #41

ETS memory leak #41

trarbr commented Oct 16, 2019

fhunleth commented Oct 16, 2019

ETS memory leak #41

ETS memory leak #41

Comments

trarbr commented Oct 16, 2019

Environment

Current behavior

Expected behavior

fhunleth commented Oct 16, 2019