node: concurrent map read/write inside hostentry sync #225

fionera · 2023-06-14T19:28:35Z

After adding nodes fairly quickly we encountered a crash:

           hostsfile I Hosts entry:fatal error: concurrent map iteration and map write

goroutine 276 [running]:
runtime.throw({0x254e1d0?, 0x5?})
	GOROOT/src/runtime/panic.go:992 +0x71 fp=0xc0194ada50 sp=0xc0194ada20 pc=0x4386b1
runtime.mapiternext(0x22112c0?)
	GOROOT/src/runtime/map.go:871 +0x4eb fp=0xc0194adac0 sp=0xc0194ada50 pc=0x41006b
source.monogon.dev/metropolis/node/core/network/hostsfile.(*Service).Run(0xc001736050, {0x28b7fb0, 0xc00073a240})
	metropolis/node/core/network/hostsfile/hostsfile.go:190 +0xb0b fp=0xc0194ade48 sp=0xc0194adac0 pc=0x1cd2a8b
source.monogon.dev/metropolis/node/core/roleserve.(*workerHostsfile).run(0xc000140800, {0x28b7fb0, 0xc00073a240})
	metropolis/node/core/roleserve/worker_hostsfile.go:53 +0x2f1 fp=0xc0194adf40 sp=0xc0194ade48 pc=0x1e78af1
source.monogon.dev/metropolis/node/core/roleserve.(*workerHostsfile).run-fm({0x28b7fb0?, 0xc00073a240?})
	<autogenerated>:1 +0x39 fp=0xc0194adf68 sp=0xc0194adf40 pc=0x1e7df39
source.monogon.dev/metropolis/pkg/supervisor.(*supervisor).processSchedule.func1()
	metropolis/pkg/supervisor/supervisor_processor.go:249 +0x93 fp=0xc0194adfe0 sp=0xc0194adf68 pc=0x987b13
runtime.goexit()
	src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0194adfe8 sp=0xc0194adfe0 pc=0x46b061
created by source.monogon.dev/metropolis/pkg/supervisor.(*supervisor).processSchedule
	metropolis/pkg/supervisor/supervisor_processor.go:235 +0xea

After a reboot the node crashes while printing the hosts entries but without any error message.

Metropolis: this is /dev/ttyS1. Verbose node logs follow.

        panichandler E Failed to open core runtime log file: read-only file system
        panichandler W Continuing without persistent panic storage.
        panichandler I Panic console: /dev/tty0
        panichandler I Panic console: /dev/ttyS0
        panichandler I Panic console: /dev/ttyS1
                init I Starting Metropolis node init
                root I Board name: "[REDACTED]"
                root I No qemu fwcfg params.
                root I Retrieved node parameters from ESP
          k8s worker I Waiting for startup data...
     cplane launcher I Waiting for start data...
      k8s controller I Waiting for startup data...
           hostsfile I Waiting for curator connection...
          clusternet I Waiting for curator connection...
           rolefetch I Waiting for curator connection...
            nodemgmt I Waiting for cluster membership...
           heartbeat I Waiting for curator connection...
          net static I Configured interface "bond0"
          net static I Configured interface "enp65s0f0"
                root I Non-sealed configuration present. attempting to join cluster
                root I Joining an existing cluster.
                root I   Using TPM-secured configuration: false
                root I   Node Join public key: 1d83cf94261775598949149bd5c37b028dc313033680a0571d54a7f370cd9b0b
                root I   Directory:
                root I     Addresses:

The text was updated successfully, but these errors were encountered:

q3k · 2023-06-15T09:07:23Z

I think this is two separate bugs?

One is the concurrent map access (which should've been caught by the race detector, which we're still not running automatically in tests...).

The second is a crash when joining with an empty node directory.

q3k · 2023-06-15T09:08:18Z

Tentative fix for the first issue: https://review.monogon.dev/c/monogon/+/1817

q3k · 2023-06-15T09:13:50Z

This implements explicitly failing join if we don't have a cluster directory: https://review.monogon.dev/c/monogon/+/1818

This isn't a fix, but it should make this failure mode clearer to cluster operators.

Not sure how much time we wanna spend investigating the silent aspect of the crash. I expect it might be a quiet panic due to us not catching them so early on in the boot process. And without such a handler, the panics go straight into /dev/stderr, which in our case is likely not /dev/ttyS1.

fionera · 2023-06-15T09:23:22Z

We found the reason for the Silent fail last night. Its an issue with the BMC not being fast enough. I made a small patch I still have to push that removes printing the whole directory on startup, to reduce the amount of logs written to serial

lorenz · 2023-06-15T12:35:23Z

Closed in https://review.monogon.dev/c/monogon/+/1817

fionera added bug Something isn't working c/node Issues related to low-level node services and startup labels Jun 14, 2023

fionera assigned lorenz Jun 14, 2023

lorenz closed this as completed Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node: concurrent map read/write inside hostentry sync #225

node: concurrent map read/write inside hostentry sync #225

fionera commented Jun 14, 2023 •

edited by lorenz

Loading

q3k commented Jun 15, 2023

q3k commented Jun 15, 2023

q3k commented Jun 15, 2023 •

edited

Loading

fionera commented Jun 15, 2023

lorenz commented Jun 15, 2023

node: concurrent map read/write inside hostentry sync #225

node: concurrent map read/write inside hostentry sync #225

Comments

fionera commented Jun 14, 2023 • edited by lorenz Loading

q3k commented Jun 15, 2023

q3k commented Jun 15, 2023

q3k commented Jun 15, 2023 • edited Loading

fionera commented Jun 15, 2023

lorenz commented Jun 15, 2023

fionera commented Jun 14, 2023 •

edited by lorenz

Loading

q3k commented Jun 15, 2023 •

edited

Loading