Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node: concurrent map read/write inside hostentry sync #225

Closed
fionera opened this issue Jun 14, 2023 · 5 comments
Closed

node: concurrent map read/write inside hostentry sync #225

fionera opened this issue Jun 14, 2023 · 5 comments
Assignees
Labels
bug Something isn't working c/node Issues related to low-level node services and startup

Comments

@fionera
Copy link
Contributor

fionera commented Jun 14, 2023

After adding nodes fairly quickly we encountered a crash:

           hostsfile I Hosts entry:fatal error: concurrent map iteration and map write

goroutine 276 [running]:
runtime.throw({0x254e1d0?, 0x5?})
	GOROOT/src/runtime/panic.go:992 +0x71 fp=0xc0194ada50 sp=0xc0194ada20 pc=0x4386b1
runtime.mapiternext(0x22112c0?)
	GOROOT/src/runtime/map.go:871 +0x4eb fp=0xc0194adac0 sp=0xc0194ada50 pc=0x41006b
source.monogon.dev/metropolis/node/core/network/hostsfile.(*Service).Run(0xc001736050, {0x28b7fb0, 0xc00073a240})
	metropolis/node/core/network/hostsfile/hostsfile.go:190 +0xb0b fp=0xc0194ade48 sp=0xc0194adac0 pc=0x1cd2a8b
source.monogon.dev/metropolis/node/core/roleserve.(*workerHostsfile).run(0xc000140800, {0x28b7fb0, 0xc00073a240})
	metropolis/node/core/roleserve/worker_hostsfile.go:53 +0x2f1 fp=0xc0194adf40 sp=0xc0194ade48 pc=0x1e78af1
source.monogon.dev/metropolis/node/core/roleserve.(*workerHostsfile).run-fm({0x28b7fb0?, 0xc00073a240?})
	<autogenerated>:1 +0x39 fp=0xc0194adf68 sp=0xc0194adf40 pc=0x1e7df39
source.monogon.dev/metropolis/pkg/supervisor.(*supervisor).processSchedule.func1()
	metropolis/pkg/supervisor/supervisor_processor.go:249 +0x93 fp=0xc0194adfe0 sp=0xc0194adf68 pc=0x987b13
runtime.goexit()
	src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0194adfe8 sp=0xc0194adfe0 pc=0x46b061
created by source.monogon.dev/metropolis/pkg/supervisor.(*supervisor).processSchedule
	metropolis/pkg/supervisor/supervisor_processor.go:235 +0xea

After a reboot the node crashes while printing the hosts entries but without any error message.

Metropolis: this is /dev/ttyS1. Verbose node logs follow.

        panichandler E Failed to open core runtime log file: read-only file system
        panichandler W Continuing without persistent panic storage.
        panichandler I Panic console: /dev/tty0
        panichandler I Panic console: /dev/ttyS0
        panichandler I Panic console: /dev/ttyS1
                init I Starting Metropolis node init
                root I Board name: "[REDACTED]"
                root I No qemu fwcfg params.
                root I Retrieved node parameters from ESP
          k8s worker I Waiting for startup data...
     cplane launcher I Waiting for start data...
      k8s controller I Waiting for startup data...
           hostsfile I Waiting for curator connection...
          clusternet I Waiting for curator connection...
           rolefetch I Waiting for curator connection...
            nodemgmt I Waiting for cluster membership...
           heartbeat I Waiting for curator connection...
          net static I Configured interface "bond0"
          net static I Configured interface "enp65s0f0"
                root I Non-sealed configuration present. attempting to join cluster
                root I Joining an existing cluster.
                root I   Using TPM-secured configuration: false
                root I   Node Join public key: 1d83cf94261775598949149bd5c37b028dc313033680a0571d54a7f370cd9b0b
                root I   Directory:
                root I     Addresses:
@fionera fionera added bug Something isn't working c/node Issues related to low-level node services and startup labels Jun 14, 2023
@q3k
Copy link
Contributor

q3k commented Jun 15, 2023

I think this is two separate bugs?

One is the concurrent map access (which should've been caught by the race detector, which we're still not running automatically in tests...).

The second is a crash when joining with an empty node directory.

@q3k
Copy link
Contributor

q3k commented Jun 15, 2023

Tentative fix for the first issue: https://review.monogon.dev/c/monogon/+/1817

@q3k
Copy link
Contributor

q3k commented Jun 15, 2023

This implements explicitly failing join if we don't have a cluster directory: https://review.monogon.dev/c/monogon/+/1818

This isn't a fix, but it should make this failure mode clearer to cluster operators.

Not sure how much time we wanna spend investigating the silent aspect of the crash. I expect it might be a quiet panic due to us not catching them so early on in the boot process. And without such a handler, the panics go straight into /dev/stderr, which in our case is likely not /dev/ttyS1.

@fionera
Copy link
Contributor Author

fionera commented Jun 15, 2023

We found the reason for the Silent fail last night. Its an issue with the BMC not being fast enough. I made a small patch I still have to push that removes printing the whole directory on startup, to reduce the amount of logs written to serial

@lorenz
Copy link
Contributor

lorenz commented Jun 15, 2023

@lorenz lorenz closed this as completed Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c/node Issues related to low-level node services and startup
Projects
None yet
Development

No branches or pull requests

3 participants