Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic serving: concurrent write to websocket connection #19506

Closed
Kamilcuk opened this issue Dec 16, 2023 · 9 comments · Fixed by #19932
Closed

panic serving: concurrent write to websocket connection #19506

Kamilcuk opened this issue Dec 16, 2023 · 9 comments · Fixed by #19932

Comments

@Kamilcuk
Copy link
Contributor

Kamilcuk commented Dec 16, 2023

Nomad version

$ /usr/bin/nomad --version
Nomad v1.7.2
BuildDate 2023-12-13T19:59:42Z
Revision 64e3dca9274b493e38a49fda3a70fd31d0485b91

Operating system and Environment details

Archlinux.

Issue

I am executing a lot of nomad job exec API commands to test some stuff and Nomad logs some panics:

    2023-12-16T22:56:25.020+0100 [DEBUG] http: request complete: method=GET path="/v1/client/allocation/9ececd9c-a005-40e5-b2ef-38b4c2454424/exec?command=%5B%22sh%22%2C%22-c%22%
2C%22%5Cn++++++++++++find+%5C%22%242%5C%22+-maxdepth+1+-mindepth+1+-type+f+-name+%5C%22%243%2A%5C%22%5Cn++++++++++++find+%5C%22%242%5C%22+-maxdepth+1+-mindepth+1+-type+d+-name+%
5C%22%243%2A%5C%22+%7C%5Cn++++++++++++++++while+IFS%3D+read+-r+line%3B+do+echo+%5C%22%24line%2F%5C%22%3B+done%5Cn++++++++++++%22%2C%22--%22%2C%22%2Fusr%2Fbi%22%2C%22%2Fusr%2F%22
%2C%22bi%22%5D&namespace=default&region=global&task=test_nomad_cp_complete&tty=false" duration=9.321342ms                                                                        
    2023-12-16T22:56:25.020+0100 [ERROR] http: http: panic serving 127.0.0.1:50746: concurrent write to websocket connection                                                     
goroutine 3830 [running]:                                                                                                                                                        
net/http.(*conn).serve.func1()                                                                                                                                                   
        net/http/server.go:1868 +0xb9                                                                                                                                            
panic({0x28e3c80?, 0x37baf50?})                                                                                                                                                  
        runtime/panic.go:920 +0x270                                                                                                                                              
github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc001227428, 0x1, {0xc00170e6c0?, 0x7fe1d8062338?, 0x7fe220b35a68?})                                                   
        github.com/gorilla/[email protected]/conn.go:617 +0x4b8                                                                                                                   
github.com/gorilla/websocket.(*Conn).WriteMessage(0xc0010bcb00, 0xc000ac8900?, {0xc00170e6c0, 0x3a, 0x3a})                                                                       
        github.com/gorilla/[email protected]/conn.go:770 +0x127                                                                                                                   
github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl(0xc000b4a0e0, 0xc0010bcb00, 0xc00118db00)                                                                  
        github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:652 +0x5bd                                                                                                    
github.com/hashicorp/nomad/command/agent.(*HTTPServer).allocExec(0xc000b4a0e0, {0xc0014346da, 0x24}, {0x37d6e70, 0xc0009b67e0}, 0xc000efbb00)                                    
        github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:519 +0x685                                                                                                    
github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest(0xc000b4a0e0, {0x37d6e70, 0xc0009b67e0}, 0xc000efbb00)                                                 
        github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:227 +0x15d                                                                                                    
github.com/hashicorp/nomad/command/agent.(*HTTPServer).registerHandlers.(*HTTPServer).wrap.func45({0x37d6e70, 0xc0009b67e0}, 0xc000efbb00)                                       
        github.com/hashicorp/nomad/command/agent/http.go:715 +0x168                                                                                                              
net/http.HandlerFunc.ServeHTTP(0xc0009be540?, {0x37d6e70?, 0xc0009b67e0?}, 0xc000efbb00?)                                                                                        
        net/http/server.go:2136 +0x29                                                                                                                                            
github.com/hashicorp/nomad/command/agent.wrapCORSWithAllowedMethods.(*Cors).Handler.func1({0x37d6e70, 0xc0009b67e0}, 0xc000efbb00)                                               
        github.com/rs/[email protected]/cors.go:236 +0x184                                                                                                                             
net/http.HandlerFunc.ServeHTTP(0xc000c7df80?, {0x37d6e70?, 0xc0009b67e0?}, 0xc000d6b9e8?)                                                                                        
        net/http/server.go:2136 +0x29                                                                                                                                            
net/http.(*ServeMux).ServeHTTP(0xc000c8ea10?, {0x37d6e70, 0xc0009b67e0}, 0xc000efbb00)                                                                                           
        net/http/server.go:2514 +0x142                                                                                                                                           
github.com/hashicorp/nomad/command/agent.NewHTTPServers.CompressHandler.CompressHandlerLevel.func3({0x37d6e70?, 0xc0009b67e0}, 0xc000efbb00)
        github.com/gorilla/[email protected]/compress.go:105 +0x58d
net/http.HandlerFunc.ServeHTTP(0x51f50c0?, {0x37d6e70?, 0xc0009b67e0?}, 0xc000d6bb50?)
        net/http/server.go:2136 +0x29
net/http.serverHandler.ServeHTTP({0xc000c7df20?}, {0x37d6e70?, 0xc0009b67e0?}, 0x6?)
        net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc00100cab0, {0x37e1438, 0xc00070d860})
        net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 264
        net/http/server.go:3086 +0x5cb

When running not -dev instance, sometimes Nomad process terminates (!!!):

    2023-12-16T23:51:19.381+0100 [INFO]  client: task exec session ended: exec_id=7f5f46fc-22ef-af79-ce50-e3a442c02aed
    2023-12-16T23:51:19.381+0100 [ERROR] http: request failed: method=GET path="/v1/client/allocation/e6e0b6a7-1dbe-01b6-4c7a-a7a3e3299dc4/exec?command=%5B%22sh%22%2C%22-c%22%2C%22%5Cn++++++++++++find+%5C%22%242%5C%22+-maxdepth+1+-mindepth+1+-type+f+-name+%5C%22%243%2A%5C%22%5Cn++++++++++++find+%5C%22%242%5C%22+-maxdepth+1+-mindepth+1+-type+d+-name+%5C%22%243%2A%5C%22+%7C%5Cn++++++++++++++++while+IFS%3D+read+-r+line%3B+do+echo+%5C%22%24line%2F%5C%22%3B+done%5Cn++++++++++++%22%2C%22--%22%2C%22.%2Fh%22%2C%22.%2F%22%2C%22h%22%5D&namespace=default&region=global&task=test_nomad_cp_complete&tty=false" error="websocket: close 1006 (abnormal closure): unexpected EOF" code=500
panic: concurrent write to websocket connection

goroutine 1809 [running]:
github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc000e82e88, 0x1, {0xc000df8568?, 0xdc899b?, 0x7f5cf22aa5b8?})
        github.com/gorilla/[email protected]/conn.go:617 +0x4b8
github.com/gorilla/websocket.(*Conn).WriteMessage(0xc000ff5a20, 0x4f15760?, {0xc000df8568, 0x2, 0x2})
        github.com/gorilla/[email protected]/conn.go:770 +0x127
github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl.func2()
        github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:608 +0x48b
created by github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl in goroutine 1806
        github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:593 +0x3a6

Reproduction steps

Run nomad 1.7.2 agent -dev.

Execute a lot of:

nomad alloc exec -t=false -i=false -task=test_nomad_cp_complete e6e0b6a7-1dbe-01b6-4c7a-a7a3e3299dc4 sh -c '                                                                   
            find "$2" -maxdepth 1 -mindepth 1 -type f -name "$3*"                                                                                                                
            find "$2" -maxdepth 1 -mindepth 1 -type d -name "$3*" |                                                                                                              
                while IFS= read -r line; do echo "$line/"; done                                                                                                                  
            ' -- ./h ./h

Expected Result

There should be no exceptions in logs.

Actual Result

There are exceptions in logs and occasionally Nomad process terminates.

Reproducible on 1.7.0, 1.7.1, 1.7.2.

Not reproducible on 1.6.3.

Job file (if appropriate)

variable "script" {
  type = string
  default = ""
}
job "test-test_nomad_cp_complete" {
  type = "batch"
  meta {
    uuid = uuidv4()
  }
  group "test_nomad_cp_complete" {
    reschedule {
      attempts = 0
    }
    restart {
      attempts = 0
    }
    task "test_nomad_cp_complete" {
      driver = "raw_exec"
      config {
        command = "sh"
        args = ["-xc", <<-EOF
          exec sleep 60
          ${var.script}
        EOF
        ]
      }
    }
  }
}

Nomad logs

https://pastebin.com/5vhAPCB1

https://pastebin.com/ydJLgzHp

@tgross
Copy link
Member

tgross commented Jan 17, 2024

Hi @Kamilcuk! I've been able to repro this with a lot of attempts but don't have an obvious root cause. I know we touched the Alloc Exec API a good bit to support Actions in Nomad 1.7.0, so presumably there's something we broke there. We'll investigate and report back.

@tgross tgross self-assigned this Jan 17, 2024
@Kamilcuk
Copy link
Contributor Author

Hi, I prepared the following reproducing script called ./reproducible.sh:

#!/bin/bash
set -xeuo pipefail
if [[ ! -d nomad-tools ]]; then
	# Setup the repo
	git clone https://github.com/Kamilcuk/nomad-tools.git
	cd nomad-tools
	git checkout 44a47aee1f4330d1e075b46285a88ebbd7303bd2
	pip install -e '.[test]'
	cd ..
fi
if ! nomad status test | grep -i running; then
	# Run the job if not running.
	nomad job run - <<'EOF'
	job "test" {
	  type = "batch"
	  meta {
		uuid = uuidv4()
	  }
	  group "test" {
		task "test" {
		  driver = "raw_exec"
		  config {
			command = "sh"
			args = ["-xc", "sleep 60"]
		  }
		}
	  }
	}
EOF
	while ! nomad status test | grep -i running; do sleep 1; done
fi
if ((1)); then
	# Run the test
	cd ./nomad-tools
	./integration_tests.sh -k test_nomad_cp_complete
	cd ..
fi

Execute nomad agent -dev version 1.7.3 on one terminal, and execute ./reproducible.sh on a second. Python3 and git and pip is needed.

I am consistently getting 2024-01-18T11:43:05.885+0100 [ERROR] http: http: panic serving 127.0.0.1:40300: concurrent write to websocket connectio errors in logs. I was able to reproduce with nomad 1.7.3 on my Arch Linux 6.7.0-zen3-1-zen and on WSL2 ubuntu.

@bra-fsn
Copy link

bra-fsn commented Jan 25, 2024

I can see the same sometimes (1.7.3 running on Ubuntu 22.04). Three nomad servers federated to another three and one client attached to each of those clusters (just a testbed).

@fqborges
Copy link
Contributor

I can see the same using v1.5.11 running in Debian 11.

I have this on my journalctl

Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: panic: concurrent write to websocket connection
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: goroutine 6432 [running]:
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc001542e88, 0x1, {0xc001da8110?, 0x7fc7d80c5a50?>
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]:         github.com/gorilla/[email protected]/conn.go:617 +0x4b8
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/gorilla/websocket.(*Conn).WriteMessage(0xc004488f20, 0x4d0f750?, {0xc001da8110, 0x2, 0x2})
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]:         github.com/gorilla/[email protected]/conn.go:770 +0x127
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl.func2()
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]:         github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:605 +0x48b
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]: created by github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl in goroutine 6446
Jan 29 14:33:06 ip-172-16-1-119 nomad[1421528]:         github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:590 +0x3a6
Jan 29 14:33:06 ip-172-16-1-119 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 29 14:33:06 ip-172-16-1-119 systemd[1]: nomad.service: Failed with result 'exit-code'.

@Kamilcuk
Copy link
Contributor Author

Kamilcuk commented Feb 2, 2024

This error specifically when using nomad alloc exec. nomad alloc exec executes some queries first and then right right after them connects via the websocket. If I connect straight using only the websocket, even executing a lot of them, I am not able to reproduce this bug. Bottom line, this suggests, this issue is about integration of some other endpoints with the exec endpoint.

@tgross
Copy link
Member

tgross commented Feb 8, 2024

Hey folks, sorry about the delay on this. On the surface it looks like this was introduced in #19172 which shipped in Nomad 1.7.0 (with backports to 1.6.4 and 1.5.11), and there's definitely a bug in that PR, which I'll explain below. But that bug unfortunately doesn't explain the panic.

The relevant blocks of code are alloc_endpoint.go#L589-L655 and alloc_endpoint.go#L670-L688.

In #19172, we added a check if the error returned from decoding from the websocket was one of several benign "close errors". The trouble is that this check incorrectly assumed that errors other than those with valid websocket message error codes were of type HTTPCodedError.

But that's not the cause of the panic! When I hit this error:

websocket: close 1006 (abnormal closure): unexpected EOF"

while running a build with Go's data race detection on, I see the following data race reported:

data race
==================
WARNING: DATA RACE
Read at 0x00c00019d258 by goroutine 6211:
  github.com/gorilla/websocket.(*Conn).beginMessage()
      github.com/gorilla/[email protected]/conn.go:479 +0x56
  github.com/gorilla/websocket.(*Conn).WriteMessage()
      github.com/gorilla/[email protected]/conn.go:764 +0xee
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl()
      github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:652 +0x975
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).allocExec()
      github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:519 +0xa08
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest()
      github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:227 +0x2ab
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest-fm()
      <autogenerated>:1 +0x5d
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).registerHandlers.(*HTTPServer).wrap.func45()
      github.com/hashicorp/nomad/command/agent/http.go:716 +0x1e7
  net/http.HandlerFunc.ServeHTTP()
      net/http/server.go:2136 +0x47
  github.com/hashicorp/nomad/command/agent.wrapCORSWithAllowedMethods.(*Cors).Handler.func1()
      github.com/rs/[email protected]/cors.go:236 +0x2f1
  net/http.HandlerFunc.ServeHTTP()
      net/http/server.go:2136 +0x47
  net/http.(*ServeMux).ServeHTTP()
      net/http/server.go:2514 +0xbc
  github.com/hashicorp/nomad/command/agent.NewHTTPServers.CompressHandler.CompressHandlerLevel.func3()
      github.com/gorilla/[email protected]/compress.go:105 +0x738
  net/http.HandlerFunc.ServeHTTP()
      net/http/server.go:2136 +0x47
  net/http.serverHandler.ServeHTTP()
      net/http/server.go:2938 +0x2a1
  net/http.(*conn).serve()
      net/http/server.go:2009 +0xc24
  net/http.(*Server).Serve.func3()
      net/http/server.go:3086 +0x4f

Previous write at 0x00c00019d258 by goroutine 6214:
  github.com/gorilla/websocket.(*messageWriter).endMessage()
      github.com/gorilla/[email protected]/conn.go:546 +0xce
  github.com/gorilla/websocket.(*messageWriter).flushFrame()
      github.com/gorilla/[email protected]/conn.go:633 +0xb64
  github.com/gorilla/websocket.(*Conn).WriteMessage()
      github.com/gorilla/[email protected]/conn.go:770 +0x1c4
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl.func2()
      github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:608 +0x768
  github.com/hashicorp/go-msgpack/codec.(*decReaderSwitch).readn1()
      github.com/hashicorp/[email protected]/codec/decode.go:2224 +0x47
  github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).readNextBd()
      github.com/hashicorp/[email protected]/codec/msgpack.go:742 +0x6b
  github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).TryDecodeAsNil()
      github.com/hashicorp/[email protected]/codec/msgpack.go:791 +0x55
  github.com/hashicorp/go-msgpack/codec.(*Decoder).mustDecode()
      github.com/hashicorp/[email protected]/codec/decode.go:2536 +0x4b
  github.com/hashicorp/go-msgpack/codec.(*Decoder).Decode()
      github.com/hashicorp/[email protected]/codec/decode.go:2519 +0xe5
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl.func2()
      github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:606 +0x324

Goroutine 6211 (running) created at:
  net/http.(*Server).Serve()
      net/http/server.go:3086 +0x80c
  github.com/hashicorp/nomad/command/agent.NewHTTPServers.func1()
      github.com/hashicorp/nomad/command/agent/http.go:191 +0xc4

Goroutine 6214 (finished) created at:
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).execStreamImpl()
      github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:593 +0x66a
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).allocExec()
      github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:519 +0xa08
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest()
      github.com/hashicorp/nomad/command/agent/alloc_endpoint.go:227 +0x2ab
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).ClientAllocRequest-fm()
      <autogenerated>:1 +0x5d
  github.com/hashicorp/nomad/command/agent.(*HTTPServer).registerHandlers.(*HTTPServer).wrap.func45()
      github.com/hashicorp/nomad/command/agent/http.go:716 +0x1e7
  net/http.HandlerFunc.ServeHTTP()
      net/http/server.go:2136 +0x47
  github.com/hashicorp/nomad/command/agent.wrapCORSWithAllowedMethods.(*Cors).Handler.func1()
      github.com/rs/[email protected]/cors.go:236 +0x2f1
  net/http.HandlerFunc.ServeHTTP()
      net/http/server.go:2136 +0x47
  net/http.(*ServeMux).ServeHTTP()
      net/http/server.go:2514 +0xbc
  github.com/hashicorp/nomad/command/agent.NewHTTPServers.CompressHandler.CompressHandlerLevel.func3()
      github.com/gorilla/[email protected]/compress.go:105 +0x738
  net/http.HandlerFunc.ServeHTTP()
      net/http/server.go:2136 +0x47
  net/http.serverHandler.ServeHTTP()
      net/http/server.go:2938 +0x2a1
  net/http.(*conn).serve()
      net/http/server.go:2009 +0xc24
  net/http.(*Server).Serve.func3()
      net/http/server.go:3086 +0x4f
==================

Which means these two writes are happening at the same time: alloc_endpoint.go#L608 and alloc_endpoint.go#L628. The write on line 652 shouldn't be happening until we send on the errCh, at which point the WriteMessage call on line 608 should already have completed.

So that's puzzling. I've got a fairly straightforward fix for the error handling bug. What I'm going to try to do next is move the WriteMessage on line 652 up into the same goroutine as all the other writes. If we can hit the bug even in that case, there's likely a bug in the upstream library that we may need to code around. Will pick that up tomorrow.

@tgross
Copy link
Member

tgross commented Feb 9, 2024

Draft PR with the fix is here: #19932 but I'm working up a test for it before I mark that ready for review.

@tgross
Copy link
Member

tgross commented Feb 12, 2024

#19932 has been merged and will ship in the next version of Nomad 1.7.x, with backports to 1.6.x and 1.5.x

Copy link

github-actions bot commented Jan 1, 2025

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 1, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

4 participants