Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 BUG]: RR [v2.5.7] doesn't construct new workers after call resetting command #1168

Closed
1 task done
obrazkowv opened this issue Jun 10, 2022 · 60 comments
Closed
1 task done
Assignees
Labels
B-bug Bug: bug, exception F-need-verification

Comments

@obrazkowv
Copy link

obrazkowv commented Jun 10, 2022

No duplicates 🥲.

  • I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

Sometimes i faced with issue when rr doesn't start workers after resetting

I use the command for reloading application after deploying new version

php -r '
require_once "/var/www/..../vendor/autoload.php";
$rpc = \Spiral\Goridge\RPC\RPC::create("tcp://127.0.0.1:6001");
$rpc->call("resetter.Reset", "http");
' 2> /dev/null && echo "roadrunner restarted"

And this works fine. In logs of the roadrunner, i can find such logs.

{"level":"info","ts":1654876751.0767014,"logger":"http","msg":"HTTP plugin got restart request. Restarting..."}
{"level":"debug","ts":1654876751.482103,"logger":"server","msg":"worker constructed","pid":18535}
{"level":"debug","ts":1654876751.7076323,"logger":"server","msg":"worker constructed","pid":18539}
{"level":"debug","ts":1654876751.92825,"logger":"server","msg":"worker constructed","pid":18543}
{"level":"debug","ts":1654876752.1424391,"logger":"server","msg":"worker constructed","pid":18547}
{"level":"debug","ts":1654876752.3567128,"logger":"server","msg":"worker constructed","pid":18551}
{"level":"debug","ts":1654876752.5836997,"logger":"server","msg":"worker constructed","pid":18555}
{"level":"debug","ts":1654876752.7970486,"logger":"server","msg":"worker constructed","pid":18569}
{"level":"debug","ts":1654876753.012394,"logger":"server","msg":"worker constructed","pid":18573}
{"level":"debug","ts":1654876753.2264977,"logger":"server","msg":"worker constructed","pid":18579}
{"level":"debug","ts":1654876753.439048,"logger":"server","msg":"worker constructed","pid":18583}
{"level":"debug","ts":1654876753.6546476,"logger":"server","msg":"worker constructed","pid":18587}
{"level":"debug","ts":1654876753.8703601,"logger":"server","msg":"worker constructed","pid":18591}
{"level":"info","ts":1654876753.8704066,"logger":"http","msg":"HTTP workers Pool successfully restarted"}
{"level":"info","ts":1654876753.8704116,"logger":"http","msg":"HTTP handler listeners successfully re-added"}
{"level":"info","ts":1654876753.8704147,"logger":"http","msg":"HTTP plugin successfully restarted"}

But sometimes when i tried to reload workers i faced with unexpected behavior when rr doesn't construct new workers

{"level":"info","ts":1654865797.447119,"logger":"http","msg":"HTTP plugin got restart request. Restarting..."}
{"level":"debug","ts":1654865947.0437615,"logger":"rpc","msg":"Started RPC service","address":"tcp://127.0.0.1:6001","plugins":["informer","resetter"]}
{"level":"debug","ts":1654865947.3511188,"logger":"server","msg":"worker constructed","pid":26076}
{"level":"debug","ts":1654865947.6003292,"logger":"server","msg":"worker constructed","pid":26090}
{"level":"debug","ts":1654865947.8625834,"logger":"server","msg":"worker constructed","pid":26094}
{"level":"debug","ts":1654865948.083429,"logger":"server","msg":"worker constructed","pid":26098}
{"level":"debug","ts":1654865948.2959368,"logger":"server","msg":"worker constructed","pid":26102}
{"level":"debug","ts":1654865948.5081518,"logger":"server","msg":"worker constructed","pid":26106}
{"level":"debug","ts":1654865948.7213166,"logger":"server","msg":"worker constructed","pid":26110}
{"level":"debug","ts":1654865948.9349105,"logger":"server","msg":"worker constructed","pid":26114}
{"level":"debug","ts":1654865949.1486824,"logger":"server","msg":"worker constructed","pid":26118}
{"level":"debug","ts":1654865949.3617551,"logger":"server","msg":"worker constructed","pid":26122}
{"level":"debug","ts":1654865949.5777884,"logger":"server","msg":"worker constructed","pid":26127}
{"level":"debug","ts":1654865949.7925146,"logger":"server","msg":"worker constructed","pid":26131}

Construction of new workers was after restarting the service

And that's all. It seems like reload plugin was stopped by something. After that, i need to restart the service completely.

Also, i have such logs in syslog that appeared after restarting the service, but i don't know does it relate to the issue or not

rr[24396]: panic: runtime error: invalid memory address or nil pointer dereference
rr[24396]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0xd77dbd]
rr[24396]: goroutine 55954476 [running]:
rr[24396]: github.com/spiral/roadrunner-plugins/v2/http.(*Plugin).workers(...)
rr[24396]: #011github.com/spiral/roadrunner-plugins/[email protected]/http/plugin.go:358
rr[24396]: github.com/spiral/roadrunner-plugins/v2/http.(*Plugin).Workers(0x1529d40)
rr[24396]: #011github.com/spiral/roadrunner-plugins/[email protected]/http/plugin.go:342 +0xbd
rr[24396]: github.com/spiral/roadrunner-plugins/v2/informer.(*Plugin).Workers(...)
rr[24396]: #011github.com/spiral/roadrunner-plugins/[email protected]/informer/plugin.go:38
rr[24396]: github.com/spiral/roadrunner-plugins/v2/informer.(*rpc).Workers(0x2, {0xc000d5f490, 0x1}, 0xc000da9b78)
rr[24396]: #011github.com/spiral/roadrunner-plugins/[email protected]/informer/rpc.go:31 +0x5e
rr[24396]: reflect.Value.call({0xc0007b7500, 0xc00000f1b0, 0x13}, {0x17222dd, 0x4}, {0xc000befef8, 0x3, 0x3})
rr[24396]: #011reflect/value.go:543 +0x814
rr[24396]: reflect.Value.Call({0xc0007b7500, 0xc00000f1b0, 0x0}, {0xc000db46f8, 0x3, 0x3})
rr[24396]: #011reflect/value.go:339 +0xc5
rr[24396]: net/rpc.(*service).call(0xc0007a3700, 0xc000db47b0, 0xd79dd9, 0xc000d5ede0, 0xc0007ba300, 0xc000774360, {0x14c2860, 0xc000f168e0, 0x15ef4e0}, {0x156fa00, ...}, ...)
rr[24396]: #011net/rpc/server.go:377 +0x239
rr[24396]: created by net/rpc.(*Server).ServeCodec
rr[24396]: #011net/rpc/server.go:474 +0x405
systemd[1]: rr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: rr.service: Failed with result 'exit-code'.
systemd[1]: Stopped High-performance PHP application server.
systemd[1]: Started High-performance PHP application server.

PHP7.4 (opcache_cli enabled without storing in files)
Ubuntu 18.04 x86_64
rr version 2.5.7 (build time: 2021-11-13T16:43:25+0000, go1.17.2)

server:
  command: 'php /var/www/...../worker.php'
  relay: 'pipes'

rpc:
  listen: tcp://127.0.0.1:6001

http:
  address: 10.0.6.50:80
  pool:
    max_jobs: 50000
    num_workers: 12

logs:
  mode: production
  level: debug
  file_logger_options:
    log_output: /var/log/rr/access.log
    max_size: 100
    max_age: 48
    compress: true

Maybe i do something wrong when i try to reload workers?
Thank you for any help.

Version

2.5.7

Relevant log output

No response

@rustatian
Copy link
Member

Hey @obrazkowv 👋🏻 . Could you please try the 2.10.4-rc.1 bugfix release?

@rustatian
Copy link
Member

Also, rr version 2.5.7 is unsupported. Please, upgrade to the most recent version: v2.10.3

@rustatian rustatian changed the title RR doesn't construct new workers after call resetting command [🐛 BUG]: [🐛 BUG]: RR doesn't construct new workers after call resetting command Jun 10, 2022
@rustatian rustatian changed the title [🐛 BUG]: RR doesn't construct new workers after call resetting command [🐛 BUG]: RR [v2.5.7] doesn't construct new workers after call resetting command Jun 10, 2022
@obrazkowv
Copy link
Author

This issue is difficult to reproduce. It appears at different moment of time and i don't know what is the cause of this issue.
Can you help me figure out or decrease the range of possible problems? Is operation atomic? Can be this operation interrupted or once the line "HTTP plugin got restart request. Restarting" was logged it must be successfully reloaded in any case ?

@obrazkowv
Copy link
Author

I removed redirect stderr to /dev/null from resetting script and started to reset rr workers. After that i found, that sometimes i get different errors:

PHP Fatal error:  Uncaught Spiral\Goridge\RPC\Exception\ServiceException: Error 'resetter_rpc_reset: WorkerAllocate:
	http_plugin_reset:
	goridge_frame_receive: EOF' on tcp://127.0.0.1:6001 in /var/www/release/1654891148_72fe3eeb-b29c-438d-b650-45a9b8e0e2a1/..../vendor/spiral/goridge/src/RPC/RPC.php:128
Stack trace:
#0 /var/www/release/1654891148_72fe3eeb-b29c-438d-b650-45a9b8e0e2a1/..../vendor/spiral/goridge/src/RPC/RPC.php(98): Spiral\Goridge\RPC\RPC->decodeResponse() 

PHP Fatal error: Uncaught Spiral\Goridge\Exception\RelayException: Connection refused in /var/www/release/1654890903_e93bd70a-495c-4e8c-839f-e63c0bc12bad/..../vendor/spiral/goridge/src/SocketRelay.php

PHP Fatal error:  Uncaught Spiral\Goridge\Exception\HeaderException: Unable to read frame header: Success in /var/www/release/1654890879_ca3ca3a1-7ed6-4337-88e4-8ecaf4922a84/...../vendor/spiral/goridge/src/SocketRelay.php:193
Stack trace:
#0 /var/www/release/1654890879_ca3ca3a1-7ed6-4337-88e4-8ecaf4922a84/...../vendor/spiral/goridge/src/RPC/RPC.php(86): Spiral\Goridge\SocketRelay->waitFrame()
#1 Command line code(4): Spiral\Goridge\RPC\RPC->call()

@keywan-ghadami-oxid
Copy link

@obrazkowv was that after upgrade or still with the old version?

@obrazkowv
Copy link
Author

obrazkowv commented Jun 10, 2022

I was still on same version, but few minutes ago i upgraded the version to "rr version 2.10.3 (build time: 2022-06-02T10:51:28+0000, go1.18.2)" and even after that i faced with issue.

P.S. i found, that i have different version between rr binary and php lib. The version of php lib was 2.0.0, that already outdated and besides this, it mismatched to the version of the rr binary. So i will try to reproduce this problem with same version (2.10.3)

@obrazkowv
Copy link
Author

After upgrading to 2.10.3, unfortunately, nothing changed

PHP Fatal error:  Uncaught Spiral\Goridge\RPC\Exception\ServiceException: Error 'resetter_rpc_reset: WorkerAllocate:
	http_plugin_reset:
	static_pool_allocate_workers: EOF' on tcp://127.0.0.1:6001 in /var/www/release/1654895170_2c910500-12dd-479a-9b04-df78feed7509/..../vendor/spiral/goridge/src/RPC/RPC.php:128
Stack trace:
#0 /var/www/release/1654895170_2c910500-12dd-479a-9b04-df78feed7509/..../vendor/spiral/goridge/src/RPC/RPC.php(98): Spiral\Goridge\RPC\RPC->decodeResponse()
#1 Command line code(4): Spiral\Goridge\RPC\RPC->call()
#2 {main}
  thrown in /var/www/release/1654895170_2c910500-12dd-479a-9b04-df78feed7509/..../vendor/spiral/goridge/src/RPC/RPC.php on line 128

That's what i am doing, maybe it can help:
I use for project symbolic link for releasing new deploys, so i extract project in some directory, usually it's unique hash (1654895170_2c910500-12dd-479a-9b04-df78feed7509) and after i change symbolic link to this directory and call reset api.

I will try to debug what is happening via strace, maybe i will find something that seems useful and finally i will do tcpdump of 6001 port(rcp listener)

@obrazkowv
Copy link
Author

I started to call rr reset and at some moment it was stuck without any response. And any other command doesn't respond(rr workers). I have simple endpoint, that always answer with 200 code and i wrote the bash script, in which there is infinite loop that makes 10 calls simultaneously each iteration

bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:31:30 resetting plugin: [http]
2022/06/10 21:31:32 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:31:34 resetting plugin: [http]
2022/06/10 21:31:36 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:31:36 resetting plugin: [http]
2022/06/10 21:31:38 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:31:39 resetting plugin: [http]
2022/06/10 21:31:41 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:31:42 resetting plugin: [http]
2022/06/10 21:31:44 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:31:44 resetting plugin: [http]
2022/06/10 21:31:46 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:31:47 resetting plugin: [http]
2022/06/10 21:31:49 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:31:49 resetting plugin: [http]
2022/06/10 21:31:51 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:31:52 resetting plugin: [http]
2022/06/10 21:31:54 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:31:55 resetting plugin: [http]
2022/06/10 21:31:57 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:31:57 resetting plugin: [http]
2022/06/10 21:31:59 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:32:00 resetting plugin: [http]
2022/06/10 21:32:02 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:32:02 resetting plugin: [http]
2022/06/10 21:32:04 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:32:05 resetting plugin: [http]
2022/06/10 21:32:07 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:32:08 resetting plugin: [http]
2022/06/10 21:32:10 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:32:11 resetting plugin: [http]
2022/06/10 21:32:13 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:32:13 resetting plugin: [http]
2022/06/10 21:32:15 plugin reset: [http]
bar@baz:/etc/roadrunner-app$ rr reset
2022/06/10 21:32:15 resetting plugin: [http]

This logs i found in syslog

Jun 10 21:26:45 ip-..... rr[5644]: 2022-06-10T21:26:45.349Z#011ERROR#011container/stop.go:147#011vertices which are not stopped#011{"id": ["config.Plugin", "logger.Plugin", "server.Plugin", "http.Plugin", "
gzip.Plugin", "prometheus.Plugin", "send.Plugin", "memory.Plugin", "boltdb.Plugin", "redis.Plugin"]}
Jun 10 21:26:45 ip-..... rr[5644]: github.com/roadrunner-server/endure/pkg/container.(*Endure).shutdown
Jun 10 21:26:45 ip-..... rr[5644]: #011github.com/roadrunner-server/[email protected]/pkg/container/stop.go:147
Jun 10 21:26:45 ip-..... rr[5644]: github.com/roadrunner-server/endure/pkg/container.(*Endure).Shutdown
Jun 10 21:26:45 ip-..... rr[5644]: #011github.com/roadrunner-server/[email protected]/pkg/container/endure.go:463
Jun 10 21:26:45 ip-..... rr[5644]: reflect.Value.call
Jun 10 21:26:45 ip-..... rr[5644]: #011reflect/value.go:556
Jun 10 21:26:45 ip-..... rr[5644]: reflect.Value.Call
Jun 10 21:26:45 ip-..... rr[5644]: #011reflect/value.go:339
Jun 10 21:26:45 ip-..... rr[5644]: github.com/roadrunner-server/endure/pkg/fsm.(*FSMImpl).Transition
Jun 10 21:26:45 ip-..... rr[5644]: #011github.com/roadrunner-server/[email protected]/pkg/fsm/fsm.go:216
Jun 10 21:26:45 ip-..... rr[5644]: github.com/roadrunner-server/endure/pkg/container.(*Endure).Stop
Jun 10 21:26:45 ip-..... rr[5644]: #011github.com/roadrunner-server/[email protected]/pkg/container/endure.go:335
Jun 10 21:26:45 ip-..... rr[5644]: github.com/roadrunner-server/roadrunner/v2/internal/cli/serve.NewCommand.func1
Jun 10 21:26:45 ip-..... rr[5644]: #011github.com/roadrunner-server/roadrunner/v2/internal/cli/serve/command.go:103
Jun 10 21:26:45 ip-..... rr[5644]: github.com/spf13/cobra.(*Command).execute
Jun 10 21:26:45 ip-..... rr[5644]: #011github.com/spf13/[email protected]/command.go:856
Jun 10 21:26:45 ip-..... rr[5644]: github.com/spf13/cobra.(*Command).ExecuteC
Jun 10 21:26:45 ip-..... rr[5644]: #011github.com/spf13/[email protected]/command.go:974
Jun 10 21:26:45 ip-..... rr[5644]: github.com/spf13/cobra.(*Command).Execute
Jun 10 21:26:45 ip-..... rr[5644]: #011github.com/spf13/[email protected]/command.go:902
Jun 10 21:26:45 ip-..... rr[5644]: main.run
Jun 10 21:26:45 ip-..... rr[5644]: #011github.com/roadrunner-server/roadrunner/v2/cmd/rr/main.go:21
Jun 10 21:26:45 ip-..... rr[5644]: main.main
Jun 10 21:26:45 ip-..... rr[5644]: #011github.com/roadrunner-server/roadrunner/v2/cmd/rr/main.go:15
Jun 10 21:26:45 ip-..... rr[5644]: runtime.main
Jun 10 21:26:45 ip-..... rr[5644]: #011runtime/proc.go:250
Jun 10 21:26:45 ip-..... rr[5644]: error: endure_shutdown: Timeout: timeout exceed, some vertices may not be stopped and can cause memory leak
Jun 10 21:26:45 ip-..... systemd[1]: rr.service: Main process exited, code=exited, status=1/FAILURE
Jun 10 21:26:45 ip-..... systemd[1]: rr.service: Failed with result 'exit-code'.
Jun 10 21:26:45 ip-..... systemd[1]: Stopped High-performance PHP application server.
Jun 10 21:26:45 ip-..... systemd[1]: Started High-performance PHP application server.
Jun 10 21:26:45 ip-..... rr[12803]: [INFO] RoadRunner server started; version: 2.10.3, buildtime: 2022-06-02T10:51:28+0000

And that what i found in rr logs

{"level":"info","ts":1654896733.7360613,"logger":"http","msg":"reset signal was received"}
{"level":"debug","ts":1654896733.8988416,"logger":"server","msg":"worker destroyed","pid":5786,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1654896733.8988907,"logger":"server","msg":"worker destroyed","pid":5783,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1654896733.8989246,"logger":"server","msg":"worker destroyed","pid":5785,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1654896733.898958,"logger":"server","msg":"worker destroyed","pid":5788,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1654896733.8989766,"logger":"server","msg":"worker destroyed","pid":5781,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1654896733.8990352,"logger":"server","msg":"worker destroyed","pid":5782,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1654896733.9004345,"logger":"server","msg":"worker destroyed","pid":5784,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1654896733.9007757,"logger":"server","msg":"worker destroyed","pid":5779,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1654896733.900811,"logger":"server","msg":"worker destroyed","pid":5780,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1654896733.9019005,"logger":"server","msg":"worker destroyed","pid":5789,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1654896733.9024315,"logger":"server","msg":"worker destroyed","pid":5787,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1654896733.9025087,"logger":"server","msg":"worker destroyed","pid":5778,"internal_event_name":"EventWorkerDestruct"}
{"level":"error","ts":1654896795.1365736,"logger":"server","msg":"no free workers in the pool, wait timeout exceed","reason":"no free workers","internal_event_name":"EventNoFreeWorkers","error":"worker_watcher_g
et_free_worker: NoFreeWorkers:\n\tcontext deadline exceeded"}
{"level":"error","ts":1654896795.1366465,"logger":"http","msg":"execute","start":1654896735.1363375,"elapsed":60.000307717,"error":"static_pool_exec: NoFreeWorkers:\n\tstatic_pool_exec:\n\tworker_watcher_get_fre
e_worker:\n\tcontext deadline exceeded"}
{"level":"info","ts":1654896795.136754,"logger":"http","msg":"reset signal was received"}

@rustatian
Copy link
Member

rustatian commented Jun 10, 2022

@obrazkowv Could you please update to v2.10.4-rc.1 ?

EDIT: v2.10.4

@rustatian
Copy link
Member

and after i change symbolic link to this directory and call reset API

RR won't re-read a new path when you update a symlink. You have to restart RR in that case.

The version of php lib was 2.0.0

What's the name of the lib?

@obrazkowv
Copy link
Author

RR won't re-read a new path when you update a symlink. You have to restart RR in that case.

What do you mean when you said restart rr? Do you mean reset workers or i need to restart the application fully? If i need to restart the application by stopping the serving process, how can i avoid downtime during restarting the application?

What's the name of the lib?

I mean spiral/roadrunner lib, that use in the application(workers)

@obrazkowv
Copy link
Author

@obrazkowv Could you please update to v2.10.4-rc.1 ?

EDIT: v2.10.4

No problem, i will try to test with new version today lately or tomorrow

@rustatian
Copy link
Member

rustatian commented Jun 11, 2022

@obrazkowv

What do you mean when you said restart rr? Do you mean reset workers or i need to restart the application fully? If i need to restart the application by stopping the serving process, how can i avoid downtime during restarting the application?

I'm not quite sure what are doing. You may explain your approach if you want.

I mean spiral/roadrunner lib, that use in the application(workers)

It's not quite necessary to have the same version. IIRC there were no BCs between RR-PHP and RR server.

@rustatian rustatian moved this to Backlog in General Jun 11, 2022
@obrazkowv
Copy link
Author

@obrazkowv

What do you mean when you said restart rr? Do you mean reset workers or i need to restart the application fully? If i need to restart the application by stopping the serving process, how can i avoid downtime during restarting the application?

I'm not quite sure what are doing. You may explain your approach if you want.

I faced with same behavior after upgrading to 2.10.4. I will try to reproduce this issue from scratch project, but i think that nothing will change because the answer from go is wrong as i think

For reproducing this issue, i start the app, and in another tab start simple bash script, that creates calls to the app(i think, that ab or wrk also will be good for this purpose) and finally, i do reset of workers until i face with error

#!/bin/bash

for i in {1..50000}
do
  curl --connect-timeout 1 10.0.6.232 & curl --connect-timeout 1 10.0.6.232 & curl --connect-timeout 1 10.0.6.232 
done
php -r '
require_once "/var/www/...../vendor/autoload.php";
$rpc = \Spiral\Goridge\RPC\RPC::create("tcp://127.0.0.1:6001");
$rpc->call("resetter.Reset", "http");
'
PHP Fatal error:  Uncaught Spiral\Goridge\RPC\Exception\ServiceException: Error 'resetter_rpc_reset: WorkerAllocate:
  	http_plugin_reset:
  	static_pool_allocate_workers: EOF' on tcp://127.0.0.1:6001 in /var/www/release/1654895170_2c910500-12dd-479a-9b04-df78feed7509/.../vendor/spiral/goridge/src/RPC/RPC.php:128
  Stack trace:
  #0 /var/www/release/1654895170_2c910500-12dd-479a-9b04-df78feed7509/.../vendor/spiral/goridge/src/RPC/RPC.php(98): Spiral\Goridge\RPC\RPC->decodeResponse()
  #1 Command line code(4): Spiral\Goridge\RPC\RPC->call()
  #2 {main}
    thrown in /var/www/release/1654895170_2c910500-12dd-479a-9b04-df78feed7509/.../vendor/spiral/goridge/src/RPC/RPC.php on line 128

@obrazkowv
Copy link
Author

I was able to reproduce this with code from scratch

<?php

use Nyholm\Psr7\Factory\Psr17Factory;
use Nyholm\Psr7\Response;
use Spiral\RoadRunner\Http\PSR7Worker;
use Spiral\RoadRunner\Worker;

require_once __DIR__ . '/vendor/autoload.php';

$worker = Worker::create();
$psrFactory = new Psr17Factory();
$worker = new PSR7Worker($worker, $psrFactory, $psrFactory, $psrFactory);

while (true) {
    try {
        $request = $worker->waitRequest();
        $worker->respond(new Response(200, [], 'Hello RoadRunner!'));
    } catch (Throwable $e) {
        $worker->respond(new Response(400, [], 'Error'));
    }
}

@obrazkowv
Copy link
Author

Also, i found, that you can skip sending requests and just create the loop of resetting commands.

So, i started the rr and started script

for i in {1..5000}; do rr reset; done;

And script stuck on receiving the signal

2022/06/15 19:26:06 resetting plugin: [http] 
2022/06/15 19:26:06 plugin reset: [http]
2022/06/15 19:26:06 resetting plugin: [http] 
2022/06/15 19:26:07 plugin reset: [http]
2022/06/15 19:26:07 resetting plugin: [http] 
2022/06/15 19:26:07 plugin reset: [http]
2022/06/15 19:26:07 resetting plugin: [http] 
2022/06/15 19:26:07 plugin reset: [http]
2022/06/15 19:26:07 resetting plugin: [http] 
^C
{"level":"info","ts":1655321167.1824,"logger":"http","msg":"plugin was successfully reset"}
{"level":"info","ts":1655321167.2000213,"logger":"http","msg":"reset signal was received"}
{"level":"debug","ts":1655321167.2285714,"logger":"server","msg":"worker destroyed","pid":11704,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.2305667,"logger":"server","msg":"worker destroyed","pid":11697,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.2307086,"logger":"server","msg":"worker destroyed","pid":11702,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.2307572,"logger":"server","msg":"worker destroyed","pid":11699,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.230815,"logger":"server","msg":"worker destroyed","pid":11698,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.2317677,"logger":"server","msg":"worker destroyed","pid":11700,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.2355788,"logger":"server","msg":"worker destroyed","pid":11708,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.2356284,"logger":"server","msg":"worker destroyed","pid":11705,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.2356622,"logger":"server","msg":"worker destroyed","pid":11701,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.235707,"logger":"server","msg":"worker destroyed","pid":11707,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.2357373,"logger":"server","msg":"worker destroyed","pid":11703,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.2360013,"logger":"server","msg":"worker destroyed","pid":11706,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.4157066,"logger":"server","msg":"worker is allocated","pid":11751,"internal_event_name":"EventWorkerConstruct"}
{"level":"debug","ts":1655321167.4899836,"logger":"server","msg":"worker is allocated","pid":11750,"internal_event_name":"EventWorkerConstruct"}
{"level":"debug","ts":1655321167.4980216,"logger":"server","msg":"worker is allocated","pid":11753,"internal_event_name":"EventWorkerConstruct"}
{"level":"debug","ts":1655321167.5465815,"logger":"server","msg":"worker is allocated","pid":11752,"internal_event_name":"EventWorkerConstruct"}
{"level":"debug","ts":1655321167.5658998,"logger":"server","msg":"worker is allocated","pid":11756,"internal_event_name":"EventWorkerConstruct"}
{"level":"debug","ts":1655321167.5740101,"logger":"server","msg":"worker is allocated","pid":11754,"internal_event_name":"EventWorkerConstruct"}
{"level":"debug","ts":1655321167.5841622,"logger":"server","msg":"worker is allocated","pid":11759,"internal_event_name":"EventWorkerConstruct"}
{"level":"debug","ts":1655321167.598021,"logger":"server","msg":"worker is allocated","pid":11755,"internal_event_name":"EventWorkerConstruct"}
{"level":"debug","ts":1655321167.6086805,"logger":"server","msg":"worker is allocated","pid":11760,"internal_event_name":"EventWorkerConstruct"}
{"level":"debug","ts":1655321167.616357,"logger":"server","msg":"worker is allocated","pid":11758,"internal_event_name":"EventWorkerConstruct"}
{"level":"debug","ts":1655321167.616763,"logger":"server","msg":"worker is allocated","pid":11757,"internal_event_name":"EventWorkerConstruct"}
{"level":"debug","ts":1655321167.621884,"logger":"server","msg":"worker is allocated","pid":11764,"internal_event_name":"EventWorkerConstruct"}
{"level":"info","ts":1655321167.6219273,"logger":"http","msg":"plugin was successfully reset"}
{"level":"info","ts":1655321167.6395772,"logger":"http","msg":"reset signal was received"}
{"level":"debug","ts":1655321167.675687,"logger":"server","msg":"worker destroyed","pid":11759,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.6840167,"logger":"server","msg":"worker destroyed","pid":11753,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.6897082,"logger":"server","msg":"worker destroyed","pid":11757,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.6897633,"logger":"server","msg":"worker destroyed","pid":11760,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.6897862,"logger":"server","msg":"worker destroyed","pid":11764,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.68981,"logger":"server","msg":"worker destroyed","pid":11755,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.6898305,"logger":"server","msg":"worker destroyed","pid":11756,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.6898513,"logger":"server","msg":"worker destroyed","pid":11758,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.6898746,"logger":"server","msg":"worker destroyed","pid":11752,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.689904,"logger":"server","msg":"worker destroyed","pid":11751,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.689997,"logger":"server","msg":"worker destroyed","pid":11750,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.6900198,"logger":"server","msg":"worker destroyed","pid":11754,"internal_event_name":"EventWorkerDestruct"}
{"level":"debug","ts":1655321167.8721511,"logger":"server","msg":"worker is allocated","pid":11829,"internal_event_name":"EventWorkerConstruct"}
{"level":"info","ts":1655321167.890679,"logger":"http","msg":"reset signal was received"}

@rustatian
Copy link
Member

It's not possible to reset workers from workers.

@rustatian
Copy link
Member

I mean, you should use the ./rr reset command, but not the same command from the worker.

@obrazkowv
Copy link
Author

what do you mean?
I did reset from another shell. In one shell i started rr process and from another, i did the loop in which sequentially do resetting of workers

@rustatian
Copy link
Member

I started the same command on the server, which worked in the loop for eight days.

@rustatian
Copy link
Member

So, first of all, lets clarify some things:

php -r '
require_once "/var/www/...../vendor/autoload.php";
$rpc = \Spiral\Goridge\RPC\RPC::create("tcp://127.0.0.1:6001");
$rpc->call("resetter.Reset", "http");
'

Doing reset from the PHP is not possible because this command wasn't designed to work with PHP.

@obrazkowv
Copy link
Author

Can the power of server be related to the problem? I am testing this behavior on aws t3.medium. But I doubt this...

@obrazkowv
Copy link
Author

So, first of all, lets clarify some things:

php -r '
require_once "/var/www/...../vendor/autoload.php";
$rpc = \Spiral\Goridge\RPC\RPC::create("tcp://127.0.0.1:6001");
$rpc->call("resetter.Reset", "http");
'

Doing reset from the PHP is not possible because this command wasn't designed to work with PHP.

I have already changed the command, i am using "rr reset" command near with rr.yaml file, so i don't call it through php, but it doesn't help

@rustatian
Copy link
Member

Could you please try to do that with 1 RR and 1 rr reset in the loop? This is not a stress test, this is the test of the SO_REUSEPORT I guess.

@obrazkowv
Copy link
Author

Do you mean 1 worker in the config when you told 1 RR?
rr rest already 1 in the loop, it doesn't run in parallel, each iteration wait finish of prev iteration

@rustatian
Copy link
Member

I tried to do the same, at least for the last 10m there are no issues
image

@rustatian
Copy link
Member

image

@rustatian
Copy link
Member

I assume that there is smt wrong with your test cases. Keep in mind of SO_REUSEPORT used by the RR.

@rustatian
Copy link
Member

Do you mean one worker in the config when you told 1 RR?

I mean 1 RR process. You have dozen of them running in your htop screenshot. Some of them running under root, some under the ubuntu user. That's incorrect for sure.

@obrazkowv
Copy link
Author

i'll try to find what can be cause of this issue, maybe it's related to the settings of the OS.
If you didn't face with the problem, so maybe i did something wrong. Thank you for hint with SO_REUSEPORT.
I started such test because sometimes i faced with problem on production, when call reset command, so i guess, that i can reproduce this problem in such way.

@rustatian
Copy link
Member

As you may see here, there is no even 1 missed request:
image

(4 RR processed and 6 rr reset in a loop + wrk)

@obrazkowv
Copy link
Author

Do you mean one worker in the config when you told 1 RR?

I mean 1 RR process. You have dozen of them running in your htop screenshot. Some of them running under root, some under the ubuntu user. That's incorrect for sure.

You are wrong, i have only one main process. "rr reset" was called by ubuntu user
ps -aux | grep rr
ubuntu 6126 0.0 0.0 15872 1072 pts/5 R+ 21:12 0:00 grep --color=auto rr
root 21640 136 0.9 741508 35880 pts/0 Sl+ 19:58 100:34 rr serve
ubuntu 28408 0.0 0.6 740676 26544 pts/1 Sl+ 19:59 0:00 rr reset

@rustatian
Copy link
Member

You have a lot of the rr serve processes:
image

@rustatian
Copy link
Member

Also, as I mentioned (in general, this is not the issue, but...) you are running an rr serve processes from the root and rr reset from the ubuntu user.

@obrazkowv
Copy link
Author

image
I think that it can be threads or something anything.

@rustatian
Copy link
Member

rustatian commented Jun 15, 2022

It might be a child PHP processes; however, I don't know.

@rustatian
Copy link
Member

So, to summarize, I was trying to reproduce your case for eight days with a RR (with 30 workers, the latest PHP) and didn't see the result you showed.
I may only give you general advice:

  1. Use the same user for the rr and rr reset.
  2. Try a fresh ubuntu install.
  3. Use the same worker as I used: https://github.com/roadrunner-server/rr-e2e-tests/blob/master/php_test_files/psr-worker-bench.php

@obrazkowv
Copy link
Author

So, to summarize, I was trying to reproduce your case for eight days with a RR (with 30 workers, the latest PHP) and didn't see the result you showed. I may only give you general advice:

  1. Use the same user for the rr and rr reset.
  2. Try a fresh ubuntu install.
  3. Use the same worker as I used: https://github.com/roadrunner-server/rr-e2e-tests/blob/master/php_test_files/psr-worker-bench.php

I already said that you can skip the step of sending requests. The reset command freezes when you do sequential reset. Try to remove the sleep after each call, idk

I did it on ubuntu 18.04, Is it enough fresh?

I already have similar code

<?php

use Nyholm\Psr7\Factory\Psr17Factory;
use Nyholm\Psr7\Response;
use Spiral\RoadRunner\Http\PSR7Worker;
use Spiral\RoadRunner\Worker;

require_once __DIR__ . '/vendor/autoload.php';

$worker = Worker::create();
$psrFactory = new Psr17Factory();
$worker = new PSR7Worker($worker, $psrFactory, $psrFactory, $psrFactory);

while (true) {
    try {
        $request = $worker->waitRequest();

        $worker->respond(new Response(200, [], 'Hello RoadRunner!'));
    } catch (Throwable $e) {
        $worker->respond(new Response(400, [], 'Hello RoadRunner!'));
    }
}

And finally, i think, that different user doesn't matter to this issue, because i would be faced with this on first iteration.

Okey, i will try to find cause of the problem and if i find, i will add explanation, thank you for your time

@obrazkowv
Copy link
Author

When i have changed relay from pipe to socket, after some time i got warning. I hope it can help...
rr serve
[INFO] RoadRunner server started; version: 2.10.4, buildtime: 2022-06-10T23:29:30+0000
[WARN]: socket server listen, error: pipes frame send: write unix rr.sock->@: write: broken pipe

@rustatian
Copy link
Member

The reset command freezes when you do sequential reset. Try to remove the sleep after each call, idk

I tried different scenarios, but no freezes.

[WARN]: socket server listen, error: pipes frame send: write unix rr.sock->@: write: broken pipe

It is fine. This is not an error, that might happen when you're doing a lot of resets.

I did it on ubuntu 18.04, Is it enough fresh?

It's very old.

that different user doesn't matter to this issue

Also, as I mentioned (in general, this is not the issue, but...) you are running an rr serve processes from the root and rr reset from the ubuntu user.

@obrazkowv
Copy link
Author

obrazkowv commented Jun 15, 2022

I found the cause of issue, we are using OS signals in our php application, so each time, when i send the signal to the process, rr will freeze...

ps aux | grep worker.php | grep -v grep | awk '{print $2}' | xargs kill -SIGUSR2

So, now you should be able to reproduce this problem and i hope it helps you

@rustatian
Copy link
Member

I found the cause of issue, we are using OS signals in our php application, so each time, when i send the signal to the process, rr will freeze...

ps aux | grep worker.php | grep -v grep | awk '{print $2}' | xargs kill -SIGUSR2

So, now you should be able to reproduce this problem and i hope it helps you

If you send a SIGUSR2 syscall to the PHP app, this is not a problem. If you are using reset at the same time - this is a problem. You're not allowed to do that at the moment. This is not a bug, this is expected behavior.

@rustatian
Copy link
Member

I don't understand, why you need to stop your worker and then stop all the workers at the same time with rr reset? If you want to stop a worker - just send a SIGUSR2 and RR will reallocate such worker automatically. If you want to restart all the workers, send to all of them SIGUSR2 OR use rr reset. Why use both these approaches at the same time?

@obrazkowv
Copy link
Author

I don't understand, why you need to stop your worker and then stop all the workers at the same time with rr reset? If you want to stop a worker - just send a SIGUSR2 and RR will reallocate such worker automatically. If you want to restart all the workers, send to all of them SIGUSR2 OR use rr reset. Why use both these approaches at the same time?

Why do you think, that i can use it only for signaling to stop? I use this signal to notify about something another process and i have used it in the application before i started to use roadrunner, so i didn't even assume that it can be the problem. I think, that you should add this point in the documentation because it can be unexpected behavior for someone like me because you can change of the reaction on the signal in your app, i mean, you can use it not only for termination.

@rustatian
Copy link
Member

Why do you think, that i can use it only for signaling to stop?

Because if you send a syscall and worker won't stop, this is not a problem for the rr reset. The only problem was when the process exited after this signal.
So, generally speaking, you're free to use syscalls, but these syscalls should not stop the worker at the same time you're calling reset.

I think, that you should add this point in the documentation because it can be unexpected behavior for someone like me because you can change of the reaction on the signal in your app, i mean, you can use it not only for termination.

We're always welcome to the PRs 😉

You may also create a feature request with a proper description to support the scenario you described. Because rr reset (as in the ticket description) works as expected. The problem is in the simultaneous use of syscalls which leads to workers stopping with rr reset.

@obrazkowv
Copy link
Author

Why do you think, that i can use it only for signaling to stop?

Because if you send a syscall and worker won't stop, this is not a problem for the rr reset. The only problem was when the process exited after this signal. So, generally speaking, you're free to use syscalls, but these syscalls should not stop the worker at the same time you're calling reset.

I think, that you should add this point in the documentation because it can be unexpected behavior for someone like me because you can change of the reaction on the signal in your app, i mean, you can use it not only for termination.

We're always welcome to the PRs wink

You may also create a feature request with a proper description to support the scenario you described. Because rr reset (as in the ticket description) works as expected. The problem is in the simultaneous use of syscalls which leads to workers stopping with rr reset.

I think that this cannot be resolved in roadrunner app(application written on go) because you don't know which sent the signal and how it must handle. Developer must care about this when he starts using this library, but he must be aware of this potential issue.
But you can add tips to the site documentation if you think that it matters.

@rustatian
Copy link
Member

I think that this cannot be resolved in roadrunner app(application written on go) because you don't know which sent the signal and how it must handle.

Sorry for repeating this, but again, you're free to use system calls until they won't stop the worker. The problem is not in the syscall (any of them), the problem is in the syscalls which stop the worker at the same time with the rr reset. RR doesn't need to listen to syscalls sent to a PHP process or from a PHP process.
For example, kill -12 <PID> will stop the worker at the moment.

{"level":"debug","ts":1655365928.6281197,"logger":"server","msg":"worker stopped","internal_event_name":"EventWorkerWaitExit","error":"signal: user defined signal 2; process_wait: signal: user defined signal 2","errorCauses":[{"error":"signal: user defined signal 2"},{"error":"process_wait: signal: user defined signal 2"}]}
{"level":"debug","ts":1655365928.6687036,"logger":"server","msg":"worker is allocated","pid":3781786,"internal_event_name":"EventWorkerConstruct"}

As you can see, this is not a problem. RR will just create a new worker instead of stopping one. And then you may use a rr reset (but as we agreed, not at the same time).

So, feel free to use syscalls that lead to worker stopping OR rr reset. OR syscalls and rr reset but not at the same time.
If you want to support simultaneous use of the syscalls which lead to worker stopping and rr reset, please, describe your scenario in the feature request ticket.

@obrazkowv
Copy link
Author

obrazkowv commented Jun 16, 2022

I think that this cannot be resolved in roadrunner app(application written on go) because you don't know which sent the signal and how it must handle.

Sorry for repeating this, but again, you're free to use system calls until they won't stop the worker. The problem is not in the syscall (any of them), the problem is in the syscalls which stop the worker at the same time with the rr reset. RR doesn't need to listen to syscalls sent to a PHP process or from a PHP process. For example, kill -12 <PID> will stop the worker at the moment.

{"level":"debug","ts":1655365928.6281197,"logger":"server","msg":"worker stopped","internal_event_name":"EventWorkerWaitExit","error":"signal: user defined signal 2; process_wait: signal: user defined signal 2","errorCauses":[{"error":"signal: user defined signal 2"},{"error":"process_wait: signal: user defined signal 2"}]}
{"level":"debug","ts":1655365928.6687036,"logger":"server","msg":"worker is allocated","pid":3781786,"internal_event_name":"EventWorkerConstruct"}

As you can see, this is not a problem. RR will just create a new worker instead of stopping one. And then you may use a rr reset (but as we agreed, not at the same time).

So, feel free to use syscalls that lead to worker stopping OR rr reset. OR syscalls and rr reset but not at the same time. If you want to support simultaneous use of the syscalls which lead to worker stopping and rr reset, please, describe your scenario in the feature request ticket.

The main problem is that you don't know can you use signals or not right now, i mean simultaneously. I know how to resolve the issue in my environment, but it can be not so obvious in another case. And i need to add a point, that i use the signal to notify that something changed, so i use the signal not for stopping. You can set up the reaction on the specific signal in php and the worker won't stop after sending syscall.

https://www.gnu.org/software/libc/manual/html_node/Miscellaneous-Signals.html
"The SIGUSR1 and SIGUSR2 signals are set aside for you to use any way you want. They’re useful for simple interprocess communication, if you write a signal handler for them in the program that receives the signal."

Finally, i think, that we can close this issue.
Thank you

@rustatian
Copy link
Member

My pleasure 🤝

Repository owner moved this from Backlog to Unreleased in General Jun 16, 2022
@rustatian
Copy link
Member

@obrazkowv This is the correct description of the problem: #1180.

@rustatian rustatian moved this from Unreleased to Done in General Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-bug Bug: bug, exception F-need-verification
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants