-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Heartbeat on non running activity #109
Comments
Hey @Zylius , could you please send your |
Hey sure,
|
I'll take a look, but it seems that the issue is related to the |
Thanks 🙇 , any ideas on how to fix it would be great. I can make my own custom RPC calls if needed for the heartbeat 🚀 . |
We already have these calls in the |
Yes seems to happend in both 2.5.2 and 2.5.0 :/ |
I mean, does it happen in the 2.4.* for example? |
Yeah, just recreated it with 2.4.2. IIRC I had it with 2.3.2 as well. |
Also, tbh, I find the issue, possible, I'm not 100% sure, but I find a place with a concurrent check for the activity heartbeats. In the RPC call (which is actually doing the PHP) we store the token in the concurrent hashmap, which is accessible even if we have an execution error. So, you have a small window to call this method from the RPC, see this token, return successfully from the call, but actually, this token will be removed after the error occurred. So, you'll see a successful call, but actually, this will be an error. PHP will call this activity and get an error writing to the closed pipe. |
That's weird, so if I understand you correctly the heartbeat should fail after the first unsuccessful rpc call to roadrunner (which will respond as a successful one). But that wouldn't be consistent with the strange stuff, that restarting the roadrunner instances somehow sometimes fixes the problem, yet sometimes it doesn't. I've checked the PHP code and it seems to be calling the correct RPC procedure roadrunner-temporal/activity/rpc.go Line 41 in a582c6c
|
Yes, everything is ok with PHP. Actually, this error But, If I understand correctly, you have these issues even with the running activities? |
Yeah, I send it out directly from the activity code in the example. the I used to think earlier that it might be caused by an activity timeout, but now I'm sure it's not cause I've tested with one liner code and temporal doesn't report any timeouts. |
So, then I'm pretty sure, that this is due to the concurrent access to the internal sync primitives. This is our bug with a 99% confidence, but I'll need to check this. |
It's important for our system stability, but we can postpone temporal a few weeks for the activities which require heartbeats. :) Thank you for your help!! |
You are welcome 😃 I'll try to resolve this issue tomorrow 🤖 |
Sorry for the delay @Zylius , have a lot of tickets to resolve. Will start ASAP to work on it (next few days). |
It's no issue we're having higher priority issues on our end as well! 🙇♂️ |
Hello, any news on this? 👀 |
@Zylius Sorry, I got sick 😢 |
Ahh the season of the year, wish you get well soon!!! |
@Zylius Ok 😃 Sorry for the delay (COVID-19). At the moment I'm waiting for the temporal/sdk 1.11 fix roadrunner-server/roadrunner-plugins#12 (comment). After that, I'll be able to release a beta version which you can test for the heartbeats. |
Thanks man, happy you got better! |
@Zylius I forked (temporarily) the temporal/go-sdk and excluded commit with the problem. I'll release RR v2.6.0-rc.1 tomorrow, so, you can test it. |
WIll be ready to test it! 😁 |
Ahh snap, still getting it.
Road runner version:
Dunno maybe it's an isolated issue in our environment? |
This is the regular issue, not the bug. The activity you are trying to ping is not running at that moment. |
This is not a critical error, you should treat this as a let's say, expected error when you are trying to check the HeartBeat. Because even the temporal go-sdk will return an error in that case. |
I'm confused, the activity is sending the heart beat in it's own code, like in this example I gave
How could the Activity |
Yep, I was ill when we discussed this issue, and didn't pay enough attention to it, sorry. The activity you showed here pinging the temporal server for the heartbeat. But the heartbeat in the issue pings this |
So you're saying #[ActivityInterface]
class TestActivity
{
#[ActivityMethod]
public function testActivity()
{
Activity::heartbeat('test'); // This causes failure, cause it's sent somehow async? And when temporal gets the request for heartbeat the activity is already closed?
}
} I tried this #[ActivityInterface]
class TestActivity
{
#[ActivityMethod]
public function testActivity()
{
Activity::heartbeat('test');
/// one second sleep so activity is surely not over when temporal gets the heart beat
sleep(1);
}
} Still the error. Now I'm all confused about the concept of heartbeats |
You may have a look at our tests, where we have samples with heartbeats -> https://github.com/temporalio/roadrunner-temporal/blob/master/tests/src/Workflow/SimpleHeartbeatWorkflow.php |
That's the issue. The test code is the same I use as an example here and the same I used to test the new rc-1. it throws me the |
In your sample, you have only 1 ping to the temporal server and then sleep, you may try to use the loop with the heartbeats like shown in the docs: https://docs.temporal.io/docs/php/activities/#activity-heart-beating without sleep. |
Also, make sure to set the |
Thank you very much! I'll try everything you've said :) You've been a great help. I'll report here if I find where I screwed up! |
Is there a reason why you don't have |
Weird, I think the yield makes it rarer, but I still can recreate it after a couple of roadrunner restarts and workflow relaunches.
THe code is #[WorkflowInterface]
class TestWorkflow
{
#[Workflow\WorkflowMethod]
public function launchActivity()
{
$options = ActivityOptions::new()
->withStartToCloseTimeout(CarbonInterval::seconds(50))
->withTaskQueue('internal_saga_queue')
->withRetryOptions(
(new RetryOptions())
->withMaximumAttempts(1)
);
yield Workflow::newUntypedActivityStub($options)->execute('testActivity');
}
} When I originally found this issue I was using yield. In the example I gave I'm not using it for the simplicities sake. |
Responded at the forum... I'm not able to reproduce this error. Can we try to dig deeper into other issues you having "broken pipe" and etc? I wonder if we are chasing the wrong bug. |
The broken pipe issue might be because of RPC plugin started before the activity RPC layer. The issue was fixed in the latest version. @Zylius Could you please retest the issue on this version -> https://github.com/spiral/roadrunner-binary/releases/tag/v2.6.3? |
Guys, just tried it with the v2.6.3, neither with the 🙇 Thanks |
Describe the bug
Sending out a heartbeat in activity method results in
heartbeat on non running activity
(at random times). The failed state for heartbeats seems to be set upon roadrunner launch (rr serve
command). If the activity worker launches with a 'failed' heartbeat state, it'll continue failing upon Heartbeats until it, or the Workflowsrr serve
instance is killed and restarted. I'm still not very clear on what fixes it. The more roadrunner instances I run (rr serve
) the higher the chance of this error.The failing heartbeats throw:
'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001
Or sometimes the worker (very rarely) enters a different state for heartbeats and failed with a socket error. IDK if it's related:
socket_send(): unable to write to socket [32]: Broken pipe
To Reproduce
rr serve
commands for the workflow part and activities separately. In our system workflows are contained in a single app, and activities are handled by different service apps. The morerr serve
instances for the activity app is launched, the higher the chance one will see the error. It's pretty easy to recreate, it doesn't require many workers. I would say on my end with one worker it's about 50/50 to see the heartbeat error.Expected behavior
Heartbeat to be sent out successfully.
Screenshots/Terminal output
Errors
Very rare (dunno if related)
Code samples:
Versions
Debian GNU/Linux 10 (buster)
, hostPop!_OS 21.04
temporalio/auto-setup
in ourdocker-compose.yaml
, we've seen this error with sample temporaldocker-compose-mysql-es.yaml
files as well.Additional context
I'm not sure, maybe this is an issue only with our setup, but any help would be greatly appreciated.
The text was updated successfully, but these errors were encountered: