[BUG] Heartbeat on non running activity #109

Zylius · 2021-10-26T10:44:18Z

Describe the bug

Sending out a heartbeat in activity method results in heartbeat on non running activity (at random times). The failed state for heartbeats seems to be set upon roadrunner launch (rr serve command). If the activity worker launches with a 'failed' heartbeat state, it'll continue failing upon Heartbeats until it, or the Workflows rr serve instance is killed and restarted. I'm still not very clear on what fixes it. The more roadrunner instances I run (rr serve) the higher the chance of this error.

The failing heartbeats throw: 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001

Or sometimes the worker (very rarely) enters a different state for heartbeats and failed with a socket error. IDK if it's related:
socket_send(): unable to write to socket [32]: Broken pipe

To Reproduce

Create a dummy workflow.
Create a dummy activity.
Send a heartbeat in the activity
Launch rr serve commands for the workflow part and activities separately. In our system workflows are contained in a single app, and activities are handled by different service apps. The more rr serve instances for the activity app is launched, the higher the chance one will see the error. It's pretty easy to recreate, it doesn't require many workers. I would say on my end with one worker it's about 50/50 to see the heartbeat error.
Launch the workflow
See the error.

Expected behavior

Heartbeat to be sent out successfully.

Screenshots/Terminal output

Errors

Temporal\Exception\TransportException: Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 
Spiral\Goridge\RPC\Exception\ServiceException: Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 in /home/docker/services/hyperwallet/vendor/spiral/goridge/src/RPC/RPC.php:128
Stack trace:
#0 /home/docker/services/hyperwallet/vendor/spiral/goridge/src/RPC/RPC.php(98): Spiral\Goridge\RPC\RPC->decodeResponse(Object(Spiral\Goridge\Frame), NULL)
#1 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Worker/Transport/Goridge.php(56): Spiral\Goridge\RPC\RPC->call('activities.Reco...', Array)
#2 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Internal/Activity/ActivityContext.php(137): Temporal\Worker\Transport\Goridge->call('activities.Reco...', Array)

Very rare (dunno if related)

ErrorException: Warning: socket_send(): unable to write to socket [32]: Broken pipe 
ErrorException: Warning: socket_send(): unable to write to socket [32]: Broken pipe in /home/docker/services/hyperwallet/vendor/spiral/goridge/src/SocketRelay.php:234
Stack trace:
#0 /home/docker/services/hyperwallet/vendor/spiral/goridge/src/RPC/RPC.php(83): Spiral\Goridge\SocketRelay->send(Object(Spiral\Goridge\Frame))
#1 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Worker/Transport/Goridge.php(56): Spiral\Goridge\RPC\RPC->call('activities.Reco...', Array)
#2 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Internal/Activity/ActivityContext.php(137): Temporal\Worker\Transport\Goridge->call('activities.Reco...', Array)

Code samples:

<?php
declare(strict_types=1);

namespace App\Command;

use Carbon\CarbonInterval;
use Temporal\Activity\ActivityOptions;
use Temporal\Workflow;
use Temporal\Workflow\WorkflowInterface;

#[WorkflowInterface]
class TemporalTestWorkflow
{
    #[Workflow\WorkflowMethod]
    public function launchActivity()
    {
        $options = ActivityOptions::new()
            ->withStartToCloseTimeout(CarbonInterval::seconds(2))
            ->withTaskQueue('hyperwallet_command_bus');

        Workflow::newUntypedActivityStub($options)->execute('testActivity');
    }
}

<?php

declare(strict_types=1);

namespace App;

use Temporal\Activity;
use Temporal\Activity\ActivityInterface;
use Temporal\Activity\ActivityMethod;

#[ActivityInterface]
class TestActivity
{
    #[ActivityMethod]
    public function testActivity()
    {
        Activity::heartbeat('test'); // This causes failure.
    }
}

Versions

OS: In Docker Debian GNU/Linux 10 (buster), host Pop!_OS 21.04
Temporal Version: 1.12.2
Roadrunner version: 2.5.2
PHP composer versions:
- spiral/roadrunner: 2.5.0
- spiral/roadrunner-cli: 2.0.11
- spiral/roadrunner-http: 2.0.4
- spiral/roadrunner-worker: 2.1.3
- temporal/sdk: 1.0.4
We're using docker image temporalio/auto-setup in our docker-compose.yaml, we've seen this error with sample temporal docker-compose-mysql-es.yaml files as well.

Additional context

I'm not sure, maybe this is an issue only with our setup, but any help would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

rustatian · 2021-10-26T10:47:39Z

Hey @Zylius , could you please send your .rr.yaml ?

Zylius · 2021-10-26T10:48:08Z

Hey sure,

rpc:
  listen: tcp://127.0.0.1:6001

server:
  command: "php bin/console t:w"

temporal:
  address: "temporal:7233"
  activities:
    num_workers: 1
  debug_level: 2
  codec: json

rustatian · 2021-10-26T10:50:21Z

I'll take a look, but it seems that the issue is related to the temporal-php. But, anyway, I'll check this.

Zylius · 2021-10-26T10:52:17Z

Thanks 🙇 , any ideas on how to fix it would be great. I can make my own custom RPC calls if needed for the heartbeat 🚀 .

rustatian · 2021-10-26T10:55:28Z

We already have these calls in the rpc.go file.
Does this issue happen in the roadrunner 2.5.0 ?

Zylius · 2021-10-26T11:03:53Z

Yes seems to happend in both 2.5.2 and 2.5.0 :/

rustatian · 2021-10-26T11:04:50Z

I mean, does it happen in the 2.4.* for example?

Zylius · 2021-10-26T11:08:29Z

Yeah, just recreated it with 2.4.2. IIRC I had it with 2.3.2 as well.

rustatian · 2021-10-26T11:10:05Z

Also, tbh, I find the issue, possible, I'm not 100% sure, but I find a place with a concurrent check for the activity heartbeats. In the RPC call (which is actually doing the PHP) we store the token in the concurrent hashmap, which is accessible even if we have an execution error. So, you have a small window to call this method from the RPC, see this token, return successfully from the call, but actually, this token will be removed after the error occurred. So, you'll see a successful call, but actually, this will be an error. PHP will call this activity and get an error writing to the closed pipe.

Zylius · 2021-10-26T11:27:01Z

That's weird, so if I understand you correctly the heartbeat should fail after the first unsuccessful rpc call to roadrunner (which will respond as a successful one). But that wouldn't be consistent with the strange stuff, that restarting the roadrunner instances somehow sometimes fixes the problem, yet sometimes it doesn't.

I've checked the PHP code and it seems to be calling the correct RPC procedure

roadrunner-temporal/activity/rpc.go

Line 41 in a582c6c

    
           func (r *rpc) RecordActivityHeartbeat(in RecordHeartbeatRequest, out *RecordHeartbeatResponse) error {

🤔 💭

rustatian · 2021-10-26T11:33:37Z

Yes, everything is ok with PHP.

Actually, this error Temporal\Exception\TransportException: Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 should only happen on the non-running activities.

But, If I understand correctly, you have these issues even with the running activities?

Zylius · 2021-10-26T11:41:25Z

Yeah, I send it out directly from the activity code in the example. the Activity::heartbeat('test'); line. The heartbeat is always during the activity execution. Since temporal-php closes the activity context only once the activity method is done.

I used to think earlier that it might be caused by an activity timeout, but now I'm sure it's not cause I've tested with one liner code and temporal doesn't report any timeouts.

rustatian · 2021-10-26T11:44:57Z

So, then I'm pretty sure, that this is due to the concurrent access to the internal sync primitives. This is our bug with a 99% confidence, but I'll need to check this.
I'll have a look tomorrow at this since I guess that is very important for you (or not) ?

Zylius · 2021-10-26T11:53:15Z

It's important for our system stability, but we can postpone temporal a few weeks for the activities which require heartbeats. :) Thank you for your help!!

rustatian · 2021-10-26T11:54:29Z

You are welcome 😃 I'll try to resolve this issue tomorrow 🤖

rustatian · 2021-10-28T13:54:08Z

Sorry for the delay @Zylius , have a lot of tickets to resolve. Will start ASAP to work on it (next few days).

Zylius · 2021-10-28T19:13:55Z

It's no issue we're having higher priority issues on our end as well! 🙇‍♂️

Zylius · 2021-11-04T13:56:56Z

Hello, any news on this? 👀

rustatian · 2021-11-04T15:40:26Z

@Zylius Sorry, I got sick 😢

Zylius · 2021-11-04T16:18:35Z

Ahh the season of the year, wish you get well soon!!!

rustatian · 2021-11-24T06:13:44Z

@Zylius Ok 😃 Sorry for the delay (COVID-19). At the moment I'm waiting for the temporal/sdk 1.11 fix roadrunner-server/roadrunner-plugins#12 (comment). After that, I'll be able to release a beta version which you can test for the heartbeats.

Zylius · 2021-11-24T10:26:52Z

Thanks man, happy you got better!

rustatian · 2021-11-25T19:10:40Z

@Zylius I forked (temporarily) the temporal/go-sdk and excluded commit with the problem. I'll release RR v2.6.0-rc.1 tomorrow, so, you can test it.

Zylius · 2021-11-25T19:14:11Z

WIll be ready to test it! 😁

rustatian · 2021-11-26T08:58:46Z

@Zylius here you are: https://github.com/spiral/roadrunner-binary/releases/tag/v2.6.0-rc.1

Zylius · 2021-11-26T11:26:55Z

Ahh snap, still getting it.

2021-11-26T11:20:33.778Z	ERROR	temporal    	Activity error.	{"Namespace": "default", "TaskQueue": "test_command_bus", "WorkerID": "6668@64ddfa348084@test_command_bus@2", "WorkflowID": "20c0e906-feb7-419d-95a8-690ec59bd64a", "RunID": "7f727c43-dba8-4df9-a05d-f462ddf67ba6", "ActivityType": "CreateTestCommand", "Attempt": 1, "Error": "Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 (type: Temporal\\Exception\\TransportException, retryable: true): Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 (type: Spiral\\Goridge\\RPC\\Exception\\ServiceException, retryable: true)"}

Road runner version:

docker@64ddfa348084:~/services/test$ ./rr --version
rr version 2.6.0-rc-1 (build time: 2021-11-26T08:53:07+0000, go1.17.2)

Dunno maybe it's an isolated issue in our environment?
It seems no one else is reporting such issues.

rustatian · 2021-11-26T11:28:02Z

This is the regular issue, not the bug. The activity you are trying to ping is not running at that moment.

rustatian · 2021-11-26T11:33:09Z

This is not a critical error, you should treat this as a let's say, expected error when you are trying to check the HeartBeat. Because even the temporal go-sdk will return an error in that case.

Zylius · 2021-11-26T11:34:02Z

I'm confused, the activity is sending the heart beat in it's own code, like in this example I gave

#[ActivityInterface]
class TestActivity
{
    #[ActivityMethod]
    public function testActivity()
    {
        Activity::heartbeat('test'); // This causes failure.
    }
}

How could the Activity testActivity not be running while we send out heartbeat in the testActivity code?
This is original error I opened this issue with.

rustatian · 2021-11-26T11:39:39Z

Yep, I was ill when we discussed this issue, and didn't pay enough attention to it, sorry. The activity you showed here pinging the temporal server for the heartbeat. But the heartbeat in the issue pings this testActivity for the heartbeat. You are trying to ping the testActivity after it Activity::heartbeat('test'); method ended.

Zylius · 2021-11-26T11:46:20Z

So you're saying

#[ActivityInterface]
class TestActivity
{
    #[ActivityMethod]
    public function testActivity()
    {
        Activity::heartbeat('test'); // This causes failure, cause it's sent somehow async? And when temporal gets the request for heartbeat the activity is already closed?
    }
}

I tried this

#[ActivityInterface]
class TestActivity
{
    #[ActivityMethod]
    public function testActivity()
    {
        Activity::heartbeat('test'); 
       /// one second sleep so activity is surely not over when temporal gets the heart beat
        sleep(1);
    }
}

Still the error.

Now I'm all confused about the concept of heartbeats

rustatian · 2021-11-26T11:47:04Z

You may have a look at our tests, where we have samples with heartbeats -> https://github.com/temporalio/roadrunner-temporal/blob/master/tests/src/Workflow/SimpleHeartbeatWorkflow.php
and the second: https://github.com/temporalio/roadrunner-temporal/blob/master/tests/src/Activity/HeartBeatActivity.php

Zylius · 2021-11-26T11:50:58Z

That's the issue. The test code is the same I use as an example here and the same I used to test the new rc-1. it throws me the heartbeat on non running activity error. Well I guess I'll try to avoid heartbeats on PHP for now.

rustatian · 2021-11-26T11:55:02Z

In your sample, you have only 1 ping to the temporal server and then sleep, you may try to use the loop with the heartbeats like shown in the docs: https://docs.temporal.io/docs/php/activities/#activity-heart-beating without sleep.
I'm not a PHP dev, sorry, I can't help a lot with a PHP (
You may ask for help (about how to correctly write the heartbeats) in our discord server in the temporal channel: https://discord.gg/TFeEmCs

rustatian · 2021-11-26T11:55:33Z

Also, make sure to set the ActivityOptions::new()->withStartToCloseTimeout(50) option in your workflow.

Zylius · 2021-11-26T12:13:49Z

Thank you very much! I'll try everything you've said :) You've been a great help. I'll report here if I find where I screwed up!

wolfy-j · 2021-12-02T12:54:42Z

Is there a reason why you don't have yield in your workflow on activity execution? Can you try to run one of the samples and see if you still have the issue?

Zylius · 2021-12-02T14:09:15Z

Weird, I think the yield makes it rarer, but I still can recreate it after a couple of roadrunner restarts and workflow relaunches.

2021-12-02T14:07:43.827Z	DEBUG	temporal    	ExecuteActivity	{"Namespace": "default", "TaskQueue": "saga_command_bus", "WorkerID": "5391@64ddfa348084@saga_command_bus@1", "WorkflowType": "TestWorkflow", "WorkflowID": "fdbbd843-3dc6-40f8-a84d-c2c4c04c9088", "RunID": "ec7e703a-736d-45ae-a2e6-d41e7f53ad60", "Attempt": 1, "ActivityID": "5", "ActivityType": "testActivity"}
2021-12-02T14:07:46.282Z	DEBUG	temporal    	ExecuteActivity	{"Namespace": "default", "TaskQueue": "saga_command_bus", "WorkerID": "5391@64ddfa348084@saga_command_bus@1", "WorkflowType": "TestWorkflow", "WorkflowID": "f80698b0-82e2-4452-9194-19444ba301aa", "RunID": "b4348441-c08b-4a12-8f49-126f64eeeab3", "Attempt": 1, "ActivityID": "5", "ActivityType": "testActivity"}
2021-12-02T14:07:48.254Z	DEBUG	temporal    	ExecuteActivity	{"Namespace": "default", "TaskQueue": "saga_command_bus", "WorkerID": "5391@64ddfa348084@saga_command_bus@1", "WorkflowType": "TestWorkflow", "WorkflowID": "091757be-a5aa-4a5c-8e71-4d4684759f67", "RunID": "cd615364-0d65-4b7c-aa5d-eb25c40cf9ca", "Attempt": 1, "ActivityID": "5", "ActivityType": "testActivity"}
2021-12-02T14:07:48.325Z	ERROR	temporal    	Activity error.	{"Namespace": "default", "TaskQueue": "internal_saga_queue", "WorkerID": "5391@64ddfa348084@internal_saga_queue@4", "WorkflowID": "091757be-a5aa-4a5c-8e71-4d4684759f67", "RunID": "cd615364-0d65-4b7c-aa5d-eb25c40cf9ca", "ActivityType": "testActivity", "Attempt": 1, "Error": "Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 (type: Temporal\\Exception\\TransportException, retryable: true): Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 (type: Spiral\\Goridge\\RPC\\Exception\\ServiceException, retryable: true)"}

THe code is

#[WorkflowInterface]
class TestWorkflow
{
    #[Workflow\WorkflowMethod]
    public function launchActivity()
    {
        $options = ActivityOptions::new()
            ->withStartToCloseTimeout(CarbonInterval::seconds(50))
            ->withTaskQueue('internal_saga_queue')
            ->withRetryOptions(
                (new RetryOptions())
                    ->withMaximumAttempts(1)
            );

        yield Workflow::newUntypedActivityStub($options)->execute('testActivity');
    }
}

When I originally found this issue I was using yield. In the example I gave I'm not using it for the simplicities sake.

wolfy-j · 2021-12-03T14:46:46Z

Responded at the forum... I'm not able to reproduce this error. Can we try to dig deeper into other issues you having "broken pipe" and etc? I wonder if we are chasing the wrong bug.

rustatian · 2021-12-03T16:19:06Z

The broken pipe issue might be because of RPC plugin started before the activity RPC layer. The issue was fixed in the latest version. @Zylius Could you please retest the issue on this version -> https://github.com/spiral/roadrunner-binary/releases/tag/v2.6.3?

Zylius · 2021-12-03T18:22:36Z

Guys, just tried it with the v2.6.3, neither with the yield nor without it can I reproduce the issue. I don't get the broken pipe, nor the heartbeat on non running activity anymore! I think it's fixed! Thank you so much you're great. Tried it on different queues etc.

🙇 Thanks

rustatian changed the title ~~[Bug] Heartbeat on non running activity~~ [BUG?] Heartbeat on non running activity Oct 26, 2021

rustatian self-assigned this Oct 26, 2021

rustatian added the F-need-verification label Oct 26, 2021

rustatian added this to the 1.0.11 milestone Oct 26, 2021

rustatian mentioned this issue Nov 25, 2021

fix(heartbeat): record hb context right after hb check #120

Merged

6 tasks

rustatian modified the milestones: 1.0.11, v1.1 Nov 30, 2021

Zylius closed this as completed Dec 3, 2021

rustatian changed the title ~~[BUG?] Heartbeat on non running activity~~ [BUG] Heartbeat on non running activity Dec 3, 2021

rustatian added B-bug Bug: bug, unhandled exception and removed F-need-verification labels Dec 3, 2021

rustatian mentioned this issue Dec 3, 2021

fix(edges): wrong vertices edge handling roadrunner-server/endure#111

Merged

6 tasks

[BUG] Heartbeat on non running activity #109

[BUG] Heartbeat on non running activity #109

Comments

Zylius commented Oct 26, 2021

Describe the bug

To Reproduce

Expected behavior

Screenshots/Terminal output

Errors

Code samples:

Versions

Additional context

rustatian commented Oct 26, 2021

Zylius commented Oct 26, 2021

rustatian commented Oct 26, 2021

Zylius commented Oct 26, 2021

rustatian commented Oct 26, 2021

Zylius commented Oct 26, 2021

rustatian commented Oct 26, 2021

Zylius commented Oct 26, 2021

rustatian commented Oct 26, 2021

Zylius commented Oct 26, 2021

rustatian commented Oct 26, 2021

Zylius commented Oct 26, 2021 • edited Loading

rustatian commented Oct 26, 2021

Zylius commented Oct 26, 2021

rustatian commented Oct 26, 2021

rustatian commented Oct 28, 2021

Zylius commented Oct 28, 2021

Zylius commented Nov 4, 2021

rustatian commented Nov 4, 2021

Zylius commented Nov 4, 2021

rustatian commented Nov 24, 2021

Zylius commented Nov 24, 2021

rustatian commented Nov 25, 2021

Zylius commented Nov 25, 2021

rustatian commented Nov 26, 2021

Zylius commented Nov 26, 2021

rustatian commented Nov 26, 2021

rustatian commented Nov 26, 2021 • edited Loading

Zylius commented Nov 26, 2021

rustatian commented Nov 26, 2021 • edited Loading

Zylius commented Nov 26, 2021

rustatian commented Nov 26, 2021

Zylius commented Nov 26, 2021

rustatian commented Nov 26, 2021

rustatian commented Nov 26, 2021

Zylius commented Nov 26, 2021

wolfy-j commented Dec 2, 2021 • edited Loading

Zylius commented Dec 2, 2021 • edited Loading

wolfy-j commented Dec 3, 2021

rustatian commented Dec 3, 2021 • edited Loading

Zylius commented Dec 3, 2021

Zylius commented Oct 26, 2021 •

edited

Loading

rustatian commented Nov 26, 2021 •

edited

Loading

rustatian commented Nov 26, 2021 •

edited

Loading

wolfy-j commented Dec 2, 2021 •

edited

Loading

Zylius commented Dec 2, 2021 •

edited

Loading

rustatian commented Dec 3, 2021 •

edited

Loading