Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Heartbeat on non running activity #109

Closed
Zylius opened this issue Oct 26, 2021 · 41 comments
Closed

[BUG] Heartbeat on non running activity #109

Zylius opened this issue Oct 26, 2021 · 41 comments
Assignees
Labels
B-bug Bug: bug, unhandled exception
Milestone

Comments

@Zylius
Copy link

Zylius commented Oct 26, 2021

Describe the bug

Sending out a heartbeat in activity method results in heartbeat on non running activity (at random times). The failed state for heartbeats seems to be set upon roadrunner launch (rr serve command). If the activity worker launches with a 'failed' heartbeat state, it'll continue failing upon Heartbeats until it, or the Workflows rr serve instance is killed and restarted. I'm still not very clear on what fixes it. The more roadrunner instances I run (rr serve) the higher the chance of this error.

The failing heartbeats throw: 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001

Or sometimes the worker (very rarely) enters a different state for heartbeats and failed with a socket error. IDK if it's related:
socket_send(): unable to write to socket [32]: Broken pipe

To Reproduce

  1. Create a dummy workflow.
  2. Create a dummy activity.
  3. Send a heartbeat in the activity
  4. Launch rr serve commands for the workflow part and activities separately. In our system workflows are contained in a single app, and activities are handled by different service apps. The more rr serve instances for the activity app is launched, the higher the chance one will see the error. It's pretty easy to recreate, it doesn't require many workers. I would say on my end with one worker it's about 50/50 to see the heartbeat error.
  5. Launch the workflow
  6. See the error.

Expected behavior

Heartbeat to be sent out successfully.

Screenshots/Terminal output

Errors

Temporal\Exception\TransportException: Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 
Spiral\Goridge\RPC\Exception\ServiceException: Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 in /home/docker/services/hyperwallet/vendor/spiral/goridge/src/RPC/RPC.php:128
Stack trace:
#0 /home/docker/services/hyperwallet/vendor/spiral/goridge/src/RPC/RPC.php(98): Spiral\Goridge\RPC\RPC->decodeResponse(Object(Spiral\Goridge\Frame), NULL)
#1 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Worker/Transport/Goridge.php(56): Spiral\Goridge\RPC\RPC->call('activities.Reco...', Array)
#2 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Internal/Activity/ActivityContext.php(137): Temporal\Worker\Transport\Goridge->call('activities.Reco...', Array)

Very rare (dunno if related)

ErrorException: Warning: socket_send(): unable to write to socket [32]: Broken pipe 
ErrorException: Warning: socket_send(): unable to write to socket [32]: Broken pipe in /home/docker/services/hyperwallet/vendor/spiral/goridge/src/SocketRelay.php:234
Stack trace:
#0 /home/docker/services/hyperwallet/vendor/spiral/goridge/src/RPC/RPC.php(83): Spiral\Goridge\SocketRelay->send(Object(Spiral\Goridge\Frame))
#1 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Worker/Transport/Goridge.php(56): Spiral\Goridge\RPC\RPC->call('activities.Reco...', Array)
#2 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Internal/Activity/ActivityContext.php(137): Temporal\Worker\Transport\Goridge->call('activities.Reco...', Array)

Code samples:

<?php
declare(strict_types=1);

namespace App\Command;

use Carbon\CarbonInterval;
use Temporal\Activity\ActivityOptions;
use Temporal\Workflow;
use Temporal\Workflow\WorkflowInterface;

#[WorkflowInterface]
class TemporalTestWorkflow
{
    #[Workflow\WorkflowMethod]
    public function launchActivity()
    {
        $options = ActivityOptions::new()
            ->withStartToCloseTimeout(CarbonInterval::seconds(2))
            ->withTaskQueue('hyperwallet_command_bus');

        Workflow::newUntypedActivityStub($options)->execute('testActivity');
    }
}
<?php

declare(strict_types=1);

namespace App;

use Temporal\Activity;
use Temporal\Activity\ActivityInterface;
use Temporal\Activity\ActivityMethod;

#[ActivityInterface]
class TestActivity
{
    #[ActivityMethod]
    public function testActivity()
    {
        Activity::heartbeat('test'); // This causes failure.
    }
}

Versions

  • OS: In Docker Debian GNU/Linux 10 (buster), host Pop!_OS 21.04
  • Temporal Version: 1.12.2
  • Roadrunner version: 2.5.2
  • PHP composer versions:
    • spiral/roadrunner: 2.5.0
    • spiral/roadrunner-cli: 2.0.11
    • spiral/roadrunner-http: 2.0.4
    • spiral/roadrunner-worker: 2.1.3
    • temporal/sdk: 1.0.4
  • We're using docker image temporalio/auto-setup in our docker-compose.yaml, we've seen this error with sample temporal docker-compose-mysql-es.yaml files as well.

Additional context

I'm not sure, maybe this is an issue only with our setup, but any help would be greatly appreciated.

@rustatian rustatian changed the title [Bug] Heartbeat on non running activity [BUG?] Heartbeat on non running activity Oct 26, 2021
@rustatian rustatian self-assigned this Oct 26, 2021
@rustatian rustatian added this to the 1.0.11 milestone Oct 26, 2021
@rustatian
Copy link
Collaborator

Hey @Zylius , could you please send your .rr.yaml ?

@Zylius
Copy link
Author

Zylius commented Oct 26, 2021

Hey sure,

rpc:
  listen: tcp://127.0.0.1:6001

server:
  command: "php bin/console t:w"

temporal:
  address: "temporal:7233"
  activities:
    num_workers: 1
  debug_level: 2
  codec: json

@rustatian
Copy link
Collaborator

I'll take a look, but it seems that the issue is related to the temporal-php. But, anyway, I'll check this.

@Zylius
Copy link
Author

Zylius commented Oct 26, 2021

Thanks 🙇 , any ideas on how to fix it would be great. I can make my own custom RPC calls if needed for the heartbeat 🚀 .

@rustatian
Copy link
Collaborator

We already have these calls in the rpc.go file.
Does this issue happen in the roadrunner 2.5.0 ?

@Zylius
Copy link
Author

Zylius commented Oct 26, 2021

Yes seems to happend in both 2.5.2 and 2.5.0 :/

@rustatian
Copy link
Collaborator

I mean, does it happen in the 2.4.* for example?

@Zylius
Copy link
Author

Zylius commented Oct 26, 2021

Yeah, just recreated it with 2.4.2. IIRC I had it with 2.3.2 as well.

@rustatian
Copy link
Collaborator

Also, tbh, I find the issue, possible, I'm not 100% sure, but I find a place with a concurrent check for the activity heartbeats. In the RPC call (which is actually doing the PHP) we store the token in the concurrent hashmap, which is accessible even if we have an execution error. So, you have a small window to call this method from the RPC, see this token, return successfully from the call, but actually, this token will be removed after the error occurred. So, you'll see a successful call, but actually, this will be an error. PHP will call this activity and get an error writing to the closed pipe.

@Zylius
Copy link
Author

Zylius commented Oct 26, 2021

That's weird, so if I understand you correctly the heartbeat should fail after the first unsuccessful rpc call to roadrunner (which will respond as a successful one). But that wouldn't be consistent with the strange stuff, that restarting the roadrunner instances somehow sometimes fixes the problem, yet sometimes it doesn't.

I've checked the PHP code and it seems to be calling the correct RPC procedure

func (r *rpc) RecordActivityHeartbeat(in RecordHeartbeatRequest, out *RecordHeartbeatResponse) error {
🤔 💭

@rustatian
Copy link
Collaborator

Yes, everything is ok with PHP.

Actually, this error Temporal\Exception\TransportException: Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 should only happen on the non-running activities.

But, If I understand correctly, you have these issues even with the running activities?

@Zylius
Copy link
Author

Zylius commented Oct 26, 2021

Yeah, I send it out directly from the activity code in the example. the Activity::heartbeat('test'); line. The heartbeat is always during the activity execution. Since temporal-php closes the activity context only once the activity method is done.

I used to think earlier that it might be caused by an activity timeout, but now I'm sure it's not cause I've tested with one liner code and temporal doesn't report any timeouts.

@rustatian
Copy link
Collaborator

So, then I'm pretty sure, that this is due to the concurrent access to the internal sync primitives. This is our bug with a 99% confidence, but I'll need to check this.
I'll have a look tomorrow at this since I guess that is very important for you (or not) ?

@Zylius
Copy link
Author

Zylius commented Oct 26, 2021

It's important for our system stability, but we can postpone temporal a few weeks for the activities which require heartbeats. :) Thank you for your help!!

@rustatian
Copy link
Collaborator

You are welcome 😃 I'll try to resolve this issue tomorrow 🤖

@rustatian
Copy link
Collaborator

Sorry for the delay @Zylius , have a lot of tickets to resolve. Will start ASAP to work on it (next few days).

@Zylius
Copy link
Author

Zylius commented Oct 28, 2021

It's no issue we're having higher priority issues on our end as well! 🙇‍♂️

@Zylius
Copy link
Author

Zylius commented Nov 4, 2021

Hello, any news on this? 👀

@rustatian
Copy link
Collaborator

@Zylius Sorry, I got sick 😢

@Zylius
Copy link
Author

Zylius commented Nov 4, 2021

Ahh the season of the year, wish you get well soon!!!

@rustatian
Copy link
Collaborator

@Zylius Ok 😃 Sorry for the delay (COVID-19). At the moment I'm waiting for the temporal/sdk 1.11 fix roadrunner-server/roadrunner-plugins#12 (comment). After that, I'll be able to release a beta version which you can test for the heartbeats.

@Zylius
Copy link
Author

Zylius commented Nov 24, 2021

Thanks man, happy you got better!

@rustatian
Copy link
Collaborator

@Zylius I forked (temporarily) the temporal/go-sdk and excluded commit with the problem. I'll release RR v2.6.0-rc.1 tomorrow, so, you can test it.

@Zylius
Copy link
Author

Zylius commented Nov 25, 2021

WIll be ready to test it! 😁

@rustatian
Copy link
Collaborator

@Zylius here you are: https://github.com/spiral/roadrunner-binary/releases/tag/v2.6.0-rc.1

@Zylius
Copy link
Author

Zylius commented Nov 26, 2021

Ahh snap, still getting it.

2021-11-26T11:20:33.778Z	ERROR	temporal    	Activity error.	{"Namespace": "default", "TaskQueue": "test_command_bus", "WorkerID": "6668@64ddfa348084@test_command_bus@2", "WorkflowID": "20c0e906-feb7-419d-95a8-690ec59bd64a", "RunID": "7f727c43-dba8-4df9-a05d-f462ddf67ba6", "ActivityType": "CreateTestCommand", "Attempt": 1, "Error": "Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 (type: Temporal\\Exception\\TransportException, retryable: true): Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 (type: Spiral\\Goridge\\RPC\\Exception\\ServiceException, retryable: true)"}

Road runner version:

docker@64ddfa348084:~/services/test$ ./rr --version
rr version 2.6.0-rc-1 (build time: 2021-11-26T08:53:07+0000, go1.17.2)

Dunno maybe it's an isolated issue in our environment?
It seems no one else is reporting such issues.

@rustatian
Copy link
Collaborator

This is the regular issue, not the bug. The activity you are trying to ping is not running at that moment.

@rustatian
Copy link
Collaborator

rustatian commented Nov 26, 2021

This is not a critical error, you should treat this as a let's say, expected error when you are trying to check the HeartBeat. Because even the temporal go-sdk will return an error in that case.

@Zylius
Copy link
Author

Zylius commented Nov 26, 2021

I'm confused, the activity is sending the heart beat in it's own code, like in this example I gave

#[ActivityInterface]
class TestActivity
{
    #[ActivityMethod]
    public function testActivity()
    {
        Activity::heartbeat('test'); // This causes failure.
    }
}

How could the Activity testActivity not be running while we send out heartbeat in the testActivity code?
This is original error I opened this issue with.

@rustatian
Copy link
Collaborator

rustatian commented Nov 26, 2021

Yep, I was ill when we discussed this issue, and didn't pay enough attention to it, sorry. The activity you showed here pinging the temporal server for the heartbeat. But the heartbeat in the issue pings this testActivity for the heartbeat. You are trying to ping the testActivity after it Activity::heartbeat('test'); method ended.

@Zylius
Copy link
Author

Zylius commented Nov 26, 2021

So you're saying

#[ActivityInterface]
class TestActivity
{
    #[ActivityMethod]
    public function testActivity()
    {
        Activity::heartbeat('test'); // This causes failure, cause it's sent somehow async? And when temporal gets the request for heartbeat the activity is already closed?
    }
}

I tried this

#[ActivityInterface]
class TestActivity
{
    #[ActivityMethod]
    public function testActivity()
    {
        Activity::heartbeat('test'); 
       /// one second sleep so activity is surely not over when temporal gets the heart beat
        sleep(1);
    }
}

Still the error.

Now I'm all confused about the concept of heartbeats

@rustatian
Copy link
Collaborator

@Zylius
Copy link
Author

Zylius commented Nov 26, 2021

That's the issue. The test code is the same I use as an example here and the same I used to test the new rc-1. it throws me the heartbeat on non running activity error. Well I guess I'll try to avoid heartbeats on PHP for now.

@rustatian
Copy link
Collaborator

In your sample, you have only 1 ping to the temporal server and then sleep, you may try to use the loop with the heartbeats like shown in the docs: https://docs.temporal.io/docs/php/activities/#activity-heart-beating without sleep.
I'm not a PHP dev, sorry, I can't help a lot with a PHP (
You may ask for help (about how to correctly write the heartbeats) in our discord server in the temporal channel: https://discord.gg/TFeEmCs

@rustatian
Copy link
Collaborator

Also, make sure to set the ActivityOptions::new()->withStartToCloseTimeout(50) option in your workflow.

@Zylius
Copy link
Author

Zylius commented Nov 26, 2021

Thank you very much! I'll try everything you've said :) You've been a great help. I'll report here if I find where I screwed up!

@rustatian rustatian modified the milestones: 1.0.11, v1.1 Nov 30, 2021
@wolfy-j
Copy link
Collaborator

wolfy-j commented Dec 2, 2021

Is there a reason why you don't have yield in your workflow on activity execution? Can you try to run one of the samples and see if you still have the issue?

@Zylius
Copy link
Author

Zylius commented Dec 2, 2021

Weird, I think the yield makes it rarer, but I still can recreate it after a couple of roadrunner restarts and workflow relaunches.

2021-12-02T14:07:43.827Z	DEBUG	temporal    	ExecuteActivity	{"Namespace": "default", "TaskQueue": "saga_command_bus", "WorkerID": "5391@64ddfa348084@saga_command_bus@1", "WorkflowType": "TestWorkflow", "WorkflowID": "fdbbd843-3dc6-40f8-a84d-c2c4c04c9088", "RunID": "ec7e703a-736d-45ae-a2e6-d41e7f53ad60", "Attempt": 1, "ActivityID": "5", "ActivityType": "testActivity"}
2021-12-02T14:07:46.282Z	DEBUG	temporal    	ExecuteActivity	{"Namespace": "default", "TaskQueue": "saga_command_bus", "WorkerID": "5391@64ddfa348084@saga_command_bus@1", "WorkflowType": "TestWorkflow", "WorkflowID": "f80698b0-82e2-4452-9194-19444ba301aa", "RunID": "b4348441-c08b-4a12-8f49-126f64eeeab3", "Attempt": 1, "ActivityID": "5", "ActivityType": "testActivity"}
2021-12-02T14:07:48.254Z	DEBUG	temporal    	ExecuteActivity	{"Namespace": "default", "TaskQueue": "saga_command_bus", "WorkerID": "5391@64ddfa348084@saga_command_bus@1", "WorkflowType": "TestWorkflow", "WorkflowID": "091757be-a5aa-4a5c-8e71-4d4684759f67", "RunID": "cd615364-0d65-4b7c-aa5d-eb25c40cf9ca", "Attempt": 1, "ActivityID": "5", "ActivityType": "testActivity"}
2021-12-02T14:07:48.325Z	ERROR	temporal    	Activity error.	{"Namespace": "default", "TaskQueue": "internal_saga_queue", "WorkerID": "5391@64ddfa348084@internal_saga_queue@4", "WorkflowID": "091757be-a5aa-4a5c-8e71-4d4684759f67", "RunID": "cd615364-0d65-4b7c-aa5d-eb25c40cf9ca", "ActivityType": "testActivity", "Attempt": 1, "Error": "Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 (type: Temporal\\Exception\\TransportException, retryable: true): Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 (type: Spiral\\Goridge\\RPC\\Exception\\ServiceException, retryable: true)"}

THe code is

#[WorkflowInterface]
class TestWorkflow
{
    #[Workflow\WorkflowMethod]
    public function launchActivity()
    {
        $options = ActivityOptions::new()
            ->withStartToCloseTimeout(CarbonInterval::seconds(50))
            ->withTaskQueue('internal_saga_queue')
            ->withRetryOptions(
                (new RetryOptions())
                    ->withMaximumAttempts(1)
            );

        yield Workflow::newUntypedActivityStub($options)->execute('testActivity');
    }
}

When I originally found this issue I was using yield. In the example I gave I'm not using it for the simplicities sake.

@wolfy-j
Copy link
Collaborator

wolfy-j commented Dec 3, 2021

Responded at the forum... I'm not able to reproduce this error. Can we try to dig deeper into other issues you having "broken pipe" and etc? I wonder if we are chasing the wrong bug.

@rustatian
Copy link
Collaborator

rustatian commented Dec 3, 2021

The broken pipe issue might be because of RPC plugin started before the activity RPC layer. The issue was fixed in the latest version. @Zylius Could you please retest the issue on this version -> https://github.com/spiral/roadrunner-binary/releases/tag/v2.6.3?

@Zylius
Copy link
Author

Zylius commented Dec 3, 2021

Guys, just tried it with the v2.6.3, neither with the yield nor without it can I reproduce the issue. I don't get the broken pipe, nor the heartbeat on non running activity anymore! I think it's fixed! Thank you so much you're great. Tried it on different queues etc.

🙇 Thanks

@Zylius Zylius closed this as completed Dec 3, 2021
@rustatian rustatian changed the title [BUG?] Heartbeat on non running activity [BUG] Heartbeat on non running activity Dec 3, 2021
@rustatian rustatian added B-bug Bug: bug, unhandled exception and removed F-need-verification labels Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-bug Bug: bug, unhandled exception
Projects
None yet
Development

No branches or pull requests

3 participants