Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call stays in active state upon channel error #900

Open
matiasgarciaisaia opened this issue Mar 3, 2022 · 8 comments
Open

Call stays in active state upon channel error #900

matiasgarciaisaia opened this issue Mar 3, 2022 · 8 comments
Labels

Comments

@matiasgarciaisaia
Copy link
Member

See #778 for a similar issue.

We've observed some calls that stay with state active for weeks in Verboice, even if it was finished/cancelled/errored. The one we've just seen (call in Verboice, broker's logs below) was sent via Twilio (call SID CA381882b5e5f24f714d0ba28eee084f3c in InSTEDD 4 NCD project). The error was internal to the broker (maybe a communication issue with Surveda?), there are no errors seen in Twilio.

We should check what the error is, and how to properly handle it.

04:32:41.937 [info] [<0.4046.459>|8da0604c] session: Session ("190487cd-1fa9-4707-a971-b397630e4fbc") dial
                "verboice.surveda-mumbai.org/twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc",
                "/twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc",
                "POST /twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc HTTP/1.1",
04:32:57.756 [info] [<0.4046.459>|8da0604c] session: Session ("190487cd-1fa9-4707-a971-b397630e4fbc") answer
                  "https://verboice.surveda-mumbai.org/twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc"},
04:38:33.104 [error] [<0.24823.458>|91b9025e] session: Error during session "190487cd-1fa9-4707-a971-b397630e4fbc": exit:{timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}
04:38:33.105 [error] [emulator|undefined] unknown_module: Error in process <0.4328.459> on node 'verboice_broker@localhost' with exit value: {function_clause,[{http_uri,'-encode/1-lc$^0/1-0-',[{timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}],[{file,"http_uri.erl"},{... 
04:38:33.106 [info] [<0.4046.459>|8da0604c] session: Call failed with reason {internal_error,{timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}}
04:38:33.107 [warning] [<0.4046.459>|8da0604c] session: Session ("190487cd-1fa9-4707-a971-b397630e4fbc") terminated with reason: {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}
04:38:33.107 [error] [<0.4046.459>|undefined] unknown_module: gen_fsm {session,[49,57,48,52,56,55,99,100,45,49,102,97,57,45,52,55,48,55,45,97,57,55,49,45,98,51,57,55,54,51,48,101,52,102,98,99]} in state in_progress terminated with reason: {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}
04:38:33.107 [error] [<0.4046.459>|undefined] unknown_module: CRASH REPORT Process <0.4046.459> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}} in gen_fsm:terminate/7 line 611
04:38:33.107 [error] [<0.145.0>|undefined] unknown_module: Supervisor session_sup had child "190487cd-1fa9-4707-a971-b397630e4fbc" started with {session,start_link,undefined} at <0.4046.459> exit with reason {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}} in context child_terminated

CC: @ggiraldez

@ysbaddaden
Copy link
Contributor

It looks like a communication issue between verboice and surveda: the former failed to capture the MP3 file with a timeout. If I understand the code correctly, the Twilio PBX crashed, the session handler noticed it and logged an internal error, but the close state got never changed?

@ysbaddaden
Copy link
Contributor

I forgot to say that I reviewed issues logged in Sentry and found nothing remotely related, hence the network issue.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Apr 26, 2022

Errata: the capture function has nothing to do with downloading the MP3 file from Surveda! It's instead trying to capture the respondent's reply through Twilio (for a given question) but it times out.

  1. I believe this is this specific call (which in turn calls this in the actor):

    capture(Caption, Timeout, FinishOnKey, Min, Max, ?PBX(Pid)) ->
    case gen_server:call(Pid, {capture, Caption, Timeout, FinishOnKey, Min, Max}, timer:minutes(5)) of
    hangup -> throw(hangup);
    X -> X
    end.

  2. Either Twilio never responds, or there was a network error, and/or the twilio_pbx actor crashed. In the end, the Broker eventually reaches the 5 minutes timeout while waiting a reply or hangup. It crashes the actor, and this bubbles up to the session.

  3. It ends up calling session:finalize (with {failed, Reason}) where we properly report the error (Call failed with reason ~p), reschedule the call, and update the call log (though I'm not sure what's the state value):

    CallLog:update([{state, NewState}, {finished_at, calendar:universal_time()}] ++ FailInfo),

  4. It eventually calls session:terminate (I don't know how) where we report the error again (session (~p) terminated with reason ~p), but may also enqueue a delayed_job (CallFlow::FusionTablesPush::Pusher)? The logic seems identical to CallLog#finish in Ruby's model.

    terminate(Reason, _, #state{session_id = SessionId, session = Session}) ->
    push_results(Session),
    case Reason of
    normal -> lager:info("Session (~p) terminated normally", [SessionId]);
    _ -> lager:warning("Session (~p) terminated with reason: ~p", [SessionId, Reason])
    end,
    poirot:pop().


Now, it seems Verboice considers the call to still be open, when the Broker properly failed and reported. Maybe the CallLog state is incorrectly updated in session:finalize ? What about the delayed job?

@aeatencio
Copy link
Contributor

aeatencio commented May 26, 2022

I tried hard to reproduce this bug with no success. I did reproduce the timeout error and got the same logs in this issue.

My strongest hypothesis is very simple: this update fails.

As a consequence of this failure, in the Mumbai instance, 0.01% of the started calls remained "active" when they actually failed.

SELECT sum(case when finished_at is null then 1 else 0 end) `unfinished`,
	count(1) `started`,
	sum(case when finished_at is null then 1 else 0 end) / count(1) * 100 as `percentage`
FROM `call_logs`
WHERE started_at is not null

|------------|---------|------------|
| unfinished | started | percentage |
|------------|---------|------------|
|        237 | 1725747 |     0.0137 |
|------------|---------|------------|

#912 doesn't solve this failure. It will continue happening. But its consequences (the active calls) won't be there anymore.

aeatencio pushed a commit that referenced this issue May 30, 2022
When a call remains active for too long Verboice considers there was an error and cancels it. For #900
@ysbaddaden
Copy link
Contributor

huh... is it possible that we retain a MySQL connection which gets closed during the X minutes timeout? which is only noticed when we try to update (EPIPE)? could the connection be missing an auto-reconnect or something?

@aeatencio
Copy link
Contributor

Today it seems had a couple of (consistent) occurrences of this issue in STG by hanging up calls using Callcentric. I share the logs.
verboice_error_2.log
verboice_error.log

@ysbaddaden
Copy link
Contributor

As outlined by @ggiraldez, as soon as we get the hung up during the Gather operation, we try to update a row of call_logs, but it's failing, and the js_context column goes from null to be an Erlang error:

9/22/2022 9:53:16 AM                            {error,
9/22/2022 9:53:16 AM                             {unrecognized_value,
9/22/2022 9:53:16 AM                              {#Ref<0.0.931.17542>,
9/22/2022 9:53:16 AM                               {dict,1,16,16,8,80,48,
9/22/2022 9:53:16 AM                                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM                                 []},
9/22/2022 9:53:16 AM                                {{[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM                                  [[#Ref<0.0.931.17542>|
9/22/2022 9:53:16 AM                                    {dict,5,16,16,8,80,48,
9/22/2022 9:53:16 AM                                     {[],[],[],[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM                                      [],[],[]},
9/22/2022 9:53:16 AM                                     {{[[<<"hub_url">>]],
9/22/2022 9:53:16 AM                                       [],[],
9/22/2022 9:53:16 AM                                       [[<<"_get_var">>|
9/22/2022 9:53:16 AM                                         #Fun<session.17.60946945>],
9/22/2022 9:53:16 AM                                        [<<"split_digits">>|
9/22/2022 9:53:16 AM                                         #Fun<session.18.60946945>]],
9/22/2022 9:53:16 AM                                       [],[],
9/22/2022 9:53:16 AM                                       [[<<"phone_number">>,49,48,49]],
9/22/2022 9:53:16 AM                                       [[<<"record_url">>|
9/22/2022 9:53:16 AM                                         #Fun<session.16.60946945>]],
9/22/2022 9:53:16 AM                                       [],[],[],[],[],[],[],[]}}}]],
9/22/2022 9:53:16 AM                                  [],[],[],[],[]}}}}}},

@ysbaddaden
Copy link
Contributor

I reproduced with the Twilio simulator:

The simulator decided that a bunch of of respondents wouldn't pick up the phone (i.e. no reply kicked in), and reported it by sending the no-answer status to verboice, but maybe Twilio doesn't do that, because verboice didn't care and sent the first question, expecting an answer.

Maybe this is how the "capture timeout" above triggers in: Twilio reports an error status to Url or <Redirect> that verboice overlooks, it sends the next question in response (that twilio discards, it got a 200 status delivering the status and it's happy), then Verboice waits for an answer that will never come -> timeout during capture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants