Call stays in active state upon channel error #900

matiasgarciaisaia · 2022-03-03T15:06:01Z

See #778 for a similar issue.

We've observed some calls that stay with state active for weeks in Verboice, even if it was finished/cancelled/errored. The one we've just seen (call in Verboice, broker's logs below) was sent via Twilio (call SID CA381882b5e5f24f714d0ba28eee084f3c in InSTEDD 4 NCD project). The error was internal to the broker (maybe a communication issue with Surveda?), there are no errors seen in Twilio.

We should check what the error is, and how to properly handle it.

04:32:41.937 [info] [<0.4046.459>|8da0604c] session: Session ("190487cd-1fa9-4707-a971-b397630e4fbc") dial
                "verboice.surveda-mumbai.org/twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc",
                "/twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc",
                "POST /twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc HTTP/1.1",
04:32:57.756 [info] [<0.4046.459>|8da0604c] session: Session ("190487cd-1fa9-4707-a971-b397630e4fbc") answer
                  "https://verboice.surveda-mumbai.org/twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc"},
04:38:33.104 [error] [<0.24823.458>|91b9025e] session: Error during session "190487cd-1fa9-4707-a971-b397630e4fbc": exit:{timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}
04:38:33.105 [error] [emulator|undefined] unknown_module: Error in process <0.4328.459> on node 'verboice_broker@localhost' with exit value: {function_clause,[{http_uri,'-encode/1-lc$^0/1-0-',[{timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}],[{file,"http_uri.erl"},{... 
04:38:33.106 [info] [<0.4046.459>|8da0604c] session: Call failed with reason {internal_error,{timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}}
04:38:33.107 [warning] [<0.4046.459>|8da0604c] session: Session ("190487cd-1fa9-4707-a971-b397630e4fbc") terminated with reason: {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}
04:38:33.107 [error] [<0.4046.459>|undefined] unknown_module: gen_fsm {session,[49,57,48,52,56,55,99,100,45,49,102,97,57,45,52,55,48,55,45,97,57,55,49,45,98,51,57,55,54,51,48,101,52,102,98,99]} in state in_progress terminated with reason: {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}
04:38:33.107 [error] [<0.4046.459>|undefined] unknown_module: CRASH REPORT Process <0.4046.459> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}} in gen_fsm:terminate/7 line 611
04:38:33.107 [error] [<0.145.0>|undefined] unknown_module: Supervisor session_sup had child "190487cd-1fa9-4707-a971-b397630e4fbc" started with {session,start_link,undefined} at <0.4046.459> exit with reason {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}} in context child_terminated

CC: @ggiraldez

The text was updated successfully, but these errors were encountered:

ysbaddaden · 2022-03-03T15:39:17Z

It looks like a communication issue between verboice and surveda: the former failed to capture the MP3 file with a timeout. If I understand the code correctly, the Twilio PBX crashed, the session handler noticed it and logged an internal error, but the close state got never changed?

ysbaddaden · 2022-03-03T15:52:15Z

I forgot to say that I reviewed issues logged in Sentry and found nothing remotely related, hence the network issue.

ysbaddaden · 2022-04-26T09:44:29Z

Errata: the capture function has nothing to do with downloading the MP3 file from Surveda! It's instead trying to capture the respondent's reply through Twilio (for a given question) but it times out.

I believe this is this specific call (which in turn calls this in the actor):

verboice/broker/src/twilio/twilio_pbx.erl

Lines 49 to 53 in f6229be

    
           capture(Caption, Timeout, FinishOnKey, Min, Max, ?PBX(Pid)) -> 
        
             case gen_server:call(Pid, {capture, Caption, Timeout, FinishOnKey, Min, Max}, timer:minutes(5)) of 
        
               hangup -> throw(hangup); 
        
               X -> X 
        
             end.

Either Twilio never responds, or there was a network error, and/or the twilio_pbx actor crashed. In the end, the Broker eventually reaches the 5 minutes timeout while waiting a reply or hangup. It crashes the actor, and this bubbles up to the session.
It ends up calling session:finalize (with {failed, Reason}) where we properly report the error (Call failed with reason ~p), reschedule the call, and update the call log (though I'm not sure what's the state value):

verboice/broker/src/session/session.erl

Line 499 in f6229be

CallLog:update([{state, NewState}, {finished_at, calendar:universal_time()}] ++ FailInfo),

It eventually calls session:terminate (I don't know how) where we report the error again (session (~p) terminated with reason ~p), but may also enqueue a delayed_job (CallFlow::FusionTablesPush::Pusher)? The logic seems identical to CallLog#finish in Ruby's model.

verboice/broker/src/session/session.erl

Lines 447 to 453 in f6229be

    
           terminate(Reason, _, #state{session_id = SessionId, session = Session}) -> 
        
             push_results(Session), 
        
             case Reason of 
        
               normal -> lager:info("Session (~p) terminated normally", [SessionId]); 
        
               _ -> lager:warning("Session (~p) terminated with reason: ~p", [SessionId, Reason]) 
        
             end, 
        
             poirot:pop().

Now, it seems Verboice considers the call to still be open, when the Broker properly failed and reported. Maybe the CallLog state is incorrectly updated in session:finalize ? What about the delayed job?

For #900

aeatencio · 2022-05-26T12:48:51Z

I tried hard to reproduce this bug with no success. I did reproduce the timeout error and got the same logs in this issue.

My strongest hypothesis is very simple: this update fails.

As a consequence of this failure, in the Mumbai instance, 0.01% of the started calls remained "active" when they actually failed.

SELECT sum(case when finished_at is null then 1 else 0 end) `unfinished`,
	count(1) `started`,
	sum(case when finished_at is null then 1 else 0 end) / count(1) * 100 as `percentage`
FROM `call_logs`
WHERE started_at is not null

|------------|---------|------------|
| unfinished | started | percentage |
|------------|---------|------------|
|        237 | 1725747 |     0.0137 |
|------------|---------|------------|

#912 doesn't solve this failure. It will continue happening. But its consequences (the active calls) won't be there anymore.

When a call remains active for too long Verboice considers there was an error and cancels it. For #900

ysbaddaden · 2022-05-30T16:30:42Z

huh... is it possible that we retain a MySQL connection which gets closed during the X minutes timeout? which is only noticed when we try to update (EPIPE)? could the connection be missing an auto-reconnect or something?

aeatencio · 2022-09-22T14:24:18Z

Today it seems had a couple of (consistent) occurrences of this issue in STG by hanging up calls using Callcentric. I share the logs.
verboice_error_2.log
verboice_error.log

ysbaddaden · 2022-09-23T12:29:02Z

As outlined by @ggiraldez, as soon as we get the hung up during the Gather operation, we try to update a row of call_logs, but it's failing, and the js_context column goes from null to be an Erlang error:

9/22/2022 9:53:16 AM                            {error,
9/22/2022 9:53:16 AM                             {unrecognized_value,
9/22/2022 9:53:16 AM                              {#Ref<0.0.931.17542>,
9/22/2022 9:53:16 AM                               {dict,1,16,16,8,80,48,
9/22/2022 9:53:16 AM                                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM                                 []},
9/22/2022 9:53:16 AM                                {{[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM                                  [[#Ref<0.0.931.17542>|
9/22/2022 9:53:16 AM                                    {dict,5,16,16,8,80,48,
9/22/2022 9:53:16 AM                                     {[],[],[],[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM                                      [],[],[]},
9/22/2022 9:53:16 AM                                     {{[[<<"hub_url">>]],
9/22/2022 9:53:16 AM                                       [],[],
9/22/2022 9:53:16 AM                                       [[<<"_get_var">>|
9/22/2022 9:53:16 AM                                         #Fun<session.17.60946945>],
9/22/2022 9:53:16 AM                                        [<<"split_digits">>|
9/22/2022 9:53:16 AM                                         #Fun<session.18.60946945>]],
9/22/2022 9:53:16 AM                                       [],[],
9/22/2022 9:53:16 AM                                       [[<<"phone_number">>,49,48,49]],
9/22/2022 9:53:16 AM                                       [[<<"record_url">>|
9/22/2022 9:53:16 AM                                         #Fun<session.16.60946945>]],
9/22/2022 9:53:16 AM                                       [],[],[],[],[],[],[],[]}}}]],
9/22/2022 9:53:16 AM                                  [],[],[],[],[]}}}}}},

ysbaddaden · 2023-03-21T10:39:11Z

I reproduced with the Twilio simulator:

The simulator decided that a bunch of of respondents wouldn't pick up the phone (i.e. no reply kicked in), and reported it by sending the no-answer status to verboice, but maybe Twilio doesn't do that, because verboice didn't care and sent the first question, expecting an answer.

Maybe this is how the "capture timeout" above triggers in: Twilio reports an error status to Url or <Redirect> that verboice overlooks, it sends the next question in response (that twilio discards, it got a 200 status delivering the status and it's happy), then Verboice waits for an answer that will never come -> timeout during capture.

matiasgarciaisaia added the bug label Mar 3, 2022

manumoreira mentioned this issue Apr 25, 2022

Verboice call stays in active state upon channel error instedd/surveda#2071

Closed

aeatencio self-assigned this May 4, 2022

aeatencio mentioned this issue May 11, 2022

Improve logs and ensure update #909

Merged

aeatencio pushed a commit that referenced this issue May 12, 2022

Improve logs and ensure update (#909)

7ac6369

For #900

aeatencio mentioned this issue May 24, 2022

Garbage collector for active calls #912

Merged

aeatencio pushed a commit that referenced this issue May 30, 2022

Garbage collector for active calls (#912)

0d5c136

When a call remains active for too long Verboice considers there was an error and cancels it. For #900

matiasgarciaisaia mentioned this issue Oct 5, 2022

Saving calls fails due to non-serializable js_context #921

Closed

aeatencio removed their assignment Mar 21, 2023

ysbaddaden mentioned this issue Apr 14, 2023

Support twilio status callback #938

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call stays in active state upon channel error #900

Call stays in active state upon channel error #900

matiasgarciaisaia commented Mar 3, 2022

ysbaddaden commented Mar 3, 2022

ysbaddaden commented Mar 3, 2022

ysbaddaden commented Apr 26, 2022 •

edited

Loading

aeatencio commented May 26, 2022 •

edited

Loading

ysbaddaden commented May 30, 2022

aeatencio commented Sep 22, 2022

ysbaddaden commented Sep 23, 2022

ysbaddaden commented Mar 21, 2023

Call stays in active state upon channel error #900

Call stays in active state upon channel error #900

Comments

matiasgarciaisaia commented Mar 3, 2022

ysbaddaden commented Mar 3, 2022

ysbaddaden commented Mar 3, 2022

ysbaddaden commented Apr 26, 2022 • edited Loading

aeatencio commented May 26, 2022 • edited Loading

ysbaddaden commented May 30, 2022

aeatencio commented Sep 22, 2022

ysbaddaden commented Sep 23, 2022

ysbaddaden commented Mar 21, 2023

ysbaddaden commented Apr 26, 2022 •

edited

Loading

aeatencio commented May 26, 2022 •

edited

Loading