ProxyNode failure #195

joeyparrish · 2024-11-08T18:20:23Z

Our Shaka Player History stream went down last night. The first errors in the log look like this:

Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn
Nov 08 09:54:17 xvfb-run[2896901]:     conn = connection.create_connection(
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 73, in create_connection
Nov 08 09:54:17 xvfb-run[2896901]:     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
Nov 08 09:54:17 xvfb-run[2896901]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Nov 08 09:54:17 xvfb-run[2896901]: Traceback (most recent call last):
Nov 08 09:54:17 xvfb-run[2896901]: socket.gaierror: [Errno -3] Temporary failure in name resolution
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn
Nov 08 09:54:17 xvfb-run[2896901]:     conn = connection.create_connection(
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 73, in create_connection
Nov 08 09:54:17 xvfb-run[2896901]:     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
Nov 08 09:54:17 xvfb-run[2896901]: During handling of the above exception, another exception occurred:
Nov 08 09:54:17 xvfb-run[2896901]: Traceback (most recent call last):
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
Nov 08 09:54:17 xvfb-run[2896901]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Nov 08 09:54:17 xvfb-run[2896901]: socket.gaierror: [Errno -3] Temporary failure in name resolution
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 700, in urlopen
Nov 08 09:54:17 xvfb-run[2896901]:     httplib_response = self._make_request(
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 383, in _make_request
Nov 08 09:54:17 xvfb-run[2896901]:     self._validate_conn(conn)
Nov 08 09:54:17 xvfb-run[2896901]: During handling of the above exception, another exception occurred:
Nov 08 09:54:17 xvfb-run[2896901]: Traceback (most recent call last):
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1017, in _validate_conn
Nov 08 09:54:17 xvfb-run[2896901]:     conn.connect()
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 353, in connect
Nov 08 09:54:17 xvfb-run[2896901]:     conn = self._new_conn()
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 700, in urlopen
Nov 08 09:54:17 xvfb-run[2896901]:     httplib_response = self._make_request(
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 383, in _make_request
Nov 08 09:54:17 xvfb-run[2896901]:     self._validate_conn(conn)
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 181, in _new_conn
Nov 08 09:54:17 xvfb-run[2896901]:     raise NewConnectionError(
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1017, in _validate_conn
Nov 08 09:54:17 xvfb-run[2896901]:     conn.connect()
Nov 08 09:54:17 xvfb-run[2896901]: urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x76b1934a8ee0>: Failed to establish a new connection: [Errno -3>
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 353, in connect
Nov 08 09:54:17 xvfb-run[2896901]:     conn = self._new_conn()
Nov 08 09:54:17 xvfb-run[2896901]: During handling of the above exception, another exception occurred:
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 181, in _new_conn
Nov 08 09:54:17 xvfb-run[2896901]:     raise NewConnectionError(

This eventually culminates in:

Nov 08 10:14:40 xvfb-run[2896901]: 127.0.0.1 - - [08/Nov/2024 10:14:40] "PUT /player-source/video_720p_2M_h264_14147.mp4 HTTP/1.1" 500 -
Nov 08 10:14:40 xvfb-run[2896935]: E1108 10:14:40.138181 2896941 http_file.cc:258] HttpFile request failed: 7 (HTTP_FAILURE): HTTP response code said error, response code: 500.
Nov 08 10:14:40 xvfb-run[2896935]: E1108 10:14:40.138217 2896941 demuxer.cc:371] Failed to process sample 256 5 (FILE_FAILURE): Cannot close file http://localhost:37433/player-source/v>
Nov 08 10:14:40 xvfb-run[2896935]: E1108 10:14:40.138223 2896941 mp4_media_parser.cc:1030] Failed to process the sample.

There have been no more errors, but the logged throughput has fallen from 0.979x at 10:15:

Nov 08 10:14:56 xvfb-run[2896933]: frame=1417701 fps= 24 q=24.0 q=20.0 q=-0.0 q=-0.0 q=32.0 q=33.0 size=13834013KiB time=15:45:07.24 bitrate=1998.5kbits/s speed=0.979x

to 0.658x at 18:08:

Nov 08 18:08:04 xvfb-run[2896933]: frame=1417701 fps= 16 q=24.0 q=20.0 q=-0.0 q=-0.0 q=32.0 q=33.0 size=13834013KiB time=15:45:07.24 bitrate=1998.5kbits/s speed=0.658x

The system does not appear to be recovering gracefully from the initial errors.

The text was updated successfully, but these errors were encountered:

joeyparrish · 2024-11-10T03:05:21Z

The encoder for our live stream is currently in my home. Just found this in my personal email from the 8th:

[Name of internet provider] is currently conducting planned equipment maintenance in your area. During this window, your internet connection will be disrupted for approximately 2 hours.

So that explains the outage. It should be fairly simple to replicate by pulling a cable once we have a potential solution. Ideally, the encoder should be able to recover from an intermittent network outage.

After making the initial report, I neglected to restart the service. (Oops.) After checking it just now, it was still stuck. So it definitely does not recover with time only.

The service has been restarted for now.

joeyparrish · 2024-11-11T04:02:11Z

I'm running an experiment now where I deliberately block uploads with a firewall rule:

sudo iptables -A OUTPUT -p tcp --destination-port 443 -j REJECT --reject-with tcp-reset

If this adequately reproduces the outage, I can then test a recovery mechanism.

joeyparrish · 2024-11-11T04:44:00Z

After the first error in the log, I removed the firewall rule. According to tcpdump, Packager stops trying to write to the local ProxyNode. The ffmpeg throughput rate continues to drop in the logs. So I think this is a faithful enough repro.

joeyparrish · 2024-11-11T05:18:23Z

With a prospective fix in place, I blocked connections after 7 minutes of streaming. 12 minutes later, with ffmpeg throughput having fallen from 0.996x to 0.537x, I removed the block from iptables.

In that time, the delay of retries and timeouts meant that only 2x 4-second segments failed to write. The number of missing segments was found by looking at cloud storage contents.

After restoring connections, the average ffmpeg throughput has been climbing, and new segments are being written to cloud storage faster than realtime. (About 1.133x.) The rate of uploads was found by counting segments in cloud storage over time.

So I think the fix is working!

When uploading a live stream, if a single segment upload fails, we shouldn't give up. This adds an option to ignore HTTP output failures so that a single failed upload does not result in a hung stream. See shaka-project/shaka-streamer#195 for details.

Instruct Packager to ignore HTTP output failures, so that the pipeline overall doesn't fail during a live stream. Once the system recovers, segments will be missing, but the stream overall will survive. This requires a new release of Shaka Packager (v3.4.0). Closes shaka-project#195

joeyparrish added the type: bug Something isn't working correctly label Nov 8, 2024

joeyparrish self-assigned this Nov 8, 2024

github-actions bot added this to the v1.0 milestone Nov 8, 2024

joeyparrish modified the milestones: v1.0, v1.1 Nov 8, 2024

joeyparrish mentioned this issue Nov 11, 2024

feat: Add option to ignore HTTP output failures shaka-project/shaka-packager#1449

Merged

joeyparrish mentioned this issue Nov 11, 2024

feat: Recover from upload failures #211

Merged

joeyparrish closed this as completed in #211 Nov 11, 2024

joeyparrish closed this as completed in 278d775 Nov 11, 2024

shaka-bot mentioned this issue Nov 11, 2024

chore(main): release 1.1.0 #209

Merged

github-actions bot added the status: archived Archived and locked; will not be updated label Jan 10, 2025

github-actions bot locked as resolved and limited conversation to collaborators Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProxyNode failure #195

ProxyNode failure #195

joeyparrish commented Nov 8, 2024

joeyparrish commented Nov 10, 2024

joeyparrish commented Nov 11, 2024 •

edited

Loading

joeyparrish commented Nov 11, 2024

joeyparrish commented Nov 11, 2024

ProxyNode failure #195

ProxyNode failure #195

Comments

joeyparrish commented Nov 8, 2024

joeyparrish commented Nov 10, 2024

joeyparrish commented Nov 11, 2024 • edited Loading

joeyparrish commented Nov 11, 2024

joeyparrish commented Nov 11, 2024

joeyparrish commented Nov 11, 2024 •

edited

Loading