Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProxyNode failure #195

Closed
joeyparrish opened this issue Nov 8, 2024 · 4 comments · Fixed by #211 or #209
Closed

ProxyNode failure #195

joeyparrish opened this issue Nov 8, 2024 · 4 comments · Fixed by #211 or #209
Assignees
Labels
status: archived Archived and locked; will not be updated type: bug Something isn't working correctly
Milestone

Comments

@joeyparrish
Copy link
Member

Our Shaka Player History stream went down last night. The first errors in the log look like this:

Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn
Nov 08 09:54:17 xvfb-run[2896901]:     conn = connection.create_connection(
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 73, in create_connection
Nov 08 09:54:17 xvfb-run[2896901]:     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
Nov 08 09:54:17 xvfb-run[2896901]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Nov 08 09:54:17 xvfb-run[2896901]: Traceback (most recent call last):
Nov 08 09:54:17 xvfb-run[2896901]: socket.gaierror: [Errno -3] Temporary failure in name resolution
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn
Nov 08 09:54:17 xvfb-run[2896901]:     conn = connection.create_connection(
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 73, in create_connection
Nov 08 09:54:17 xvfb-run[2896901]:     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
Nov 08 09:54:17 xvfb-run[2896901]: During handling of the above exception, another exception occurred:
Nov 08 09:54:17 xvfb-run[2896901]: Traceback (most recent call last):
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
Nov 08 09:54:17 xvfb-run[2896901]:     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
Nov 08 09:54:17 xvfb-run[2896901]: socket.gaierror: [Errno -3] Temporary failure in name resolution
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 700, in urlopen
Nov 08 09:54:17 xvfb-run[2896901]:     httplib_response = self._make_request(
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 383, in _make_request
Nov 08 09:54:17 xvfb-run[2896901]:     self._validate_conn(conn)
Nov 08 09:54:17 xvfb-run[2896901]: During handling of the above exception, another exception occurred:
Nov 08 09:54:17 xvfb-run[2896901]: Traceback (most recent call last):
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1017, in _validate_conn
Nov 08 09:54:17 xvfb-run[2896901]:     conn.connect()
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 353, in connect
Nov 08 09:54:17 xvfb-run[2896901]:     conn = self._new_conn()
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 700, in urlopen
Nov 08 09:54:17 xvfb-run[2896901]:     httplib_response = self._make_request(
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 383, in _make_request
Nov 08 09:54:17 xvfb-run[2896901]:     self._validate_conn(conn)
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 181, in _new_conn
Nov 08 09:54:17 xvfb-run[2896901]:     raise NewConnectionError(
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1017, in _validate_conn
Nov 08 09:54:17 xvfb-run[2896901]:     conn.connect()
Nov 08 09:54:17 xvfb-run[2896901]: urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x76b1934a8ee0>: Failed to establish a new connection: [Errno -3>
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 353, in connect
Nov 08 09:54:17 xvfb-run[2896901]:     conn = self._new_conn()
Nov 08 09:54:17 xvfb-run[2896901]: During handling of the above exception, another exception occurred:
Nov 08 09:54:17 xvfb-run[2896901]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 181, in _new_conn
Nov 08 09:54:17 xvfb-run[2896901]:     raise NewConnectionError(

This eventually culminates in:

Nov 08 10:14:40 xvfb-run[2896901]: 127.0.0.1 - - [08/Nov/2024 10:14:40] "PUT /player-source/video_720p_2M_h264_14147.mp4 HTTP/1.1" 500 -
Nov 08 10:14:40 xvfb-run[2896935]: E1108 10:14:40.138181 2896941 http_file.cc:258] HttpFile request failed: 7 (HTTP_FAILURE): HTTP response code said error, response code: 500.
Nov 08 10:14:40 xvfb-run[2896935]: E1108 10:14:40.138217 2896941 demuxer.cc:371] Failed to process sample 256 5 (FILE_FAILURE): Cannot close file http://localhost:37433/player-source/v>
Nov 08 10:14:40 xvfb-run[2896935]: E1108 10:14:40.138223 2896941 mp4_media_parser.cc:1030] Failed to process the sample.

There have been no more errors, but the logged throughput has fallen from 0.979x at 10:15:

Nov 08 10:14:56 xvfb-run[2896933]: frame=1417701 fps= 24 q=24.0 q=20.0 q=-0.0 q=-0.0 q=32.0 q=33.0 size=13834013KiB time=15:45:07.24 bitrate=1998.5kbits/s speed=0.979x

to 0.658x at 18:08:

Nov 08 18:08:04 xvfb-run[2896933]: frame=1417701 fps= 16 q=24.0 q=20.0 q=-0.0 q=-0.0 q=32.0 q=33.0 size=13834013KiB time=15:45:07.24 bitrate=1998.5kbits/s speed=0.658x

The system does not appear to be recovering gracefully from the initial errors.

@joeyparrish joeyparrish added the type: bug Something isn't working correctly label Nov 8, 2024
@joeyparrish joeyparrish self-assigned this Nov 8, 2024
@github-actions github-actions bot added this to the v1.0 milestone Nov 8, 2024
@joeyparrish joeyparrish modified the milestones: v1.0, v1.1 Nov 8, 2024
@joeyparrish
Copy link
Member Author

The encoder for our live stream is currently in my home. Just found this in my personal email from the 8th:

[Name of internet provider] is currently conducting planned equipment maintenance in your area. During this window, your internet connection will be disrupted for approximately 2 hours.

So that explains the outage. It should be fairly simple to replicate by pulling a cable once we have a potential solution. Ideally, the encoder should be able to recover from an intermittent network outage.

After making the initial report, I neglected to restart the service. (Oops.) After checking it just now, it was still stuck. So it definitely does not recover with time only.

The service has been restarted for now.

@joeyparrish
Copy link
Member Author

joeyparrish commented Nov 11, 2024

I'm running an experiment now where I deliberately block uploads with a firewall rule:

sudo iptables -A OUTPUT -p tcp --destination-port 443 -j REJECT --reject-with tcp-reset

If this adequately reproduces the outage, I can then test a recovery mechanism.

@joeyparrish
Copy link
Member Author

After the first error in the log, I removed the firewall rule. According to tcpdump, Packager stops trying to write to the local ProxyNode. The ffmpeg throughput rate continues to drop in the logs. So I think this is a faithful enough repro.

@joeyparrish
Copy link
Member Author

With a prospective fix in place, I blocked connections after 7 minutes of streaming. 12 minutes later, with ffmpeg throughput having fallen from 0.996x to 0.537x, I removed the block from iptables.

In that time, the delay of retries and timeouts meant that only 2x 4-second segments failed to write. The number of missing segments was found by looking at cloud storage contents.

After restoring connections, the average ffmpeg throughput has been climbing, and new segments are being written to cloud storage faster than realtime. (About 1.133x.) The rate of uploads was found by counting segments in cloud storage over time.

So I think the fix is working!

joeyparrish added a commit to joeyparrish/shaka-packager that referenced this issue Nov 11, 2024
When uploading a live stream, if a single segment upload fails, we shouldn't give up.

This adds an option to ignore HTTP output failures so that a single failed upload does not result in a hung stream.  See shaka-project/shaka-streamer#195 for details.
joeyparrish added a commit to shaka-project/shaka-packager that referenced this issue Nov 11, 2024
When uploading a live stream, if a single segment upload fails, we shouldn't give up.

This adds an option to ignore HTTP output failures so that a single failed upload does not result in a hung stream.  See shaka-project/shaka-streamer#195 for details.
joeyparrish added a commit to joeyparrish/shaka-streamer that referenced this issue Nov 11, 2024
Instruct Packager to ignore HTTP output failures, so that the pipeline overall doesn't fail during a live stream.  Once the system recovers, segments will be missing, but the stream overall will survive.

This requires a new release of Shaka Packager (v3.4.0).

Closes shaka-project#195
@github-actions github-actions bot added the status: archived Archived and locked; will not be updated label Jan 10, 2025
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status: archived Archived and locked; will not be updated type: bug Something isn't working correctly
Projects
None yet
1 participant