Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bbc] Fix BBCCoUkIPlayerPlaylistIE #28360

Closed

Conversation

dirkf
Copy link
Contributor

@dirkf dirkf commented Mar 5, 2021

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

The BBCCoUkIPlayerPlaylistIE extractor was failing, whereas BBCCoUkPlaylistIE still worked for the same pids: eg, the _TESTS.

In the iPlayer series pages, the playlist metadata is now sent in a JSON expression within a <script> element, and this patch extracts the title and description from it, and modifies the _VIDEO_ID_TEMPLATE to match the episode links within the entities subobject of the JSON.

Also, the regex used to get the next page link for BBCCoUkPlaylistBaseIE didn't allow for a newline that is now sent between tags.

For multi-series iPlayer playlists, each series page is also processed.

For multi-page iPlayer playlists, the pagination regex is extended to match the "Next Page" link on these pages.

@remitamine
Copy link
Collaborator

do mix multiple changes in a single PR.

@dirkf dirkf force-pushed the dirkf-BBCCoUkIPlayerPlaylistIE-patch branch from c2c0fd4 to 4863404 Compare March 16, 2021 12:45
Allow white space between tags as now sent by BBC.
@dirkf dirkf force-pushed the dirkf-BBCCoUkIPlayerPlaylistIE-patch branch from 4863404 to dc43160 Compare March 16, 2021 13:08
@dirkf dirkf changed the title Fix BBCCoUkIPlayerPlaylistIE [bbc] Fix BBCCoUkIPlayerPlaylistIE Mar 22, 2021
@remitamine
Copy link
Collaborator

  • does not handle programmes/series with multiple seasons.
  • does not handle seasons/groups with multiple pages.

@dirkf
Copy link
Contributor Author

dirkf commented Mar 23, 2021

The current version should deal with your comments, but the download tests aren't yet validated: CI is failing because Certificate did not match expected hostname: files.pythonhosted.org when installing nose with pip install nose in the Jython configurations.

@remitamine
Copy link
Collaborator

as i was looking at this, i think there is a better way to make an optimized version that would reduce the number of requests and still support all the expected features in addition to extracting more metadata for --flat-playlist option, i will try to implement it.

@dirkf
Copy link
Contributor Author

dirkf commented Mar 24, 2021

Feel free to use the test URLs from this PR for that. See also PR #23438 for a previous solution and set of test cases.

Generally, even though the BBC site support is split into different extractors, the BBCIE extractor in particular implements multiple extraction tactics and it's difficult to know which of these are still valid -- perhaps some telemetry would help, as well as refactoring the over-sized _real_extract() method.

It seems that two types of tactics are currently valid:

  • find some JSON in JS that is part of the page; extract links and metadata from the JSON;
  • find certain links, normally with a regex search, but mostly trying to replicate a DOM search similar to JS document.querySelectorAll() without the overhead of DOM.

Maybe generic methods can be created to support these tactics?

I have further updates to fix some of the failing test cases but will hold off if you're reimplementing.

@dirkf dirkf force-pushed the dirkf-BBCCoUkIPlayerPlaylistIE-patch branch 2 times, most recently from 6920932 to a5b6df4 Compare March 24, 2021 13:15
@remitamine
Copy link
Collaborator

Generally, even though the BBC site support is split into different extractors, the BBCIE extractor in particular implements multiple extraction tactics and it's difficult to know which of these are still valid -- perhaps some telemetry would help, as well as refactoring the over-sized _real_extract() method.

It seems that two types of tactics are currently valid:

  • find some JSON in JS that is part of the page; extract links and metadata from the JSON;
  • find certain links, normally with a regex search, but mostly trying to replicate a DOM search similar to JS document.querySelectorAll() without the overhead of DOM.

Maybe generic methods can be created to support these tactics?

the main BBC Extractor has to be split in the future, this can be done by slowly extracting code that is related to a specific URL pattern into it's own extractor, it would certainly make the code more manageable.

I have further updates to fix some of the failing test cases but will hold off if you're reimplementing.

it always depend on whether there is a better way of doing thing, in the case of BBC Reel code, even thought i would preferred to introduce the code in a different way, but as the code can be reused in the future, i did accept it with only small changes, but in this case, the method that i added is quite different from the code here, so there aren't much code that can be reused(except for the tests).

@dirkf
Copy link
Contributor Author

dirkf commented Mar 24, 2021

The changes I mentioned aren't specific to the iPlayer extractor (fixes to tests, morph playlists, support simorgh playlists).

Regarding the new code, please see #15710 (comment) for a case where what started out as an iPlayer URL redirected to a programmes page in which _call_api() returned None and caused a crash.

github-actions bot added a commit to hellopony/youtube-dl that referenced this pull request Mar 24, 2021
* https://github.com/ytdl-org/youtube-dl:
  [ard] improve clip id extraction(ytdl-org#22724)(closes ytdl-org#28528)
  release 2021.03.25
  [ChangeLog] Actualize [ci skip]
  [zoom] Add new extractor(closes ytdl-org#16597, closes ytdl-org#27002, closes ytdl-org#28531)
  [extractor] escape forgotten dot for hostnames in regular expression (ytdl-org#28530)
  [bbc] fix BBC IPlayer Episodes/Group extraction(closes ytdl-org#28360)
  [youtube] Fix default value for youtube_include_dash_manifest (closes ytdl-org#28523)
leshasmlesha pushed a commit to leshasmlesha/youtube-dl that referenced this pull request Apr 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants