Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Default language on Transcript class #133

Open
arturoalcibia opened this issue Nov 11, 2021 · 13 comments
Open

[Feature] Default language on Transcript class #133

arturoalcibia opened this issue Nov 11, 2021 · 13 comments
Labels
enhancement New feature or request

Comments

@arturoalcibia
Copy link

arturoalcibia commented Nov 11, 2021

Hello! It'd be great to have the default language of a video available as an attribute on the TranscriptList class.

I've been able to get this by accesing the list of subtitles from this url:

Ex:

If more than one subtitle is available, there will be a "default_lang" key on the xml. Which is what the user chose as the language of the video when uploading a file.

I have a M.R. ready but wanted to submit it as an issue in case someone was already working on something similar or had a better approach.

@jdepoix
Copy link
Owner

jdepoix commented Dec 16, 2021

Hi @arturoalcibia,
sorry for the late reply, somehow I must've missed this issue...

I am not really sure what functionality you are asking for exactly. You are currently able to retrieve transcripts in different languages using

YouTubeTranscriptApi.get_transcript(video_ids, languages=['de', 'en'])

or

YouTubeTranscriptApi.list_transcripts(video_id).find_transcript(['de', 'en']).fetch()

What use case do you have which is not covered by these methods?

@arturoalcibia
Copy link
Author

arturoalcibia commented Jan 4, 2022

Hi @jdepoix,

no worries.

This would give us access to what the user intended the default caption track to be played. Which is usually the language of the video.

As an example, this video contains multiple manually created tracks: https://www.youtube.com/watch?v=UOgvbS4GkF0
But English is the one the user set to default.

You can find which transcript track is set to default by looking at the html returned with the key "defaultCaptionTrackIndex".

In this case, the html has the index 3 as the "defaultCaptionTrackIndex" which corresponds to the english track.

Here's a quick dirty snippet to get the index (Which refers to the english track ).

import requests
from youtube_transcript_api._transcripts import TranscriptListFetcher
videoId = 'UOgvbS4GkF0'

with requests.Session() as http_client:

    tListFetcher = TranscriptListFetcher(http_client)
    htmlContent = tListFetcher._fetch_video_html(videoId)
    captions_json = tListFetcher._extract_captions_json(tListFetcher._fetch_video_html(videoId), videoId)
    defaultCaptionIndex = captions_json['audioTracks'][0].get('defaultCaptionTrackIndex', 0)
    print(defaultCaptionIndex)

I'd be happy to contribute with a proper M.R. on this.

@jdepoix
Copy link
Owner

jdepoix commented Jan 4, 2022

Hi @arturoalcibia,

okay, that makes sense. In that case the default language would have to be added as a param to the TranscriptList constructor and the TranscriptList.build method would have to determine the default language and set it. The language_codes params on find_manually_created_transcript, find_generated_transcript and find_transcript would have to become optional and if they are not set the default language is used.

Of course any contributions on this are very much welcome! 😊

My only concern is that this would change the default behaviour of this module and could break peoples code if they expect english subtitles (since that's what they've been getting by simply calling get_transcript). However, using the default language provided by the uploader seems like a more fitting default for this module, so maybe we should accept this breaking change. Any thoughts on this?

@arturoalcibia
Copy link
Author

Hi @jdepoix,

Sounds good, I agree that the breaking change seems worth it, adding any extra function or argument to return the default language seems overkill and would get confusing. I also think having "english" as a default language feels arbitrary. Returning the default language provided by the user looks cleaner.

@jdepoix jdepoix added the enhancement New feature or request label Jan 5, 2022
@arturoalcibia
Copy link
Author

Hi @jdepoix,

I think I have a working version with this feature, would it be possible to be added as a contributor to submit a M.R.?

@jdepoix
Copy link
Owner

jdepoix commented Jan 5, 2022

Hi @arturoalcibia, you don't need to be a contributor to submit a PR. You can simply submit a PR from your fork. Read this to find out more!

@jdepoix
Copy link
Owner

jdepoix commented Dec 6, 2022

Hi @arturoalcibia,
as this topic just came up in #177, is this something you are still working on? Is there anything I can help you with?

@arturoalcibia
Copy link
Author

Hi @jdepoix,
My bad! I worked on it but forgot to ever submit the PR, if that's okay I will submit it this weekend for review.

@jdepoix
Copy link
Owner

jdepoix commented Dec 7, 2022

@arturoalcibia no worries, I am always appreciative about contributions in any way 😊

@dcsilver
Copy link

Any progress on this? I'm using the cli and it'd be great to have a flag that just returned the default language of the video

@jdepoix
Copy link
Owner

jdepoix commented Mar 20, 2023

@dcsilver I haven't done any active development on this. Apparently, @arturoalcibia has been working on a PR, but hasn't turned it in so far. Any news on this @arturoalcibia?

@KhaledLela
Copy link

Any update about defaultAudioLanguage??\

@jdepoix
Copy link
Owner

jdepoix commented Jun 12, 2023

@KhaledLela sorry, I haven't done any development on this and @arturoalcibia has unfortunately never turned in that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants