Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Support multilingual transcripts #367

Closed
ryan-lp opened this issue Apr 14, 2022 · 3 comments
Closed

Proposal: Support multilingual transcripts #367

ryan-lp opened this issue Apr 14, 2022 · 3 comments

Comments

@ryan-lp
Copy link

ryan-lp commented Apr 14, 2022

Some podcasts are multilingual, where each episode might use a different language, or even where a single episode may switch between multiple languages.

It is already possible to list multiple languages in the channel of the RSS feed (e.g. <language>en,es</language> on the channel), and perhaps there should also be a similar optional tag on each item that defaults to the channel language, because that may be helpful when each episode is in a different language.

But when a single episode contains multiple languages, we also need a way to tag which text belong to which language within the transcript.

I am not sure if there is an obvious way to do it in every format, but for JSON, we can add an optional language property to each segment which defaults to the item's language in the RSS feed, as follows:

    {
      "speaker": "Darth Vader",
      "startTime": 0.5,
      "endTime": 0.75,
      "body": "I",
      "language": "en"
    }

For WebVTT, maybe this information could be placed in a comment.
For SRT, maybe this information could be encoded in parentheses or some other type of brackets.
For HTML, maybe this can use the lang attribute.

@saerdnaer
Copy link
Contributor

why not lang instead of language?

@ryan-lp
Copy link
Author

ryan-lp commented Apr 17, 2022

The RSS spec defines a language tag, the HTML spec defines a lang attribute. I have followed the existing style in each format, where for JSON, it uses unabbreviated words (otherwise we could certainly make most of the other JSON names shorter, but there is no difference after compression.)

@ryan-lp
Copy link
Author

ryan-lp commented Apr 25, 2022

perhaps there should also be a similar optional tag on each item that defaults to the channel language, because that may be helpful when each episode is in a different language.

For this part of the problem, we can actually use the existing xml:lang attribute:

<item xml:lang="pt">
  ...
</item>
<item xml:lang="es">
  ...
</item>

Technically these existing language/lang tags/attributes are specified to hold only one language, although in practice creators of multilingual podcasts do use comma-delimited lists in the language tag to hold multiple languages

@Podcastindex-org Podcastindex-org locked and limited conversation to collaborators Mar 25, 2023
@daveajones daveajones converted this issue into discussion #483 Mar 25, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants