Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed Element : Transcript #9829

Open
brennanyoung opened this issue Oct 3, 2023 · 5 comments
Open

Proposed Element : Transcript #9829

brennanyoung opened this issue Oct 3, 2023 · 5 comments
Labels
accessibility Affects accessibility addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest topic: media

Comments

@brennanyoung
Copy link

brennanyoung commented Oct 3, 2023

What problem are you trying to solve?

WCAG calls for transcripts as a text alternative for audio-only speech, and for spoken video soundtracks in some cases.

HTML does not have a native <transcript> element. Making a well-engineered transcript is not trivial. A babel of half-baked solutions are to be found in the world, but implementations differ greatly from site to site.

This is a missed opportunity for a content type which is extremely common across the web, leading to a degraded/uneven experience of transcripts for assistive tech users.

I am calling for a baseline transcript solution to be offered via declarative markup. Assistive technologies would thus be able to present this content with predictable affordances, regardless of where (on the web) it is found, or how it may be visually presented.

It is the inconsistency of implementation (especially regarding AT experience) that I would hope to resolve by introducing this element.

What solutions exist today?

A babel of half-baked solutions are to be found on the www, along with a small handful of well-engineered examples (such as the one offered by ableplayer) but implementations, and the expected patterns of "consumption" differ greatly from site to site.

It's not obvious what the best semantic markup for a transcript should be today, and yet a transcript has a relatively consistent format and a clear and distinct semantic role.-

A possible choice today might be a <ol>, since a transcript is indeed an ordered collection. However the typical affordances for lists offered by ATs (such as announcement of the item index and the total items) are of questionable value for a transcript.

A closer fit might be <dl> with the timestamp as the term and the utterances as the data. I could be convinced that this is the way forward (perhaps with some special attributes), except that when I look at most transcript implementations "in the wild" they do not use these elements. Support for <dl> in ATs is improving, but not great, with some accessibility experts recommending against it.

If the existing "semantic pool" in HTML does not offer a good fit for a transcript, the temptation to reach for non-semantic divs and spans in "home cooked" solutions instead will be high. Examples of transcripts using non-semantic HTML are easily found. The differing implementations offer no way to leverage user experience and expectations. An explicit transcript element type would usefully constrain and simplify the way transcripts are authored for the benefit of all users.

How would you solve it?

Transcript semantics require some sort of outer wrapper, for example an element perhaps called <transcript>

An optional attribute might link that transcript to a time-based media element elsewhere in the DOM, perhaps reusing the for attribute. -or- time-based media elements themselves might indicate the id of the transcript, in a way rather similar to aria-details. Either would be acceptable. The direction of indication is less important than establishing a standard way of associating the two elements.

It may be preferable to support (but not require) an association to be made as a simple descendent relationship (e.g. a transcript appearing inside the subtree of an <audio> element may be understood as the transcript for that audio - and in the case of multilanguage video soundtracks, the different transcripts could have a language attribute etc.).

The DOM subtree of a transcript would consist of timestamps (using the existing <time> element), and utterances, which may or may not deserve their own element.

The structure of each cue could be similar to the implicit "pairing" of <DT> and <DD> which may be expressed in <DL>, although I think it may be better if each cue is explicitly wrapped so that there is no doubt which time belongs with each utterance, for example:

<cue>
<time>00:00</time>
<utterance>lorem ipsum</utterance>
</cue>

SImple CSS selectors and rules can be imagined to style, hide or show the timestamps.

It is a reasonable expectation that visible timestamps behave like hyperlinks, which will jump to exactly that moment of the associated media. Again, it would be ideal if user agents could construct this UI styling and behavior by default from declarative code.

Anything else?

Note: Some users prefer timestamps to be presented, others prefer them to be suppresed. A simple baked-in toggle attribute to handle this would go a very long way. (Time stamp announcements from a screen reader get old very quickly).

@brennanyoung brennanyoung added addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest labels Oct 3, 2023
@nigelmegitt
Copy link

nigelmegitt commented Oct 3, 2023

On the assumption that you want to reference an external transcript resource, I would suggest looking at DAPT for the payload format choice.

@prlbr
Copy link

prlbr commented Nov 13, 2023

<audio> and <video> elements can have nested <track> elements that reference different kinds of subtitles/transcriptions/descriptions in the WebVTT format.

Can this be leveraged or do the transcriptions have to be encoded in native HTML?

@brennanyoung
Copy link
Author

brennanyoung commented Nov 14, 2023

Leveraging <track> elements seems like an obvious choice as a data source for a user-agent default transcript view. If we can avoid the requirement to encode the transcript directly in HTML, that would be great.

The content of time-based text tracks is not included in the accessibility tree. Typically only the "current cue" will be surfaced, and that is (currently) the job of the content author, so it simply may not happen in many cases.

Interesting that the HTML5 spec explicitly mentioned transcript under both captions and subtitles, although I would like to note that WCAG treats transcripts and captions as separate kinds of content. The data is often near-identical, but the presentation, and intended pattern of consumption differs greatly.

So... among any other goals, we may be looking at a predictable baseline way for .vtt and .srt files to be converted into rich text (presumably some kind of DOM subtree), with the intention that it may be consumed independent of playback state of the audio.

@brennanyoung
Copy link
Author

Regex for VTT cues
^(\d{2}:\d{2}:\d{2}[.,]\d{3})\s-->\s(\d{2}:\d{2}:\d{2}[.,]\d{3})\n(.*(?:\r?\n(?!\r?\n).*)*)

Example Replace pattern
<cue><a href=#><time>$1</time></a><utterance>$3</utterance></cue>

@Malvoz
Copy link
Contributor

Malvoz commented Nov 14, 2023

FYI this was also brought up in #7499 / WICG/proposals#45, I did not intend to shut down that discussion by any means, by posting the following:

Apparently, there already exists a proposal for <transcript>:

/cc @chaals

Other useful resources:

cc @accessabilly

@annevk annevk added topic: media accessibility Affects accessibility labels Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accessibility Affects accessibility addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest topic: media
Development

No branches or pull requests

5 participants