-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserving timed text (and pagination issue?) #32
Comments
also relevant via @xshy216 #10 (comment) |
An update on the latest thinking, and a chance to recap some of the current progress. Deferring pagination explorationAfter talking to @rememberlenny I decided defer trying out pagination in favor of an approach that tries to single out paragraphs that have changed and align only those. Options for aligning only the paragraphs that changedThere's two ways in which you could do that,
Word level timings and clickable wordsSlightly unrelated, but relevant, similar to the DraftJs approach of using entities in example[
{
"type": "timedText",
"speaker": "James Jacoby",
"start": 1.41,
"previousTimings": "0",
"startTimecode": "00:00:01",
"children": [
{
"text": "So tell me, let’s start at the beginning.",
"words": [
{
"end": 1.63,
"start": 1.41,
"text": "So"
},
{
"end": 2.175,
"start": 1.63,
"text": "tell"
},
{
"end": 2.72,
"start": 2.175,
"text": "me,"
},
{
"end": 2.9,
"start": 2.72,
"text": "let’s"
},
{
"end": 3.14,
"start": 2.9,
"text": "start"
},
{
"end": 3.21,
"start": 3.14,
"text": "at"
},
{
"start": 3.21,
"end": 3.28,
"text": "the"
},
{
"end": 4.88,
"start": 4.346666666666666,
"text": "beginning."
}
]
}
]
},
...
We can add onDoubleClick={handleTimedTextClick} And use a Paragraph changesOption 2 assumes that paragraphs are not changing, eg splitting or merging a paragraph. OR that this is being handled separately from the alignment process. For now I've disabled splitting and merging paragraph, via example // TODO: revisit logic for
// - splitting paragraph via enter key
// - merging paragraph via delete
// - merging paragraphs via deleting across paragraphs
const handleOnKeyDown = (event) => {
console.log('event.key', event.key);
if (event.key === 'Enter') {
// intercept Enter
event.preventDefault();
console.log('For now cdisabling enter key to split a paragraph, while figuring out the aligment issue');
return;
}
if (event.key === 'Backspace') {
const selection = editor.selection;
console.log('selection', selection);
console.log(selection.anchor.path[0], selection.focus.path[0]);
// across paragraph
if (selection.anchor.path[0] !== selection.focus.path[0]) {
console.log('For now cannot merge paragraph via delete across paragraphs, while figuring out the aligment issue');
event.preventDefault();
return;
}
// beginning of a paragrraph
if (selection.anchor.offset === 0 && selection.focus.offset === 0) {
console.log('For now cannot merge paragraph via delete, while figuring out the aligment issue');
event.preventDefault();
return;
}
} option 2. identify paragraphs that have changedOne idea from @rememberlenny is that If you don't run the alignment on every keystroke or when the user's stop typing (which are both possible optimization to consider - via @gridinoc) then you need to find which paragraphs have changed, and only align those. I found that lodash example/**
* Update timestamps usign stt-align module
* @param {*} currentContent - slate js value
* @param {*} words - list of stt words
* @return slateJS value
*/
// TODO: do optimization mentions in TODOS below and try out on 5 hours long to see if UI Still freezes.
// TODO: in stt-align-node if all the words are completely diff, it seems to freeze.
// Look into why in stt-align-node github repo etc..
export const updateTimestampsHelper = (currentContent, dpeTranscript) => {
// TODO: figure out if can remove the cloneDeep option
const newCurrentContent = _.cloneDeep(currentContent);
// trying to align only text that changed
// TODO: ideally, you save the slate converted content in the parent component when
// component is initialized so don't need to re-convert this from dpe all the time.
const originalContentSlateFormat = convertDpeToSlate(dpeTranscript);
// TODO: add the ID further upstream to be able to skip this step.
// we are adding the index for the paragraph,to be able to update the words attribute in the paragraph and easily replace that paragraph in the
// slate editor content.
// Obv this wouldn't work, if re-enable the edge cases, disabled above in handleOnKeyDown
const currentSlateContentWithId = currentContent.map((paragraph, index) => {
const newParagraph = { ...paragraph };
newParagraph.id = index;
return newParagraph;
});
const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator);
// This gives you a list of paragraphs that have changed, and because we added indexes via ids, we can easily and quickly identify them and run alignment on individual paragraphs. option 2. align individual paragraphs that have changedOnce you have the individual paragraphs that need aligning you can run example const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator);
diffParagraphs.forEach((diffParagraph) => {
// TODO: figure out if can remove the cloneDeep option
let newDiffParagraph = _.cloneDeep(diffParagraph);
let alignedWordsTest = alignSTT(newDiffParagraph.children[0], newDiffParagraph.children[0].text);
newDiffParagraph.children[0].words = alignedWordsTest;
// also adjust paragraph timecode
// NOTE: in current implementation paragraphs cannot be modified, so this part is not necessary
// but keeping because eventually will handle use cases where paragraphs are modified.
newDiffParagraph.start = alignedWordsTest[0].start;
newDiffParagraph.startTimecode = shortTimecode(alignedWordsTest[0].start);
newCurrentContent[newDiffParagraph.id] = newDiffParagraph;
});
return newCurrentContent;
}; fulll example// TODO: do optimization mentions in TODOS below and try out on 5 hours long to see if UI Still freezes.
// TODO: in stt-align-node if all the words are completely diff, it seems to freeze.
// Look into why in stt-align-node github repo etc..
export const updateTimestampsHelper = (currentContent, dpeTranscript) => {
// TODO: figure out if can remove the cloneDeep option
const newCurrentContent = _.cloneDeep(currentContent);
// trying to align only text that changed
// TODO: ideally, you save the slate converted content in the parent component when
// component is initialized so don't need to re-convert this from dpe all the time.
const originalContentSlateFormat = convertDpeToSlate(dpeTranscript);
// TODO: add the ID further upstream to be able to skip this step.
// we are adding the index for the paragraph,to be able to update the words attribute in the paragraph and easily replace that paragraph in the
// slate editor content.
// Obv this wouldn't work, if re-enable the edge cases, disabled above in handleOnKeyDown
const currentSlateContentWithId = currentContent.map((paragraph, index) => {
const newParagraph = { ...paragraph };
newParagraph.id = index;
return newParagraph;
});
const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator);
diffParagraphs.forEach((diffParagraph) => {
// TODO: figure out if can remove the cloneDeep option
let newDiffParagraph = _.cloneDeep(diffParagraph);
let alignedWordsTest = alignSTT(newDiffParagraph.children[0], newDiffParagraph.children[0].text);
newDiffParagraph.children[0].words = alignedWordsTest;
// also adjust paragraph timecode
// NOTE: in current implementation paragraphs cannot be modified, so this part is not necessary
// but keeping because eventually will handle use cases where paragraphs are modified.
newDiffParagraph.start = alignedWordsTest[0].start;
newDiffParagraph.startTimecode = shortTimecode(alignedWordsTest[0].start);
newCurrentContent[newDiffParagraph.id] = newDiffParagraph;
});
return newCurrentContent;
}; up next.See latest commit of the PR #36 for more details on this.
Refactor/clean up
Also
And
|
Some thoughts after recent refactor #36
on 💡
This would mean that you are running the STT align against the most recent re-alignment, as opposed to the original STT data. But would give flexibility to handle changing paragraphs. As well as skip alignment of paragraphs that might not needed. Still unsure of frequency of the alignment, def on save, but not sure if it should happen on pause typing, maybe not for now. Need to check performance against longer file (1 to 5 hours example) |
Updated storybook demo https://pietropassarelli.com/slate-transcript-editor/ to reflect this PR #36 to recap
Some things I am not sure about
extra / stretch goal
|
PR #36 recap
|
this has been merged to master and deployed alpha releases to test it out and make it easier to revert back if needed. closing this for now. |
Working on this PR #30 I run into an issue with figuring out the right logic to paginate the transcript.
The issue
TL;DR: The issue is that when the user corrects the text, it might delete, substitute or insert new words. These operations tend to loose the time-codes originally associated with each word. The alignment module currently in use, loses performance for transcripts over one 1 hour. So we are considering pagination as a
quickfix.If you truly want the TL;DR version skip to the Pagination heading. Otherwise click here for more context
Context
Some
quickbackground for those new to the project.slate-transcript-editor
builds on top of the lessons learned from developing @bbc/react-transcript-editor (based on draftJs).As the name suggests
slate-transcript-editor
is built on top of slateJs augmenting it with transcript editing domain specific functionalities.For more on "draftjs vs slatejs" for this use case, see these notes.
It is a react transcript editor component to allow users to correct automated transcriptions of audio or video generated from speech to text services.
It is used in use cases such as autoEdit, an app to edit audio/video interviews, as well as other situation where users might need to correct transcriptions, for a variety of use cases.
The ambition is to have a component that takes in timed text (eg a list of words with start times), allows the user to correct the text (providing some convenience features, such pause while typing, and keeping some kind of correspondence between the text and audio/video) and on save returns timed text in the same json format (referred to, for convenience, as dpe format, after the digital paper edit project where it was first formalized).
As part of
slate-transcript-editor
this dpe format is then converted into slateJs data model.see storybook demo to see the
slate-transcript-editor
react componet it in practiceOver time in this domain folks have tried a variety of approaches to solve this problem.
compute the timings
listening to char insertion, deletion and detecting word boundaries, you could estimate the time-codes. This is a very fiddly approach, as there's a lot of edge cases to handle. Eg what if a user deletes a whole paragraph? And over time the accuracy of the time-codes slowly fades (if there's a lot of correction done to the text, eg if the STT is not very accurate).
alignment - server side - Aeneas
Some folks have had some success running server side alignment.
For example in pietrop/fact2_transcription_editor the editor was one giant content editable div, and on save it would send to the server plain text version (literally using
.innerText
). @frisch1 then server side would then align it against the original media using the aeneas aligner by @pettarin.Aeneas converts the text into speech (TTS) and then uses that wave form to compare it against the original media to very quickly produce the alignment, restoring time-codes, either at word or line level depending on your preferences.
Aeneas uses dynamic time warping of math frequency capsule coefficient algo (🤯). You can read more about how Aeneas works in the How Does This Thing Work? section of their docs.
This approach for fact2_transcription_editor was some what successful, Aeneas is very fast. However
side note on word level time-codes and clickable words
I should mention that in fact2_transcription_editor you could click on individual words to jump to corresponding point in the media.
With something equivalent to
A pattern I had first come across in hyperaud.io's blog description of "hypertranscripts" by @maboa & @gridinoc
STT based alignment - Gentle
Some folks have also used Gentle, by @maxhawkins, a forced aligner based on Kaldi as a way to get alignment info.
I've personally used it for autoEdit2 as an open source offline option for users to get transcriptions. But I haven't used it for alignment, as STT based alignment is slower then TTS one.
alignment - client side - option 1 (stt-align)
Another option is to run the alignment client side. by doing a diff between the human corrected (accurate) text and the timed text from the STT engine, and to transpose the time-codes from the second to the first.
some more background and info on this solution
This solution was first introduced by @chrisbaume in bbc/dialogger (presented at textAV 2017) it modified CKEditor (at the time draftJS was not around yet) and run the alignment server side in a custom python module sttalign.py
With @chrisbaume's help I converted the python code into a node module stt-align-node which is used in @bbc/react-transcript-editor and slate-transcript-editor
one issue in converting from python to the node version is that for diffing python uses the difflib that is part of the core library while in the node module we use , difflib.js which might not be as performant (❓ 🤷♂️ )
When a word is inserted, (eg was not recognized by the STT services and the users adds it manually) in this type of alignment there are no time-codes for it. Via interpolation of time-codes of neighboring words, we bring back add some time-codes. In the python version the time-codes interpolation is done via numpy to linearly interpolate the missing times
In the node version the interpolation is done via the everpolate module and again it might not be as performant as the python version (❓ 🤷♂️ ).
However in @bbc/react-transcript-editor and slate-transcript-editor initially every time the user stopped typing for longer then a few seconds, we'd trigger a save, which was proceeded by an alignment. This became very un-performant, especially for long transcriptions, (eg approximately over 1 hour) because whether you change a paragraph or just one word, it would run the alignment across the whole text. Which turned out to be a pretty expensive operation.
This lead to removing user facing word level time-codes in the slateJs version to improve performance on long transcriptions. and removing auto save. However, on long transcription, even with manual save, sometimes the
stt-align-node
module can temporary freeze the UI for a few seconds 😬 or in the worst case scenario sometimes even crash the page 😓 ☠️more on retaining speaker labels after alignement
There is also a workaround for handling retaining speaker labels at paragraph level when using this module to run the alignment.The module itself only aligns the words. To re-introduce the speakers, you just compare the aligned words with the paragraphs with speaker info. Example of converting into slateJs format or into dpe format from slateJs
Which is why in PR #30 we are considering pagination. But before a closer look into that, let's consider one more option.
alignment - client side - option 2 (web-aligner)
Another option explored by @chrisbaume at textAV 2017 was to make a webaligner (example here and code of the example here) to create a
simplelightweight client-side forced aligner for timed text levering the browser audio API (AudioContext), and doing computation similar to Aeneas(? not sure about this last sentce?).This option is promising, but was never fully fleshed out to a usable state. It might also only work when aligning small sentences due to browser's limitations(?).
5. Overtyper
Before considering pagination, a completely different approach to the UX problem of correcting text is overtyper by @alexnorton & @maboa from textAV 2017. Where you follow along a range of words being hiligteed as the media plays. To correct you start typing from the last correct word you heard until the next correct one, so that the system can adjust/replace/insert all the once in between. This makes the alignment problem a lot more narrow, and new word timings can be more easily computed.
This is promising, but unfortunately as far as I know there hasn't been a lot of user testing to this approach to validate.
Pagination
For
slate-transcript-editor
we've been using (option 3) client side alignment with stt-align-node to restore time-codes on user's save.However because of the performance issue on large transcription, we've been considering pagination - PR #30 but run into a few issues.
For now we can assume the transcription comes as one payload from the server. And I've been splitting it into one hour chunks.
The idea is that the slateJs editor can be responsible for the text editing part, and alignment, save, export in various format can be done in the parent component to provide a cohesive interface that for example. Merges all the pages into one doc before exporting but only updates the current chunk when saving.
questions
I am going to continue to try a few other things here but any thoughts, ideas 💡 or examples on react best practice when dealing with react to paginate text editors are much appreciated.
Quick disclaimer: Last but not least this is my best effort to collect info on this topic in order to frame the problem and hopefully get closer to a solution, if some of these are not as accurate as they should be, feel free to let me know in the comments.
The text was updated successfully, but these errors were encountered: