Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving timed text (and pagination issue?) #32

Closed
pietrop opened this issue Feb 5, 2021 · 6 comments
Closed

Preserving timed text (and pagination issue?) #32

pietrop opened this issue Feb 5, 2021 · 6 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@pietrop
Copy link
Owner

pietrop commented Feb 5, 2021

Working on this PR #30 I run into an issue with figuring out the right logic to paginate the transcript.

The issue

TL;DR: The issue is that when the user corrects the text, it might delete, substitute or insert new words. These operations tend to loose the time-codes originally associated with each word. The alignment module currently in use, loses performance for transcripts over one 1 hour. So we are considering pagination as a quick fix.

If you truly want the TL;DR version skip to the Pagination heading. Otherwise click here for more context

Context

Some quick background for those new to the project.

slate-transcript-editor builds on top of the lessons learned from developing @bbc/react-transcript-editor (based on draftJs).

As the name suggests slate-transcript-editor is built on top of slateJs augmenting it with transcript editing domain specific functionalities.

For more on "draftjs vs slatejs" for this use case, see these notes.

It is a react transcript editor component to allow users to correct automated transcriptions of audio or video generated from speech to text services.

It is used in use cases such as autoEdit, an app to edit audio/video interviews, as well as other situation where users might need to correct transcriptions, for a variety of use cases.

The ambition is to have a component that takes in timed text (eg a list of words with start times), allows the user to correct the text (providing some convenience features, such pause while typing, and keeping some kind of correspondence between the text and audio/video) and on save returns timed text in the same json format (referred to, for convenience, as dpe format, after the digital paper edit project where it was first formalized).

{
  "words": [
    {
      "end": 0.46, // in seconds
      "start": 0,
      "text": "Hello"
    },
    {
      "end": 1.02,
      "start": 0.46,
      "text": "World"
    },
    ...
    ]
    "paragraphs": [
    {
      "speaker": "SPEAKER_A",
      "start": 0,
      "end": 3
    },
    {
      "speaker": "SPEAKER_B",
      "start": 3,
      "end": 19.2
    },
    ...
    ]
 }

As part of slate-transcript-editor this dpe format is then converted into slateJs data model.

see storybook demo to see the slate-transcript-editor react componet it in practice

Over time in this domain folks have tried a variety of approaches to solve this problem.

compute the timings

listening to char insertion, deletion and detecting word boundaries, you could estimate the time-codes. This is a very fiddly approach, as there's a lot of edge cases to handle. Eg what if a user deletes a whole paragraph? And over time the accuracy of the time-codes slowly fades (if there's a lot of correction done to the text, eg if the STT is not very accurate).

alignment - server side - Aeneas

Some folks have had some success running server side alignment.
For example in pietrop/fact2_transcription_editor the editor was one giant content editable div, and on save it would send to the server plain text version (literally using .innerText). @frisch1 then server side would then align it against the original media using the aeneas aligner by @pettarin.

Aeneas converts the text into speech (TTS) and then uses that wave form to compare it against the original media to very quickly produce the alignment, restoring time-codes, either at word or line level depending on your preferences.

Aeneas uses dynamic time warping of math frequency capsule coefficient algo (🤯). You can read more about how Aeneas works in the How Does This Thing Work? section of their docs.

This approach for fact2_transcription_editor was some what successful, Aeneas is very fast. However

  • the alignment is only done on save to the database.
  • If a user continues to edit the page over time more and more of the time-codes will disappear until the refresh the page, and the "last saved and aligned" transcript gets fetch from the db.
  • And to set this up as "a reusable component" you'd always have to pair with a server side module to do the alignment
  • Aeneas is great but in it's current form does not exist as an npm module (as far as I am aware of?) it's written in python and has some system dependencies such as ffmpeg, TTS engine etc..
side note on word level time-codes and clickable words

I should mention that in fact2_transcription_editor you could click on individual words to jump to corresponding point in the media.

With something equivalent to

<span data-start-time="0" data-end-time="0.46" classNames="words"> Hello </span> ...

A pattern I had first come across in hyperaud.io's blog description of "hypertranscripts" by @maboa & @gridinoc

STT based alignment - Gentle

Some folks have also used Gentle, by @maxhawkins, a forced aligner based on Kaldi as a way to get alignment info.

I've personally used it for autoEdit2 as an open source offline option for users to get transcriptions. But I haven't used it for alignment, as STT based alignment is slower then TTS one.

alignment - client side - option 1 (stt-align)

Another option is to run the alignment client side. by doing a diff between the human corrected (accurate) text and the timed text from the STT engine, and to transpose the time-codes from the second to the first.

some more background and info on this solution

This solution was first introduced by @chrisbaume in bbc/dialogger (presented at textAV 2017) it modified CKEditor (at the time draftJS was not around yet) and run the alignment server side in a custom python module sttalign.py

With @chrisbaume's help I converted the python code into a node module stt-align-node which is used in @bbc/react-transcript-editor and slate-transcript-editor

one issue in converting from python to the node version is that for diffing python uses the difflib that is part of the core library while in the node module we use , difflib.js which might not be as performant (❓ 🤷‍♂️ )

When a word is inserted, (eg was not recognized by the STT services and the users adds it manually) in this type of alignment there are no time-codes for it. Via interpolation of time-codes of neighboring words, we bring back add some time-codes. In the python version the time-codes interpolation is done via numpy to linearly interpolate the missing times

In the node version the interpolation is done via the everpolate module and again it might not be as performant as the python version (❓ 🤷‍♂️ ).

However in @bbc/react-transcript-editor and slate-transcript-editor initially every time the user stopped typing for longer then a few seconds, we'd trigger a save, which was proceeded by an alignment. This became very un-performant, especially for long transcriptions, (eg approximately over 1 hour) because whether you change a paragraph or just one word, it would run the alignment across the whole text. Which turned out to be a pretty expensive operation.

This lead to removing user facing word level time-codes in the slateJs version to improve performance on long transcriptions. and removing auto save. However, on long transcription, even with manual save, sometimes the stt-align-node module can temporary freeze the UI for a few seconds 😬 or in the worst case scenario sometimes even crash the page 😓 ☠️

more on retaining speaker labels after alignement There is also a workaround for handling retaining speaker labels at paragraph level when using this module to run the alignment.

The module itself only aligns the words. To re-introduce the speakers, you just compare the aligned words with the paragraphs with speaker info. Example of converting into slateJs format or into dpe format from slateJs

Which is why in PR #30 we are considering pagination. But before a closer look into that, let's consider one more option.

alignment - client side - option 2 (web-aligner)

Another option explored by @chrisbaume at textAV 2017 was to make a webaligner (example here and code of the example here) to create a simple lightweight client-side forced aligner for timed text levering the browser audio API (AudioContext), and doing computation similar to Aeneas(? not sure about this last sentce?).

This option is promising, but was never fully fleshed out to a usable state. It might also only work when aligning small sentences due to browser's limitations(?).

5. Overtyper

Before considering pagination, a completely different approach to the UX problem of correcting text is overtyper by @alexnorton & @maboa from textAV 2017. Where you follow along a range of words being hiligteed as the media plays. To correct you start typing from the last correct word you heard until the next correct one, so that the system can adjust/replace/insert all the once in between. This makes the alignment problem a lot more narrow, and new word timings can be more easily computed.

This is promising, but unfortunately as far as I know there hasn't been a lot of user testing to this approach to validate.

Pagination

For slate-transcript-editor we've been using (option 3) client side alignment with stt-align-node to restore time-codes on user's save.

However because of the performance issue on large transcription, we've been considering pagination - PR #30 but run into a few issues.

For now we can assume the transcription comes as one payload from the server. And I've been splitting it into one hour chunks.

The idea is that the slateJs editor can be responsible for the text editing part, and alignment, save, export in various format can be done in the parent component to provide a cohesive interface that for example. Merges all the pages into one doc before exporting but only updates the current chunk when saving.

questions
  1. Should these chunk be store in the state of the parent component or is there a performance issue in doing that in react?
  2. Should you loop through the chunks in the render method and only display the current one? Is this a good pattern to use or is there a better one?
  3. Should the state of the slateJS editor be held in the parent component? (this seemed to cause a performance issue)
  4. on change of the slateJS editor, do we just update the current chunk or also the array of chunks? (this seemed to cause a performance issue)

I am going to continue to try a few other things here but any thoughts, ideas 💡 or examples on react best practice when dealing with react to paginate text editors are much appreciated.

Quick disclaimer: Last but not least this is my best effort to collect info on this topic in order to frame the problem and hopefully get closer to a solution, if some of these are not as accurate as they should be, feel free to let me know in the comments.

@pietrop pietrop added bug Something isn't working help wanted Extra attention is needed labels Feb 5, 2021
@pietrop
Copy link
Owner Author

pietrop commented Feb 5, 2021

also relevant via @xshy216 #10 (comment)

@pietrop
Copy link
Owner Author

pietrop commented Feb 10, 2021

An update on the latest thinking, and a chance to recap some of the current progress.

Deferring pagination exploration

After talking to @rememberlenny I decided defer trying out pagination in favor of an approach that tries to single out paragraphs that have changed and align only those.

Options for aligning only the paragraphs that changed

There's two ways in which you could do that,

  1. one is to use slateJs api for onKeyDown and/or onChange and keep some sort of list that keeps track where the changes in the doc have been made, based on user cursor and selection. For now this seems laborious.
  2. The other is to compare the paragraphs and single out those that have changed, and only run the alignment for those (using pietrop/stt-align-node) .

Word level timings and clickable words

Slightly unrelated, but relevant, similar to the DraftJs approach of using entities in @bbc/react-transcript-editor, but somehow way more performant, we can bring back clickable words, by adding them as an attribute to the text child node, along side the text attribute.

example
[
  {
    "type": "timedText",
    "speaker": "James Jacoby",
    "start": 1.41,
    "previousTimings": "0",
    "startTimecode": "00:00:01",
    "children": [
      {
        "text": "So tell me, let’s start at the beginning.",
        "words": [
          {
            "end": 1.63,
            "start": 1.41,
            "text": "So"
          },
          {
            "end": 2.175,
            "start": 1.63,
            "text": "tell"
          },
          {
            "end": 2.72,
            "start": 2.175,
            "text": "me,"
          },
          {
            "end": 2.9,
            "start": 2.72,
            "text": "let’s"
          },
          {
            "end": 3.14,
            "start": 2.9,
            "text": "start"
          },
          {
            "end": 3.21,
            "start": 3.14,
            "text": "at"
          },
          {
            "start": 3.21,
            "end": 3.28,
            "text": "the"
          },
          {
            "end": 4.88,
            "start": 4.346666666666666,
            "text": "beginning."
          }
        ]
      }
    ]
  },
  ...
  

We can add onDoubleClick to the renderLeaf component.

onDoubleClick={handleTimedTextClick}

And use a getSelectionNodes helper function to use slateJS selection/cursor position to return timecode of current word.
Assuming text has not been edited using selection offset vs word's objects list text char count gives you the start time of the word being clicked on (if that makes sesnse?).

Paragraph changes

Option 2 assumes that paragraphs are not changing, eg splitting or merging a paragraph. OR that this is being handled separately from the alignment process.

For now I've disabled splitting and merging paragraph, via Enter and Backspace key (eg if Backspace is at beginning of the paragraph). However you can still delete multiple words within one paragraph.

example
  // TODO: revisit logic for
  // - splitting paragraph via enter key
  // - merging paragraph via delete
  // - merging paragraphs via deleting across paragraphs
const handleOnKeyDown = (event) => {
    console.log('event.key', event.key);
    if (event.key === 'Enter') {
      // intercept Enter 
      event.preventDefault();
      console.log('For now cdisabling enter key to split a paragraph, while figuring out the aligment issue');
      return;
    }
    if (event.key === 'Backspace') {
      const selection = editor.selection;
      console.log('selection', selection);
      console.log(selection.anchor.path[0], selection.focus.path[0]);
      // across paragraph
      if (selection.anchor.path[0] !== selection.focus.path[0]) {
        console.log('For now cannot merge paragraph via delete across paragraphs, while figuring out the aligment issue');
        event.preventDefault();
        return;
      }
      // beginning of a paragrraph
      if (selection.anchor.offset === 0 && selection.focus.offset === 0) {
        console.log('For now cannot merge paragraph via delete, while figuring out the aligment issue');
        event.preventDefault();
        return;
      }
    }

option 2. identify paragraphs that have changed

One idea from @rememberlenny is that If you don't run the alignment on every keystroke or when the user's stop typing (which are both possible optimization to consider - via @gridinoc) then you need to find which paragraphs have changed, and only align those.

I found that lodash differenceWith is pretty snappy. And you can specify a comparator function. Which allows you to for example only compare the text attribute of the child node, as opposed to the whole paragraph block.

example
/**
 * Update timestamps usign stt-align module
 * @param {*} currentContent - slate js value
 * @param {*} words - list of stt words
 * @return slateJS value
 */
// TODO: do optimization mentions in TODOS below and try out on 5 hours long to see if UI Still freezes.
// TODO: in stt-align-node if all the words are completely diff, it seems to freeze.
// Look into why in stt-align-node github repo etc..
export const updateTimestampsHelper = (currentContent, dpeTranscript) => {
  // TODO: figure out if can remove the cloneDeep option
  const newCurrentContent = _.cloneDeep(currentContent);
  // trying to align only text that changed

  // TODO: ideally, you save the slate converted content in the parent component when
  // component is initialized so don't need to re-convert this from dpe all the time.
  const originalContentSlateFormat = convertDpeToSlate(dpeTranscript);

  // TODO: add the ID further upstream to be able to skip this step.
  // we are adding the index for the paragraph,to be able to update the words attribute in the paragraph and easily replace that paragraph in the
  // slate editor content.
  // Obv this wouldn't work, if re-enable the edge cases, disabled above in handleOnKeyDown
  const currentSlateContentWithId = currentContent.map((paragraph, index) => {
    const newParagraph = { ...paragraph };
    newParagraph.id = index;
    return newParagraph;
  });
  const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator);

// This gives you a list of paragraphs that have changed, and because we added indexes via ids, we can easily and quickly identify them and run alignment on individual paragraphs.

option 2. align individual paragraphs that have changed

Once you have the individual paragraphs that need aligning you can run alignSTT on each and replace them in the slateJs editor current content value list of paragraphs.

example
  const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator);

  diffParagraphs.forEach((diffParagraph) => {
    // TODO: figure out if can remove the cloneDeep option
    let newDiffParagraph = _.cloneDeep(diffParagraph);
    let alignedWordsTest = alignSTT(newDiffParagraph.children[0], newDiffParagraph.children[0].text);
    newDiffParagraph.children[0].words = alignedWordsTest;
    // also adjust paragraph timecode
    // NOTE: in current implementation paragraphs cannot be modified, so this part is not necessary
    // but keeping because eventually will handle use cases where paragraphs are modified.
    newDiffParagraph.start = alignedWordsTest[0].start;
    newDiffParagraph.startTimecode = shortTimecode(alignedWordsTest[0].start);
    newCurrentContent[newDiffParagraph.id] = newDiffParagraph;
  });
  return newCurrentContent;
};
fulll example
// TODO: do optimization mentions in TODOS below and try out on 5 hours long to see if UI Still freezes.
// TODO: in stt-align-node if all the words are completely diff, it seems to freeze.
// Look into why in stt-align-node github repo etc..
export const updateTimestampsHelper = (currentContent, dpeTranscript) => {
  // TODO: figure out if can remove the cloneDeep option
  const newCurrentContent = _.cloneDeep(currentContent);
  // trying to align only text that changed

  // TODO: ideally, you save the slate converted content in the parent component when
  // component is initialized so don't need to re-convert this from dpe all the time.
  const originalContentSlateFormat = convertDpeToSlate(dpeTranscript);

  // TODO: add the ID further upstream to be able to skip this step.
  // we are adding the index for the paragraph,to be able to update the words attribute in the paragraph and easily replace that paragraph in the
  // slate editor content.
  // Obv this wouldn't work, if re-enable the edge cases, disabled above in handleOnKeyDown
  const currentSlateContentWithId = currentContent.map((paragraph, index) => {
    const newParagraph = { ...paragraph };
    newParagraph.id = index;
    return newParagraph;
  });
  const diffParagraphs = _.differenceWith(currentSlateContentWithId, originalContentSlateFormat, comparator);

  diffParagraphs.forEach((diffParagraph) => {
    // TODO: figure out if can remove the cloneDeep option
    let newDiffParagraph = _.cloneDeep(diffParagraph);
    let alignedWordsTest = alignSTT(newDiffParagraph.children[0], newDiffParagraph.children[0].text);
    newDiffParagraph.children[0].words = alignedWordsTest;
    // also adjust paragraph timecode
    // NOTE: in current implementation paragraphs cannot be modified, so this part is not necessary
    // but keeping because eventually will handle use cases where paragraphs are modified.
    newDiffParagraph.start = alignedWordsTest[0].start;
    newDiffParagraph.startTimecode = shortTimecode(alignedWordsTest[0].start);
    newCurrentContent[newDiffParagraph.id] = newDiffParagraph;
  });
  return newCurrentContent;
};

up next.

See latest commit of the PR #36 for more details on this.

  • Handle split paragraph via Enter . Eg split associated list of words objects in the two new paragraphs
  • Handle merge paragraphs via Backspace. Eg merge the list of words from in the two old paragraphs
  • handle regular delete within a paragraph

Refactor/clean up

  • see if can remove the need for cloneDeep
  • see if can remove convertDpeToSlate for comparison. Eg save in state slateJs pre last changed(?)
  • if optimizing to run it on char change or on stop typing. could pass current paragraph, and skip the differenceWith computation step. (Altho would need to figure out how to handle if corrects one paragraph, then go to the next one quickly eg without triggering an alignment in between)

Also

  • consider what happens if hit Enter with selection that spans across multiple paragraphs. Do you need to remove those stt words list from the paragraph block or should keep this disabled for now ? for now intercepted and disabled it instead
  • consider consider what happens if hit Backspace with selection that spans across multiple paragraphs. Do you need to remove those stt words list from the paragraph block or should keep this disabled for now ?

And

  • figure out if instead of ♻️ alignment btn, should add/bring back a some (similar to @gridinoc suggestion) some logic to run align programmatically, eg on every keystroke, on user stop typing. This would need to be debounced, and could make use of requestIdleCallback to make it more efficient.
  • add an option to insert new text to replace and re-align current one, since multi paragraph delete is now disabled.

@pietrop pietrop changed the title Preserving timed text and pagination issue Preserving timed text (and pagination issue?) Feb 16, 2021
@pietrop
Copy link
Owner Author

pietrop commented Feb 17, 2021

Some thoughts after recent refactor #36

  • Handle split paragraph via Enter . Eg split associated list of words objects in the two new paragraphs
  • Handle merge paragraphs via Backspace. Eg merge the list of words from in the two old paragraphs
  • handle regular delete within a paragraph
  • consider what happens if hit Enter with selection that spans across multiple paragraphs. Do you need to remove those stt words list from the paragraph block or should keep this disabled for now ? for now intercepted and disabled it instead
  • consider consider what happens if hit Backspace with selection that spans across multiple paragraphs. Do you need to remove those stt words list from the paragraph block or should keep this disabled for now ? for now intercepted and disabled it instead

on 💡 You are not allowed to completely delete a paragraph? as it could make things easier for alignment, as a paragraph will always have timed words associated with it.

  • But 💡 As you are not creating new empty paragraphs (enter only works within a paragraph to split) . 
And since delete now also merges and preserves timecode. 
Then when we run alignment 
could we just compare the timecode in the words attribute with the text of the block? And align those if the text is different from the text in the words?

    • eg if word count same but text different, only replace the words in the and keep time-codes etc..
    • If word count diff, then runs sttAlignNode? etc...?

This would mean that you are running the STT align against the most recent re-alignment, as opposed to the original STT data. But would give flexibility to handle changing paragraphs. As well as skip alignment of paragraphs that might not needed.

Still unsure of frequency of the alignment, def on save, but not sure if it should happen on pause typing, maybe not for now. Need to check performance against longer file (1 to 5 hours example)

@pietrop
Copy link
Owner Author

pietrop commented Feb 18, 2021

Updated storybook demo https://pietropassarelli.com/slate-transcript-editor/ to reflect this PR #36

Screen Shot 2021-02-18 at 12 16 37 AM

to recap

  • double clicking on a word takes you to that point in the media (as opposed to before where it was paragraph level only)
  • still no word level highlight by design, to keep it performant, but open to add it if there's some good 💡
  • handles split of a paragraph (and split corresponding words list associated with paragraph using cursor char offset)
  • handle delete at beginning of a paragraph to merge two paragraphs (+recombine words list into new paragraph and move cursor /selection)
  • disable split of a paragraph while selecting text (for now?)
  • disable delete text across paragraphs
  • handles delete text within a paragraph
  • alignment btn / restore timecodes by comparing slateJs words list and text in blocks/paragraphs for changes in text, while ignoring white spaces. This way only align with stt-align-node the paragraphs that have changed
  • there's a flag to check if the text as been modified, if it has not, skips alignment, when saving or exporting from the editor, as an optimization.
  • save btn save runs alignment
  • export btn runs alignment
  • refactored to use material UI for ease of theming and portability.
  • as a side effect, localized at paragraph level alignment means you can run alignment also in live use case with interim results populating the editor - see http://localhost:6006/?path=/story/live--editable

Some things I am not sure about

  • I think it be neat to use a timer, or some kind of debounce to bring back have auto save. But I am not sure if I am doing it right. I have one in place for optional "pause while typing" but it seems like introducing a timer that way in on key dow might introduce performance issues? 🤷‍♂️ any thoughts or 💡 ❓
  • auto save could also run auto alignment with the same logic when the user stop typing if there has been any changes - if it doesn't effect performance.

extra / stretch goal

  • one thing that I found myself using for certain projects was selecting the whole text and replacing it with accurate transcription (without speakers) in order to use the editor to re-align it and export a time-coded version. This wouldn't work without the possibility of bulk delting or replacing paragraphs. So addded a dedicated btn with a prompt where you can paste the new text, and it run alignment and repelace the slateJs content, while preserving the slatejs paragraph breaks. (altho not sure if that's good. might revist that, it might be better if it does the paragraph breaks based on line breaks of input text. or maybe it's not an issue for now 🤷‍♂️ ) This would probls till freeze the UI for a long transcript well over 1 hour.

@pietrop
Copy link
Owner Author

pietrop commented Feb 19, 2021

PR #36 recap

  • Change pause while typing to use debounce instead of timer
  • got debounce working for alignment when user stops typing, but, commented it out for now coz cannot properly asses if it effects performance
  • can consider adding auto save, as part of the debounce alignment if it doesn't effect performance.
  • inserting text + enter, does alignment before the split
  • deleting text before merging two paragraphs does alignment before the merge

@pietrop
Copy link
Owner Author

pietrop commented Feb 27, 2021

this has been merged to master and deployed alpha releases to test it out and make it easier to revert back if needed.
Will bump up the version when there's more confidence that it was a successful refactor that didn't introduce 🐞

closing this for now.

@pietrop pietrop closed this as completed Feb 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant