Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable hot-word boosting #3297

Merged
merged 48 commits into from
Sep 24, 2020
Merged

enable hot-word boosting #3297

merged 48 commits into from
Sep 24, 2020

Conversation

JRMeyer
Copy link
Contributor

@JRMeyer JRMeyer commented Aug 31, 2020

This PR enables hot-word boosting (immediate support in the C and Python clients) with the new flags --hot_words.

The flag takes a string of words and their respective boosts separated by commas and colons, as such: --hot_words "friend:1.5,enemy:20.4". The boost takes a floating point number between -inf and inf.

The boosting is applied as an addition to the negative log likelihood of a candidate word sequence, given by the KenLM language model. Since the LM probability is a negative log value, at 0.0 we have 100% likelihood, and at negative infinity we have 0% likelihood. As such, we will always have some negative number from the KenLM model.

For example, if KenLM returns -3.5 as the likelihood for the word sequence "i like cheese", if we add 3 to this number, we get -0.75, therefore increasing the likelihood of that sequence. On the other hand, if we add a -3 to the likelihood, we decrease the likelihood of that sequence. Adding a negative number as a boost will make the decoder "avoid" certain words.

@lissyx
Copy link
Collaborator

lissyx commented Sep 1, 2020

This PR enables hot-word boosting (from the C client) with the two new flags --hot_words and --boost_coefficient.

Can we not limit that to the C-client? It's very much likely people will want to use this part of the API from elsewhere, and in the current state, it's completely unknown whether this works or not.

if (!hot_words_.empty()) {
// increase prob of prefix for every word
// that matches a word in the hot-words list
for (std::string word : ngram) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you measured perf impact with scorers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf as in WER? or perf as in latency?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not measured the latency effects yet, no.

Are there any TC jobs that do this, or should I profile locally? What do you recommend?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, you'd have to do it locally. Using perf should be quite easy.

Copy link
Collaborator

@lissyx lissyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expose it in the API as a real list of words, and please add:

  • basic CI testing for that feature
  • usage in different bindings would really be a good thing (Python, JS, .Net, Java) if you can

Also, it looks like your current code breaks training and ctc decoder, so please fix that.

@reuben
Copy link
Contributor

reuben commented Sep 1, 2020

For example, if KenLM returns -3.5 as the likelihood for the word sequence "i like cheese", if we multiply this number by 0.5, we get -1.75, therefore doubling the likelihood of that sequence.

This isn't how log probabilities work, you're making exponential increases in the probability here. exp(-3.5) ~= 0.03 and exp(-1.75) ~= 0.17. This, combined with the fact that a single word will be boosted several times in the same beam as it appears in multiple n-grams, makes it hard to reason about the behavior of the coefficient. It should probably be an additive factor (multiplication in probability space).

@reuben
Copy link
Contributor

reuben commented Sep 1, 2020

@lissyx
Copy link
Collaborator

lissyx commented Sep 1, 2020

@JRMeyer To keep your API simpler, I suggest you move to a single entry point:

DEEPSPEECH_EXPORT
int DS_AddHotWord(ModelState* aCtx, const char* word, float boostCoefficient)

This entry point would add a new word to your std::vector (or set, maybe, because it would guarantee unicity). If the hot word does not exists, we add it with the given boost, and if it is already in the set, we update the coefficient

Depending on usecase, it could also be cool to expose (though I'm unsure it is really required):

DEEPSPEECH_EXPORT
int DS_ClearHotWords(ModelState* aCtx)

This would simply re-init the set of hot words

With this API, you could more easily expose and update all our bindings (const char ** are a bit painful via SWIG) to make the feature available.

@JRMeyer
Copy link
Contributor Author

JRMeyer commented Sep 10, 2020

For example, if KenLM returns -3.5 as the likelihood for the word sequence "i like cheese", if we multiply this number by 0.5, we get -1.75, therefore doubling the likelihood of that sequence.

This isn't how log probabilities work, you're making exponential increases in the probability here. exp(-3.5) ~= 0.03 and exp(-1.75) ~= 0.17. This, combined with the fact that a single word will be boosted several times in the same beam as it appears in multiple n-grams, makes it hard to reason about the behavior of the coefficient. It should probably be an additive factor (multiplication in probability space).

Even though my initial intuition was wrong about how the boosting compounds, I still like the UX. Namely, if you're using this feature, and trying to find the right boosting coefficient for your data, you would know to sweep between 0 and 1, which isn't hard.

with an additive effect, the search space now goes from (0,1) to (0,infinity). The math is better, but the UX seems worse. I make the changes in 184189c, but I still have doubts. Thoughts?

Copy link
Collaborator

@carlfm01 carlfm01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @JRMeyer, just missing the following on the IDeepSpeech interface:

        /// <summary>
        /// Add a hot-word.
        /// </summary>
        /// <param name="aWord">Some word</param>
        /// <param name="aBoost">Some boost</param>
        /// <exception cref="ArgumentException">Thrown on failure.</exception>
        public void AddHotWord(string aWord, float aBoost);

        /// <summary>
        /// Erase entry for a hot-word.
        /// </summary>
        /// <param name="aWord">Some word</param>
        /// <exception cref="ArgumentException">Thrown on failure.</exception>
        public void EraseHotWord(string aWord);

        /// <summary>
        /// Clear all hot-words.
        /// </summary>
        /// <exception cref="ArgumentException">Thrown on failure.</exception>
        public void ClearHotWords();

@JRMeyer
Copy link
Contributor Author

JRMeyer commented Sep 22, 2020

Nice @JRMeyer, just missing the following on the IDeepSpeech interface:

        /// <summary>
        /// Add a hot-word.
        /// </summary>
        /// <param name="aWord">Some word</param>
        /// <param name="aBoost">Some boost</param>
        /// <exception cref="ArgumentException">Thrown on failure.</exception>
        public void AddHotWord(string aWord, float aBoost);

        /// <summary>
        /// Erase entry for a hot-word.
        /// </summary>
        /// <param name="aWord">Some word</param>
        /// <exception cref="ArgumentException">Thrown on failure.</exception>
        public void EraseHotWord(string aWord);

        /// <summary>
        /// Clear all hot-words.
        /// </summary>
        /// <exception cref="ArgumentException">Thrown on failure.</exception>
        public void ClearHotWords();

I set these as unsafe void in 5432f56

@carlfm01
Copy link
Collaborator

carlfm01 commented Sep 22, 2020

I set these as unsafe void in

Sorry, I forgot to delete the public, you did it right.

@lissyx lissyx self-requested a review September 24, 2020 13:58
Copy link
Collaborator

@lissyx lissyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now looking quite good, just fix the Android test execution, and ensure to squash into one commit

@JRMeyer JRMeyer merged commit 1eb155e into mozilla:master Sep 24, 2020
@JRMeyer
Copy link
Contributor Author

JRMeyer commented Sep 24, 2020

@lissyx -- it's one commit, but all the previous commit messages got appended into the one commit message :/ it doesn't look pretty, but yes it is one commit

@lissyx lissyx mentioned this pull request Sep 25, 2020
3 tasks
reuben referenced this pull request in reuben/STT Nov 16, 2020
* enable hot-word boosting

* more consistent ordering of CLI arguments

* progress on review

* use map instead of set for hot-words, move string logic to client.cc

* typo bug

* pointer things?

* use map for hotwords, better string splitting

* add the boost, not multiply

* cleaning up

* cleaning whitespace

* remove <set> inclusion

* change typo set-->map

* rename boost_coefficient to boost

X-DeepSpeech: NOBUILD

* add hot_words to python bindings

* missing hot_words

* include map in swigwrapper.i

* add Map template to swigwrapper.i

* emacs intermediate file

* map things

* map-->unordered_map

* typu

* typu

* use dict() not None

* error out if hot_words without scorer

* two new functions: remove hot-word and clear all hot-words

* starting to work on better error messages

X-DeepSpeech: NOBUILD

* better error handling + .Net ERR codes

* allow for negative boosts:)

* adding TC test for hot-words

* add hot-words to python client, make TC test hot-words everywhere

* only run TC tests for C++ and Python

* fully expose API in python bindings

* expose API in Java (thanks spectie!)

* expose API in dotnet (thanks spectie!)

* expose API in javascript (thanks spectie!)

* java lol

* typo in javascript

* commenting

* java error codes from swig

* java docs from SWIG

* java and dotnet issues

* add hotword test to android tests

* dotnet fixes from carlos

* add DS_BINARY_PREFIX to tc-asserts.sh for hotwords command

* make sure lm is on android for hotword test

* path to android model + nit

* path

* path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants