-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable hot-word boosting #3297
enable hot-word boosting #3297
Conversation
Can we not limit that to the C-client? It's very much likely people will want to use this part of the API from elsewhere, and in the current state, it's completely unknown whether this works or not. |
if (!hot_words_.empty()) { | ||
// increase prob of prefix for every word | ||
// that matches a word in the hot-words list | ||
for (std::string word : ngram) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you measured perf impact with scorers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perf as in WER? or perf as in latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not measured the latency effects yet, no.
Are there any TC jobs that do this, or should I profile locally? What do you recommend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, you'd have to do it locally. Using perf
should be quite easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please expose it in the API as a real list of words, and please add:
- basic CI testing for that feature
- usage in different bindings would really be a good thing (Python, JS, .Net, Java) if you can
Also, it looks like your current code breaks training and ctc decoder, so please fix that.
This isn't how log probabilities work, you're making exponential increases in the probability here. exp(-3.5) ~= 0.03 and exp(-1.75) ~= 0.17. This, combined with the fact that a single word will be boosted several times in the same beam as it appears in multiple n-grams, makes it hard to reason about the behavior of the coefficient. It should probably be an additive factor (multiplication in probability space). |
@JRMeyer To keep your API simpler, I suggest you move to a single entry point:
This entry point would add a new Depending on usecase, it could also be cool to expose (though I'm unsure it is really required):
This would simply re-init the set of hot words With this API, you could more easily expose and update all our bindings ( |
Even though my initial intuition was wrong about how the boosting compounds, I still like the UX. Namely, if you're using this feature, and trying to find the right boosting coefficient for your data, you would know to sweep between 0 and 1, which isn't hard. with an additive effect, the search space now goes from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice @JRMeyer, just missing the following on the IDeepSpeech interface:
/// <summary>
/// Add a hot-word.
/// </summary>
/// <param name="aWord">Some word</param>
/// <param name="aBoost">Some boost</param>
/// <exception cref="ArgumentException">Thrown on failure.</exception>
public void AddHotWord(string aWord, float aBoost);
/// <summary>
/// Erase entry for a hot-word.
/// </summary>
/// <param name="aWord">Some word</param>
/// <exception cref="ArgumentException">Thrown on failure.</exception>
public void EraseHotWord(string aWord);
/// <summary>
/// Clear all hot-words.
/// </summary>
/// <exception cref="ArgumentException">Thrown on failure.</exception>
public void ClearHotWords();
I set these as |
Sorry, I forgot to delete the public, you did it right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now looking quite good, just fix the Android test execution, and ensure to squash into one commit
@lissyx -- it's one commit, but all the previous commit messages got appended into the one commit message :/ it doesn't look pretty, but yes it is one commit |
* enable hot-word boosting * more consistent ordering of CLI arguments * progress on review * use map instead of set for hot-words, move string logic to client.cc * typo bug * pointer things? * use map for hotwords, better string splitting * add the boost, not multiply * cleaning up * cleaning whitespace * remove <set> inclusion * change typo set-->map * rename boost_coefficient to boost X-DeepSpeech: NOBUILD * add hot_words to python bindings * missing hot_words * include map in swigwrapper.i * add Map template to swigwrapper.i * emacs intermediate file * map things * map-->unordered_map * typu * typu * use dict() not None * error out if hot_words without scorer * two new functions: remove hot-word and clear all hot-words * starting to work on better error messages X-DeepSpeech: NOBUILD * better error handling + .Net ERR codes * allow for negative boosts:) * adding TC test for hot-words * add hot-words to python client, make TC test hot-words everywhere * only run TC tests for C++ and Python * fully expose API in python bindings * expose API in Java (thanks spectie!) * expose API in dotnet (thanks spectie!) * expose API in javascript (thanks spectie!) * java lol * typo in javascript * commenting * java error codes from swig * java docs from SWIG * java and dotnet issues * add hotword test to android tests * dotnet fixes from carlos * add DS_BINARY_PREFIX to tc-asserts.sh for hotwords command * make sure lm is on android for hotword test * path to android model + nit * path * path
This PR enables hot-word boosting (immediate support in the C and Python clients) with the new flags
--hot_words
.The flag takes a string of
words
and their respectiveboosts
separated by commas and colons, as such:--hot_words "friend:1.5,enemy:20.4"
. Theboost
takes a floating point number between-inf
andinf
.The boosting is applied as an addition to the negative log likelihood of a candidate word sequence, given by the KenLM language model. Since the LM probability is a negative log value, at
0.0
we have 100% likelihood, and at negative infinity we have 0% likelihood. As such, we will always have some negative number from the KenLM model.For example, if KenLM returns
-3.5
as the likelihood for the word sequence "i like cheese", if we add3
to this number, we get-0.75
, therefore increasing the likelihood of that sequence. On the other hand, if we add a-3
to the likelihood, we decrease the likelihood of that sequence. Adding a negative number as a boost will make the decoder "avoid" certain words.