Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocabulary needs to also be shortened in #removeNgramsWithCountsLessThan: #21

Open
myroslavarm opened this issue Mar 10, 2020 · 1 comment

Comments

@myroslavarm
Copy link
Contributor

In the method we are deleting ngrams and reducing history counts, i think vocabulary needs to be cleaned up too (when word history becomes zero, for instance).

The main idea of this method is to get rid of tokens and their sequences that we find irrelevant, in order to speed up reading from file or lookup within the model. And in this case always keeping all the vocabulary entries defeats the purpose.

@myroslavarm
Copy link
Contributor Author

Am still not sure about a good solution for reducing vocabulary, but I think history needs to be additionally reduced too, perhaps using something like this:

historyCounts := historyCounts rejectWithOccurrences: [ :each :count |
		count < aNumber ]

Because technically, if we are reducing the ngrams using the same threshold, then those words we are "throwing out" of ngramCounts will have the same or even higher occurence in historyCounts and therefore should be safe to delete. Tell me what you think, @olekscode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant