Vocabulary needs to also be shortened in #removeNgramsWithCountsLessThan: #21

myroslavarm · 2020-03-10T16:48:06Z

In the method we are deleting ngrams and reducing history counts, i think vocabulary needs to be cleaned up too (when word history becomes zero, for instance).

The main idea of this method is to get rid of tokens and their sequences that we find irrelevant, in order to speed up reading from file or lookup within the model. And in this case always keeping all the vocabulary entries defeats the purpose.

myroslavarm · 2020-03-16T17:18:34Z

Am still not sure about a good solution for reducing vocabulary, but I think history needs to be additionally reduced too, perhaps using something like this:

historyCounts := historyCounts rejectWithOccurrences: [ :each :count |
		count < aNumber ]

Because technically, if we are reducing the ngrams using the same threshold, then those words we are "throwing out" of ngramCounts will have the same or even higher occurence in historyCounts and therefore should be safe to delete. Tell me what you think, @olekscode.

myroslavarm mentioned this issue Mar 20, 2020

Shorten vocabulary and history for speed myroslavarm/CompletionSorting#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary needs to also be shortened in #removeNgramsWithCountsLessThan: #21

Vocabulary needs to also be shortened in #removeNgramsWithCountsLessThan: #21

myroslavarm commented Mar 10, 2020

myroslavarm commented Mar 16, 2020

Vocabulary needs to also be shortened in #removeNgramsWithCountsLessThan: #21

Vocabulary needs to also be shortened in #removeNgramsWithCountsLessThan: #21

Comments

myroslavarm commented Mar 10, 2020

myroslavarm commented Mar 16, 2020