Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances fully lazy. #1756

adamreichold · 2023-01-03T14:38:41Z

No description provided.

codecov-commenter · 2023-01-03T14:56:29Z

Codecov Report

Merging #1756 (b288f2b) into main (b22f966) will decrease coverage by 0.01%.
The diff coverage is 86.20%.

❗ Current head b288f2b differs from pull request most recent head 789b5e6. Consider uploading reports for the commit 789b5e6 to get more accurate results

@@            Coverage Diff             @@
##             main    #1756      +/-   ##
==========================================
- Coverage   94.13%   94.12%   -0.02%     
==========================================
  Files         267      267              
  Lines       50900    50903       +3     
==========================================
- Hits        47917    47910       -7     
- Misses       2983     2993      +10

Impacted Files	Coverage Δ
src/query/fuzzy_query.rs	`90.47% <73.33%> (-0.59%)`	⬇️
src/directory/ram_directory.rs	`90.50% <100.00%> (ø)`
src/directory/watch_event_router.rs	`95.79% <100.00%> (ø)`
src/indexer/json_term_writer.rs	`99.79% <100.00%> (ø)`
src/indexer/segment_updater.rs	`94.40% <100.00%> (-1.05%)`	⬇️
src/postings/compression/mod.rs	`100.00% <100.00%> (ø)`
src/query/boolean_query/boolean_query.rs	`92.66% <100.00%> (ø)`
src/query/boolean_query/boolean_weight.rs	`93.43% <100.00%> (ø)`
src/query/query_parser/logical_ast.rs	`80.00% <100.00%> (ø)`
src/schema/text_options.rs	`100.00% <100.00%> (ø)`
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

ChillFish8 · 2023-01-03T23:59:38Z

src/query/fuzzy_query.rs

+        static BUILDER: Lazy<RwLock<FxHashMap<(u8, bool), LevenshteinAutomatonBuilder>>> =
+            Lazy::new(Default::default);
+
+        loop {


Do we really want a RwLock here? Although this makes the logic lazy, it adds additional overhead to every read we try to create the scorer, and I believe the standard library RwLocks still have some performance issues plaguing them when being accessed concurrently by several threads even when reading only, and this is potentially a very hot path.

Secondly, this removes the cap on the maximum edit distance per term, which although isn't bad in itself, does open the potential for people to pick a number too high, 3 tends to be the realistic maximum you'd use, although it depends on the word length.

Do we really want a RwLock here? Although this makes the logic lazy, it adds additional overhead to every read we try to create the scorer, and I believe the standard library RwLocks still have some performance issues plaguing them when being accessed concurrently by several threads even when reading only, and this is potentially a very hot path.

Yes, I am somewhat unhappy with the RwLock too, but some form of synchronization will have to be necessary if laziness is desired.

Alternatives I see are:

Using Mutex which should be faster but would prevent concurrent reads after the relevant builders have been initialized. Alternatively, Arc<LevenshteinAutomatonBuilder> could be cached and cloned under the mutex so that usage can proceed without it.

But if we do go for the additional indirection of Arc<LevenshteinAutomatonBuilder>, we might also use ArcSwap instead of RwLock (already a dependency) and always publish a new (larger) hash table when another LevenshteinAutomatonBuilder had to be initialized containing copies of all the previous Arc<LevenshteinAutomatonBuilder> instances. (LevenshteinAutomatonBuilder itself is not clone, but I guess that could also be changed upstream to avoid the additional Arc indirection in this case.)

Secondly, this removes the cap on the maximum edit distance per term, which although isn't bad in itself, does open the potential for people to pick a number too high, 3 tends to be the realistic maximum you'd use, although it depends on the word length.

The limitation seems artificial IMHO and related to the eager construction of the cached LevenshteinAutomatonBuilder instances which is why I removed it after making things lazy.

I have also read about 3 being a realistic limit, but I think this would better be served by a soft limit, e.g. advice given in the documentation of the new and new_prefix methods.

Opened quickwit-oss/levenshtein-automata#13 to enable cloning of the builders.

We could have a hashmap (or arrays) with once_cell as value? That would remove the RwLock , the loop, and the possible expensive multiple levenshtein buider building upon race condition.

We could have a hashmap (or arrays) with once_cell as value?

We could lazily initialize only the entries and eagerly construct the data structure itself, but if we keep the static limit on the distance, then the current approach - lazily constructing all builders at once - seems to most reasonable to me and I would just remove the TODO.

Pushed an implementation of that approach, i.e. using [[OnceCell<LevenshteinAutomatonBuilder>; 2]; 3].

src/query/fuzzy_query.rs

…instances lazy.

fulmicoton · 2023-01-04T09:14:26Z

src/query/fuzzy_query.rs

+            .get(self.transposition_cost_one as usize)
+            .unwrap()


Suggested change

.get(self.transposition_cost_one as usize)

.unwrap()

[self.transposition_cost_one as usize]

would not that be the same?

It would certainly be the same, but it formats very unfavorably:

let automaton_builder = AUTOMATON_BUILDER .get(self.distance as usize) .ok_or_else(|| { InvalidArgument(format!( "Levenshtein distance of {} is not allowed. Choose a value less than {}", self.distance, AUTOMATON_BUILDER.len() )) })?[self.transposition_cost_one as usize] .get_or_init(|| { LevenshteinAutomatonBuilder::new(self.distance, self.transposition_cost_one) });

which is why I went for this somewhat verbose way of doing it. But I am not really invested in this.

Which one do you prefer after seeing the formatting, [..] or get(..).unwrap()?

…instances lazy. (quickwit-oss#1756)

ChillFish8 reviewed Jan 3, 2023

View reviewed changes

fulmicoton reviewed Jan 4, 2023

View reviewed changes

src/query/fuzzy_query.rs Show resolved Hide resolved

Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery …

789b5e6

…instances lazy.

fulmicoton reviewed Jan 4, 2023

View reviewed changes

fulmicoton approved these changes Jan 4, 2023

View reviewed changes

fulmicoton merged commit 1afa5bf into quickwit-oss:main Jan 6, 2023

adamreichold deleted the lazy-levenshtein-builder branch January 6, 2023 07:58

This was referenced Jan 13, 2023

truncation comment PSeitz/tantivy#30

Closed

use stats PSeitz/tantivy#31

Closed

Hodkinson pushed a commit to Hodkinson/tantivy that referenced this pull request Jan 30, 2023

Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery …

89ba851

…instances lazy. (quickwit-oss#1756)

PSeitz mentioned this pull request Jan 31, 2023

update lz4 flex PSeitz/tantivy#33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances fully lazy. #1756

Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances fully lazy. #1756

adamreichold commented Jan 3, 2023

codecov-commenter commented Jan 3, 2023 •

edited

Loading

ChillFish8 Jan 3, 2023

adamreichold Jan 4, 2023

adamreichold Jan 4, 2023

fulmicoton Jan 4, 2023 •

edited

Loading

adamreichold Jan 4, 2023

adamreichold Jan 4, 2023

fulmicoton Jan 4, 2023

fulmicoton Jan 4, 2023

adamreichold Jan 4, 2023

	.get(self.transposition_cost_one as usize)
	.unwrap()
	[self.transposition_cost_one as usize]

Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances fully lazy. #1756

Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances fully lazy. #1756

Conversation

adamreichold commented Jan 3, 2023

codecov-commenter commented Jan 3, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fulmicoton Jan 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 3, 2023 •

edited

Loading

fulmicoton Jan 4, 2023 •

edited

Loading