-
-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances fully lazy. #1756
Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances fully lazy. #1756
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1756 +/- ##
==========================================
- Coverage 94.13% 94.12% -0.02%
==========================================
Files 267 267
Lines 50900 50903 +3
==========================================
- Hits 47917 47910 -7
- Misses 2983 2993 +10
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
src/query/fuzzy_query.rs
Outdated
static BUILDER: Lazy<RwLock<FxHashMap<(u8, bool), LevenshteinAutomatonBuilder>>> = | ||
Lazy::new(Default::default); | ||
|
||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want a RwLock here? Although this makes the logic lazy, it adds additional overhead to every read we try to create the scorer, and I believe the standard library RwLocks still have some performance issues plaguing them when being accessed concurrently by several threads even when reading only, and this is potentially a very hot path.
Secondly, this removes the cap on the maximum edit distance per term, which although isn't bad in itself, does open the potential for people to pick a number too high, 3 tends to be the realistic maximum you'd use, although it depends on the word length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want a RwLock here? Although this makes the logic lazy, it adds additional overhead to every read we try to create the scorer, and I believe the standard library RwLocks still have some performance issues plaguing them when being accessed concurrently by several threads even when reading only, and this is potentially a very hot path.
Yes, I am somewhat unhappy with the RwLock
too, but some form of synchronization will have to be necessary if laziness is desired.
Alternatives I see are:
- Using
Mutex
which should be faster but would prevent concurrent reads after the relevant builders have been initialized. Alternatively,Arc<LevenshteinAutomatonBuilder>
could be cached and cloned under the mutex so that usage can proceed without it. - But if we do go for the additional indirection of
Arc<LevenshteinAutomatonBuilder>
, we might also useArcSwap
instead ofRwLock
(already a dependency) and always publish a new (larger) hash table when anotherLevenshteinAutomatonBuilder
had to be initialized containing copies of all the previousArc<LevenshteinAutomatonBuilder>
instances. (LevenshteinAutomatonBuilder
itself is not clone, but I guess that could also be changed upstream to avoid the additionalArc
indirection in this case.)
Secondly, this removes the cap on the maximum edit distance per term, which although isn't bad in itself, does open the potential for people to pick a number too high, 3 tends to be the realistic maximum you'd use, although it depends on the word length.
The limitation seems artificial IMHO and related to the eager construction of the cached LevenshteinAutomatonBuilder
instances which is why I removed it after making things lazy.
I have also read about 3 being a realistic limit, but I think this would better be served by a soft limit, e.g. advice given in the documentation of the new
and new_prefix
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened quickwit-oss/levenshtein-automata#13 to enable cloning of the builders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have a hashmap (or arrays) with once_cell as value? That would remove the RwLock , the loop, and the possible expensive multiple levenshtein buider building upon race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have a hashmap (or arrays) with once_cell as value?
We could lazily initialize only the entries and eagerly construct the data structure itself, but if we keep the static limit on the distance, then the current approach - lazily constructing all builders at once - seems to most reasonable to me and I would just remove the TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed an implementation of that approach, i.e. using [[OnceCell<LevenshteinAutomatonBuilder>; 2]; 3]
.
.get(self.transposition_cost_one as usize) | ||
.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.get(self.transposition_cost_one as usize) | |
.unwrap() | |
[self.transposition_cost_one as usize] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would not that be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would certainly be the same, but it formats very unfavorably:
let automaton_builder = AUTOMATON_BUILDER
.get(self.distance as usize)
.ok_or_else(|| {
InvalidArgument(format!(
"Levenshtein distance of {} is not allowed. Choose a value less than {}",
self.distance,
AUTOMATON_BUILDER.len()
))
})?[self.transposition_cost_one as usize]
.get_or_init(|| {
LevenshteinAutomatonBuilder::new(self.distance, self.transposition_cost_one)
});
which is why I went for this somewhat verbose way of doing it. But I am not really invested in this.
Which one do you prefer after seeing the formatting, [..]
or get(..).unwrap()
?
No description provided.