-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Term accelerated searches using bloomfilter #30
Comments
Added pattern table to bloomdb can be used to select specific filter. Pattern is stored a it's own bloom filter byte array, when a search term is included in the saved patterns (using bloommatch udf) it will activate bloom search using filter of that pattern. For simplicity when a filter is created it will be assigned a single pattern that the search term is matched against. later this can be changed to multiple patterns per filter and vice versa. |
Changing to support multiple patterns per filter |
Implemented a schema with a pattern table and a junction table between patterns and filters,. Condition walker selects only filters with pattern match with search term and run UDF bloommatch for temp table filters generated from filter types and search term. Next:
|
Changes to be made:
|
Testing version with pattern matching against tokenized search terms |
New changes to be made
|
Created a new walker that finds all dynamic bloomfilter tables that have a pattern match with the tokenized search term, will use this to select the tables for join with the main query. (Combined with Condition Walker) |
Created classes to hold dynamic tables and temp tables |
Internal PR |
updates to filtertype table: pattern varchar value increased to 2048 and pattern added to unique composite index |
Fixed issues with filter size selection in temp tables generated for bloommatch condition. Limited tokenizers to use only major tokens to match with dpf_03. Working in QA with working filtering (pth-07 5.3.0-22-gbd5da88a) Test example without bloom took 16-18s |
Fixing an issue where table pattern match filtering from meta data was fetching the whole table data to java memory, limited fetch to check only 1 row and only PK field. |
Duplicate rows on multiple pattern matches when multiple tables are joined, testing fix using group by logfile.id update - group by too slow, false positives maybe caused by null on null bloommatch check if a pattern match table was joined that has no matching logfiles for index. |
null check after bloommatch condition for bloom filters fixed duplicate issues and speed up query with multiple joined tables. |
Fixed bug with multiple search terms, tested and working in QA. |
|
refactoring: move all tokenization to PatternMatch class and move all bloommatch condition generation steps to BloomFilterTempTable class |
|
will split the refactoring into another PR and implement the changes requested in review |
rebased to main |
After logic review meeting doing refactoring to make code clearer to review
|
RefactoringNew classes:
Other changes:
|
Fixing tokenization of query incoming search term that worked with old bloom filters that had every token but does not work with new bloom filter tables with pattern filtered tokens since some of the added tokens are not present. |
Allow search string pattern to be accelerated without using a global bloomfilter
The text was updated successfully, but these errors were encountered: