Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Term accelerated searches using bloomfilter #30

Closed
elliVM opened this issue Mar 12, 2024 · 23 comments
Closed

Term accelerated searches using bloomfilter #30

elliVM opened this issue Mar 12, 2024 · 23 comments
Assignees

Comments

@elliVM
Copy link
Contributor

elliVM commented Mar 12, 2024

Allow search string pattern to be accelerated without using a global bloomfilter

@elliVM elliVM self-assigned this Mar 12, 2024
@elliVM
Copy link
Contributor Author

elliVM commented Mar 18, 2024

Added pattern table to bloomdb can be used to select specific filter. Pattern is stored a it's own bloom filter byte array, when a search term is included in the saved patterns (using bloommatch udf) it will activate bloom search using filter of that pattern.

For simplicity when a filter is created it will be assigned a single pattern that the search term is matched against. later this can be changed to multiple patterns per filter and vice versa.

@elliVM
Copy link
Contributor Author

elliVM commented Mar 20, 2024

Changing to support multiple patterns per filter

@elliVM
Copy link
Contributor Author

elliVM commented Mar 25, 2024

Implemented a schema with a pattern table and a junction table between patterns and filters,. Condition walker selects only filters with pattern match with search term and run UDF bloommatch for temp table filters generated from filter types and search term.

Next:

  • Add testing
  • Disable bloom if no filters found

@elliVM
Copy link
Contributor Author

elliVM commented Mar 27, 2024

Changes to be made:

  • Only one pattern per filter is needed remove junction table. Move pattern to filtertype
  • Change to use regex for matching matching instead of UDF, start first without tokenization.
  • Update schema, move pattern to filtertype table as a datatype that can use regex.
  • Later tokenize search term before matching.

@elliVM
Copy link
Contributor Author

elliVM commented Apr 4, 2024

Testing version with pattern matching against tokenized search terms

@elliVM
Copy link
Contributor Author

elliVM commented Apr 12, 2024

New changes to be made

  • Create a database table for each stored regex pattern with a bloom filter
  • Join all filter tables that have a regex pattern match with incoming archive search term
  • Run bloommatch UDF for each logfile and select those that match any of the joined filters
  • To run bloommatch, create a temp table for each bloom filter table that has a pattern match

@elliVM
Copy link
Contributor Author

elliVM commented Apr 22, 2024

Created a new walker that finds all dynamic bloomfilter tables that have a pattern match with the tokenized search term, will use this to select the tables for join with the main query. (Combined with Condition Walker)

@elliVM
Copy link
Contributor Author

elliVM commented Apr 29, 2024

Created classes to hold dynamic tables and temp tables

@elliVM
Copy link
Contributor Author

elliVM commented May 7, 2024

Internal PR

@elliVM
Copy link
Contributor Author

elliVM commented Jun 3, 2024

updates to filtertype table: pattern varchar value increased to 2048 and pattern added to unique composite index

@elliVM
Copy link
Contributor Author

elliVM commented Jul 5, 2024

Fixed issues with filter size selection in temp tables generated for bloommatch condition. Limited tokenizers to use only major tokens to match with dpf_03.

Working in QA with working filtering (pth-07 5.3.0-22-gbd5da88a)

Test example
index=alert_examples earliest=-999d "c3468f80-4273-4867-9b66-3f470787c365"

without bloom took 16-18s
with bloom 3-6s

@elliVM
Copy link
Contributor Author

elliVM commented Jul 8, 2024

Fixing an issue where table pattern match filtering from meta data was fetching the whole table data to java memory, limited fetch to check only 1 row and only PK field.

@elliVM
Copy link
Contributor Author

elliVM commented Jul 8, 2024

Duplicate rows on multiple pattern matches when multiple tables are joined, testing fix using group by logfile.id

update - group by too slow, false positives maybe caused by null on null bloommatch check if a pattern match table was joined that has no matching logfiles for index.

@elliVM
Copy link
Contributor Author

elliVM commented Jul 15, 2024

null check after bloommatch condition for bloom filters fixed duplicate issues and speed up query with multiple joined tables.

@elliVM
Copy link
Contributor Author

elliVM commented Jul 17, 2024

Fixed bug with multiple search terms, tested and working in QA.

@elliVM
Copy link
Contributor Author

elliVM commented Jul 22, 2024

  • Updated old tests to company standards

@elliVM
Copy link
Contributor Author

elliVM commented Jul 24, 2024

refactoring: move all tokenization to PatternMatch class and move all bloommatch condition generation steps to BloomFilterTempTable class

@elliVM
Copy link
Contributor Author

elliVM commented Aug 5, 2024

  • QA testing showed good results with small number of matches performance gain fell down gradually as matches increased still all queries were faster with bloom enabled.
  • A single large table and multiple tables had good performance.

@elliVM elliVM linked a pull request Aug 12, 2024 that will close this issue
@ronja-ui ronja-ui added the review Issues or pull requests waiting for a review label Aug 21, 2024
@q22u q22u removed the review Issues or pull requests waiting for a review label Aug 22, 2024
@ronja-ui ronja-ui added review Issues or pull requests waiting for a review and removed review Issues or pull requests waiting for a review labels Aug 26, 2024
@elliVM
Copy link
Contributor Author

elliVM commented Aug 26, 2024

will split the refactoring into another PR and implement the changes requested in review

@elliVM
Copy link
Contributor Author

elliVM commented Sep 12, 2024

rebased to main

@elliVM elliVM added the review Issues or pull requests waiting for a review label Sep 13, 2024
@elliVM
Copy link
Contributor Author

elliVM commented Sep 16, 2024

After logic review meeting doing refactoring to make code clearer to review

  • Check single responsibility of classes
  • Split classes into smaller pieces where possible
  • Reduce coupling with use of interfaces
  • Better naming of classes and methods

@elliVM elliVM removed the review Issues or pull requests waiting for a review label Sep 16, 2024
@elliVM
Copy link
Contributor Author

elliVM commented Sep 18, 2024

Refactoring

New classes:

Class Responsibility
PatternMatchTables Finds bloomdb Tables that match a pattern condition
CategoryTableImpl Temp table from a bloom filter table that can return a CategoryTableCondition
Created CategoryTable that is created to database
WithFilterTypes CategoryTable with its filters inserted
TableFilters Inserts filters of a CategoryTable
TableFilterTypesFromMetadata Fetches different filter types of a table from metadata
CategoryTableCondition Condition that compares category tables filter bytes against boom filter table filter bytes with bloommatch UDF, selects the same size and bloom term id
PatternMatchCondition Condition that check if any of given tokens match with bloomdb.filtertype.pattern

Other changes:

  • Many method naming changes
  • Added interfaces CategoryTable, TableRecords, BloomQueryCondition
  • Added missing withoutFilters option and implementation to IndexStatementCondition
  • Equality methods for all new classes
  • Tests for all new classes

@elliVM
Copy link
Contributor Author

elliVM commented Sep 23, 2024

Fixing tokenization of query incoming search term that worked with old bloom filters that had every token but does not work with new bloom filter tables with pattern filtered tokens since some of the added tokens are not present.
Will not tokenize search term for now, multiple values can still be searched but have to be split between search terms and search term will have to regex match a pattern.

@elliVM elliVM added review Issues or pull requests waiting for a review and removed review Issues or pull requests waiting for a review labels Sep 30, 2024
@elliVM elliVM closed this as completed Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants