Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bf udf #18

Merged
merged 5 commits into from
Oct 23, 2023
Merged

Bf udf #18

merged 5 commits into from
Oct 23, 2023

Conversation

kortemik
Copy link
Member

spark allows running multiple aggregations in the same run by using

      .agg(tokenAggregatorColumn0, tokenAggregatorColumn1, tokenAggregatorColumn2)

therefore simplifying this so that:

  • Tokenizer is now it's separate UDF, run this before applying the BloomFilterAggregator
  • BloomFilterAggregator has no longer overhead for it's internal byte[] (de-)serialization between rows

one gets the separate bloomfilters to their dedicated columns, use BloomFilter.readFrom(bais) and .expectedFpp to check if the filter is healthy and should be used. healthy can be checked by comparing it to set fpp (false positive propability).

@StrongestNumber9 StrongestNumber9 merged commit f8bb6cd into teragrep:main Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants