Bf udf #18

kortemik · 2023-10-20T13:29:10Z

spark allows running multiple aggregations in the same run by using

      .agg(tokenAggregatorColumn0, tokenAggregatorColumn1, tokenAggregatorColumn2)

therefore simplifying this so that:

Tokenizer is now it's separate UDF, run this before applying the BloomFilterAggregator
BloomFilterAggregator has no longer overhead for it's internal byte[] (de-)serialization between rows

one gets the separate bloomfilters to their dedicated columns, use BloomFilter.readFrom(bais) and .expectedFpp to check if the filter is healthy and should be used. healthy can be checked by comparing it to set fpp (false positive propability).

kortemik and others added 5 commits October 20, 2023 13:34

wip

a19252e

Merge branch 'teragrep:main' into bf-udf

25da2d3

TokenizerUDF created, BloomFilterAggregator optimized for it

a74ce98

make BloomFilterBufferTest to test, although it is still broken

a0e86cf

TokenizerTest

bfc3101

kortemik requested review from StrongestNumber9 and elliVM October 20, 2023 13:29

elliVM approved these changes Oct 23, 2023

View reviewed changes

StrongestNumber9 approved these changes Oct 23, 2023

View reviewed changes

StrongestNumber9 merged commit f8bb6cd into teragrep:main Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bf udf #18

Bf udf #18

kortemik commented Oct 20, 2023

Bf udf #18

Bf udf #18

Conversation

kortemik commented Oct 20, 2023