Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks #13289

oerling · 2019-08-27T00:59:46Z

Note that the DictionaryBlocks are produced by scan only for string columns. This therefore depends on the introduction of a selective slice dictionary reader.

mbasmanova

@oerling This would be a nice optimization. Thanks for contributing. I have some comments.

mbasmanova · 2019-08-27T11:25:42Z

presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java

+    private Block previousDictionary;
+    private Page dictionaryPage;
+
+    private static final byte FILTER_NOT_EVALUATED = 0;


static member variables go to the beginning of the class before instance member variables

mbasmanova · 2019-08-27T11:25:52Z

presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java

+    // as the dictionary inside the block is physically the same.
+    private byte[] dictionaryResults;
+    private Block previousDictionary;
+    private Page dictionaryPage;


dictionaryPage variables is not used

mbasmanova · 2019-08-27T11:30:08Z

presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java

+    // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long
+    // as the dictionary inside the block is physically the same.
+    private byte[] dictionaryResults;
+    private Block previousDictionary;


Holding on to dictionary block will work with the current implementation of the readers, but strictly speaking this is not safe. SelectiveStreamReader.getBlockView API returns a temporary view into the data and the caller is not supposed to use this data after next call to SelectiveStreamReader.read. The data is both the block and the dictionary block it is referencing. Hence, it is not safe to hold on to a dictionary from the previous batch. How important is it to re-use filter result across batches? If important, let's think about ways to make it less fragile.

mbasmanova · 2019-08-27T11:34:29Z

presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java

+            previousDictionary = dictionary;
+            int numEntries = dictionary.getPositionCount();
+            dictionaryPage = new Page(numEntries, dictionary);
+            if (dictionaryResults == null || dictionaryResults.length < numEntries) {


It is preferable to use Arrays.ensureCapacity method for allocating arrays.

dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount()); fill(dictionaryResults, FILTER_NOT_EVALUATED);

If it is important to avoid extra fill for newly allocated array, consider adding another ensureCapacity method that takes a value used to initialize new array or re-set existing one.

dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount(), FILTER_NOT_EVALUATED); public static byte[] ensureCapacity(byte[] buffer, int capacity, byte initialValue) { if (buffer == null || buffer.length < capacity) { byte[] newBuffer = new byte[capacity]; if (initialValue != 0) { fill(buffer, initialValue); } return newBuffer; } fill(buffer, initialValue); return buffer; }

mbasmanova · 2019-08-27T11:35:40Z

presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java

+                continue;
+            }
+            if (result == FILTER_PASSED) {
+                positions[outputCount++] = position;


you need to carry over any errors that occurred while evaluating earlier filters:

positions[outputCount] = position; errors[outputCount] = errors[i]; outputCount++;

same below

mbasmanova · 2019-08-27T11:42:27Z

presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java

+            int position = positions[i];
+            int dictionaryPosition = block.getId(position);
+            byte result = dictionaryResults[dictionaryPosition];
+            if (result == FILTER_FAILED) {


Is there any particular reason not to use switch?

switch (result) { case FILTER_FAILED: break; case FILTER_PASSED: positions[outputCount] = position; errors[outputCount] = errors[i]; outputCount++; break; case FILTER_NOT_EVALUATED: try { if (predicate.evaluate(session, page, position)) { positions[outputCount++] = position; errors[outputCount] = errors[i]; dictionaryResults[dictionaryPosition] = FILTER_PASSED; } else { dictionaryResults[dictionaryPosition] = FILTER_FAILED; } } catch (RuntimeException e) { // We do not record errors in the dictionary results. positions[outputCount] = position; errors[outputCount] = e; // keep last error outputCount++; }; break; default: verify(false, "Unexpected filter result: " + result); }

mbasmanova · 2019-08-27T11:43:20Z

presto-hive/src/test/java/com/facebook/presto/hive/TestHivePushdownFilterQueries.java

@@ -334,6 +334,9 @@ public void testFilterFunctions()
        // filter function on numeric and boolean columns
        assertFilterProject("if(is_returned, linenumber, orderkey) % 5 = 0", "linenumber");

+        // Filter functions on dictionary encoded columns
+        assertQuery("SELECT orderkey, linenumber, shipmode, shipinstruct FROM lineitem WHERE shipmode LIKE '%R%' and shipinstruct LIKE '%CO%'");


This test fails because there is no varchar reader yet. FilterFunction can be unit tested though. This might be a better way to test it anyway.

oerling · 2019-08-28T23:14:32Z

I addressed the comments and added a unit test. Concerning sharing filter results between batches: - The lifetime of the dictionary is at least a row group and more often a stripe, which means that a reasonable run estimate is between 10K and 1M values,. With a batch of max 1K, we have 10-1K reuses for a first filter. The dictionary cardinalities that we see are somewhere in the hundreds based on just point observations. So, using across batches is actually important. Concerning memory ownership: The precedent in DictionaryAwarePageFilter is that the base block of consecutive DictionaryBlocks is compared with ==. The base block of a DictionaryBlock in readers, Aria or other, is a VariableWidthBlock over a byte[] that is made at the start of the Stripe or RowGroup, depending on the scope. If the VariableWidth block stays the same, then the byte[] stays the same, except if the underlying byte[] is reused in another Stripe/RowGroup. Doing this would break the memory assumptions of the pipeline. In specific, aggregation for example expects to keep references to incoming blocks forever and hence the base byte[] of a string dictionary is very definitely RAII (resource allocation is initialization). On general principles, I think it is a bad practice to keep data indefinitely live by having a reference to the container for the sake of a single element. This is a real problem in terms of big group bys with tons of cross region refs and GC goes to 20+% of CPU just tracking these. This should be fixed but not now. Therefore, until there is a memory ownership model where producers own their memory, which I think is how it should be, the assumption on the immutability of the Dictionary byte[] is solid. It would be entirely appropriate to document that the base of a DictionaryBlock, also in the case of a view, is immutable. If you have memory reuse throughout, then you'd give the dictionary base a serial number. There is also a dictionarySourceId in DictionaryBlock which could be used to indicate that two DictionaryBlocks have the same dictionary. This is even serializable. This is not used for this purpose in filtering but we could use it for this, far as I can tell, without breaking anything. From: Maria Basmanova <[email protected]> Sent: Tuesday, August 27, 2019 4:47 AM To: prestodb/presto <[email protected]> Cc: oerling <[email protected]>; Mention <[email protected]> Subject: Re: [prestodb/presto] Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks (#13289) @mbasmanova requested changes on this pull request. @oerling <https://github.com/oerling> This would be a nice optimization. Thanks for contributing. I have some comments.

_____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

@@ -27,6 +30,16 @@

private final boolean deterministic; private final int[] inputChannels; + // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long + // as the dictionary inside the block is physically the same. + private byte[] dictionaryResults; + private Block previousDictionary; + private Page dictionaryPage; + + private static final byte FILTER_NOT_EVALUATED = 0; static member variables go to the beginning of the class before instance member variables

_____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

@@ -27,6 +30,16 @@

private final boolean deterministic; private final int[] inputChannels; + // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long + // as the dictionary inside the block is physically the same. + private byte[] dictionaryResults; + private Block previousDictionary; + private Page dictionaryPage; dictionaryPage variables is not used

_____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

@@ -27,6 +30,16 @@

private final boolean deterministic; private final int[] inputChannels; + // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long + // as the dictionary inside the block is physically the same. + private byte[] dictionaryResults; + private Block previousDictionary; Holding on to dictionary block will work with the current implementation of the readers, but strictly speaking this is not safe. SelectiveStreamReader.getBlockView API returns a temporary view into the data and the caller is not supposed to use this data after next call to SelectiveStreamReader.read. The data is both the block and the dictionary block it is referencing. Hence, it is not safe to hold on to a dictionary from the previous batch. How important is it to re-use filter result across batches? If important, let's think about ways to make it less fragile.

_____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

@@ -73,6 +90,53 @@ public int filter(Page page, int[] positions, int positionCount, RuntimeExceptio

return outputCount; } + private int filterWithDictionary(Page page, int[] positions, int positionCount, RuntimeException[] errors) + { + int outputCount = 0; + DictionaryBlock block = (DictionaryBlock) page.getBlock(0); + Block dictionary = block.getDictionary(); + if (dictionary != previousDictionary) { + previousDictionary = dictionary; + int numEntries = dictionary.getPositionCount(); + dictionaryPage = new Page(numEntries, dictionary); + if (dictionaryResults == null || dictionaryResults.length < numEntries) { It is preferable to use Arrays.ensureCapacity method for allocating arrays. dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount()); fill(dictionaryResults, FILTER_NOT_EVALUATED); If it is important to avoid extra fill for newly allocated array, consider adding another ensureCapacity method that takes a value used to initialize new array or re-set existing one. dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount(), FILTER_NOT_EVALUATED); public static byte[] ensureCapacity(byte[] buffer, int capacity, byte initialValue) { if (buffer == null || buffer.length < capacity) { byte[] newBuffer = new byte[capacity]; if (initialValue != 0) { fill(buffer, initialValue); } return newBuffer; } fill(buffer, initialValue); return buffer; } _____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

+ // 0 means unevaluated, so no extra initialization needed.

+ dictionaryResults = new byte[numEntries]; + } + else { + fill(dictionaryResults, 0, numEntries, FILTER_NOT_EVALUATED); + } + } + for (int i = 0; i < positionCount; i++) { + int position = positions[i]; + int dictionaryPosition = block.getId(position); + byte result = dictionaryResults[dictionaryPosition]; + if (result == FILTER_FAILED) { + continue; + } + if (result == FILTER_PASSED) { + positions[outputCount++] = position; you need to carry over any errors that occurred while evaluating earlier filters: positions[outputCount] = position; errors[outputCount] = errors[i]; outputCount++; same below

_____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

+ previousDictionary = dictionary;

+ int numEntries = dictionary.getPositionCount(); + dictionaryPage = new Page(numEntries, dictionary); + if (dictionaryResults == null || dictionaryResults.length < numEntries) { + // 0 means unevaluated, so no extra initialization needed. + dictionaryResults = new byte[numEntries]; + } + else { + fill(dictionaryResults, 0, numEntries, FILTER_NOT_EVALUATED); + } + } + for (int i = 0; i < positionCount; i++) { + int position = positions[i]; + int dictionaryPosition = block.getId(position); + byte result = dictionaryResults[dictionaryPosition]; + if (result == FILTER_FAILED) { Is there any particular reason not to use switch? switch (result) { case FILTER_FAILED: break; case FILTER_PASSED: positions[outputCount] = position; errors[outputCount] = errors[i]; outputCount++; break; case FILTER_NOT_EVALUATED: try { if (predicate.evaluate(session, page, position)) { positions[outputCount++] = position; errors[outputCount] = errors[i]; dictionaryResults[dictionaryPosition] = FILTER_PASSED; } else { dictionaryResults[dictionaryPosition] = FILTER_FAILED; } } catch (RuntimeException e) { // We do not record errors in the dictionary results. positions[outputCount] = position; errors[outputCount] = e; // keep last error outputCount++; }; break; default: verify(false, "Unexpected filter result: " + result); } _____ In presto-hive/src/test/java/com/facebook/presto/hive/TestHivePushdownFilterQueries.java <#13289 (comment)> :

@@ -334,6 +334,9 @@ public void testFilterFunctions()

// filter function on numeric and boolean columns assertFilterProject("if(is_returned, linenumber, orderkey) % 5 = 0", "linenumber"); + // Filter functions on dictionary encoded columns + assertQuery("SELECT orderkey, linenumber, shipmode, shipinstruct FROM lineitem WHERE shipmode LIKE '%R%' and shipinstruct LIKE '%CO%'"); This test fails because there is no varchar reader yet. FilterFunction can be unit tested though. This might be a better way to test it anyway. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13289?email_source=notifications&email_token=AKPPPT4FEXZZXCOQS7M7HQLQGUH4TA5CNFSM4IPZOIC2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCCZGFTI#pullrequestreview-280126157> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT3ORBTOPS325B47M4DQGUH4TANCNFSM4IPZOICQ> .

oerling · 2019-08-28T23:17:47Z

This is not very urgent since first we’ll want Slice readers to be in. The test in the query tests will not work until then. From: Maria Basmanova <[email protected]> Sent: Tuesday, August 27, 2019 4:47 AM To: prestodb/presto <[email protected]> Cc: oerling <[email protected]>; Mention <[email protected]> Subject: Re: [prestodb/presto] Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks (#13289) @mbasmanova requested changes on this pull request. @oerling <https://github.com/oerling> This would be a nice optimization. Thanks for contributing. I have some comments.

_____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

@@ -27,6 +30,16 @@

private final boolean deterministic; private final int[] inputChannels; + // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long + // as the dictionary inside the block is physically the same. + private byte[] dictionaryResults; + private Block previousDictionary; + private Page dictionaryPage; + + private static final byte FILTER_NOT_EVALUATED = 0; static member variables go to the beginning of the class before instance member variables

_____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

@@ -27,6 +30,16 @@

private final boolean deterministic; private final int[] inputChannels; + // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long + // as the dictionary inside the block is physically the same. + private byte[] dictionaryResults; + private Block previousDictionary; + private Page dictionaryPage; dictionaryPage variables is not used

_____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

@@ -27,6 +30,16 @@

private final boolean deterministic; private final int[] inputChannels; + // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long + // as the dictionary inside the block is physically the same. + private byte[] dictionaryResults; + private Block previousDictionary; Holding on to dictionary block will work with the current implementation of the readers, but strictly speaking this is not safe. SelectiveStreamReader.getBlockView API returns a temporary view into the data and the caller is not supposed to use this data after next call to SelectiveStreamReader.read. The data is both the block and the dictionary block it is referencing. Hence, it is not safe to hold on to a dictionary from the previous batch. How important is it to re-use filter result across batches? If important, let's think about ways to make it less fragile.

_____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

@@ -73,6 +90,53 @@ public int filter(Page page, int[] positions, int positionCount, RuntimeExceptio

return outputCount; } + private int filterWithDictionary(Page page, int[] positions, int positionCount, RuntimeException[] errors) + { + int outputCount = 0; + DictionaryBlock block = (DictionaryBlock) page.getBlock(0); + Block dictionary = block.getDictionary(); + if (dictionary != previousDictionary) { + previousDictionary = dictionary; + int numEntries = dictionary.getPositionCount(); + dictionaryPage = new Page(numEntries, dictionary); + if (dictionaryResults == null || dictionaryResults.length < numEntries) { It is preferable to use Arrays.ensureCapacity method for allocating arrays. dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount()); fill(dictionaryResults, FILTER_NOT_EVALUATED); If it is important to avoid extra fill for newly allocated array, consider adding another ensureCapacity method that takes a value used to initialize new array or re-set existing one. dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount(), FILTER_NOT_EVALUATED); public static byte[] ensureCapacity(byte[] buffer, int capacity, byte initialValue) { if (buffer == null || buffer.length < capacity) { byte[] newBuffer = new byte[capacity]; if (initialValue != 0) { fill(buffer, initialValue); } return newBuffer; } fill(buffer, initialValue); return buffer; } _____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

+ // 0 means unevaluated, so no extra initialization needed.

+ dictionaryResults = new byte[numEntries]; + } + else { + fill(dictionaryResults, 0, numEntries, FILTER_NOT_EVALUATED); + } + } + for (int i = 0; i < positionCount; i++) { + int position = positions[i]; + int dictionaryPosition = block.getId(position); + byte result = dictionaryResults[dictionaryPosition]; + if (result == FILTER_FAILED) { + continue; + } + if (result == FILTER_PASSED) { + positions[outputCount++] = position; you need to carry over any errors that occurred while evaluating earlier filters: positions[outputCount] = position; errors[outputCount] = errors[i]; outputCount++; same below

_____ In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :

+ previousDictionary = dictionary;

+ int numEntries = dictionary.getPositionCount(); + dictionaryPage = new Page(numEntries, dictionary); + if (dictionaryResults == null || dictionaryResults.length < numEntries) { + // 0 means unevaluated, so no extra initialization needed. + dictionaryResults = new byte[numEntries]; + } + else { + fill(dictionaryResults, 0, numEntries, FILTER_NOT_EVALUATED); + } + } + for (int i = 0; i < positionCount; i++) { + int position = positions[i]; + int dictionaryPosition = block.getId(position); + byte result = dictionaryResults[dictionaryPosition]; + if (result == FILTER_FAILED) { Is there any particular reason not to use switch? switch (result) { case FILTER_FAILED: break; case FILTER_PASSED: positions[outputCount] = position; errors[outputCount] = errors[i]; outputCount++; break; case FILTER_NOT_EVALUATED: try { if (predicate.evaluate(session, page, position)) { positions[outputCount++] = position; errors[outputCount] = errors[i]; dictionaryResults[dictionaryPosition] = FILTER_PASSED; } else { dictionaryResults[dictionaryPosition] = FILTER_FAILED; } } catch (RuntimeException e) { // We do not record errors in the dictionary results. positions[outputCount] = position; errors[outputCount] = e; // keep last error outputCount++; }; break; default: verify(false, "Unexpected filter result: " + result); } _____ In presto-hive/src/test/java/com/facebook/presto/hive/TestHivePushdownFilterQueries.java <#13289 (comment)> :

@@ -334,6 +334,9 @@ public void testFilterFunctions()

// filter function on numeric and boolean columns assertFilterProject("if(is_returned, linenumber, orderkey) % 5 = 0", "linenumber"); + // Filter functions on dictionary encoded columns + assertQuery("SELECT orderkey, linenumber, shipmode, shipinstruct FROM lineitem WHERE shipmode LIKE '%R%' and shipinstruct LIKE '%CO%'"); This test fails because there is no varchar reader yet. FilterFunction can be unit tested though. This might be a better way to test it anyway. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13289?email_source=notifications&email_token=AKPPPT4FEXZZXCOQS7M7HQLQGUH4TA5CNFSM4IPZOIC2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCCZGFTI#pullrequestreview-280126157> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT3ORBTOPS325B47M4DQGUH4TANCNFSM4IPZOICQ> .

mbasmanova

@oerling Orri, thanks for adding unit test. Let's remove failing integration test, address comments and merge this. The integration test will be added in @bhhari's PR that introduces varchar reader.

mbasmanova · 2019-09-04T14:25:01Z

presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java

@@ -53,6 +68,10 @@ public int filter(Page page, int[] positions, int positionCount, RuntimeExceptio
        checkArgument(positionCount <= positions.length);
        checkArgument(positionCount <= errors.length);

+        if (inputChannels.length == 1 && page.getBlock(0) instanceof DictionaryBlock && deterministic) {


nit: I'd put the cheapest check first, e.g. if (deterministic && ...)

mbasmanova · 2019-09-04T14:29:27Z

presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java

+                        }
+                    }
+                    catch (RuntimeException e) {
+                        // We do not record errors in the dictionary results.


Just to confirm my understanding: we'll keep re-evaluating the filter if it throws an error, right?

mbasmanova · 2019-09-04T14:29:55Z

presto-orc/src/test/java/com/facebook/presto/orc/TestFilterFunction.java

+    private static final long UNLUCKY = 13;
+
+    @Test
+    public void TestFilter()


method names start with a lower case: TestFilter -> testFilter

mbasmanova · 2019-09-04T14:30:45Z

presto-orc/src/test/java/com/facebook/presto/orc/TestFilterFunction.java

+        checkFilter(filter, otherDictionary, otherDictionaryPositions, otherDictionaryPositions.length);
+
+        // Repeat test on a dictionary with different content to make sure that cached results are not reused.
+        Block numbers2 = makeNumbers(1, 1001);


numbers2 variable is not used; remove it

mbasmanova · 2019-09-04T14:41:06Z

presto-orc/src/test/java/com/facebook/presto/orc/TestFilterFunction.java

+        return array;
+    }
+
+    private static Block makeNumbers(long from, long to)


input arguments can be integers; then toIntExact won't be needed

I think looping from 0 to count is easier to read

private static Block makeNumbers(int from, int to) { int count = toIntExact(to - from); long[] array = new long[count]; for (int i = 0; i < count; i++) { array[i] = from + i; } return new LongArrayBlock(count, Optional.empty(), array); }

mbasmanova · 2019-09-04T14:49:44Z

presto-orc/src/test/java/com/facebook/presto/orc/TestFilterFunction.java

+        ConnectorSession session = new TestingConnectorSession(ImmutableList.of());
+        FilterFunction filter = new FilterFunction(session, true, new IsOddPredicate());
+        checkFilter(filter, numbers, allPositions, allPositions.length);
+        Block dictionaryNumbers = new DictionaryBlock(numbers, makePositions(0, 1000, 1));


use allPositions here

mbasmanova · 2019-09-04T14:58:07Z

presto-orc/src/test/java/com/facebook/presto/orc/TestFilterFunction.java

+    {
+        Block numbers = makeNumbers(0, 1000);
+        int[] allPositions = makePositions(0, 1000, 1);
+        ConnectorSession session = new TestingConnectorSession(ImmutableList.of());


I'd initialize session and filter in the beginning of the method and add an empty line after that. These variables are not changing and are the same for all the test cases here.

I'd put an empty line after each call to assertFilter

ConnectorSession session = new TestingConnectorSession(ImmutableList.of()); FilterFunction filter = new FilterFunction(session, true, new IsOddPredicate()); Block numbers = makeNumbers(0, 1000); ...

mbasmanova · 2019-09-04T15:00:11Z

presto-orc/src/test/java/com/facebook/presto/orc/TestFilterFunction.java

+
+        // Repeat test on a dictionary with different content to make sure that cached results are not reused.
+        Block numbers2 = makeNumbers(1, 1001);
+        Block dictionary2Numbers = new DictionaryBlock(numbers, makePositions(0, 1000, 1));


use allPositions here

dictionary2Numbers appears to be the same as dictionaryNumbers (the pointers are different though) - is this intentional; the comment dictionary with different content appears misleading

mbasmanova · 2019-09-04T15:01:15Z

presto-orc/src/test/java/com/facebook/presto/orc/TestFilterFunction.java

+        checkFilter(filter, numbers, allPositions, allPositions.length);
+        Block dictionaryNumbers = new DictionaryBlock(numbers, makePositions(0, 1000, 1));
+        checkFilter(filter, dictionaryNumbers, allPositions, allPositions.length);
+        // Sparse coverage of the same dictionary


I'm seeing that we are not testing the case where subsequent call is reusing previous results, but still evaluates some new values. To cover this, we need to run filter on sparse values first, then on all values.

Block dictionaryNumbers = new DictionaryBlock(numbers, allPositions); // Sparse coverage of the dictionary values int[] sparsePositions = makePositions(1, 300, 3); assertFilter(filter, dictionaryNumbers, sparsePositions, sparsePositions.length); // Full coverage of the dictionary values assertFilter(filter, dictionaryNumbers, allPositions, allPositions.length);

mbasmanova · 2019-09-04T15:02:03Z

presto-orc/src/test/java/com/facebook/presto/orc/TestFilterFunction.java

+        // Sparse coverage of the same dictionary
+        int[] sparsePositions = makePositions(1, 300, 3);
+        checkFilter(filter, dictionaryNumbers, sparsePositions, sparsePositions.length);
+        // Test with a different dictionary over the same numbers. Results are reused.


Results are reused.- I think this is not correct because otherDictionary is a new object and therefore won't match any previous dictionary.

mbasmanova · 2019-09-04T15:06:55Z

@oerling Commit title is a bit too long. How about shortening it to Optimize single-column filter functions applied to dictionary blocks and updating the description to

This optimization will apply to dictionary blocks produced by scanning low 
cardinality string columns.

oerling · 2019-09-05T18:11:38Z

Thank you for the review. Updated the PR. nit: I'd put the cheapest check first, e.g. if (deterministic && ...) A: - Yes. The thinking was that deterministic is nearly always true, hence last. But it can be first as well, it costs nothing. For readability one could say that the one that best expresses the key intent, i.e. is most selective, could be first. Just to confirm my understanding: we'll keep re-evaluating the filter if it throws an error, right? A: Yes. If there is no reordering, the first error kills and caching would make no sense. If there is reordering, errors are not frequent but if they are frequent then the filter that has them goes last, which stops the occurrence of the error, that is, the thinh masking the error goes before the error. So not remembering errors saves code paths that would otherwise need tests and as above said, there will be no loss from this. numbers2 variable is not used; remove it A: Changed. Now used on the next line, as originally intended. Good catch. dictionary2Numbers appears to be the same as dictionaryNumbers (the pointers are different though) - is this intentional; the comment dictionary with different content appears misleading A: Changed. The point is that numbers2 and numbers have different content. Reuse would give wrong results. Results are reused.- I think this is not correct because otherDictionary is a new object and therefore won't match any previous dictionary. A: otherDictionary is different but otherDictionary.getDictionary() == dictionaryNumbers.getDictionary(). From: Maria Basmanova <[email protected]> Sent: Wednesday, September 4, 2019 8:08 AM To: prestodb/presto <[email protected]> Cc: oerling <[email protected]>; Mention <[email protected]> Subject: Re: [prestodb/presto] Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks (#13289) @oerling <https://github.com/oerling> Commit title is a bit too long. How about shortening it to Optimize single-column filter functions applied to dictionary blocks and updating the description to This optimization will apply to dictionary blocks produced by scanning low cardinality string columns. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13289?email_source=notifications&email_token=AKPPPT7WOGII4WYEVHO2W5LQH7FMNA5CNFSM4IPZOIC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD534S6I#issuecomment-527944057> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT4OUHT42C74HYB644LQH7FMNANCNFSM4IPZOICQ> .

mbasmanova

@oerling Looks good to me % one comment.

mbasmanova · 2019-09-05T19:33:00Z

presto-orc/src/test/java/com/facebook/presto/orc/TestFilterFunction.java

+        // Repeat test on a DictionaryBlock over different content to make sure that cached results are not reused.
+        Block numbers2 = makeNumbers(1, 1001);
+        Block dictionary2Numbers = new DictionaryBlock(numbers2, allPositions);
+        assertFilter(filter, dictionary2Numbers, allPositions, allPositions.length);


don't use a1, a2, a3 names; here, I'd inline numbers2 and dictionary2Numbers to avoid naming problem: assertFilter(filter, new DictionaryBlock(makeNumbers(1, 1001), allPositions), allPositions, allPositions.length);

Applies the filter at most once on any of the distinct values in a DictionaryBlock over the same dictionary. This optimizes filtering of low cardinality string columns.

oerling · 2019-09-05T20:45:29Z

Changed. From: Maria Basmanova <[email protected]> Sent: Thursday, September 5, 2019 12:34 PM To: prestodb/presto <[email protected]> Cc: oerling <[email protected]>; Mention <[email protected]> Subject: Re: [prestodb/presto] Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks (#13289) @mbasmanova commented on this pull request. @oerling <https://github.com/oerling> Looks good to me % one comment.

_____ In presto-orc/src/test/java/com/facebook/presto/orc/TestFilterFunction.java <#13289 (comment)> :

+ int[] sparsePositions = makePositions(1, 300, 3);

+ assertFilter(filter, dictionaryNumbers, sparsePositions, sparsePositions.length); + + // Full coverage of the dictionary values + assertFilter(filter, dictionaryNumbers, allPositions, allPositions.length); + + // Test with a different DictionaryBlock over the same numbers. Results are reused. The DictionaryBlock covers the + // values sparsely. TheDictionaryBlock itself is accessed sparsely. + DictionaryBlock otherDictionary = new DictionaryBlock(numbers, makePositions(1, 332, 3)); + int[] otherDictionaryPositions = makePositions(0, 150, 2); + assertFilter(filter, otherDictionary, otherDictionaryPositions, otherDictionaryPositions.length); + + // Repeat test on a DictionaryBlock over different content to make sure that cached results are not reused. + Block numbers2 = makeNumbers(1, 1001); + Block dictionary2Numbers = new DictionaryBlock(numbers2, allPositions); + assertFilter(filter, dictionary2Numbers, allPositions, allPositions.length); don't use a1, a2, a3 names; here, I'd inline numbers2 and dictionary2Numbers to avoid naming problem: assertFilter(filter, new DictionaryBlock(makeNumbers(1, 1001), allPositions), allPositions, allPositions.length); — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13289?email_source=notifications&email_token=AKPPPTYAZLCKO7T3QHVS2JTQIFNLPA5CNFSM4IPZOIC2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCD2PZ3Q#pullrequestreview-284490990> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT55GQ3IOUZSE7YUVTDQIFNLPANCNFSM4IPZOICQ> .

mbasmanova · 2019-09-05T22:44:30Z

@oerling Thank you, Orri.

oerling requested a review from mbasmanova August 27, 2019 00:59

facebook-github-bot added the CLA Signed label Aug 27, 2019

mbasmanova self-assigned this Aug 27, 2019

mbasmanova requested a review from a team August 27, 2019 11:45

mbasmanova added the aria Presto Aria performance improvements label Aug 27, 2019

mbasmanova requested changes Aug 27, 2019

View reviewed changes

oerling force-pushed the filter-function-with-dictionary branch from 0ee7fdd to 19b30d6 Compare August 28, 2019 22:43

mbasmanova requested changes Sep 4, 2019

View reviewed changes

oerling force-pushed the filter-function-with-dictionary branch 2 times, most recently from d0f0d52 to d385c51 Compare September 5, 2019 18:11

mbasmanova reviewed Sep 5, 2019

View reviewed changes

Optimize single-column FilterFunctions on DictionaryBlocks

a5f060e

Applies the filter at most once on any of the distinct values in a DictionaryBlock over the same dictionary. This optimizes filtering of low cardinality string columns.

oerling force-pushed the filter-function-with-dictionary branch from d385c51 to a5f060e Compare September 5, 2019 20:43

mbasmanova requested a review from a team September 5, 2019 22:44

mbasmanova approved these changes Sep 5, 2019

View reviewed changes

mbasmanova merged commit 29bb4c6 into prestodb:master Sep 5, 2019

neeradsomanchi mentioned this pull request Sep 9, 2019

Release notes for 0.226 #13340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks #13289

Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks #13289

oerling commented Aug 27, 2019

mbasmanova left a comment

mbasmanova Aug 27, 2019

mbasmanova Aug 27, 2019

mbasmanova Aug 27, 2019

mbasmanova Aug 27, 2019

mbasmanova Aug 27, 2019

mbasmanova Aug 27, 2019

mbasmanova Aug 27, 2019

oerling commented Aug 28, 2019 via email

oerling commented Aug 28, 2019 via email

mbasmanova left a comment

mbasmanova Sep 4, 2019

mbasmanova Sep 4, 2019

mbasmanova Sep 4, 2019

mbasmanova Sep 4, 2019

mbasmanova Sep 4, 2019

mbasmanova Sep 4, 2019

mbasmanova Sep 4, 2019

mbasmanova Sep 4, 2019

mbasmanova Sep 4, 2019

mbasmanova Sep 4, 2019

mbasmanova commented Sep 4, 2019

oerling commented Sep 5, 2019 via email

mbasmanova left a comment

mbasmanova Sep 5, 2019

oerling commented Sep 5, 2019 via email

mbasmanova commented Sep 5, 2019

Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks #13289

Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks #13289

Conversation

oerling commented Aug 27, 2019

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oerling commented Aug 28, 2019 via email

oerling commented Aug 28, 2019 via email

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova commented Sep 4, 2019

oerling commented Sep 5, 2019 via email

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oerling commented Sep 5, 2019 via email

mbasmanova commented Sep 5, 2019