-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks #13289
Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks #13289
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oerling This would be a nice optimization. Thanks for contributing. I have some comments.
private Block previousDictionary; | ||
private Page dictionaryPage; | ||
|
||
private static final byte FILTER_NOT_EVALUATED = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static member variables go to the beginning of the class before instance member variables
// as the dictionary inside the block is physically the same. | ||
private byte[] dictionaryResults; | ||
private Block previousDictionary; | ||
private Page dictionaryPage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dictionaryPage variables is not used
// If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long | ||
// as the dictionary inside the block is physically the same. | ||
private byte[] dictionaryResults; | ||
private Block previousDictionary; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holding on to dictionary block will work with the current implementation of the readers, but strictly speaking this is not safe. SelectiveStreamReader.getBlockView
API returns a temporary view into the data and the caller is not supposed to use this data after next call to SelectiveStreamReader.read
. The data is both the block and the dictionary block it is referencing. Hence, it is not safe to hold on to a dictionary from the previous batch. How important is it to re-use filter result across batches? If important, let's think about ways to make it less fragile.
previousDictionary = dictionary; | ||
int numEntries = dictionary.getPositionCount(); | ||
dictionaryPage = new Page(numEntries, dictionary); | ||
if (dictionaryResults == null || dictionaryResults.length < numEntries) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is preferable to use Arrays.ensureCapacity method for allocating arrays.
dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount());
fill(dictionaryResults, FILTER_NOT_EVALUATED);
If it is important to avoid extra fill
for newly allocated array, consider adding another ensureCapacity
method that takes a value used to initialize new array or re-set existing one.
dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount(), FILTER_NOT_EVALUATED);
public static byte[] ensureCapacity(byte[] buffer, int capacity, byte initialValue)
{
if (buffer == null || buffer.length < capacity) {
byte[] newBuffer = new byte[capacity];
if (initialValue != 0) {
fill(buffer, initialValue);
}
return newBuffer;
}
fill(buffer, initialValue);
return buffer;
}
continue; | ||
} | ||
if (result == FILTER_PASSED) { | ||
positions[outputCount++] = position; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to carry over any errors that occurred while evaluating earlier filters:
positions[outputCount] = position;
errors[outputCount] = errors[i];
outputCount++;
same below
int position = positions[i]; | ||
int dictionaryPosition = block.getId(position); | ||
byte result = dictionaryResults[dictionaryPosition]; | ||
if (result == FILTER_FAILED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any particular reason not to use switch
?
switch (result) {
case FILTER_FAILED:
break;
case FILTER_PASSED:
positions[outputCount] = position;
errors[outputCount] = errors[i];
outputCount++;
break;
case FILTER_NOT_EVALUATED:
try {
if (predicate.evaluate(session, page, position)) {
positions[outputCount++] = position;
errors[outputCount] = errors[i];
dictionaryResults[dictionaryPosition] = FILTER_PASSED;
}
else {
dictionaryResults[dictionaryPosition] = FILTER_FAILED;
}
}
catch (RuntimeException e) {
// We do not record errors in the dictionary results.
positions[outputCount] = position;
errors[outputCount] = e; // keep last error
outputCount++;
};
break;
default:
verify(false, "Unexpected filter result: " + result);
}
@@ -334,6 +334,9 @@ public void testFilterFunctions() | |||
// filter function on numeric and boolean columns | |||
assertFilterProject("if(is_returned, linenumber, orderkey) % 5 = 0", "linenumber"); | |||
|
|||
// Filter functions on dictionary encoded columns | |||
assertQuery("SELECT orderkey, linenumber, shipmode, shipinstruct FROM lineitem WHERE shipmode LIKE '%R%' and shipinstruct LIKE '%CO%'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails because there is no varchar reader yet. FilterFunction can be unit tested though. This might be a better way to test it anyway.
0ee7fdd
to
19b30d6
Compare
I addressed the comments and added a unit test.
Concerning sharing filter results between batches:
- The lifetime of the dictionary is at least a row group and more often a stripe, which means that a reasonable run estimate is between 10K and 1M values,. With a batch of max 1K, we have 10-1K reuses for a first filter. The dictionary cardinalities that we see are somewhere in the hundreds based on just point observations. So, using across batches is actually important.
Concerning memory ownership:
The precedent in DictionaryAwarePageFilter is that the base block of consecutive DictionaryBlocks is compared with ==.
The base block of a DictionaryBlock in readers, Aria or other, is a VariableWidthBlock over a byte[] that is made at the start of the Stripe or RowGroup, depending on the scope.
If the VariableWidth block stays the same, then the byte[] stays the same, except if the underlying byte[] is reused in another Stripe/RowGroup. Doing this would break the memory assumptions of the pipeline. In specific, aggregation for example expects to keep references to incoming blocks forever and hence the base byte[] of a string dictionary is very definitely RAII (resource allocation is initialization).
On general principles, I think it is a bad practice to keep data indefinitely live by having a reference to the container for the sake of a single element. This is a real problem in terms of big group bys with tons of cross region refs and GC goes to 20+% of CPU just tracking these.
This should be fixed but not now.
Therefore, until there is a memory ownership model where producers own their memory, which I think is how it should be, the assumption on the immutability of the Dictionary byte[] is solid.
It would be entirely appropriate to document that the base of a DictionaryBlock, also in the case of a view, is immutable. If you have memory reuse throughout, then you'd give the dictionary base a serial number.
There is also a dictionarySourceId in DictionaryBlock which could be used to indicate that two DictionaryBlocks have the same dictionary. This is even serializable. This is not used for this purpose in filtering but we could use it for this, far as I can tell, without breaking anything.
From: Maria Basmanova <[email protected]>
Sent: Tuesday, August 27, 2019 4:47 AM
To: prestodb/presto <[email protected]>
Cc: oerling <[email protected]>; Mention <[email protected]>
Subject: Re: [prestodb/presto] Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks (#13289)
@mbasmanova requested changes on this pull request.
@oerling <https://github.com/oerling> This would be a nice optimization. Thanks for contributing. I have some comments.
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
@@ -27,6 +30,16 @@
private final boolean deterministic;
private final int[] inputChannels;
+ // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long
+ // as the dictionary inside the block is physically the same.
+ private byte[] dictionaryResults;
+ private Block previousDictionary;
+ private Page dictionaryPage;
+
+ private static final byte FILTER_NOT_EVALUATED = 0;
static member variables go to the beginning of the class before instance member variables
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
@@ -27,6 +30,16 @@
private final boolean deterministic;
private final int[] inputChannels;
+ // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long
+ // as the dictionary inside the block is physically the same.
+ private byte[] dictionaryResults;
+ private Block previousDictionary;
+ private Page dictionaryPage;
dictionaryPage variables is not used
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
@@ -27,6 +30,16 @@
private final boolean deterministic;
private final int[] inputChannels;
+ // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long
+ // as the dictionary inside the block is physically the same.
+ private byte[] dictionaryResults;
+ private Block previousDictionary;
Holding on to dictionary block will work with the current implementation of the readers, but strictly speaking this is not safe. SelectiveStreamReader.getBlockView API returns a temporary view into the data and the caller is not supposed to use this data after next call to SelectiveStreamReader.read. The data is both the block and the dictionary block it is referencing. Hence, it is not safe to hold on to a dictionary from the previous batch. How important is it to re-use filter result across batches? If important, let's think about ways to make it less fragile.
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
@@ -73,6 +90,53 @@ public int filter(Page page, int[] positions, int positionCount, RuntimeExceptio
return outputCount;
}
+ private int filterWithDictionary(Page page, int[] positions, int positionCount, RuntimeException[] errors)
+ {
+ int outputCount = 0;
+ DictionaryBlock block = (DictionaryBlock) page.getBlock(0);
+ Block dictionary = block.getDictionary();
+ if (dictionary != previousDictionary) {
+ previousDictionary = dictionary;
+ int numEntries = dictionary.getPositionCount();
+ dictionaryPage = new Page(numEntries, dictionary);
+ if (dictionaryResults == null || dictionaryResults.length < numEntries) {
It is preferable to use Arrays.ensureCapacity method for allocating arrays.
dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount());
fill(dictionaryResults, FILTER_NOT_EVALUATED);
If it is important to avoid extra fill for newly allocated array, consider adding another ensureCapacity method that takes a value used to initialize new array or re-set existing one.
dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount(), FILTER_NOT_EVALUATED);
public static byte[] ensureCapacity(byte[] buffer, int capacity, byte initialValue)
{
if (buffer == null || buffer.length < capacity) {
byte[] newBuffer = new byte[capacity];
if (initialValue != 0) {
fill(buffer, initialValue);
}
return newBuffer;
}
fill(buffer, initialValue);
return buffer;
}
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
+ // 0 means unevaluated, so no extra initialization needed.
+ dictionaryResults = new byte[numEntries];
+ }
+ else {
+ fill(dictionaryResults, 0, numEntries, FILTER_NOT_EVALUATED);
+ }
+ }
+ for (int i = 0; i < positionCount; i++) {
+ int position = positions[i];
+ int dictionaryPosition = block.getId(position);
+ byte result = dictionaryResults[dictionaryPosition];
+ if (result == FILTER_FAILED) {
+ continue;
+ }
+ if (result == FILTER_PASSED) {
+ positions[outputCount++] = position;
you need to carry over any errors that occurred while evaluating earlier filters:
positions[outputCount] = position;
errors[outputCount] = errors[i];
outputCount++;
same below
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
+ previousDictionary = dictionary;
+ int numEntries = dictionary.getPositionCount();
+ dictionaryPage = new Page(numEntries, dictionary);
+ if (dictionaryResults == null || dictionaryResults.length < numEntries) {
+ // 0 means unevaluated, so no extra initialization needed.
+ dictionaryResults = new byte[numEntries];
+ }
+ else {
+ fill(dictionaryResults, 0, numEntries, FILTER_NOT_EVALUATED);
+ }
+ }
+ for (int i = 0; i < positionCount; i++) {
+ int position = positions[i];
+ int dictionaryPosition = block.getId(position);
+ byte result = dictionaryResults[dictionaryPosition];
+ if (result == FILTER_FAILED) {
Is there any particular reason not to use switch?
switch (result) {
case FILTER_FAILED:
break;
case FILTER_PASSED:
positions[outputCount] = position;
errors[outputCount] = errors[i];
outputCount++;
break;
case FILTER_NOT_EVALUATED:
try {
if (predicate.evaluate(session, page, position)) {
positions[outputCount++] = position;
errors[outputCount] = errors[i];
dictionaryResults[dictionaryPosition] = FILTER_PASSED;
}
else {
dictionaryResults[dictionaryPosition] = FILTER_FAILED;
}
}
catch (RuntimeException e) {
// We do not record errors in the dictionary results.
positions[outputCount] = position;
errors[outputCount] = e; // keep last error
outputCount++;
};
break;
default:
verify(false, "Unexpected filter result: " + result);
}
_____
In presto-hive/src/test/java/com/facebook/presto/hive/TestHivePushdownFilterQueries.java <#13289 (comment)> :
@@ -334,6 +334,9 @@ public void testFilterFunctions()
// filter function on numeric and boolean columns
assertFilterProject("if(is_returned, linenumber, orderkey) % 5 = 0", "linenumber");
+ // Filter functions on dictionary encoded columns
+ assertQuery("SELECT orderkey, linenumber, shipmode, shipinstruct FROM lineitem WHERE shipmode LIKE '%R%' and shipinstruct LIKE '%CO%'");
This test fails because there is no varchar reader yet. FilterFunction can be unit tested though. This might be a better way to test it anyway.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#13289?email_source=notifications&email_token=AKPPPT4FEXZZXCOQS7M7HQLQGUH4TA5CNFSM4IPZOIC2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCCZGFTI#pullrequestreview-280126157> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT3ORBTOPS325B47M4DQGUH4TANCNFSM4IPZOICQ> .
|
This is not very urgent since first we’ll want Slice readers to be in. The test in the query tests will not work until then.
From: Maria Basmanova <[email protected]>
Sent: Tuesday, August 27, 2019 4:47 AM
To: prestodb/presto <[email protected]>
Cc: oerling <[email protected]>; Mention <[email protected]>
Subject: Re: [prestodb/presto] Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks (#13289)
@mbasmanova requested changes on this pull request.
@oerling <https://github.com/oerling> This would be a nice optimization. Thanks for contributing. I have some comments.
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
@@ -27,6 +30,16 @@
private final boolean deterministic;
private final int[] inputChannels;
+ // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long
+ // as the dictionary inside the block is physically the same.
+ private byte[] dictionaryResults;
+ private Block previousDictionary;
+ private Page dictionaryPage;
+
+ private static final byte FILTER_NOT_EVALUATED = 0;
static member variables go to the beginning of the class before instance member variables
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
@@ -27,6 +30,16 @@
private final boolean deterministic;
private final int[] inputChannels;
+ // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long
+ // as the dictionary inside the block is physically the same.
+ private byte[] dictionaryResults;
+ private Block previousDictionary;
+ private Page dictionaryPage;
dictionaryPage variables is not used
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
@@ -27,6 +30,16 @@
private final boolean deterministic;
private final int[] inputChannels;
+ // If the function has a single argument and this is a DictionaryBlock, we can cache results. The cache is valid as long
+ // as the dictionary inside the block is physically the same.
+ private byte[] dictionaryResults;
+ private Block previousDictionary;
Holding on to dictionary block will work with the current implementation of the readers, but strictly speaking this is not safe. SelectiveStreamReader.getBlockView API returns a temporary view into the data and the caller is not supposed to use this data after next call to SelectiveStreamReader.read. The data is both the block and the dictionary block it is referencing. Hence, it is not safe to hold on to a dictionary from the previous batch. How important is it to re-use filter result across batches? If important, let's think about ways to make it less fragile.
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
@@ -73,6 +90,53 @@ public int filter(Page page, int[] positions, int positionCount, RuntimeExceptio
return outputCount;
}
+ private int filterWithDictionary(Page page, int[] positions, int positionCount, RuntimeException[] errors)
+ {
+ int outputCount = 0;
+ DictionaryBlock block = (DictionaryBlock) page.getBlock(0);
+ Block dictionary = block.getDictionary();
+ if (dictionary != previousDictionary) {
+ previousDictionary = dictionary;
+ int numEntries = dictionary.getPositionCount();
+ dictionaryPage = new Page(numEntries, dictionary);
+ if (dictionaryResults == null || dictionaryResults.length < numEntries) {
It is preferable to use Arrays.ensureCapacity method for allocating arrays.
dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount());
fill(dictionaryResults, FILTER_NOT_EVALUATED);
If it is important to avoid extra fill for newly allocated array, consider adding another ensureCapacity method that takes a value used to initialize new array or re-set existing one.
dictionaryResults = ensureCapacity(dictionaryResults, dictionary.getPositionCount(), FILTER_NOT_EVALUATED);
public static byte[] ensureCapacity(byte[] buffer, int capacity, byte initialValue)
{
if (buffer == null || buffer.length < capacity) {
byte[] newBuffer = new byte[capacity];
if (initialValue != 0) {
fill(buffer, initialValue);
}
return newBuffer;
}
fill(buffer, initialValue);
return buffer;
}
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
+ // 0 means unevaluated, so no extra initialization needed.
+ dictionaryResults = new byte[numEntries];
+ }
+ else {
+ fill(dictionaryResults, 0, numEntries, FILTER_NOT_EVALUATED);
+ }
+ }
+ for (int i = 0; i < positionCount; i++) {
+ int position = positions[i];
+ int dictionaryPosition = block.getId(position);
+ byte result = dictionaryResults[dictionaryPosition];
+ if (result == FILTER_FAILED) {
+ continue;
+ }
+ if (result == FILTER_PASSED) {
+ positions[outputCount++] = position;
you need to carry over any errors that occurred while evaluating earlier filters:
positions[outputCount] = position;
errors[outputCount] = errors[i];
outputCount++;
same below
_____
In presto-orc/src/main/java/com/facebook/presto/orc/FilterFunction.java <#13289 (comment)> :
+ previousDictionary = dictionary;
+ int numEntries = dictionary.getPositionCount();
+ dictionaryPage = new Page(numEntries, dictionary);
+ if (dictionaryResults == null || dictionaryResults.length < numEntries) {
+ // 0 means unevaluated, so no extra initialization needed.
+ dictionaryResults = new byte[numEntries];
+ }
+ else {
+ fill(dictionaryResults, 0, numEntries, FILTER_NOT_EVALUATED);
+ }
+ }
+ for (int i = 0; i < positionCount; i++) {
+ int position = positions[i];
+ int dictionaryPosition = block.getId(position);
+ byte result = dictionaryResults[dictionaryPosition];
+ if (result == FILTER_FAILED) {
Is there any particular reason not to use switch?
switch (result) {
case FILTER_FAILED:
break;
case FILTER_PASSED:
positions[outputCount] = position;
errors[outputCount] = errors[i];
outputCount++;
break;
case FILTER_NOT_EVALUATED:
try {
if (predicate.evaluate(session, page, position)) {
positions[outputCount++] = position;
errors[outputCount] = errors[i];
dictionaryResults[dictionaryPosition] = FILTER_PASSED;
}
else {
dictionaryResults[dictionaryPosition] = FILTER_FAILED;
}
}
catch (RuntimeException e) {
// We do not record errors in the dictionary results.
positions[outputCount] = position;
errors[outputCount] = e; // keep last error
outputCount++;
};
break;
default:
verify(false, "Unexpected filter result: " + result);
}
_____
In presto-hive/src/test/java/com/facebook/presto/hive/TestHivePushdownFilterQueries.java <#13289 (comment)> :
@@ -334,6 +334,9 @@ public void testFilterFunctions()
// filter function on numeric and boolean columns
assertFilterProject("if(is_returned, linenumber, orderkey) % 5 = 0", "linenumber");
+ // Filter functions on dictionary encoded columns
+ assertQuery("SELECT orderkey, linenumber, shipmode, shipinstruct FROM lineitem WHERE shipmode LIKE '%R%' and shipinstruct LIKE '%CO%'");
This test fails because there is no varchar reader yet. FilterFunction can be unit tested though. This might be a better way to test it anyway.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#13289?email_source=notifications&email_token=AKPPPT4FEXZZXCOQS7M7HQLQGUH4TA5CNFSM4IPZOIC2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCCZGFTI#pullrequestreview-280126157> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT3ORBTOPS325B47M4DQGUH4TANCNFSM4IPZOICQ> .
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -53,6 +68,10 @@ public int filter(Page page, int[] positions, int positionCount, RuntimeExceptio | |||
checkArgument(positionCount <= positions.length); | |||
checkArgument(positionCount <= errors.length); | |||
|
|||
if (inputChannels.length == 1 && page.getBlock(0) instanceof DictionaryBlock && deterministic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd put the cheapest check first, e.g. if (deterministic && ...)
} | ||
} | ||
catch (RuntimeException e) { | ||
// We do not record errors in the dictionary results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm my understanding: we'll keep re-evaluating the filter if it throws an error, right?
private static final long UNLUCKY = 13; | ||
|
||
@Test | ||
public void TestFilter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method names start with a lower case: TestFilter -> testFilter
checkFilter(filter, otherDictionary, otherDictionaryPositions, otherDictionaryPositions.length); | ||
|
||
// Repeat test on a dictionary with different content to make sure that cached results are not reused. | ||
Block numbers2 = makeNumbers(1, 1001); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numbers2 variable is not used; remove it
return array; | ||
} | ||
|
||
private static Block makeNumbers(long from, long to) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- input arguments can be integers; then toIntExact won't be needed
- I think looping from 0 to count is easier to read
private static Block makeNumbers(int from, int to)
{
int count = toIntExact(to - from);
long[] array = new long[count];
for (int i = 0; i < count; i++) {
array[i] = from + i;
}
return new LongArrayBlock(count, Optional.empty(), array);
}
ConnectorSession session = new TestingConnectorSession(ImmutableList.of()); | ||
FilterFunction filter = new FilterFunction(session, true, new IsOddPredicate()); | ||
checkFilter(filter, numbers, allPositions, allPositions.length); | ||
Block dictionaryNumbers = new DictionaryBlock(numbers, makePositions(0, 1000, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use allPositions here
{ | ||
Block numbers = makeNumbers(0, 1000); | ||
int[] allPositions = makePositions(0, 1000, 1); | ||
ConnectorSession session = new TestingConnectorSession(ImmutableList.of()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'd initialize session and filter in the beginning of the method and add an empty line after that. These variables are not changing and are the same for all the test cases here.
- I'd put an empty line after each call to
assertFilter
ConnectorSession session = new TestingConnectorSession(ImmutableList.of());
FilterFunction filter = new FilterFunction(session, true, new IsOddPredicate());
Block numbers = makeNumbers(0, 1000);
...
|
||
// Repeat test on a dictionary with different content to make sure that cached results are not reused. | ||
Block numbers2 = makeNumbers(1, 1001); | ||
Block dictionary2Numbers = new DictionaryBlock(numbers, makePositions(0, 1000, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- use allPositions here
- dictionary2Numbers appears to be the same as dictionaryNumbers (the pointers are different though) - is this intentional; the comment
dictionary with different content
appears misleading
checkFilter(filter, numbers, allPositions, allPositions.length); | ||
Block dictionaryNumbers = new DictionaryBlock(numbers, makePositions(0, 1000, 1)); | ||
checkFilter(filter, dictionaryNumbers, allPositions, allPositions.length); | ||
// Sparse coverage of the same dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing that we are not testing the case where subsequent call is reusing previous results, but still evaluates some new values. To cover this, we need to run filter on sparse values first, then on all values.
Block dictionaryNumbers = new DictionaryBlock(numbers, allPositions);
// Sparse coverage of the dictionary values
int[] sparsePositions = makePositions(1, 300, 3);
assertFilter(filter, dictionaryNumbers, sparsePositions, sparsePositions.length);
// Full coverage of the dictionary values
assertFilter(filter, dictionaryNumbers, allPositions, allPositions.length);
// Sparse coverage of the same dictionary | ||
int[] sparsePositions = makePositions(1, 300, 3); | ||
checkFilter(filter, dictionaryNumbers, sparsePositions, sparsePositions.length); | ||
// Test with a different dictionary over the same numbers. Results are reused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results are reused.
- I think this is not correct because otherDictionary is a new object and therefore won't match any previous dictionary.
@oerling Commit title is a bit too long. How about shortening it to
|
d0f0d52
to
d385c51
Compare
Thank you for the review. Updated the PR.
nit: I'd put the cheapest check first, e.g. if (deterministic && ...)
A: - Yes. The thinking was that deterministic is nearly always true, hence last. But it can be first as well, it costs nothing. For readability one could say that the one that best expresses the key intent, i.e. is most selective, could be first.
Just to confirm my understanding: we'll keep re-evaluating the filter if it throws an error, right?
A: Yes. If there is no reordering, the first error kills and caching would make no sense. If there is reordering, errors are not frequent but if they are frequent then the filter that has them goes last, which stops the occurrence of the error, that is, the thinh masking the error goes before the error. So not remembering errors saves code paths that would otherwise need tests and as above said, there will be no loss from this.
numbers2 variable is not used; remove it
A: Changed. Now used on the next line, as originally intended. Good catch.
dictionary2Numbers appears to be the same as dictionaryNumbers (the pointers are different though) - is this intentional; the comment dictionary with different content appears misleading
A: Changed. The point is that numbers2 and numbers have different content. Reuse would give wrong results.
Results are reused.- I think this is not correct because otherDictionary is a new object and therefore won't match any previous dictionary.
A: otherDictionary is different but otherDictionary.getDictionary() == dictionaryNumbers.getDictionary().
From: Maria Basmanova <[email protected]>
Sent: Wednesday, September 4, 2019 8:08 AM
To: prestodb/presto <[email protected]>
Cc: oerling <[email protected]>; Mention <[email protected]>
Subject: Re: [prestodb/presto] Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks (#13289)
@oerling <https://github.com/oerling> Commit title is a bit too long. How about shortening it to Optimize single-column filter functions applied to dictionary blocks and updating the description to
This optimization will apply to dictionary blocks produced by scanning low
cardinality string columns.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#13289?email_source=notifications&email_token=AKPPPT7WOGII4WYEVHO2W5LQH7FMNA5CNFSM4IPZOIC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD534S6I#issuecomment-527944057> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT4OUHT42C74HYB644LQH7FMNANCNFSM4IPZOICQ> .
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oerling Looks good to me % one comment.
// Repeat test on a DictionaryBlock over different content to make sure that cached results are not reused. | ||
Block numbers2 = makeNumbers(1, 1001); | ||
Block dictionary2Numbers = new DictionaryBlock(numbers2, allPositions); | ||
assertFilter(filter, dictionary2Numbers, allPositions, allPositions.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use a1, a2, a3 names; here, I'd inline numbers2 and dictionary2Numbers to avoid naming problem: assertFilter(filter, new DictionaryBlock(makeNumbers(1, 1001), allPositions), allPositions, allPositions.length);
Applies the filter at most once on any of the distinct values in a DictionaryBlock over the same dictionary. This optimizes filtering of low cardinality string columns.
d385c51
to
a5f060e
Compare
Changed.
From: Maria Basmanova <[email protected]>
Sent: Thursday, September 5, 2019 12:34 PM
To: prestodb/presto <[email protected]>
Cc: oerling <[email protected]>; Mention <[email protected]>
Subject: Re: [prestodb/presto] Optimize single-column FilterFunctions to only run on distinct values for DictionaryBlocks (#13289)
@mbasmanova commented on this pull request.
@oerling <https://github.com/oerling> Looks good to me % one comment.
_____
In presto-orc/src/test/java/com/facebook/presto/orc/TestFilterFunction.java <#13289 (comment)> :
+ int[] sparsePositions = makePositions(1, 300, 3);
+ assertFilter(filter, dictionaryNumbers, sparsePositions, sparsePositions.length);
+
+ // Full coverage of the dictionary values
+ assertFilter(filter, dictionaryNumbers, allPositions, allPositions.length);
+
+ // Test with a different DictionaryBlock over the same numbers. Results are reused. The DictionaryBlock covers the
+ // values sparsely. TheDictionaryBlock itself is accessed sparsely.
+ DictionaryBlock otherDictionary = new DictionaryBlock(numbers, makePositions(1, 332, 3));
+ int[] otherDictionaryPositions = makePositions(0, 150, 2);
+ assertFilter(filter, otherDictionary, otherDictionaryPositions, otherDictionaryPositions.length);
+
+ // Repeat test on a DictionaryBlock over different content to make sure that cached results are not reused.
+ Block numbers2 = makeNumbers(1, 1001);
+ Block dictionary2Numbers = new DictionaryBlock(numbers2, allPositions);
+ assertFilter(filter, dictionary2Numbers, allPositions, allPositions.length);
don't use a1, a2, a3 names; here, I'd inline numbers2 and dictionary2Numbers to avoid naming problem: assertFilter(filter, new DictionaryBlock(makeNumbers(1, 1001), allPositions), allPositions, allPositions.length);
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#13289?email_source=notifications&email_token=AKPPPTYAZLCKO7T3QHVS2JTQIFNLPA5CNFSM4IPZOIC2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCD2PZ3Q#pullrequestreview-284490990> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT55GQ3IOUZSE7YUVTDQIFNLPANCNFSM4IPZOICQ> .
|
@oerling Thank you, Orri. |
Note that the DictionaryBlocks are produced by scan only for string columns. This therefore depends on the introduction of a selective slice dictionary reader.