-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial refactoring of pushdown subscripts rule #12534
Initial refactoring of pushdown subscripts rule #12534
Conversation
c9953e4
to
b5207bc
Compare
@oerling Addressed offline comments from Orri. |
} | ||
|
||
@Override | ||
public int hashCode() | ||
{ | ||
return path.hashCode(); | ||
return Objects.hash(path, getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because getName
always returns path.toString
, do we really need it? Similar question for equals? If so can we comment why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdcmeehan You are correct. I got confused with this class. It extends Symbol, but it doesn't work like Symbol at all. For example, toSymbolReference will return an invalid symbol with empty name... Let me revert changes to hashCode and equals. In some followup PR, I hope to remove this class altogether.
{ | ||
requireNonNull(path, "path is null"); | ||
if (path.size() <= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would create a private static checkArgument
method for these checks
|
||
public static PathElement allSubscripts() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me what this means, can you add a comment for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdcmeehan Let's say you have column coordinates of type array(row(x, y)) and the query is:
SELECT c.x FROM t CROSS JOIN UNNEST(coordinates) as u(c)
By default, table scan will produce coordinates column where each array element contains both x and y. However, the query only uses x. To express that, you'll use SubscriptPath coordinates[*].x
. allSubscripts is used to express *
.
I can't come up with a good comment to put here. Any ideas? I'd like to avoid restating the obvious, e.g. "Represents all subscripts of an array or map."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of expressing all subscripts as its own subclass of PathElement
, rather than overloading LongSubscript
with the magic number? It could be singleton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdcmeehan I think is it better. We can then serialize it as "*" which will read better. Would you like me to make this change in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! 👍
@@ -93,28 +95,24 @@ | |||
@Override | |||
public void setReferencedSubfields(List<SubfieldPath> subfields, int depth) | |||
{ | |||
HashMap<String, ArrayList<SubfieldPath>> subfieldPaths = new HashMap(); | |||
Map<String, List<SubfieldPath>> subfieldPaths = new HashMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this so that the key is included in the name? i.e. what is the String
?
} | ||
else { | ||
return assignments.get(symbol); | ||
SubfieldPath path = ((SymbolWithSubfieldPath) symbol).getPath(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename getPath
to getSubfieldPath
for clarity
|
||
private static boolean prefixExists(SubfieldPath subfieldPath, Collection<SubfieldPath> subfieldPaths) | ||
{ | ||
List<PathElement> path = subfieldPath.getPath(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename getPath
here to something like getPathParts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdcmeehan I did that in #12571 (getPathElements)
@@ -73,7 +75,7 @@ | |||
// The set of subscripts for which data needs to be | |||
// returned. Other positions can be initialized to null. If this | |||
// is null, values for all subscripts must be returned. | |||
long[] subscripts; | |||
private int[] subscripts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because we can never have such a ridiculously high subscript? If so, why are we using LongSubscript instead of for example IntegerSubscript?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdcmeehan LongSubscript is used for both array subscript (where values can only be integers) and map subscript (where values can be both integers and longs, e.g. map<integer, T> and map<bigint, T>).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it makes sense to differentiate it when we parse? From my reading we should be able to tell the difference when we construct the XxSubscript
(correct me if I'm wrong?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdcmeehan At this point, I don't see any practical value of doing that. It would require examining the type of the SubscriptExpression#base, which is not readily available.
else if (field != null) { | ||
base = new DereferenceExpression(base, new Identifier(field)); | ||
else if (element instanceof LongSubscript) { | ||
base = new SubscriptExpression(base, new LongLiteral(String.valueOf(((LongSubscript) element).getIndex()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's some extra parenthesis here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdcmeehan I don't see that. Would you show a rewrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I didn't read it correctly.
b5207bc
to
667b3cb
Compare
@tdcmeehan Tim, thank you for review. Comments addressed. Would you take another look? |
elements.add(new SubfieldPath.LongSubscript(Long.valueOf(literal.getValue()))); | ||
} | ||
else { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this cause NPE on any of deferenceOrSubscriptExpressionToPath
's callsites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdcmeehan It might. Some callers check for null and some don't. This is an existing behavior, hence, I'd like to revisit it in some future PR.
private static boolean prefixExists(SubfieldPath subfieldPath, Collection<SubfieldPath> subfieldPaths) | ||
{ | ||
List<PathElement> path = subfieldPath.getPath(); | ||
for (SubfieldPath otherSubfiledPath : subfieldPaths) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider adding a isPrefix
method to SubfieldPath
. (Previously I mentioned something comparable etc, don't think it's necessary.)
// Compute the leaf subfields. If we have a.b.c and | ||
// a.b then a.b is the result. If a path is a prefix | ||
// of another path, then the longer is discarded. | ||
List<SubfieldPath> leafPaths = new ArrayList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read it a couple of times now, and I believe this comment isn't accurate. I think it actually returns a.b.c
, not a.b
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdcmeehan "a.b".isPrefix("a.b.c") returns true, hence, "a.b" will be added to leafPaths, but "a.b.c" won't.
9c50fed
to
5963ccf
Compare
@tdcmeehan Tim, thank you for review. I added isPrefix method to SubfieldPath and AllSubscripts class. Would you take another look? |
} | ||
if (subscript.getSubscript() == PathElement.allSubscripts) { | ||
checkArgument(subscript instanceof LongSubscript, "List reader needs a PathElement with a subscript"); | ||
if (subscript == allSubscripts()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it can be serialized/deserialized, is it safe to use reference equality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if practically it will never happen (not sure), we could ensure it by using @JsonCreator on a static factory which always returns the static constant, and making the constructor private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdcmeehan Good point. It is not safe as of right now, but it will be safe once we use custom deserialization from #12571 (it won't be calling new AllSubscripts(), but use allSubscripts() factory method.) That said, I could change this to .equals() and implement AllSubscripts.equals(). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Extracted pushdown subscripts logic into its own rule, PushDownSubscripts, from PruneUnreferencedOutputs. - Replaced Subfield.PathElement with a hierarchy of classes: empty base class PathElement + NestedField, IntegerSubscript, StringSubscript sub-classes. NestedField represents subfield of a struct: c.a, IntegerSubscript - element of an array or map with integer keys: c[2], StringSubscript - element of a map with varchar keys c["a"].
5963ccf
to
ea81487
Compare
@tdcmeehan Thank you, Tim. I updated AllSubscripts with @JsonCreator on a static factory. |
from PruneUnreferencedOutputs.
PathElement + NestedField, IntegerSubscript, StringSubscript sub-classes.
NestedField represents subfield of a struct: c.a, IntegerSubscript - element
of an array or map with integer keys: c[2], StringSubscript - element of a
map with varchar keys c["a"].
Supersedes #12515.