Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial refactoring of pushdown subscripts rule #12534

Merged

Conversation

mbasmanova
Copy link
Contributor

  • Extracted pushdown subscripts logic into its own rule, PushDownSubscripts,
    from PruneUnreferencedOutputs.
  • Replaced Subfield.PathElement with a hierarchy of classes: empty base class
    PathElement + NestedField, IntegerSubscript, StringSubscript sub-classes.
    NestedField represents subfield of a struct: c.a, IntegerSubscript - element
    of an array or map with integer keys: c[2], StringSubscript - element of a
    map with varchar keys c["a"].

Supersedes #12515.

@mbasmanova
Copy link
Contributor Author

@oerling Addressed offline comments from Orri.

}

@Override
public int hashCode()
{
return path.hashCode();
return Objects.hash(path, getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because getName always returns path.toString, do we really need it? Similar question for equals? If so can we comment why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan You are correct. I got confused with this class. It extends Symbol, but it doesn't work like Symbol at all. For example, toSymbolReference will return an invalid symbol with empty name... Let me revert changes to hashCode and equals. In some followup PR, I hope to remove this class altogether.

{
requireNonNull(path, "path is null");
if (path.size() <= 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create a private static checkArgument method for these checks


public static PathElement allSubscripts()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what this means, can you add a comment for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan Let's say you have column coordinates of type array(row(x, y)) and the query is:

SELECT c.x FROM t CROSS JOIN UNNEST(coordinates) as u(c)

By default, table scan will produce coordinates column where each array element contains both x and y. However, the query only uses x. To express that, you'll use SubscriptPath coordinates[*].x. allSubscripts is used to express *.

I can't come up with a good comment to put here. Any ideas? I'd like to avoid restating the obvious, e.g. "Represents all subscripts of an array or map."

Copy link
Contributor

@tdcmeehan tdcmeehan Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of expressing all subscripts as its own subclass of PathElement, rather than overloading LongSubscript with the magic number? It could be singleton.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan I think is it better. We can then serialize it as "*" which will read better. Would you like me to make this change in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! 👍

@@ -93,28 +95,24 @@
@Override
public void setReferencedSubfields(List<SubfieldPath> subfields, int depth)
{
HashMap<String, ArrayList<SubfieldPath>> subfieldPaths = new HashMap();
Map<String, List<SubfieldPath>> subfieldPaths = new HashMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this so that the key is included in the name? i.e. what is the String?

}
else {
return assignments.get(symbol);
SubfieldPath path = ((SymbolWithSubfieldPath) symbol).getPath();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename getPath to getSubfieldPath for clarity


private static boolean prefixExists(SubfieldPath subfieldPath, Collection<SubfieldPath> subfieldPaths)
{
List<PathElement> path = subfieldPath.getPath();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename getPath here to something like getPathParts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan I did that in #12571 (getPathElements)

@@ -73,7 +75,7 @@
// The set of subscripts for which data needs to be
// returned. Other positions can be initialized to null. If this
// is null, values for all subscripts must be returned.
long[] subscripts;
private int[] subscripts;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because we can never have such a ridiculously high subscript? If so, why are we using LongSubscript instead of for example IntegerSubscript?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan LongSubscript is used for both array subscript (where values can only be integers) and map subscript (where values can be both integers and longs, e.g. map<integer, T> and map<bigint, T>).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it makes sense to differentiate it when we parse? From my reading we should be able to tell the difference when we construct the XxSubscript (correct me if I'm wrong?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan At this point, I don't see any practical value of doing that. It would require examining the type of the SubscriptExpression#base, which is not readily available.

else if (field != null) {
base = new DereferenceExpression(base, new Identifier(field));
else if (element instanceof LongSubscript) {
base = new SubscriptExpression(base, new LongLiteral(String.valueOf(((LongSubscript) element).getIndex())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some extra parenthesis here and below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan I don't see that. Would you show a rewrite?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I didn't read it correctly.

@mbasmanova mbasmanova force-pushed the refactor-aria-rules branch from b5207bc to 667b3cb Compare April 2, 2019 15:29
@mbasmanova
Copy link
Contributor Author

@tdcmeehan Tim, thank you for review. Comments addressed. Would you take another look?

elements.add(new SubfieldPath.LongSubscript(Long.valueOf(literal.getValue())));
}
else {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this cause NPE on any of deferenceOrSubscriptExpressionToPath's callsites?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan It might. Some callers check for null and some don't. This is an existing behavior, hence, I'd like to revisit it in some future PR.

private static boolean prefixExists(SubfieldPath subfieldPath, Collection<SubfieldPath> subfieldPaths)
{
List<PathElement> path = subfieldPath.getPath();
for (SubfieldPath otherSubfiledPath : subfieldPaths) {
Copy link
Contributor

@tdcmeehan tdcmeehan Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider adding a isPrefix method to SubfieldPath. (Previously I mentioned something comparable etc, don't think it's necessary.)

// Compute the leaf subfields. If we have a.b.c and
// a.b then a.b is the result. If a path is a prefix
// of another path, then the longer is discarded.
List<SubfieldPath> leafPaths = new ArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read it a couple of times now, and I believe this comment isn't accurate. I think it actually returns a.b.c, not a.b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan "a.b".isPrefix("a.b.c") returns true, hence, "a.b" will be added to leafPaths, but "a.b.c" won't.

@mbasmanova mbasmanova force-pushed the refactor-aria-rules branch 3 times, most recently from 9c50fed to 5963ccf Compare April 2, 2019 19:46
@mbasmanova
Copy link
Contributor Author

@tdcmeehan Tim, thank you for review. I added isPrefix method to SubfieldPath and AllSubscripts class. Would you take another look?

}
if (subscript.getSubscript() == PathElement.allSubscripts) {
checkArgument(subscript instanceof LongSubscript, "List reader needs a PathElement with a subscript");
if (subscript == allSubscripts()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it can be serialized/deserialized, is it safe to use reference equality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if practically it will never happen (not sure), we could ensure it by using @JsonCreator on a static factory which always returns the static constant, and making the constructor private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan Good point. It is not safe as of right now, but it will be safe once we use custom deserialization from #12571 (it won't be calling new AllSubscripts(), but use allSubscripts() factory method.) That said, I could change this to .equals() and implement AllSubscripts.equals(). What do you think?

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the improvement suggestion here:

#12534 (comment)

LGTM

- Extracted pushdown subscripts logic into its own rule, PushDownSubscripts,
from PruneUnreferencedOutputs.
- Replaced Subfield.PathElement with a hierarchy of classes: empty base class
PathElement + NestedField, IntegerSubscript, StringSubscript sub-classes.
NestedField represents subfield of a struct: c.a, IntegerSubscript - element
of an array or map with integer keys: c[2], StringSubscript - element of a
map with varchar keys c["a"].
@mbasmanova mbasmanova force-pushed the refactor-aria-rules branch from 5963ccf to ea81487 Compare April 3, 2019 00:24
@mbasmanova
Copy link
Contributor Author

@tdcmeehan Thank you, Tim. I updated AllSubscripts with @JsonCreator on a static factory.

@mbasmanova mbasmanova merged commit 51a2e30 into prestodb:aria-scan-research Apr 3, 2019
@highker highker added the aria Presto Aria performance improvements label Apr 4, 2019
@mbasmanova mbasmanova deleted the refactor-aria-rules branch May 24, 2019 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aria Presto Aria performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants