-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split Painless AST into a "user" tree and an "ir" tree #51278
Conversation
Pinging @elastic/es-core-infra (:Core/Infra/Scripting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good. I like the new design, and while I understand this is only the start, the separation should make adding new optimizations much easier long term.
One general comment: Could we keep setters as void? I see a mix of a builder pattern for the setters, but the nodes are not builders, they are simply mutable.
@rjernst Thank you for the feedback. I will make the change to remove the builder pattern in this case as after speaking with you I see how this is confusing. |
@rjernst I've removed the builder pattern from the "ir" nodes in favor of standard getters and setters. Please take a look again when you have a chance. |
modules/lang-painless/src/main/java/org/elasticsearch/painless/PainlessScriptEngine.java
Show resolved
Hide resolved
|
||
/* ---- begin tree structure ---- */ | ||
|
||
private final List<ExpressionNode> argumentNodes = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the actual arguments, correct? If so, consider renaming to arguments
. It's a bit confusing to have an ArgumentsNode
type with field argumentNodes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to suggestions here. I know the names are a bit confusing and clash, but they do both describe accurately what each item is. I have divided the data in the "ir" nodes with two types of data - tree structure and local data. All tree structure members end with Node
to help differentiate this from typical data. This also creates a general consistency between all nodes.
private boolean post; | ||
private Operation operation; | ||
private boolean read; | ||
private boolean cat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a cat
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cat
is short for concatenation for String
types. These names are copied directly from their equivalent "user" node, so I would prefer to leave this for now as mechanical, but renaming should be a future change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's clear when reading the code further down, but a word or two as a comment would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment here and in BinaryMathNode that also does concatenations.
|
||
/* ---- end tree structure */ | ||
|
||
public ArgumentsNode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a default constructor insufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to have more than one constructor early on in the refactor, but wanted to show it was okay to create nodes from scratch. I will remove these as they are no longer necessary.
@stu-elastic Thank you for the review. I have removed all the default constructors from the "ir" nodes. |
|
||
@Override | ||
protected int accessElementCount() { | ||
return 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can ya comment on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, happy to - currently, all of the nodes that contain code allowing for the storage of values such as variables or fields have several additional methods attached to them including accessElementCount
. These methods are used by the AssignmentNode to do the majority of the work. AssignmentNode is also responsible for two additional items - compound assignment and a value being read from either pre or post assignment (++ and -- operator or x = y = z
where y must be read from post assignment). To do this all the storeable nodes follow a common pattern to do compound assignment, etc. This requires knowledge of how many ASM stack elements have been placed on the stack in order to access the actual value which is what accessElementCount
is returning. As an example in this case, BraceSubNode
may refer to an array access where on the stack the array reference along with the index is placed prior to accessing the actual value. accessElementCount
refers to these two values and returns 2. This allows for some shortcutting by AssignmentNode to re-read the value if necessary. Take for instance x.y.z[2] += 1;
. To access z[2]
, we have already accessed x and y. If we didn't shortcut straight to z[2]
again to write the value we would have to double access x and y. Hopefully, this makes some sense.
Edit: I don't want to expend too much effort here adding additional comments because the intention with further refactoring is for this specific code to go away in favor of storeable nodes getting more responsibility from assignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start dude.
@stu-elastic Thank you again for the reviews. |
@rjernst Thanks for the review again as well. |
Painless has accumulated literal years of technical debt. The current AST handles all semantic checking and code generation. This AST contains a great deal of mutable state behaving as local input and output. The semantic phase changes the current AST tree structure for type and method resolution along with implicit casting and constant folding. The culmination of this is a great deal of mutable state with numerous edge cases making the current AST difficult to change/improve and particularly fragile to manage moving forward.
By splitting the AST into two separate trees we can eventually make the "user" tree an immutable representation of the code the user actually typed in. The "user" tree then runs through a semantic phase from which an "ir" tree is generated. The "ir" tree is mutable for additional external phases that can either add additional state/features or optimization through "ir" tree manipulation.
Phases manipulating the "ir" tree are much easier to reason about since we will guarantee that the "ir" tree generated from the "user" tree is semantically correct. Each additional phase will run under the assumption that the "ir" tree remains semantically correct. In turn, we can then decouple customized logic from the "ir" tree such as additional exceptions to encapsulate a script's
execute
method, so the "ir" tree nodes are relatively generic. Instead, an additional phase would add exception nodes to wrap theexecute
method. Each of these phases has the potential to be per script context allowing for further optimization phases such as possibly using read only variables to reduce has lookups from Maps, or method folding for SQL, etc.While it is possible to update the existing AST to accommodate for some optimizations, for the reasons outlined above it is at best incredibly difficult as maintaining correctness becomes even harder with each additional change.
This PR takes the existing Painless AST responsible for both semantic checking and writing Java ASM and splits it into two separate trees. The first tree, termed the "user" tree, is now responsible for semantic checking and generation of a second "ir" tree. The second tree, termed the "ir" tree, is responsible for generating the Java ASM bytecode. This change takes the nodes in nearly a 1:1 ratio with the exception of some improved super classes for the "ir" nodes to help with reduction of boilerplate getters/setters. The Painless AST remains mutable for this PR. This change simply takes the existing AST nodes, splits them into a "user" node and an equivalent "ir" node. Each "user" node will generate it's equivalent "ir" node during a separate phase, but eventually this will be combined into a single phase.
This is the minimum (or very close to it) required to effectively split the trees, and have them actually wired together. There is a huge amount of boilerplate accounting for the large increase in number of lines mostly due to making the "ir" nodes in a builder-style format, and then wiring each "user" node to generate an equivalent "ir" node. Each piece of tree structure and data in the "ir" nodes is available for manipulation to allow for maximum flexibility during external phases. This also allows the "user" tree to easily generate "ir" nodes as it gathers each piece of data necessary for code generation.
The code here is a bit raw, but this is an important first step to achieve the goals outlined above. The code in this PR generates scripts that are identical to scripts generated by the current code.
Given that this really is giant, one alternative strategy to break this up would be to check in 10-20 "ir" nodes at a time through 5-10 PRs leaving them isolated and inactive, then have a PR to wire the two trees together.