-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to Optimize Join Operations with Chunking Using VALUES Statement #16508
Comments
Looks quite good! We don't really care about the chunk size though, right? Would it not always be better to send them all in one go? Also, we could probably optimize this even more to only send the values for a specific shard in that chunk. Kinda like how we do it for |
Similarly, de-duping would go along way for many queries. The same query for the same row in the right table is currently run for each occurrence in the left. This batching could allow de-duping, at least across the batching window. |
I have also thought about similar issues before and considered rewriting RHS equality queries to use IN, just like in your example with RHS. The query would become SELECT u.name FROM user AS u WHERE u.bar = 'foo' AND u.id IN (:ue_ids) where I would like to ask about the differences between using the |
The IN clause is useful when only retrieving data from the RHS columns. For instance: SELECT ue.id, ue.foo, ue.bar
FROM user u
JOIN user_extra ue ON u.id = ue.id Here, you can use the IN clause to optimize the query: SELECT ue.id, ue.foo, ue.bar
FROM user_extra ue
WHERE ue.id IN (:ue_ids) However, if multiple columns from the LHS are needed, the IN clause falls short. Consider this query: SELECT u.foo, ue.bar
FROM user u
JOIN user_extra ue ON u.id = ue.id In this case, you need both Using the VALUES statement, the query becomes: SELECT u.foo, ue.bar
FROM user u
JOIN (VALUES ROW(1, 'foo'), ROW(2, 'bar')) AS ue(id, foo)
ON u.id = ue.id This approach keeps the paired values together, ensuring data integrity across the join operation. |
Multi-Join and Expression Optimization ClarificationI wanted to clarify how multi-join situations and complex expressions are handled with the ValuesJoin optimization: Example QuerySELECT u.name, o.order_date, p.product_name, u.loyalty_score + o.order_value AS customer_value
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN products p ON o.product_id = p.id
WHERE u.country = 'USA' AND o.status = 'completed' AND p.category = 'electronics'; Execution Steps
Key Points
|
Talking this over with @harshit-gangal, we don't think it actually makes sense to fall back to single rows here. If the RHS is a scatter, it would just mean that we need to send many scatter queries, which is not really preferable. We don't see a situation right now that would require falling back, so currently we are thinking all joins would be |
Choosing ApplyJoin or ValuesJoin?@frouioui and I talked about if there are any situations where it still would make sense to use // Determine the join mode based on MySQL version and query context
mode := ValuesJoin
if version < 8.0.19 || (ctx.wantsFastFirstRow() && !rhs.isGreedy()) {
mode = ApplyJoin
} Explanation:
Greedy Operator in Query Planning: Here's a list of MySQL operators or constructs that benefit from fast first row retrieval:
OLTP/OLAP plan cacheIn OLTP mode, all operators are greedy. Today we do not differentiate between plans for these two modes. We probably should start that if we introduce the greedy property on operators. |
Summary
This proposal suggests an optimization for join operations in Vitess by introducing chunking of rows using the
VALUES
statement in MySQL. The goal is to reduce the number of network round-trips and improve query performance by batching data transfer.Background
Currently, Vitess handles joins between sharded tables by using bind variables to fetch rows from the right-hand side (RHS) of the join, based on values obtained from the left-hand side (LHS) - a so called nested loop join. This approach can lead to many network round-trips, especially when the join involves a significant number of rows, impacting overall performance.
Proposal
I propose leveraging the
VALUES
statement to batch multiple rows together, thus minimizing the number of network round-trips. The engine would generate a query using theVALUES
clause for the RHS of the join, allowing multiple rows to be processed in a single request.The
VALUES
statement in MySQL is used to construct a set of rows that can be treated as a table. It allows you to define multiple rows of data, where each row is represented asROW(value1, value2, ...)
. This virtual table can then be used in various SQL operations, such as joins or unions, making it a useful tool for batch processing data in queries. For example,VALUES ROW(1, 'foo'), ROW(2, 'bar')
creates a temporary result set with two rows and two columns, which can be aliased and joined with other tables in a query.Example Change:
The current process uses a query like:
With bind variables, the RHS query becomes:
The proposed query with chunking:
In this case,
ROW(1,2), ROW(4,5)
represent batched data for the join.Row-by-Row Fallback
In scenarios where the LHS yields few rows and the RHS can be optimized using a
SingleUnique
index, the engine should dynamically decide to revert to row-by-row processing. This would occur at runtime, ensuring that the most efficient execution strategy is chosen based on the data characteristics.Considerations
VALUES
in this context needs to be validated across different MySQL versions and configurations used in Vitess.Conclusion
This could give us a nice performance boost in a lot of situations, and it does not sound too hard to implement. I think it could be worth spending some time on.
The text was updated successfully, but these errors were encountered: