Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQL GROUP BY to Skyhook Flatbuffer processing #26

Closed
jlefevre opened this issue Mar 3, 2019 · 1 comment · Fixed by #56
Closed

Add SQL GROUP BY to Skyhook Flatbuffer processing #26

jlefevre opened this issue Mar 3, 2019 · 1 comment · Fixed by #56
Assignees
Labels
enhancement New feature or request

Comments

@jlefevre
Copy link
Member

jlefevre commented Mar 3, 2019

This project will develop object-class methods methods to sort/group query result sets. This requires extending the current code (select/project/basic-aggregations - min/max/sum/count) to support groupby and/or orderby.

Skyhook flatbuffer data format is specified here.

Current processing includes select, project, simple aggregates min/max/count/sum. This is a row-oriented layout that uses Google Flatbuffers. Each row (or agg across all rows) is processed in a loop here.

For each row, first all of the predicates are checked for each column (data schema), and if they all pass, the projected columns (query_schema) are encoded into the return flatbuffer based upon their data type here.

Currently, simple aggregates are being accumulated during the applyPredicates function into a typed predicate object over all rows that pass (nothing is encoded into the return flatbuffer until the very end) here.

SQL GROUP BY statement re-organizes the rows returned by the query as indicated here / here.

This project requires implementing SQL GROUP BY by grouping the rows of the return flatbuffer by similar column values (such as group by the same ShippingDate). One straightforward way to do this is for each unique value of the grouping column, create a hashmap keyed on that value and store the associated row numbers in a list as that key's value. After all rows have been evaluated, iterate over the map and just encode the rows of each group using the existing encoding switch statement.

This project may require refactoring the encoding switch into a separate function so it can be reused by the existing code and the new groupby code. Testing can be done using our TPCH Lineitem table data, schema specified here.

A very similar method will also need implemented for our column-oriented processing function here, that uses Apache Arrow format.

Once the groupby functionality as described above is correct, the approach should be further optimized as appropriate to minimize memory footprint and/or processing complexity.

@jlefevre
Copy link
Member Author

jlefevre commented Mar 3, 2019

An extension of this is to sort the groups returned (SQL ORDER BY) and computing the current simple aggregates (min/max/sum/count) along with the GROUP BY statement. The simple aggregates are already computed here during applyPredicates, and one could use this existing agg computation per row, and then use the existing agg encoding to encode the simple agg of each group once you are ready to encode the groups into the return flatbuffer.

Sorting could preferably be done in a more general way so that it could used to sort the groups or all of the rows in the return flatbuffer.

@ivotron ivotron transferred this issue from uccross/skyhookdm-ceph May 31, 2020
@jlefevre jlefevre added the enhancement New feature or request label Jul 14, 2020
@aditigupta17 aditigupta17 linked a pull request Jul 15, 2020 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants