Add SQL GROUP BY to Skyhook Flatbuffer processing #26

jlefevre · 2019-03-03T18:54:06Z

This project will develop object-class methods methods to sort/group query result sets. This requires extending the current code (select/project/basic-aggregations - min/max/sum/count) to support groupby and/or orderby.

Skyhook flatbuffer data format is specified here.

Current processing includes select, project, simple aggregates min/max/count/sum. This is a row-oriented layout that uses Google Flatbuffers. Each row (or agg across all rows) is processed in a loop here.

For each row, first all of the predicates are checked for each column (data schema), and if they all pass, the projected columns (query_schema) are encoded into the return flatbuffer based upon their data type here.

Currently, simple aggregates are being accumulated during the applyPredicates function into a typed predicate object over all rows that pass (nothing is encoded into the return flatbuffer until the very end) here.

SQL GROUP BY statement re-organizes the rows returned by the query as indicated here / here.

This project requires implementing SQL GROUP BY by grouping the rows of the return flatbuffer by similar column values (such as group by the same ShippingDate). One straightforward way to do this is for each unique value of the grouping column, create a hashmap keyed on that value and store the associated row numbers in a list as that key's value. After all rows have been evaluated, iterate over the map and just encode the rows of each group using the existing encoding switch statement.

This project may require refactoring the encoding switch into a separate function so it can be reused by the existing code and the new groupby code. Testing can be done using our TPCH Lineitem table data, schema specified here.

A very similar method will also need implemented for our column-oriented processing function here, that uses Apache Arrow format.

Once the groupby functionality as described above is correct, the approach should be further optimized as appropriate to minimize memory footprint and/or processing complexity.

jlefevre · 2019-03-03T19:27:13Z

An extension of this is to sort the groups returned (SQL ORDER BY) and computing the current simple aggregates (min/max/sum/count) along with the GROUP BY statement. The simple aggregates are already computed here during applyPredicates, and one could use this existing agg computation per row, and then use the existing agg encoding to encode the simple agg of each group once you are ready to encode the groups into the return flatbuffer.

Sorting could preferably be done in a more general way so that it could used to sort the groups or all of the rows in the return flatbuffer.

ivotron transferred this issue from uccross/skyhookdm-ceph May 31, 2020

jlefevre assigned aditigupta17 Jul 14, 2020

jlefevre added the enhancement New feature or request label Jul 14, 2020

aditigupta17 linked a pull request Jul 15, 2020 that will close this issue

Add query validation for projection and aggregation #56

Merged

4 tasks

jlefevre closed this as completed in #56 Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQL GROUP BY to Skyhook Flatbuffer processing #26

Add SQL GROUP BY to Skyhook Flatbuffer processing #26

jlefevre commented Mar 3, 2019

jlefevre commented Mar 3, 2019 •

edited

Loading

Add SQL GROUP BY to Skyhook Flatbuffer processing #26

Add SQL GROUP BY to Skyhook Flatbuffer processing #26

Comments

jlefevre commented Mar 3, 2019

jlefevre commented Mar 3, 2019 • edited Loading

jlefevre commented Mar 3, 2019 •

edited

Loading