[FEATURE]: Comparison of aggregate metrics in source vs target for specific columns #503

ajalisatgi · 2024-06-28T15:23:55Z

Is there an existing issue for this?

I have searched the existing issues

Category of feature request

Reconcile

Problem statement

Remorph's recon module cannot be configured to compare aggregate summary metrics for specific columns based on user input.

Proposed Solution

Provide users the ability to define configuration-driven options that can compare summary metrics for specific columns in the source vs. the target.

Additional Context

Example:
Comparison between Snowflake table (snftable.T1) and Databricks table (dltable.T1)
query1 --> select sum(col A ) , sum(col B )from snftable.T1 group by colC
query 2 --> select sum(col A ) , sum(col B )from dltable.T1 group by colC

* Aggregate Queries Reconciliation ([#740](#740)). This release introduces several changes to enhance the functionality of the project, including the implementation of Aggregate Queries Reconciliation, addressing issue [#503](#503). A new property, `aggregates`, has been added to the base class of the query builder module to support aggregate queries reconciliation. A `generate_final_reconcile_aggregate_output` function has been added to generate the final reconcile output for aggregate queries. A new SQL file creates a table called `aggregate_details` to store details about aggregate reconciles, and a new column, `operation_name`, has been added to the `main` table in the `installation` reconciliation query. Additionally, new classes and methods have been introduced for handling aggregate queries and their reconciliation, and new SQL tables and columns have been created for storing and managing rules for aggregating data in the context of query reconciliation. Unit tests have been added to ensure the proper functioning of aggregate queries reconciliation and reconcile aggregate data in the context of missing records. * Generate GROUP BY / PIVOT ([#747](#747)). The LogicalPlanGenerator class in the remorph library has been updated to support generating GROUP BY and PIVOT clauses for SQL queries. A new private method, "aggregate", has been added to handle two types of aggregates: GroupBy and Pivot. For GroupBy, it generates a GROUP BY clause with specified grouping expressions. For Pivot, it generates a PIVOT clause where the specified column is used as the pivot column and the specified values are used as the pivot values, compatible with Spark SQL. If the aggregate type is unsupported, a TranspileException is thrown. Additionally, new test cases have been introduced for the LogicalPlanGenerator class in the com.databricks.labs.remorph.generators.sql package to support testing the transpilation of Aggregate expressions with GROUP BY and PIVOT clauses, ensuring proper handling and transpilation of these expressions. * Implement error strategy for Snowflake parsing and use error strategy for all parser instances ([#760](#760)). In this release, we have developed an error strategy specifically for Snowflake parsing that translates raw token names and parser rules into more user-friendly SQL error messages. This strategy is applied consistently across all parser instances, ensuring a unified error handling experience. Additionally, we have refined the DBL_DOLLAR rule in the SnowflakeLexer grammar to handle escaped dollar signs correctly. These updates improve the accuracy and readability of error messages for SQL authors, regardless of the parsing tool or transpiler used. Furthermore, we have updated the TSQL parsing error strategy to match the new Snowflake error strategy implementation, providing a consistent error handling experience across dialects. * Incremental improvement to error messages - article selection ([#711](#711)). In this release, we have implemented an incremental improvement to the error messages generated during T-SQL code parsing. This change introduces a new private method, `articleFor`, which determines whether to use `a` or `an` in the generated messages based on the first letter of the following word. The `generateMessage` method has been updated to use this new method when constructing the initial error message and subsequent messages when there are multiple expected tokens. This improvement ensures consistent use of articles `a` or `an` in the error messages, enhancing their readability for software engineers working with T-SQL code. * TSQL: Adds tests and support for SELECT OPTION(...) generation ([#755](#755)). In this release, we have added support for generating code for the TSQL `SELECT ... OPTION(...)` clause in the codebase. This new feature includes the ability to transpile any query hints supplied with a SELECT statement as comments in the output code, allowing for easier assessment of query performance after transpilation. The OPTION clause is now generated as comments, including MAXRECURSION, string options, boolean options, and auto options. Additionally, we have added new tests and updated the TSqlAstBuilderSpec test class with new and updated test cases to cover the new functionality. The implementation is focused on generating code for the OPTION clause, and does not affect the actual execution of the query. The changes are limited to the ExpressionGenerator class and its associated methods, and the TSqlRelationBuilder class, without affecting other parts of the codebase. * TSQL: IR implementation of MERGE ([#719](#719)). The open-source library has been updated to include a complete implementation of the TSQL MERGE statement's IR (Intermediate Representation), bringing it in line with Spark SQL. The `LogicalPlanGenerator` class now includes a `generateMerge` method, which generates the SQL code for the MERGE statement, taking a `MergeIntoTable` object containing the target and source tables, merge condition, and merge actions as input. The `MergeIntoTable` class has been added as a case class to represent the logical plan of the MERGE INTO command and extends the `Modification` trait. The `LogicalPlanGenerator` class also includes a new `generateWithOptions` method, which generates SQL code for the WITH OPTIONS clause, taking a `WithOptions` object containing the input and options as children. Additionally, the `TSqlRelationBuilder` class has been updated to handle the MERGE statement's parsing, introducing new methods and updating existing ones, such as `visitMerge`. The `TSqlToDatabricksTranspiler` class has been updated to include support for the TSQL MERGE statement, and the `ExpressionGenerator` class has new tests for options, columns, and arithmetic expressions. A new optimization rule, `TrapInsertDefaultsAction`, has been added to handle the behavior of the DEFAULT keyword during INSERT statements. The commit also includes test cases for the `MergeIntoTable` logical operator and the T-SQL merge statement in the `TSqlAstBuilderSpec`.

Closes databrickslabs#503

* Aggregate Queries Reconciliation ([databrickslabs#740](databrickslabs#740)). This release introduces several changes to enhance the functionality of the project, including the implementation of Aggregate Queries Reconciliation, addressing issue [databrickslabs#503](databrickslabs#503). A new property, `aggregates`, has been added to the base class of the query builder module to support aggregate queries reconciliation. A `generate_final_reconcile_aggregate_output` function has been added to generate the final reconcile output for aggregate queries. A new SQL file creates a table called `aggregate_details` to store details about aggregate reconciles, and a new column, `operation_name`, has been added to the `main` table in the `installation` reconciliation query. Additionally, new classes and methods have been introduced for handling aggregate queries and their reconciliation, and new SQL tables and columns have been created for storing and managing rules for aggregating data in the context of query reconciliation. Unit tests have been added to ensure the proper functioning of aggregate queries reconciliation and reconcile aggregate data in the context of missing records. * Generate GROUP BY / PIVOT ([databrickslabs#747](databrickslabs#747)). The LogicalPlanGenerator class in the remorph library has been updated to support generating GROUP BY and PIVOT clauses for SQL queries. A new private method, "aggregate", has been added to handle two types of aggregates: GroupBy and Pivot. For GroupBy, it generates a GROUP BY clause with specified grouping expressions. For Pivot, it generates a PIVOT clause where the specified column is used as the pivot column and the specified values are used as the pivot values, compatible with Spark SQL. If the aggregate type is unsupported, a TranspileException is thrown. Additionally, new test cases have been introduced for the LogicalPlanGenerator class in the com.databricks.labs.remorph.generators.sql package to support testing the transpilation of Aggregate expressions with GROUP BY and PIVOT clauses, ensuring proper handling and transpilation of these expressions. * Implement error strategy for Snowflake parsing and use error strategy for all parser instances ([databrickslabs#760](databrickslabs#760)). In this release, we have developed an error strategy specifically for Snowflake parsing that translates raw token names and parser rules into more user-friendly SQL error messages. This strategy is applied consistently across all parser instances, ensuring a unified error handling experience. Additionally, we have refined the DBL_DOLLAR rule in the SnowflakeLexer grammar to handle escaped dollar signs correctly. These updates improve the accuracy and readability of error messages for SQL authors, regardless of the parsing tool or transpiler used. Furthermore, we have updated the TSQL parsing error strategy to match the new Snowflake error strategy implementation, providing a consistent error handling experience across dialects. * Incremental improvement to error messages - article selection ([databrickslabs#711](databrickslabs#711)). In this release, we have implemented an incremental improvement to the error messages generated during T-SQL code parsing. This change introduces a new private method, `articleFor`, which determines whether to use `a` or `an` in the generated messages based on the first letter of the following word. The `generateMessage` method has been updated to use this new method when constructing the initial error message and subsequent messages when there are multiple expected tokens. This improvement ensures consistent use of articles `a` or `an` in the error messages, enhancing their readability for software engineers working with T-SQL code. * TSQL: Adds tests and support for SELECT OPTION(...) generation ([databrickslabs#755](databrickslabs#755)). In this release, we have added support for generating code for the TSQL `SELECT ... OPTION(...)` clause in the codebase. This new feature includes the ability to transpile any query hints supplied with a SELECT statement as comments in the output code, allowing for easier assessment of query performance after transpilation. The OPTION clause is now generated as comments, including MAXRECURSION, string options, boolean options, and auto options. Additionally, we have added new tests and updated the TSqlAstBuilderSpec test class with new and updated test cases to cover the new functionality. The implementation is focused on generating code for the OPTION clause, and does not affect the actual execution of the query. The changes are limited to the ExpressionGenerator class and its associated methods, and the TSqlRelationBuilder class, without affecting other parts of the codebase. * TSQL: IR implementation of MERGE ([databrickslabs#719](databrickslabs#719)). The open-source library has been updated to include a complete implementation of the TSQL MERGE statement's IR (Intermediate Representation), bringing it in line with Spark SQL. The `LogicalPlanGenerator` class now includes a `generateMerge` method, which generates the SQL code for the MERGE statement, taking a `MergeIntoTable` object containing the target and source tables, merge condition, and merge actions as input. The `MergeIntoTable` class has been added as a case class to represent the logical plan of the MERGE INTO command and extends the `Modification` trait. The `LogicalPlanGenerator` class also includes a new `generateWithOptions` method, which generates SQL code for the WITH OPTIONS clause, taking a `WithOptions` object containing the input and options as children. Additionally, the `TSqlRelationBuilder` class has been updated to handle the MERGE statement's parsing, introducing new methods and updating existing ones, such as `visitMerge`. The `TSqlToDatabricksTranspiler` class has been updated to include support for the TSQL MERGE statement, and the `ExpressionGenerator` class has new tests for options, columns, and arithmetic expressions. A new optimization rule, `TrapInsertDefaultsAction`, has been added to handle the behavior of the DEFAULT keyword during INSERT statements. The commit also includes test cases for the `MergeIntoTable` logical operator and the T-SQL merge statement in the `TSqlAstBuilderSpec`.

Closes databrickslabs#503

* Aggregate Queries Reconciliation ([databrickslabs#740](databrickslabs#740)). This release introduces several changes to enhance the functionality of the project, including the implementation of Aggregate Queries Reconciliation, addressing issue [databrickslabs#503](databrickslabs#503). A new property, `aggregates`, has been added to the base class of the query builder module to support aggregate queries reconciliation. A `generate_final_reconcile_aggregate_output` function has been added to generate the final reconcile output for aggregate queries. A new SQL file creates a table called `aggregate_details` to store details about aggregate reconciles, and a new column, `operation_name`, has been added to the `main` table in the `installation` reconciliation query. Additionally, new classes and methods have been introduced for handling aggregate queries and their reconciliation, and new SQL tables and columns have been created for storing and managing rules for aggregating data in the context of query reconciliation. Unit tests have been added to ensure the proper functioning of aggregate queries reconciliation and reconcile aggregate data in the context of missing records. * Generate GROUP BY / PIVOT ([databrickslabs#747](databrickslabs#747)). The LogicalPlanGenerator class in the remorph library has been updated to support generating GROUP BY and PIVOT clauses for SQL queries. A new private method, "aggregate", has been added to handle two types of aggregates: GroupBy and Pivot. For GroupBy, it generates a GROUP BY clause with specified grouping expressions. For Pivot, it generates a PIVOT clause where the specified column is used as the pivot column and the specified values are used as the pivot values, compatible with Spark SQL. If the aggregate type is unsupported, a TranspileException is thrown. Additionally, new test cases have been introduced for the LogicalPlanGenerator class in the com.databricks.labs.remorph.generators.sql package to support testing the transpilation of Aggregate expressions with GROUP BY and PIVOT clauses, ensuring proper handling and transpilation of these expressions. * Implement error strategy for Snowflake parsing and use error strategy for all parser instances ([databrickslabs#760](databrickslabs#760)). In this release, we have developed an error strategy specifically for Snowflake parsing that translates raw token names and parser rules into more user-friendly SQL error messages. This strategy is applied consistently across all parser instances, ensuring a unified error handling experience. Additionally, we have refined the DBL_DOLLAR rule in the SnowflakeLexer grammar to handle escaped dollar signs correctly. These updates improve the accuracy and readability of error messages for SQL authors, regardless of the parsing tool or transpiler used. Furthermore, we have updated the TSQL parsing error strategy to match the new Snowflake error strategy implementation, providing a consistent error handling experience across dialects. * Incremental improvement to error messages - article selection ([databrickslabs#711](databrickslabs#711)). In this release, we have implemented an incremental improvement to the error messages generated during T-SQL code parsing. This change introduces a new private method, `articleFor`, which determines whether to use `a` or `an` in the generated messages based on the first letter of the following word. The `generateMessage` method has been updated to use this new method when constructing the initial error message and subsequent messages when there are multiple expected tokens. This improvement ensures consistent use of articles `a` or `an` in the error messages, enhancing their readability for software engineers working with T-SQL code. * TSQL: Adds tests and support for SELECT OPTION(...) generation ([databrickslabs#755](databrickslabs#755)). In this release, we have added support for generating code for the TSQL `SELECT ... OPTION(...)` clause in the codebase. This new feature includes the ability to transpile any query hints supplied with a SELECT statement as comments in the output code, allowing for easier assessment of query performance after transpilation. The OPTION clause is now generated as comments, including MAXRECURSION, string options, boolean options, and auto options. Additionally, we have added new tests and updated the TSqlAstBuilderSpec test class with new and updated test cases to cover the new functionality. The implementation is focused on generating code for the OPTION clause, and does not affect the actual execution of the query. The changes are limited to the ExpressionGenerator class and its associated methods, and the TSqlRelationBuilder class, without affecting other parts of the codebase. * TSQL: IR implementation of MERGE ([databrickslabs#719](databrickslabs#719)). The open-source library has been updated to include a complete implementation of the TSQL MERGE statement's IR (Intermediate Representation), bringing it in line with Spark SQL. The `LogicalPlanGenerator` class now includes a `generateMerge` method, which generates the SQL code for the MERGE statement, taking a `MergeIntoTable` object containing the target and source tables, merge condition, and merge actions as input. The `MergeIntoTable` class has been added as a case class to represent the logical plan of the MERGE INTO command and extends the `Modification` trait. The `LogicalPlanGenerator` class also includes a new `generateWithOptions` method, which generates SQL code for the WITH OPTIONS clause, taking a `WithOptions` object containing the input and options as children. Additionally, the `TSqlRelationBuilder` class has been updated to handle the MERGE statement's parsing, introducing new methods and updating existing ones, such as `visitMerge`. The `TSqlToDatabricksTranspiler` class has been updated to include support for the TSQL MERGE statement, and the `ExpressionGenerator` class has new tests for options, columns, and arithmetic expressions. A new optimization rule, `TrapInsertDefaultsAction`, has been added to handle the behavior of the DEFAULT keyword during INSERT statements. The commit also includes test cases for the `MergeIntoTable` logical operator and the T-SQL merge statement in the `TSqlAstBuilderSpec`.

ajalisatgi added the enhancement New feature or request label Jun 28, 2024

vijaypavann-db self-assigned this Jul 8, 2024

vijaypavann-db mentioned this issue Jul 8, 2024

User defined Aggregate Queries implementation #537

Closed

nfx added feat/recon making sure that remorphed query produces the same results as original and removed enhancement New feature or request labels Jul 19, 2024

sundarshankar89 mentioned this issue Jul 23, 2024

Metrics and data quality: compare source and destination data profiles #18

Open

3 tasks

vijaypavann-db mentioned this issue Jul 30, 2024

Aggregate Queries Reconciliation #740

Merged

nfx closed this as completed in #740 Aug 2, 2024

nfx closed this as completed in de431bc Aug 2, 2024

nfx mentioned this issue Aug 2, 2024

Release v0.4.1 #763

Merged

sundarshankar89 mentioned this issue Aug 5, 2024

[FEATURE]: Installer and CLI for aggregates-reconcile #767

Closed

sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this issue Jan 2, 2025

Aggregate Queries Reconciliation (databrickslabs#740)

6486bec

Closes databrickslabs#503

sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this issue Jan 3, 2025

Aggregate Queries Reconciliation (databrickslabs#740)

f48428c

Closes databrickslabs#503

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Comparison of aggregate metrics in source vs target for specific columns #503

[FEATURE]: Comparison of aggregate metrics in source vs target for specific columns #503

ajalisatgi commented Jun 28, 2024 •

edited by nfx

Loading

[FEATURE]: Comparison of aggregate metrics in source vs target for specific columns #503

[FEATURE]: Comparison of aggregate metrics in source vs target for specific columns #503

Comments

ajalisatgi commented Jun 28, 2024 • edited by nfx Loading

Is there an existing issue for this?

Category of feature request

Problem statement

Proposed Solution

Additional Context

ajalisatgi commented Jun 28, 2024 •

edited by nfx

Loading