-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix serialization of MultipleErrors #1177
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nfx
approved these changes
Nov 8, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
sundarshankar89
added a commit
that referenced
this pull request
Dec 2, 2024
* Added support for format_datetime function in presto to Databricks ([#1250](#1250)). A new `format_datetime` function has been added to the `Parser` class in the `presto.py` file to provide support for formatting datetime values in Presto on Databricks. This function utilizes the `DateFormat.from_arg_list` method from the `local_expression` module to format datetime values according to a specified format string. To ensure compatibility and consistency between Presto and Databricks, a new test file `test_format_datetime_1.sql` has been added, containing SQL queries that demonstrate the usage of the `format_datetime` function in Presto and its equivalent in Databricks, `DATE_FORMAT`. This standalone change adds new functionality without modifying any existing code. * Added support for SnowFlake `SUBSTR` ([#1238](#1238)). This commit enhances the library's SnowFlake support by adding the `SUBSTR` function, which was previously unsupported and existed only as an alternative to `SUBSTRING`. The project now fully supports both functions, and the `SUBSTRING` function can be used interchangeably with `SUBSTR` via the new `withConversionStrategy(SynonymOf("SUBSTR"))` method. Additionally, this commit supersedes a previous pull request that lacked a GPG signature and includes a test for the `SUBSTR` function. The `ARRAY_SLICE` function has also been updated to match SnowFlake's behavior, and the project now supports a more comprehensive list of SQL functions with their corresponding arity. * Added support for json_size function in presto ([#1236](#1236)). A new `json_size` function for Presto has been added, which determines the size of a JSON object or array and returns an integer. Two new methods, `_build_json_size` and `get_json_object`, have been implemented to handle JSON objects and arrays differently, and the Parser and Tokenizer classes of the Presto class have been updated to include the new json_size function. An alternative implementation for Databricks using SQL functions is provided, and a test case is added to cover a fixed `is not null` error for json_extract in the Databricks generator. Additionally, a new test file for Presto has been added to test the functionality of the `json_extract` function in Presto, and a new method `GetJsonObject` is introduced to extract a JSON object from a given path. The `json_extract` function has also been updated to extract the value associated with a specified key from JSON data in both Presto and Databricks. * Enclosed subqueries in parenthesis ([#1232](#1232)). This PR introduces changes to the ExpressionGenerator and LogicalPlanGenerator classes to ensure that subqueries are correctly enclosed in parentheses during code generation. Previously, subqueries were not always enclosed in parentheses, leading to incorrect code. This issue has been addressed by enclosing subqueries in parentheses in the `in` and `scalarSubquery` methods, and by adding new match cases for `ir.Filter` in the `LogicalPlanGenerator` class. The changes also take care to avoid doubling enclosing parentheses in the `.. IN(SELECT...)` pattern. New methods have not been added, and existing functionality has been modified to ensure that subqueries are correctly enclosed in parentheses, leading to the generation of correct SQL code. Test cases have been included in a separate PR. These changes improve the correctness of the generated code, avoiding issues such as `SELECT * FROM SELECT * FROM t WHERE a > `a` WHERE a > 'b'` and ensuring that the generated code includes parentheses around subqueries. * Fixed serialization of MultipleErrors ([#1177](#1177)). In the latest release, the encoding of errors in the `com.databricks.labs.remorph.coverage` package has been improved with an update to the `encoders.scala` file. The change involves a fix for serializing `MultipleErrors` instances using the `asJson` method on each error instead of just the message. This modification ensures that all relevant information about each error is included in the encoded output, improving the accuracy of serialization for `MultipleErrors` class. Users who handle multiple errors and require precise serialization representation will benefit from this enhancement, as it guarantees comprehensive information encoding for each error instance. * Fixed presto strpos and array_average functions ([#1196](#1196)). This PR introduces new classes `Locate` and `NamedStruct` in the `local_expression.py` file to handle the `STRPOS` and `ARRAY_AVERAGE` functions in a Databricks environment, ensuring compatibility with Presto SQL. The `STRPOS` function, used to locate the position of a substring within a string, now uses the `Locate` class and emits a warning regarding differences in implementation between Presto and Databricks SQL. A new method `_build_array_average` has been added to handle the `ARRAY_AVERAGE` function in Databricks, which calculates the average of an array, accommodating nulls, integers, and doubles. Two SQL test cases have been added to demonstrate the use of the `ARRAY_AVERAGE` function with arrays containing integers and doubles. These changes promote compatibility and consistent behavior between Presto and Databricks when dealing with `STRPOS` and `ARRAY_AVERAGE` functions, enhancing the ability to migrate between the systems smoothly. * Handled presto Unnest cross join to Databricks lateral view ([#1209](#1209)). This release introduces new features and updates for handling Presto UNNEST cross joins in Databricks, utilizing the lateral view feature. New methods have been added to improve efficiency and robustness when handling UNNEST cross joins. Additionally, new test cases have been implemented for Presto and Databricks to ensure compatibility and consistency between the two systems in handling UNNEST cross joins, array construction and flattening, and parsing JSON data. Some limitations and issues remain, which will be addressed in future work. The acceptance tests have also been updated, with certain tests now expected to pass, while others may still fail. This release aims to improve the functionality and compatibility of Presto and Databricks when handling UNNEST cross joins and JSON data. * Implemented remaining TSQL set operations ([#1227](#1227)). This pull request enhances the TSql parser by adding support for parsing and converting the set operations `UNION [ALL]`, `EXCEPT`, and `INTERSECT` to the Intermediate Representation (IR). Initially, the grammar recognized these operations, but they were not being converted to the IR. This change resolves issues [#1126](#1126) and [#1102](#1102) and includes new unit, transpiler, and functional tests, ensuring the correct behavior of these set operations, including precedence rules. The commit also introduces a new test file, `union-all.sql`, demonstrating the correct handling of simple `UNION ALL` operations, ensuring consistent output across TSQL and Databricks SQL platforms. * Supported multiple columns in order by clause in for ARRAYAGG ([#1228](#1228)). This commit enhances the ARRAYAGG and LISTAGG functions by adding support for multiple columns in the order by clause and sorting in both ascending and descending order. A new method, sortArray, has been introduced to handle multiple sort orders. The changes also improve the functionality of the ARRAYAGG function in the Snowflake dialect by supporting multiple columns in the ORDER BY clause, with an optional DESC keyword for each column. The `WithinGroupParams` dataclass has been updated in the local expression module to include a list of tuples for the order columns and their sorting direction. These changes provide increased flexibility and control over the output of the ARRAYAGG and LISTAGG functions * Added TSQL parser support for `(LHS) UNION RHS` queries ([#1211](#1211)). In this release, we have implemented support for a new form of UNION in the TSQL parser, specifically for queries formatted as `(SELECT a from b) UNION [ALL] SELECT x from y`. This allows the union of two SELECT queries with an optional ALL keyword to include duplicate rows. The implementation includes a new case statement in the `TSqlRelationBuilder` class that handles this form of UNION, creating a `SetOperation` object with the left-hand side and right-hand side of the union, and an `is_all` flag based on the presence of the ALL keyword. Additionally, we have added support for parsing right-associative UNION clauses in TSQL queries, enhancing the flexibility and expressiveness of the TSQL parser for more complex and nuanced queries. The commit also includes new test cases to verify the correct translation of TSQL set operations to Databricks SQL, resolving issue [#1127](#1127). This enhancement allows for more accurate parsing of TSQL queries that use the UNION operator in various formats. * Added support for inline columns in CTEs ([#1184](#1184)). In this release, we have added support for inline columns in Common Table Expressions (CTEs) in Snowflake across various components of our open-source library. This includes updates to the AST (Abstract Syntax Tree) for better TSQL translation and the introduction of the new case class `KnownInterval` for handling intervals. We have also implemented a new method, `DealiasInlineColumnExpressions`, in the `SnowflakePlanParser` class to parse inline columns in CTEs and modify the class constructor to include this new method. Additionally, a new private case class `InlineColumnExpression` has been introduced to allow for more efficient processing of Snowflake CTEs. The SnowflakeToDatabricksTranspiler has also been updated to support inline columns in CTEs, as demonstrated by a new test case. These changes improve compatibility, precision, and usability of the codebase, providing a better overall experience for software engineers working with CTEs in Snowflake. * Implemented AST for positional column identifiers ([#1181](#1181)). The recent change introduces an Abstract Syntax Tree (AST) for positional column identifiers in the Snowflake project, specifically in the `ExpressionGenerator` class. The new `NameOrPosition` type represents a column identifier, either by name or position. The `Id` and `Position` classes inherit from `NameOrPosition`, and the `nameOrPosition` method has been added to check and return the appropriate SQL representation. However, due to Databricks' lack of positional column identifier support, the generator side does not yet support this feature. This means that the schema of the table is required to properly translate queries involving positional column identifiers. This enhancement increases the system's flexibility in handling Snowflake's query structures, with the potential for more comprehensive generator-side support in the future. * Implemented GROUP BY ALL ([#1180](#1180)). The `GROUP BY ALL` clause is now supported in the LogicalPlanGenerator class of the remorph project, with the addition of a new case to handle the GroupByAll type and updated implementation for the Pivot type. A new case object called `GroupByAll` has been added to the relations.scala file's sealed trait "GroupType". A new test case has been implemented in the SnowflakeToDatabricksTranspilerTest class to check the correct transpilation of the `GROUP BY ALL` clause from Snowflake SQL syntax to Databricks SQL syntax. These changes allow for more flexibility and control in grouping operations and enable the implementation of specific functionality for the GROUP BY ALL clause in Snowflake, improving compatibility with Snowflake SQL syntax. Dependency updates: * Bump codecov/codecov-action from 4 to 5 ([#1210](#1210)). * Bump sqlglot from 25.30.0 to 25.32.1 ([#1254](#1254)).
Merged
gueniai
pushed a commit
that referenced
this pull request
Dec 2, 2024
* Added support for format_datetime function in presto to Databricks ([#1250](#1250)). A new `format_datetime` function has been added to the `Parser` class in the `presto.py` file to provide support for formatting datetime values in Presto on Databricks. This function utilizes the `DateFormat.from_arg_list` method from the `local_expression` module to format datetime values according to a specified format string. To ensure compatibility and consistency between Presto and Databricks, a new test file `test_format_datetime_1.sql` has been added, containing SQL queries that demonstrate the usage of the `format_datetime` function in Presto and its equivalent in Databricks, `DATE_FORMAT`. This standalone change adds new functionality without modifying any existing code. * Added support for SnowFlake `SUBSTR` ([#1238](#1238)). This commit enhances the library's SnowFlake support by adding the `SUBSTR` function, which was previously unsupported and existed only as an alternative to `SUBSTRING`. The project now fully supports both functions, and the `SUBSTRING` function can be used interchangeably with `SUBSTR` via the new `withConversionStrategy(SynonymOf("SUBSTR"))` method. Additionally, this commit supersedes a previous pull request that lacked a GPG signature and includes a test for the `SUBSTR` function. The `ARRAY_SLICE` function has also been updated to match SnowFlake's behavior, and the project now supports a more comprehensive list of SQL functions with their corresponding arity. * Added support for json_size function in presto ([#1236](#1236)). A new `json_size` function for Presto has been added, which determines the size of a JSON object or array and returns an integer. Two new methods, `_build_json_size` and `get_json_object`, have been implemented to handle JSON objects and arrays differently, and the Parser and Tokenizer classes of the Presto class have been updated to include the new json_size function. An alternative implementation for Databricks using SQL functions is provided, and a test case is added to cover a fixed `is not null` error for json_extract in the Databricks generator. Additionally, a new test file for Presto has been added to test the functionality of the `json_extract` function in Presto, and a new method `GetJsonObject` is introduced to extract a JSON object from a given path. The `json_extract` function has also been updated to extract the value associated with a specified key from JSON data in both Presto and Databricks. * Enclosed subqueries in parenthesis ([#1232](#1232)). This PR introduces changes to the ExpressionGenerator and LogicalPlanGenerator classes to ensure that subqueries are correctly enclosed in parentheses during code generation. Previously, subqueries were not always enclosed in parentheses, leading to incorrect code. This issue has been addressed by enclosing subqueries in parentheses in the `in` and `scalarSubquery` methods, and by adding new match cases for `ir.Filter` in the `LogicalPlanGenerator` class. The changes also take care to avoid doubling enclosing parentheses in the `.. IN(SELECT...)` pattern. New methods have not been added, and existing functionality has been modified to ensure that subqueries are correctly enclosed in parentheses, leading to the generation of correct SQL code. Test cases have been included in a separate PR. These changes improve the correctness of the generated code, avoiding issues such as `SELECT * FROM SELECT * FROM t WHERE a > `a` WHERE a > 'b'` and ensuring that the generated code includes parentheses around subqueries. * Fixed serialization of MultipleErrors ([#1177](#1177)). In the latest release, the encoding of errors in the `com.databricks.labs.remorph.coverage` package has been improved with an update to the `encoders.scala` file. The change involves a fix for serializing `MultipleErrors` instances using the `asJson` method on each error instead of just the message. This modification ensures that all relevant information about each error is included in the encoded output, improving the accuracy of serialization for `MultipleErrors` class. Users who handle multiple errors and require precise serialization representation will benefit from this enhancement, as it guarantees comprehensive information encoding for each error instance. * Fixed presto strpos and array_average functions ([#1196](#1196)). This PR introduces new classes `Locate` and `NamedStruct` in the `local_expression.py` file to handle the `STRPOS` and `ARRAY_AVERAGE` functions in a Databricks environment, ensuring compatibility with Presto SQL. The `STRPOS` function, used to locate the position of a substring within a string, now uses the `Locate` class and emits a warning regarding differences in implementation between Presto and Databricks SQL. A new method `_build_array_average` has been added to handle the `ARRAY_AVERAGE` function in Databricks, which calculates the average of an array, accommodating nulls, integers, and doubles. Two SQL test cases have been added to demonstrate the use of the `ARRAY_AVERAGE` function with arrays containing integers and doubles. These changes promote compatibility and consistent behavior between Presto and Databricks when dealing with `STRPOS` and `ARRAY_AVERAGE` functions, enhancing the ability to migrate between the systems smoothly. * Handled presto Unnest cross join to Databricks lateral view ([#1209](#1209)). This release introduces new features and updates for handling Presto UNNEST cross joins in Databricks, utilizing the lateral view feature. New methods have been added to improve efficiency and robustness when handling UNNEST cross joins. Additionally, new test cases have been implemented for Presto and Databricks to ensure compatibility and consistency between the two systems in handling UNNEST cross joins, array construction and flattening, and parsing JSON data. Some limitations and issues remain, which will be addressed in future work. The acceptance tests have also been updated, with certain tests now expected to pass, while others may still fail. This release aims to improve the functionality and compatibility of Presto and Databricks when handling UNNEST cross joins and JSON data. * Implemented remaining TSQL set operations ([#1227](#1227)). This pull request enhances the TSql parser by adding support for parsing and converting the set operations `UNION [ALL]`, `EXCEPT`, and `INTERSECT` to the Intermediate Representation (IR). Initially, the grammar recognized these operations, but they were not being converted to the IR. This change resolves issues [#1126](#1126) and [#1102](#1102) and includes new unit, transpiler, and functional tests, ensuring the correct behavior of these set operations, including precedence rules. The commit also introduces a new test file, `union-all.sql`, demonstrating the correct handling of simple `UNION ALL` operations, ensuring consistent output across TSQL and Databricks SQL platforms. * Supported multiple columns in order by clause in for ARRAYAGG ([#1228](#1228)). This commit enhances the ARRAYAGG and LISTAGG functions by adding support for multiple columns in the order by clause and sorting in both ascending and descending order. A new method, sortArray, has been introduced to handle multiple sort orders. The changes also improve the functionality of the ARRAYAGG function in the Snowflake dialect by supporting multiple columns in the ORDER BY clause, with an optional DESC keyword for each column. The `WithinGroupParams` dataclass has been updated in the local expression module to include a list of tuples for the order columns and their sorting direction. These changes provide increased flexibility and control over the output of the ARRAYAGG and LISTAGG functions * Added TSQL parser support for `(LHS) UNION RHS` queries ([#1211](#1211)). In this release, we have implemented support for a new form of UNION in the TSQL parser, specifically for queries formatted as `(SELECT a from b) UNION [ALL] SELECT x from y`. This allows the union of two SELECT queries with an optional ALL keyword to include duplicate rows. The implementation includes a new case statement in the `TSqlRelationBuilder` class that handles this form of UNION, creating a `SetOperation` object with the left-hand side and right-hand side of the union, and an `is_all` flag based on the presence of the ALL keyword. Additionally, we have added support for parsing right-associative UNION clauses in TSQL queries, enhancing the flexibility and expressiveness of the TSQL parser for more complex and nuanced queries. The commit also includes new test cases to verify the correct translation of TSQL set operations to Databricks SQL, resolving issue [#1127](#1127). This enhancement allows for more accurate parsing of TSQL queries that use the UNION operator in various formats. * Added support for inline columns in CTEs ([#1184](#1184)). In this release, we have added support for inline columns in Common Table Expressions (CTEs) in Snowflake across various components of our open-source library. This includes updates to the AST (Abstract Syntax Tree) for better TSQL translation and the introduction of the new case class `KnownInterval` for handling intervals. We have also implemented a new method, `DealiasInlineColumnExpressions`, in the `SnowflakePlanParser` class to parse inline columns in CTEs and modify the class constructor to include this new method. Additionally, a new private case class `InlineColumnExpression` has been introduced to allow for more efficient processing of Snowflake CTEs. The SnowflakeToDatabricksTranspiler has also been updated to support inline columns in CTEs, as demonstrated by a new test case. These changes improve compatibility, precision, and usability of the codebase, providing a better overall experience for software engineers working with CTEs in Snowflake. * Implemented AST for positional column identifiers ([#1181](#1181)). The recent change introduces an Abstract Syntax Tree (AST) for positional column identifiers in the Snowflake project, specifically in the `ExpressionGenerator` class. The new `NameOrPosition` type represents a column identifier, either by name or position. The `Id` and `Position` classes inherit from `NameOrPosition`, and the `nameOrPosition` method has been added to check and return the appropriate SQL representation. However, due to Databricks' lack of positional column identifier support, the generator side does not yet support this feature. This means that the schema of the table is required to properly translate queries involving positional column identifiers. This enhancement increases the system's flexibility in handling Snowflake's query structures, with the potential for more comprehensive generator-side support in the future. * Implemented GROUP BY ALL ([#1180](#1180)). The `GROUP BY ALL` clause is now supported in the LogicalPlanGenerator class of the remorph project, with the addition of a new case to handle the GroupByAll type and updated implementation for the Pivot type. A new case object called `GroupByAll` has been added to the relations.scala file's sealed trait "GroupType". A new test case has been implemented in the SnowflakeToDatabricksTranspilerTest class to check the correct transpilation of the `GROUP BY ALL` clause from Snowflake SQL syntax to Databricks SQL syntax. These changes allow for more flexibility and control in grouping operations and enable the implementation of specific functionality for the GROUP BY ALL clause in Snowflake, improving compatibility with Snowflake SQL syntax. Dependency updates: * Bump codecov/codecov-action from 4 to 5 ([#1210](#1210)). * Bump sqlglot from 25.30.0 to 25.32.1 ([#1254](#1254)).
sundarshankar89
added a commit
to sundarshankar89/remorph
that referenced
this pull request
Jan 2, 2025
* Added support for format_datetime function in presto to Databricks ([databrickslabs#1250](databrickslabs#1250)). A new `format_datetime` function has been added to the `Parser` class in the `presto.py` file to provide support for formatting datetime values in Presto on Databricks. This function utilizes the `DateFormat.from_arg_list` method from the `local_expression` module to format datetime values according to a specified format string. To ensure compatibility and consistency between Presto and Databricks, a new test file `test_format_datetime_1.sql` has been added, containing SQL queries that demonstrate the usage of the `format_datetime` function in Presto and its equivalent in Databricks, `DATE_FORMAT`. This standalone change adds new functionality without modifying any existing code. * Added support for SnowFlake `SUBSTR` ([databrickslabs#1238](databrickslabs#1238)). This commit enhances the library's SnowFlake support by adding the `SUBSTR` function, which was previously unsupported and existed only as an alternative to `SUBSTRING`. The project now fully supports both functions, and the `SUBSTRING` function can be used interchangeably with `SUBSTR` via the new `withConversionStrategy(SynonymOf("SUBSTR"))` method. Additionally, this commit supersedes a previous pull request that lacked a GPG signature and includes a test for the `SUBSTR` function. The `ARRAY_SLICE` function has also been updated to match SnowFlake's behavior, and the project now supports a more comprehensive list of SQL functions with their corresponding arity. * Added support for json_size function in presto ([databrickslabs#1236](databrickslabs#1236)). A new `json_size` function for Presto has been added, which determines the size of a JSON object or array and returns an integer. Two new methods, `_build_json_size` and `get_json_object`, have been implemented to handle JSON objects and arrays differently, and the Parser and Tokenizer classes of the Presto class have been updated to include the new json_size function. An alternative implementation for Databricks using SQL functions is provided, and a test case is added to cover a fixed `is not null` error for json_extract in the Databricks generator. Additionally, a new test file for Presto has been added to test the functionality of the `json_extract` function in Presto, and a new method `GetJsonObject` is introduced to extract a JSON object from a given path. The `json_extract` function has also been updated to extract the value associated with a specified key from JSON data in both Presto and Databricks. * Enclosed subqueries in parenthesis ([databrickslabs#1232](databrickslabs#1232)). This PR introduces changes to the ExpressionGenerator and LogicalPlanGenerator classes to ensure that subqueries are correctly enclosed in parentheses during code generation. Previously, subqueries were not always enclosed in parentheses, leading to incorrect code. This issue has been addressed by enclosing subqueries in parentheses in the `in` and `scalarSubquery` methods, and by adding new match cases for `ir.Filter` in the `LogicalPlanGenerator` class. The changes also take care to avoid doubling enclosing parentheses in the `.. IN(SELECT...)` pattern. New methods have not been added, and existing functionality has been modified to ensure that subqueries are correctly enclosed in parentheses, leading to the generation of correct SQL code. Test cases have been included in a separate PR. These changes improve the correctness of the generated code, avoiding issues such as `SELECT * FROM SELECT * FROM t WHERE a > `a` WHERE a > 'b'` and ensuring that the generated code includes parentheses around subqueries. * Fixed serialization of MultipleErrors ([databrickslabs#1177](databrickslabs#1177)). In the latest release, the encoding of errors in the `com.databricks.labs.remorph.coverage` package has been improved with an update to the `encoders.scala` file. The change involves a fix for serializing `MultipleErrors` instances using the `asJson` method on each error instead of just the message. This modification ensures that all relevant information about each error is included in the encoded output, improving the accuracy of serialization for `MultipleErrors` class. Users who handle multiple errors and require precise serialization representation will benefit from this enhancement, as it guarantees comprehensive information encoding for each error instance. * Fixed presto strpos and array_average functions ([databrickslabs#1196](databrickslabs#1196)). This PR introduces new classes `Locate` and `NamedStruct` in the `local_expression.py` file to handle the `STRPOS` and `ARRAY_AVERAGE` functions in a Databricks environment, ensuring compatibility with Presto SQL. The `STRPOS` function, used to locate the position of a substring within a string, now uses the `Locate` class and emits a warning regarding differences in implementation between Presto and Databricks SQL. A new method `_build_array_average` has been added to handle the `ARRAY_AVERAGE` function in Databricks, which calculates the average of an array, accommodating nulls, integers, and doubles. Two SQL test cases have been added to demonstrate the use of the `ARRAY_AVERAGE` function with arrays containing integers and doubles. These changes promote compatibility and consistent behavior between Presto and Databricks when dealing with `STRPOS` and `ARRAY_AVERAGE` functions, enhancing the ability to migrate between the systems smoothly. * Handled presto Unnest cross join to Databricks lateral view ([databrickslabs#1209](databrickslabs#1209)). This release introduces new features and updates for handling Presto UNNEST cross joins in Databricks, utilizing the lateral view feature. New methods have been added to improve efficiency and robustness when handling UNNEST cross joins. Additionally, new test cases have been implemented for Presto and Databricks to ensure compatibility and consistency between the two systems in handling UNNEST cross joins, array construction and flattening, and parsing JSON data. Some limitations and issues remain, which will be addressed in future work. The acceptance tests have also been updated, with certain tests now expected to pass, while others may still fail. This release aims to improve the functionality and compatibility of Presto and Databricks when handling UNNEST cross joins and JSON data. * Implemented remaining TSQL set operations ([databrickslabs#1227](databrickslabs#1227)). This pull request enhances the TSql parser by adding support for parsing and converting the set operations `UNION [ALL]`, `EXCEPT`, and `INTERSECT` to the Intermediate Representation (IR). Initially, the grammar recognized these operations, but they were not being converted to the IR. This change resolves issues [databrickslabs#1126](databrickslabs#1126) and [databrickslabs#1102](databrickslabs#1102) and includes new unit, transpiler, and functional tests, ensuring the correct behavior of these set operations, including precedence rules. The commit also introduces a new test file, `union-all.sql`, demonstrating the correct handling of simple `UNION ALL` operations, ensuring consistent output across TSQL and Databricks SQL platforms. * Supported multiple columns in order by clause in for ARRAYAGG ([databrickslabs#1228](databrickslabs#1228)). This commit enhances the ARRAYAGG and LISTAGG functions by adding support for multiple columns in the order by clause and sorting in both ascending and descending order. A new method, sortArray, has been introduced to handle multiple sort orders. The changes also improve the functionality of the ARRAYAGG function in the Snowflake dialect by supporting multiple columns in the ORDER BY clause, with an optional DESC keyword for each column. The `WithinGroupParams` dataclass has been updated in the local expression module to include a list of tuples for the order columns and their sorting direction. These changes provide increased flexibility and control over the output of the ARRAYAGG and LISTAGG functions * Added TSQL parser support for `(LHS) UNION RHS` queries ([databrickslabs#1211](databrickslabs#1211)). In this release, we have implemented support for a new form of UNION in the TSQL parser, specifically for queries formatted as `(SELECT a from b) UNION [ALL] SELECT x from y`. This allows the union of two SELECT queries with an optional ALL keyword to include duplicate rows. The implementation includes a new case statement in the `TSqlRelationBuilder` class that handles this form of UNION, creating a `SetOperation` object with the left-hand side and right-hand side of the union, and an `is_all` flag based on the presence of the ALL keyword. Additionally, we have added support for parsing right-associative UNION clauses in TSQL queries, enhancing the flexibility and expressiveness of the TSQL parser for more complex and nuanced queries. The commit also includes new test cases to verify the correct translation of TSQL set operations to Databricks SQL, resolving issue [databrickslabs#1127](databrickslabs#1127). This enhancement allows for more accurate parsing of TSQL queries that use the UNION operator in various formats. * Added support for inline columns in CTEs ([databrickslabs#1184](databrickslabs#1184)). In this release, we have added support for inline columns in Common Table Expressions (CTEs) in Snowflake across various components of our open-source library. This includes updates to the AST (Abstract Syntax Tree) for better TSQL translation and the introduction of the new case class `KnownInterval` for handling intervals. We have also implemented a new method, `DealiasInlineColumnExpressions`, in the `SnowflakePlanParser` class to parse inline columns in CTEs and modify the class constructor to include this new method. Additionally, a new private case class `InlineColumnExpression` has been introduced to allow for more efficient processing of Snowflake CTEs. The SnowflakeToDatabricksTranspiler has also been updated to support inline columns in CTEs, as demonstrated by a new test case. These changes improve compatibility, precision, and usability of the codebase, providing a better overall experience for software engineers working with CTEs in Snowflake. * Implemented AST for positional column identifiers ([databrickslabs#1181](databrickslabs#1181)). The recent change introduces an Abstract Syntax Tree (AST) for positional column identifiers in the Snowflake project, specifically in the `ExpressionGenerator` class. The new `NameOrPosition` type represents a column identifier, either by name or position. The `Id` and `Position` classes inherit from `NameOrPosition`, and the `nameOrPosition` method has been added to check and return the appropriate SQL representation. However, due to Databricks' lack of positional column identifier support, the generator side does not yet support this feature. This means that the schema of the table is required to properly translate queries involving positional column identifiers. This enhancement increases the system's flexibility in handling Snowflake's query structures, with the potential for more comprehensive generator-side support in the future. * Implemented GROUP BY ALL ([databrickslabs#1180](databrickslabs#1180)). The `GROUP BY ALL` clause is now supported in the LogicalPlanGenerator class of the remorph project, with the addition of a new case to handle the GroupByAll type and updated implementation for the Pivot type. A new case object called `GroupByAll` has been added to the relations.scala file's sealed trait "GroupType". A new test case has been implemented in the SnowflakeToDatabricksTranspilerTest class to check the correct transpilation of the `GROUP BY ALL` clause from Snowflake SQL syntax to Databricks SQL syntax. These changes allow for more flexibility and control in grouping operations and enable the implementation of specific functionality for the GROUP BY ALL clause in Snowflake, improving compatibility with Snowflake SQL syntax. Dependency updates: * Bump codecov/codecov-action from 4 to 5 ([databrickslabs#1210](databrickslabs#1210)). * Bump sqlglot from 25.30.0 to 25.32.1 ([databrickslabs#1254](databrickslabs#1254)).
sundarshankar89
added a commit
to sundarshankar89/remorph
that referenced
this pull request
Jan 3, 2025
* Added support for format_datetime function in presto to Databricks ([databrickslabs#1250](databrickslabs#1250)). A new `format_datetime` function has been added to the `Parser` class in the `presto.py` file to provide support for formatting datetime values in Presto on Databricks. This function utilizes the `DateFormat.from_arg_list` method from the `local_expression` module to format datetime values according to a specified format string. To ensure compatibility and consistency between Presto and Databricks, a new test file `test_format_datetime_1.sql` has been added, containing SQL queries that demonstrate the usage of the `format_datetime` function in Presto and its equivalent in Databricks, `DATE_FORMAT`. This standalone change adds new functionality without modifying any existing code. * Added support for SnowFlake `SUBSTR` ([databrickslabs#1238](databrickslabs#1238)). This commit enhances the library's SnowFlake support by adding the `SUBSTR` function, which was previously unsupported and existed only as an alternative to `SUBSTRING`. The project now fully supports both functions, and the `SUBSTRING` function can be used interchangeably with `SUBSTR` via the new `withConversionStrategy(SynonymOf("SUBSTR"))` method. Additionally, this commit supersedes a previous pull request that lacked a GPG signature and includes a test for the `SUBSTR` function. The `ARRAY_SLICE` function has also been updated to match SnowFlake's behavior, and the project now supports a more comprehensive list of SQL functions with their corresponding arity. * Added support for json_size function in presto ([databrickslabs#1236](databrickslabs#1236)). A new `json_size` function for Presto has been added, which determines the size of a JSON object or array and returns an integer. Two new methods, `_build_json_size` and `get_json_object`, have been implemented to handle JSON objects and arrays differently, and the Parser and Tokenizer classes of the Presto class have been updated to include the new json_size function. An alternative implementation for Databricks using SQL functions is provided, and a test case is added to cover a fixed `is not null` error for json_extract in the Databricks generator. Additionally, a new test file for Presto has been added to test the functionality of the `json_extract` function in Presto, and a new method `GetJsonObject` is introduced to extract a JSON object from a given path. The `json_extract` function has also been updated to extract the value associated with a specified key from JSON data in both Presto and Databricks. * Enclosed subqueries in parenthesis ([databrickslabs#1232](databrickslabs#1232)). This PR introduces changes to the ExpressionGenerator and LogicalPlanGenerator classes to ensure that subqueries are correctly enclosed in parentheses during code generation. Previously, subqueries were not always enclosed in parentheses, leading to incorrect code. This issue has been addressed by enclosing subqueries in parentheses in the `in` and `scalarSubquery` methods, and by adding new match cases for `ir.Filter` in the `LogicalPlanGenerator` class. The changes also take care to avoid doubling enclosing parentheses in the `.. IN(SELECT...)` pattern. New methods have not been added, and existing functionality has been modified to ensure that subqueries are correctly enclosed in parentheses, leading to the generation of correct SQL code. Test cases have been included in a separate PR. These changes improve the correctness of the generated code, avoiding issues such as `SELECT * FROM SELECT * FROM t WHERE a > `a` WHERE a > 'b'` and ensuring that the generated code includes parentheses around subqueries. * Fixed serialization of MultipleErrors ([databrickslabs#1177](databrickslabs#1177)). In the latest release, the encoding of errors in the `com.databricks.labs.remorph.coverage` package has been improved with an update to the `encoders.scala` file. The change involves a fix for serializing `MultipleErrors` instances using the `asJson` method on each error instead of just the message. This modification ensures that all relevant information about each error is included in the encoded output, improving the accuracy of serialization for `MultipleErrors` class. Users who handle multiple errors and require precise serialization representation will benefit from this enhancement, as it guarantees comprehensive information encoding for each error instance. * Fixed presto strpos and array_average functions ([databrickslabs#1196](databrickslabs#1196)). This PR introduces new classes `Locate` and `NamedStruct` in the `local_expression.py` file to handle the `STRPOS` and `ARRAY_AVERAGE` functions in a Databricks environment, ensuring compatibility with Presto SQL. The `STRPOS` function, used to locate the position of a substring within a string, now uses the `Locate` class and emits a warning regarding differences in implementation between Presto and Databricks SQL. A new method `_build_array_average` has been added to handle the `ARRAY_AVERAGE` function in Databricks, which calculates the average of an array, accommodating nulls, integers, and doubles. Two SQL test cases have been added to demonstrate the use of the `ARRAY_AVERAGE` function with arrays containing integers and doubles. These changes promote compatibility and consistent behavior between Presto and Databricks when dealing with `STRPOS` and `ARRAY_AVERAGE` functions, enhancing the ability to migrate between the systems smoothly. * Handled presto Unnest cross join to Databricks lateral view ([databrickslabs#1209](databrickslabs#1209)). This release introduces new features and updates for handling Presto UNNEST cross joins in Databricks, utilizing the lateral view feature. New methods have been added to improve efficiency and robustness when handling UNNEST cross joins. Additionally, new test cases have been implemented for Presto and Databricks to ensure compatibility and consistency between the two systems in handling UNNEST cross joins, array construction and flattening, and parsing JSON data. Some limitations and issues remain, which will be addressed in future work. The acceptance tests have also been updated, with certain tests now expected to pass, while others may still fail. This release aims to improve the functionality and compatibility of Presto and Databricks when handling UNNEST cross joins and JSON data. * Implemented remaining TSQL set operations ([databrickslabs#1227](databrickslabs#1227)). This pull request enhances the TSql parser by adding support for parsing and converting the set operations `UNION [ALL]`, `EXCEPT`, and `INTERSECT` to the Intermediate Representation (IR). Initially, the grammar recognized these operations, but they were not being converted to the IR. This change resolves issues [databrickslabs#1126](databrickslabs#1126) and [databrickslabs#1102](databrickslabs#1102) and includes new unit, transpiler, and functional tests, ensuring the correct behavior of these set operations, including precedence rules. The commit also introduces a new test file, `union-all.sql`, demonstrating the correct handling of simple `UNION ALL` operations, ensuring consistent output across TSQL and Databricks SQL platforms. * Supported multiple columns in order by clause in for ARRAYAGG ([databrickslabs#1228](databrickslabs#1228)). This commit enhances the ARRAYAGG and LISTAGG functions by adding support for multiple columns in the order by clause and sorting in both ascending and descending order. A new method, sortArray, has been introduced to handle multiple sort orders. The changes also improve the functionality of the ARRAYAGG function in the Snowflake dialect by supporting multiple columns in the ORDER BY clause, with an optional DESC keyword for each column. The `WithinGroupParams` dataclass has been updated in the local expression module to include a list of tuples for the order columns and their sorting direction. These changes provide increased flexibility and control over the output of the ARRAYAGG and LISTAGG functions * Added TSQL parser support for `(LHS) UNION RHS` queries ([databrickslabs#1211](databrickslabs#1211)). In this release, we have implemented support for a new form of UNION in the TSQL parser, specifically for queries formatted as `(SELECT a from b) UNION [ALL] SELECT x from y`. This allows the union of two SELECT queries with an optional ALL keyword to include duplicate rows. The implementation includes a new case statement in the `TSqlRelationBuilder` class that handles this form of UNION, creating a `SetOperation` object with the left-hand side and right-hand side of the union, and an `is_all` flag based on the presence of the ALL keyword. Additionally, we have added support for parsing right-associative UNION clauses in TSQL queries, enhancing the flexibility and expressiveness of the TSQL parser for more complex and nuanced queries. The commit also includes new test cases to verify the correct translation of TSQL set operations to Databricks SQL, resolving issue [databrickslabs#1127](databrickslabs#1127). This enhancement allows for more accurate parsing of TSQL queries that use the UNION operator in various formats. * Added support for inline columns in CTEs ([databrickslabs#1184](databrickslabs#1184)). In this release, we have added support for inline columns in Common Table Expressions (CTEs) in Snowflake across various components of our open-source library. This includes updates to the AST (Abstract Syntax Tree) for better TSQL translation and the introduction of the new case class `KnownInterval` for handling intervals. We have also implemented a new method, `DealiasInlineColumnExpressions`, in the `SnowflakePlanParser` class to parse inline columns in CTEs and modify the class constructor to include this new method. Additionally, a new private case class `InlineColumnExpression` has been introduced to allow for more efficient processing of Snowflake CTEs. The SnowflakeToDatabricksTranspiler has also been updated to support inline columns in CTEs, as demonstrated by a new test case. These changes improve compatibility, precision, and usability of the codebase, providing a better overall experience for software engineers working with CTEs in Snowflake. * Implemented AST for positional column identifiers ([databrickslabs#1181](databrickslabs#1181)). The recent change introduces an Abstract Syntax Tree (AST) for positional column identifiers in the Snowflake project, specifically in the `ExpressionGenerator` class. The new `NameOrPosition` type represents a column identifier, either by name or position. The `Id` and `Position` classes inherit from `NameOrPosition`, and the `nameOrPosition` method has been added to check and return the appropriate SQL representation. However, due to Databricks' lack of positional column identifier support, the generator side does not yet support this feature. This means that the schema of the table is required to properly translate queries involving positional column identifiers. This enhancement increases the system's flexibility in handling Snowflake's query structures, with the potential for more comprehensive generator-side support in the future. * Implemented GROUP BY ALL ([databrickslabs#1180](databrickslabs#1180)). The `GROUP BY ALL` clause is now supported in the LogicalPlanGenerator class of the remorph project, with the addition of a new case to handle the GroupByAll type and updated implementation for the Pivot type. A new case object called `GroupByAll` has been added to the relations.scala file's sealed trait "GroupType". A new test case has been implemented in the SnowflakeToDatabricksTranspilerTest class to check the correct transpilation of the `GROUP BY ALL` clause from Snowflake SQL syntax to Databricks SQL syntax. These changes allow for more flexibility and control in grouping operations and enable the implementation of specific functionality for the GROUP BY ALL clause in Snowflake, improving compatibility with Snowflake SQL syntax. Dependency updates: * Bump codecov/codecov-action from 4 to 5 ([databrickslabs#1210](databrickslabs#1210)). * Bump sqlglot from 25.30.0 to 25.32.1 ([databrickslabs#1254](databrickslabs#1254)).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.