Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement remaining TSQL set operations. #1227

Merged
merged 19 commits into from
Nov 28, 2024
Merged

Implement remaining TSQL set operations. #1227

merged 19 commits into from
Nov 28, 2024

Conversation

asnare
Copy link
Contributor

@asnare asnare commented Nov 21, 2024

Changes

This PR implements support for parsing set operations with TSql: UNION [ALL], EXCEPT and INTERSECT.

The grammar previously supported these but they were not being converted to the IR.

Linked issues

Resolves #1126.
Resolves #1102.

Tests

  • added unit tests
  • added transpiler tests
  • added functional tests

@asnare asnare added enhancement New feature or request feat/ir everything related to abstract syntax trees tech debt design flaws and other cascading effects sql/tsql labels Nov 21, 2024
@asnare asnare self-assigned this Nov 21, 2024
Copy link

github-actions bot commented Nov 21, 2024

Coverage tests results

464 tests  +5   427 ✅ +5   4s ⏱️ ±0s
  6 suites ±0    37 💤 ±0 
  6 files   ±0     0 ❌ ±0 

Results for commit c5ed175. ± Comparison against base commit a08cd61.

♻️ This comment has been updated with latest results.

@asnare asnare marked this pull request as ready for review November 27, 2024 12:22
@asnare asnare requested a review from a team as a code owner November 27, 2024 12:22
@asnare asnare requested a review from vil1 November 27, 2024 12:23
Copy link
Contributor

@jimidle jimidle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - other than maybe moving where the TODO is so we just get one created by the bot and it points to the explanatory text? But that is not a blocker for approval.

@@ -2794,6 +2794,7 @@ queryExpression
;

sqlUnion
// TODO: Handle INTERSECT precedence in the grammar; it has higher precedence than EXCEPT and UNION ALL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look in to this - the TODO shoudl have raised an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that the TODO detection was a one-off thing. I've created issue #1255 to cover this.

I've also updated the comment with a reference to the relevant TSql documentation.

@@ -2802,6 +2803,7 @@ querySpecification
;

selectOptionalClauses
// TODO: Fix ORDER BY; it needs to be outside the set operations instead of between.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also this - sometimes the original author implemented it just as reading the syntax diagram from MS, which does not always deal with such things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, understood: I missed it as well. I've created issue #1256 to cover this, and added a comment with a reference to some documentation where this is mentioned.

private[this] def buildIntersectOperations(
lhs: ir.LogicalPlan,
remainingContexts: Seq[TSqlParser.SqlUnionContext]): ir.LogicalPlan = {
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be the the TODO as it has a better explanation?

}
}
}

@tailrec
private[this] def buildIntersectOperations(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could simplify that a bit by first going through the the seq of context and split it at each instance of INTERSECT, producing a Seq[Seq[SqlUnionContext]] that we'd next traverse again to produce the actual SetOperation(..., IntersectSetOp).

Something along the lines of:

def appendNext(agg: Seq[Seq[SqlUnionContext]], item: SqlUnionContext) = Seq(item) +: agg
def appendFirst (agg: Seq[Seq[SqlUnionContext]], item: SqlUnionContext) = 
  if (agg.isEmpty) Seq(Seq(item))
  else (item +: agg.head) +:  agg.tail

remainingContexts.foldLeft(Seq.empty[Seq[SqlUnionContext]]) {
  case (agg, item) if item.INTERSECT() == null => appendFirst(item)
  case (agg, item) => appendNext(item)
}
.map( // here you get seqs of SqlUnionContext that aren't INTERSECT, in reverse order, so you may foldRight/reduceRight )
.reduceRight(//here you build your SetOperation(..., IntersectSetOp, ...))

Copy link
Contributor

@vil1 vil1 Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having slept over it, maybe the existing @tailrec approach would still be more ... approachable. Your call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I agree with you and my first draft tried to do exactly what you suggest.

  1. Implement .spans(p: T => Boolean): Seq[Seq[T]] to divide up the INTERSECT chunks.
  2. Use map/reduce in the way you suggest.

However I got caught a bit trying to implement .spans() as elegantly as this (I tried using builders, like Scala's collection internals) and decided that:

  • It wasn't worth the effort, given that this will be shortly be refactored once the grammar is updated to handle the precedence rules natively.
  • It wasn't clearly easier to follow, although this was more a reflection on what I wrote than the problem.

Given that I think we've agreed to address the precedence of INTERSECT in the grammar, I'd rather not spend much time on this because it's going to disappear shortly anyway.

@vil1: Does this sound okay?

…lained.

It looks like both are probably just reflecting the rules of ANSI SQL.
@asnare asnare enabled auto-merge November 28, 2024 14:32
@asnare asnare added this pull request to the merge queue Nov 28, 2024
Merged via the queue into main with commit a5bbdb6 Nov 28, 2024
8 checks passed
@asnare asnare deleted the feature/tsql-set-ops branch November 28, 2024 14:40
sundarshankar89 added a commit that referenced this pull request Dec 2, 2024
*  Added support for format_datetime function in presto to Databricks ([#1250](#1250)). In this release, we have added support for the `format_datetime` function in Presto to Databricks. This feature provides users with more flexibility and customization when formatting datetime values. A new entry, `"FORMAT_DATETIME": local_expression.DateFormat.from_arg_list,` has been added to the `_FUNCTION_MAP` dictionary in the `Parser` class, causing the `DateFormat.from_arg_list` method to be called when the `FORMAT_DATETIME` token is encountered during parsing. This method creates a `DateFormat` object based on the list of arguments provided to the `format_datetime` function. To ensure the proper functioning of this feature, a new SQL file has been introduced in the `functional/presto/test_format_datetime` directory, containing four test cases demonstrating the usage of the `format_datetime` function with various format strings. These tests cover the current_timestamp, current_date, and the Unix timestamp 1732723200. On Databricks, the equivalent DATE_FORMAT function is used for the same formatting. These changes simplify migration of datetime formatting logic from Presto to Databricks, improving interoperability between the two systems.
* Added support for SnowFlake `SUBSTR` ([#1238](#1238)). This commit adds support for the SnowFlake `SUBSTR` function, which was previously unsupported, addressing a missing feature in the previous version. SnowFlake has both `SUBSTR` and `SUBSTRING` functions, and this change focuses on implementing `SUBSTR` using the `FunctionBuilder` class, a sealed trait used to represent different types of functions. The new case for `SUBSTR` uses the same function definition as `SUBSTRING`, and an existing case for `SUBSTRING` has been updated to include a conversion strategy that allows it to be used as a synonym for `SUBSTR`. This ensures that both functions can be used interchangeably, improving the compatibility of the code with SnowFlake's SQL syntax and enhancing the user experience for those leveraging the SnowFlake platform. The commit also includes a test case demonstrating the usage of the `SUBSTR` function and acknowledges the contribution of Andrew Snare.
* Added support for json_size function in presto ([#1236](#1236)). This revision adds support for the `json_size` function in Presto to calculate the size of a JSON object or array, and provides an alternative implementation for Databricks using SQL functions. The `Is` expression's `this` attribute is now used with the `sql` method for correct evaluation of `NOT Is` expressions. A bug in the Databricks generator with `json_extract` returning an error for non-null values has been fixed, and a new test case has been added for this issue. The `NamedStruct` class has been updated with `arg_types`. These changes improve compatibility and functionality with Presto and Databricks, and enhance JSON functionality of the Presto module. A test case demonstrates the usage of `json_size` and `json_extract` functions with various JSON inputs, including maps, arrays, and strings, to ensure proper error handling in both Presto and Databricks.
* Enclosed subqueries in parenthesis ([#1232](#1232)). The ExpressionGenerator and LogicalPlanGenerator classes have been updated to enclose subqueries in parentheses, addressing an issue where the absence of such enclosure resulted in incorrect SQL code. The `enclosed` method was added to ensure values are enclosed in parentheses if they are not already, and the `in` and `scalarSubquery` methods have been updated to properly enclose values in parentheses. The LogicalPlanGenerator's Filter case has been updated to handle the `.. IN(SELECT...)` pattern and avoid doubling enclosing parentheses. These changes improve the correctness of code generation and increase reliability, ensuring accurate SQL generation and execution by the database engine. Test cases have been added, but the diff does not contain them, indicating they are located elsewhere in the codebase.
* Fixed presto strpos and array_average functions ([#1196](#1196)). This commit introduces new classes Locate and NamedStruct in the Func class of local_expression.py to handle the STRPOS and ARRAY_AVERAGE functions in the context of migrating SQL code from Presto to Databricks. The Locate class is used for the STRPOS function, implemented as LOCATE in Databricks SQL with a different third parameter. A warning is logged to inform users of this difference. The NamedStruct class is used in the ARRAY_AVERAGE function implementation, which calculates the average of an array, handling nulls and supporting both integer and double data types. The Parser and Tokenizer classes for the Presto dialect are updated to include new implementations for STRPOS and ARRAY_AVERAGE. New methods such as _build_str_position and _build_array_average are introduced, and test cases for both functions are updated in the Presto SQL and Databricks SQL dialects. These changes improve the compatibility of Presto SQL and Databricks SQL, enhancing functionality and ensuring a smoother transition for users migrating from Presto SQL.
* Added presto Unnest cross join to Databricks lateral view ([#1209](#1209)). This release introduces several changes to the SnowflakeAcceptanceSuite test suite to improve handling of Presto unnest cross join functionality on Databricks using lateral view. New SQL test files, including "test_skip_unsupported_operations_10.sql", "arrays/test_array_construct_1.sql", "arrays/test_array_construct_2.sql", and "functions/parse_json/test_parse_json_3.sql", have been added to support previously failing tests due to unsupported operations. Additionally, new methods have been implemented to handle the unnest cross join operation when translating Presto code to Databricks lateral view. A new test case, "test_unnest_1.sql", has been added to demonstrate cross join functionality using UNNEST in Presto and its equivalent in Databricks using the LATERAL VIEW EXPLODE function. Another test case has been added to showcase the handling of UNNEST cross join in Presto and its equivalent implementation using the LATERAL VIEW EXPLODE function in Databricks. A new test SQL script has been introduced to verify array construction and flattening functionality in both Snowflake and Databricks dialects. Lastly, a test has been added to compare functionality between Snowflake and Databricks when processing JSON data, highlighting similarities in functionality.
* Implemented remaining TSQL set operations ([#1227](#1227)). This pull request adds support for parsing and converting TSQL set operations, including `UNION [ALL]`, `EXCEPT`, and `INTERSECT`, to the intermediate representation (IR) in our open-source library. Previously, the grammar recognized these set operations, but they were not being processed or converted correctly. This pull request resolves issues [#1126](#1126) and [#1102](#1102) and includes new unit tests, transpiler tests, and functional tests to ensure the correct implementation of these set operations. Two functional tests for `INTERSECT` and `EXCEPT` have been added, demonstrating the correct handling of these operations in TSQL queries. The `UNION` operation has also been updated with new functional tests for simple `UNION` queries. This enhancement improves the library's compatibility and reliability in handling a wider range of TSQL queries involving set operations.
* Added support for multiple columns in order by clause in for ARRAYAGG ([#1228](#1228)). This commit introduces support for multiple columns in the ORDER BY clause for the ARRAYAGG and LISTAGG functions in Snowflake, allowing for more complex ordering of aggregated arrays. The sortArray and sortingLambda methods have been updated to handle multiple sort orders, and the _get_within_group_params and _create_named_struct_for_cmp methods have been updated accordingly. The ToArray class has been adapted to work with the new order_cols parameter, and the test files have been updated to demonstrate the new functionality. This change enhances the functionality of the local expression module in the databricks/labs/remorph/snow package, providing more flexibility and expressivity in SQL queries. This change is backward compatible, and existing queries that use ARRAYAGG function will continue to work as before. The change is tested and validated through the provided test cases.
* Added TSQL parser support for `(LHS) UNION RHS` queries ([#1211](#1211)). In this update, the TSQL parser has been enhanced to support a specific form of UNION queries where the UNION operator is used to combine the result sets of two SELECT statements, with the first SELECT statement enclosed in parentheses. This new functionality is implemented in the `TSqlRelationBuilder` class, with the addition of a new case branch to handle this specific form of UNION queries. The commit also includes new test cases in `TSqlRelationBuilderSpec` to verify this functionality, as well as a new test file `TsqlToDatabricksTranspilerTest.scala` for testing the correct translation of these queries into Databricks SQL dialect. These updates improve the parser's flexibility and utility for handling a wider range of TSQL queries, and resolve issue [#1127](#1127) related to the UNION parsing behavior.
* Added support for inline columns in CTEs ([#1184](#1184)). The latest change to the open-source library adds support for inline columns in Common Table Expressions (CTEs) for the Snowflake backend, enhancing its flexibility and functionality. This enhancement includes the addition of a new method `KnownInterval` for Abstract Syntax Tree (AST) representation and the modification of the `NamedTable` case class in the `relations.scala` file to allow for more customization options, such as specifying additional options for named tables and enabling streaming mode. The `SnowflakePlanParser` class of the `com.databricks.labs.remorph` project has also been updated with a new method `DealiasInlineColumnExpressions` to handle the parsing of inline columns in CTEs. Additionally, the `SnowflakeRelationBuilder` class now includes a new `visitCTEColumn` method to handle the parsing of inline columns and create `InlineColumnExpression` objects. The new `InlineColumnExpression` class takes a column name and value as arguments. The SnowflakeToDatabricksTranspiler has also been updated to support inline columns in CTEs, as demonstrated in a new test case that transpiles CTEs with multiple inline columns correctly. These changes improve the system's efficiency and make it easier for users to work with Snowflake.
* Implemented AST for positional column identifiers ([#1181](#1181)). In this release, we have implemented an Abstract Syntax Tree (AST) for positional column identifiers as part of the Snowflake functionality, addressing issue [#1181](#1181). The new type `NameOrPosition` has been introduced, allowing for both column names and positions to be used in the `InsertIntoTable` case class. The `SnowflakeExpressionBuilder` class now supports visiting `ColumnPositionContext` nodes and building `ir.NameOrPosition` objects, representing column identifiers using either names or positions. Two new methods, `visitColumnPosition` and `buildPosition`, have been added to handle positional column identifiers. However, please note that positional column identifier support is not yet available on the generator side for Databricks, which requires the table's schema for proper query translation involving positional column identifiers. The SnowflakeExpressionBuilderSpec test suite has been updated to include new tests for translating column positions and aliases in Snowflake SQL expressions, ensuring the parser correctly handles column positions and provides the correct AST elements for further processing.
* Implemented GROUP BY ALL ([#1180](#1180)). The SQL module's LogicalPlanGenerator class has been updated to support the GROUP BY ALL clause, allowing all columns to be used in the grouping. This is achieved by adding a new case to the match statement for the aggregate.group_type. In the PIVOT clause, values for pivoting are now generated using the expr.generate method, simplifying the code. The SnowflakeRelationBuilder class has also been updated to handle the GROUP BY ALL clause in Snowflake SQL queries. Additionally, the SnowflakeToDatabricksTranspiler has implemented the GROUP BY ALL feature, allowing for querying all unique combinations of columns in a specified table. A new test case has been added to ensure the correct behavior of this feature. These changes bring more flexibility and functionality to the SQL module and the transpiler.

Dependency updates:

 * Bump codecov/codecov-action from 4 to 5 ([#1210](#1210)).
 * Bump sqlglot from 25.30.0 to 25.32.1 ([#1254](#1254)).
sundarshankar89 added a commit that referenced this pull request Dec 2, 2024
*  Added support for format_datetime function in presto to Databricks ([#1250](#1250)). A new `format_datetime` function has been added to the `Parser` class in the `presto.py` file to provide support for formatting datetime values in Presto on Databricks. This function utilizes the `DateFormat.from_arg_list` method from the `local_expression` module to format datetime values according to a specified format string. To ensure compatibility and consistency between Presto and Databricks, a new test file `test_format_datetime_1.sql` has been added, containing SQL queries that demonstrate the usage of the `format_datetime` function in Presto and its equivalent in Databricks, `DATE_FORMAT`. This standalone change adds new functionality without modifying any existing code.
* Added support for SnowFlake `SUBSTR` ([#1238](#1238)). This commit enhances the library's SnowFlake support by adding the `SUBSTR` function, which was previously unsupported and existed only as an alternative to `SUBSTRING`. The project now fully supports both functions, and the `SUBSTRING` function can be used interchangeably with `SUBSTR` via the new `withConversionStrategy(SynonymOf("SUBSTR"))` method. Additionally, this commit supersedes a previous pull request that lacked a GPG signature and includes a test for the `SUBSTR` function. The `ARRAY_SLICE` function has also been updated to match SnowFlake's behavior, and the project now supports a more comprehensive list of SQL functions with their corresponding arity.
* Added support for json_size function in presto ([#1236](#1236)). A new `json_size` function for Presto has been added, which determines the size of a JSON object or array and returns an integer. Two new methods, `_build_json_size` and `get_json_object`, have been implemented to handle JSON objects and arrays differently, and the Parser and Tokenizer classes of the Presto class have been updated to include the new json_size function. An alternative implementation for Databricks using SQL functions is provided, and a test case is added to cover a fixed `is not null` error for json_extract in the Databricks generator. Additionally, a new test file for Presto has been added to test the functionality of the `json_extract` function in Presto, and a new method `GetJsonObject` is introduced to extract a JSON object from a given path. The `json_extract` function has also been updated to extract the value associated with a specified key from JSON data in both Presto and Databricks.
* Enclosed subqueries in parenthesis ([#1232](#1232)). This PR introduces changes to the ExpressionGenerator and LogicalPlanGenerator classes to ensure that subqueries are correctly enclosed in parentheses during code generation. Previously, subqueries were not always enclosed in parentheses, leading to incorrect code. This issue has been addressed by enclosing subqueries in parentheses in the `in` and `scalarSubquery` methods, and by adding new match cases for `ir.Filter` in the `LogicalPlanGenerator` class. The changes also take care to avoid doubling enclosing parentheses in the `.. IN(SELECT...)` pattern. New methods have not been added, and existing functionality has been modified to ensure that subqueries are correctly enclosed in parentheses, leading to the generation of correct SQL code. Test cases have been included in a separate PR. These changes improve the correctness of the generated code, avoiding issues such as `SELECT * FROM SELECT * FROM t WHERE a > `a` WHERE a > 'b'` and ensuring that the generated code includes parentheses around subqueries.
* Fixed serialization of MultipleErrors ([#1177](#1177)). In the latest release, the encoding of errors in the `com.databricks.labs.remorph.coverage` package has been improved with an update to the `encoders.scala` file. The change involves a fix for serializing `MultipleErrors` instances using the `asJson` method on each error instead of just the message. This modification ensures that all relevant information about each error is included in the encoded output, improving the accuracy of serialization for `MultipleErrors` class. Users who handle multiple errors and require precise serialization representation will benefit from this enhancement, as it guarantees comprehensive information encoding for each error instance.
* Fixed presto strpos and array_average functions ([#1196](#1196)). This PR introduces new classes `Locate` and `NamedStruct` in the `local_expression.py` file to handle the `STRPOS` and `ARRAY_AVERAGE` functions in a Databricks environment, ensuring compatibility with Presto SQL. The `STRPOS` function, used to locate the position of a substring within a string, now uses the `Locate` class and emits a warning regarding differences in implementation between Presto and Databricks SQL. A new method `_build_array_average` has been added to handle the `ARRAY_AVERAGE` function in Databricks, which calculates the average of an array, accommodating nulls, integers, and doubles. Two SQL test cases have been added to demonstrate the use of the `ARRAY_AVERAGE` function with arrays containing integers and doubles. These changes promote compatibility and consistent behavior between Presto and Databricks when dealing with `STRPOS` and `ARRAY_AVERAGE` functions, enhancing the ability to migrate between the systems smoothly.
* Handled presto Unnest cross join to Databricks lateral view ([#1209](#1209)). This release introduces new features and updates for handling Presto UNNEST cross joins in Databricks, utilizing the lateral view feature. New methods have been added to improve efficiency and robustness when handling UNNEST cross joins. Additionally, new test cases have been implemented for Presto and Databricks to ensure compatibility and consistency between the two systems in handling UNNEST cross joins, array construction and flattening, and parsing JSON data. Some limitations and issues remain, which will be addressed in future work. The acceptance tests have also been updated, with certain tests now expected to pass, while others may still fail. This release aims to improve the functionality and compatibility of Presto and Databricks when handling UNNEST cross joins and JSON data.
* Implemented remaining TSQL set operations ([#1227](#1227)). This pull request enhances the TSql parser by adding support for parsing and converting the set operations `UNION [ALL]`, `EXCEPT`, and `INTERSECT` to the Intermediate Representation (IR). Initially, the grammar recognized these operations, but they were not being converted to the IR. This change resolves issues [#1126](#1126) and [#1102](#1102) and includes new unit, transpiler, and functional tests, ensuring the correct behavior of these set operations, including precedence rules. The commit also introduces a new test file, `union-all.sql`, demonstrating the correct handling of simple `UNION ALL` operations, ensuring consistent output across TSQL and Databricks SQL platforms.
* Supported multiple columns in order by clause in for ARRAYAGG ([#1228](#1228)). This commit enhances the ARRAYAGG and LISTAGG functions by adding support for multiple columns in the order by clause and sorting in both ascending and descending order. A new method, sortArray, has been introduced to handle multiple sort orders. The changes also improve the functionality of the ARRAYAGG function in the Snowflake dialect by supporting multiple columns in the ORDER BY clause, with an optional DESC keyword for each column. The `WithinGroupParams` dataclass has been updated in the local expression module to include a list of tuples for the order columns and their sorting direction. These changes provide increased flexibility and control over the output of the ARRAYAGG and LISTAGG functions
* Added TSQL parser support for `(LHS) UNION RHS` queries ([#1211](#1211)). In this release, we have implemented support for a new form of UNION in the TSQL parser, specifically for queries formatted as `(SELECT a from b) UNION [ALL] SELECT x from y`. This allows the union of two SELECT queries with an optional ALL keyword to include duplicate rows. The implementation includes a new case statement in the `TSqlRelationBuilder` class that handles this form of UNION, creating a `SetOperation` object with the left-hand side and right-hand side of the union, and an `is_all` flag based on the presence of the ALL keyword. Additionally, we have added support for parsing right-associative UNION clauses in TSQL queries, enhancing the flexibility and expressiveness of the TSQL parser for more complex and nuanced queries. The commit also includes new test cases to verify the correct translation of TSQL set operations to Databricks SQL, resolving issue [#1127](#1127). This enhancement allows for more accurate parsing of TSQL queries that use the UNION operator in various formats.
* Added support for inline columns in CTEs ([#1184](#1184)). In this release, we have added support for inline columns in Common Table Expressions (CTEs) in Snowflake across various components of our open-source library. This includes updates to the AST (Abstract Syntax Tree) for better TSQL translation and the introduction of the new case class `KnownInterval` for handling intervals. We have also implemented a new method, `DealiasInlineColumnExpressions`, in the `SnowflakePlanParser` class to parse inline columns in CTEs and modify the class constructor to include this new method. Additionally, a new private case class `InlineColumnExpression` has been introduced to allow for more efficient processing of Snowflake CTEs. The SnowflakeToDatabricksTranspiler has also been updated to support inline columns in CTEs, as demonstrated by a new test case. These changes improve compatibility, precision, and usability of the codebase, providing a better overall experience for software engineers working with CTEs in Snowflake.
*  Implemented AST for positional column identifiers ([#1181](#1181)). The recent change introduces an Abstract Syntax Tree (AST) for positional column identifiers in the Snowflake project, specifically in the `ExpressionGenerator` class. The new `NameOrPosition` type represents a column identifier, either by name or position. The `Id` and `Position` classes inherit from `NameOrPosition`, and the `nameOrPosition` method has been added to check and return the appropriate SQL representation. However, due to Databricks' lack of positional column identifier support, the generator side does not yet support this feature. This means that the schema of the table is required to properly translate queries involving positional column identifiers. This enhancement increases the system's flexibility in handling Snowflake's query structures, with the potential for more comprehensive generator-side support in the future.
* Implemented GROUP BY ALL ([#1180](#1180)). The `GROUP BY ALL` clause is now supported in the LogicalPlanGenerator class of the remorph project, with the addition of a new case to handle the GroupByAll type and updated implementation for the Pivot type. A new case object called `GroupByAll` has been added to the relations.scala file's sealed trait "GroupType". A new test case has been implemented in the SnowflakeToDatabricksTranspilerTest class to check the correct transpilation of the `GROUP BY ALL` clause from Snowflake SQL syntax to Databricks SQL syntax. These changes allow for more flexibility and control in grouping operations and enable the implementation of specific functionality for the GROUP BY ALL clause in Snowflake, improving compatibility with Snowflake SQL syntax.

Dependency updates:

 * Bump codecov/codecov-action from 4 to 5 ([#1210](#1210)).
 * Bump sqlglot from 25.30.0 to 25.32.1 ([#1254](#1254)).
@sundarshankar89 sundarshankar89 mentioned this pull request Dec 2, 2024
gueniai pushed a commit that referenced this pull request Dec 2, 2024
* Added support for format_datetime function in presto to Databricks
([#1250](#1250)). A new
`format_datetime` function has been added to the `Parser` class in the
`presto.py` file to provide support for formatting datetime values in
Presto on Databricks. This function utilizes the
`DateFormat.from_arg_list` method from the `local_expression` module to
format datetime values according to a specified format string. To ensure
compatibility and consistency between Presto and Databricks, a new test
file `test_format_datetime_1.sql` has been added, containing SQL queries
that demonstrate the usage of the `format_datetime` function in Presto
and its equivalent in Databricks, `DATE_FORMAT`. This standalone change
adds new functionality without modifying any existing code.
* Added support for SnowFlake `SUBSTR`
([#1238](#1238)). This
commit enhances the library's SnowFlake support by adding the `SUBSTR`
function, which was previously unsupported and existed only as an
alternative to `SUBSTRING`. The project now fully supports both
functions, and the `SUBSTRING` function can be used interchangeably with
`SUBSTR` via the new `withConversionStrategy(SynonymOf("SUBSTR"))`
method. Additionally, this commit supersedes a previous pull request
that lacked a GPG signature and includes a test for the `SUBSTR`
function. The `ARRAY_SLICE` function has also been updated to match
SnowFlake's behavior, and the project now supports a more comprehensive
list of SQL functions with their corresponding arity.
* Added support for json_size function in presto
([#1236](#1236)). A new
`json_size` function for Presto has been added, which determines the
size of a JSON object or array and returns an integer. Two new methods,
`_build_json_size` and `get_json_object`, have been implemented to
handle JSON objects and arrays differently, and the Parser and Tokenizer
classes of the Presto class have been updated to include the new
json_size function. An alternative implementation for Databricks using
SQL functions is provided, and a test case is added to cover a fixed `is
not null` error for json_extract in the Databricks generator.
Additionally, a new test file for Presto has been added to test the
functionality of the `json_extract` function in Presto, and a new method
`GetJsonObject` is introduced to extract a JSON object from a given
path. The `json_extract` function has also been updated to extract the
value associated with a specified key from JSON data in both Presto and
Databricks.
* Enclosed subqueries in parenthesis
([#1232](#1232)). This
PR introduces changes to the ExpressionGenerator and
LogicalPlanGenerator classes to ensure that subqueries are correctly
enclosed in parentheses during code generation. Previously, subqueries
were not always enclosed in parentheses, leading to incorrect code. This
issue has been addressed by enclosing subqueries in parentheses in the
`in` and `scalarSubquery` methods, and by adding new match cases for
`ir.Filter` in the `LogicalPlanGenerator` class. The changes also take
care to avoid doubling enclosing parentheses in the `.. IN(SELECT...)`
pattern. New methods have not been added, and existing functionality has
been modified to ensure that subqueries are correctly enclosed in
parentheses, leading to the generation of correct SQL code. Test cases
have been included in a separate PR. These changes improve the
correctness of the generated code, avoiding issues such as `SELECT *
FROM SELECT * FROM t WHERE a > `a` WHERE a > 'b'` and ensuring that the
generated code includes parentheses around subqueries.
* Fixed serialization of MultipleErrors
([#1177](#1177)). In the
latest release, the encoding of errors in the
`com.databricks.labs.remorph.coverage` package has been improved with an
update to the `encoders.scala` file. The change involves a fix for
serializing `MultipleErrors` instances using the `asJson` method on each
error instead of just the message. This modification ensures that all
relevant information about each error is included in the encoded output,
improving the accuracy of serialization for `MultipleErrors` class.
Users who handle multiple errors and require precise serialization
representation will benefit from this enhancement, as it guarantees
comprehensive information encoding for each error instance.
* Fixed presto strpos and array_average functions
([#1196](#1196)). This
PR introduces new classes `Locate` and `NamedStruct` in the
`local_expression.py` file to handle the `STRPOS` and `ARRAY_AVERAGE`
functions in a Databricks environment, ensuring compatibility with
Presto SQL. The `STRPOS` function, used to locate the position of a
substring within a string, now uses the `Locate` class and emits a
warning regarding differences in implementation between Presto and
Databricks SQL. A new method `_build_array_average` has been added to
handle the `ARRAY_AVERAGE` function in Databricks, which calculates the
average of an array, accommodating nulls, integers, and doubles. Two SQL
test cases have been added to demonstrate the use of the `ARRAY_AVERAGE`
function with arrays containing integers and doubles. These changes
promote compatibility and consistent behavior between Presto and
Databricks when dealing with `STRPOS` and `ARRAY_AVERAGE` functions,
enhancing the ability to migrate between the systems smoothly.
* Handled presto Unnest cross join to Databricks lateral view
([#1209](#1209)). This
release introduces new features and updates for handling Presto UNNEST
cross joins in Databricks, utilizing the lateral view feature. New
methods have been added to improve efficiency and robustness when
handling UNNEST cross joins. Additionally, new test cases have been
implemented for Presto and Databricks to ensure compatibility and
consistency between the two systems in handling UNNEST cross joins,
array construction and flattening, and parsing JSON data. Some
limitations and issues remain, which will be addressed in future work.
The acceptance tests have also been updated, with certain tests now
expected to pass, while others may still fail. This release aims to
improve the functionality and compatibility of Presto and Databricks
when handling UNNEST cross joins and JSON data.
* Implemented remaining TSQL set operations
([#1227](#1227)). This
pull request enhances the TSql parser by adding support for parsing and
converting the set operations `UNION [ALL]`, `EXCEPT`, and `INTERSECT`
to the Intermediate Representation (IR). Initially, the grammar
recognized these operations, but they were not being converted to the
IR. This change resolves issues
[#1126](#1126) and
[#1102](#1102) and
includes new unit, transpiler, and functional tests, ensuring the
correct behavior of these set operations, including precedence rules.
The commit also introduces a new test file, `union-all.sql`,
demonstrating the correct handling of simple `UNION ALL` operations,
ensuring consistent output across TSQL and Databricks SQL platforms.
* Supported multiple columns in order by clause in for ARRAYAGG
([#1228](#1228)). This
commit enhances the ARRAYAGG and LISTAGG functions by adding support for
multiple columns in the order by clause and sorting in both ascending
and descending order. A new method, sortArray, has been introduced to
handle multiple sort orders. The changes also improve the functionality
of the ARRAYAGG function in the Snowflake dialect by supporting multiple
columns in the ORDER BY clause, with an optional DESC keyword for each
column. The `WithinGroupParams` dataclass has been updated in the local
expression module to include a list of tuples for the order columns and
their sorting direction. These changes provide increased flexibility and
control over the output of the ARRAYAGG and LISTAGG functions
* Added TSQL parser support for `(LHS) UNION RHS` queries
([#1211](#1211)). In
this release, we have implemented support for a new form of UNION in the
TSQL parser, specifically for queries formatted as `(SELECT a from b)
UNION [ALL] SELECT x from y`. This allows the union of two SELECT
queries with an optional ALL keyword to include duplicate rows. The
implementation includes a new case statement in the
`TSqlRelationBuilder` class that handles this form of UNION, creating a
`SetOperation` object with the left-hand side and right-hand side of the
union, and an `is_all` flag based on the presence of the ALL keyword.
Additionally, we have added support for parsing right-associative UNION
clauses in TSQL queries, enhancing the flexibility and expressiveness of
the TSQL parser for more complex and nuanced queries. The commit also
includes new test cases to verify the correct translation of TSQL set
operations to Databricks SQL, resolving issue
[#1127](#1127). This
enhancement allows for more accurate parsing of TSQL queries that use
the UNION operator in various formats.
* Added support for inline columns in CTEs
([#1184](#1184)). In
this release, we have added support for inline columns in Common Table
Expressions (CTEs) in Snowflake across various components of our
open-source library. This includes updates to the AST (Abstract Syntax
Tree) for better TSQL translation and the introduction of the new case
class `KnownInterval` for handling intervals. We have also implemented a
new method, `DealiasInlineColumnExpressions`, in the
`SnowflakePlanParser` class to parse inline columns in CTEs and modify
the class constructor to include this new method. Additionally, a new
private case class `InlineColumnExpression` has been introduced to allow
for more efficient processing of Snowflake CTEs. The
SnowflakeToDatabricksTranspiler has also been updated to support inline
columns in CTEs, as demonstrated by a new test case. These changes
improve compatibility, precision, and usability of the codebase,
providing a better overall experience for software engineers working
with CTEs in Snowflake.
* Implemented AST for positional column identifiers
([#1181](#1181)). The
recent change introduces an Abstract Syntax Tree (AST) for positional
column identifiers in the Snowflake project, specifically in the
`ExpressionGenerator` class. The new `NameOrPosition` type represents a
column identifier, either by name or position. The `Id` and `Position`
classes inherit from `NameOrPosition`, and the `nameOrPosition` method
has been added to check and return the appropriate SQL representation.
However, due to Databricks' lack of positional column identifier
support, the generator side does not yet support this feature. This
means that the schema of the table is required to properly translate
queries involving positional column identifiers. This enhancement
increases the system's flexibility in handling Snowflake's query
structures, with the potential for more comprehensive generator-side
support in the future.
* Implemented GROUP BY ALL
([#1180](#1180)). The
`GROUP BY ALL` clause is now supported in the LogicalPlanGenerator class
of the remorph project, with the addition of a new case to handle the
GroupByAll type and updated implementation for the Pivot type. A new
case object called `GroupByAll` has been added to the relations.scala
file's sealed trait "GroupType". A new test case has been implemented in
the SnowflakeToDatabricksTranspilerTest class to check the correct
transpilation of the `GROUP BY ALL` clause from Snowflake SQL syntax to
Databricks SQL syntax. These changes allow for more flexibility and
control in grouping operations and enable the implementation of specific
functionality for the GROUP BY ALL clause in Snowflake, improving
compatibility with Snowflake SQL syntax.

Dependency updates:

* Bump codecov/codecov-action from 4 to 5
([#1210](#1210)).
* Bump sqlglot from 25.30.0 to 25.32.1
([#1254](#1254)).
sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this pull request Jan 2, 2025
## Changes

This PR implements support for parsing set operations with TSql: `UNION
[ALL]`, `EXCEPT` and `INTERSECT`.

The grammar previously supported these but they were not being converted
to the IR.

### Linked issues

Resolves databrickslabs#1126.
Resolves databrickslabs#1102.

### Tests

- [x] added unit tests
- [x] added transpiler tests
- [x] added functional tests
sundarshankar89 added a commit to sundarshankar89/remorph that referenced this pull request Jan 2, 2025
* Added support for format_datetime function in presto to Databricks
([databrickslabs#1250](databrickslabs#1250)). A new
`format_datetime` function has been added to the `Parser` class in the
`presto.py` file to provide support for formatting datetime values in
Presto on Databricks. This function utilizes the
`DateFormat.from_arg_list` method from the `local_expression` module to
format datetime values according to a specified format string. To ensure
compatibility and consistency between Presto and Databricks, a new test
file `test_format_datetime_1.sql` has been added, containing SQL queries
that demonstrate the usage of the `format_datetime` function in Presto
and its equivalent in Databricks, `DATE_FORMAT`. This standalone change
adds new functionality without modifying any existing code.
* Added support for SnowFlake `SUBSTR`
([databrickslabs#1238](databrickslabs#1238)). This
commit enhances the library's SnowFlake support by adding the `SUBSTR`
function, which was previously unsupported and existed only as an
alternative to `SUBSTRING`. The project now fully supports both
functions, and the `SUBSTRING` function can be used interchangeably with
`SUBSTR` via the new `withConversionStrategy(SynonymOf("SUBSTR"))`
method. Additionally, this commit supersedes a previous pull request
that lacked a GPG signature and includes a test for the `SUBSTR`
function. The `ARRAY_SLICE` function has also been updated to match
SnowFlake's behavior, and the project now supports a more comprehensive
list of SQL functions with their corresponding arity.
* Added support for json_size function in presto
([databrickslabs#1236](databrickslabs#1236)). A new
`json_size` function for Presto has been added, which determines the
size of a JSON object or array and returns an integer. Two new methods,
`_build_json_size` and `get_json_object`, have been implemented to
handle JSON objects and arrays differently, and the Parser and Tokenizer
classes of the Presto class have been updated to include the new
json_size function. An alternative implementation for Databricks using
SQL functions is provided, and a test case is added to cover a fixed `is
not null` error for json_extract in the Databricks generator.
Additionally, a new test file for Presto has been added to test the
functionality of the `json_extract` function in Presto, and a new method
`GetJsonObject` is introduced to extract a JSON object from a given
path. The `json_extract` function has also been updated to extract the
value associated with a specified key from JSON data in both Presto and
Databricks.
* Enclosed subqueries in parenthesis
([databrickslabs#1232](databrickslabs#1232)). This
PR introduces changes to the ExpressionGenerator and
LogicalPlanGenerator classes to ensure that subqueries are correctly
enclosed in parentheses during code generation. Previously, subqueries
were not always enclosed in parentheses, leading to incorrect code. This
issue has been addressed by enclosing subqueries in parentheses in the
`in` and `scalarSubquery` methods, and by adding new match cases for
`ir.Filter` in the `LogicalPlanGenerator` class. The changes also take
care to avoid doubling enclosing parentheses in the `.. IN(SELECT...)`
pattern. New methods have not been added, and existing functionality has
been modified to ensure that subqueries are correctly enclosed in
parentheses, leading to the generation of correct SQL code. Test cases
have been included in a separate PR. These changes improve the
correctness of the generated code, avoiding issues such as `SELECT *
FROM SELECT * FROM t WHERE a > `a` WHERE a > 'b'` and ensuring that the
generated code includes parentheses around subqueries.
* Fixed serialization of MultipleErrors
([databrickslabs#1177](databrickslabs#1177)). In the
latest release, the encoding of errors in the
`com.databricks.labs.remorph.coverage` package has been improved with an
update to the `encoders.scala` file. The change involves a fix for
serializing `MultipleErrors` instances using the `asJson` method on each
error instead of just the message. This modification ensures that all
relevant information about each error is included in the encoded output,
improving the accuracy of serialization for `MultipleErrors` class.
Users who handle multiple errors and require precise serialization
representation will benefit from this enhancement, as it guarantees
comprehensive information encoding for each error instance.
* Fixed presto strpos and array_average functions
([databrickslabs#1196](databrickslabs#1196)). This
PR introduces new classes `Locate` and `NamedStruct` in the
`local_expression.py` file to handle the `STRPOS` and `ARRAY_AVERAGE`
functions in a Databricks environment, ensuring compatibility with
Presto SQL. The `STRPOS` function, used to locate the position of a
substring within a string, now uses the `Locate` class and emits a
warning regarding differences in implementation between Presto and
Databricks SQL. A new method `_build_array_average` has been added to
handle the `ARRAY_AVERAGE` function in Databricks, which calculates the
average of an array, accommodating nulls, integers, and doubles. Two SQL
test cases have been added to demonstrate the use of the `ARRAY_AVERAGE`
function with arrays containing integers and doubles. These changes
promote compatibility and consistent behavior between Presto and
Databricks when dealing with `STRPOS` and `ARRAY_AVERAGE` functions,
enhancing the ability to migrate between the systems smoothly.
* Handled presto Unnest cross join to Databricks lateral view
([databrickslabs#1209](databrickslabs#1209)). This
release introduces new features and updates for handling Presto UNNEST
cross joins in Databricks, utilizing the lateral view feature. New
methods have been added to improve efficiency and robustness when
handling UNNEST cross joins. Additionally, new test cases have been
implemented for Presto and Databricks to ensure compatibility and
consistency between the two systems in handling UNNEST cross joins,
array construction and flattening, and parsing JSON data. Some
limitations and issues remain, which will be addressed in future work.
The acceptance tests have also been updated, with certain tests now
expected to pass, while others may still fail. This release aims to
improve the functionality and compatibility of Presto and Databricks
when handling UNNEST cross joins and JSON data.
* Implemented remaining TSQL set operations
([databrickslabs#1227](databrickslabs#1227)). This
pull request enhances the TSql parser by adding support for parsing and
converting the set operations `UNION [ALL]`, `EXCEPT`, and `INTERSECT`
to the Intermediate Representation (IR). Initially, the grammar
recognized these operations, but they were not being converted to the
IR. This change resolves issues
[databrickslabs#1126](databrickslabs#1126) and
[databrickslabs#1102](databrickslabs#1102) and
includes new unit, transpiler, and functional tests, ensuring the
correct behavior of these set operations, including precedence rules.
The commit also introduces a new test file, `union-all.sql`,
demonstrating the correct handling of simple `UNION ALL` operations,
ensuring consistent output across TSQL and Databricks SQL platforms.
* Supported multiple columns in order by clause in for ARRAYAGG
([databrickslabs#1228](databrickslabs#1228)). This
commit enhances the ARRAYAGG and LISTAGG functions by adding support for
multiple columns in the order by clause and sorting in both ascending
and descending order. A new method, sortArray, has been introduced to
handle multiple sort orders. The changes also improve the functionality
of the ARRAYAGG function in the Snowflake dialect by supporting multiple
columns in the ORDER BY clause, with an optional DESC keyword for each
column. The `WithinGroupParams` dataclass has been updated in the local
expression module to include a list of tuples for the order columns and
their sorting direction. These changes provide increased flexibility and
control over the output of the ARRAYAGG and LISTAGG functions
* Added TSQL parser support for `(LHS) UNION RHS` queries
([databrickslabs#1211](databrickslabs#1211)). In
this release, we have implemented support for a new form of UNION in the
TSQL parser, specifically for queries formatted as `(SELECT a from b)
UNION [ALL] SELECT x from y`. This allows the union of two SELECT
queries with an optional ALL keyword to include duplicate rows. The
implementation includes a new case statement in the
`TSqlRelationBuilder` class that handles this form of UNION, creating a
`SetOperation` object with the left-hand side and right-hand side of the
union, and an `is_all` flag based on the presence of the ALL keyword.
Additionally, we have added support for parsing right-associative UNION
clauses in TSQL queries, enhancing the flexibility and expressiveness of
the TSQL parser for more complex and nuanced queries. The commit also
includes new test cases to verify the correct translation of TSQL set
operations to Databricks SQL, resolving issue
[databrickslabs#1127](databrickslabs#1127). This
enhancement allows for more accurate parsing of TSQL queries that use
the UNION operator in various formats.
* Added support for inline columns in CTEs
([databrickslabs#1184](databrickslabs#1184)). In
this release, we have added support for inline columns in Common Table
Expressions (CTEs) in Snowflake across various components of our
open-source library. This includes updates to the AST (Abstract Syntax
Tree) for better TSQL translation and the introduction of the new case
class `KnownInterval` for handling intervals. We have also implemented a
new method, `DealiasInlineColumnExpressions`, in the
`SnowflakePlanParser` class to parse inline columns in CTEs and modify
the class constructor to include this new method. Additionally, a new
private case class `InlineColumnExpression` has been introduced to allow
for more efficient processing of Snowflake CTEs. The
SnowflakeToDatabricksTranspiler has also been updated to support inline
columns in CTEs, as demonstrated by a new test case. These changes
improve compatibility, precision, and usability of the codebase,
providing a better overall experience for software engineers working
with CTEs in Snowflake.
* Implemented AST for positional column identifiers
([databrickslabs#1181](databrickslabs#1181)). The
recent change introduces an Abstract Syntax Tree (AST) for positional
column identifiers in the Snowflake project, specifically in the
`ExpressionGenerator` class. The new `NameOrPosition` type represents a
column identifier, either by name or position. The `Id` and `Position`
classes inherit from `NameOrPosition`, and the `nameOrPosition` method
has been added to check and return the appropriate SQL representation.
However, due to Databricks' lack of positional column identifier
support, the generator side does not yet support this feature. This
means that the schema of the table is required to properly translate
queries involving positional column identifiers. This enhancement
increases the system's flexibility in handling Snowflake's query
structures, with the potential for more comprehensive generator-side
support in the future.
* Implemented GROUP BY ALL
([databrickslabs#1180](databrickslabs#1180)). The
`GROUP BY ALL` clause is now supported in the LogicalPlanGenerator class
of the remorph project, with the addition of a new case to handle the
GroupByAll type and updated implementation for the Pivot type. A new
case object called `GroupByAll` has been added to the relations.scala
file's sealed trait "GroupType". A new test case has been implemented in
the SnowflakeToDatabricksTranspilerTest class to check the correct
transpilation of the `GROUP BY ALL` clause from Snowflake SQL syntax to
Databricks SQL syntax. These changes allow for more flexibility and
control in grouping operations and enable the implementation of specific
functionality for the GROUP BY ALL clause in Snowflake, improving
compatibility with Snowflake SQL syntax.

Dependency updates:

* Bump codecov/codecov-action from 4 to 5
([databrickslabs#1210](databrickslabs#1210)).
* Bump sqlglot from 25.30.0 to 25.32.1
([databrickslabs#1254](databrickslabs#1254)).
sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this pull request Jan 3, 2025
## Changes

This PR implements support for parsing set operations with TSql: `UNION
[ALL]`, `EXCEPT` and `INTERSECT`.

The grammar previously supported these but they were not being converted
to the IR.

### Linked issues

Resolves databrickslabs#1126.
Resolves databrickslabs#1102.

### Tests

- [x] added unit tests
- [x] added transpiler tests
- [x] added functional tests
sundarshankar89 added a commit to sundarshankar89/remorph that referenced this pull request Jan 3, 2025
* Added support for format_datetime function in presto to Databricks
([databrickslabs#1250](databrickslabs#1250)). A new
`format_datetime` function has been added to the `Parser` class in the
`presto.py` file to provide support for formatting datetime values in
Presto on Databricks. This function utilizes the
`DateFormat.from_arg_list` method from the `local_expression` module to
format datetime values according to a specified format string. To ensure
compatibility and consistency between Presto and Databricks, a new test
file `test_format_datetime_1.sql` has been added, containing SQL queries
that demonstrate the usage of the `format_datetime` function in Presto
and its equivalent in Databricks, `DATE_FORMAT`. This standalone change
adds new functionality without modifying any existing code.
* Added support for SnowFlake `SUBSTR`
([databrickslabs#1238](databrickslabs#1238)). This
commit enhances the library's SnowFlake support by adding the `SUBSTR`
function, which was previously unsupported and existed only as an
alternative to `SUBSTRING`. The project now fully supports both
functions, and the `SUBSTRING` function can be used interchangeably with
`SUBSTR` via the new `withConversionStrategy(SynonymOf("SUBSTR"))`
method. Additionally, this commit supersedes a previous pull request
that lacked a GPG signature and includes a test for the `SUBSTR`
function. The `ARRAY_SLICE` function has also been updated to match
SnowFlake's behavior, and the project now supports a more comprehensive
list of SQL functions with their corresponding arity.
* Added support for json_size function in presto
([databrickslabs#1236](databrickslabs#1236)). A new
`json_size` function for Presto has been added, which determines the
size of a JSON object or array and returns an integer. Two new methods,
`_build_json_size` and `get_json_object`, have been implemented to
handle JSON objects and arrays differently, and the Parser and Tokenizer
classes of the Presto class have been updated to include the new
json_size function. An alternative implementation for Databricks using
SQL functions is provided, and a test case is added to cover a fixed `is
not null` error for json_extract in the Databricks generator.
Additionally, a new test file for Presto has been added to test the
functionality of the `json_extract` function in Presto, and a new method
`GetJsonObject` is introduced to extract a JSON object from a given
path. The `json_extract` function has also been updated to extract the
value associated with a specified key from JSON data in both Presto and
Databricks.
* Enclosed subqueries in parenthesis
([databrickslabs#1232](databrickslabs#1232)). This
PR introduces changes to the ExpressionGenerator and
LogicalPlanGenerator classes to ensure that subqueries are correctly
enclosed in parentheses during code generation. Previously, subqueries
were not always enclosed in parentheses, leading to incorrect code. This
issue has been addressed by enclosing subqueries in parentheses in the
`in` and `scalarSubquery` methods, and by adding new match cases for
`ir.Filter` in the `LogicalPlanGenerator` class. The changes also take
care to avoid doubling enclosing parentheses in the `.. IN(SELECT...)`
pattern. New methods have not been added, and existing functionality has
been modified to ensure that subqueries are correctly enclosed in
parentheses, leading to the generation of correct SQL code. Test cases
have been included in a separate PR. These changes improve the
correctness of the generated code, avoiding issues such as `SELECT *
FROM SELECT * FROM t WHERE a > `a` WHERE a > 'b'` and ensuring that the
generated code includes parentheses around subqueries.
* Fixed serialization of MultipleErrors
([databrickslabs#1177](databrickslabs#1177)). In the
latest release, the encoding of errors in the
`com.databricks.labs.remorph.coverage` package has been improved with an
update to the `encoders.scala` file. The change involves a fix for
serializing `MultipleErrors` instances using the `asJson` method on each
error instead of just the message. This modification ensures that all
relevant information about each error is included in the encoded output,
improving the accuracy of serialization for `MultipleErrors` class.
Users who handle multiple errors and require precise serialization
representation will benefit from this enhancement, as it guarantees
comprehensive information encoding for each error instance.
* Fixed presto strpos and array_average functions
([databrickslabs#1196](databrickslabs#1196)). This
PR introduces new classes `Locate` and `NamedStruct` in the
`local_expression.py` file to handle the `STRPOS` and `ARRAY_AVERAGE`
functions in a Databricks environment, ensuring compatibility with
Presto SQL. The `STRPOS` function, used to locate the position of a
substring within a string, now uses the `Locate` class and emits a
warning regarding differences in implementation between Presto and
Databricks SQL. A new method `_build_array_average` has been added to
handle the `ARRAY_AVERAGE` function in Databricks, which calculates the
average of an array, accommodating nulls, integers, and doubles. Two SQL
test cases have been added to demonstrate the use of the `ARRAY_AVERAGE`
function with arrays containing integers and doubles. These changes
promote compatibility and consistent behavior between Presto and
Databricks when dealing with `STRPOS` and `ARRAY_AVERAGE` functions,
enhancing the ability to migrate between the systems smoothly.
* Handled presto Unnest cross join to Databricks lateral view
([databrickslabs#1209](databrickslabs#1209)). This
release introduces new features and updates for handling Presto UNNEST
cross joins in Databricks, utilizing the lateral view feature. New
methods have been added to improve efficiency and robustness when
handling UNNEST cross joins. Additionally, new test cases have been
implemented for Presto and Databricks to ensure compatibility and
consistency between the two systems in handling UNNEST cross joins,
array construction and flattening, and parsing JSON data. Some
limitations and issues remain, which will be addressed in future work.
The acceptance tests have also been updated, with certain tests now
expected to pass, while others may still fail. This release aims to
improve the functionality and compatibility of Presto and Databricks
when handling UNNEST cross joins and JSON data.
* Implemented remaining TSQL set operations
([databrickslabs#1227](databrickslabs#1227)). This
pull request enhances the TSql parser by adding support for parsing and
converting the set operations `UNION [ALL]`, `EXCEPT`, and `INTERSECT`
to the Intermediate Representation (IR). Initially, the grammar
recognized these operations, but they were not being converted to the
IR. This change resolves issues
[databrickslabs#1126](databrickslabs#1126) and
[databrickslabs#1102](databrickslabs#1102) and
includes new unit, transpiler, and functional tests, ensuring the
correct behavior of these set operations, including precedence rules.
The commit also introduces a new test file, `union-all.sql`,
demonstrating the correct handling of simple `UNION ALL` operations,
ensuring consistent output across TSQL and Databricks SQL platforms.
* Supported multiple columns in order by clause in for ARRAYAGG
([databrickslabs#1228](databrickslabs#1228)). This
commit enhances the ARRAYAGG and LISTAGG functions by adding support for
multiple columns in the order by clause and sorting in both ascending
and descending order. A new method, sortArray, has been introduced to
handle multiple sort orders. The changes also improve the functionality
of the ARRAYAGG function in the Snowflake dialect by supporting multiple
columns in the ORDER BY clause, with an optional DESC keyword for each
column. The `WithinGroupParams` dataclass has been updated in the local
expression module to include a list of tuples for the order columns and
their sorting direction. These changes provide increased flexibility and
control over the output of the ARRAYAGG and LISTAGG functions
* Added TSQL parser support for `(LHS) UNION RHS` queries
([databrickslabs#1211](databrickslabs#1211)). In
this release, we have implemented support for a new form of UNION in the
TSQL parser, specifically for queries formatted as `(SELECT a from b)
UNION [ALL] SELECT x from y`. This allows the union of two SELECT
queries with an optional ALL keyword to include duplicate rows. The
implementation includes a new case statement in the
`TSqlRelationBuilder` class that handles this form of UNION, creating a
`SetOperation` object with the left-hand side and right-hand side of the
union, and an `is_all` flag based on the presence of the ALL keyword.
Additionally, we have added support for parsing right-associative UNION
clauses in TSQL queries, enhancing the flexibility and expressiveness of
the TSQL parser for more complex and nuanced queries. The commit also
includes new test cases to verify the correct translation of TSQL set
operations to Databricks SQL, resolving issue
[databrickslabs#1127](databrickslabs#1127). This
enhancement allows for more accurate parsing of TSQL queries that use
the UNION operator in various formats.
* Added support for inline columns in CTEs
([databrickslabs#1184](databrickslabs#1184)). In
this release, we have added support for inline columns in Common Table
Expressions (CTEs) in Snowflake across various components of our
open-source library. This includes updates to the AST (Abstract Syntax
Tree) for better TSQL translation and the introduction of the new case
class `KnownInterval` for handling intervals. We have also implemented a
new method, `DealiasInlineColumnExpressions`, in the
`SnowflakePlanParser` class to parse inline columns in CTEs and modify
the class constructor to include this new method. Additionally, a new
private case class `InlineColumnExpression` has been introduced to allow
for more efficient processing of Snowflake CTEs. The
SnowflakeToDatabricksTranspiler has also been updated to support inline
columns in CTEs, as demonstrated by a new test case. These changes
improve compatibility, precision, and usability of the codebase,
providing a better overall experience for software engineers working
with CTEs in Snowflake.
* Implemented AST for positional column identifiers
([databrickslabs#1181](databrickslabs#1181)). The
recent change introduces an Abstract Syntax Tree (AST) for positional
column identifiers in the Snowflake project, specifically in the
`ExpressionGenerator` class. The new `NameOrPosition` type represents a
column identifier, either by name or position. The `Id` and `Position`
classes inherit from `NameOrPosition`, and the `nameOrPosition` method
has been added to check and return the appropriate SQL representation.
However, due to Databricks' lack of positional column identifier
support, the generator side does not yet support this feature. This
means that the schema of the table is required to properly translate
queries involving positional column identifiers. This enhancement
increases the system's flexibility in handling Snowflake's query
structures, with the potential for more comprehensive generator-side
support in the future.
* Implemented GROUP BY ALL
([databrickslabs#1180](databrickslabs#1180)). The
`GROUP BY ALL` clause is now supported in the LogicalPlanGenerator class
of the remorph project, with the addition of a new case to handle the
GroupByAll type and updated implementation for the Pivot type. A new
case object called `GroupByAll` has been added to the relations.scala
file's sealed trait "GroupType". A new test case has been implemented in
the SnowflakeToDatabricksTranspilerTest class to check the correct
transpilation of the `GROUP BY ALL` clause from Snowflake SQL syntax to
Databricks SQL syntax. These changes allow for more flexibility and
control in grouping operations and enable the implementation of specific
functionality for the GROUP BY ALL clause in Snowflake, improving
compatibility with Snowflake SQL syntax.

Dependency updates:

* Bump codecov/codecov-action from 4 to 5
([databrickslabs#1210](databrickslabs#1210)).
* Bump sqlglot from 25.30.0 to 25.32.1
([databrickslabs#1254](databrickslabs#1254)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feat/ir everything related to abstract syntax trees sql/tsql tech debt design flaws and other cascading effects
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TODO] Implement set ops [TODO] This will change when UNION is implemented correctly
3 participants