-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TSQL: Implement grammar for CREATE DATABASE and CREATE DATABASE SCOPED OPTION #788
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Coverage tests results401 tests ±0 154 ✅ ±0 4s ⏱️ -1s For more details on these failures, see this check. Results for commit 7c5f271. ± Comparison against base commit a01a719. ♻️ This comment has been updated with latest results. |
- NB: I have a PR in to the gramamr format repo to stop it writing out the grammar if it has not changed, therefore trigger more build steps than it should
a76f43f
to
932e70b
Compare
nfx
approved these changes
Aug 13, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
nfx
added a commit
that referenced
this pull request
Aug 23, 2024
* Added Translation Support for `!` as `commands` and `&` for `Parameters` ([#771](#771)). This commit adds translation support for using "!" as commands and "&" as parameters in Snowflake code within the remorph tool, enhancing compatibility with Snowflake syntax. The "!set exit_on_error=true" command, which previously caused an error, is now treated as a comment and prepended with `--` in the output. The "&" symbol, previously unrecognized, is converted to its Databricks equivalent "$", which represents parameters, allowing for proper handling of Snowflake SQL code containing "!" commands and "&" parameters. These changes improve the compatibility and robustness of remorph with Snowflake code and enable more efficient processing of Snowflake SQL statements. Additionally, the commit introduces a new test suite for Snowflake commands, enhancing code coverage and ensuring proper functionality of the transpiler. * Added `LET` and `DECLARE` statements parsing in Snowflake PL/SQL procedures ([#548](#548)). This commit introduces support for parsing `DECLARE` and `LET` statements in Snowflake PL/SQL procedures, enabling variable declaration and assignment. It adds new grammar rules, refactors code using ScalaSubquery, and implements IR visitors for `DECLARE` and `LET` statements with Variable Assignment and ResultSet Assignment. The `RETURN` statement and parameterized expressions are also now supported. Note that `CURSOR` is not yet covered. These changes allow for improved processing and handling of Snowflake PL/SQL code, enhancing the overall functionality of the library. * Added logger statements in get_schema function ([#756](#756)). In this release, enhanced logging has been implemented in the Metadata (Schema) fetch functions, specifically in the `get_schema` function and other metadata fetch functions within Oracle, SnowflakeDataSource modules. The changes include logger statements that log the schema query, start time, and end time, providing better visibility into the performance and behavior of these functions during debugging or monitoring. The logging functionality is implemented using the built-in `logging` module and timestamps are obtained using the `datetime` module. In the SnowflakeDataSource class, RuntimeError or PySparkException will be raised if the user's current role lacks the necessary privileges to access the specified Information Schema object. The INFORMATION_SCHEMA table in Snowflake is used to fetch the schema, with the query modified to handle unquoted and quoted identifiers and the ordinal position of columns. The `get_schema_query` function has also been updated for better formatting for the SQL query used to fetch schema information. The schema fetching method remains unchanged, but these enhancements provide more detailed logging for debugging and monitoring purposes. * Aggregates Reconcile CLI Implementation ([#770](#770)). The `Aggregates Reconcile CLI Implementation` commit introduces a new command-line interface (CLI) for reconcile jobs, specifically for aggregated data. This change adds a new parameter, "operation_name", to the run method in the runner.py file, which determines the type of reconcile operation to perform. A new function, _trigger_reconcile_aggregates, has been implemented to reconcile aggregate data based on provided configurations and log the reconciliation process outcome. Additionally, new methods for defining job parameters and settings, such as `max_concurrent_runs` and "parameters", have been included. This CLI implementation enhances the customizability and control of the reconciliation process for users, allowing them to focus on specific use cases and data aggregations. The changes also include new test cases in test_runner.py to ensure the proper behavior of the ReconcileRunner class when the `aggregates-reconcile` operation_name is set. * Aggregates Reconcile Updates ([#784](#784)). This commit introduces significant updates to the `Table Deployment` feature, enabling it to support `Aggregate Tables` deployment and modifying the persistence logic for tables. Notable changes include the addition of a new `aggregates` attribute to the `Table` class in the configuration, which allows users to specify aggregate functions and optionally group by specific columns. The reconcile process now captures mismatch data, missing rows in the source, and missing rows in the target in the recon metrics tables. Furthermore, the aggregates reconcile process supports various aggregate functions like min, max, count, sum, avg, median, mode, percentile, stddev, and variance. The documentation has been updated to reflect these improvements. The commit also removes the `percentile` function from the reconciliation configuration and modifies the `aggregate_metrics` SQL query, enhancing the flexibility of the `Table Deployment` feature for `Aggregate Tables`. Users should note that the `percentile` function is no longer a valid option and should update their code accordingly. * Aggregates Reconcile documentation ([#779](#779)). In this commit, the Aggregates Reconcile utility has been enhanced with new documentation and visualizations for improved understanding and usability. The utility now includes a flow diagram, visualization, and README file illustrating how it compares specific aggregate metrics between source and target data residing on Databricks. A new configuration sample is added, showcasing the reconciliation of two tables using various aggregate functions, join columns, transformations, filters, and JDBC ReaderOptions configurations. The commit also introduces two Mermaid flowchart diagrams, depicting the reconciliation process with and without a `group by` operation. Additionally, new flow diagram visualizations in PNG and GIF formats have been added, aiding in understanding the process flow of the Aggregates Reconcile feature. The reconcile configuration samples in the documentation have also been updated with a spelling correction for clarity. * Bump sqlglot from 25.6.1 to 25.8.1 ([#749](#749)). In this version update, the `sqlglot` dependency has been bumped from 25.6.1 to 25.8.1, bringing several bug fixes and new features related to various SQL dialects such as BigQuery, DuckDB, and T-SQL. Notable changes include support for BYTEINT in BigQuery, improved parsing and transpilation of StrToDate in ClickHouse, and support for SUMMARIZE in DuckDB. Additionally, there are bug fixes for DuckDB and T-SQL, including wrapping left IN clause json extract arrow operand and handling JSON_QUERY with a single argument. The update also includes refactors and changes to the ANNOTATORS and PARSER modules to improve dialect-aware annotation and consistency. This pull request is compatible with `sqlglot` version 25.6.1 and below and includes a detailed list of commits and their corresponding changes. * Generate window functions ([#772](#772)). In this release, we have added support for generating SQL `WINDOW` and `SortOrder` expressions in the `ExpressionGenerator` class. This enhancement includes the ability to generate a `WINDOW` expression with a window function, partitioning and ordering clauses, and an optional window frame, using the `window` and `frameBoundary` methods. The `sortOrder` method now generates the SQL `SortOrder` expression, which includes the expression to sort by, sort direction, and null ordering. Additional methods `orNull` and `doubleQuote` return a string representing a NULL value and a string enclosed in double quotes, respectively. These changes provide increased flexibility for handling more complex expressions in SQL. Additionally, new test cases have been added to the `ExpressionGeneratorTest` to ensure the correct generation of SQL window functions, specifically the `ROW_NUMBER()` function with various partitioning, ordering, and framing specifications. These updates improve the robustness and functionality of the `ExpressionGenerator` class for generating SQL window functions. * Implement TSQL specific function call mapper ([#765](#765)). This commit introduces several new features to enhance compatibility between TSQL and Databricks SQL. A new method, `interval`, has been added to generate a Databricks SQL compatible string for intervals in a TSQL expression. The `expression` method has been updated to handle certain functions directly, improving translation efficiency. Specifically, the DATEADD function is now translated to Databricks SQL's DATE_ADD, ADD_MONTHS, and xxx + INTERVAL n {days|months|etc} constructs. The changes also include a new sealed trait `KnownIntervalType`, a new case class `KnownInterval`, and a new class `TSqlCallMapper` for mapping TSQL functions to Databricks SQL equivalents. Furthermore, the commit introduces new tests for TSQL specific function call mappers, ensuring proper translation of TSQL functions to Databricks SQL compatible constructs. These improvements collectively facilitate better integration and compatibility between TSQL and Databricks SQL. * Improve TSQL and Snowflake parser and lexer ([#757](#757)). In this release, the open-source library's Snowflake and TSQL lexers and parsers have been improved for better functionality and robustness. For the Snowflake lexer, unnecessary escape sequence processing has been removed, and various options have been corrected to be simple strings. The lexer now accepts a question mark as a placeholder for prepared statements in Snowflake statements. The TSQL lexer has undergone minor improvements, such as aligning the catch-all rule name with Snowflake. The Snowflake parser now accepts the question mark as a `PARAM` placeholder and simplifies the `typeFileformat` rule to accept a single `STRING` token. Additionally, several new keywords have been added to the TSQL lexer, improving consistency and clarity. These changes aim to simplify lexer and parser rules, enhance option handling and placeholders, and ensure consistency between Snowflake and TSQL. * Patch Information Schema Predicate Pushdown for Snowflake ([#764](#764)). In this release, we have implemented Information Schema Predicate Pushdown for Snowflake, resolving issue [#7](#7) * TSQL: Implement correct grammar for CREATE TABLE in all forms ([#796](#796)). In this release, the TSqlLexer's CREATE TABLE statement grammar has been updated and expanded to support new keywords and improve accuracy. The newly added keywords 'EDGE', 'FILETABLE', 'NODE', and `NODES` enable correct parsing of CREATE TABLE statements using graph nodes and FILETABLE functionality. Existing keywords such as 'DROP_EXISTING', 'DYNAMIC', 'FILENAME', and `FILTER` have been refined for better precision. Furthermore, the introduction of the `tableIndices` rule standardizes the order of columns in the table. These enhancements improve the T-SQL parser's robustness and consistency, benefiting users in creating and managing tables in their databases. * TSQL: Implement grammar for CREATE DATABASE and CREATE DATABASE SCOPED OPTION ([#788](#788)). In this release, we have implemented the TSQL grammar for `CREATE DATABASE` and `CREATE DATABASE SCOPED OPTION` statements, addressing inconsistencies with TSQL documentation. The implementation was initially intended to cover the entire process from grammar to code generation. However, to simplify other DDL statements, the work was split into separate grammar-only pull requests. The diff introduces new methods such as `createDatabaseScopedCredential`, `createDatabaseOption`, and `databaseFilestreamOption`, while modifying the existing `createDatabase` method. The `createDatabaseScopedCredential` method handles the creation of a database scoped credential, which was previously part of `createDatabaseOption`. The `createDatabaseOption` method now focuses on handling individual options, while `databaseFilestreamOption` deals with filesystem specifications. Note that certain options, like `DEFAULT_LANGUAGE`, `DEFAULT_FULLTEXT_LANGUAGE`, and more, have been marked as TODO and will be addressed in future updates. * TSQL: Improve transpilation coverage ([#766](#766)). In this update, various enhancements have been made to improve the coverage of TSQL transpilation and address bugs in code generation, particularly for the `ExpressionGenerator` class in the `com/databricks/labs/remorph/generators/sql` package, and the `TSqlExpressionBuilder`, `TSqlFunctionBuilder`, `TSqlCallMapper`, and `QueryRunner` classes. Changes include adding support for new cases, modifying code generation behavior, improving test coverage, and updating existing tests for better TSQL code generation. Specific additions include new methods for handling bitwise operations, converting CHECKSUM_AGG calls to a sequence of MD5 function calls, and handling Fn instances. The `QueryRunner` class has been updated to include both the actual and expected outputs in error messages for better debugging purposes. Additionally, the test file for the `DATEADD` function has been updated to ensure proper syntax and consistency. All these modifications aim to improve the reliability, accuracy, and compatibility of TSQL transpilation, ensuring better functionality and coverage for the Remorph library's transformation capabilities. * [chore] speedup build process by not running unit tests twice ([#842](#842)). In this commit, the build process for the open-source library has been optimized by removing the execution of unit tests during the build phase in the Maven build process. A new plugin for the Apache Maven Surefire Plugin has been added, with the group ID set to "org.apache.maven.plugins", artifact ID set to "maven-surefire-plugin", and version set to "3.1.2". The configuration for this plugin includes a `skipTests` attribute set to "true", ensuring that tests are not run twice, thereby improving the build process speed. The existing ScalaTest Maven plugin configuration remains unchanged, allowing Scala tests to still be executed during the test phase. Additionally, the Maven Compiler Plugin has been upgraded to version 3.11.0, and the release parameter has been set to 8, ensuring that the Java compiler used during the build process is compatible with Java 8. The version numbers for several libraries, including os-lib, mainargs, ujson, scalatest, and exec-maven-plugin, are now being defined using properties, allowing Maven to manage and cache these libraries more efficiently. These changes improve the build process's performance and reliability without affecting the existing functionality. * [internal] better errors for call mapper ([#816](#816)). In this release, the `ExpressionGenerator` class in the `com.databricks.labs.remorph.generators.sql` package has been updated to handle exceptions during the conversion of input functions to Databricks expressions. A try-catch block has been added to catch `IndexOutOfBoundsException` and provide a more descriptive error message, including the name of the problematic function and the error message associated with the exception. A `TranspileException` with the message `not implemented` is now thrown when encountering a function for which a translation to Databricks expressions is not available. The `IsTranspiledFromSnowflakeQueryRunner` class in the `com.databricks.labs.remorph.coverage` package has also been updated to include the name of the exception class in the error message for better error identification when a non-fatal error occurs during parsing. Additionally, the import statement for `Formatter` has been moved to ensure alphabetical order. These changes improve error handling and readability, thereby enhancing the overall user experience for developers interacting with the codebase. * [snowflake] map more functions to Databricks SQL ([#826](#826)). This commit introduces new private methods `andPredicate` and `orPredicate` to the ExpressionGenerator class in the `com.databricks.labs.remorph.generators.sql` package, enhancing the generation of SQL expressions for AND and OR logical operators, and improving readability and correctness of complex logical expressions. The LogicalPlanGenerator class in the `sql` package now supports more flexibility in inserting data into a target relation, enabling users to choose between overwriting the existing data or appending to it. The `FROM_JSON` function in the CallMapper class has been updated to accommodate an optional third argument, providing more flexibility in handling JSON-related transformations. A new class, `CastParseJsonToFromJson`, has been introduced to improve the performance of data processing pipelines that involve parsing JSON data in Snowflake using the `PARSE_JSON` function. Additional Snowflake SQL functions have been mapped to Databricks SQL IR, enhancing compatibility and functionality. The ExpressionGeneratorTest class now generates predicates without parentheses, simplifying and improving readability. Mappings for several Snowflake functions to Databricks SQL have been added, enhancing compatibility with Databricks SQL. The `sqlFiles` sequence in the `NestedFiles` class is now sorted before being mapped to `AcceptanceTest` objects, ensuring consistent order for testing or debugging purposes. A semicolon has been added to the end of a SQL query in a test file for Snowflake DML insert functionality, ensuring proper query termination. * [sql] generate `INSERT INTO ...` ([#823](#823)). In this release, we have made significant updates to our open-source library. The ExpressionGenerator.scala file has been updated to convert boolean values to lowercase instead of uppercase when generating INSERT INTO statements, ensuring SQL code consistency. A new method `insert` has been added to the `LogicalPlanGenerator` class to generate INSERT INTO SQL statements based on the `InsertIntoTable` input. We have introduced a new case class `InsertIntoTable` that extends `Modification` to simplify the API for DML operations other than SELECT. The SQL ExpressionGenerator now generates boolean literals in lowercase, and new test cases have been added to ensure the correct generation of INSERT and JOIN statements. Lastly, we have added support for generating INSERT INTO statements in SQL for specified database tables, improving cross-platform compatibility. These changes aim to enhance the library's functionality and ease of use for software engineers. * [sql] generate basic JSON access ([#835](#835)). In this release, we have added several new features and improvements to our open-source library. The `ExpressionGenerator` class now includes a new method, `jsonAccess`, which generates SQL code to access a JSON object's properties, handling different types of elements in the path. The `TO_JSON` function in the `StructsToJson` class has been updated to accept an optional expression as an argument, enhancing its flexibility. The `SnowflakeCallMapper` class now includes a new method, `lift`, and a new feature to generate basic JSON access, with corresponding updates to test cases and methods. The SQL logical plan generator has been refined to generate star projections with escaped identifiers, handling complex table and database names. We have also added new methods and test cases to the `SnowflakeCallMapper` class to convert Snowflake structs into JSON strings and cast Snowflake values to specific data types. These changes improve the library's ability to handle complex JSON data structures, enhance functionality, and ensure the quality of generated SQL code. * [sql] generate basic `CREATE TABLE` definition ([#829](#829)). In this release, the open-source library's SQL generation capabilities have been enhanced with the addition of a new `createTable` method to the `LogicalPlanGenerator` class. This method generates a `CREATE TABLE` definition for a given `ir.CreateTableCommand`, producing a SQL statement with a comma-separated list of column definitions. Each column definition includes the column name, data type, and any applicable constraints, generated using the `DataTypeGenerator.generateDataType` method and the newly-introduced `constraint` method. Additionally, the `project` method has been updated to incorporate a `FROM` clause in the generated SQL statement when the input of the project node is not `ir.NoTable()`. These improvements extend the functionality of the `LogicalPlanGenerator` class, allowing it to generate `CREATE TABLE` statements for input catalog ASTs, thereby better supporting data transformation use cases. A new test for the `CreateTableCommand` has been added to the `LogicalPlanGeneratorTest` class to validate the correct transpilation of the `CreateTableCommand` to a `CREATE TABLE` SQL statement. * [sql] generate basic `TABLESAMPLE` ([#830](#830)). In this commit, the open-source library's `LogicalPlanGenerator` class has been updated to include a new method, `tableSample`, which generates SQL representations of table sampling operations. Previously, the class only handled `INSERT`, `DELETE`, and `CREATE TABLE` commands. With this enhancement, the generator can now produce SQL statements using the `TABLESAMPLE` clause, allowing for the selection of a sample of data from a table based on various sampling methods and a seed value for repeatable sampling. The newly supported sampling methods include row-based probabilistic, row-based fixed amount, and block-based sampling. Additionally, a new test case has been added for the `LogicalPlanGenerator` related to the `TableSample` class, validating the correct transpilation of named tables and fixed row sampling into the `TABLESAMPLE` clause with specified parameters. This improvement ensures that the generated SQL code accurately represents the desired table sampling settings. Dependency updates: * Bump sqlglot from 25.6.1 to 25.8.1 ([#749](#749)).
Merged
nfx
added a commit
that referenced
this pull request
Aug 23, 2024
* Added Translation Support for `!` as `commands` and `&` for `Parameters` ([#771](#771)). This commit adds translation support for using "!" as commands and "&" as parameters in Snowflake code within the remorph tool, enhancing compatibility with Snowflake syntax. The "!set exit_on_error=true" command, which previously caused an error, is now treated as a comment and prepended with `--` in the output. The "&" symbol, previously unrecognized, is converted to its Databricks equivalent "$", which represents parameters, allowing for proper handling of Snowflake SQL code containing "!" commands and "&" parameters. These changes improve the compatibility and robustness of remorph with Snowflake code and enable more efficient processing of Snowflake SQL statements. Additionally, the commit introduces a new test suite for Snowflake commands, enhancing code coverage and ensuring proper functionality of the transpiler. * Added `LET` and `DECLARE` statements parsing in Snowflake PL/SQL procedures ([#548](#548)). This commit introduces support for parsing `DECLARE` and `LET` statements in Snowflake PL/SQL procedures, enabling variable declaration and assignment. It adds new grammar rules, refactors code using ScalaSubquery, and implements IR visitors for `DECLARE` and `LET` statements with Variable Assignment and ResultSet Assignment. The `RETURN` statement and parameterized expressions are also now supported. Note that `CURSOR` is not yet covered. These changes allow for improved processing and handling of Snowflake PL/SQL code, enhancing the overall functionality of the library. * Added logger statements in get_schema function ([#756](#756)). In this release, enhanced logging has been implemented in the Metadata (Schema) fetch functions, specifically in the `get_schema` function and other metadata fetch functions within Oracle, SnowflakeDataSource modules. The changes include logger statements that log the schema query, start time, and end time, providing better visibility into the performance and behavior of these functions during debugging or monitoring. The logging functionality is implemented using the built-in `logging` module and timestamps are obtained using the `datetime` module. In the SnowflakeDataSource class, RuntimeError or PySparkException will be raised if the user's current role lacks the necessary privileges to access the specified Information Schema object. The INFORMATION_SCHEMA table in Snowflake is used to fetch the schema, with the query modified to handle unquoted and quoted identifiers and the ordinal position of columns. The `get_schema_query` function has also been updated for better formatting for the SQL query used to fetch schema information. The schema fetching method remains unchanged, but these enhancements provide more detailed logging for debugging and monitoring purposes. * Aggregates Reconcile CLI Implementation ([#770](#770)). The `Aggregates Reconcile CLI Implementation` commit introduces a new command-line interface (CLI) for reconcile jobs, specifically for aggregated data. This change adds a new parameter, "operation_name", to the run method in the runner.py file, which determines the type of reconcile operation to perform. A new function, _trigger_reconcile_aggregates, has been implemented to reconcile aggregate data based on provided configurations and log the reconciliation process outcome. Additionally, new methods for defining job parameters and settings, such as `max_concurrent_runs` and "parameters", have been included. This CLI implementation enhances the customizability and control of the reconciliation process for users, allowing them to focus on specific use cases and data aggregations. The changes also include new test cases in test_runner.py to ensure the proper behavior of the ReconcileRunner class when the `aggregates-reconcile` operation_name is set. * Aggregates Reconcile Updates ([#784](#784)). This commit introduces significant updates to the `Table Deployment` feature, enabling it to support `Aggregate Tables` deployment and modifying the persistence logic for tables. Notable changes include the addition of a new `aggregates` attribute to the `Table` class in the configuration, which allows users to specify aggregate functions and optionally group by specific columns. The reconcile process now captures mismatch data, missing rows in the source, and missing rows in the target in the recon metrics tables. Furthermore, the aggregates reconcile process supports various aggregate functions like min, max, count, sum, avg, median, mode, percentile, stddev, and variance. The documentation has been updated to reflect these improvements. The commit also removes the `percentile` function from the reconciliation configuration and modifies the `aggregate_metrics` SQL query, enhancing the flexibility of the `Table Deployment` feature for `Aggregate Tables`. Users should note that the `percentile` function is no longer a valid option and should update their code accordingly. * Aggregates Reconcile documentation ([#779](#779)). In this commit, the Aggregates Reconcile utility has been enhanced with new documentation and visualizations for improved understanding and usability. The utility now includes a flow diagram, visualization, and README file illustrating how it compares specific aggregate metrics between source and target data residing on Databricks. A new configuration sample is added, showcasing the reconciliation of two tables using various aggregate functions, join columns, transformations, filters, and JDBC ReaderOptions configurations. The commit also introduces two Mermaid flowchart diagrams, depicting the reconciliation process with and without a `group by` operation. Additionally, new flow diagram visualizations in PNG and GIF formats have been added, aiding in understanding the process flow of the Aggregates Reconcile feature. The reconcile configuration samples in the documentation have also been updated with a spelling correction for clarity. * Bump sqlglot from 25.6.1 to 25.8.1 ([#749](#749)). In this version update, the `sqlglot` dependency has been bumped from 25.6.1 to 25.8.1, bringing several bug fixes and new features related to various SQL dialects such as BigQuery, DuckDB, and T-SQL. Notable changes include support for BYTEINT in BigQuery, improved parsing and transpilation of StrToDate in ClickHouse, and support for SUMMARIZE in DuckDB. Additionally, there are bug fixes for DuckDB and T-SQL, including wrapping left IN clause json extract arrow operand and handling JSON_QUERY with a single argument. The update also includes refactors and changes to the ANNOTATORS and PARSER modules to improve dialect-aware annotation and consistency. This pull request is compatible with `sqlglot` version 25.6.1 and below and includes a detailed list of commits and their corresponding changes. * Generate window functions ([#772](#772)). In this release, we have added support for generating SQL `WINDOW` and `SortOrder` expressions in the `ExpressionGenerator` class. This enhancement includes the ability to generate a `WINDOW` expression with a window function, partitioning and ordering clauses, and an optional window frame, using the `window` and `frameBoundary` methods. The `sortOrder` method now generates the SQL `SortOrder` expression, which includes the expression to sort by, sort direction, and null ordering. Additional methods `orNull` and `doubleQuote` return a string representing a NULL value and a string enclosed in double quotes, respectively. These changes provide increased flexibility for handling more complex expressions in SQL. Additionally, new test cases have been added to the `ExpressionGeneratorTest` to ensure the correct generation of SQL window functions, specifically the `ROW_NUMBER()` function with various partitioning, ordering, and framing specifications. These updates improve the robustness and functionality of the `ExpressionGenerator` class for generating SQL window functions. * Implement TSQL specific function call mapper ([#765](#765)). This commit introduces several new features to enhance compatibility between TSQL and Databricks SQL. A new method, `interval`, has been added to generate a Databricks SQL compatible string for intervals in a TSQL expression. The `expression` method has been updated to handle certain functions directly, improving translation efficiency. Specifically, the DATEADD function is now translated to Databricks SQL's DATE_ADD, ADD_MONTHS, and xxx + INTERVAL n {days|months|etc} constructs. The changes also include a new sealed trait `KnownIntervalType`, a new case class `KnownInterval`, and a new class `TSqlCallMapper` for mapping TSQL functions to Databricks SQL equivalents. Furthermore, the commit introduces new tests for TSQL specific function call mappers, ensuring proper translation of TSQL functions to Databricks SQL compatible constructs. These improvements collectively facilitate better integration and compatibility between TSQL and Databricks SQL. * Improve TSQL and Snowflake parser and lexer ([#757](#757)). In this release, the open-source library's Snowflake and TSQL lexers and parsers have been improved for better functionality and robustness. For the Snowflake lexer, unnecessary escape sequence processing has been removed, and various options have been corrected to be simple strings. The lexer now accepts a question mark as a placeholder for prepared statements in Snowflake statements. The TSQL lexer has undergone minor improvements, such as aligning the catch-all rule name with Snowflake. The Snowflake parser now accepts the question mark as a `PARAM` placeholder and simplifies the `typeFileformat` rule to accept a single `STRING` token. Additionally, several new keywords have been added to the TSQL lexer, improving consistency and clarity. These changes aim to simplify lexer and parser rules, enhance option handling and placeholders, and ensure consistency between Snowflake and TSQL. * Patch Information Schema Predicate Pushdown for Snowflake ([#764](#764)). In this release, we have implemented Information Schema Predicate Pushdown for Snowflake, resolving issue [#7](#7) * TSQL: Implement correct grammar for CREATE TABLE in all forms ([#796](#796)). In this release, the TSqlLexer's CREATE TABLE statement grammar has been updated and expanded to support new keywords and improve accuracy. The newly added keywords 'EDGE', 'FILETABLE', 'NODE', and `NODES` enable correct parsing of CREATE TABLE statements using graph nodes and FILETABLE functionality. Existing keywords such as 'DROP_EXISTING', 'DYNAMIC', 'FILENAME', and `FILTER` have been refined for better precision. Furthermore, the introduction of the `tableIndices` rule standardizes the order of columns in the table. These enhancements improve the T-SQL parser's robustness and consistency, benefiting users in creating and managing tables in their databases. * TSQL: Implement grammar for CREATE DATABASE and CREATE DATABASE SCOPED OPTION ([#788](#788)). In this release, we have implemented the TSQL grammar for `CREATE DATABASE` and `CREATE DATABASE SCOPED OPTION` statements, addressing inconsistencies with TSQL documentation. The implementation was initially intended to cover the entire process from grammar to code generation. However, to simplify other DDL statements, the work was split into separate grammar-only pull requests. The diff introduces new methods such as `createDatabaseScopedCredential`, `createDatabaseOption`, and `databaseFilestreamOption`, while modifying the existing `createDatabase` method. The `createDatabaseScopedCredential` method handles the creation of a database scoped credential, which was previously part of `createDatabaseOption`. The `createDatabaseOption` method now focuses on handling individual options, while `databaseFilestreamOption` deals with filesystem specifications. Note that certain options, like `DEFAULT_LANGUAGE`, `DEFAULT_FULLTEXT_LANGUAGE`, and more, have been marked as TODO and will be addressed in future updates. * TSQL: Improve transpilation coverage ([#766](#766)). In this update, various enhancements have been made to improve the coverage of TSQL transpilation and address bugs in code generation, particularly for the `ExpressionGenerator` class in the `com/databricks/labs/remorph/generators/sql` package, and the `TSqlExpressionBuilder`, `TSqlFunctionBuilder`, `TSqlCallMapper`, and `QueryRunner` classes. Changes include adding support for new cases, modifying code generation behavior, improving test coverage, and updating existing tests for better TSQL code generation. Specific additions include new methods for handling bitwise operations, converting CHECKSUM_AGG calls to a sequence of MD5 function calls, and handling Fn instances. The `QueryRunner` class has been updated to include both the actual and expected outputs in error messages for better debugging purposes. Additionally, the test file for the `DATEADD` function has been updated to ensure proper syntax and consistency. All these modifications aim to improve the reliability, accuracy, and compatibility of TSQL transpilation, ensuring better functionality and coverage for the Remorph library's transformation capabilities. * [chore] speedup build process by not running unit tests twice ([#842](#842)). In this commit, the build process for the open-source library has been optimized by removing the execution of unit tests during the build phase in the Maven build process. A new plugin for the Apache Maven Surefire Plugin has been added, with the group ID set to "org.apache.maven.plugins", artifact ID set to "maven-surefire-plugin", and version set to "3.1.2". The configuration for this plugin includes a `skipTests` attribute set to "true", ensuring that tests are not run twice, thereby improving the build process speed. The existing ScalaTest Maven plugin configuration remains unchanged, allowing Scala tests to still be executed during the test phase. Additionally, the Maven Compiler Plugin has been upgraded to version 3.11.0, and the release parameter has been set to 8, ensuring that the Java compiler used during the build process is compatible with Java 8. The version numbers for several libraries, including os-lib, mainargs, ujson, scalatest, and exec-maven-plugin, are now being defined using properties, allowing Maven to manage and cache these libraries more efficiently. These changes improve the build process's performance and reliability without affecting the existing functionality. * [internal] better errors for call mapper ([#816](#816)). In this release, the `ExpressionGenerator` class in the `com.databricks.labs.remorph.generators.sql` package has been updated to handle exceptions during the conversion of input functions to Databricks expressions. A try-catch block has been added to catch `IndexOutOfBoundsException` and provide a more descriptive error message, including the name of the problematic function and the error message associated with the exception. A `TranspileException` with the message `not implemented` is now thrown when encountering a function for which a translation to Databricks expressions is not available. The `IsTranspiledFromSnowflakeQueryRunner` class in the `com.databricks.labs.remorph.coverage` package has also been updated to include the name of the exception class in the error message for better error identification when a non-fatal error occurs during parsing. Additionally, the import statement for `Formatter` has been moved to ensure alphabetical order. These changes improve error handling and readability, thereby enhancing the overall user experience for developers interacting with the codebase. * [snowflake] map more functions to Databricks SQL ([#826](#826)). This commit introduces new private methods `andPredicate` and `orPredicate` to the ExpressionGenerator class in the `com.databricks.labs.remorph.generators.sql` package, enhancing the generation of SQL expressions for AND and OR logical operators, and improving readability and correctness of complex logical expressions. The LogicalPlanGenerator class in the `sql` package now supports more flexibility in inserting data into a target relation, enabling users to choose between overwriting the existing data or appending to it. The `FROM_JSON` function in the CallMapper class has been updated to accommodate an optional third argument, providing more flexibility in handling JSON-related transformations. A new class, `CastParseJsonToFromJson`, has been introduced to improve the performance of data processing pipelines that involve parsing JSON data in Snowflake using the `PARSE_JSON` function. Additional Snowflake SQL functions have been mapped to Databricks SQL IR, enhancing compatibility and functionality. The ExpressionGeneratorTest class now generates predicates without parentheses, simplifying and improving readability. Mappings for several Snowflake functions to Databricks SQL have been added, enhancing compatibility with Databricks SQL. The `sqlFiles` sequence in the `NestedFiles` class is now sorted before being mapped to `AcceptanceTest` objects, ensuring consistent order for testing or debugging purposes. A semicolon has been added to the end of a SQL query in a test file for Snowflake DML insert functionality, ensuring proper query termination. * [sql] generate `INSERT INTO ...` ([#823](#823)). In this release, we have made significant updates to our open-source library. The ExpressionGenerator.scala file has been updated to convert boolean values to lowercase instead of uppercase when generating INSERT INTO statements, ensuring SQL code consistency. A new method `insert` has been added to the `LogicalPlanGenerator` class to generate INSERT INTO SQL statements based on the `InsertIntoTable` input. We have introduced a new case class `InsertIntoTable` that extends `Modification` to simplify the API for DML operations other than SELECT. The SQL ExpressionGenerator now generates boolean literals in lowercase, and new test cases have been added to ensure the correct generation of INSERT and JOIN statements. Lastly, we have added support for generating INSERT INTO statements in SQL for specified database tables, improving cross-platform compatibility. These changes aim to enhance the library's functionality and ease of use for software engineers. * [sql] generate basic JSON access ([#835](#835)). In this release, we have added several new features and improvements to our open-source library. The `ExpressionGenerator` class now includes a new method, `jsonAccess`, which generates SQL code to access a JSON object's properties, handling different types of elements in the path. The `TO_JSON` function in the `StructsToJson` class has been updated to accept an optional expression as an argument, enhancing its flexibility. The `SnowflakeCallMapper` class now includes a new method, `lift`, and a new feature to generate basic JSON access, with corresponding updates to test cases and methods. The SQL logical plan generator has been refined to generate star projections with escaped identifiers, handling complex table and database names. We have also added new methods and test cases to the `SnowflakeCallMapper` class to convert Snowflake structs into JSON strings and cast Snowflake values to specific data types. These changes improve the library's ability to handle complex JSON data structures, enhance functionality, and ensure the quality of generated SQL code. * [sql] generate basic `CREATE TABLE` definition ([#829](#829)). In this release, the open-source library's SQL generation capabilities have been enhanced with the addition of a new `createTable` method to the `LogicalPlanGenerator` class. This method generates a `CREATE TABLE` definition for a given `ir.CreateTableCommand`, producing a SQL statement with a comma-separated list of column definitions. Each column definition includes the column name, data type, and any applicable constraints, generated using the `DataTypeGenerator.generateDataType` method and the newly-introduced `constraint` method. Additionally, the `project` method has been updated to incorporate a `FROM` clause in the generated SQL statement when the input of the project node is not `ir.NoTable()`. These improvements extend the functionality of the `LogicalPlanGenerator` class, allowing it to generate `CREATE TABLE` statements for input catalog ASTs, thereby better supporting data transformation use cases. A new test for the `CreateTableCommand` has been added to the `LogicalPlanGeneratorTest` class to validate the correct transpilation of the `CreateTableCommand` to a `CREATE TABLE` SQL statement. * [sql] generate basic `TABLESAMPLE` ([#830](#830)). In this commit, the open-source library's `LogicalPlanGenerator` class has been updated to include a new method, `tableSample`, which generates SQL representations of table sampling operations. Previously, the class only handled `INSERT`, `DELETE`, and `CREATE TABLE` commands. With this enhancement, the generator can now produce SQL statements using the `TABLESAMPLE` clause, allowing for the selection of a sample of data from a table based on various sampling methods and a seed value for repeatable sampling. The newly supported sampling methods include row-based probabilistic, row-based fixed amount, and block-based sampling. Additionally, a new test case has been added for the `LogicalPlanGenerator` related to the `TableSample` class, validating the correct transpilation of named tables and fixed row sampling into the `TABLESAMPLE` clause with specified parameters. This improvement ensures that the generated SQL code accurately represents the desired table sampling settings. Dependency updates: * Bump sqlglot from 25.6.1 to 25.8.1 ([#749](#749)).
sundarshankar89
pushed a commit
to sundarshankar89/remorph
that referenced
this pull request
Jan 2, 2025
* Added Translation Support for `!` as `commands` and `&` for `Parameters` ([databrickslabs#771](databrickslabs#771)). This commit adds translation support for using "!" as commands and "&" as parameters in Snowflake code within the remorph tool, enhancing compatibility with Snowflake syntax. The "!set exit_on_error=true" command, which previously caused an error, is now treated as a comment and prepended with `--` in the output. The "&" symbol, previously unrecognized, is converted to its Databricks equivalent "$", which represents parameters, allowing for proper handling of Snowflake SQL code containing "!" commands and "&" parameters. These changes improve the compatibility and robustness of remorph with Snowflake code and enable more efficient processing of Snowflake SQL statements. Additionally, the commit introduces a new test suite for Snowflake commands, enhancing code coverage and ensuring proper functionality of the transpiler. * Added `LET` and `DECLARE` statements parsing in Snowflake PL/SQL procedures ([databrickslabs#548](databrickslabs#548)). This commit introduces support for parsing `DECLARE` and `LET` statements in Snowflake PL/SQL procedures, enabling variable declaration and assignment. It adds new grammar rules, refactors code using ScalaSubquery, and implements IR visitors for `DECLARE` and `LET` statements with Variable Assignment and ResultSet Assignment. The `RETURN` statement and parameterized expressions are also now supported. Note that `CURSOR` is not yet covered. These changes allow for improved processing and handling of Snowflake PL/SQL code, enhancing the overall functionality of the library. * Added logger statements in get_schema function ([databrickslabs#756](databrickslabs#756)). In this release, enhanced logging has been implemented in the Metadata (Schema) fetch functions, specifically in the `get_schema` function and other metadata fetch functions within Oracle, SnowflakeDataSource modules. The changes include logger statements that log the schema query, start time, and end time, providing better visibility into the performance and behavior of these functions during debugging or monitoring. The logging functionality is implemented using the built-in `logging` module and timestamps are obtained using the `datetime` module. In the SnowflakeDataSource class, RuntimeError or PySparkException will be raised if the user's current role lacks the necessary privileges to access the specified Information Schema object. The INFORMATION_SCHEMA table in Snowflake is used to fetch the schema, with the query modified to handle unquoted and quoted identifiers and the ordinal position of columns. The `get_schema_query` function has also been updated for better formatting for the SQL query used to fetch schema information. The schema fetching method remains unchanged, but these enhancements provide more detailed logging for debugging and monitoring purposes. * Aggregates Reconcile CLI Implementation ([databrickslabs#770](databrickslabs#770)). The `Aggregates Reconcile CLI Implementation` commit introduces a new command-line interface (CLI) for reconcile jobs, specifically for aggregated data. This change adds a new parameter, "operation_name", to the run method in the runner.py file, which determines the type of reconcile operation to perform. A new function, _trigger_reconcile_aggregates, has been implemented to reconcile aggregate data based on provided configurations and log the reconciliation process outcome. Additionally, new methods for defining job parameters and settings, such as `max_concurrent_runs` and "parameters", have been included. This CLI implementation enhances the customizability and control of the reconciliation process for users, allowing them to focus on specific use cases and data aggregations. The changes also include new test cases in test_runner.py to ensure the proper behavior of the ReconcileRunner class when the `aggregates-reconcile` operation_name is set. * Aggregates Reconcile Updates ([databrickslabs#784](databrickslabs#784)). This commit introduces significant updates to the `Table Deployment` feature, enabling it to support `Aggregate Tables` deployment and modifying the persistence logic for tables. Notable changes include the addition of a new `aggregates` attribute to the `Table` class in the configuration, which allows users to specify aggregate functions and optionally group by specific columns. The reconcile process now captures mismatch data, missing rows in the source, and missing rows in the target in the recon metrics tables. Furthermore, the aggregates reconcile process supports various aggregate functions like min, max, count, sum, avg, median, mode, percentile, stddev, and variance. The documentation has been updated to reflect these improvements. The commit also removes the `percentile` function from the reconciliation configuration and modifies the `aggregate_metrics` SQL query, enhancing the flexibility of the `Table Deployment` feature for `Aggregate Tables`. Users should note that the `percentile` function is no longer a valid option and should update their code accordingly. * Aggregates Reconcile documentation ([databrickslabs#779](databrickslabs#779)). In this commit, the Aggregates Reconcile utility has been enhanced with new documentation and visualizations for improved understanding and usability. The utility now includes a flow diagram, visualization, and README file illustrating how it compares specific aggregate metrics between source and target data residing on Databricks. A new configuration sample is added, showcasing the reconciliation of two tables using various aggregate functions, join columns, transformations, filters, and JDBC ReaderOptions configurations. The commit also introduces two Mermaid flowchart diagrams, depicting the reconciliation process with and without a `group by` operation. Additionally, new flow diagram visualizations in PNG and GIF formats have been added, aiding in understanding the process flow of the Aggregates Reconcile feature. The reconcile configuration samples in the documentation have also been updated with a spelling correction for clarity. * Bump sqlglot from 25.6.1 to 25.8.1 ([databrickslabs#749](databrickslabs#749)). In this version update, the `sqlglot` dependency has been bumped from 25.6.1 to 25.8.1, bringing several bug fixes and new features related to various SQL dialects such as BigQuery, DuckDB, and T-SQL. Notable changes include support for BYTEINT in BigQuery, improved parsing and transpilation of StrToDate in ClickHouse, and support for SUMMARIZE in DuckDB. Additionally, there are bug fixes for DuckDB and T-SQL, including wrapping left IN clause json extract arrow operand and handling JSON_QUERY with a single argument. The update also includes refactors and changes to the ANNOTATORS and PARSER modules to improve dialect-aware annotation and consistency. This pull request is compatible with `sqlglot` version 25.6.1 and below and includes a detailed list of commits and their corresponding changes. * Generate window functions ([databrickslabs#772](databrickslabs#772)). In this release, we have added support for generating SQL `WINDOW` and `SortOrder` expressions in the `ExpressionGenerator` class. This enhancement includes the ability to generate a `WINDOW` expression with a window function, partitioning and ordering clauses, and an optional window frame, using the `window` and `frameBoundary` methods. The `sortOrder` method now generates the SQL `SortOrder` expression, which includes the expression to sort by, sort direction, and null ordering. Additional methods `orNull` and `doubleQuote` return a string representing a NULL value and a string enclosed in double quotes, respectively. These changes provide increased flexibility for handling more complex expressions in SQL. Additionally, new test cases have been added to the `ExpressionGeneratorTest` to ensure the correct generation of SQL window functions, specifically the `ROW_NUMBER()` function with various partitioning, ordering, and framing specifications. These updates improve the robustness and functionality of the `ExpressionGenerator` class for generating SQL window functions. * Implement TSQL specific function call mapper ([databrickslabs#765](databrickslabs#765)). This commit introduces several new features to enhance compatibility between TSQL and Databricks SQL. A new method, `interval`, has been added to generate a Databricks SQL compatible string for intervals in a TSQL expression. The `expression` method has been updated to handle certain functions directly, improving translation efficiency. Specifically, the DATEADD function is now translated to Databricks SQL's DATE_ADD, ADD_MONTHS, and xxx + INTERVAL n {days|months|etc} constructs. The changes also include a new sealed trait `KnownIntervalType`, a new case class `KnownInterval`, and a new class `TSqlCallMapper` for mapping TSQL functions to Databricks SQL equivalents. Furthermore, the commit introduces new tests for TSQL specific function call mappers, ensuring proper translation of TSQL functions to Databricks SQL compatible constructs. These improvements collectively facilitate better integration and compatibility between TSQL and Databricks SQL. * Improve TSQL and Snowflake parser and lexer ([databrickslabs#757](databrickslabs#757)). In this release, the open-source library's Snowflake and TSQL lexers and parsers have been improved for better functionality and robustness. For the Snowflake lexer, unnecessary escape sequence processing has been removed, and various options have been corrected to be simple strings. The lexer now accepts a question mark as a placeholder for prepared statements in Snowflake statements. The TSQL lexer has undergone minor improvements, such as aligning the catch-all rule name with Snowflake. The Snowflake parser now accepts the question mark as a `PARAM` placeholder and simplifies the `typeFileformat` rule to accept a single `STRING` token. Additionally, several new keywords have been added to the TSQL lexer, improving consistency and clarity. These changes aim to simplify lexer and parser rules, enhance option handling and placeholders, and ensure consistency between Snowflake and TSQL. * Patch Information Schema Predicate Pushdown for Snowflake ([databrickslabs#764](databrickslabs#764)). In this release, we have implemented Information Schema Predicate Pushdown for Snowflake, resolving issue [databrickslabs#7](databrickslabs#7) * TSQL: Implement correct grammar for CREATE TABLE in all forms ([databrickslabs#796](databrickslabs#796)). In this release, the TSqlLexer's CREATE TABLE statement grammar has been updated and expanded to support new keywords and improve accuracy. The newly added keywords 'EDGE', 'FILETABLE', 'NODE', and `NODES` enable correct parsing of CREATE TABLE statements using graph nodes and FILETABLE functionality. Existing keywords such as 'DROP_EXISTING', 'DYNAMIC', 'FILENAME', and `FILTER` have been refined for better precision. Furthermore, the introduction of the `tableIndices` rule standardizes the order of columns in the table. These enhancements improve the T-SQL parser's robustness and consistency, benefiting users in creating and managing tables in their databases. * TSQL: Implement grammar for CREATE DATABASE and CREATE DATABASE SCOPED OPTION ([databrickslabs#788](databrickslabs#788)). In this release, we have implemented the TSQL grammar for `CREATE DATABASE` and `CREATE DATABASE SCOPED OPTION` statements, addressing inconsistencies with TSQL documentation. The implementation was initially intended to cover the entire process from grammar to code generation. However, to simplify other DDL statements, the work was split into separate grammar-only pull requests. The diff introduces new methods such as `createDatabaseScopedCredential`, `createDatabaseOption`, and `databaseFilestreamOption`, while modifying the existing `createDatabase` method. The `createDatabaseScopedCredential` method handles the creation of a database scoped credential, which was previously part of `createDatabaseOption`. The `createDatabaseOption` method now focuses on handling individual options, while `databaseFilestreamOption` deals with filesystem specifications. Note that certain options, like `DEFAULT_LANGUAGE`, `DEFAULT_FULLTEXT_LANGUAGE`, and more, have been marked as TODO and will be addressed in future updates. * TSQL: Improve transpilation coverage ([databrickslabs#766](databrickslabs#766)). In this update, various enhancements have been made to improve the coverage of TSQL transpilation and address bugs in code generation, particularly for the `ExpressionGenerator` class in the `com/databricks/labs/remorph/generators/sql` package, and the `TSqlExpressionBuilder`, `TSqlFunctionBuilder`, `TSqlCallMapper`, and `QueryRunner` classes. Changes include adding support for new cases, modifying code generation behavior, improving test coverage, and updating existing tests for better TSQL code generation. Specific additions include new methods for handling bitwise operations, converting CHECKSUM_AGG calls to a sequence of MD5 function calls, and handling Fn instances. The `QueryRunner` class has been updated to include both the actual and expected outputs in error messages for better debugging purposes. Additionally, the test file for the `DATEADD` function has been updated to ensure proper syntax and consistency. All these modifications aim to improve the reliability, accuracy, and compatibility of TSQL transpilation, ensuring better functionality and coverage for the Remorph library's transformation capabilities. * [chore] speedup build process by not running unit tests twice ([databrickslabs#842](databrickslabs#842)). In this commit, the build process for the open-source library has been optimized by removing the execution of unit tests during the build phase in the Maven build process. A new plugin for the Apache Maven Surefire Plugin has been added, with the group ID set to "org.apache.maven.plugins", artifact ID set to "maven-surefire-plugin", and version set to "3.1.2". The configuration for this plugin includes a `skipTests` attribute set to "true", ensuring that tests are not run twice, thereby improving the build process speed. The existing ScalaTest Maven plugin configuration remains unchanged, allowing Scala tests to still be executed during the test phase. Additionally, the Maven Compiler Plugin has been upgraded to version 3.11.0, and the release parameter has been set to 8, ensuring that the Java compiler used during the build process is compatible with Java 8. The version numbers for several libraries, including os-lib, mainargs, ujson, scalatest, and exec-maven-plugin, are now being defined using properties, allowing Maven to manage and cache these libraries more efficiently. These changes improve the build process's performance and reliability without affecting the existing functionality. * [internal] better errors for call mapper ([databrickslabs#816](databrickslabs#816)). In this release, the `ExpressionGenerator` class in the `com.databricks.labs.remorph.generators.sql` package has been updated to handle exceptions during the conversion of input functions to Databricks expressions. A try-catch block has been added to catch `IndexOutOfBoundsException` and provide a more descriptive error message, including the name of the problematic function and the error message associated with the exception. A `TranspileException` with the message `not implemented` is now thrown when encountering a function for which a translation to Databricks expressions is not available. The `IsTranspiledFromSnowflakeQueryRunner` class in the `com.databricks.labs.remorph.coverage` package has also been updated to include the name of the exception class in the error message for better error identification when a non-fatal error occurs during parsing. Additionally, the import statement for `Formatter` has been moved to ensure alphabetical order. These changes improve error handling and readability, thereby enhancing the overall user experience for developers interacting with the codebase. * [snowflake] map more functions to Databricks SQL ([databrickslabs#826](databrickslabs#826)). This commit introduces new private methods `andPredicate` and `orPredicate` to the ExpressionGenerator class in the `com.databricks.labs.remorph.generators.sql` package, enhancing the generation of SQL expressions for AND and OR logical operators, and improving readability and correctness of complex logical expressions. The LogicalPlanGenerator class in the `sql` package now supports more flexibility in inserting data into a target relation, enabling users to choose between overwriting the existing data or appending to it. The `FROM_JSON` function in the CallMapper class has been updated to accommodate an optional third argument, providing more flexibility in handling JSON-related transformations. A new class, `CastParseJsonToFromJson`, has been introduced to improve the performance of data processing pipelines that involve parsing JSON data in Snowflake using the `PARSE_JSON` function. Additional Snowflake SQL functions have been mapped to Databricks SQL IR, enhancing compatibility and functionality. The ExpressionGeneratorTest class now generates predicates without parentheses, simplifying and improving readability. Mappings for several Snowflake functions to Databricks SQL have been added, enhancing compatibility with Databricks SQL. The `sqlFiles` sequence in the `NestedFiles` class is now sorted before being mapped to `AcceptanceTest` objects, ensuring consistent order for testing or debugging purposes. A semicolon has been added to the end of a SQL query in a test file for Snowflake DML insert functionality, ensuring proper query termination. * [sql] generate `INSERT INTO ...` ([databrickslabs#823](databrickslabs#823)). In this release, we have made significant updates to our open-source library. The ExpressionGenerator.scala file has been updated to convert boolean values to lowercase instead of uppercase when generating INSERT INTO statements, ensuring SQL code consistency. A new method `insert` has been added to the `LogicalPlanGenerator` class to generate INSERT INTO SQL statements based on the `InsertIntoTable` input. We have introduced a new case class `InsertIntoTable` that extends `Modification` to simplify the API for DML operations other than SELECT. The SQL ExpressionGenerator now generates boolean literals in lowercase, and new test cases have been added to ensure the correct generation of INSERT and JOIN statements. Lastly, we have added support for generating INSERT INTO statements in SQL for specified database tables, improving cross-platform compatibility. These changes aim to enhance the library's functionality and ease of use for software engineers. * [sql] generate basic JSON access ([databrickslabs#835](databrickslabs#835)). In this release, we have added several new features and improvements to our open-source library. The `ExpressionGenerator` class now includes a new method, `jsonAccess`, which generates SQL code to access a JSON object's properties, handling different types of elements in the path. The `TO_JSON` function in the `StructsToJson` class has been updated to accept an optional expression as an argument, enhancing its flexibility. The `SnowflakeCallMapper` class now includes a new method, `lift`, and a new feature to generate basic JSON access, with corresponding updates to test cases and methods. The SQL logical plan generator has been refined to generate star projections with escaped identifiers, handling complex table and database names. We have also added new methods and test cases to the `SnowflakeCallMapper` class to convert Snowflake structs into JSON strings and cast Snowflake values to specific data types. These changes improve the library's ability to handle complex JSON data structures, enhance functionality, and ensure the quality of generated SQL code. * [sql] generate basic `CREATE TABLE` definition ([databrickslabs#829](databrickslabs#829)). In this release, the open-source library's SQL generation capabilities have been enhanced with the addition of a new `createTable` method to the `LogicalPlanGenerator` class. This method generates a `CREATE TABLE` definition for a given `ir.CreateTableCommand`, producing a SQL statement with a comma-separated list of column definitions. Each column definition includes the column name, data type, and any applicable constraints, generated using the `DataTypeGenerator.generateDataType` method and the newly-introduced `constraint` method. Additionally, the `project` method has been updated to incorporate a `FROM` clause in the generated SQL statement when the input of the project node is not `ir.NoTable()`. These improvements extend the functionality of the `LogicalPlanGenerator` class, allowing it to generate `CREATE TABLE` statements for input catalog ASTs, thereby better supporting data transformation use cases. A new test for the `CreateTableCommand` has been added to the `LogicalPlanGeneratorTest` class to validate the correct transpilation of the `CreateTableCommand` to a `CREATE TABLE` SQL statement. * [sql] generate basic `TABLESAMPLE` ([databrickslabs#830](databrickslabs#830)). In this commit, the open-source library's `LogicalPlanGenerator` class has been updated to include a new method, `tableSample`, which generates SQL representations of table sampling operations. Previously, the class only handled `INSERT`, `DELETE`, and `CREATE TABLE` commands. With this enhancement, the generator can now produce SQL statements using the `TABLESAMPLE` clause, allowing for the selection of a sample of data from a table based on various sampling methods and a seed value for repeatable sampling. The newly supported sampling methods include row-based probabilistic, row-based fixed amount, and block-based sampling. Additionally, a new test case has been added for the `LogicalPlanGenerator` related to the `TableSample` class, validating the correct transpilation of named tables and fixed row sampling into the `TABLESAMPLE` clause with specified parameters. This improvement ensures that the generated SQL code accurately represents the desired table sampling settings. Dependency updates: * Bump sqlglot from 25.6.1 to 25.8.1 ([databrickslabs#749](databrickslabs#749)).
sundarshankar89
pushed a commit
to sundarshankar89/remorph
that referenced
this pull request
Jan 3, 2025
* Added Translation Support for `!` as `commands` and `&` for `Parameters` ([databrickslabs#771](databrickslabs#771)). This commit adds translation support for using "!" as commands and "&" as parameters in Snowflake code within the remorph tool, enhancing compatibility with Snowflake syntax. The "!set exit_on_error=true" command, which previously caused an error, is now treated as a comment and prepended with `--` in the output. The "&" symbol, previously unrecognized, is converted to its Databricks equivalent "$", which represents parameters, allowing for proper handling of Snowflake SQL code containing "!" commands and "&" parameters. These changes improve the compatibility and robustness of remorph with Snowflake code and enable more efficient processing of Snowflake SQL statements. Additionally, the commit introduces a new test suite for Snowflake commands, enhancing code coverage and ensuring proper functionality of the transpiler. * Added `LET` and `DECLARE` statements parsing in Snowflake PL/SQL procedures ([databrickslabs#548](databrickslabs#548)). This commit introduces support for parsing `DECLARE` and `LET` statements in Snowflake PL/SQL procedures, enabling variable declaration and assignment. It adds new grammar rules, refactors code using ScalaSubquery, and implements IR visitors for `DECLARE` and `LET` statements with Variable Assignment and ResultSet Assignment. The `RETURN` statement and parameterized expressions are also now supported. Note that `CURSOR` is not yet covered. These changes allow for improved processing and handling of Snowflake PL/SQL code, enhancing the overall functionality of the library. * Added logger statements in get_schema function ([databrickslabs#756](databrickslabs#756)). In this release, enhanced logging has been implemented in the Metadata (Schema) fetch functions, specifically in the `get_schema` function and other metadata fetch functions within Oracle, SnowflakeDataSource modules. The changes include logger statements that log the schema query, start time, and end time, providing better visibility into the performance and behavior of these functions during debugging or monitoring. The logging functionality is implemented using the built-in `logging` module and timestamps are obtained using the `datetime` module. In the SnowflakeDataSource class, RuntimeError or PySparkException will be raised if the user's current role lacks the necessary privileges to access the specified Information Schema object. The INFORMATION_SCHEMA table in Snowflake is used to fetch the schema, with the query modified to handle unquoted and quoted identifiers and the ordinal position of columns. The `get_schema_query` function has also been updated for better formatting for the SQL query used to fetch schema information. The schema fetching method remains unchanged, but these enhancements provide more detailed logging for debugging and monitoring purposes. * Aggregates Reconcile CLI Implementation ([databrickslabs#770](databrickslabs#770)). The `Aggregates Reconcile CLI Implementation` commit introduces a new command-line interface (CLI) for reconcile jobs, specifically for aggregated data. This change adds a new parameter, "operation_name", to the run method in the runner.py file, which determines the type of reconcile operation to perform. A new function, _trigger_reconcile_aggregates, has been implemented to reconcile aggregate data based on provided configurations and log the reconciliation process outcome. Additionally, new methods for defining job parameters and settings, such as `max_concurrent_runs` and "parameters", have been included. This CLI implementation enhances the customizability and control of the reconciliation process for users, allowing them to focus on specific use cases and data aggregations. The changes also include new test cases in test_runner.py to ensure the proper behavior of the ReconcileRunner class when the `aggregates-reconcile` operation_name is set. * Aggregates Reconcile Updates ([databrickslabs#784](databrickslabs#784)). This commit introduces significant updates to the `Table Deployment` feature, enabling it to support `Aggregate Tables` deployment and modifying the persistence logic for tables. Notable changes include the addition of a new `aggregates` attribute to the `Table` class in the configuration, which allows users to specify aggregate functions and optionally group by specific columns. The reconcile process now captures mismatch data, missing rows in the source, and missing rows in the target in the recon metrics tables. Furthermore, the aggregates reconcile process supports various aggregate functions like min, max, count, sum, avg, median, mode, percentile, stddev, and variance. The documentation has been updated to reflect these improvements. The commit also removes the `percentile` function from the reconciliation configuration and modifies the `aggregate_metrics` SQL query, enhancing the flexibility of the `Table Deployment` feature for `Aggregate Tables`. Users should note that the `percentile` function is no longer a valid option and should update their code accordingly. * Aggregates Reconcile documentation ([databrickslabs#779](databrickslabs#779)). In this commit, the Aggregates Reconcile utility has been enhanced with new documentation and visualizations for improved understanding and usability. The utility now includes a flow diagram, visualization, and README file illustrating how it compares specific aggregate metrics between source and target data residing on Databricks. A new configuration sample is added, showcasing the reconciliation of two tables using various aggregate functions, join columns, transformations, filters, and JDBC ReaderOptions configurations. The commit also introduces two Mermaid flowchart diagrams, depicting the reconciliation process with and without a `group by` operation. Additionally, new flow diagram visualizations in PNG and GIF formats have been added, aiding in understanding the process flow of the Aggregates Reconcile feature. The reconcile configuration samples in the documentation have also been updated with a spelling correction for clarity. * Bump sqlglot from 25.6.1 to 25.8.1 ([databrickslabs#749](databrickslabs#749)). In this version update, the `sqlglot` dependency has been bumped from 25.6.1 to 25.8.1, bringing several bug fixes and new features related to various SQL dialects such as BigQuery, DuckDB, and T-SQL. Notable changes include support for BYTEINT in BigQuery, improved parsing and transpilation of StrToDate in ClickHouse, and support for SUMMARIZE in DuckDB. Additionally, there are bug fixes for DuckDB and T-SQL, including wrapping left IN clause json extract arrow operand and handling JSON_QUERY with a single argument. The update also includes refactors and changes to the ANNOTATORS and PARSER modules to improve dialect-aware annotation and consistency. This pull request is compatible with `sqlglot` version 25.6.1 and below and includes a detailed list of commits and their corresponding changes. * Generate window functions ([databrickslabs#772](databrickslabs#772)). In this release, we have added support for generating SQL `WINDOW` and `SortOrder` expressions in the `ExpressionGenerator` class. This enhancement includes the ability to generate a `WINDOW` expression with a window function, partitioning and ordering clauses, and an optional window frame, using the `window` and `frameBoundary` methods. The `sortOrder` method now generates the SQL `SortOrder` expression, which includes the expression to sort by, sort direction, and null ordering. Additional methods `orNull` and `doubleQuote` return a string representing a NULL value and a string enclosed in double quotes, respectively. These changes provide increased flexibility for handling more complex expressions in SQL. Additionally, new test cases have been added to the `ExpressionGeneratorTest` to ensure the correct generation of SQL window functions, specifically the `ROW_NUMBER()` function with various partitioning, ordering, and framing specifications. These updates improve the robustness and functionality of the `ExpressionGenerator` class for generating SQL window functions. * Implement TSQL specific function call mapper ([databrickslabs#765](databrickslabs#765)). This commit introduces several new features to enhance compatibility between TSQL and Databricks SQL. A new method, `interval`, has been added to generate a Databricks SQL compatible string for intervals in a TSQL expression. The `expression` method has been updated to handle certain functions directly, improving translation efficiency. Specifically, the DATEADD function is now translated to Databricks SQL's DATE_ADD, ADD_MONTHS, and xxx + INTERVAL n {days|months|etc} constructs. The changes also include a new sealed trait `KnownIntervalType`, a new case class `KnownInterval`, and a new class `TSqlCallMapper` for mapping TSQL functions to Databricks SQL equivalents. Furthermore, the commit introduces new tests for TSQL specific function call mappers, ensuring proper translation of TSQL functions to Databricks SQL compatible constructs. These improvements collectively facilitate better integration and compatibility between TSQL and Databricks SQL. * Improve TSQL and Snowflake parser and lexer ([databrickslabs#757](databrickslabs#757)). In this release, the open-source library's Snowflake and TSQL lexers and parsers have been improved for better functionality and robustness. For the Snowflake lexer, unnecessary escape sequence processing has been removed, and various options have been corrected to be simple strings. The lexer now accepts a question mark as a placeholder for prepared statements in Snowflake statements. The TSQL lexer has undergone minor improvements, such as aligning the catch-all rule name with Snowflake. The Snowflake parser now accepts the question mark as a `PARAM` placeholder and simplifies the `typeFileformat` rule to accept a single `STRING` token. Additionally, several new keywords have been added to the TSQL lexer, improving consistency and clarity. These changes aim to simplify lexer and parser rules, enhance option handling and placeholders, and ensure consistency between Snowflake and TSQL. * Patch Information Schema Predicate Pushdown for Snowflake ([databrickslabs#764](databrickslabs#764)). In this release, we have implemented Information Schema Predicate Pushdown for Snowflake, resolving issue [databrickslabs#7](databrickslabs#7) * TSQL: Implement correct grammar for CREATE TABLE in all forms ([databrickslabs#796](databrickslabs#796)). In this release, the TSqlLexer's CREATE TABLE statement grammar has been updated and expanded to support new keywords and improve accuracy. The newly added keywords 'EDGE', 'FILETABLE', 'NODE', and `NODES` enable correct parsing of CREATE TABLE statements using graph nodes and FILETABLE functionality. Existing keywords such as 'DROP_EXISTING', 'DYNAMIC', 'FILENAME', and `FILTER` have been refined for better precision. Furthermore, the introduction of the `tableIndices` rule standardizes the order of columns in the table. These enhancements improve the T-SQL parser's robustness and consistency, benefiting users in creating and managing tables in their databases. * TSQL: Implement grammar for CREATE DATABASE and CREATE DATABASE SCOPED OPTION ([databrickslabs#788](databrickslabs#788)). In this release, we have implemented the TSQL grammar for `CREATE DATABASE` and `CREATE DATABASE SCOPED OPTION` statements, addressing inconsistencies with TSQL documentation. The implementation was initially intended to cover the entire process from grammar to code generation. However, to simplify other DDL statements, the work was split into separate grammar-only pull requests. The diff introduces new methods such as `createDatabaseScopedCredential`, `createDatabaseOption`, and `databaseFilestreamOption`, while modifying the existing `createDatabase` method. The `createDatabaseScopedCredential` method handles the creation of a database scoped credential, which was previously part of `createDatabaseOption`. The `createDatabaseOption` method now focuses on handling individual options, while `databaseFilestreamOption` deals with filesystem specifications. Note that certain options, like `DEFAULT_LANGUAGE`, `DEFAULT_FULLTEXT_LANGUAGE`, and more, have been marked as TODO and will be addressed in future updates. * TSQL: Improve transpilation coverage ([databrickslabs#766](databrickslabs#766)). In this update, various enhancements have been made to improve the coverage of TSQL transpilation and address bugs in code generation, particularly for the `ExpressionGenerator` class in the `com/databricks/labs/remorph/generators/sql` package, and the `TSqlExpressionBuilder`, `TSqlFunctionBuilder`, `TSqlCallMapper`, and `QueryRunner` classes. Changes include adding support for new cases, modifying code generation behavior, improving test coverage, and updating existing tests for better TSQL code generation. Specific additions include new methods for handling bitwise operations, converting CHECKSUM_AGG calls to a sequence of MD5 function calls, and handling Fn instances. The `QueryRunner` class has been updated to include both the actual and expected outputs in error messages for better debugging purposes. Additionally, the test file for the `DATEADD` function has been updated to ensure proper syntax and consistency. All these modifications aim to improve the reliability, accuracy, and compatibility of TSQL transpilation, ensuring better functionality and coverage for the Remorph library's transformation capabilities. * [chore] speedup build process by not running unit tests twice ([databrickslabs#842](databrickslabs#842)). In this commit, the build process for the open-source library has been optimized by removing the execution of unit tests during the build phase in the Maven build process. A new plugin for the Apache Maven Surefire Plugin has been added, with the group ID set to "org.apache.maven.plugins", artifact ID set to "maven-surefire-plugin", and version set to "3.1.2". The configuration for this plugin includes a `skipTests` attribute set to "true", ensuring that tests are not run twice, thereby improving the build process speed. The existing ScalaTest Maven plugin configuration remains unchanged, allowing Scala tests to still be executed during the test phase. Additionally, the Maven Compiler Plugin has been upgraded to version 3.11.0, and the release parameter has been set to 8, ensuring that the Java compiler used during the build process is compatible with Java 8. The version numbers for several libraries, including os-lib, mainargs, ujson, scalatest, and exec-maven-plugin, are now being defined using properties, allowing Maven to manage and cache these libraries more efficiently. These changes improve the build process's performance and reliability without affecting the existing functionality. * [internal] better errors for call mapper ([databrickslabs#816](databrickslabs#816)). In this release, the `ExpressionGenerator` class in the `com.databricks.labs.remorph.generators.sql` package has been updated to handle exceptions during the conversion of input functions to Databricks expressions. A try-catch block has been added to catch `IndexOutOfBoundsException` and provide a more descriptive error message, including the name of the problematic function and the error message associated with the exception. A `TranspileException` with the message `not implemented` is now thrown when encountering a function for which a translation to Databricks expressions is not available. The `IsTranspiledFromSnowflakeQueryRunner` class in the `com.databricks.labs.remorph.coverage` package has also been updated to include the name of the exception class in the error message for better error identification when a non-fatal error occurs during parsing. Additionally, the import statement for `Formatter` has been moved to ensure alphabetical order. These changes improve error handling and readability, thereby enhancing the overall user experience for developers interacting with the codebase. * [snowflake] map more functions to Databricks SQL ([databrickslabs#826](databrickslabs#826)). This commit introduces new private methods `andPredicate` and `orPredicate` to the ExpressionGenerator class in the `com.databricks.labs.remorph.generators.sql` package, enhancing the generation of SQL expressions for AND and OR logical operators, and improving readability and correctness of complex logical expressions. The LogicalPlanGenerator class in the `sql` package now supports more flexibility in inserting data into a target relation, enabling users to choose between overwriting the existing data or appending to it. The `FROM_JSON` function in the CallMapper class has been updated to accommodate an optional third argument, providing more flexibility in handling JSON-related transformations. A new class, `CastParseJsonToFromJson`, has been introduced to improve the performance of data processing pipelines that involve parsing JSON data in Snowflake using the `PARSE_JSON` function. Additional Snowflake SQL functions have been mapped to Databricks SQL IR, enhancing compatibility and functionality. The ExpressionGeneratorTest class now generates predicates without parentheses, simplifying and improving readability. Mappings for several Snowflake functions to Databricks SQL have been added, enhancing compatibility with Databricks SQL. The `sqlFiles` sequence in the `NestedFiles` class is now sorted before being mapped to `AcceptanceTest` objects, ensuring consistent order for testing or debugging purposes. A semicolon has been added to the end of a SQL query in a test file for Snowflake DML insert functionality, ensuring proper query termination. * [sql] generate `INSERT INTO ...` ([databrickslabs#823](databrickslabs#823)). In this release, we have made significant updates to our open-source library. The ExpressionGenerator.scala file has been updated to convert boolean values to lowercase instead of uppercase when generating INSERT INTO statements, ensuring SQL code consistency. A new method `insert` has been added to the `LogicalPlanGenerator` class to generate INSERT INTO SQL statements based on the `InsertIntoTable` input. We have introduced a new case class `InsertIntoTable` that extends `Modification` to simplify the API for DML operations other than SELECT. The SQL ExpressionGenerator now generates boolean literals in lowercase, and new test cases have been added to ensure the correct generation of INSERT and JOIN statements. Lastly, we have added support for generating INSERT INTO statements in SQL for specified database tables, improving cross-platform compatibility. These changes aim to enhance the library's functionality and ease of use for software engineers. * [sql] generate basic JSON access ([databrickslabs#835](databrickslabs#835)). In this release, we have added several new features and improvements to our open-source library. The `ExpressionGenerator` class now includes a new method, `jsonAccess`, which generates SQL code to access a JSON object's properties, handling different types of elements in the path. The `TO_JSON` function in the `StructsToJson` class has been updated to accept an optional expression as an argument, enhancing its flexibility. The `SnowflakeCallMapper` class now includes a new method, `lift`, and a new feature to generate basic JSON access, with corresponding updates to test cases and methods. The SQL logical plan generator has been refined to generate star projections with escaped identifiers, handling complex table and database names. We have also added new methods and test cases to the `SnowflakeCallMapper` class to convert Snowflake structs into JSON strings and cast Snowflake values to specific data types. These changes improve the library's ability to handle complex JSON data structures, enhance functionality, and ensure the quality of generated SQL code. * [sql] generate basic `CREATE TABLE` definition ([databrickslabs#829](databrickslabs#829)). In this release, the open-source library's SQL generation capabilities have been enhanced with the addition of a new `createTable` method to the `LogicalPlanGenerator` class. This method generates a `CREATE TABLE` definition for a given `ir.CreateTableCommand`, producing a SQL statement with a comma-separated list of column definitions. Each column definition includes the column name, data type, and any applicable constraints, generated using the `DataTypeGenerator.generateDataType` method and the newly-introduced `constraint` method. Additionally, the `project` method has been updated to incorporate a `FROM` clause in the generated SQL statement when the input of the project node is not `ir.NoTable()`. These improvements extend the functionality of the `LogicalPlanGenerator` class, allowing it to generate `CREATE TABLE` statements for input catalog ASTs, thereby better supporting data transformation use cases. A new test for the `CreateTableCommand` has been added to the `LogicalPlanGeneratorTest` class to validate the correct transpilation of the `CreateTableCommand` to a `CREATE TABLE` SQL statement. * [sql] generate basic `TABLESAMPLE` ([databrickslabs#830](databrickslabs#830)). In this commit, the open-source library's `LogicalPlanGenerator` class has been updated to include a new method, `tableSample`, which generates SQL representations of table sampling operations. Previously, the class only handled `INSERT`, `DELETE`, and `CREATE TABLE` commands. With this enhancement, the generator can now produce SQL statements using the `TABLESAMPLE` clause, allowing for the selection of a sample of data from a table based on various sampling methods and a seed value for repeatable sampling. The newly supported sampling methods include row-based probabilistic, row-based fixed amount, and block-based sampling. Additionally, a new test case has been added for the `LogicalPlanGenerator` related to the `TableSample` class, validating the correct transpilation of named tables and fixed row sampling into the `TABLESAMPLE` clause with specified parameters. This improvement ensures that the generated SQL code accurately represents the desired table sampling settings. Dependency updates: * Bump sqlglot from 25.6.1 to 25.8.1 ([databrickslabs#749](databrickslabs#749)).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The TSQL grammar for
CREATE DATABASE
andCREATE DATABASE SCOPED OPTION
was not consistent with the TSQL documentation. Here we correct the grammar errors.While it was originally my intention to implement these all the way through to code gen, the DDL needs some serious work, and I would rather implement the grammar for it all in a series of grammar only PRs, then come back through and implement all the way to codegen in a subsequent set of PRs. The reason being that as each DDL statement is fixed at the grammar level, all the other DDL statements will gradually simplify at the same time and I do not wish to keep changing grammar then have to tweak previously defined code gen.