[FEATURE]: Create an Upgrade script to handle changes in newer versions #769

vijaypavann-db · 2024-08-05T05:42:15Z

Is there an existing issue for this?

I have searched the existing issues

Category of feature request

Reconcile

Problem statement

A new column operation_name has been added to the main table.
The reconcililation process is failing to write the Data to existing main table because of the Schema changes mentioned above.

Please find the error below:

AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: ).
To enable schema migration using DataFrameWriter or DataStreamWriter, please set:
'.option("mergeSchema", "true")'.
For other operations, set the session configuration
spark.databricks.delta.schema.autoMerge.enabled to "true". See the documentation
specific to the operation for details.

Proposed Solution

To address these scenarios, we need to develop an Upgrade script that runs during Installation which contains detailed instructions for upgrading from one version to an upper version.
for example: from 0.4.0 to 0.4.1.

Additional Context

Check for ways to automate the process while running reconcile.
Current changes includes: Schema and Workflow

The text was updated successfully, but these errors were encountered:

closes #769

* Added private key authentication for sf ([#917](#917)). This commit adds support for private key authentication to the Snowflake data source connector, providing users with more flexibility and security. The `cryptography` library is used to process the user-provided private key, with priority given to the `pem_private_key` secret, followed by the `sfPassword` secret. If neither secret is found, an exception is raised. However, password-based authentication is still used when JDBC options are provided, as Spark JDBC does not currently support private key authentication. A new exception class, `InvalidSnowflakePemPrivateKey`, has been added for handling invalid or malformed private keys. Additionally, new tests have been included for reading data with private key authentication, handling malformed private keys, and checking for missing authentication keys. The notice has been updated to include the `cryptography` library's copyright and license information. * Added support for `PARSE_JSON` and `VARIANT` datatype ([#906](#906)). This commit introduces support for the `PARSE_JSON` function and `VARIANT` datatype in the Snowflake parser, addressing issue [#894](#894). The implementation removes the experimental dialect, enabling support for the `VARIANT` datatype and using `PARSE_JSON` for it. The `variant_explode` function is also utilized. During transpilation to Snowflake, whenever the `:` operator is encountered in the SELECT statement, everything will be treated as a `VARIANT` on the Databricks side to handle differences between Snowflake and Databricks in accessing variant types. These changes are implemented using ANTLR. * Added upgrade script and modified metrics sql ([#990](#990)). In this release, the open-source library has been updated with several improvements to the metrics system, database upgrades, and setup process. The setup_spark_remote.sh script now checks if the Spark server is running by pinging localhost:4040, rather than sleeping for a fixed time, allowing for faster execution and more accurate server status determination. The metrics table's insert statement has been updated to cast values to Bigint for better handling of larger counts. An upgrade script has been added to facilitate required modifications, and the setup_spark_remote.sh script has been modified to validate URLs. A new Python file for upgrading the metrics table's data types has been added, which includes a function to recreate the table with the correct data types for specific columns. The upgrade_common module now includes several functions for upgrading database tables, and a new unit test file, test_upgrade_common.py, has been added with test cases for these functions. Lastly, the upgrade script for v0.4.0 has been updated to simplify the process of checking if the main table in the reconcile metadata needs to be recreated and to add an `operation_name` column. These changes improve the library's functionality, accuracy, and robustness, particularly for larger datasets and upgrading processes, enhancing the overall user experience. * Basic CTAS Implementation ([#968](#968)). This pull request adds basic support for the CREATE TABLE AS SELECT (CTAS) statement in Snowflake, enabling users to create a new table by selecting data from an existing table or query. In the LogicalPlanGenerator class, a new method has been implemented to handle CTAS statements, which generates the appropriate SQL command for creating a table based on the result of a select query. The SnowflakeDDLBuilder class now includes a relationBuilder class member for building relations based on Snowflake DDL input, and the visitCreateTableAsSelect method has been overridden to parse CTAS statements and construct corresponding IR objects. The test suite has been expanded to include a new spec for CTAS statements and a test case for the CTAS statement "CREATE TABLE t1 AS SELECT c1, c2 FROM t2;". Additionally, a new test file "test_ctas_complex.sql" has been added, containing SQL statements for creating a new table by selecting columns from multiple tables and computing new columns through various data manipulations. The implementation also includes adding new SQL statements for CTAS in both Snowflake and Databricks dialects, allowing for testing the CTAS functionality for both. * Create repeatable estimator for Snowflake query history ([#924](#924)). This commit introduces a new coverage estimation tool for analyzing query history in a database, initially implemented for Snowflake. The tool parses and transpiles query history into Databricks SQL and reports on the percentage of query history it can process. It includes a new `SnowflakePlanParser` class that handles Snowflake query plans, a `SqlGenerator` class that generates Databricks SQL from optimized logical plans, and a `dialect` method that returns the dialect string. The long-term plan is to extend this functionality to other supported databases and dialects and include a report on SQL complexity. Additionally, test cases have been added to the `AnonymizerTest` class to ensure the correct functionality of the `Anonymizer` class, which anonymizes executed queries when provided with a `PlanParser` object. The `Anonymizer` class is intended to be used as part of the coverage estimation tool, which will provide analysis of query history for various databases. * Created a mapping dict for algo for each dialect at layer level ([#934](#934)). A series of changes have been implemented to improve the reconciliation process and the handling of hash algorithms in the open-source library. A mapping dictionary algorithm to dialect has been introduced at the layer level to enhance the reconciliation process. The `get_hash_transform` function now accepts a new `layer` argument and returns a list of hash algorithms from the `HashAlgoMapping` dictionary. A new `HashAlgoMapping` class has been added to map algorithms to a dialect for hashing, replacing the previous `DialectHashConfig` class. A new function `get_dialect` has been introduced to retrieve the dialect based on the layer. The `_hash_transform` function and the `build_query` method have been updated to use the `layer` parameter when determining the dialect. These changes provide more precise control over the algorithm used for hash transformation based on the source layer and the target dialect, resulting in improved reconciliation performance and accuracy. * Fetch TableDefinitions from Snowflake ([#904](#904)). A new `SnowflakeTableDefinitions` class has been added to simplify the discovery of Snowflake table metadata. This class establishes a connection with a Snowflake database through a Connection object, and provides methods such as `getDataType` and `getTableDefinitionQuery` to parse data types and generate queries for table definitions. Moreover, it includes a `getTableDefinitions` method to retrieve all table definitions in a Snowflake database as a sequence of `TableDefinition` objects, which encapsulates various properties of each table. The class also features methods for retrieving all catalogs and schemas in a Snowflake database. Alongside `SnowflakeTableDefinitions`, a new test class, `SnowflakeTableDefinitionTest`, has been introduced to verify the behavior of `getTableDefinitions` and ensure that the class functions as intended, adhering to the desired behavior. * Guide user on missing configuration file ([#930](#930)). In this commit, the `_verify_recon_table_config` method in the `runner.py` file of the `databricks/labs/remorph` package has been updated to handle missing reconcile table configurations during installation. When the reconcile table configuration is not found, an error message will now display the name of the requested configuration file. This enhancement helps users identify the specific configuration file they need to provide in their workspace, addressing issue [#919](#919). This commit is co-authored by Ludovic Claude. * Implement more missing visitor functions for Snowflake and TSQL ([#975](#975)). In this release, we have added several missing visitor methods for the Snowflake and TSQL builder classes to improve the reliability and maintainability of our parser. Previously, when a visitor method was missing, the default visitor was called, causing the visit of all children of the ParseTree, which was not ideal. This could lead to incorrect results due to a slight modification in the ANTLR grammar inadvertently breaking the visitor. In this release, we have implemented several new visitor methods for both Snowflake and TSQL builder classes, including the `visitDdlCommand` method in the `SnowflakeDDLBuilder` class and the `visitDdlClause` method in the `TSqlDDLBuilder` class. These new methods ensure that the ParseTree is traversed correctly and that the correct IR node is returned. The `visitDdlCommand` method checks for different types of DDL commands, such as create, alter, drop, and undrop, and calls the appropriate method for each type. The `visitDdlClause` method contains a sequence of methods corresponding to different DDL clauses and calls the first non-null method in the sequence. These changes significantly improve the robustness of our parser and enhance the reliability of our code. * Introduce typed errors ([#981](#981)). This commit introduces typed errors in the form of a new class, `UnexpectedNode`, and several case classes including `ParsingError`, `UnsupportedDataType`, `WrongNumberOfArguments`, `UnsupportedArguments`, and `UnsupportedDateTimePart` in various packages, as part of the ongoing effort to replace exception throwing with returning `Result` types in future pull requests. These changes will improve error handling and provide more context and precision for errors, facilitating debugging and maintenance of the remorph library and data type generation functionality. The `TranspileException` class is now constructed with specific typed error instances, and the `ErrorCollector` and `ErrorDetail` classes have been updated to use `ParsingError`. Additionally, the `SnowflakeCallMapper` and `SnowflakeTimeUnits` classes have been updated to use the new typed error mechanism, providing more precise error handling for Snowflake-specific functions and expressions. * Miscellaneous improvements to Snowflake parser ([#952](#952)). This diff brings several miscellaneous improvements to the Snowflake parser in the open-source library, targeting increased parse and transpilation success rates. The modifications include updating the `colDecl` rule to allow optional data types, introducing an `objectField` rule, and enabling date and timestamp literals as strings. Additionally, the parser has been refined to handle identifiers more efficiently, such as hashes within the AnonymizerTest. The expected Ast for certain test cases has also been updated to improve parser accuracy. These changes aim to create a more robust and streamlined Snowflake parser, minimizing parsing errors and enhancing overall user experience for project adopters. Furthermore, the error handling and reporting capabilities of the Snowflake parser have been improved with new case classes, `IndividualError` and `ErrorsSummary`, and updated error messages. * Moved intermediate package out of parsers ([#972](#972)). In this release, the `intermediate` package has been refactored out of the `parsers` package, aligning with the design principle that parsers should depend on the intermediate representation instead of the other way around. This change affects various classes and methods across the project, all of which have been updated to import the `intermediate` package from its new location. No new functionality has been introduced, but the refactoring improves the package structure and dependency management. The `EstimationAnalyzer` class in the `coverage/estimation` package has been updated to import classes from the new location of the `intermediate` package, and its `evaluateTree` method has been updated to use the new import path for `LogicalPlan` and `Expression`. Other affected classes include `SnowflakeTableDefinitions`, `SnowflakeLexer`, `SnowflakeParser`, `SnowflakeTypeBuilder`, `GeneratorContext`, `DataTypeGenerator`, `IRHelpers`, and multiple test files. * Patch Function without Brackets ([#907](#907)). This commit introduces new lexer and parser rules to handle Snowflake SQL functions without parentheses, specifically impacting CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP. The new rules allow these functions to be used without parentheses, consistent with Snowflake SQL. This change fixes functional tests and includes documentation for the affected functions. However, there is a pending task to add or fix more test cases to ensure comprehensive testing of the new rules. Additionally, the syntax of the SELECT statement for the CURRENT_TIMESTAMP function has been updated, removing the need for the parameter 'col1'. This change simplifies the syntax for certain SQL functions in the codebase and improves the consistency and reliability of the functional tests. * Root Table ([#936](#936)). The PR #\<pr\-number\> introduces a new class `TableGraph` that extends `DependencyGraph` and implements `LazyLogging` trait. This class builds a graph of tables and their dependencies based on query history and table definitions. It provides methods to add nodes and edges, build the graph, and retrieve root, upstream, and downstream tables. The `DependencyGraph` trait offers a more structured and flexible way to handle table dependencies. This change is part of the Root Table feature (issue [#936](#936)) that identifies root tables in a graph of table dependencies, closing issue [#23](#23). The PR includes a new `TableGraphTest` class that demonstrates the use of these methods and verifies their behavior for better data flow understanding and optimization. * Snowflake Merge Implementation ([#964](#964)). In this release, we have implemented the Merge statement for the Snowflake parser, which enables updating or deleting rows in a target table based on matches with a source table, and inserting new rows into the target table when there are no matches. This feature includes updates to the SnowflakeDMLBuilder and SnowflakeExpressionBuilder classes, allowing for proper handling of column names and MERGE queries. Additionally, we have added test cases to the SnowflakeASTBuilder, SnowflakeDMLBuilderSpec, and SnowflakeToDatabricksTranspiler to ensure the accurate translation and execution of MERGE statements for the Snowflake dialect. These changes bring important database migration and synchronization capabilities to our open-source library, improving its functionality and usability for software engineers. * TSQL: Implement CREATE TABLE ([#911](#911)). This commit implements the TSQL CREATE TABLE command and its various options and forms, including CTAS, graph node syntax, and analytics variants, as well as syntactical differences for SQL Server. The DDL and DML visitors have been moved from the AST and Relation visitors to separate classes for better responsibility segregation. The LogicalPlanGenerator class has been updated to generate unique constraints, primary keys, foreign keys, check constraints, default value constraints, and identity constraints for the CREATE TABLE command. Additionally, new classes for generating SQL options and handling unresolved options during transpilation have been added to enhance the parser's capability to manage various options and forms. These changes improve the transpilation of TSQL code and the organization of the codebase, making it easier to maintain and extend. * Transpile Snow ARRAY_SORT function ([#973](#973)). In this release, we have implemented support for the Snowflake ARRAY_SORT function in our open-source library. This feature has been added as part of issue [#973](#973), and it involves the addition of two new private methods, `arraySort` and `makeArraySort`, to the `SnowflakeCallMapper` class. The `arraySort` method takes a sequence of expressions as input and sorts the array using the `makeArraySort` method. The `makeArraySort` method handles both null and non-null values, sorts the array in ascending or descending order based on the provided parameter, and determines the position of null or small values based on the nulls first parameter. The sorted array is then returned as an `ir.ArraySort` expression. This functionality allows for the sorting of arrays in Snowflake SQL to be translated to equivalent code in the target language. This enhancement simplifies the process of working with arrays in Snowflake SQL and provides users with a more streamlined experience. * Transpile Snow MONTHS_BETWEEN function correctly ([#963](#963)). In this release, the remorph library's SnowflakeCallMapper class in the com/databricks/labs/remorph/parsers/snowflake/rules package has been updated to handle the MONTHS_BETWEEN function. A new case has been added that creates a MonthsBetween object with the first two arguments of the function call and a boolean value of true. This change enhances compatibility and ensures that the output accurately reflects the intended functionality. Additionally, new test cases have been introduced to the SnowflakeCallMapperSpec for the transpilation of the MONTHS_BETWEEN function. These test cases demonstrate accurate mapping of the function to the MonthsBetween class and proper casting of inputs as dates or timestamps, improving the reliability and precision of date and time calculations. * Updated Installation to handle install errors ([#962](#962)). In this release, we've made significant improvements to the `remorph` project, addressing and resolving installation errors that were occurring during the installation process in development mode. We've introduced a new `ProductInfo` class in the `wheels` module, which provides information about the products being installed. This change replaces the use of `WheelsV2` in two test functions. Additionally, we've updated the `workspace_installation` method in `application.py` to handle installation errors more effectively, addressing the dependency on workspace `.remorph` due to wheels. We've also added new methods to `installation.py` to manage local and remote version files, and updated the `_upgrade_reconcile_workflow` function to ensure the correct wheel path is used during installation. These changes improve the overall quality of the codebase, making it easier for developers to adopt and maintain the project, and ensure a more seamless installation experience for users. * Updated catalog operations logging ([#910](#910)). In this release, the setup process for the catalog, schema, and volume in the configurator module has been simplified and improved. The previous implementation repeatedly prompted the user for input until the correct input was provided or a maximum number of attempts was reached. The updated code now checks if the catalog, schema, or volume already exists and either uses it or prompts the user to create it once. If the user does not have the necessary privileges to use the catalog, schema, or volume, an error message is logged and the installation is aborted. New methods have been added to check for necessary privileges, such as `has_necessary_catalog_access`, `has_necessary_schema_access`, and `has_necessary_volume_access`, which return a boolean indicating whether the user has the necessary privileges and log an error message with the missing privileges if not. The logging for catalog operations in the install.py file has also been updated to check for privileges at the end of the process and list any missing privileges for each catalog object. Additionally, changes have been made to the unit tests for the ResourceConfigurator class to ensure that the system handles cases where the user does not have the necessary permissions to access catalogs, schemas, or volumes, preventing unauthorized access and maintaining the security and integrity of the system. * Updated remorph reconcile workflow to use wheels instead of pypi ([#884](#884)). In this release, the installation process for the Remorph library has been updated to allow for the use of locally uploaded wheel files instead of downloading the package from PyPI. This change includes updates to the `install` and `_deploy_jobs` methods in the `recon.py` file to accept a new `wheel_paths` argument, which is used to pass the path of the Remorph wheel file to the `deploy_recon_job` method. The `_upgrade_reconcile_workflow` function in the `v0.4.0_add_main_table_operation_name_column.py` file has also been updated to upload the wheel package to the workspace and pass its path to the `deploy_reconcile_job` method. Additionally, the `deploy_recon_job` method in the `JobDeployment` class now accepts a new `wheel_file` argument, which represents the name of the wheel file for the remorph library. These changes address issues faced by customers with no public internet access and enable the use of new features before they are released on PyPI. The `test_recon.py` file in the `tests/unit/deployment` directory has also been updated to reflect these changes. * Upgrade script Implementation ([#777](#777)). In this release, we've implemented an upgrade script as part of pull request [#777](#777), which resolves issue [#769](#769). This change introduces a new `Upgrades` class in `application.py` that accepts `product_info` and `installation` as parameters and includes a cached property `wheels` for improved performance. Additionally, we've added new methods to the `WorkspaceInstaller` class for handling upgrade-related tasks, including the creation of a `ProductInfo` object, interacting with the Databricks SDK, and handling potential errors. We've also added a test case to ensure that upgrades are applied correctly on more recent versions. These changes are part of our ongoing effort to enhance the management and application of upgrades to installed products. * bug fix for to_array function ([#961](#961)). A bug fix has been implemented to improve the `TO_ARRAY` function in our open-source library. Previously, this function expected only one parameter, but it has been updated to accept two parameters, with the second being optional. This change brings the function in line with other functions in the class, improving flexibility and ensuring backward compatibility. The `TO_ARRAY` function is used to convert a given expression to an array if it is not null and return null otherwise. The commit also includes updates to the `Generator` class, where a new entry for the `ToArray` expression has been added to the `expression_map` dictionary. Additionally, a new `ToArray` class has been introduced as a subclass of `Func`, allowing the function to handle a variable number of arguments more gracefully. Relevant updates have been made to the functional tests for the `to_array` function for both Snowflake and Databricks SQL, demonstrating its handling of null inputs and comparing it with the corresponding ARRAY function in each SQL dialect. Overall, these changes enhance the functionality and adaptability of the `TO_ARRAY` function. * feat: Implement all of TSQL predicates except for SOME ALL ANY ([#922](#922)). In this commit, we have implemented the IR generation for several TSQL predicates including IN, IS, BETWEEN, LIKE, EXISTS, and FREETEXT, thereby improving the parser's ability to handle a wider range of TSQL syntax. The `TSqlParser` class has been updated with new methods and changes to existing ones, including the addition of new labeled expressions to the `predicate` rule. Additionally, we have corrected an error in the LIKE predicate's implementation, allowing the ESCAPE character to accept a full expression that evaluates to a single character at runtime, rather than assuming it to be a single character at parse time. These changes provide more flexibility and adherence to the TSQL standard, enhancing the overall functionality of the project for our adopters.

closes databrickslabs#769

* Added private key authentication for sf ([databrickslabs#917](databrickslabs#917)). This commit adds support for private key authentication to the Snowflake data source connector, providing users with more flexibility and security. The `cryptography` library is used to process the user-provided private key, with priority given to the `pem_private_key` secret, followed by the `sfPassword` secret. If neither secret is found, an exception is raised. However, password-based authentication is still used when JDBC options are provided, as Spark JDBC does not currently support private key authentication. A new exception class, `InvalidSnowflakePemPrivateKey`, has been added for handling invalid or malformed private keys. Additionally, new tests have been included for reading data with private key authentication, handling malformed private keys, and checking for missing authentication keys. The notice has been updated to include the `cryptography` library's copyright and license information. * Added support for `PARSE_JSON` and `VARIANT` datatype ([databrickslabs#906](databrickslabs#906)). This commit introduces support for the `PARSE_JSON` function and `VARIANT` datatype in the Snowflake parser, addressing issue [databrickslabs#894](databrickslabs#894). The implementation removes the experimental dialect, enabling support for the `VARIANT` datatype and using `PARSE_JSON` for it. The `variant_explode` function is also utilized. During transpilation to Snowflake, whenever the `:` operator is encountered in the SELECT statement, everything will be treated as a `VARIANT` on the Databricks side to handle differences between Snowflake and Databricks in accessing variant types. These changes are implemented using ANTLR. * Added upgrade script and modified metrics sql ([databrickslabs#990](databrickslabs#990)). In this release, the open-source library has been updated with several improvements to the metrics system, database upgrades, and setup process. The setup_spark_remote.sh script now checks if the Spark server is running by pinging localhost:4040, rather than sleeping for a fixed time, allowing for faster execution and more accurate server status determination. The metrics table's insert statement has been updated to cast values to Bigint for better handling of larger counts. An upgrade script has been added to facilitate required modifications, and the setup_spark_remote.sh script has been modified to validate URLs. A new Python file for upgrading the metrics table's data types has been added, which includes a function to recreate the table with the correct data types for specific columns. The upgrade_common module now includes several functions for upgrading database tables, and a new unit test file, test_upgrade_common.py, has been added with test cases for these functions. Lastly, the upgrade script for v0.4.0 has been updated to simplify the process of checking if the main table in the reconcile metadata needs to be recreated and to add an `operation_name` column. These changes improve the library's functionality, accuracy, and robustness, particularly for larger datasets and upgrading processes, enhancing the overall user experience. * Basic CTAS Implementation ([databrickslabs#968](databrickslabs#968)). This pull request adds basic support for the CREATE TABLE AS SELECT (CTAS) statement in Snowflake, enabling users to create a new table by selecting data from an existing table or query. In the LogicalPlanGenerator class, a new method has been implemented to handle CTAS statements, which generates the appropriate SQL command for creating a table based on the result of a select query. The SnowflakeDDLBuilder class now includes a relationBuilder class member for building relations based on Snowflake DDL input, and the visitCreateTableAsSelect method has been overridden to parse CTAS statements and construct corresponding IR objects. The test suite has been expanded to include a new spec for CTAS statements and a test case for the CTAS statement "CREATE TABLE t1 AS SELECT c1, c2 FROM t2;". Additionally, a new test file "test_ctas_complex.sql" has been added, containing SQL statements for creating a new table by selecting columns from multiple tables and computing new columns through various data manipulations. The implementation also includes adding new SQL statements for CTAS in both Snowflake and Databricks dialects, allowing for testing the CTAS functionality for both. * Create repeatable estimator for Snowflake query history ([databrickslabs#924](databrickslabs#924)). This commit introduces a new coverage estimation tool for analyzing query history in a database, initially implemented for Snowflake. The tool parses and transpiles query history into Databricks SQL and reports on the percentage of query history it can process. It includes a new `SnowflakePlanParser` class that handles Snowflake query plans, a `SqlGenerator` class that generates Databricks SQL from optimized logical plans, and a `dialect` method that returns the dialect string. The long-term plan is to extend this functionality to other supported databases and dialects and include a report on SQL complexity. Additionally, test cases have been added to the `AnonymizerTest` class to ensure the correct functionality of the `Anonymizer` class, which anonymizes executed queries when provided with a `PlanParser` object. The `Anonymizer` class is intended to be used as part of the coverage estimation tool, which will provide analysis of query history for various databases. * Created a mapping dict for algo for each dialect at layer level ([databrickslabs#934](databrickslabs#934)). A series of changes have been implemented to improve the reconciliation process and the handling of hash algorithms in the open-source library. A mapping dictionary algorithm to dialect has been introduced at the layer level to enhance the reconciliation process. The `get_hash_transform` function now accepts a new `layer` argument and returns a list of hash algorithms from the `HashAlgoMapping` dictionary. A new `HashAlgoMapping` class has been added to map algorithms to a dialect for hashing, replacing the previous `DialectHashConfig` class. A new function `get_dialect` has been introduced to retrieve the dialect based on the layer. The `_hash_transform` function and the `build_query` method have been updated to use the `layer` parameter when determining the dialect. These changes provide more precise control over the algorithm used for hash transformation based on the source layer and the target dialect, resulting in improved reconciliation performance and accuracy. * Fetch TableDefinitions from Snowflake ([databrickslabs#904](databrickslabs#904)). A new `SnowflakeTableDefinitions` class has been added to simplify the discovery of Snowflake table metadata. This class establishes a connection with a Snowflake database through a Connection object, and provides methods such as `getDataType` and `getTableDefinitionQuery` to parse data types and generate queries for table definitions. Moreover, it includes a `getTableDefinitions` method to retrieve all table definitions in a Snowflake database as a sequence of `TableDefinition` objects, which encapsulates various properties of each table. The class also features methods for retrieving all catalogs and schemas in a Snowflake database. Alongside `SnowflakeTableDefinitions`, a new test class, `SnowflakeTableDefinitionTest`, has been introduced to verify the behavior of `getTableDefinitions` and ensure that the class functions as intended, adhering to the desired behavior. * Guide user on missing configuration file ([databrickslabs#930](databrickslabs#930)). In this commit, the `_verify_recon_table_config` method in the `runner.py` file of the `databricks/labs/remorph` package has been updated to handle missing reconcile table configurations during installation. When the reconcile table configuration is not found, an error message will now display the name of the requested configuration file. This enhancement helps users identify the specific configuration file they need to provide in their workspace, addressing issue [databrickslabs#919](databrickslabs#919). This commit is co-authored by Ludovic Claude. * Implement more missing visitor functions for Snowflake and TSQL ([databrickslabs#975](databrickslabs#975)). In this release, we have added several missing visitor methods for the Snowflake and TSQL builder classes to improve the reliability and maintainability of our parser. Previously, when a visitor method was missing, the default visitor was called, causing the visit of all children of the ParseTree, which was not ideal. This could lead to incorrect results due to a slight modification in the ANTLR grammar inadvertently breaking the visitor. In this release, we have implemented several new visitor methods for both Snowflake and TSQL builder classes, including the `visitDdlCommand` method in the `SnowflakeDDLBuilder` class and the `visitDdlClause` method in the `TSqlDDLBuilder` class. These new methods ensure that the ParseTree is traversed correctly and that the correct IR node is returned. The `visitDdlCommand` method checks for different types of DDL commands, such as create, alter, drop, and undrop, and calls the appropriate method for each type. The `visitDdlClause` method contains a sequence of methods corresponding to different DDL clauses and calls the first non-null method in the sequence. These changes significantly improve the robustness of our parser and enhance the reliability of our code. * Introduce typed errors ([databrickslabs#981](databrickslabs#981)). This commit introduces typed errors in the form of a new class, `UnexpectedNode`, and several case classes including `ParsingError`, `UnsupportedDataType`, `WrongNumberOfArguments`, `UnsupportedArguments`, and `UnsupportedDateTimePart` in various packages, as part of the ongoing effort to replace exception throwing with returning `Result` types in future pull requests. These changes will improve error handling and provide more context and precision for errors, facilitating debugging and maintenance of the remorph library and data type generation functionality. The `TranspileException` class is now constructed with specific typed error instances, and the `ErrorCollector` and `ErrorDetail` classes have been updated to use `ParsingError`. Additionally, the `SnowflakeCallMapper` and `SnowflakeTimeUnits` classes have been updated to use the new typed error mechanism, providing more precise error handling for Snowflake-specific functions and expressions. * Miscellaneous improvements to Snowflake parser ([databrickslabs#952](databrickslabs#952)). This diff brings several miscellaneous improvements to the Snowflake parser in the open-source library, targeting increased parse and transpilation success rates. The modifications include updating the `colDecl` rule to allow optional data types, introducing an `objectField` rule, and enabling date and timestamp literals as strings. Additionally, the parser has been refined to handle identifiers more efficiently, such as hashes within the AnonymizerTest. The expected Ast for certain test cases has also been updated to improve parser accuracy. These changes aim to create a more robust and streamlined Snowflake parser, minimizing parsing errors and enhancing overall user experience for project adopters. Furthermore, the error handling and reporting capabilities of the Snowflake parser have been improved with new case classes, `IndividualError` and `ErrorsSummary`, and updated error messages. * Moved intermediate package out of parsers ([databrickslabs#972](databrickslabs#972)). In this release, the `intermediate` package has been refactored out of the `parsers` package, aligning with the design principle that parsers should depend on the intermediate representation instead of the other way around. This change affects various classes and methods across the project, all of which have been updated to import the `intermediate` package from its new location. No new functionality has been introduced, but the refactoring improves the package structure and dependency management. The `EstimationAnalyzer` class in the `coverage/estimation` package has been updated to import classes from the new location of the `intermediate` package, and its `evaluateTree` method has been updated to use the new import path for `LogicalPlan` and `Expression`. Other affected classes include `SnowflakeTableDefinitions`, `SnowflakeLexer`, `SnowflakeParser`, `SnowflakeTypeBuilder`, `GeneratorContext`, `DataTypeGenerator`, `IRHelpers`, and multiple test files. * Patch Function without Brackets ([databrickslabs#907](databrickslabs#907)). This commit introduces new lexer and parser rules to handle Snowflake SQL functions without parentheses, specifically impacting CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP. The new rules allow these functions to be used without parentheses, consistent with Snowflake SQL. This change fixes functional tests and includes documentation for the affected functions. However, there is a pending task to add or fix more test cases to ensure comprehensive testing of the new rules. Additionally, the syntax of the SELECT statement for the CURRENT_TIMESTAMP function has been updated, removing the need for the parameter 'col1'. This change simplifies the syntax for certain SQL functions in the codebase and improves the consistency and reliability of the functional tests. * Root Table ([databrickslabs#936](databrickslabs#936)). The PR #\<pr\-number\> introduces a new class `TableGraph` that extends `DependencyGraph` and implements `LazyLogging` trait. This class builds a graph of tables and their dependencies based on query history and table definitions. It provides methods to add nodes and edges, build the graph, and retrieve root, upstream, and downstream tables. The `DependencyGraph` trait offers a more structured and flexible way to handle table dependencies. This change is part of the Root Table feature (issue [databrickslabs#936](databrickslabs#936)) that identifies root tables in a graph of table dependencies, closing issue [databrickslabs#23](databrickslabs#23). The PR includes a new `TableGraphTest` class that demonstrates the use of these methods and verifies their behavior for better data flow understanding and optimization. * Snowflake Merge Implementation ([databrickslabs#964](databrickslabs#964)). In this release, we have implemented the Merge statement for the Snowflake parser, which enables updating or deleting rows in a target table based on matches with a source table, and inserting new rows into the target table when there are no matches. This feature includes updates to the SnowflakeDMLBuilder and SnowflakeExpressionBuilder classes, allowing for proper handling of column names and MERGE queries. Additionally, we have added test cases to the SnowflakeASTBuilder, SnowflakeDMLBuilderSpec, and SnowflakeToDatabricksTranspiler to ensure the accurate translation and execution of MERGE statements for the Snowflake dialect. These changes bring important database migration and synchronization capabilities to our open-source library, improving its functionality and usability for software engineers. * TSQL: Implement CREATE TABLE ([databrickslabs#911](databrickslabs#911)). This commit implements the TSQL CREATE TABLE command and its various options and forms, including CTAS, graph node syntax, and analytics variants, as well as syntactical differences for SQL Server. The DDL and DML visitors have been moved from the AST and Relation visitors to separate classes for better responsibility segregation. The LogicalPlanGenerator class has been updated to generate unique constraints, primary keys, foreign keys, check constraints, default value constraints, and identity constraints for the CREATE TABLE command. Additionally, new classes for generating SQL options and handling unresolved options during transpilation have been added to enhance the parser's capability to manage various options and forms. These changes improve the transpilation of TSQL code and the organization of the codebase, making it easier to maintain and extend. * Transpile Snow ARRAY_SORT function ([databrickslabs#973](databrickslabs#973)). In this release, we have implemented support for the Snowflake ARRAY_SORT function in our open-source library. This feature has been added as part of issue [databrickslabs#973](databrickslabs#973), and it involves the addition of two new private methods, `arraySort` and `makeArraySort`, to the `SnowflakeCallMapper` class. The `arraySort` method takes a sequence of expressions as input and sorts the array using the `makeArraySort` method. The `makeArraySort` method handles both null and non-null values, sorts the array in ascending or descending order based on the provided parameter, and determines the position of null or small values based on the nulls first parameter. The sorted array is then returned as an `ir.ArraySort` expression. This functionality allows for the sorting of arrays in Snowflake SQL to be translated to equivalent code in the target language. This enhancement simplifies the process of working with arrays in Snowflake SQL and provides users with a more streamlined experience. * Transpile Snow MONTHS_BETWEEN function correctly ([databrickslabs#963](databrickslabs#963)). In this release, the remorph library's SnowflakeCallMapper class in the com/databricks/labs/remorph/parsers/snowflake/rules package has been updated to handle the MONTHS_BETWEEN function. A new case has been added that creates a MonthsBetween object with the first two arguments of the function call and a boolean value of true. This change enhances compatibility and ensures that the output accurately reflects the intended functionality. Additionally, new test cases have been introduced to the SnowflakeCallMapperSpec for the transpilation of the MONTHS_BETWEEN function. These test cases demonstrate accurate mapping of the function to the MonthsBetween class and proper casting of inputs as dates or timestamps, improving the reliability and precision of date and time calculations. * Updated Installation to handle install errors ([databrickslabs#962](databrickslabs#962)). In this release, we've made significant improvements to the `remorph` project, addressing and resolving installation errors that were occurring during the installation process in development mode. We've introduced a new `ProductInfo` class in the `wheels` module, which provides information about the products being installed. This change replaces the use of `WheelsV2` in two test functions. Additionally, we've updated the `workspace_installation` method in `application.py` to handle installation errors more effectively, addressing the dependency on workspace `.remorph` due to wheels. We've also added new methods to `installation.py` to manage local and remote version files, and updated the `_upgrade_reconcile_workflow` function to ensure the correct wheel path is used during installation. These changes improve the overall quality of the codebase, making it easier for developers to adopt and maintain the project, and ensure a more seamless installation experience for users. * Updated catalog operations logging ([databrickslabs#910](databrickslabs#910)). In this release, the setup process for the catalog, schema, and volume in the configurator module has been simplified and improved. The previous implementation repeatedly prompted the user for input until the correct input was provided or a maximum number of attempts was reached. The updated code now checks if the catalog, schema, or volume already exists and either uses it or prompts the user to create it once. If the user does not have the necessary privileges to use the catalog, schema, or volume, an error message is logged and the installation is aborted. New methods have been added to check for necessary privileges, such as `has_necessary_catalog_access`, `has_necessary_schema_access`, and `has_necessary_volume_access`, which return a boolean indicating whether the user has the necessary privileges and log an error message with the missing privileges if not. The logging for catalog operations in the install.py file has also been updated to check for privileges at the end of the process and list any missing privileges for each catalog object. Additionally, changes have been made to the unit tests for the ResourceConfigurator class to ensure that the system handles cases where the user does not have the necessary permissions to access catalogs, schemas, or volumes, preventing unauthorized access and maintaining the security and integrity of the system. * Updated remorph reconcile workflow to use wheels instead of pypi ([databrickslabs#884](databrickslabs#884)). In this release, the installation process for the Remorph library has been updated to allow for the use of locally uploaded wheel files instead of downloading the package from PyPI. This change includes updates to the `install` and `_deploy_jobs` methods in the `recon.py` file to accept a new `wheel_paths` argument, which is used to pass the path of the Remorph wheel file to the `deploy_recon_job` method. The `_upgrade_reconcile_workflow` function in the `v0.4.0_add_main_table_operation_name_column.py` file has also been updated to upload the wheel package to the workspace and pass its path to the `deploy_reconcile_job` method. Additionally, the `deploy_recon_job` method in the `JobDeployment` class now accepts a new `wheel_file` argument, which represents the name of the wheel file for the remorph library. These changes address issues faced by customers with no public internet access and enable the use of new features before they are released on PyPI. The `test_recon.py` file in the `tests/unit/deployment` directory has also been updated to reflect these changes. * Upgrade script Implementation ([databrickslabs#777](databrickslabs#777)). In this release, we've implemented an upgrade script as part of pull request [databrickslabs#777](databrickslabs#777), which resolves issue [databrickslabs#769](databrickslabs#769). This change introduces a new `Upgrades` class in `application.py` that accepts `product_info` and `installation` as parameters and includes a cached property `wheels` for improved performance. Additionally, we've added new methods to the `WorkspaceInstaller` class for handling upgrade-related tasks, including the creation of a `ProductInfo` object, interacting with the Databricks SDK, and handling potential errors. We've also added a test case to ensure that upgrades are applied correctly on more recent versions. These changes are part of our ongoing effort to enhance the management and application of upgrades to installed products. * bug fix for to_array function ([databrickslabs#961](databrickslabs#961)). A bug fix has been implemented to improve the `TO_ARRAY` function in our open-source library. Previously, this function expected only one parameter, but it has been updated to accept two parameters, with the second being optional. This change brings the function in line with other functions in the class, improving flexibility and ensuring backward compatibility. The `TO_ARRAY` function is used to convert a given expression to an array if it is not null and return null otherwise. The commit also includes updates to the `Generator` class, where a new entry for the `ToArray` expression has been added to the `expression_map` dictionary. Additionally, a new `ToArray` class has been introduced as a subclass of `Func`, allowing the function to handle a variable number of arguments more gracefully. Relevant updates have been made to the functional tests for the `to_array` function for both Snowflake and Databricks SQL, demonstrating its handling of null inputs and comparing it with the corresponding ARRAY function in each SQL dialect. Overall, these changes enhance the functionality and adaptability of the `TO_ARRAY` function. * feat: Implement all of TSQL predicates except for SOME ALL ANY ([databrickslabs#922](databrickslabs#922)). In this commit, we have implemented the IR generation for several TSQL predicates including IN, IS, BETWEEN, LIKE, EXISTS, and FREETEXT, thereby improving the parser's ability to handle a wider range of TSQL syntax. The `TSqlParser` class has been updated with new methods and changes to existing ones, including the addition of new labeled expressions to the `predicate` rule. Additionally, we have corrected an error in the LIKE predicate's implementation, allowing the ESCAPE character to accept a full expression that evaluates to a single character at runtime, rather than assuming it to be a single character at parse time. These changes provide more flexibility and adherence to the TSQL standard, enhancing the overall functionality of the project for our adopters.

closes databrickslabs#769

* Added private key authentication for sf ([databrickslabs#917](databrickslabs#917)). This commit adds support for private key authentication to the Snowflake data source connector, providing users with more flexibility and security. The `cryptography` library is used to process the user-provided private key, with priority given to the `pem_private_key` secret, followed by the `sfPassword` secret. If neither secret is found, an exception is raised. However, password-based authentication is still used when JDBC options are provided, as Spark JDBC does not currently support private key authentication. A new exception class, `InvalidSnowflakePemPrivateKey`, has been added for handling invalid or malformed private keys. Additionally, new tests have been included for reading data with private key authentication, handling malformed private keys, and checking for missing authentication keys. The notice has been updated to include the `cryptography` library's copyright and license information. * Added support for `PARSE_JSON` and `VARIANT` datatype ([databrickslabs#906](databrickslabs#906)). This commit introduces support for the `PARSE_JSON` function and `VARIANT` datatype in the Snowflake parser, addressing issue [databrickslabs#894](databrickslabs#894). The implementation removes the experimental dialect, enabling support for the `VARIANT` datatype and using `PARSE_JSON` for it. The `variant_explode` function is also utilized. During transpilation to Snowflake, whenever the `:` operator is encountered in the SELECT statement, everything will be treated as a `VARIANT` on the Databricks side to handle differences between Snowflake and Databricks in accessing variant types. These changes are implemented using ANTLR. * Added upgrade script and modified metrics sql ([databrickslabs#990](databrickslabs#990)). In this release, the open-source library has been updated with several improvements to the metrics system, database upgrades, and setup process. The setup_spark_remote.sh script now checks if the Spark server is running by pinging localhost:4040, rather than sleeping for a fixed time, allowing for faster execution and more accurate server status determination. The metrics table's insert statement has been updated to cast values to Bigint for better handling of larger counts. An upgrade script has been added to facilitate required modifications, and the setup_spark_remote.sh script has been modified to validate URLs. A new Python file for upgrading the metrics table's data types has been added, which includes a function to recreate the table with the correct data types for specific columns. The upgrade_common module now includes several functions for upgrading database tables, and a new unit test file, test_upgrade_common.py, has been added with test cases for these functions. Lastly, the upgrade script for v0.4.0 has been updated to simplify the process of checking if the main table in the reconcile metadata needs to be recreated and to add an `operation_name` column. These changes improve the library's functionality, accuracy, and robustness, particularly for larger datasets and upgrading processes, enhancing the overall user experience. * Basic CTAS Implementation ([databrickslabs#968](databrickslabs#968)). This pull request adds basic support for the CREATE TABLE AS SELECT (CTAS) statement in Snowflake, enabling users to create a new table by selecting data from an existing table or query. In the LogicalPlanGenerator class, a new method has been implemented to handle CTAS statements, which generates the appropriate SQL command for creating a table based on the result of a select query. The SnowflakeDDLBuilder class now includes a relationBuilder class member for building relations based on Snowflake DDL input, and the visitCreateTableAsSelect method has been overridden to parse CTAS statements and construct corresponding IR objects. The test suite has been expanded to include a new spec for CTAS statements and a test case for the CTAS statement "CREATE TABLE t1 AS SELECT c1, c2 FROM t2;". Additionally, a new test file "test_ctas_complex.sql" has been added, containing SQL statements for creating a new table by selecting columns from multiple tables and computing new columns through various data manipulations. The implementation also includes adding new SQL statements for CTAS in both Snowflake and Databricks dialects, allowing for testing the CTAS functionality for both. * Create repeatable estimator for Snowflake query history ([databrickslabs#924](databrickslabs#924)). This commit introduces a new coverage estimation tool for analyzing query history in a database, initially implemented for Snowflake. The tool parses and transpiles query history into Databricks SQL and reports on the percentage of query history it can process. It includes a new `SnowflakePlanParser` class that handles Snowflake query plans, a `SqlGenerator` class that generates Databricks SQL from optimized logical plans, and a `dialect` method that returns the dialect string. The long-term plan is to extend this functionality to other supported databases and dialects and include a report on SQL complexity. Additionally, test cases have been added to the `AnonymizerTest` class to ensure the correct functionality of the `Anonymizer` class, which anonymizes executed queries when provided with a `PlanParser` object. The `Anonymizer` class is intended to be used as part of the coverage estimation tool, which will provide analysis of query history for various databases. * Created a mapping dict for algo for each dialect at layer level ([databrickslabs#934](databrickslabs#934)). A series of changes have been implemented to improve the reconciliation process and the handling of hash algorithms in the open-source library. A mapping dictionary algorithm to dialect has been introduced at the layer level to enhance the reconciliation process. The `get_hash_transform` function now accepts a new `layer` argument and returns a list of hash algorithms from the `HashAlgoMapping` dictionary. A new `HashAlgoMapping` class has been added to map algorithms to a dialect for hashing, replacing the previous `DialectHashConfig` class. A new function `get_dialect` has been introduced to retrieve the dialect based on the layer. The `_hash_transform` function and the `build_query` method have been updated to use the `layer` parameter when determining the dialect. These changes provide more precise control over the algorithm used for hash transformation based on the source layer and the target dialect, resulting in improved reconciliation performance and accuracy. * Fetch TableDefinitions from Snowflake ([databrickslabs#904](databrickslabs#904)). A new `SnowflakeTableDefinitions` class has been added to simplify the discovery of Snowflake table metadata. This class establishes a connection with a Snowflake database through a Connection object, and provides methods such as `getDataType` and `getTableDefinitionQuery` to parse data types and generate queries for table definitions. Moreover, it includes a `getTableDefinitions` method to retrieve all table definitions in a Snowflake database as a sequence of `TableDefinition` objects, which encapsulates various properties of each table. The class also features methods for retrieving all catalogs and schemas in a Snowflake database. Alongside `SnowflakeTableDefinitions`, a new test class, `SnowflakeTableDefinitionTest`, has been introduced to verify the behavior of `getTableDefinitions` and ensure that the class functions as intended, adhering to the desired behavior. * Guide user on missing configuration file ([databrickslabs#930](databrickslabs#930)). In this commit, the `_verify_recon_table_config` method in the `runner.py` file of the `databricks/labs/remorph` package has been updated to handle missing reconcile table configurations during installation. When the reconcile table configuration is not found, an error message will now display the name of the requested configuration file. This enhancement helps users identify the specific configuration file they need to provide in their workspace, addressing issue [databrickslabs#919](databrickslabs#919). This commit is co-authored by Ludovic Claude. * Implement more missing visitor functions for Snowflake and TSQL ([databrickslabs#975](databrickslabs#975)). In this release, we have added several missing visitor methods for the Snowflake and TSQL builder classes to improve the reliability and maintainability of our parser. Previously, when a visitor method was missing, the default visitor was called, causing the visit of all children of the ParseTree, which was not ideal. This could lead to incorrect results due to a slight modification in the ANTLR grammar inadvertently breaking the visitor. In this release, we have implemented several new visitor methods for both Snowflake and TSQL builder classes, including the `visitDdlCommand` method in the `SnowflakeDDLBuilder` class and the `visitDdlClause` method in the `TSqlDDLBuilder` class. These new methods ensure that the ParseTree is traversed correctly and that the correct IR node is returned. The `visitDdlCommand` method checks for different types of DDL commands, such as create, alter, drop, and undrop, and calls the appropriate method for each type. The `visitDdlClause` method contains a sequence of methods corresponding to different DDL clauses and calls the first non-null method in the sequence. These changes significantly improve the robustness of our parser and enhance the reliability of our code. * Introduce typed errors ([databrickslabs#981](databrickslabs#981)). This commit introduces typed errors in the form of a new class, `UnexpectedNode`, and several case classes including `ParsingError`, `UnsupportedDataType`, `WrongNumberOfArguments`, `UnsupportedArguments`, and `UnsupportedDateTimePart` in various packages, as part of the ongoing effort to replace exception throwing with returning `Result` types in future pull requests. These changes will improve error handling and provide more context and precision for errors, facilitating debugging and maintenance of the remorph library and data type generation functionality. The `TranspileException` class is now constructed with specific typed error instances, and the `ErrorCollector` and `ErrorDetail` classes have been updated to use `ParsingError`. Additionally, the `SnowflakeCallMapper` and `SnowflakeTimeUnits` classes have been updated to use the new typed error mechanism, providing more precise error handling for Snowflake-specific functions and expressions. * Miscellaneous improvements to Snowflake parser ([databrickslabs#952](databrickslabs#952)). This diff brings several miscellaneous improvements to the Snowflake parser in the open-source library, targeting increased parse and transpilation success rates. The modifications include updating the `colDecl` rule to allow optional data types, introducing an `objectField` rule, and enabling date and timestamp literals as strings. Additionally, the parser has been refined to handle identifiers more efficiently, such as hashes within the AnonymizerTest. The expected Ast for certain test cases has also been updated to improve parser accuracy. These changes aim to create a more robust and streamlined Snowflake parser, minimizing parsing errors and enhancing overall user experience for project adopters. Furthermore, the error handling and reporting capabilities of the Snowflake parser have been improved with new case classes, `IndividualError` and `ErrorsSummary`, and updated error messages. * Moved intermediate package out of parsers ([databrickslabs#972](databrickslabs#972)). In this release, the `intermediate` package has been refactored out of the `parsers` package, aligning with the design principle that parsers should depend on the intermediate representation instead of the other way around. This change affects various classes and methods across the project, all of which have been updated to import the `intermediate` package from its new location. No new functionality has been introduced, but the refactoring improves the package structure and dependency management. The `EstimationAnalyzer` class in the `coverage/estimation` package has been updated to import classes from the new location of the `intermediate` package, and its `evaluateTree` method has been updated to use the new import path for `LogicalPlan` and `Expression`. Other affected classes include `SnowflakeTableDefinitions`, `SnowflakeLexer`, `SnowflakeParser`, `SnowflakeTypeBuilder`, `GeneratorContext`, `DataTypeGenerator`, `IRHelpers`, and multiple test files. * Patch Function without Brackets ([databrickslabs#907](databrickslabs#907)). This commit introduces new lexer and parser rules to handle Snowflake SQL functions without parentheses, specifically impacting CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP. The new rules allow these functions to be used without parentheses, consistent with Snowflake SQL. This change fixes functional tests and includes documentation for the affected functions. However, there is a pending task to add or fix more test cases to ensure comprehensive testing of the new rules. Additionally, the syntax of the SELECT statement for the CURRENT_TIMESTAMP function has been updated, removing the need for the parameter 'col1'. This change simplifies the syntax for certain SQL functions in the codebase and improves the consistency and reliability of the functional tests. * Root Table ([databrickslabs#936](databrickslabs#936)). The PR #\<pr\-number\> introduces a new class `TableGraph` that extends `DependencyGraph` and implements `LazyLogging` trait. This class builds a graph of tables and their dependencies based on query history and table definitions. It provides methods to add nodes and edges, build the graph, and retrieve root, upstream, and downstream tables. The `DependencyGraph` trait offers a more structured and flexible way to handle table dependencies. This change is part of the Root Table feature (issue [databrickslabs#936](databrickslabs#936)) that identifies root tables in a graph of table dependencies, closing issue [databrickslabs#23](databrickslabs#23). The PR includes a new `TableGraphTest` class that demonstrates the use of these methods and verifies their behavior for better data flow understanding and optimization. * Snowflake Merge Implementation ([databrickslabs#964](databrickslabs#964)). In this release, we have implemented the Merge statement for the Snowflake parser, which enables updating or deleting rows in a target table based on matches with a source table, and inserting new rows into the target table when there are no matches. This feature includes updates to the SnowflakeDMLBuilder and SnowflakeExpressionBuilder classes, allowing for proper handling of column names and MERGE queries. Additionally, we have added test cases to the SnowflakeASTBuilder, SnowflakeDMLBuilderSpec, and SnowflakeToDatabricksTranspiler to ensure the accurate translation and execution of MERGE statements for the Snowflake dialect. These changes bring important database migration and synchronization capabilities to our open-source library, improving its functionality and usability for software engineers. * TSQL: Implement CREATE TABLE ([databrickslabs#911](databrickslabs#911)). This commit implements the TSQL CREATE TABLE command and its various options and forms, including CTAS, graph node syntax, and analytics variants, as well as syntactical differences for SQL Server. The DDL and DML visitors have been moved from the AST and Relation visitors to separate classes for better responsibility segregation. The LogicalPlanGenerator class has been updated to generate unique constraints, primary keys, foreign keys, check constraints, default value constraints, and identity constraints for the CREATE TABLE command. Additionally, new classes for generating SQL options and handling unresolved options during transpilation have been added to enhance the parser's capability to manage various options and forms. These changes improve the transpilation of TSQL code and the organization of the codebase, making it easier to maintain and extend. * Transpile Snow ARRAY_SORT function ([databrickslabs#973](databrickslabs#973)). In this release, we have implemented support for the Snowflake ARRAY_SORT function in our open-source library. This feature has been added as part of issue [databrickslabs#973](databrickslabs#973), and it involves the addition of two new private methods, `arraySort` and `makeArraySort`, to the `SnowflakeCallMapper` class. The `arraySort` method takes a sequence of expressions as input and sorts the array using the `makeArraySort` method. The `makeArraySort` method handles both null and non-null values, sorts the array in ascending or descending order based on the provided parameter, and determines the position of null or small values based on the nulls first parameter. The sorted array is then returned as an `ir.ArraySort` expression. This functionality allows for the sorting of arrays in Snowflake SQL to be translated to equivalent code in the target language. This enhancement simplifies the process of working with arrays in Snowflake SQL and provides users with a more streamlined experience. * Transpile Snow MONTHS_BETWEEN function correctly ([databrickslabs#963](databrickslabs#963)). In this release, the remorph library's SnowflakeCallMapper class in the com/databricks/labs/remorph/parsers/snowflake/rules package has been updated to handle the MONTHS_BETWEEN function. A new case has been added that creates a MonthsBetween object with the first two arguments of the function call and a boolean value of true. This change enhances compatibility and ensures that the output accurately reflects the intended functionality. Additionally, new test cases have been introduced to the SnowflakeCallMapperSpec for the transpilation of the MONTHS_BETWEEN function. These test cases demonstrate accurate mapping of the function to the MonthsBetween class and proper casting of inputs as dates or timestamps, improving the reliability and precision of date and time calculations. * Updated Installation to handle install errors ([databrickslabs#962](databrickslabs#962)). In this release, we've made significant improvements to the `remorph` project, addressing and resolving installation errors that were occurring during the installation process in development mode. We've introduced a new `ProductInfo` class in the `wheels` module, which provides information about the products being installed. This change replaces the use of `WheelsV2` in two test functions. Additionally, we've updated the `workspace_installation` method in `application.py` to handle installation errors more effectively, addressing the dependency on workspace `.remorph` due to wheels. We've also added new methods to `installation.py` to manage local and remote version files, and updated the `_upgrade_reconcile_workflow` function to ensure the correct wheel path is used during installation. These changes improve the overall quality of the codebase, making it easier for developers to adopt and maintain the project, and ensure a more seamless installation experience for users. * Updated catalog operations logging ([databrickslabs#910](databrickslabs#910)). In this release, the setup process for the catalog, schema, and volume in the configurator module has been simplified and improved. The previous implementation repeatedly prompted the user for input until the correct input was provided or a maximum number of attempts was reached. The updated code now checks if the catalog, schema, or volume already exists and either uses it or prompts the user to create it once. If the user does not have the necessary privileges to use the catalog, schema, or volume, an error message is logged and the installation is aborted. New methods have been added to check for necessary privileges, such as `has_necessary_catalog_access`, `has_necessary_schema_access`, and `has_necessary_volume_access`, which return a boolean indicating whether the user has the necessary privileges and log an error message with the missing privileges if not. The logging for catalog operations in the install.py file has also been updated to check for privileges at the end of the process and list any missing privileges for each catalog object. Additionally, changes have been made to the unit tests for the ResourceConfigurator class to ensure that the system handles cases where the user does not have the necessary permissions to access catalogs, schemas, or volumes, preventing unauthorized access and maintaining the security and integrity of the system. * Updated remorph reconcile workflow to use wheels instead of pypi ([databrickslabs#884](databrickslabs#884)). In this release, the installation process for the Remorph library has been updated to allow for the use of locally uploaded wheel files instead of downloading the package from PyPI. This change includes updates to the `install` and `_deploy_jobs` methods in the `recon.py` file to accept a new `wheel_paths` argument, which is used to pass the path of the Remorph wheel file to the `deploy_recon_job` method. The `_upgrade_reconcile_workflow` function in the `v0.4.0_add_main_table_operation_name_column.py` file has also been updated to upload the wheel package to the workspace and pass its path to the `deploy_reconcile_job` method. Additionally, the `deploy_recon_job` method in the `JobDeployment` class now accepts a new `wheel_file` argument, which represents the name of the wheel file for the remorph library. These changes address issues faced by customers with no public internet access and enable the use of new features before they are released on PyPI. The `test_recon.py` file in the `tests/unit/deployment` directory has also been updated to reflect these changes. * Upgrade script Implementation ([databrickslabs#777](databrickslabs#777)). In this release, we've implemented an upgrade script as part of pull request [databrickslabs#777](databrickslabs#777), which resolves issue [databrickslabs#769](databrickslabs#769). This change introduces a new `Upgrades` class in `application.py` that accepts `product_info` and `installation` as parameters and includes a cached property `wheels` for improved performance. Additionally, we've added new methods to the `WorkspaceInstaller` class for handling upgrade-related tasks, including the creation of a `ProductInfo` object, interacting with the Databricks SDK, and handling potential errors. We've also added a test case to ensure that upgrades are applied correctly on more recent versions. These changes are part of our ongoing effort to enhance the management and application of upgrades to installed products. * bug fix for to_array function ([databrickslabs#961](databrickslabs#961)). A bug fix has been implemented to improve the `TO_ARRAY` function in our open-source library. Previously, this function expected only one parameter, but it has been updated to accept two parameters, with the second being optional. This change brings the function in line with other functions in the class, improving flexibility and ensuring backward compatibility. The `TO_ARRAY` function is used to convert a given expression to an array if it is not null and return null otherwise. The commit also includes updates to the `Generator` class, where a new entry for the `ToArray` expression has been added to the `expression_map` dictionary. Additionally, a new `ToArray` class has been introduced as a subclass of `Func`, allowing the function to handle a variable number of arguments more gracefully. Relevant updates have been made to the functional tests for the `to_array` function for both Snowflake and Databricks SQL, demonstrating its handling of null inputs and comparing it with the corresponding ARRAY function in each SQL dialect. Overall, these changes enhance the functionality and adaptability of the `TO_ARRAY` function. * feat: Implement all of TSQL predicates except for SOME ALL ANY ([databrickslabs#922](databrickslabs#922)). In this commit, we have implemented the IR generation for several TSQL predicates including IN, IS, BETWEEN, LIKE, EXISTS, and FREETEXT, thereby improving the parser's ability to handle a wider range of TSQL syntax. The `TSqlParser` class has been updated with new methods and changes to existing ones, including the addition of new labeled expressions to the `predicate` rule. Additionally, we have corrected an error in the LIKE predicate's implementation, allowing the ESCAPE character to accept a full expression that evaluates to a single character at runtime, rather than assuming it to be a single character at parse time. These changes provide more flexibility and adherence to the TSQL standard, enhancing the overall functionality of the project for our adopters.

vijaypavann-db added enhancement New feature or request feat/recon making sure that remorphed query produces the same results as original labels Aug 5, 2024

vijaypavann-db changed the title ~~[FEATURE]: Create an Upgrade script to handle (Schema) changes in newer versions~~ [FEATURE]: Create an Upgrade script to handle changes in newer versions Aug 7, 2024

vijaypavann-db mentioned this issue Aug 7, 2024

Upgrade script Implementation #777

Merged

nfx assigned vijaypavann-db Aug 16, 2024

nfx closed this as completed in #777 Sep 18, 2024

nfx closed this as completed in 1c6c890 Sep 18, 2024

vil1 pushed a commit that referenced this issue Sep 20, 2024

Upgrade script Implementation (#777)

7d28298

closes #769

nfx mentioned this issue Oct 10, 2024

Release v0.7.0 #992

Merged

sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this issue Jan 2, 2025

Upgrade script Implementation (databrickslabs#777)

c346137

closes databrickslabs#769

sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this issue Jan 3, 2025

Upgrade script Implementation (databrickslabs#777)

78bf277

closes databrickslabs#769

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Create an Upgrade script to handle changes in newer versions #769

[FEATURE]: Create an Upgrade script to handle changes in newer versions #769

vijaypavann-db commented Aug 5, 2024 •

edited

Loading

[FEATURE]: Create an Upgrade script to handle changes in newer versions #769

[FEATURE]: Create an Upgrade script to handle changes in newer versions #769

Comments

vijaypavann-db commented Aug 5, 2024 • edited Loading

Is there an existing issue for this?

Category of feature request

Problem statement

Proposed Solution

Additional Context

vijaypavann-db commented Aug 5, 2024 •

edited

Loading