Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Create an Upgrade script to handle changes in newer versions #769

Closed
1 task done
vijaypavann-db opened this issue Aug 5, 2024 · 0 comments · Fixed by #777
Closed
1 task done

[FEATURE]: Create an Upgrade script to handle changes in newer versions #769

vijaypavann-db opened this issue Aug 5, 2024 · 0 comments · Fixed by #777
Assignees
Labels
enhancement New feature or request feat/recon making sure that remorphed query produces the same results as original

Comments

@vijaypavann-db
Copy link
Contributor

vijaypavann-db commented Aug 5, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Category of feature request

Reconcile

Problem statement

A new column operation_name has been added to the main table.
The reconcililation process is failing to write the Data to existing main table because of the Schema changes mentioned above.

Please find the error below:

AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: ).
To enable schema migration using DataFrameWriter or DataStreamWriter, please set:
'.option("mergeSchema", "true")'.
For other operations, set the session configuration
spark.databricks.delta.schema.autoMerge.enabled to "true". See the documentation
specific to the operation for details.

Proposed Solution

To address these scenarios, we need to develop an Upgrade script that runs during Installation which contains detailed instructions for upgrading from one version to an upper version.
for example: from 0.4.0 to 0.4.1.

Additional Context

Check for ways to automate the process while running reconcile.
Current changes includes: Schema and Workflow

@vijaypavann-db vijaypavann-db added enhancement New feature or request feat/recon making sure that remorphed query produces the same results as original labels Aug 5, 2024
@vijaypavann-db vijaypavann-db changed the title [FEATURE]: Create an Upgrade script to handle (Schema) changes in newer versions [FEATURE]: Create an Upgrade script to handle changes in newer versions Aug 7, 2024
@nfx nfx closed this as completed in #777 Sep 18, 2024
@nfx nfx closed this as completed in 1c6c890 Sep 18, 2024
vil1 pushed a commit that referenced this issue Sep 20, 2024
@nfx nfx mentioned this issue Oct 10, 2024
nfx added a commit that referenced this issue Oct 10, 2024
* Added private key authentication for sf ([#917](#917)). This commit adds support for private key authentication to the Snowflake data source connector, providing users with more flexibility and security. The `cryptography` library is used to process the user-provided private key, with priority given to the `pem_private_key` secret, followed by the `sfPassword` secret. If neither secret is found, an exception is raised. However, password-based authentication is still used when JDBC options are provided, as Spark JDBC does not currently support private key authentication. A new exception class, `InvalidSnowflakePemPrivateKey`, has been added for handling invalid or malformed private keys. Additionally, new tests have been included for reading data with private key authentication, handling malformed private keys, and checking for missing authentication keys. The notice has been updated to include the `cryptography` library's copyright and license information.
* Added support for `PARSE_JSON` and `VARIANT` datatype ([#906](#906)). This commit introduces support for the `PARSE_JSON` function and `VARIANT` datatype in the Snowflake parser, addressing issue [#894](#894). The implementation removes the experimental dialect, enabling support for the `VARIANT` datatype and using `PARSE_JSON` for it. The `variant_explode` function is also utilized. During transpilation to Snowflake, whenever the `:` operator is encountered in the SELECT statement, everything will be treated as a `VARIANT` on the Databricks side to handle differences between Snowflake and Databricks in accessing variant types. These changes are implemented using ANTLR.
* Added upgrade script and modified metrics sql ([#990](#990)). In this release, the open-source library has been updated with several improvements to the metrics system, database upgrades, and setup process. The setup_spark_remote.sh script now checks if the Spark server is running by pinging localhost:4040, rather than sleeping for a fixed time, allowing for faster execution and more accurate server status determination. The metrics table's insert statement has been updated to cast values to Bigint for better handling of larger counts. An upgrade script has been added to facilitate required modifications, and the setup_spark_remote.sh script has been modified to validate URLs. A new Python file for upgrading the metrics table's data types has been added, which includes a function to recreate the table with the correct data types for specific columns. The upgrade_common module now includes several functions for upgrading database tables, and a new unit test file, test_upgrade_common.py, has been added with test cases for these functions. Lastly, the upgrade script for v0.4.0 has been updated to simplify the process of checking if the main table in the reconcile metadata needs to be recreated and to add an `operation_name` column. These changes improve the library's functionality, accuracy, and robustness, particularly for larger datasets and upgrading processes, enhancing the overall user experience.
* Basic CTAS Implementation ([#968](#968)). This pull request adds basic support for the CREATE TABLE AS SELECT (CTAS) statement in Snowflake, enabling users to create a new table by selecting data from an existing table or query. In the LogicalPlanGenerator class, a new method has been implemented to handle CTAS statements, which generates the appropriate SQL command for creating a table based on the result of a select query. The SnowflakeDDLBuilder class now includes a relationBuilder class member for building relations based on Snowflake DDL input, and the visitCreateTableAsSelect method has been overridden to parse CTAS statements and construct corresponding IR objects. The test suite has been expanded to include a new spec for CTAS statements and a test case for the CTAS statement "CREATE TABLE t1 AS SELECT c1, c2 FROM t2;". Additionally, a new test file "test_ctas_complex.sql" has been added, containing SQL statements for creating a new table by selecting columns from multiple tables and computing new columns through various data manipulations. The implementation also includes adding new SQL statements for CTAS in both Snowflake and Databricks dialects, allowing for testing the CTAS functionality for both.
* Create repeatable estimator for Snowflake query history ([#924](#924)). This commit introduces a new coverage estimation tool for analyzing query history in a database, initially implemented for Snowflake. The tool parses and transpiles query history into Databricks SQL and reports on the percentage of query history it can process. It includes a new `SnowflakePlanParser` class that handles Snowflake query plans, a `SqlGenerator` class that generates Databricks SQL from optimized logical plans, and a `dialect` method that returns the dialect string. The long-term plan is to extend this functionality to other supported databases and dialects and include a report on SQL complexity. Additionally, test cases have been added to the `AnonymizerTest` class to ensure the correct functionality of the `Anonymizer` class, which anonymizes executed queries when provided with a `PlanParser` object. The `Anonymizer` class is intended to be used as part of the coverage estimation tool, which will provide analysis of query history for various databases.
* Created a mapping dict for algo for each dialect at layer level ([#934](#934)). A series of changes have been implemented to improve the reconciliation process and the handling of hash algorithms in the open-source library. A mapping dictionary algorithm to dialect has been introduced at the layer level to enhance the reconciliation process. The `get_hash_transform` function now accepts a new `layer` argument and returns a list of hash algorithms from the `HashAlgoMapping` dictionary. A new `HashAlgoMapping` class has been added to map algorithms to a dialect for hashing, replacing the previous `DialectHashConfig` class. A new function `get_dialect` has been introduced to retrieve the dialect based on the layer. The `_hash_transform` function and the `build_query` method have been updated to use the `layer` parameter when determining the dialect. These changes provide more precise control over the algorithm used for hash transformation based on the source layer and the target dialect, resulting in improved reconciliation performance and accuracy.
* Fetch TableDefinitions from Snowflake ([#904](#904)). A new `SnowflakeTableDefinitions` class has been added to simplify the discovery of Snowflake table metadata. This class establishes a connection with a Snowflake database through a Connection object, and provides methods such as `getDataType` and `getTableDefinitionQuery` to parse data types and generate queries for table definitions. Moreover, it includes a `getTableDefinitions` method to retrieve all table definitions in a Snowflake database as a sequence of `TableDefinition` objects, which encapsulates various properties of each table. The class also features methods for retrieving all catalogs and schemas in a Snowflake database. Alongside `SnowflakeTableDefinitions`, a new test class, `SnowflakeTableDefinitionTest`, has been introduced to verify the behavior of `getTableDefinitions` and ensure that the class functions as intended, adhering to the desired behavior.
* Guide user on missing configuration file ([#930](#930)). In this commit, the `_verify_recon_table_config` method in the `runner.py` file of the `databricks/labs/remorph` package has been updated to handle missing reconcile table configurations during installation. When the reconcile table configuration is not found, an error message will now display the name of the requested configuration file. This enhancement helps users identify the specific configuration file they need to provide in their workspace, addressing issue [#919](#919). This commit is co-authored by Ludovic Claude.
* Implement more missing visitor functions for Snowflake and TSQL ([#975](#975)). In this release, we have added several missing visitor methods for the Snowflake and TSQL builder classes to improve the reliability and maintainability of our parser. Previously, when a visitor method was missing, the default visitor was called, causing the visit of all children of the ParseTree, which was not ideal. This could lead to incorrect results due to a slight modification in the ANTLR grammar inadvertently breaking the visitor. In this release, we have implemented several new visitor methods for both Snowflake and TSQL builder classes, including the `visitDdlCommand` method in the `SnowflakeDDLBuilder` class and the `visitDdlClause` method in the `TSqlDDLBuilder` class. These new methods ensure that the ParseTree is traversed correctly and that the correct IR node is returned. The `visitDdlCommand` method checks for different types of DDL commands, such as create, alter, drop, and undrop, and calls the appropriate method for each type. The `visitDdlClause` method contains a sequence of methods corresponding to different DDL clauses and calls the first non-null method in the sequence. These changes significantly improve the robustness of our parser and enhance the reliability of our code.
* Introduce typed errors ([#981](#981)). This commit introduces typed errors in the form of a new class, `UnexpectedNode`, and several case classes including `ParsingError`, `UnsupportedDataType`, `WrongNumberOfArguments`, `UnsupportedArguments`, and `UnsupportedDateTimePart` in various packages, as part of the ongoing effort to replace exception throwing with returning `Result` types in future pull requests. These changes will improve error handling and provide more context and precision for errors, facilitating debugging and maintenance of the remorph library and data type generation functionality. The `TranspileException` class is now constructed with specific typed error instances, and the `ErrorCollector` and `ErrorDetail` classes have been updated to use `ParsingError`. Additionally, the `SnowflakeCallMapper` and `SnowflakeTimeUnits` classes have been updated to use the new typed error mechanism, providing more precise error handling for Snowflake-specific functions and expressions.
* Miscellaneous improvements to Snowflake parser ([#952](#952)). This diff brings several miscellaneous improvements to the Snowflake parser in the open-source library, targeting increased parse and transpilation success rates. The modifications include updating the `colDecl` rule to allow optional data types, introducing an `objectField` rule, and enabling date and timestamp literals as strings. Additionally, the parser has been refined to handle identifiers more efficiently, such as hashes within the AnonymizerTest. The expected Ast for certain test cases has also been updated to improve parser accuracy. These changes aim to create a more robust and streamlined Snowflake parser, minimizing parsing errors and enhancing overall user experience for project adopters. Furthermore, the error handling and reporting capabilities of the Snowflake parser have been improved with new case classes, `IndividualError` and `ErrorsSummary`, and updated error messages.
* Moved intermediate package out of parsers ([#972](#972)). In this release, the `intermediate` package has been refactored out of the `parsers` package, aligning with the design principle that parsers should depend on the intermediate representation instead of the other way around. This change affects various classes and methods across the project, all of which have been updated to import the `intermediate` package from its new location. No new functionality has been introduced, but the refactoring improves the package structure and dependency management. The `EstimationAnalyzer` class in the `coverage/estimation` package has been updated to import classes from the new location of the `intermediate` package, and its `evaluateTree` method has been updated to use the new import path for `LogicalPlan` and `Expression`. Other affected classes include `SnowflakeTableDefinitions`, `SnowflakeLexer`, `SnowflakeParser`, `SnowflakeTypeBuilder`, `GeneratorContext`, `DataTypeGenerator`, `IRHelpers`, and multiple test files.
* Patch Function without Brackets ([#907](#907)). This commit introduces new lexer and parser rules to handle Snowflake SQL functions without parentheses, specifically impacting CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP. The new rules allow these functions to be used without parentheses, consistent with Snowflake SQL. This change fixes functional tests and includes documentation for the affected functions. However, there is a pending task to add or fix more test cases to ensure comprehensive testing of the new rules. Additionally, the syntax of the SELECT statement for the CURRENT_TIMESTAMP function has been updated, removing the need for the parameter 'col1'. This change simplifies the syntax for certain SQL functions in the codebase and improves the consistency and reliability of the functional tests.
* Root Table ([#936](#936)). The PR #\<pr\-number\> introduces a new class `TableGraph` that extends `DependencyGraph` and implements `LazyLogging` trait. This class builds a graph of tables and their dependencies based on query history and table definitions. It provides methods to add nodes and edges, build the graph, and retrieve root, upstream, and downstream tables. The `DependencyGraph` trait offers a more structured and flexible way to handle table dependencies. This change is part of the Root Table feature (issue [#936](#936)) that identifies root tables in a graph of table dependencies, closing issue [#23](#23). The PR includes a new `TableGraphTest` class that demonstrates the use of these methods and verifies their behavior for better data flow understanding and optimization.
* Snowflake Merge Implementation ([#964](#964)). In this release, we have implemented the Merge statement for the Snowflake parser, which enables updating or deleting rows in a target table based on matches with a source table, and inserting new rows into the target table when there are no matches. This feature includes updates to the SnowflakeDMLBuilder and SnowflakeExpressionBuilder classes, allowing for proper handling of column names and MERGE queries. Additionally, we have added test cases to the SnowflakeASTBuilder, SnowflakeDMLBuilderSpec, and SnowflakeToDatabricksTranspiler to ensure the accurate translation and execution of MERGE statements for the Snowflake dialect. These changes bring important database migration and synchronization capabilities to our open-source library, improving its functionality and usability for software engineers.
* TSQL: Implement CREATE TABLE ([#911](#911)). This commit implements the TSQL CREATE TABLE command and its various options and forms, including CTAS, graph node syntax, and analytics variants, as well as syntactical differences for SQL Server. The DDL and DML visitors have been moved from the AST and Relation visitors to separate classes for better responsibility segregation. The LogicalPlanGenerator class has been updated to generate unique constraints, primary keys, foreign keys, check constraints, default value constraints, and identity constraints for the CREATE TABLE command. Additionally, new classes for generating SQL options and handling unresolved options during transpilation have been added to enhance the parser's capability to manage various options and forms. These changes improve the transpilation of TSQL code and the organization of the codebase, making it easier to maintain and extend.
* Transpile Snow ARRAY_SORT function ([#973](#973)). In this release, we have implemented support for the Snowflake ARRAY_SORT function in our open-source library. This feature has been added as part of issue [#973](#973), and it involves the addition of two new private methods, `arraySort` and `makeArraySort`, to the `SnowflakeCallMapper` class. The `arraySort` method takes a sequence of expressions as input and sorts the array using the `makeArraySort` method. The `makeArraySort` method handles both null and non-null values, sorts the array in ascending or descending order based on the provided parameter, and determines the position of null or small values based on the nulls first parameter. The sorted array is then returned as an `ir.ArraySort` expression. This functionality allows for the sorting of arrays in Snowflake SQL to be translated to equivalent code in the target language. This enhancement simplifies the process of working with arrays in Snowflake SQL and provides users with a more streamlined experience.
* Transpile Snow MONTHS_BETWEEN function correctly ([#963](#963)). In this release, the remorph library's SnowflakeCallMapper class in the com/databricks/labs/remorph/parsers/snowflake/rules package has been updated to handle the MONTHS_BETWEEN function. A new case has been added that creates a MonthsBetween object with the first two arguments of the function call and a boolean value of true. This change enhances compatibility and ensures that the output accurately reflects the intended functionality. Additionally, new test cases have been introduced to the SnowflakeCallMapperSpec for the transpilation of the MONTHS_BETWEEN function. These test cases demonstrate accurate mapping of the function to the MonthsBetween class and proper casting of inputs as dates or timestamps, improving the reliability and precision of date and time calculations.
* Updated Installation to handle install errors ([#962](#962)). In this release, we've made significant improvements to the `remorph` project, addressing and resolving installation errors that were occurring during the installation process in development mode. We've introduced a new `ProductInfo` class in the `wheels` module, which provides information about the products being installed. This change replaces the use of `WheelsV2` in two test functions. Additionally, we've updated the `workspace_installation` method in `application.py` to handle installation errors more effectively, addressing the dependency on workspace `.remorph` due to wheels. We've also added new methods to `installation.py` to manage local and remote version files, and updated the `_upgrade_reconcile_workflow` function to ensure the correct wheel path is used during installation. These changes improve the overall quality of the codebase, making it easier for developers to adopt and maintain the project, and ensure a more seamless installation experience for users.
* Updated catalog operations logging ([#910](#910)). In this release, the setup process for the catalog, schema, and volume in the configurator module has been simplified and improved. The previous implementation repeatedly prompted the user for input until the correct input was provided or a maximum number of attempts was reached. The updated code now checks if the catalog, schema, or volume already exists and either uses it or prompts the user to create it once. If the user does not have the necessary privileges to use the catalog, schema, or volume, an error message is logged and the installation is aborted. New methods have been added to check for necessary privileges, such as `has_necessary_catalog_access`, `has_necessary_schema_access`, and `has_necessary_volume_access`, which return a boolean indicating whether the user has the necessary privileges and log an error message with the missing privileges if not. The logging for catalog operations in the install.py file has also been updated to check for privileges at the end of the process and list any missing privileges for each catalog object. Additionally, changes have been made to the unit tests for the ResourceConfigurator class to ensure that the system handles cases where the user does not have the necessary permissions to access catalogs, schemas, or volumes, preventing unauthorized access and maintaining the security and integrity of the system.
* Updated remorph reconcile workflow to use wheels instead of pypi ([#884](#884)). In this release, the installation process for the Remorph library has been updated to allow for the use of locally uploaded wheel files instead of downloading the package from PyPI. This change includes updates to the `install` and `_deploy_jobs` methods in the `recon.py` file to accept a new `wheel_paths` argument, which is used to pass the path of the Remorph wheel file to the `deploy_recon_job` method. The `_upgrade_reconcile_workflow` function in the `v0.4.0_add_main_table_operation_name_column.py` file has also been updated to upload the wheel package to the workspace and pass its path to the `deploy_reconcile_job` method. Additionally, the `deploy_recon_job` method in the `JobDeployment` class now accepts a new `wheel_file` argument, which represents the name of the wheel file for the remorph library. These changes address issues faced by customers with no public internet access and enable the use of new features before they are released on PyPI. The `test_recon.py` file in the `tests/unit/deployment` directory has also been updated to reflect these changes.
* Upgrade script Implementation ([#777](#777)). In this release, we've implemented an upgrade script as part of pull request [#777](#777), which resolves issue [#769](#769). This change introduces a new `Upgrades` class in `application.py` that accepts `product_info` and `installation` as parameters and includes a cached property `wheels` for improved performance. Additionally, we've added new methods to the `WorkspaceInstaller` class for handling upgrade-related tasks, including the creation of a `ProductInfo` object, interacting with the Databricks SDK, and handling potential errors. We've also added a test case to ensure that upgrades are applied correctly on more recent versions. These changes are part of our ongoing effort to enhance the management and application of upgrades to installed products.
* bug fix for to_array function ([#961](#961)). A bug fix has been implemented to improve the `TO_ARRAY` function in our open-source library. Previously, this function expected only one parameter, but it has been updated to accept two parameters, with the second being optional. This change brings the function in line with other functions in the class, improving flexibility and ensuring backward compatibility. The `TO_ARRAY` function is used to convert a given expression to an array if it is not null and return null otherwise. The commit also includes updates to the `Generator` class, where a new entry for the `ToArray` expression has been added to the `expression_map` dictionary. Additionally, a new `ToArray` class has been introduced as a subclass of `Func`, allowing the function to handle a variable number of arguments more gracefully. Relevant updates have been made to the functional tests for the `to_array` function for both Snowflake and Databricks SQL, demonstrating its handling of null inputs and comparing it with the corresponding ARRAY function in each SQL dialect. Overall, these changes enhance the functionality and adaptability of the `TO_ARRAY` function.
* feat: Implement all of TSQL predicates except for SOME ALL ANY ([#922](#922)). In this commit, we have implemented the IR generation for several TSQL predicates including IN, IS, BETWEEN, LIKE, EXISTS, and FREETEXT, thereby improving the parser's ability to handle a wider range of TSQL syntax. The `TSqlParser` class has been updated with new methods and changes to existing ones, including the addition of new labeled expressions to the `predicate` rule. Additionally, we have corrected an error in the LIKE predicate's implementation, allowing the ESCAPE character to accept a full expression that evaluates to a single character at runtime, rather than assuming it to be a single character at parse time. These changes provide more flexibility and adherence to the TSQL standard, enhancing the overall functionality of the project for our adopters.
nfx added a commit that referenced this issue Oct 10, 2024
* Added private key authentication for sf
([#917](#917)). This
commit adds support for private key authentication to the Snowflake data
source connector, providing users with more flexibility and security.
The `cryptography` library is used to process the user-provided private
key, with priority given to the `pem_private_key` secret, followed by
the `sfPassword` secret. If neither secret is found, an exception is
raised. However, password-based authentication is still used when JDBC
options are provided, as Spark JDBC does not currently support private
key authentication. A new exception class,
`InvalidSnowflakePemPrivateKey`, has been added for handling invalid or
malformed private keys. Additionally, new tests have been included for
reading data with private key authentication, handling malformed private
keys, and checking for missing authentication keys. The notice has been
updated to include the `cryptography` library's copyright and license
information.
* Added support for `PARSE_JSON` and `VARIANT` datatype
([#906](#906)). This
commit introduces support for the `PARSE_JSON` function and `VARIANT`
datatype in the Snowflake parser, addressing issue
[#894](#894). The
implementation removes the experimental dialect, enabling support for
the `VARIANT` datatype and using `PARSE_JSON` for it. The
`variant_explode` function is also utilized. During transpilation to
Snowflake, whenever the `:` operator is encountered in the SELECT
statement, everything will be treated as a `VARIANT` on the Databricks
side to handle differences between Snowflake and Databricks in accessing
variant types. These changes are implemented using ANTLR.
* Added upgrade script and modified metrics sql
([#990](#990)). In this
release, the open-source library has been updated with several
improvements to the metrics system, database upgrades, and setup
process. The setup_spark_remote.sh script now checks if the Spark server
is running by pinging localhost:4040, rather than sleeping for a fixed
time, allowing for faster execution and more accurate server status
determination. The metrics table's insert statement has been updated to
cast values to Bigint for better handling of larger counts. An upgrade
script has been added to facilitate required modifications, and the
setup_spark_remote.sh script has been modified to validate URLs. A new
Python file for upgrading the metrics table's data types has been added,
which includes a function to recreate the table with the correct data
types for specific columns. The upgrade_common module now includes
several functions for upgrading database tables, and a new unit test
file, test_upgrade_common.py, has been added with test cases for these
functions. Lastly, the upgrade script for v0.4.0 has been updated to
simplify the process of checking if the main table in the reconcile
metadata needs to be recreated and to add an `operation_name` column.
These changes improve the library's functionality, accuracy, and
robustness, particularly for larger datasets and upgrading processes,
enhancing the overall user experience.
* Basic CTAS Implementation
([#968](#968)). This
pull request adds basic support for the CREATE TABLE AS SELECT (CTAS)
statement in Snowflake, enabling users to create a new table by
selecting data from an existing table or query. In the
LogicalPlanGenerator class, a new method has been implemented to handle
CTAS statements, which generates the appropriate SQL command for
creating a table based on the result of a select query. The
SnowflakeDDLBuilder class now includes a relationBuilder class member
for building relations based on Snowflake DDL input, and the
visitCreateTableAsSelect method has been overridden to parse CTAS
statements and construct corresponding IR objects. The test suite has
been expanded to include a new spec for CTAS statements and a test case
for the CTAS statement "CREATE TABLE t1 AS SELECT c1, c2 FROM t2;".
Additionally, a new test file "test_ctas_complex.sql" has been added,
containing SQL statements for creating a new table by selecting columns
from multiple tables and computing new columns through various data
manipulations. The implementation also includes adding new SQL
statements for CTAS in both Snowflake and Databricks dialects, allowing
for testing the CTAS functionality for both.
* Create repeatable estimator for Snowflake query history
([#924](#924)). This
commit introduces a new coverage estimation tool for analyzing query
history in a database, initially implemented for Snowflake. The tool
parses and transpiles query history into Databricks SQL and reports on
the percentage of query history it can process. It includes a new
`SnowflakePlanParser` class that handles Snowflake query plans, a
`SqlGenerator` class that generates Databricks SQL from optimized
logical plans, and a `dialect` method that returns the dialect string.
The long-term plan is to extend this functionality to other supported
databases and dialects and include a report on SQL complexity.
Additionally, test cases have been added to the `AnonymizerTest` class
to ensure the correct functionality of the `Anonymizer` class, which
anonymizes executed queries when provided with a `PlanParser` object.
The `Anonymizer` class is intended to be used as part of the coverage
estimation tool, which will provide analysis of query history for
various databases.
* Created a mapping dict for algo for each dialect at layer level
([#934](#934)). A series
of changes have been implemented to improve the reconciliation process
and the handling of hash algorithms in the open-source library. A
mapping dictionary algorithm to dialect has been introduced at the layer
level to enhance the reconciliation process. The `get_hash_transform`
function now accepts a new `layer` argument and returns a list of hash
algorithms from the `HashAlgoMapping` dictionary. A new
`HashAlgoMapping` class has been added to map algorithms to a dialect
for hashing, replacing the previous `DialectHashConfig` class. A new
function `get_dialect` has been introduced to retrieve the dialect based
on the layer. The `_hash_transform` function and the `build_query`
method have been updated to use the `layer` parameter when determining
the dialect. These changes provide more precise control over the
algorithm used for hash transformation based on the source layer and the
target dialect, resulting in improved reconciliation performance and
accuracy.
* Fetch TableDefinitions from Snowflake
([#904](#904)). A new
`SnowflakeTableDefinitions` class has been added to simplify the
discovery of Snowflake table metadata. This class establishes a
connection with a Snowflake database through a Connection object, and
provides methods such as `getDataType` and `getTableDefinitionQuery` to
parse data types and generate queries for table definitions. Moreover,
it includes a `getTableDefinitions` method to retrieve all table
definitions in a Snowflake database as a sequence of `TableDefinition`
objects, which encapsulates various properties of each table. The class
also features methods for retrieving all catalogs and schemas in a
Snowflake database. Alongside `SnowflakeTableDefinitions`, a new test
class, `SnowflakeTableDefinitionTest`, has been introduced to verify the
behavior of `getTableDefinitions` and ensure that the class functions as
intended, adhering to the desired behavior.
* Guide user on missing configuration file
([#930](#930)). In this
commit, the `_verify_recon_table_config` method in the `runner.py` file
of the `databricks/labs/remorph` package has been updated to handle
missing reconcile table configurations during installation. When the
reconcile table configuration is not found, an error message will now
display the name of the requested configuration file. This enhancement
helps users identify the specific configuration file they need to
provide in their workspace, addressing issue
[#919](#919). This
commit is co-authored by Ludovic Claude.
* Implement more missing visitor functions for Snowflake and TSQL
([#975](#975)). In this
release, we have added several missing visitor methods for the Snowflake
and TSQL builder classes to improve the reliability and maintainability
of our parser. Previously, when a visitor method was missing, the
default visitor was called, causing the visit of all children of the
ParseTree, which was not ideal. This could lead to incorrect results due
to a slight modification in the ANTLR grammar inadvertently breaking the
visitor. In this release, we have implemented several new visitor
methods for both Snowflake and TSQL builder classes, including the
`visitDdlCommand` method in the `SnowflakeDDLBuilder` class and the
`visitDdlClause` method in the `TSqlDDLBuilder` class. These new methods
ensure that the ParseTree is traversed correctly and that the correct IR
node is returned. The `visitDdlCommand` method checks for different
types of DDL commands, such as create, alter, drop, and undrop, and
calls the appropriate method for each type. The `visitDdlClause` method
contains a sequence of methods corresponding to different DDL clauses
and calls the first non-null method in the sequence. These changes
significantly improve the robustness of our parser and enhance the
reliability of our code.
* Introduce typed errors
([#981](#981)). This
commit introduces typed errors in the form of a new class,
`UnexpectedNode`, and several case classes including `ParsingError`,
`UnsupportedDataType`, `WrongNumberOfArguments`, `UnsupportedArguments`,
and `UnsupportedDateTimePart` in various packages, as part of the
ongoing effort to replace exception throwing with returning `Result`
types in future pull requests. These changes will improve error handling
and provide more context and precision for errors, facilitating
debugging and maintenance of the remorph library and data type
generation functionality. The `TranspileException` class is now
constructed with specific typed error instances, and the
`ErrorCollector` and `ErrorDetail` classes have been updated to use
`ParsingError`. Additionally, the `SnowflakeCallMapper` and
`SnowflakeTimeUnits` classes have been updated to use the new typed
error mechanism, providing more precise error handling for
Snowflake-specific functions and expressions.
* Miscellaneous improvements to Snowflake parser
([#952](#952)). This
diff brings several miscellaneous improvements to the Snowflake parser
in the open-source library, targeting increased parse and transpilation
success rates. The modifications include updating the `colDecl` rule to
allow optional data types, introducing an `objectField` rule, and
enabling date and timestamp literals as strings. Additionally, the
parser has been refined to handle identifiers more efficiently, such as
hashes within the AnonymizerTest. The expected Ast for certain test
cases has also been updated to improve parser accuracy. These changes
aim to create a more robust and streamlined Snowflake parser, minimizing
parsing errors and enhancing overall user experience for project
adopters. Furthermore, the error handling and reporting capabilities of
the Snowflake parser have been improved with new case classes,
`IndividualError` and `ErrorsSummary`, and updated error messages.
* Moved intermediate package out of parsers
([#972](#972)). In this
release, the `intermediate` package has been refactored out of the
`parsers` package, aligning with the design principle that parsers
should depend on the intermediate representation instead of the other
way around. This change affects various classes and methods across the
project, all of which have been updated to import the `intermediate`
package from its new location. No new functionality has been introduced,
but the refactoring improves the package structure and dependency
management. The `EstimationAnalyzer` class in the `coverage/estimation`
package has been updated to import classes from the new location of the
`intermediate` package, and its `evaluateTree` method has been updated
to use the new import path for `LogicalPlan` and `Expression`. Other
affected classes include `SnowflakeTableDefinitions`, `SnowflakeLexer`,
`SnowflakeParser`, `SnowflakeTypeBuilder`, `GeneratorContext`,
`DataTypeGenerator`, `IRHelpers`, and multiple test files.
* Patch Function without Brackets
([#907](#907)). This
commit introduces new lexer and parser rules to handle Snowflake SQL
functions without parentheses, specifically impacting CURRENT_DATE,
CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP. The new
rules allow these functions to be used without parentheses, consistent
with Snowflake SQL. This change fixes functional tests and includes
documentation for the affected functions. However, there is a pending
task to add or fix more test cases to ensure comprehensive testing of
the new rules. Additionally, the syntax of the SELECT statement for the
CURRENT_TIMESTAMP function has been updated, removing the need for the
parameter 'col1'. This change simplifies the syntax for certain SQL
functions in the codebase and improves the consistency and reliability
of the functional tests.
* Root Table
([#936](#936)). The PR
#\<pr\-number\> introduces a new class `TableGraph` that extends
`DependencyGraph` and implements `LazyLogging` trait. This class builds
a graph of tables and their dependencies based on query history and
table definitions. It provides methods to add nodes and edges, build the
graph, and retrieve root, upstream, and downstream tables. The
`DependencyGraph` trait offers a more structured and flexible way to
handle table dependencies. This change is part of the Root Table feature
(issue [#936](#936))
that identifies root tables in a graph of table dependencies, closing
issue [#23](#23). The PR
includes a new `TableGraphTest` class that demonstrates the use of these
methods and verifies their behavior for better data flow understanding
and optimization.
* Snowflake Merge Implementation
([#964](#964)). In this
release, we have implemented the Merge statement for the Snowflake
parser, which enables updating or deleting rows in a target table based
on matches with a source table, and inserting new rows into the target
table when there are no matches. This feature includes updates to the
SnowflakeDMLBuilder and SnowflakeExpressionBuilder classes, allowing for
proper handling of column names and MERGE queries. Additionally, we have
added test cases to the SnowflakeASTBuilder, SnowflakeDMLBuilderSpec,
and SnowflakeToDatabricksTranspiler to ensure the accurate translation
and execution of MERGE statements for the Snowflake dialect. These
changes bring important database migration and synchronization
capabilities to our open-source library, improving its functionality and
usability for software engineers.
* TSQL: Implement CREATE TABLE
([#911](#911)). This
commit implements the TSQL CREATE TABLE command and its various options
and forms, including CTAS, graph node syntax, and analytics variants, as
well as syntactical differences for SQL Server. The DDL and DML visitors
have been moved from the AST and Relation visitors to separate classes
for better responsibility segregation. The LogicalPlanGenerator class
has been updated to generate unique constraints, primary keys, foreign
keys, check constraints, default value constraints, and identity
constraints for the CREATE TABLE command. Additionally, new classes for
generating SQL options and handling unresolved options during
transpilation have been added to enhance the parser's capability to
manage various options and forms. These changes improve the
transpilation of TSQL code and the organization of the codebase, making
it easier to maintain and extend.
* Transpile Snow ARRAY_SORT function
([#973](#973)). In this
release, we have implemented support for the Snowflake ARRAY_SORT
function in our open-source library. This feature has been added as part
of issue [#973](#973),
and it involves the addition of two new private methods, `arraySort` and
`makeArraySort`, to the `SnowflakeCallMapper` class. The `arraySort`
method takes a sequence of expressions as input and sorts the array
using the `makeArraySort` method. The `makeArraySort` method handles
both null and non-null values, sorts the array in ascending or
descending order based on the provided parameter, and determines the
position of null or small values based on the nulls first parameter. The
sorted array is then returned as an `ir.ArraySort` expression. This
functionality allows for the sorting of arrays in Snowflake SQL to be
translated to equivalent code in the target language. This enhancement
simplifies the process of working with arrays in Snowflake SQL and
provides users with a more streamlined experience.
* Transpile Snow MONTHS_BETWEEN function correctly
([#963](#963)). In this
release, the remorph library's SnowflakeCallMapper class in the
com/databricks/labs/remorph/parsers/snowflake/rules package has been
updated to handle the MONTHS_BETWEEN function. A new case has been added
that creates a MonthsBetween object with the first two arguments of the
function call and a boolean value of true. This change enhances
compatibility and ensures that the output accurately reflects the
intended functionality. Additionally, new test cases have been
introduced to the SnowflakeCallMapperSpec for the transpilation of the
MONTHS_BETWEEN function. These test cases demonstrate accurate mapping
of the function to the MonthsBetween class and proper casting of inputs
as dates or timestamps, improving the reliability and precision of date
and time calculations.
* Updated Installation to handle install errors
([#962](#962)). In this
release, we've made significant improvements to the `remorph` project,
addressing and resolving installation errors that were occurring during
the installation process in development mode. We've introduced a new
`ProductInfo` class in the `wheels` module, which provides information
about the products being installed. This change replaces the use of
`WheelsV2` in two test functions. Additionally, we've updated the
`workspace_installation` method in `application.py` to handle
installation errors more effectively, addressing the dependency on
workspace `.remorph` due to wheels. We've also added new methods to
`installation.py` to manage local and remote version files, and updated
the `_upgrade_reconcile_workflow` function to ensure the correct wheel
path is used during installation. These changes improve the overall
quality of the codebase, making it easier for developers to adopt and
maintain the project, and ensure a more seamless installation experience
for users.
* Updated catalog operations logging
([#910](#910)). In this
release, the setup process for the catalog, schema, and volume in the
configurator module has been simplified and improved. The previous
implementation repeatedly prompted the user for input until the correct
input was provided or a maximum number of attempts was reached. The
updated code now checks if the catalog, schema, or volume already exists
and either uses it or prompts the user to create it once. If the user
does not have the necessary privileges to use the catalog, schema, or
volume, an error message is logged and the installation is aborted. New
methods have been added to check for necessary privileges, such as
`has_necessary_catalog_access`, `has_necessary_schema_access`, and
`has_necessary_volume_access`, which return a boolean indicating whether
the user has the necessary privileges and log an error message with the
missing privileges if not. The logging for catalog operations in the
install.py file has also been updated to check for privileges at the end
of the process and list any missing privileges for each catalog object.
Additionally, changes have been made to the unit tests for the
ResourceConfigurator class to ensure that the system handles cases where
the user does not have the necessary permissions to access catalogs,
schemas, or volumes, preventing unauthorized access and maintaining the
security and integrity of the system.
* Updated remorph reconcile workflow to use wheels instead of pypi
([#884](#884)). In this
release, the installation process for the Remorph library has been
updated to allow for the use of locally uploaded wheel files instead of
downloading the package from PyPI. This change includes updates to the
`install` and `_deploy_jobs` methods in the `recon.py` file to accept a
new `wheel_paths` argument, which is used to pass the path of the
Remorph wheel file to the `deploy_recon_job` method. The
`_upgrade_reconcile_workflow` function in the
`v0.4.0_add_main_table_operation_name_column.py` file has also been
updated to upload the wheel package to the workspace and pass its path
to the `deploy_reconcile_job` method. Additionally, the
`deploy_recon_job` method in the `JobDeployment` class now accepts a new
`wheel_file` argument, which represents the name of the wheel file for
the remorph library. These changes address issues faced by customers
with no public internet access and enable the use of new features before
they are released on PyPI. The `test_recon.py` file in the
`tests/unit/deployment` directory has also been updated to reflect these
changes.
* Upgrade script Implementation
([#777](#777)). In this
release, we've implemented an upgrade script as part of pull request
[#777](#777), which
resolves issue
[#769](#769). This
change introduces a new `Upgrades` class in `application.py` that
accepts `product_info` and `installation` as parameters and includes a
cached property `wheels` for improved performance. Additionally, we've
added new methods to the `WorkspaceInstaller` class for handling
upgrade-related tasks, including the creation of a `ProductInfo` object,
interacting with the Databricks SDK, and handling potential errors.
We've also added a test case to ensure that upgrades are applied
correctly on more recent versions. These changes are part of our ongoing
effort to enhance the management and application of upgrades to
installed products.
* bug fix for to_array function
([#961](#961)). A bug
fix has been implemented to improve the `TO_ARRAY` function in our
open-source library. Previously, this function expected only one
parameter, but it has been updated to accept two parameters, with the
second being optional. This change brings the function in line with
other functions in the class, improving flexibility and ensuring
backward compatibility. The `TO_ARRAY` function is used to convert a
given expression to an array if it is not null and return null
otherwise. The commit also includes updates to the `Generator` class,
where a new entry for the `ToArray` expression has been added to the
`expression_map` dictionary. Additionally, a new `ToArray` class has
been introduced as a subclass of `Func`, allowing the function to handle
a variable number of arguments more gracefully. Relevant updates have
been made to the functional tests for the `to_array` function for both
Snowflake and Databricks SQL, demonstrating its handling of null inputs
and comparing it with the corresponding ARRAY function in each SQL
dialect. Overall, these changes enhance the functionality and
adaptability of the `TO_ARRAY` function.
* feat: Implement all of TSQL predicates except for SOME ALL ANY
([#922](#922)). In this
commit, we have implemented the IR generation for several TSQL
predicates including IN, IS, BETWEEN, LIKE, EXISTS, and FREETEXT,
thereby improving the parser's ability to handle a wider range of TSQL
syntax. The `TSqlParser` class has been updated with new methods and
changes to existing ones, including the addition of new labeled
expressions to the `predicate` rule. Additionally, we have corrected an
error in the LIKE predicate's implementation, allowing the ESCAPE
character to accept a full expression that evaluates to a single
character at runtime, rather than assuming it to be a single character
at parse time. These changes provide more flexibility and adherence to
the TSQL standard, enhancing the overall functionality of the project
for our adopters.
sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this issue Jan 2, 2025
sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this issue Jan 2, 2025
* Added private key authentication for sf
([databrickslabs#917](databrickslabs#917)). This
commit adds support for private key authentication to the Snowflake data
source connector, providing users with more flexibility and security.
The `cryptography` library is used to process the user-provided private
key, with priority given to the `pem_private_key` secret, followed by
the `sfPassword` secret. If neither secret is found, an exception is
raised. However, password-based authentication is still used when JDBC
options are provided, as Spark JDBC does not currently support private
key authentication. A new exception class,
`InvalidSnowflakePemPrivateKey`, has been added for handling invalid or
malformed private keys. Additionally, new tests have been included for
reading data with private key authentication, handling malformed private
keys, and checking for missing authentication keys. The notice has been
updated to include the `cryptography` library's copyright and license
information.
* Added support for `PARSE_JSON` and `VARIANT` datatype
([databrickslabs#906](databrickslabs#906)). This
commit introduces support for the `PARSE_JSON` function and `VARIANT`
datatype in the Snowflake parser, addressing issue
[databrickslabs#894](databrickslabs#894). The
implementation removes the experimental dialect, enabling support for
the `VARIANT` datatype and using `PARSE_JSON` for it. The
`variant_explode` function is also utilized. During transpilation to
Snowflake, whenever the `:` operator is encountered in the SELECT
statement, everything will be treated as a `VARIANT` on the Databricks
side to handle differences between Snowflake and Databricks in accessing
variant types. These changes are implemented using ANTLR.
* Added upgrade script and modified metrics sql
([databrickslabs#990](databrickslabs#990)). In this
release, the open-source library has been updated with several
improvements to the metrics system, database upgrades, and setup
process. The setup_spark_remote.sh script now checks if the Spark server
is running by pinging localhost:4040, rather than sleeping for a fixed
time, allowing for faster execution and more accurate server status
determination. The metrics table's insert statement has been updated to
cast values to Bigint for better handling of larger counts. An upgrade
script has been added to facilitate required modifications, and the
setup_spark_remote.sh script has been modified to validate URLs. A new
Python file for upgrading the metrics table's data types has been added,
which includes a function to recreate the table with the correct data
types for specific columns. The upgrade_common module now includes
several functions for upgrading database tables, and a new unit test
file, test_upgrade_common.py, has been added with test cases for these
functions. Lastly, the upgrade script for v0.4.0 has been updated to
simplify the process of checking if the main table in the reconcile
metadata needs to be recreated and to add an `operation_name` column.
These changes improve the library's functionality, accuracy, and
robustness, particularly for larger datasets and upgrading processes,
enhancing the overall user experience.
* Basic CTAS Implementation
([databrickslabs#968](databrickslabs#968)). This
pull request adds basic support for the CREATE TABLE AS SELECT (CTAS)
statement in Snowflake, enabling users to create a new table by
selecting data from an existing table or query. In the
LogicalPlanGenerator class, a new method has been implemented to handle
CTAS statements, which generates the appropriate SQL command for
creating a table based on the result of a select query. The
SnowflakeDDLBuilder class now includes a relationBuilder class member
for building relations based on Snowflake DDL input, and the
visitCreateTableAsSelect method has been overridden to parse CTAS
statements and construct corresponding IR objects. The test suite has
been expanded to include a new spec for CTAS statements and a test case
for the CTAS statement "CREATE TABLE t1 AS SELECT c1, c2 FROM t2;".
Additionally, a new test file "test_ctas_complex.sql" has been added,
containing SQL statements for creating a new table by selecting columns
from multiple tables and computing new columns through various data
manipulations. The implementation also includes adding new SQL
statements for CTAS in both Snowflake and Databricks dialects, allowing
for testing the CTAS functionality for both.
* Create repeatable estimator for Snowflake query history
([databrickslabs#924](databrickslabs#924)). This
commit introduces a new coverage estimation tool for analyzing query
history in a database, initially implemented for Snowflake. The tool
parses and transpiles query history into Databricks SQL and reports on
the percentage of query history it can process. It includes a new
`SnowflakePlanParser` class that handles Snowflake query plans, a
`SqlGenerator` class that generates Databricks SQL from optimized
logical plans, and a `dialect` method that returns the dialect string.
The long-term plan is to extend this functionality to other supported
databases and dialects and include a report on SQL complexity.
Additionally, test cases have been added to the `AnonymizerTest` class
to ensure the correct functionality of the `Anonymizer` class, which
anonymizes executed queries when provided with a `PlanParser` object.
The `Anonymizer` class is intended to be used as part of the coverage
estimation tool, which will provide analysis of query history for
various databases.
* Created a mapping dict for algo for each dialect at layer level
([databrickslabs#934](databrickslabs#934)). A series
of changes have been implemented to improve the reconciliation process
and the handling of hash algorithms in the open-source library. A
mapping dictionary algorithm to dialect has been introduced at the layer
level to enhance the reconciliation process. The `get_hash_transform`
function now accepts a new `layer` argument and returns a list of hash
algorithms from the `HashAlgoMapping` dictionary. A new
`HashAlgoMapping` class has been added to map algorithms to a dialect
for hashing, replacing the previous `DialectHashConfig` class. A new
function `get_dialect` has been introduced to retrieve the dialect based
on the layer. The `_hash_transform` function and the `build_query`
method have been updated to use the `layer` parameter when determining
the dialect. These changes provide more precise control over the
algorithm used for hash transformation based on the source layer and the
target dialect, resulting in improved reconciliation performance and
accuracy.
* Fetch TableDefinitions from Snowflake
([databrickslabs#904](databrickslabs#904)). A new
`SnowflakeTableDefinitions` class has been added to simplify the
discovery of Snowflake table metadata. This class establishes a
connection with a Snowflake database through a Connection object, and
provides methods such as `getDataType` and `getTableDefinitionQuery` to
parse data types and generate queries for table definitions. Moreover,
it includes a `getTableDefinitions` method to retrieve all table
definitions in a Snowflake database as a sequence of `TableDefinition`
objects, which encapsulates various properties of each table. The class
also features methods for retrieving all catalogs and schemas in a
Snowflake database. Alongside `SnowflakeTableDefinitions`, a new test
class, `SnowflakeTableDefinitionTest`, has been introduced to verify the
behavior of `getTableDefinitions` and ensure that the class functions as
intended, adhering to the desired behavior.
* Guide user on missing configuration file
([databrickslabs#930](databrickslabs#930)). In this
commit, the `_verify_recon_table_config` method in the `runner.py` file
of the `databricks/labs/remorph` package has been updated to handle
missing reconcile table configurations during installation. When the
reconcile table configuration is not found, an error message will now
display the name of the requested configuration file. This enhancement
helps users identify the specific configuration file they need to
provide in their workspace, addressing issue
[databrickslabs#919](databrickslabs#919). This
commit is co-authored by Ludovic Claude.
* Implement more missing visitor functions for Snowflake and TSQL
([databrickslabs#975](databrickslabs#975)). In this
release, we have added several missing visitor methods for the Snowflake
and TSQL builder classes to improve the reliability and maintainability
of our parser. Previously, when a visitor method was missing, the
default visitor was called, causing the visit of all children of the
ParseTree, which was not ideal. This could lead to incorrect results due
to a slight modification in the ANTLR grammar inadvertently breaking the
visitor. In this release, we have implemented several new visitor
methods for both Snowflake and TSQL builder classes, including the
`visitDdlCommand` method in the `SnowflakeDDLBuilder` class and the
`visitDdlClause` method in the `TSqlDDLBuilder` class. These new methods
ensure that the ParseTree is traversed correctly and that the correct IR
node is returned. The `visitDdlCommand` method checks for different
types of DDL commands, such as create, alter, drop, and undrop, and
calls the appropriate method for each type. The `visitDdlClause` method
contains a sequence of methods corresponding to different DDL clauses
and calls the first non-null method in the sequence. These changes
significantly improve the robustness of our parser and enhance the
reliability of our code.
* Introduce typed errors
([databrickslabs#981](databrickslabs#981)). This
commit introduces typed errors in the form of a new class,
`UnexpectedNode`, and several case classes including `ParsingError`,
`UnsupportedDataType`, `WrongNumberOfArguments`, `UnsupportedArguments`,
and `UnsupportedDateTimePart` in various packages, as part of the
ongoing effort to replace exception throwing with returning `Result`
types in future pull requests. These changes will improve error handling
and provide more context and precision for errors, facilitating
debugging and maintenance of the remorph library and data type
generation functionality. The `TranspileException` class is now
constructed with specific typed error instances, and the
`ErrorCollector` and `ErrorDetail` classes have been updated to use
`ParsingError`. Additionally, the `SnowflakeCallMapper` and
`SnowflakeTimeUnits` classes have been updated to use the new typed
error mechanism, providing more precise error handling for
Snowflake-specific functions and expressions.
* Miscellaneous improvements to Snowflake parser
([databrickslabs#952](databrickslabs#952)). This
diff brings several miscellaneous improvements to the Snowflake parser
in the open-source library, targeting increased parse and transpilation
success rates. The modifications include updating the `colDecl` rule to
allow optional data types, introducing an `objectField` rule, and
enabling date and timestamp literals as strings. Additionally, the
parser has been refined to handle identifiers more efficiently, such as
hashes within the AnonymizerTest. The expected Ast for certain test
cases has also been updated to improve parser accuracy. These changes
aim to create a more robust and streamlined Snowflake parser, minimizing
parsing errors and enhancing overall user experience for project
adopters. Furthermore, the error handling and reporting capabilities of
the Snowflake parser have been improved with new case classes,
`IndividualError` and `ErrorsSummary`, and updated error messages.
* Moved intermediate package out of parsers
([databrickslabs#972](databrickslabs#972)). In this
release, the `intermediate` package has been refactored out of the
`parsers` package, aligning with the design principle that parsers
should depend on the intermediate representation instead of the other
way around. This change affects various classes and methods across the
project, all of which have been updated to import the `intermediate`
package from its new location. No new functionality has been introduced,
but the refactoring improves the package structure and dependency
management. The `EstimationAnalyzer` class in the `coverage/estimation`
package has been updated to import classes from the new location of the
`intermediate` package, and its `evaluateTree` method has been updated
to use the new import path for `LogicalPlan` and `Expression`. Other
affected classes include `SnowflakeTableDefinitions`, `SnowflakeLexer`,
`SnowflakeParser`, `SnowflakeTypeBuilder`, `GeneratorContext`,
`DataTypeGenerator`, `IRHelpers`, and multiple test files.
* Patch Function without Brackets
([databrickslabs#907](databrickslabs#907)). This
commit introduces new lexer and parser rules to handle Snowflake SQL
functions without parentheses, specifically impacting CURRENT_DATE,
CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP. The new
rules allow these functions to be used without parentheses, consistent
with Snowflake SQL. This change fixes functional tests and includes
documentation for the affected functions. However, there is a pending
task to add or fix more test cases to ensure comprehensive testing of
the new rules. Additionally, the syntax of the SELECT statement for the
CURRENT_TIMESTAMP function has been updated, removing the need for the
parameter 'col1'. This change simplifies the syntax for certain SQL
functions in the codebase and improves the consistency and reliability
of the functional tests.
* Root Table
([databrickslabs#936](databrickslabs#936)). The PR
#\<pr\-number\> introduces a new class `TableGraph` that extends
`DependencyGraph` and implements `LazyLogging` trait. This class builds
a graph of tables and their dependencies based on query history and
table definitions. It provides methods to add nodes and edges, build the
graph, and retrieve root, upstream, and downstream tables. The
`DependencyGraph` trait offers a more structured and flexible way to
handle table dependencies. This change is part of the Root Table feature
(issue [databrickslabs#936](databrickslabs#936))
that identifies root tables in a graph of table dependencies, closing
issue [databrickslabs#23](databrickslabs#23). The PR
includes a new `TableGraphTest` class that demonstrates the use of these
methods and verifies their behavior for better data flow understanding
and optimization.
* Snowflake Merge Implementation
([databrickslabs#964](databrickslabs#964)). In this
release, we have implemented the Merge statement for the Snowflake
parser, which enables updating or deleting rows in a target table based
on matches with a source table, and inserting new rows into the target
table when there are no matches. This feature includes updates to the
SnowflakeDMLBuilder and SnowflakeExpressionBuilder classes, allowing for
proper handling of column names and MERGE queries. Additionally, we have
added test cases to the SnowflakeASTBuilder, SnowflakeDMLBuilderSpec,
and SnowflakeToDatabricksTranspiler to ensure the accurate translation
and execution of MERGE statements for the Snowflake dialect. These
changes bring important database migration and synchronization
capabilities to our open-source library, improving its functionality and
usability for software engineers.
* TSQL: Implement CREATE TABLE
([databrickslabs#911](databrickslabs#911)). This
commit implements the TSQL CREATE TABLE command and its various options
and forms, including CTAS, graph node syntax, and analytics variants, as
well as syntactical differences for SQL Server. The DDL and DML visitors
have been moved from the AST and Relation visitors to separate classes
for better responsibility segregation. The LogicalPlanGenerator class
has been updated to generate unique constraints, primary keys, foreign
keys, check constraints, default value constraints, and identity
constraints for the CREATE TABLE command. Additionally, new classes for
generating SQL options and handling unresolved options during
transpilation have been added to enhance the parser's capability to
manage various options and forms. These changes improve the
transpilation of TSQL code and the organization of the codebase, making
it easier to maintain and extend.
* Transpile Snow ARRAY_SORT function
([databrickslabs#973](databrickslabs#973)). In this
release, we have implemented support for the Snowflake ARRAY_SORT
function in our open-source library. This feature has been added as part
of issue [databrickslabs#973](databrickslabs#973),
and it involves the addition of two new private methods, `arraySort` and
`makeArraySort`, to the `SnowflakeCallMapper` class. The `arraySort`
method takes a sequence of expressions as input and sorts the array
using the `makeArraySort` method. The `makeArraySort` method handles
both null and non-null values, sorts the array in ascending or
descending order based on the provided parameter, and determines the
position of null or small values based on the nulls first parameter. The
sorted array is then returned as an `ir.ArraySort` expression. This
functionality allows for the sorting of arrays in Snowflake SQL to be
translated to equivalent code in the target language. This enhancement
simplifies the process of working with arrays in Snowflake SQL and
provides users with a more streamlined experience.
* Transpile Snow MONTHS_BETWEEN function correctly
([databrickslabs#963](databrickslabs#963)). In this
release, the remorph library's SnowflakeCallMapper class in the
com/databricks/labs/remorph/parsers/snowflake/rules package has been
updated to handle the MONTHS_BETWEEN function. A new case has been added
that creates a MonthsBetween object with the first two arguments of the
function call and a boolean value of true. This change enhances
compatibility and ensures that the output accurately reflects the
intended functionality. Additionally, new test cases have been
introduced to the SnowflakeCallMapperSpec for the transpilation of the
MONTHS_BETWEEN function. These test cases demonstrate accurate mapping
of the function to the MonthsBetween class and proper casting of inputs
as dates or timestamps, improving the reliability and precision of date
and time calculations.
* Updated Installation to handle install errors
([databrickslabs#962](databrickslabs#962)). In this
release, we've made significant improvements to the `remorph` project,
addressing and resolving installation errors that were occurring during
the installation process in development mode. We've introduced a new
`ProductInfo` class in the `wheels` module, which provides information
about the products being installed. This change replaces the use of
`WheelsV2` in two test functions. Additionally, we've updated the
`workspace_installation` method in `application.py` to handle
installation errors more effectively, addressing the dependency on
workspace `.remorph` due to wheels. We've also added new methods to
`installation.py` to manage local and remote version files, and updated
the `_upgrade_reconcile_workflow` function to ensure the correct wheel
path is used during installation. These changes improve the overall
quality of the codebase, making it easier for developers to adopt and
maintain the project, and ensure a more seamless installation experience
for users.
* Updated catalog operations logging
([databrickslabs#910](databrickslabs#910)). In this
release, the setup process for the catalog, schema, and volume in the
configurator module has been simplified and improved. The previous
implementation repeatedly prompted the user for input until the correct
input was provided or a maximum number of attempts was reached. The
updated code now checks if the catalog, schema, or volume already exists
and either uses it or prompts the user to create it once. If the user
does not have the necessary privileges to use the catalog, schema, or
volume, an error message is logged and the installation is aborted. New
methods have been added to check for necessary privileges, such as
`has_necessary_catalog_access`, `has_necessary_schema_access`, and
`has_necessary_volume_access`, which return a boolean indicating whether
the user has the necessary privileges and log an error message with the
missing privileges if not. The logging for catalog operations in the
install.py file has also been updated to check for privileges at the end
of the process and list any missing privileges for each catalog object.
Additionally, changes have been made to the unit tests for the
ResourceConfigurator class to ensure that the system handles cases where
the user does not have the necessary permissions to access catalogs,
schemas, or volumes, preventing unauthorized access and maintaining the
security and integrity of the system.
* Updated remorph reconcile workflow to use wheels instead of pypi
([databrickslabs#884](databrickslabs#884)). In this
release, the installation process for the Remorph library has been
updated to allow for the use of locally uploaded wheel files instead of
downloading the package from PyPI. This change includes updates to the
`install` and `_deploy_jobs` methods in the `recon.py` file to accept a
new `wheel_paths` argument, which is used to pass the path of the
Remorph wheel file to the `deploy_recon_job` method. The
`_upgrade_reconcile_workflow` function in the
`v0.4.0_add_main_table_operation_name_column.py` file has also been
updated to upload the wheel package to the workspace and pass its path
to the `deploy_reconcile_job` method. Additionally, the
`deploy_recon_job` method in the `JobDeployment` class now accepts a new
`wheel_file` argument, which represents the name of the wheel file for
the remorph library. These changes address issues faced by customers
with no public internet access and enable the use of new features before
they are released on PyPI. The `test_recon.py` file in the
`tests/unit/deployment` directory has also been updated to reflect these
changes.
* Upgrade script Implementation
([databrickslabs#777](databrickslabs#777)). In this
release, we've implemented an upgrade script as part of pull request
[databrickslabs#777](databrickslabs#777), which
resolves issue
[databrickslabs#769](databrickslabs#769). This
change introduces a new `Upgrades` class in `application.py` that
accepts `product_info` and `installation` as parameters and includes a
cached property `wheels` for improved performance. Additionally, we've
added new methods to the `WorkspaceInstaller` class for handling
upgrade-related tasks, including the creation of a `ProductInfo` object,
interacting with the Databricks SDK, and handling potential errors.
We've also added a test case to ensure that upgrades are applied
correctly on more recent versions. These changes are part of our ongoing
effort to enhance the management and application of upgrades to
installed products.
* bug fix for to_array function
([databrickslabs#961](databrickslabs#961)). A bug
fix has been implemented to improve the `TO_ARRAY` function in our
open-source library. Previously, this function expected only one
parameter, but it has been updated to accept two parameters, with the
second being optional. This change brings the function in line with
other functions in the class, improving flexibility and ensuring
backward compatibility. The `TO_ARRAY` function is used to convert a
given expression to an array if it is not null and return null
otherwise. The commit also includes updates to the `Generator` class,
where a new entry for the `ToArray` expression has been added to the
`expression_map` dictionary. Additionally, a new `ToArray` class has
been introduced as a subclass of `Func`, allowing the function to handle
a variable number of arguments more gracefully. Relevant updates have
been made to the functional tests for the `to_array` function for both
Snowflake and Databricks SQL, demonstrating its handling of null inputs
and comparing it with the corresponding ARRAY function in each SQL
dialect. Overall, these changes enhance the functionality and
adaptability of the `TO_ARRAY` function.
* feat: Implement all of TSQL predicates except for SOME ALL ANY
([databrickslabs#922](databrickslabs#922)). In this
commit, we have implemented the IR generation for several TSQL
predicates including IN, IS, BETWEEN, LIKE, EXISTS, and FREETEXT,
thereby improving the parser's ability to handle a wider range of TSQL
syntax. The `TSqlParser` class has been updated with new methods and
changes to existing ones, including the addition of new labeled
expressions to the `predicate` rule. Additionally, we have corrected an
error in the LIKE predicate's implementation, allowing the ESCAPE
character to accept a full expression that evaluates to a single
character at runtime, rather than assuming it to be a single character
at parse time. These changes provide more flexibility and adherence to
the TSQL standard, enhancing the overall functionality of the project
for our adopters.
sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this issue Jan 3, 2025
sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this issue Jan 3, 2025
* Added private key authentication for sf
([databrickslabs#917](databrickslabs#917)). This
commit adds support for private key authentication to the Snowflake data
source connector, providing users with more flexibility and security.
The `cryptography` library is used to process the user-provided private
key, with priority given to the `pem_private_key` secret, followed by
the `sfPassword` secret. If neither secret is found, an exception is
raised. However, password-based authentication is still used when JDBC
options are provided, as Spark JDBC does not currently support private
key authentication. A new exception class,
`InvalidSnowflakePemPrivateKey`, has been added for handling invalid or
malformed private keys. Additionally, new tests have been included for
reading data with private key authentication, handling malformed private
keys, and checking for missing authentication keys. The notice has been
updated to include the `cryptography` library's copyright and license
information.
* Added support for `PARSE_JSON` and `VARIANT` datatype
([databrickslabs#906](databrickslabs#906)). This
commit introduces support for the `PARSE_JSON` function and `VARIANT`
datatype in the Snowflake parser, addressing issue
[databrickslabs#894](databrickslabs#894). The
implementation removes the experimental dialect, enabling support for
the `VARIANT` datatype and using `PARSE_JSON` for it. The
`variant_explode` function is also utilized. During transpilation to
Snowflake, whenever the `:` operator is encountered in the SELECT
statement, everything will be treated as a `VARIANT` on the Databricks
side to handle differences between Snowflake and Databricks in accessing
variant types. These changes are implemented using ANTLR.
* Added upgrade script and modified metrics sql
([databrickslabs#990](databrickslabs#990)). In this
release, the open-source library has been updated with several
improvements to the metrics system, database upgrades, and setup
process. The setup_spark_remote.sh script now checks if the Spark server
is running by pinging localhost:4040, rather than sleeping for a fixed
time, allowing for faster execution and more accurate server status
determination. The metrics table's insert statement has been updated to
cast values to Bigint for better handling of larger counts. An upgrade
script has been added to facilitate required modifications, and the
setup_spark_remote.sh script has been modified to validate URLs. A new
Python file for upgrading the metrics table's data types has been added,
which includes a function to recreate the table with the correct data
types for specific columns. The upgrade_common module now includes
several functions for upgrading database tables, and a new unit test
file, test_upgrade_common.py, has been added with test cases for these
functions. Lastly, the upgrade script for v0.4.0 has been updated to
simplify the process of checking if the main table in the reconcile
metadata needs to be recreated and to add an `operation_name` column.
These changes improve the library's functionality, accuracy, and
robustness, particularly for larger datasets and upgrading processes,
enhancing the overall user experience.
* Basic CTAS Implementation
([databrickslabs#968](databrickslabs#968)). This
pull request adds basic support for the CREATE TABLE AS SELECT (CTAS)
statement in Snowflake, enabling users to create a new table by
selecting data from an existing table or query. In the
LogicalPlanGenerator class, a new method has been implemented to handle
CTAS statements, which generates the appropriate SQL command for
creating a table based on the result of a select query. The
SnowflakeDDLBuilder class now includes a relationBuilder class member
for building relations based on Snowflake DDL input, and the
visitCreateTableAsSelect method has been overridden to parse CTAS
statements and construct corresponding IR objects. The test suite has
been expanded to include a new spec for CTAS statements and a test case
for the CTAS statement "CREATE TABLE t1 AS SELECT c1, c2 FROM t2;".
Additionally, a new test file "test_ctas_complex.sql" has been added,
containing SQL statements for creating a new table by selecting columns
from multiple tables and computing new columns through various data
manipulations. The implementation also includes adding new SQL
statements for CTAS in both Snowflake and Databricks dialects, allowing
for testing the CTAS functionality for both.
* Create repeatable estimator for Snowflake query history
([databrickslabs#924](databrickslabs#924)). This
commit introduces a new coverage estimation tool for analyzing query
history in a database, initially implemented for Snowflake. The tool
parses and transpiles query history into Databricks SQL and reports on
the percentage of query history it can process. It includes a new
`SnowflakePlanParser` class that handles Snowflake query plans, a
`SqlGenerator` class that generates Databricks SQL from optimized
logical plans, and a `dialect` method that returns the dialect string.
The long-term plan is to extend this functionality to other supported
databases and dialects and include a report on SQL complexity.
Additionally, test cases have been added to the `AnonymizerTest` class
to ensure the correct functionality of the `Anonymizer` class, which
anonymizes executed queries when provided with a `PlanParser` object.
The `Anonymizer` class is intended to be used as part of the coverage
estimation tool, which will provide analysis of query history for
various databases.
* Created a mapping dict for algo for each dialect at layer level
([databrickslabs#934](databrickslabs#934)). A series
of changes have been implemented to improve the reconciliation process
and the handling of hash algorithms in the open-source library. A
mapping dictionary algorithm to dialect has been introduced at the layer
level to enhance the reconciliation process. The `get_hash_transform`
function now accepts a new `layer` argument and returns a list of hash
algorithms from the `HashAlgoMapping` dictionary. A new
`HashAlgoMapping` class has been added to map algorithms to a dialect
for hashing, replacing the previous `DialectHashConfig` class. A new
function `get_dialect` has been introduced to retrieve the dialect based
on the layer. The `_hash_transform` function and the `build_query`
method have been updated to use the `layer` parameter when determining
the dialect. These changes provide more precise control over the
algorithm used for hash transformation based on the source layer and the
target dialect, resulting in improved reconciliation performance and
accuracy.
* Fetch TableDefinitions from Snowflake
([databrickslabs#904](databrickslabs#904)). A new
`SnowflakeTableDefinitions` class has been added to simplify the
discovery of Snowflake table metadata. This class establishes a
connection with a Snowflake database through a Connection object, and
provides methods such as `getDataType` and `getTableDefinitionQuery` to
parse data types and generate queries for table definitions. Moreover,
it includes a `getTableDefinitions` method to retrieve all table
definitions in a Snowflake database as a sequence of `TableDefinition`
objects, which encapsulates various properties of each table. The class
also features methods for retrieving all catalogs and schemas in a
Snowflake database. Alongside `SnowflakeTableDefinitions`, a new test
class, `SnowflakeTableDefinitionTest`, has been introduced to verify the
behavior of `getTableDefinitions` and ensure that the class functions as
intended, adhering to the desired behavior.
* Guide user on missing configuration file
([databrickslabs#930](databrickslabs#930)). In this
commit, the `_verify_recon_table_config` method in the `runner.py` file
of the `databricks/labs/remorph` package has been updated to handle
missing reconcile table configurations during installation. When the
reconcile table configuration is not found, an error message will now
display the name of the requested configuration file. This enhancement
helps users identify the specific configuration file they need to
provide in their workspace, addressing issue
[databrickslabs#919](databrickslabs#919). This
commit is co-authored by Ludovic Claude.
* Implement more missing visitor functions for Snowflake and TSQL
([databrickslabs#975](databrickslabs#975)). In this
release, we have added several missing visitor methods for the Snowflake
and TSQL builder classes to improve the reliability and maintainability
of our parser. Previously, when a visitor method was missing, the
default visitor was called, causing the visit of all children of the
ParseTree, which was not ideal. This could lead to incorrect results due
to a slight modification in the ANTLR grammar inadvertently breaking the
visitor. In this release, we have implemented several new visitor
methods for both Snowflake and TSQL builder classes, including the
`visitDdlCommand` method in the `SnowflakeDDLBuilder` class and the
`visitDdlClause` method in the `TSqlDDLBuilder` class. These new methods
ensure that the ParseTree is traversed correctly and that the correct IR
node is returned. The `visitDdlCommand` method checks for different
types of DDL commands, such as create, alter, drop, and undrop, and
calls the appropriate method for each type. The `visitDdlClause` method
contains a sequence of methods corresponding to different DDL clauses
and calls the first non-null method in the sequence. These changes
significantly improve the robustness of our parser and enhance the
reliability of our code.
* Introduce typed errors
([databrickslabs#981](databrickslabs#981)). This
commit introduces typed errors in the form of a new class,
`UnexpectedNode`, and several case classes including `ParsingError`,
`UnsupportedDataType`, `WrongNumberOfArguments`, `UnsupportedArguments`,
and `UnsupportedDateTimePart` in various packages, as part of the
ongoing effort to replace exception throwing with returning `Result`
types in future pull requests. These changes will improve error handling
and provide more context and precision for errors, facilitating
debugging and maintenance of the remorph library and data type
generation functionality. The `TranspileException` class is now
constructed with specific typed error instances, and the
`ErrorCollector` and `ErrorDetail` classes have been updated to use
`ParsingError`. Additionally, the `SnowflakeCallMapper` and
`SnowflakeTimeUnits` classes have been updated to use the new typed
error mechanism, providing more precise error handling for
Snowflake-specific functions and expressions.
* Miscellaneous improvements to Snowflake parser
([databrickslabs#952](databrickslabs#952)). This
diff brings several miscellaneous improvements to the Snowflake parser
in the open-source library, targeting increased parse and transpilation
success rates. The modifications include updating the `colDecl` rule to
allow optional data types, introducing an `objectField` rule, and
enabling date and timestamp literals as strings. Additionally, the
parser has been refined to handle identifiers more efficiently, such as
hashes within the AnonymizerTest. The expected Ast for certain test
cases has also been updated to improve parser accuracy. These changes
aim to create a more robust and streamlined Snowflake parser, minimizing
parsing errors and enhancing overall user experience for project
adopters. Furthermore, the error handling and reporting capabilities of
the Snowflake parser have been improved with new case classes,
`IndividualError` and `ErrorsSummary`, and updated error messages.
* Moved intermediate package out of parsers
([databrickslabs#972](databrickslabs#972)). In this
release, the `intermediate` package has been refactored out of the
`parsers` package, aligning with the design principle that parsers
should depend on the intermediate representation instead of the other
way around. This change affects various classes and methods across the
project, all of which have been updated to import the `intermediate`
package from its new location. No new functionality has been introduced,
but the refactoring improves the package structure and dependency
management. The `EstimationAnalyzer` class in the `coverage/estimation`
package has been updated to import classes from the new location of the
`intermediate` package, and its `evaluateTree` method has been updated
to use the new import path for `LogicalPlan` and `Expression`. Other
affected classes include `SnowflakeTableDefinitions`, `SnowflakeLexer`,
`SnowflakeParser`, `SnowflakeTypeBuilder`, `GeneratorContext`,
`DataTypeGenerator`, `IRHelpers`, and multiple test files.
* Patch Function without Brackets
([databrickslabs#907](databrickslabs#907)). This
commit introduces new lexer and parser rules to handle Snowflake SQL
functions without parentheses, specifically impacting CURRENT_DATE,
CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP. The new
rules allow these functions to be used without parentheses, consistent
with Snowflake SQL. This change fixes functional tests and includes
documentation for the affected functions. However, there is a pending
task to add or fix more test cases to ensure comprehensive testing of
the new rules. Additionally, the syntax of the SELECT statement for the
CURRENT_TIMESTAMP function has been updated, removing the need for the
parameter 'col1'. This change simplifies the syntax for certain SQL
functions in the codebase and improves the consistency and reliability
of the functional tests.
* Root Table
([databrickslabs#936](databrickslabs#936)). The PR
#\<pr\-number\> introduces a new class `TableGraph` that extends
`DependencyGraph` and implements `LazyLogging` trait. This class builds
a graph of tables and their dependencies based on query history and
table definitions. It provides methods to add nodes and edges, build the
graph, and retrieve root, upstream, and downstream tables. The
`DependencyGraph` trait offers a more structured and flexible way to
handle table dependencies. This change is part of the Root Table feature
(issue [databrickslabs#936](databrickslabs#936))
that identifies root tables in a graph of table dependencies, closing
issue [databrickslabs#23](databrickslabs#23). The PR
includes a new `TableGraphTest` class that demonstrates the use of these
methods and verifies their behavior for better data flow understanding
and optimization.
* Snowflake Merge Implementation
([databrickslabs#964](databrickslabs#964)). In this
release, we have implemented the Merge statement for the Snowflake
parser, which enables updating or deleting rows in a target table based
on matches with a source table, and inserting new rows into the target
table when there are no matches. This feature includes updates to the
SnowflakeDMLBuilder and SnowflakeExpressionBuilder classes, allowing for
proper handling of column names and MERGE queries. Additionally, we have
added test cases to the SnowflakeASTBuilder, SnowflakeDMLBuilderSpec,
and SnowflakeToDatabricksTranspiler to ensure the accurate translation
and execution of MERGE statements for the Snowflake dialect. These
changes bring important database migration and synchronization
capabilities to our open-source library, improving its functionality and
usability for software engineers.
* TSQL: Implement CREATE TABLE
([databrickslabs#911](databrickslabs#911)). This
commit implements the TSQL CREATE TABLE command and its various options
and forms, including CTAS, graph node syntax, and analytics variants, as
well as syntactical differences for SQL Server. The DDL and DML visitors
have been moved from the AST and Relation visitors to separate classes
for better responsibility segregation. The LogicalPlanGenerator class
has been updated to generate unique constraints, primary keys, foreign
keys, check constraints, default value constraints, and identity
constraints for the CREATE TABLE command. Additionally, new classes for
generating SQL options and handling unresolved options during
transpilation have been added to enhance the parser's capability to
manage various options and forms. These changes improve the
transpilation of TSQL code and the organization of the codebase, making
it easier to maintain and extend.
* Transpile Snow ARRAY_SORT function
([databrickslabs#973](databrickslabs#973)). In this
release, we have implemented support for the Snowflake ARRAY_SORT
function in our open-source library. This feature has been added as part
of issue [databrickslabs#973](databrickslabs#973),
and it involves the addition of two new private methods, `arraySort` and
`makeArraySort`, to the `SnowflakeCallMapper` class. The `arraySort`
method takes a sequence of expressions as input and sorts the array
using the `makeArraySort` method. The `makeArraySort` method handles
both null and non-null values, sorts the array in ascending or
descending order based on the provided parameter, and determines the
position of null or small values based on the nulls first parameter. The
sorted array is then returned as an `ir.ArraySort` expression. This
functionality allows for the sorting of arrays in Snowflake SQL to be
translated to equivalent code in the target language. This enhancement
simplifies the process of working with arrays in Snowflake SQL and
provides users with a more streamlined experience.
* Transpile Snow MONTHS_BETWEEN function correctly
([databrickslabs#963](databrickslabs#963)). In this
release, the remorph library's SnowflakeCallMapper class in the
com/databricks/labs/remorph/parsers/snowflake/rules package has been
updated to handle the MONTHS_BETWEEN function. A new case has been added
that creates a MonthsBetween object with the first two arguments of the
function call and a boolean value of true. This change enhances
compatibility and ensures that the output accurately reflects the
intended functionality. Additionally, new test cases have been
introduced to the SnowflakeCallMapperSpec for the transpilation of the
MONTHS_BETWEEN function. These test cases demonstrate accurate mapping
of the function to the MonthsBetween class and proper casting of inputs
as dates or timestamps, improving the reliability and precision of date
and time calculations.
* Updated Installation to handle install errors
([databrickslabs#962](databrickslabs#962)). In this
release, we've made significant improvements to the `remorph` project,
addressing and resolving installation errors that were occurring during
the installation process in development mode. We've introduced a new
`ProductInfo` class in the `wheels` module, which provides information
about the products being installed. This change replaces the use of
`WheelsV2` in two test functions. Additionally, we've updated the
`workspace_installation` method in `application.py` to handle
installation errors more effectively, addressing the dependency on
workspace `.remorph` due to wheels. We've also added new methods to
`installation.py` to manage local and remote version files, and updated
the `_upgrade_reconcile_workflow` function to ensure the correct wheel
path is used during installation. These changes improve the overall
quality of the codebase, making it easier for developers to adopt and
maintain the project, and ensure a more seamless installation experience
for users.
* Updated catalog operations logging
([databrickslabs#910](databrickslabs#910)). In this
release, the setup process for the catalog, schema, and volume in the
configurator module has been simplified and improved. The previous
implementation repeatedly prompted the user for input until the correct
input was provided or a maximum number of attempts was reached. The
updated code now checks if the catalog, schema, or volume already exists
and either uses it or prompts the user to create it once. If the user
does not have the necessary privileges to use the catalog, schema, or
volume, an error message is logged and the installation is aborted. New
methods have been added to check for necessary privileges, such as
`has_necessary_catalog_access`, `has_necessary_schema_access`, and
`has_necessary_volume_access`, which return a boolean indicating whether
the user has the necessary privileges and log an error message with the
missing privileges if not. The logging for catalog operations in the
install.py file has also been updated to check for privileges at the end
of the process and list any missing privileges for each catalog object.
Additionally, changes have been made to the unit tests for the
ResourceConfigurator class to ensure that the system handles cases where
the user does not have the necessary permissions to access catalogs,
schemas, or volumes, preventing unauthorized access and maintaining the
security and integrity of the system.
* Updated remorph reconcile workflow to use wheels instead of pypi
([databrickslabs#884](databrickslabs#884)). In this
release, the installation process for the Remorph library has been
updated to allow for the use of locally uploaded wheel files instead of
downloading the package from PyPI. This change includes updates to the
`install` and `_deploy_jobs` methods in the `recon.py` file to accept a
new `wheel_paths` argument, which is used to pass the path of the
Remorph wheel file to the `deploy_recon_job` method. The
`_upgrade_reconcile_workflow` function in the
`v0.4.0_add_main_table_operation_name_column.py` file has also been
updated to upload the wheel package to the workspace and pass its path
to the `deploy_reconcile_job` method. Additionally, the
`deploy_recon_job` method in the `JobDeployment` class now accepts a new
`wheel_file` argument, which represents the name of the wheel file for
the remorph library. These changes address issues faced by customers
with no public internet access and enable the use of new features before
they are released on PyPI. The `test_recon.py` file in the
`tests/unit/deployment` directory has also been updated to reflect these
changes.
* Upgrade script Implementation
([databrickslabs#777](databrickslabs#777)). In this
release, we've implemented an upgrade script as part of pull request
[databrickslabs#777](databrickslabs#777), which
resolves issue
[databrickslabs#769](databrickslabs#769). This
change introduces a new `Upgrades` class in `application.py` that
accepts `product_info` and `installation` as parameters and includes a
cached property `wheels` for improved performance. Additionally, we've
added new methods to the `WorkspaceInstaller` class for handling
upgrade-related tasks, including the creation of a `ProductInfo` object,
interacting with the Databricks SDK, and handling potential errors.
We've also added a test case to ensure that upgrades are applied
correctly on more recent versions. These changes are part of our ongoing
effort to enhance the management and application of upgrades to
installed products.
* bug fix for to_array function
([databrickslabs#961](databrickslabs#961)). A bug
fix has been implemented to improve the `TO_ARRAY` function in our
open-source library. Previously, this function expected only one
parameter, but it has been updated to accept two parameters, with the
second being optional. This change brings the function in line with
other functions in the class, improving flexibility and ensuring
backward compatibility. The `TO_ARRAY` function is used to convert a
given expression to an array if it is not null and return null
otherwise. The commit also includes updates to the `Generator` class,
where a new entry for the `ToArray` expression has been added to the
`expression_map` dictionary. Additionally, a new `ToArray` class has
been introduced as a subclass of `Func`, allowing the function to handle
a variable number of arguments more gracefully. Relevant updates have
been made to the functional tests for the `to_array` function for both
Snowflake and Databricks SQL, demonstrating its handling of null inputs
and comparing it with the corresponding ARRAY function in each SQL
dialect. Overall, these changes enhance the functionality and
adaptability of the `TO_ARRAY` function.
* feat: Implement all of TSQL predicates except for SOME ALL ANY
([databrickslabs#922](databrickslabs#922)). In this
commit, we have implemented the IR generation for several TSQL
predicates including IN, IS, BETWEEN, LIKE, EXISTS, and FREETEXT,
thereby improving the parser's ability to handle a wider range of TSQL
syntax. The `TSqlParser` class has been updated with new methods and
changes to existing ones, including the addition of new labeled
expressions to the `predicate` rule. Additionally, we have corrected an
error in the LIKE predicate's implementation, allowing the ESCAPE
character to accept a full expression that evaluates to a single
character at runtime, rather than assuming it to be a single character
at parse time. These changes provide more flexibility and adherence to
the TSQL standard, enhancing the overall functionality of the project
for our adopters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feat/recon making sure that remorphed query produces the same results as original
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant