Release v0.7.0 (#992) · databrickslabs/remorph@9679215

Commit

Release v0.7.0 (#992)
* Added private key authentication for sf
([#917](#917)). This
commit adds support for private key authentication to the Snowflake data
source connector, providing users with more flexibility and security.
The `cryptography` library is used to process the user-provided private
key, with priority given to the `pem_private_key` secret, followed by
the `sfPassword` secret. If neither secret is found, an exception is
raised. However, password-based authentication is still used when JDBC
options are provided, as Spark JDBC does not currently support private
key authentication. A new exception class,
`InvalidSnowflakePemPrivateKey`, has been added for handling invalid or
malformed private keys. Additionally, new tests have been included for
reading data with private key authentication, handling malformed private
keys, and checking for missing authentication keys. The notice has been
updated to include the `cryptography` library's copyright and license
information.
* Added support for `PARSE_JSON` and `VARIANT` datatype
([#906](#906)). This
commit introduces support for the `PARSE_JSON` function and `VARIANT`
datatype in the Snowflake parser, addressing issue
[#894](#894). The
implementation removes the experimental dialect, enabling support for
the `VARIANT` datatype and using `PARSE_JSON` for it. The
`variant_explode` function is also utilized. During transpilation to
Snowflake, whenever the `:` operator is encountered in the SELECT
statement, everything will be treated as a `VARIANT` on the Databricks
side to handle differences between Snowflake and Databricks in accessing
variant types. These changes are implemented using ANTLR.
* Added upgrade script and modified metrics sql
([#990](#990)). In this
release, the open-source library has been updated with several
improvements to the metrics system, database upgrades, and setup
process. The setup_spark_remote.sh script now checks if the Spark server
is running by pinging localhost:4040, rather than sleeping for a fixed
time, allowing for faster execution and more accurate server status
determination. The metrics table's insert statement has been updated to
cast values to Bigint for better handling of larger counts. An upgrade
script has been added to facilitate required modifications, and the
setup_spark_remote.sh script has been modified to validate URLs. A new
Python file for upgrading the metrics table's data types has been added,
which includes a function to recreate the table with the correct data
types for specific columns. The upgrade_common module now includes
several functions for upgrading database tables, and a new unit test
file, test_upgrade_common.py, has been added with test cases for these
functions. Lastly, the upgrade script for v0.4.0 has been updated to
simplify the process of checking if the main table in the reconcile
metadata needs to be recreated and to add an `operation_name` column.
These changes improve the library's functionality, accuracy, and
robustness, particularly for larger datasets and upgrading processes,
enhancing the overall user experience.
* Basic CTAS Implementation
([#968](#968)). This
pull request adds basic support for the CREATE TABLE AS SELECT (CTAS)
statement in Snowflake, enabling users to create a new table by
selecting data from an existing table or query. In the
LogicalPlanGenerator class, a new method has been implemented to handle
CTAS statements, which generates the appropriate SQL command for
creating a table based on the result of a select query. The
SnowflakeDDLBuilder class now includes a relationBuilder class member
for building relations based on Snowflake DDL input, and the
visitCreateTableAsSelect method has been overridden to parse CTAS
statements and construct corresponding IR objects. The test suite has
been expanded to include a new spec for CTAS statements and a test case
for the CTAS statement "CREATE TABLE t1 AS SELECT c1, c2 FROM t2;".
Additionally, a new test file "test_ctas_complex.sql" has been added,
containing SQL statements for creating a new table by selecting columns
from multiple tables and computing new columns through various data
manipulations. The implementation also includes adding new SQL
statements for CTAS in both Snowflake and Databricks dialects, allowing
for testing the CTAS functionality for both.
* Create repeatable estimator for Snowflake query history
([#924](#924)). This
commit introduces a new coverage estimation tool for analyzing query
history in a database, initially implemented for Snowflake. The tool
parses and transpiles query history into Databricks SQL and reports on
the percentage of query history it can process. It includes a new
`SnowflakePlanParser` class that handles Snowflake query plans, a
`SqlGenerator` class that generates Databricks SQL from optimized
logical plans, and a `dialect` method that returns the dialect string.
The long-term plan is to extend this functionality to other supported
databases and dialects and include a report on SQL complexity.
Additionally, test cases have been added to the `AnonymizerTest` class
to ensure the correct functionality of the `Anonymizer` class, which
anonymizes executed queries when provided with a `PlanParser` object.
The `Anonymizer` class is intended to be used as part of the coverage
estimation tool, which will provide analysis of query history for
various databases.
* Created a mapping dict for algo for each dialect at layer level
([#934](#934)). A series
of changes have been implemented to improve the reconciliation process
and the handling of hash algorithms in the open-source library. A
mapping dictionary algorithm to dialect has been introduced at the layer
level to enhance the reconciliation process. The `get_hash_transform`
function now accepts a new `layer` argument and returns a list of hash
algorithms from the `HashAlgoMapping` dictionary. A new
`HashAlgoMapping` class has been added to map algorithms to a dialect
for hashing, replacing the previous `DialectHashConfig` class. A new
function `get_dialect` has been introduced to retrieve the dialect based
on the layer. The `_hash_transform` function and the `build_query`
method have been updated to use the `layer` parameter when determining
the dialect. These changes provide more precise control over the
algorithm used for hash transformation based on the source layer and the
target dialect, resulting in improved reconciliation performance and
accuracy.
* Fetch TableDefinitions from Snowflake
([#904](#904)). A new
`SnowflakeTableDefinitions` class has been added to simplify the
discovery of Snowflake table metadata. This class establishes a
connection with a Snowflake database through a Connection object, and
provides methods such as `getDataType` and `getTableDefinitionQuery` to
parse data types and generate queries for table definitions. Moreover,
it includes a `getTableDefinitions` method to retrieve all table
definitions in a Snowflake database as a sequence of `TableDefinition`
objects, which encapsulates various properties of each table. The class
also features methods for retrieving all catalogs and schemas in a
Snowflake database. Alongside `SnowflakeTableDefinitions`, a new test
class, `SnowflakeTableDefinitionTest`, has been introduced to verify the
behavior of `getTableDefinitions` and ensure that the class functions as
intended, adhering to the desired behavior.
* Guide user on missing configuration file
([#930](#930)). In this
commit, the `_verify_recon_table_config` method in the `runner.py` file
of the `databricks/labs/remorph` package has been updated to handle
missing reconcile table configurations during installation. When the
reconcile table configuration is not found, an error message will now
display the name of the requested configuration file. This enhancement
helps users identify the specific configuration file they need to
provide in their workspace, addressing issue
[#919](#919). This
commit is co-authored by Ludovic Claude.
* Implement more missing visitor functions for Snowflake and TSQL
([#975](#975)). In this
release, we have added several missing visitor methods for the Snowflake
and TSQL builder classes to improve the reliability and maintainability
of our parser. Previously, when a visitor method was missing, the
default visitor was called, causing the visit of all children of the
ParseTree, which was not ideal. This could lead to incorrect results due
to a slight modification in the ANTLR grammar inadvertently breaking the
visitor. In this release, we have implemented several new visitor
methods for both Snowflake and TSQL builder classes, including the
`visitDdlCommand` method in the `SnowflakeDDLBuilder` class and the
`visitDdlClause` method in the `TSqlDDLBuilder` class. These new methods
ensure that the ParseTree is traversed correctly and that the correct IR
node is returned. The `visitDdlCommand` method checks for different
types of DDL commands, such as create, alter, drop, and undrop, and
calls the appropriate method for each type. The `visitDdlClause` method
contains a sequence of methods corresponding to different DDL clauses
and calls the first non-null method in the sequence. These changes
significantly improve the robustness of our parser and enhance the
reliability of our code.
* Introduce typed errors
([#981](#981)). This
commit introduces typed errors in the form of a new class,
`UnexpectedNode`, and several case classes including `ParsingError`,
`UnsupportedDataType`, `WrongNumberOfArguments`, `UnsupportedArguments`,
and `UnsupportedDateTimePart` in various packages, as part of the
ongoing effort to replace exception throwing with returning `Result`
types in future pull requests. These changes will improve error handling
and provide more context and precision for errors, facilitating
debugging and maintenance of the remorph library and data type
generation functionality. The `TranspileException` class is now
constructed with specific typed error instances, and the
`ErrorCollector` and `ErrorDetail` classes have been updated to use
`ParsingError`. Additionally, the `SnowflakeCallMapper` and
`SnowflakeTimeUnits` classes have been updated to use the new typed
error mechanism, providing more precise error handling for
Snowflake-specific functions and expressions.
* Miscellaneous improvements to Snowflake parser
([#952](#952)). This
diff brings several miscellaneous improvements to the Snowflake parser
in the open-source library, targeting increased parse and transpilation
success rates. The modifications include updating the `colDecl` rule to
allow optional data types, introducing an `objectField` rule, and
enabling date and timestamp literals as strings. Additionally, the
parser has been refined to handle identifiers more efficiently, such as
hashes within the AnonymizerTest. The expected Ast for certain test
cases has also been updated to improve parser accuracy. These changes
aim to create a more robust and streamlined Snowflake parser, minimizing
parsing errors and enhancing overall user experience for project
adopters. Furthermore, the error handling and reporting capabilities of
the Snowflake parser have been improved with new case classes,
`IndividualError` and `ErrorsSummary`, and updated error messages.
* Moved intermediate package out of parsers
([#972](#972)). In this
release, the `intermediate` package has been refactored out of the
`parsers` package, aligning with the design principle that parsers
should depend on the intermediate representation instead of the other
way around. This change affects various classes and methods across the
project, all of which have been updated to import the `intermediate`
package from its new location. No new functionality has been introduced,
but the refactoring improves the package structure and dependency
management. The `EstimationAnalyzer` class in the `coverage/estimation`
package has been updated to import classes from the new location of the
`intermediate` package, and its `evaluateTree` method has been updated
to use the new import path for `LogicalPlan` and `Expression`. Other
affected classes include `SnowflakeTableDefinitions`, `SnowflakeLexer`,
`SnowflakeParser`, `SnowflakeTypeBuilder`, `GeneratorContext`,
`DataTypeGenerator`, `IRHelpers`, and multiple test files.
* Patch Function without Brackets
([#907](#907)). This
commit introduces new lexer and parser rules to handle Snowflake SQL
functions without parentheses, specifically impacting CURRENT_DATE,
CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP. The new
rules allow these functions to be used without parentheses, consistent
with Snowflake SQL. This change fixes functional tests and includes
documentation for the affected functions. However, there is a pending
task to add or fix more test cases to ensure comprehensive testing of
the new rules. Additionally, the syntax of the SELECT statement for the
CURRENT_TIMESTAMP function has been updated, removing the need for the
parameter 'col1'. This change simplifies the syntax for certain SQL
functions in the codebase and improves the consistency and reliability
of the functional tests.
* Root Table
([#936](#936)). The PR
#\<pr\-number\> introduces a new class `TableGraph` that extends
`DependencyGraph` and implements `LazyLogging` trait. This class builds
a graph of tables and their dependencies based on query history and
table definitions. It provides methods to add nodes and edges, build the
graph, and retrieve root, upstream, and downstream tables. The
`DependencyGraph` trait offers a more structured and flexible way to
handle table dependencies. This change is part of the Root Table feature
(issue [#936](#936))
that identifies root tables in a graph of table dependencies, closing
issue [#23](#23). The PR
includes a new `TableGraphTest` class that demonstrates the use of these
methods and verifies their behavior for better data flow understanding
and optimization.
* Snowflake Merge Implementation
([#964](#964)). In this
release, we have implemented the Merge statement for the Snowflake
parser, which enables updating or deleting rows in a target table based
on matches with a source table, and inserting new rows into the target
table when there are no matches. This feature includes updates to the
SnowflakeDMLBuilder and SnowflakeExpressionBuilder classes, allowing for
proper handling of column names and MERGE queries. Additionally, we have
added test cases to the SnowflakeASTBuilder, SnowflakeDMLBuilderSpec,
and SnowflakeToDatabricksTranspiler to ensure the accurate translation
and execution of MERGE statements for the Snowflake dialect. These
changes bring important database migration and synchronization
capabilities to our open-source library, improving its functionality and
usability for software engineers.
* TSQL: Implement CREATE TABLE
([#911](#911)). This
commit implements the TSQL CREATE TABLE command and its various options
and forms, including CTAS, graph node syntax, and analytics variants, as
well as syntactical differences for SQL Server. The DDL and DML visitors
have been moved from the AST and Relation visitors to separate classes
for better responsibility segregation. The LogicalPlanGenerator class
has been updated to generate unique constraints, primary keys, foreign
keys, check constraints, default value constraints, and identity
constraints for the CREATE TABLE command. Additionally, new classes for
generating SQL options and handling unresolved options during
transpilation have been added to enhance the parser's capability to
manage various options and forms. These changes improve the
transpilation of TSQL code and the organization of the codebase, making
it easier to maintain and extend.
* Transpile Snow ARRAY_SORT function
([#973](#973)). In this
release, we have implemented support for the Snowflake ARRAY_SORT
function in our open-source library. This feature has been added as part
of issue [#973](#973),
and it involves the addition of two new private methods, `arraySort` and
`makeArraySort`, to the `SnowflakeCallMapper` class. The `arraySort`
method takes a sequence of expressions as input and sorts the array
using the `makeArraySort` method. The `makeArraySort` method handles
both null and non-null values, sorts the array in ascending or
descending order based on the provided parameter, and determines the
position of null or small values based on the nulls first parameter. The
sorted array is then returned as an `ir.ArraySort` expression. This
functionality allows for the sorting of arrays in Snowflake SQL to be
translated to equivalent code in the target language. This enhancement
simplifies the process of working with arrays in Snowflake SQL and
provides users with a more streamlined experience.
* Transpile Snow MONTHS_BETWEEN function correctly
([#963](#963)). In this
release, the remorph library's SnowflakeCallMapper class in the
com/databricks/labs/remorph/parsers/snowflake/rules package has been
updated to handle the MONTHS_BETWEEN function. A new case has been added
that creates a MonthsBetween object with the first two arguments of the
function call and a boolean value of true. This change enhances
compatibility and ensures that the output accurately reflects the
intended functionality. Additionally, new test cases have been
introduced to the SnowflakeCallMapperSpec for the transpilation of the
MONTHS_BETWEEN function. These test cases demonstrate accurate mapping
of the function to the MonthsBetween class and proper casting of inputs
as dates or timestamps, improving the reliability and precision of date
and time calculations.
* Updated Installation to handle install errors
([#962](#962)). In this
release, we've made significant improvements to the `remorph` project,
addressing and resolving installation errors that were occurring during
the installation process in development mode. We've introduced a new
`ProductInfo` class in the `wheels` module, which provides information
about the products being installed. This change replaces the use of
`WheelsV2` in two test functions. Additionally, we've updated the
`workspace_installation` method in `application.py` to handle
installation errors more effectively, addressing the dependency on
workspace `.remorph` due to wheels. We've also added new methods to
`installation.py` to manage local and remote version files, and updated
the `_upgrade_reconcile_workflow` function to ensure the correct wheel
path is used during installation. These changes improve the overall
quality of the codebase, making it easier for developers to adopt and
maintain the project, and ensure a more seamless installation experience
for users.
* Updated catalog operations logging
([#910](#910)). In this
release, the setup process for the catalog, schema, and volume in the
configurator module has been simplified and improved. The previous
implementation repeatedly prompted the user for input until the correct
input was provided or a maximum number of attempts was reached. The
updated code now checks if the catalog, schema, or volume already exists
and either uses it or prompts the user to create it once. If the user
does not have the necessary privileges to use the catalog, schema, or
volume, an error message is logged and the installation is aborted. New
methods have been added to check for necessary privileges, such as
`has_necessary_catalog_access`, `has_necessary_schema_access`, and
`has_necessary_volume_access`, which return a boolean indicating whether
the user has the necessary privileges and log an error message with the
missing privileges if not. The logging for catalog operations in the
install.py file has also been updated to check for privileges at the end
of the process and list any missing privileges for each catalog object.
Additionally, changes have been made to the unit tests for the
ResourceConfigurator class to ensure that the system handles cases where
the user does not have the necessary permissions to access catalogs,
schemas, or volumes, preventing unauthorized access and maintaining the
security and integrity of the system.
* Updated remorph reconcile workflow to use wheels instead of pypi
([#884](#884)). In this
release, the installation process for the Remorph library has been
updated to allow for the use of locally uploaded wheel files instead of
downloading the package from PyPI. This change includes updates to the
`install` and `_deploy_jobs` methods in the `recon.py` file to accept a
new `wheel_paths` argument, which is used to pass the path of the
Remorph wheel file to the `deploy_recon_job` method. The
`_upgrade_reconcile_workflow` function in the
`v0.4.0_add_main_table_operation_name_column.py` file has also been
updated to upload the wheel package to the workspace and pass its path
to the `deploy_reconcile_job` method. Additionally, the
`deploy_recon_job` method in the `JobDeployment` class now accepts a new
`wheel_file` argument, which represents the name of the wheel file for
the remorph library. These changes address issues faced by customers
with no public internet access and enable the use of new features before
they are released on PyPI. The `test_recon.py` file in the
`tests/unit/deployment` directory has also been updated to reflect these
changes.
* Upgrade script Implementation
([#777](#777)). In this
release, we've implemented an upgrade script as part of pull request
[#777](#777), which
resolves issue
[#769](#769). This
change introduces a new `Upgrades` class in `application.py` that
accepts `product_info` and `installation` as parameters and includes a
cached property `wheels` for improved performance. Additionally, we've
added new methods to the `WorkspaceInstaller` class for handling
upgrade-related tasks, including the creation of a `ProductInfo` object,
interacting with the Databricks SDK, and handling potential errors.
We've also added a test case to ensure that upgrades are applied
correctly on more recent versions. These changes are part of our ongoing
effort to enhance the management and application of upgrades to
installed products.
* bug fix for to_array function
([#961](#961)). A bug
fix has been implemented to improve the `TO_ARRAY` function in our
open-source library. Previously, this function expected only one
parameter, but it has been updated to accept two parameters, with the
second being optional. This change brings the function in line with
other functions in the class, improving flexibility and ensuring
backward compatibility. The `TO_ARRAY` function is used to convert a
given expression to an array if it is not null and return null
otherwise. The commit also includes updates to the `Generator` class,
where a new entry for the `ToArray` expression has been added to the
`expression_map` dictionary. Additionally, a new `ToArray` class has
been introduced as a subclass of `Func`, allowing the function to handle
a variable number of arguments more gracefully. Relevant updates have
been made to the functional tests for the `to_array` function for both
Snowflake and Databricks SQL, demonstrating its handling of null inputs
and comparing it with the corresponding ARRAY function in each SQL
dialect. Overall, these changes enhance the functionality and
adaptability of the `TO_ARRAY` function.
* feat: Implement all of TSQL predicates except for SOME ALL ANY
([#922](#922)). In this
commit, we have implemented the IR generation for several TSQL
predicates including IN, IS, BETWEEN, LIKE, EXISTS, and FREETEXT,
thereby improving the parser's ability to handle a wider range of TSQL
syntax. The `TSqlParser` class has been updated with new methods and
changes to existing ones, including the addition of new labeled
expressions to the `predicate` rule. Additionally, we have corrected an
error in the LIKE predicate's implementation, allowing the ESCAPE
character to accept a full expression that evaluates to a single
character at runtime, rather than assuming it to be a single character
at parse time. These changes provide more flexibility and adherence to
the TSQL standard, enhancing the overall functionality of the project
for our adopters.
Loading branch information
nfx authored Oct 10, 2024
1 parent 5f27d15 commit 9679215
0 comments on commit `9679215`

Please sign in to comment.
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `9679215`

Commit

There are no files selected for viewing

0 comments on commit 9679215

0 comments on commit `9679215`