Support function translation to Databricks SQL in TSql and Snowflake #414

jimidle · 2024-05-30T17:11:42Z

Some functions must be translated from TSQL or Snowflake versions into the equivalent IR for Databricks SQL. In some cases the function must be translated in one dialect, say TSql but is equivalent in another, say Snowflake.

Here we upgrade the FunctionBuilder system to be dialect aware, and provide a ConversionStrategy system that allows for any type of conversion from a simple name translation or more complicated specific IR representations when there is no equivalent.

For example the TSQL code

SELECT ISNULL(x, 0)

Should translate to:

SELECT IFNULL(x, 0)

In Databricks SQL, but in Snowflake SQL:

SELECT ISNULL(col)

Is directly equivalent to Databricks SQL and needs no conversion.

codecov · 2024-05-30T17:14:17Z

Codecov Report

Attention: Patch coverage is 98.61111% with 4 lines in your changes missing coverage. Please review.

Project coverage is 97.90%. Comparing base (d628677) to head (af08311).

Files	Patch %	Lines
...h/parsers/snowflake/SnowflakeFunctionBuilder.scala	66.66%	3 Missing ⚠️
...abricks/labs/remorph/parsers/FunctionBuilder.scala	99.59%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #414      +/-   ##
==========================================
- Coverage   97.99%   97.90%   -0.10%     
==========================================
  Files          58       61       +3     
  Lines        3743     3765      +22     
  Branches      470      467       -3     
==========================================
+ Hits         3668     3686      +18     
- Misses         42       46       +4     
  Partials       33       33

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-05-30T17:16:48Z

Coverage tests results

354 tests ±0 246 ✅ ±0 6s ⏱️ ±0s
2 suites ±0 0 💤 ±0
2 files ±0 108 ❌ ±0

For more details on these failures, see this check.

Results for commit af08311. ± Comparison against base commit d628677.

♻️ This comment has been updated with latest results.

core/src/main/scala/com/databricks/labs/remorph/parsers/FunctionConverters.scala

core/src/main/scala/com/databricks/labs/remorph/parsers/FunctionBuilder.scala

vil1

I suggest changes to this PR in #418

Refactoring and improvements for `FunctionBuilder`. - Introduce a `FunctionDefinition` that contains a `FunctionArity` and a `FunctionType` - Add a `NotConvertibleFunction` case of `FunctionType` (and get rid of the `isConvertible` boolean) - Lookup function definitions by name and dialect, allowing for dialect-specific definitions (`case (Snowflake, "FOO") => ...`) - Define `IFNULL` only for Snowflake and `ISNULL` only for TSql

vil1

LGTM

Refactoring and improvements for `FunctionBuilder`. - Introduce a `FunctionDefinition` that contains a `FunctionArity` and a `FunctionType` - Add a `NotConvertibleFunction` case of `FunctionType` (and get rid of the `isConvertible` boolean) - Lookup function definitions by name and dialect, allowing for dialect-specific definitions (`case (Snowflake, "FOO") => ...`) - Define `IFNULL` only for Snowflake and `ISNULL` only for TSql

…414) Some functions must be translated from TSQL or Snowflake versions into the equivalent IR for Databricks SQL. In some cases the function must be translated in one dialect, say TSql but is equivalent in another, say Snowflake. Here we upgrade the FunctionBuilder system to be dialect aware, and provide a ConversionStrategy system that allows for any type of conversion from a simple name translation or more complicated specific IR representations when there is no equivalent. For example the TSQL code ```tsql SELECT ISNULL(x, 0) ``` Should translate to: ```sql SELECT IFNULL(x, 0) ``` In Databricks SQL, but in Snowflake SQL: ```snowflake SELECT ISNULL(col) ``` Is directly equivalent to Databricks SQL and needs no conversion. --------- Co-authored-by: Valentin Kasas <[email protected]>

core/src/main/scala/com/databricks/labs/remorph/parsers/FunctionBuilder.scala

core/src/main/scala/com/databricks/labs/remorph/parsers/ParserCommon.scala

core/src/main/scala/com/databricks/labs/remorph/parsers/tsql/TSqlExpressionBuilder.scala

- Some functions must be translated from TSQL or Snowflake versions into the equivalent IR for Databricks SQL. - Provides a ConversionStrategy system that allows for simple name translations or more complicated IR representations when there is no equivalent. - Modifies FunctionBuilders etc to accept differnt dialects.

Refactoring and improvements for `FunctionBuilder`. - Introduce a `FunctionDefinition` that contains a `FunctionArity` and a `FunctionType` - Add a `NotConvertibleFunction` case of `FunctionType` (and get rid of the `isConvertible` boolean) - Lookup function definitions by name and dialect, allowing for dialect-specific definitions (`case (Snowflake, "FOO") => ...`) - Define `IFNULL` only for Snowflake and `ISNULL` only for TSql

…ific classes

…re then injected into the dialect specific builders

vil1

LGTM

nfx

Lgtm

* Added Oracle ojdbc8 dependent library during reconcile Installation ([#474](#474)). In this release, the `deployment.py` file in the `databricks/labs/remorph/helpers` directory has been updated to add the `ojdbc8` library as a `MavenLibrary` in the `_job_recon_task` function, enabling the reconciliation process to access the Oracle Data source and pull data for reconciliation between Oracle and Databricks. The `JDBCReaderMixin` class in the `jdbc_reader.py` file has also been updated to include the Oracle ojdbc8 dependent library for reconciliation during the `reconcile` process. This involves installing the `com.oracle.database.jdbc:ojdbc8:23.4.0.24.05` jar as a dependent library and updating the driver class to `oracle.jdbc.driver.OracleDriver` from `oracle`. A new dictionary `driver_class` has been added, which maps the driver name to the corresponding class name, allowing for dynamic driver class selection during the `_get_jdbc_reader` method call. The `test_read_data_with_options` unit test has been updated to test the Oracle connector for reading data with specific options, including the use of the correct driver class and specifying the database table for data retrieval, improving the accuracy and reliability of the reconciliation process. * Added TSQL coverage tests in the generated report artifact ([#452](#452)). In this release, we have added new TSQL coverage tests and Snowflake coverage tests to the generated report artifact in the CI/CD pipeline. These tests are executed using Maven with the updated command "mvn --update-snapshots -B test -pl coverage --file pom.xml --fail-at-end" and "mvn --update-snapshots -B exec:java -pl coverage --file pom.xml --fail-at-end -Dexec.args="-i tests/resources/functional/snowflake -o coverage-result.json" respectively, and the "continue-on-error: true" option is added to allow the pipeline to proceed even if the tests fail. Additionally, we have introduced a new constructor to the `CommentBasedQueryExtractor` class, which accepts a `dialect` parameter and allows for easier configuration of the start and end comments for different SQL dialects. We have also updated the CommentBasedQueryExtractor for Snowflake and added two TSQL coverage tests to the generated report artifact to ensure that the `QueryExtractor` is working correctly for TSQL queries. These changes will help ensure thorough testing and identification of TSQL and Snowflake queries during the CI/CD process. * Added full support for analytical windowing functions ([#401](#401)). In this release, full support for analytical windowing functions has been implemented, addressing issue [#401](#401). The functions were previously specified in the parser grammar but have been moved to the standard function lookup table for more consistent handling. This enhancement allows for the use of analytical aggregate functions, such as FIRST_VALUE and PERCENTILE_CONT, with a `WITHIN GROUP` syntax and an `OVER` clause, enabling more complex queries and data analysis. The `FixedArity` and `VariableArity` classes have been updated with new methods for the supported functions, and appropriate examples have been provided to demonstrate their usage in SQL. * Added parsing for STRPOS in presto ([#462](#462)). A new feature has been added to the remorph/snow package's presto module to parse the STRPOS function in SQL code. This has been achieved by importing the locate_to_strposition function from sqlglot.dialects.dialect and incorporating it into the FUNCTIONS dictionary in the Parser class. This change enables the parsing of the STRPOS function, which returns the position of the first occurrence of a substring in a string. The implementation has been tested with a SQL file containing two queries for Presto SQL using STRPOS and Databricks SQL using LOCATE, both aimed at finding the position of the letter `l` in the string 'Hello world', starting the search from the second position. This feature is particularly relevant for software engineers working on data processing and analytics projects involving both Presto and Databricks SQL, as it ensures compatibility and consistent behavior between the two for string manipulation functions. The commit is part of issue [#462](#462), and the diff provided includes a new SQL file with test cases for the STRPOS function in Presto and Locate function in Databricks SQL. The test cases confirm if the `hello` string is present in the greeting_message column of the greetings_table. This feature allows users to utilize the STRPOS function in Presto to determine if a specific substring is present in a string. * Added validation for join columns for all query builders and limiting rows for reports ([#413](#413)). In this release, we've added validation for join columns in all query builders, ensuring consistent and accurate data joins. A limit on the number of rows displayed for reports has been implemented with a default of 50. The `compare.py` and `execute.py` files have been updated to include validation, and the `QueryBuilder` and `HashQueryBuilder` classes have new methods for validating join columns. The `SamplingQueryBuilder`, `ThresholdQueryBuilder`, and `recon_capture.py` files have similar updates for validation and limiting rows for reports. The `recon_config.py` file now has a new return type for the `get_join_columns` method, and a new method `test_no_join_columns_raise_exception()` has been added in the `test_threshold_query.py` file. These changes aim to enhance data consistency, accuracy, and efficiency for software engineers. * Adds more coverage tests for functions to TSQL coverage ([#420](#420)). This commit adds new coverage tests for various TSQL functions, focusing on the COUNT, MAX, MIN, STDEV, STDEVP, SUM, and VARP functions, which are identical in Databricks SQL. The tests include cases with and without the DISTINCT keyword to ensure consistent behavior between TSQL and Databricks. For the GROUPING and GROUPING_ID functions, which have some differences, tests and examples of TSQL and Databicks SQL code are provided. The CHECKSUM_AGG function, not directly supported in Databricks SQL, is tested using MD5 and CONCAT_WS for equivalence. The CUME_DIST function, identical in both systems, is also tested. Additionally, a new test file for the STDEV function and updated tests for the VAR function are introduced, enhancing the reliability and robustness of TSQL conversions in the project. * Catalog, Schema Permission checks ([#492](#492)). This release introduces enhancements to the Catalog and Schema functionality, with the addition of permission checks that raise explicit `Permission Denied` exceptions. The logger messages have been updated for clarity and a new variable, README_RECON_REPO, has been created to reference the readme file for the recon_config repository. The ReconcileUtils class has been modified to handle scenarios where the recon_config file is not found or corrupted during loading, providing clear error messages and guidance for users. The unit tests for the install feature have been updated with permission checks for Catalog and Schema operations, ensuring robust handling of permission denied errors. These changes improve the system's error handling and provide clearer guidance for users encountering permission issues. * Changing the secret name acc to install script ([#432](#432)). In this release, the `recon` function in the `execute.py` file of the `databricks.labs.remorph.reconcile` package has been updated to dynamically generate the secret name instead of hardcoding it as "secret_scope". This change utilizes the new `get_key_form_dialect` function to create a secret name specific to the source dialect being used in the reconciliation process. The `get_dialect` function, along with `DatabaseConfig`, `TableRecon`, and the newly added `get_key_form_dialect`, have been imported from `databricks.labs.remorph.config`. This enhancement improves the security and flexibility of the reconciliation process by generating dynamic and dialect-specific secret names. * Feature/recon documentation ([#395](#395)). This commit introduces a new reconciliation process, enhancing data consistency between sources, co-authored by Ganesh Dogiparthi, ganeshdogiparthi-db, and SundarShankar89. The README.md file provides detailed documentation for the reconciliation process. A new binary file, docs/transpile-install.gif, offers installation instructions or visual aids, while a mermaid flowchart in `report_types_visualisation.md` illustrates report generation for data, rows, schema, and overall reconciliation. No existing functionality was modified, ensuring the addition of valuable features for software engineers adopting this project. * Fixing issues in sample query builder to handle Null's and zero ([#457](#457)). This commit introduces improvements to the sample query builder's handling of Nulls and zeroes, addressing bug [#450](#450). The changes include updated SQL queries in the test threshold query file with COALESCE and TRIM functions to replace Null values with a specified string, ensuring consistent comparison of datasets. The query store in test_execute.py has also been enhanced to handle NULL and zero values using COALESCE, improving overall robustness and consistency. Additionally, new methods such as build_join_clause, trim, and coalesce have been added to enhance null handling in the query builder. The commit also introduces the MockDataSource class, a likely test implementation of a data source, and updates the log_and_throw_exception function for clearer error messaging. * Implement Lakeview Dashboard Publisher ([#405](#405)). In this release, we've introduced the `DashboardPublisher` class in the `dashboard_publisher.py` module to streamline the process of creating and publishing dashboards in Databricks Workspace. This class simplifies dashboard creation by accepting an instance of `WorkspaceClient` and `Installation` and providing methods for creating and publishing dashboards with optional parameter substitution. Additionally, we've added a new JSON file, 'Remorph-Reconciliation-Substituted.lvdash.json', which contains a dashboard definition for a data reconciliation feature. This dashboard includes various widgets for filtering and displaying reconciliation results. We've also added a test file for the Lakeview Dashboard Publisher feature, which includes tests to ensure that the `DashboardPublisher` can create dashboards using specified file paths and parameters. These new features and enhancements are aimed at improving the user experience and streamlining the process of creating and publishing dashboards in Databricks Workspace. * Integrate recon metadata reconcile cli ([#444](#444)). A new CLI command, `databricks labs remorph reconcile`, has been added to initiate the Data Reconciliation process, loading `reconcile.yml` and `recon_config.json` configuration files from the Databricks Workspace. If these files are missing, the user is prompted to reinstall the `reconcile` module and exit the command. The command then triggers the `Remorph_Reconciliation_Job` based on the Job ID stored in the `reconcile.yml` file. This simplifies the reconcile execution process, requiring users to first configure the `reconcile` module and generate the `recon_config_<SOURCE>.json` file using `databricks labs remorph install` and `databricks labs remorph generate-recon-config` commands. The new CLI command has been manually tested and includes unit tests. Integration tests and verification on the staging environment are pending. This feature was co-authored by Bishwajit, Ganesh Dogiparthi, and SundarShankar89. * Introduce coverage tests ([#382](#382)). This commit introduces coverage tests and updates the GitHub Actions workflow to use Java 11 with Corretto distribution, improving testing and coverage analysis for the project. Coverage tests are added as part of the remorph project with the introduction of a new module for coverage and updating the artifact version to 0.2.0-SNAPSHOT. The pom.xml file is modified to change the parent project version to 0.2.0-SNAPSHOT, ensuring accurate assessment and maintenance of code coverage during development. In addition, a new Main object within the com.databricks.labs.remorph.coverage package is implemented for running coverage tests using command-line arguments, along with the addition of a new file QueryRunner.scala and case classes for ReportEntryHeader, ReportEntryReport, and ReportEntry for capturing and reporting on the status and results of parsing and transpilation processes. The `Cache Maven packages` step is removed and replaced with two new steps: `Run Unit Tests with Maven` and "Run Coverage Tests with Maven." The former executes unit tests and generates a test coverage report, while the latter downloads remorph-core jars as artifacts, executes coverage tests with Maven, and uploads coverage tests results as json artifacts. The `coverage-tests` job runs after the `test-core` job and uses the same environment, checking out the code with full history, setting up Java 11 with Corretto distribution, downloading remorph-core-jars artifacts, and running coverage tests with Maven, even if there are errors. The JUnit report is also published, and the coverage tests results are uploaded as json artifacts, providing better test coverage and more reliable code for software engineers adopting the project. * Presto approx percentile func fix ([#411](#411)). The remorph library has been updated to support the Presto database system, with a new module added to the config.py file to enable robust and maintainable interaction. An `APPROX_PERCENTILE` function has been implemented in the `presto.py` file of the `sqlglot.dialects.presto` package, allowing for approximate percentile calculations in Presto and Databricks SQL. A test file has been included for both SQL dialects, with queries calculating the approximate median of the height column in the people table. The new functionality enhances the compatibility and versatility of the remorph library in working with Presto databases and improves overall project functionality. Additionally, a new test file for Presto in the snowflakedriver project has been introduced to test expected exceptions, further ensuring robustness and reliability. * Raise exception if reconciliation fails for any table ([#412](#412)). In this release, we have implemented significant changes to improve exception handling and raise meaningful exceptions when reconciliation fails for any table in our open-source library. A new exception class, `ReconciliationException`, has been added as a child of the `Exception` class, which takes two optional parameters in its constructor, `message` and `reconcile_output`. The `ReconcileOutput` property has been created for accessing the reconcile output object. The `InvalidInputException` class now inherits from `ValueError`, making the code more explicit with the type of errors being handled. A new method, `_verify_successful_reconciliation`, has been introduced to check the reconciliation output status and raise a `ReconciliationException` if any table fails reconciliation. The `test_execute.py` file has been updated to raise a `ReconciliationException` if reconciliation for a specific report type fails, and new tests have been added to the test suite to ensure the correct behavior of the `reconcile` function with and without raising exceptions. * Removed USE catalog/schema statement as lsql has added the feature ([#465](#465)). In this release, the usage of `USE` statements for selecting a catalog and schema has been removed in the `get_sql_backend` function, thanks to the new feature provided by the lsql library. This enhancement improves code readability, maintainability, and enables better integration with the SQL backend. The commit also includes changes to the installation process for reconciliation metadata tables, providing more clarity and simplicity in the code. Additionally, several test functions have been added or modified to ensure the proper functioning of the `get_sql_backend` function in various scenarios, including cases where a warehouse ID is not provided or when executing SQL statements in a notebook environment. An error simulation test has also been added for handling `DatabricksError` exceptions when executing SQL statements using the `DatabricksConnectBackend` class. * Sampling with clause query to have `from dual` in from clause for oracle source ([#464](#464)). In this release, we've added the `get_key_from_dialect` function, replacing the previous `get_key_form_dialect` function, to retrieve the key associated with a given dialect object, serving as a unique identifier for the dialect. This improvement enhances the flexibility and readability of the codebase, making it easier to locate and manipulate dialect objects. Additionally, we've modified the 'sampling_query.py' file to include `from dual` in the `from` clause for Oracle sources in a sampling query with a clause, enabling sampling from Oracle databases. The `_insert_into_main_table` method in the `recon_capture.py` file of the `databricks.labs.remorph.reconcile` module has been updated to ensure accurate key retrieval for the specified dialect, thereby improving the reconciliation process. These changes resolve issues [#458](#458) and [#464](#464), enhancing the functionality of the sampling query builder and providing better support for various databases. * Support function translation to Databricks SQL in TSql and Snowflake ([#414](#414)). This commit introduces a dialect-aware FunctionBuilder system and a ConversionStrategy system to enable seamless translation of SQL functions between TSQL, Snowflake, and Databricks SQL IR. The new FunctionBuilder system can handle both simple name translations and more complex conversions when there is no direct equivalent. For instance, TSQL's ISNULL function translates to IFNULL in Databricks SQL, while Snowflake's ISNULL remains unchanged. The commit also includes updates to the TSqlExpressionBuilder and new methods for building and visiting various contexts, enhancing compatibility and expanding the range of supported SQL dialects. Additionally, new tests have been added in the FunctionBuilderSpec to ensure the correct arity and function type for various SQL functions. * TSQL: Create coverage tests for TSQL -> Databricks functions ([#415](#415)). This commit introduces coverage tests for T-SQL functions and their equivalent Databricks SQL implementations, focusing on the DATEADD function's `yy` keyword. The DATEADD function is translated to the ADD_MONTHS function in Databricks SQL, with the number of months multiplied by 12. This ensures functional equivalence between T-SQL and Databricks SQL for date addition involving years. The tests are written as SQL scripts and are located in the `tests/resources/functional/tsql/functions` directory, covering various scenarios and possible engine differences between T-SQL and Databricks SQL. The conversion process is documented, and future automation of this documentation is considered. * TSQL: Implement WITH CTE ([#443](#443)). With this commit, we have extended the TSQL functionality by adding support for Common Table Expressions (CTEs). CTEs are temporary result sets that can be defined within a single execution of a SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement, allowing for more complex and efficient queries. The implementation includes the ability to create a CTE with an optional name and a column list, followed by a SELECT statement that defines the CTE. CTEs can be self-referential and can be used to simplify complex queries, improving code readability and performance. This feature is particularly useful for cases where multiple queries rely on the same intermediate result set, as it enables reusing the results without having to repeat the query. * TSQL: Implement functions with specialized syntax ([#430](#430)). This commit introduces new data type conversion functions and JSON manipulation capabilities to T-SQL, addressing issue [#430](#430). The newly implemented features include `NEXT VALUE FOR sequence`, `CAST(col TO sometype)`, `TRY_CAST(col TO sometype)`, `JSON_ARRAY`, and `JSON_OBJECT`. These functions support specialized syntax for handling data type conversions and JSON operations, including NULL value handling using `NULL ON NULL` and `ABSENT ON NULL` syntax. The `TSqlFunctionBuilder` class has been updated to accommodate these changes, and new test cases have been added to the `TSqlFunctionSpec` test class in Scala. This enhancement enables SQL-based querying and data manipulation with increased functionality for T-SQL parser and function evaluations. * TSQL: Support DISTINCT in SELECT list and aggregate functions ([#400](#400)). This commit adds support for the `DISTINCT` keyword in T-SQL for use in the `SELECT` list and aggregate functions such as `COUNT`. When used in the `SELECT` list, `DISTINCT` ensures unique values of the specified expression are returned, and in aggregate functions like `COUNT`, it considers only distinct values of the specified argument. This change aligns with the SQL standard and enhances the functionality of the T-SQL parser, providing developers with greater flexibility and control when using `DISTINCT` in complex queries and aggregate functions. The default behavior in SQL, `ALL`, remains unchanged, and the parser has been updated to accommodate these improvements. * TSQL: Update the SELECT statement to support XML workspaces ([#451](#451)). This release introduces updates to the TSQL Select statement grammar to correctly support XMLWORKSPACES in accordance with the latest specification. Although Databricks SQL does not currently support XMLWORKSPACES, this change is a syntax-only update to enable compatibility with other platforms that do support it. Newly added components include 'xmlNamespaces', 'xmlDeclaration', 'xmlSchemaCollection', 'xmlTypeDefinition', 'createXmlSchemaCollection', 'xmlIndexOptions', 'xmlIndexOption', 'openXml', 'xmlCommonDirectives', and 'xmlColumnDefinition'. These additions enable the creation, configuration, and usage of XML schemas and indexes, as well as the specification of XML namespaces and directives. A new test file for functional tests has been included to demonstrate the use of XMLWORKSPACES in TSQL and its equivalent syntax in Databricks SQL. While this update does not affect the existing codebase's functionality, it does enable support for XMLWORKSPACES syntax in TSQL, facilitating easier integration with other platforms that support it. Please note that Databricks SQL does not currently support XML workspaces. * Test merge queue ([#424](#424)). In this release, the Scalafmt configuration has been updated to version 3.8.0, with changes to the formatting of Scala code. The `danglingParentheses` preset option has been set to "false", removing dangling parentheses from the code. Additionally, the `configStyleArguments` option has been set to `false` under "optIn". These modifications to the configuration file are likely to affect the formatting and style of the Scala code in the project, ensuring consistent and organized code. This change aims to enhance the readability and maintainability of the codebase. * Updated bug and feature yml to support reconcile ([#390](#390)). The open-source library has been updated to improve issue and feature categorization. In the `.github/ISSUE_TEMPLATE/bug.yml` file, new options for TranspileParserError, TranspileValidationError, and TranspileLateralColumnAliasError have been added to the `label: Category of Bug / Issue` field. Additionally, a new option for ReconcileError has been included. The `feature.yml` file in the `.github/ISSUE_TEMPLATE` directory has also been updated, introducing a required dropdown menu labeled "Category of feature request." This dropdown offers options for Transpile, Reconcile, and Other categories, ensuring accurate classification and organization of incoming feature requests. The modifications aim to enhance clarity for maintainers in reviewing and prioritizing issue resolutions and feature implementations related to reconciliation functionality. * Updated the documentation with json config examples ([#486](#486)). In this release, the Remorph Reconciliation tool on Databricks has been updated to include JSON config examples for various config elements such as jdbc_reader_options, column_mapping, transformations, thresholds, and filters. These config elements enable users to define source and target data, join columns, JDBC reader options, select and drop columns, column mappings, transformations, thresholds, and filters. The update also provides examples in both Python and JSON formats, as well as instructions for installing the necessary Oracle JDBC library on a Databricks cluster. This update enhances the tool's functionality, making it easier for software engineers to reconcile source data with target data on Databricks. * Updated uninstall flow ([#476](#476)). In this release, the `uninstall` functionality of the `databricks labs remorph` tool has been updated to align with the latest changes made to the `install` refactoring. The `uninstall` flow now utilizes a new `MockInstallation` class, which handles the uninstallation process and takes a dictionary of configuration files and their corresponding contents as input. The `uninstall` function has been modified to return `False` in two cases, either when there is no remorph directory or when the user decides not to uninstall. A `MockInstallation` object is created for the reconcile.yml file, and appropriate exceptions are raised in the aforementioned cases. The `uninstall` function now uses a `WorkspaceUnInstallation` or `WorkspaceUnInstaller` object, depending on the input arguments, to handle the uninstallation process. Additionally, the `MockPrompts` class is used to prompt the user for confirmation before uninstalling remorph. * Updates to developer documentation and add grammar formatting to maven ([#490](#490)). The developer documentation has been updated to include grammar formatting instructions and support for dialects other than Snowflake. The Maven build cycle has been modified to format grammars before ANTLR processes them, enhancing readability and easing conflict resolution during maintenance. The TSqlLexer.g4 file has been updated with formatting instructions and added dialect recognition. These changes ensure that grammars are consistently formatted and easily resolvable during merges. Engineers adopting this project should reformat the grammar file before each commit, following the provided formatting instructions and reference link. Grammar modifications in the TSqlParser.g4 file, such as alterations in partitionFunction and freetextFunction rules, improve structure and readability. * Upgrade sqlglot from 23.13.7 to 25.1.0 ([#473](#473)). In the latest release, the sqlglot package has been upgraded from version 23.13.7 to 25.1.0, offering potential new features, bug fixes, and performance improvements for SQL processing. The package dependency for numpy has been updated to version 1.26.4, which may introduce new functionality, improve existing features, or fix numpy integration issues. Furthermore, the addition of the types-pytz package as a dependency provides type hints for pytz, enhancing codebase type checking and static analysis capabilities. Specific modifications to the test_sql_transpiler.py file include updating the expected result in the test_parse_query function and removing unnecessary whitespaces in the transpiled_sql assertion in the test_procedure_conversion function. Although the find_root_tables function remains unchanged, the upgrade to sqlglot promises overall functionality enhancements, which software engineers can leverage in their projects. * Use default_factory in recon_config.py ([#431](#431)). In this release, the default value handling for the `status` field in the `DataReconcileOutput` and `ReconcileTableOutput` classes has been improved to comply with Python 3.11. Previously, a mutable default value was used, causing a `ValueError` issue. This has been addressed by implementing the `default_factory` argument in the `field` function to ensure a new instance of `StatusOutput` is created for each class. Additionally, `MismatchOutput` and `ThresholdOutput` classes now also utilize `default_factory` for consistent and robust default value handling, enhancing the overall code quality and preventing potential issues arising from mutable default values. * edit distance ([#501](#501)). In this release, we have implemented an `edit distance` feature for calculating the difference between two strings using the LEVENSHTEIN function. This has been achieved by adding a new method, `anonymous_sql`, to the `Generator` class in the `databricks.py` file. The method takes expressions of the `Anonymous` type as arguments and calls the `LEVENSHTEIN` function if the `this` attribute of the expression is equal to "EDITDISTANCE". Additionally, a new test file has been introduced for the anonymous user in the functional snowflake test suite to ensure the accurate calculation of string similarity using the EDITDISTANCE function. This change includes examples of using the EDITDISTANCE function with different parameters and compares it with the LEVENSHTEIN function available in Databricks. It addresses issue [#500](#500), which was related to testing the edit distance functionality.

* Added Oracle ojdbc8 dependent library during reconcile Installation ([databrickslabs#474](databrickslabs#474)). In this release, the `deployment.py` file in the `databricks/labs/remorph/helpers` directory has been updated to add the `ojdbc8` library as a `MavenLibrary` in the `_job_recon_task` function, enabling the reconciliation process to access the Oracle Data source and pull data for reconciliation between Oracle and Databricks. The `JDBCReaderMixin` class in the `jdbc_reader.py` file has also been updated to include the Oracle ojdbc8 dependent library for reconciliation during the `reconcile` process. This involves installing the `com.oracle.database.jdbc:ojdbc8:23.4.0.24.05` jar as a dependent library and updating the driver class to `oracle.jdbc.driver.OracleDriver` from `oracle`. A new dictionary `driver_class` has been added, which maps the driver name to the corresponding class name, allowing for dynamic driver class selection during the `_get_jdbc_reader` method call. The `test_read_data_with_options` unit test has been updated to test the Oracle connector for reading data with specific options, including the use of the correct driver class and specifying the database table for data retrieval, improving the accuracy and reliability of the reconciliation process. * Added TSQL coverage tests in the generated report artifact ([databrickslabs#452](databrickslabs#452)). In this release, we have added new TSQL coverage tests and Snowflake coverage tests to the generated report artifact in the CI/CD pipeline. These tests are executed using Maven with the updated command "mvn --update-snapshots -B test -pl coverage --file pom.xml --fail-at-end" and "mvn --update-snapshots -B exec:java -pl coverage --file pom.xml --fail-at-end -Dexec.args="-i tests/resources/functional/snowflake -o coverage-result.json" respectively, and the "continue-on-error: true" option is added to allow the pipeline to proceed even if the tests fail. Additionally, we have introduced a new constructor to the `CommentBasedQueryExtractor` class, which accepts a `dialect` parameter and allows for easier configuration of the start and end comments for different SQL dialects. We have also updated the CommentBasedQueryExtractor for Snowflake and added two TSQL coverage tests to the generated report artifact to ensure that the `QueryExtractor` is working correctly for TSQL queries. These changes will help ensure thorough testing and identification of TSQL and Snowflake queries during the CI/CD process. * Added full support for analytical windowing functions ([databrickslabs#401](databrickslabs#401)). In this release, full support for analytical windowing functions has been implemented, addressing issue [databrickslabs#401](databrickslabs#401). The functions were previously specified in the parser grammar but have been moved to the standard function lookup table for more consistent handling. This enhancement allows for the use of analytical aggregate functions, such as FIRST_VALUE and PERCENTILE_CONT, with a `WITHIN GROUP` syntax and an `OVER` clause, enabling more complex queries and data analysis. The `FixedArity` and `VariableArity` classes have been updated with new methods for the supported functions, and appropriate examples have been provided to demonstrate their usage in SQL. * Added parsing for STRPOS in presto ([databrickslabs#462](databrickslabs#462)). A new feature has been added to the remorph/snow package's presto module to parse the STRPOS function in SQL code. This has been achieved by importing the locate_to_strposition function from sqlglot.dialects.dialect and incorporating it into the FUNCTIONS dictionary in the Parser class. This change enables the parsing of the STRPOS function, which returns the position of the first occurrence of a substring in a string. The implementation has been tested with a SQL file containing two queries for Presto SQL using STRPOS and Databricks SQL using LOCATE, both aimed at finding the position of the letter `l` in the string 'Hello world', starting the search from the second position. This feature is particularly relevant for software engineers working on data processing and analytics projects involving both Presto and Databricks SQL, as it ensures compatibility and consistent behavior between the two for string manipulation functions. The commit is part of issue [databrickslabs#462](databrickslabs#462), and the diff provided includes a new SQL file with test cases for the STRPOS function in Presto and Locate function in Databricks SQL. The test cases confirm if the `hello` string is present in the greeting_message column of the greetings_table. This feature allows users to utilize the STRPOS function in Presto to determine if a specific substring is present in a string. * Added validation for join columns for all query builders and limiting rows for reports ([databrickslabs#413](databrickslabs#413)). In this release, we've added validation for join columns in all query builders, ensuring consistent and accurate data joins. A limit on the number of rows displayed for reports has been implemented with a default of 50. The `compare.py` and `execute.py` files have been updated to include validation, and the `QueryBuilder` and `HashQueryBuilder` classes have new methods for validating join columns. The `SamplingQueryBuilder`, `ThresholdQueryBuilder`, and `recon_capture.py` files have similar updates for validation and limiting rows for reports. The `recon_config.py` file now has a new return type for the `get_join_columns` method, and a new method `test_no_join_columns_raise_exception()` has been added in the `test_threshold_query.py` file. These changes aim to enhance data consistency, accuracy, and efficiency for software engineers. * Adds more coverage tests for functions to TSQL coverage ([databrickslabs#420](databrickslabs#420)). This commit adds new coverage tests for various TSQL functions, focusing on the COUNT, MAX, MIN, STDEV, STDEVP, SUM, and VARP functions, which are identical in Databricks SQL. The tests include cases with and without the DISTINCT keyword to ensure consistent behavior between TSQL and Databricks. For the GROUPING and GROUPING_ID functions, which have some differences, tests and examples of TSQL and Databicks SQL code are provided. The CHECKSUM_AGG function, not directly supported in Databricks SQL, is tested using MD5 and CONCAT_WS for equivalence. The CUME_DIST function, identical in both systems, is also tested. Additionally, a new test file for the STDEV function and updated tests for the VAR function are introduced, enhancing the reliability and robustness of TSQL conversions in the project. * Catalog, Schema Permission checks ([databrickslabs#492](databrickslabs#492)). This release introduces enhancements to the Catalog and Schema functionality, with the addition of permission checks that raise explicit `Permission Denied` exceptions. The logger messages have been updated for clarity and a new variable, README_RECON_REPO, has been created to reference the readme file for the recon_config repository. The ReconcileUtils class has been modified to handle scenarios where the recon_config file is not found or corrupted during loading, providing clear error messages and guidance for users. The unit tests for the install feature have been updated with permission checks for Catalog and Schema operations, ensuring robust handling of permission denied errors. These changes improve the system's error handling and provide clearer guidance for users encountering permission issues. * Changing the secret name acc to install script ([databrickslabs#432](databrickslabs#432)). In this release, the `recon` function in the `execute.py` file of the `databricks.labs.remorph.reconcile` package has been updated to dynamically generate the secret name instead of hardcoding it as "secret_scope". This change utilizes the new `get_key_form_dialect` function to create a secret name specific to the source dialect being used in the reconciliation process. The `get_dialect` function, along with `DatabaseConfig`, `TableRecon`, and the newly added `get_key_form_dialect`, have been imported from `databricks.labs.remorph.config`. This enhancement improves the security and flexibility of the reconciliation process by generating dynamic and dialect-specific secret names. * Feature/recon documentation ([databrickslabs#395](databrickslabs#395)). This commit introduces a new reconciliation process, enhancing data consistency between sources, co-authored by Ganesh Dogiparthi, ganeshdogiparthi-db, and SundarShankar89. The README.md file provides detailed documentation for the reconciliation process. A new binary file, docs/transpile-install.gif, offers installation instructions or visual aids, while a mermaid flowchart in `report_types_visualisation.md` illustrates report generation for data, rows, schema, and overall reconciliation. No existing functionality was modified, ensuring the addition of valuable features for software engineers adopting this project. * Fixing issues in sample query builder to handle Null's and zero ([databrickslabs#457](databrickslabs#457)). This commit introduces improvements to the sample query builder's handling of Nulls and zeroes, addressing bug [databrickslabs#450](databrickslabs#450). The changes include updated SQL queries in the test threshold query file with COALESCE and TRIM functions to replace Null values with a specified string, ensuring consistent comparison of datasets. The query store in test_execute.py has also been enhanced to handle NULL and zero values using COALESCE, improving overall robustness and consistency. Additionally, new methods such as build_join_clause, trim, and coalesce have been added to enhance null handling in the query builder. The commit also introduces the MockDataSource class, a likely test implementation of a data source, and updates the log_and_throw_exception function for clearer error messaging. * Implement Lakeview Dashboard Publisher ([databrickslabs#405](databrickslabs#405)). In this release, we've introduced the `DashboardPublisher` class in the `dashboard_publisher.py` module to streamline the process of creating and publishing dashboards in Databricks Workspace. This class simplifies dashboard creation by accepting an instance of `WorkspaceClient` and `Installation` and providing methods for creating and publishing dashboards with optional parameter substitution. Additionally, we've added a new JSON file, 'Remorph-Reconciliation-Substituted.lvdash.json', which contains a dashboard definition for a data reconciliation feature. This dashboard includes various widgets for filtering and displaying reconciliation results. We've also added a test file for the Lakeview Dashboard Publisher feature, which includes tests to ensure that the `DashboardPublisher` can create dashboards using specified file paths and parameters. These new features and enhancements are aimed at improving the user experience and streamlining the process of creating and publishing dashboards in Databricks Workspace. * Integrate recon metadata reconcile cli ([databrickslabs#444](databrickslabs#444)). A new CLI command, `databricks labs remorph reconcile`, has been added to initiate the Data Reconciliation process, loading `reconcile.yml` and `recon_config.json` configuration files from the Databricks Workspace. If these files are missing, the user is prompted to reinstall the `reconcile` module and exit the command. The command then triggers the `Remorph_Reconciliation_Job` based on the Job ID stored in the `reconcile.yml` file. This simplifies the reconcile execution process, requiring users to first configure the `reconcile` module and generate the `recon_config_<SOURCE>.json` file using `databricks labs remorph install` and `databricks labs remorph generate-recon-config` commands. The new CLI command has been manually tested and includes unit tests. Integration tests and verification on the staging environment are pending. This feature was co-authored by Bishwajit, Ganesh Dogiparthi, and SundarShankar89. * Introduce coverage tests ([databrickslabs#382](databrickslabs#382)). This commit introduces coverage tests and updates the GitHub Actions workflow to use Java 11 with Corretto distribution, improving testing and coverage analysis for the project. Coverage tests are added as part of the remorph project with the introduction of a new module for coverage and updating the artifact version to 0.2.0-SNAPSHOT. The pom.xml file is modified to change the parent project version to 0.2.0-SNAPSHOT, ensuring accurate assessment and maintenance of code coverage during development. In addition, a new Main object within the com.databricks.labs.remorph.coverage package is implemented for running coverage tests using command-line arguments, along with the addition of a new file QueryRunner.scala and case classes for ReportEntryHeader, ReportEntryReport, and ReportEntry for capturing and reporting on the status and results of parsing and transpilation processes. The `Cache Maven packages` step is removed and replaced with two new steps: `Run Unit Tests with Maven` and "Run Coverage Tests with Maven." The former executes unit tests and generates a test coverage report, while the latter downloads remorph-core jars as artifacts, executes coverage tests with Maven, and uploads coverage tests results as json artifacts. The `coverage-tests` job runs after the `test-core` job and uses the same environment, checking out the code with full history, setting up Java 11 with Corretto distribution, downloading remorph-core-jars artifacts, and running coverage tests with Maven, even if there are errors. The JUnit report is also published, and the coverage tests results are uploaded as json artifacts, providing better test coverage and more reliable code for software engineers adopting the project. * Presto approx percentile func fix ([databrickslabs#411](databrickslabs#411)). The remorph library has been updated to support the Presto database system, with a new module added to the config.py file to enable robust and maintainable interaction. An `APPROX_PERCENTILE` function has been implemented in the `presto.py` file of the `sqlglot.dialects.presto` package, allowing for approximate percentile calculations in Presto and Databricks SQL. A test file has been included for both SQL dialects, with queries calculating the approximate median of the height column in the people table. The new functionality enhances the compatibility and versatility of the remorph library in working with Presto databases and improves overall project functionality. Additionally, a new test file for Presto in the snowflakedriver project has been introduced to test expected exceptions, further ensuring robustness and reliability. * Raise exception if reconciliation fails for any table ([databrickslabs#412](databrickslabs#412)). In this release, we have implemented significant changes to improve exception handling and raise meaningful exceptions when reconciliation fails for any table in our open-source library. A new exception class, `ReconciliationException`, has been added as a child of the `Exception` class, which takes two optional parameters in its constructor, `message` and `reconcile_output`. The `ReconcileOutput` property has been created for accessing the reconcile output object. The `InvalidInputException` class now inherits from `ValueError`, making the code more explicit with the type of errors being handled. A new method, `_verify_successful_reconciliation`, has been introduced to check the reconciliation output status and raise a `ReconciliationException` if any table fails reconciliation. The `test_execute.py` file has been updated to raise a `ReconciliationException` if reconciliation for a specific report type fails, and new tests have been added to the test suite to ensure the correct behavior of the `reconcile` function with and without raising exceptions. * Removed USE catalog/schema statement as lsql has added the feature ([databrickslabs#465](databrickslabs#465)). In this release, the usage of `USE` statements for selecting a catalog and schema has been removed in the `get_sql_backend` function, thanks to the new feature provided by the lsql library. This enhancement improves code readability, maintainability, and enables better integration with the SQL backend. The commit also includes changes to the installation process for reconciliation metadata tables, providing more clarity and simplicity in the code. Additionally, several test functions have been added or modified to ensure the proper functioning of the `get_sql_backend` function in various scenarios, including cases where a warehouse ID is not provided or when executing SQL statements in a notebook environment. An error simulation test has also been added for handling `DatabricksError` exceptions when executing SQL statements using the `DatabricksConnectBackend` class. * Sampling with clause query to have `from dual` in from clause for oracle source ([databrickslabs#464](databrickslabs#464)). In this release, we've added the `get_key_from_dialect` function, replacing the previous `get_key_form_dialect` function, to retrieve the key associated with a given dialect object, serving as a unique identifier for the dialect. This improvement enhances the flexibility and readability of the codebase, making it easier to locate and manipulate dialect objects. Additionally, we've modified the 'sampling_query.py' file to include `from dual` in the `from` clause for Oracle sources in a sampling query with a clause, enabling sampling from Oracle databases. The `_insert_into_main_table` method in the `recon_capture.py` file of the `databricks.labs.remorph.reconcile` module has been updated to ensure accurate key retrieval for the specified dialect, thereby improving the reconciliation process. These changes resolve issues [databrickslabs#458](databrickslabs#458) and [databrickslabs#464](databrickslabs#464), enhancing the functionality of the sampling query builder and providing better support for various databases. * Support function translation to Databricks SQL in TSql and Snowflake ([databrickslabs#414](databrickslabs#414)). This commit introduces a dialect-aware FunctionBuilder system and a ConversionStrategy system to enable seamless translation of SQL functions between TSQL, Snowflake, and Databricks SQL IR. The new FunctionBuilder system can handle both simple name translations and more complex conversions when there is no direct equivalent. For instance, TSQL's ISNULL function translates to IFNULL in Databricks SQL, while Snowflake's ISNULL remains unchanged. The commit also includes updates to the TSqlExpressionBuilder and new methods for building and visiting various contexts, enhancing compatibility and expanding the range of supported SQL dialects. Additionally, new tests have been added in the FunctionBuilderSpec to ensure the correct arity and function type for various SQL functions. * TSQL: Create coverage tests for TSQL -> Databricks functions ([databrickslabs#415](databrickslabs#415)). This commit introduces coverage tests for T-SQL functions and their equivalent Databricks SQL implementations, focusing on the DATEADD function's `yy` keyword. The DATEADD function is translated to the ADD_MONTHS function in Databricks SQL, with the number of months multiplied by 12. This ensures functional equivalence between T-SQL and Databricks SQL for date addition involving years. The tests are written as SQL scripts and are located in the `tests/resources/functional/tsql/functions` directory, covering various scenarios and possible engine differences between T-SQL and Databricks SQL. The conversion process is documented, and future automation of this documentation is considered. * TSQL: Implement WITH CTE ([databrickslabs#443](databrickslabs#443)). With this commit, we have extended the TSQL functionality by adding support for Common Table Expressions (CTEs). CTEs are temporary result sets that can be defined within a single execution of a SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement, allowing for more complex and efficient queries. The implementation includes the ability to create a CTE with an optional name and a column list, followed by a SELECT statement that defines the CTE. CTEs can be self-referential and can be used to simplify complex queries, improving code readability and performance. This feature is particularly useful for cases where multiple queries rely on the same intermediate result set, as it enables reusing the results without having to repeat the query. * TSQL: Implement functions with specialized syntax ([databrickslabs#430](databrickslabs#430)). This commit introduces new data type conversion functions and JSON manipulation capabilities to T-SQL, addressing issue [databrickslabs#430](databrickslabs#430). The newly implemented features include `NEXT VALUE FOR sequence`, `CAST(col TO sometype)`, `TRY_CAST(col TO sometype)`, `JSON_ARRAY`, and `JSON_OBJECT`. These functions support specialized syntax for handling data type conversions and JSON operations, including NULL value handling using `NULL ON NULL` and `ABSENT ON NULL` syntax. The `TSqlFunctionBuilder` class has been updated to accommodate these changes, and new test cases have been added to the `TSqlFunctionSpec` test class in Scala. This enhancement enables SQL-based querying and data manipulation with increased functionality for T-SQL parser and function evaluations. * TSQL: Support DISTINCT in SELECT list and aggregate functions ([databrickslabs#400](databrickslabs#400)). This commit adds support for the `DISTINCT` keyword in T-SQL for use in the `SELECT` list and aggregate functions such as `COUNT`. When used in the `SELECT` list, `DISTINCT` ensures unique values of the specified expression are returned, and in aggregate functions like `COUNT`, it considers only distinct values of the specified argument. This change aligns with the SQL standard and enhances the functionality of the T-SQL parser, providing developers with greater flexibility and control when using `DISTINCT` in complex queries and aggregate functions. The default behavior in SQL, `ALL`, remains unchanged, and the parser has been updated to accommodate these improvements. * TSQL: Update the SELECT statement to support XML workspaces ([databrickslabs#451](databrickslabs#451)). This release introduces updates to the TSQL Select statement grammar to correctly support XMLWORKSPACES in accordance with the latest specification. Although Databricks SQL does not currently support XMLWORKSPACES, this change is a syntax-only update to enable compatibility with other platforms that do support it. Newly added components include 'xmlNamespaces', 'xmlDeclaration', 'xmlSchemaCollection', 'xmlTypeDefinition', 'createXmlSchemaCollection', 'xmlIndexOptions', 'xmlIndexOption', 'openXml', 'xmlCommonDirectives', and 'xmlColumnDefinition'. These additions enable the creation, configuration, and usage of XML schemas and indexes, as well as the specification of XML namespaces and directives. A new test file for functional tests has been included to demonstrate the use of XMLWORKSPACES in TSQL and its equivalent syntax in Databricks SQL. While this update does not affect the existing codebase's functionality, it does enable support for XMLWORKSPACES syntax in TSQL, facilitating easier integration with other platforms that support it. Please note that Databricks SQL does not currently support XML workspaces. * Test merge queue ([databrickslabs#424](databrickslabs#424)). In this release, the Scalafmt configuration has been updated to version 3.8.0, with changes to the formatting of Scala code. The `danglingParentheses` preset option has been set to "false", removing dangling parentheses from the code. Additionally, the `configStyleArguments` option has been set to `false` under "optIn". These modifications to the configuration file are likely to affect the formatting and style of the Scala code in the project, ensuring consistent and organized code. This change aims to enhance the readability and maintainability of the codebase. * Updated bug and feature yml to support reconcile ([databrickslabs#390](databrickslabs#390)). The open-source library has been updated to improve issue and feature categorization. In the `.github/ISSUE_TEMPLATE/bug.yml` file, new options for TranspileParserError, TranspileValidationError, and TranspileLateralColumnAliasError have been added to the `label: Category of Bug / Issue` field. Additionally, a new option for ReconcileError has been included. The `feature.yml` file in the `.github/ISSUE_TEMPLATE` directory has also been updated, introducing a required dropdown menu labeled "Category of feature request." This dropdown offers options for Transpile, Reconcile, and Other categories, ensuring accurate classification and organization of incoming feature requests. The modifications aim to enhance clarity for maintainers in reviewing and prioritizing issue resolutions and feature implementations related to reconciliation functionality. * Updated the documentation with json config examples ([databrickslabs#486](databrickslabs#486)). In this release, the Remorph Reconciliation tool on Databricks has been updated to include JSON config examples for various config elements such as jdbc_reader_options, column_mapping, transformations, thresholds, and filters. These config elements enable users to define source and target data, join columns, JDBC reader options, select and drop columns, column mappings, transformations, thresholds, and filters. The update also provides examples in both Python and JSON formats, as well as instructions for installing the necessary Oracle JDBC library on a Databricks cluster. This update enhances the tool's functionality, making it easier for software engineers to reconcile source data with target data on Databricks. * Updated uninstall flow ([databrickslabs#476](databrickslabs#476)). In this release, the `uninstall` functionality of the `databricks labs remorph` tool has been updated to align with the latest changes made to the `install` refactoring. The `uninstall` flow now utilizes a new `MockInstallation` class, which handles the uninstallation process and takes a dictionary of configuration files and their corresponding contents as input. The `uninstall` function has been modified to return `False` in two cases, either when there is no remorph directory or when the user decides not to uninstall. A `MockInstallation` object is created for the reconcile.yml file, and appropriate exceptions are raised in the aforementioned cases. The `uninstall` function now uses a `WorkspaceUnInstallation` or `WorkspaceUnInstaller` object, depending on the input arguments, to handle the uninstallation process. Additionally, the `MockPrompts` class is used to prompt the user for confirmation before uninstalling remorph. * Updates to developer documentation and add grammar formatting to maven ([databrickslabs#490](databrickslabs#490)). The developer documentation has been updated to include grammar formatting instructions and support for dialects other than Snowflake. The Maven build cycle has been modified to format grammars before ANTLR processes them, enhancing readability and easing conflict resolution during maintenance. The TSqlLexer.g4 file has been updated with formatting instructions and added dialect recognition. These changes ensure that grammars are consistently formatted and easily resolvable during merges. Engineers adopting this project should reformat the grammar file before each commit, following the provided formatting instructions and reference link. Grammar modifications in the TSqlParser.g4 file, such as alterations in partitionFunction and freetextFunction rules, improve structure and readability. * Upgrade sqlglot from 23.13.7 to 25.1.0 ([databrickslabs#473](databrickslabs#473)). In the latest release, the sqlglot package has been upgraded from version 23.13.7 to 25.1.0, offering potential new features, bug fixes, and performance improvements for SQL processing. The package dependency for numpy has been updated to version 1.26.4, which may introduce new functionality, improve existing features, or fix numpy integration issues. Furthermore, the addition of the types-pytz package as a dependency provides type hints for pytz, enhancing codebase type checking and static analysis capabilities. Specific modifications to the test_sql_transpiler.py file include updating the expected result in the test_parse_query function and removing unnecessary whitespaces in the transpiled_sql assertion in the test_procedure_conversion function. Although the find_root_tables function remains unchanged, the upgrade to sqlglot promises overall functionality enhancements, which software engineers can leverage in their projects. * Use default_factory in recon_config.py ([databrickslabs#431](databrickslabs#431)). In this release, the default value handling for the `status` field in the `DataReconcileOutput` and `ReconcileTableOutput` classes has been improved to comply with Python 3.11. Previously, a mutable default value was used, causing a `ValueError` issue. This has been addressed by implementing the `default_factory` argument in the `field` function to ensure a new instance of `StatusOutput` is created for each class. Additionally, `MismatchOutput` and `ThresholdOutput` classes now also utilize `default_factory` for consistent and robust default value handling, enhancing the overall code quality and preventing potential issues arising from mutable default values. * edit distance ([databrickslabs#501](databrickslabs#501)). In this release, we have implemented an `edit distance` feature for calculating the difference between two strings using the LEVENSHTEIN function. This has been achieved by adding a new method, `anonymous_sql`, to the `Generator` class in the `databricks.py` file. The method takes expressions of the `Anonymous` type as arguments and calls the `LEVENSHTEIN` function if the `this` attribute of the expression is equal to "EDITDISTANCE". Additionally, a new test file has been introduced for the anonymous user in the functional snowflake test suite to ensure the accurate calculation of string similarity using the EDITDISTANCE function. This change includes examples of using the EDITDISTANCE function with different parameters and compares it with the LEVENSHTEIN function available in Databricks. It addresses issue [databrickslabs#500](databrickslabs#500), which was related to testing the edit distance functionality.

jimidle force-pushed the feature/funcconverter branch from f56ed2f to ee946d9 Compare May 30, 2024 17:19

jimidle requested review from vil1 and sundarshankar89 May 30, 2024 17:19

jimidle marked this pull request as ready for review May 30, 2024 17:19

jimidle requested a review from a team as a code owner May 30, 2024 17:19

jimidle changed the title ~~Support function translation to Databricks SQL in TSQl and Snowflake~~ Support function translation to Databricks SQL in TSql and Snowflake May 30, 2024

vil1 reviewed May 31, 2024

View reviewed changes

core/src/main/scala/com/databricks/labs/remorph/parsers/FunctionConverters.scala Outdated Show resolved Hide resolved

vil1 reviewed May 31, 2024

View reviewed changes

core/src/main/scala/com/databricks/labs/remorph/parsers/FunctionBuilder.scala Outdated Show resolved Hide resolved

vil1 reviewed May 31, 2024

View reviewed changes

jimidle force-pushed the feature/funcconverter branch from df9862b to cfcadbb Compare May 31, 2024 13:13

vil1 self-requested a review June 3, 2024 07:15

vil1 approved these changes Jun 3, 2024

View reviewed changes

vil1 mentioned this pull request Jun 3, 2024

Add full support for analytical windowing functions #401

Merged

sundarshankar89 added this pull request to the merge queue Jun 3, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 3, 2024

jimidle force-pushed the feature/funcconverter branch from 2f85878 to 7a6e005 Compare June 3, 2024 14:30

jimidle added the pr/ready-to-merge this PR is safe to merge label Jun 3, 2024

jimidle added this pull request to the merge queue Jun 3, 2024

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Jun 3, 2024

jimidle force-pushed the feature/funcconverter branch from 7a6e005 to 85d765f Compare June 3, 2024 16:59

jimidle force-pushed the feature/funcconverter branch from 85d765f to 8149749 Compare June 3, 2024 17:09

jimidle added this pull request to the merge queue Jun 3, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 3, 2024

jimidle added this pull request to the merge queue Jun 3, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 3, 2024

jimidle added this pull request to the merge queue Jun 3, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 3, 2024

nfx requested changes Jun 3, 2024

View reviewed changes

nfx removed the pr/ready-to-merge this PR is safe to merge label Jun 3, 2024

jimidle and others added 7 commits June 3, 2024 13:16

feat: Tighten function privacy, rebase

38658ba

feat: Minor format change for readability

facc6d2

feat: Correct comment description

ce1a9b4

feat: Merge function additions from other PRs.

a9323ad

feat: bring in new function definitions from PRs

51a834a

nfx requested a review from vil1 June 3, 2024 19:42

feat: Resolve additional completed PR before rewrite for dialect spec…

83e5d11

…ific classes

jimidle force-pushed the feature/funcconverter branch from 8149749 to 83e5d11 Compare June 3, 2024 20:25

jimidle added 2 commits June 3, 2024 16:54

feat: Refactor FunctionBuilder into Dialect specific classes, which a…

7e86cd0

…re then injected into the dialect specific builders

feat: Tidy up traits for single responsibility

af08311

vil1 approved these changes Jun 4, 2024

View reviewed changes

vil1 requested a review from nfx June 4, 2024 07:15

nfx approved these changes Jun 4, 2024

View reviewed changes

nfx added this pull request to the merge queue Jun 4, 2024

Merged via the queue into main with commit eac0207 Jun 4, 2024
9 of 10 checks passed

nfx deleted the feature/funcconverter branch June 4, 2024 07:40

nfx mentioned this pull request Jul 1, 2024

Release v0.3.0 #510

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support function translation to Databricks SQL in TSql and Snowflake #414

Support function translation to Databricks SQL in TSql and Snowflake #414

jimidle commented May 30, 2024

codecov bot commented May 30, 2024 •

edited

Loading

github-actions bot commented May 30, 2024 •

edited

Loading

vil1 left a comment

vil1 left a comment

vil1 left a comment

nfx left a comment

Support function translation to Databricks SQL in TSql and Snowflake #414

Support function translation to Databricks SQL in TSql and Snowflake #414

Conversation

jimidle commented May 30, 2024

codecov bot commented May 30, 2024 • edited Loading

Codecov Report

github-actions bot commented May 30, 2024 • edited Loading

Coverage tests results

vil1 left a comment

Choose a reason for hiding this comment

vil1 left a comment

Choose a reason for hiding this comment

vil1 left a comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

codecov bot commented May 30, 2024 •

edited

Loading

github-actions bot commented May 30, 2024 •

edited

Loading