Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support function translation to Databricks SQL in TSql and Snowflake #414

Merged
merged 10 commits into from
Jun 4, 2024

Conversation

jimidle
Copy link
Contributor

@jimidle jimidle commented May 30, 2024

Some functions must be translated from TSQL or Snowflake versions into the equivalent IR for Databricks SQL. In some cases the function must be translated in one dialect, say TSql but is equivalent in another, say Snowflake.

Here we upgrade the FunctionBuilder system to be dialect aware, and provide a ConversionStrategy system that allows for any type of conversion from a simple name translation or more complicated specific IR representations when there is no equivalent.

For example the TSQL code

SELECT ISNULL(x, 0)

Should translate to:

SELECT IFNULL(x, 0)

In Databricks SQL, but in Snowflake SQL:

SELECT ISNULL(col)

Is directly equivalent to Databricks SQL and needs no conversion.

Copy link

codecov bot commented May 30, 2024

Codecov Report

Attention: Patch coverage is 98.61111% with 4 lines in your changes missing coverage. Please review.

Project coverage is 97.90%. Comparing base (d628677) to head (af08311).

Files Patch % Lines
...h/parsers/snowflake/SnowflakeFunctionBuilder.scala 66.66% 3 Missing ⚠️
...abricks/labs/remorph/parsers/FunctionBuilder.scala 99.59% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #414      +/-   ##
==========================================
- Coverage   97.99%   97.90%   -0.10%     
==========================================
  Files          58       61       +3     
  Lines        3743     3765      +22     
  Branches      470      467       -3     
==========================================
+ Hits         3668     3686      +18     
- Misses         42       46       +4     
  Partials       33       33              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented May 30, 2024

Coverage tests results

354 tests  ±0   246 ✅ ±0   6s ⏱️ ±0s
  2 suites ±0     0 💤 ±0 
  2 files   ±0   108 ❌ ±0 

For more details on these failures, see this check.

Results for commit af08311. ± Comparison against base commit d628677.

♻️ This comment has been updated with latest results.

@jimidle jimidle force-pushed the feature/funcconverter branch from f56ed2f to ee946d9 Compare May 30, 2024 17:19
@jimidle jimidle requested review from vil1 and sundarshankar89 May 30, 2024 17:19
@jimidle jimidle marked this pull request as ready for review May 30, 2024 17:19
@jimidle jimidle requested a review from a team as a code owner May 30, 2024 17:19
@jimidle jimidle changed the title Support function translation to Databricks SQL in TSQl and Snowflake Support function translation to Databricks SQL in TSql and Snowflake May 30, 2024
Copy link
Contributor

@vil1 vil1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest changes to this PR in #418

jimidle pushed a commit that referenced this pull request May 31, 2024
Refactoring and improvements for `FunctionBuilder`. 

- Introduce a `FunctionDefinition` that contains a `FunctionArity` and a
`FunctionType`
- Add a `NotConvertibleFunction` case of `FunctionType` (and get rid of
the `isConvertible` boolean)
- Lookup function definitions by name and dialect, allowing for
dialect-specific definitions (`case (Snowflake, "FOO") => ...`)
- Define `IFNULL` only for Snowflake and `ISNULL` only for TSql
jimidle pushed a commit that referenced this pull request May 31, 2024
Refactoring and improvements for `FunctionBuilder`. 

- Introduce a `FunctionDefinition` that contains a `FunctionArity` and a
`FunctionType`
- Add a `NotConvertibleFunction` case of `FunctionType` (and get rid of
the `isConvertible` boolean)
- Lookup function definitions by name and dialect, allowing for
dialect-specific definitions (`case (Snowflake, "FOO") => ...`)
- Define `IFNULL` only for Snowflake and `ISNULL` only for TSql
@jimidle jimidle force-pushed the feature/funcconverter branch from df9862b to cfcadbb Compare May 31, 2024 13:13
@vil1 vil1 self-requested a review June 3, 2024 07:15
Copy link
Contributor

@vil1 vil1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sundarshankar89 sundarshankar89 added this pull request to the merge queue Jun 3, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 3, 2024
@jimidle jimidle force-pushed the feature/funcconverter branch from 2f85878 to 7a6e005 Compare June 3, 2024 14:30
@jimidle jimidle added the pr/ready-to-merge this PR is safe to merge label Jun 3, 2024
@jimidle jimidle added this pull request to the merge queue Jun 3, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Jun 3, 2024
jimidle pushed a commit that referenced this pull request Jun 3, 2024
Refactoring and improvements for `FunctionBuilder`.

- Introduce a `FunctionDefinition` that contains a `FunctionArity` and a
`FunctionType`
- Add a `NotConvertibleFunction` case of `FunctionType` (and get rid of
the `isConvertible` boolean)
- Lookup function definitions by name and dialect, allowing for
dialect-specific definitions (`case (Snowflake, "FOO") => ...`)
- Define `IFNULL` only for Snowflake and `ISNULL` only for TSql
@jimidle jimidle force-pushed the feature/funcconverter branch from 7a6e005 to 85d765f Compare June 3, 2024 16:59
jimidle pushed a commit that referenced this pull request Jun 3, 2024
Refactoring and improvements for `FunctionBuilder`.

- Introduce a `FunctionDefinition` that contains a `FunctionArity` and a
`FunctionType`
- Add a `NotConvertibleFunction` case of `FunctionType` (and get rid of
the `isConvertible` boolean)
- Lookup function definitions by name and dialect, allowing for
dialect-specific definitions (`case (Snowflake, "FOO") => ...`)
- Define `IFNULL` only for Snowflake and `ISNULL` only for TSql
@jimidle jimidle force-pushed the feature/funcconverter branch from 85d765f to 8149749 Compare June 3, 2024 17:09
@jimidle jimidle added this pull request to the merge queue Jun 3, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 3, 2024
@jimidle jimidle added this pull request to the merge queue Jun 3, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 3, 2024
@jimidle jimidle added this pull request to the merge queue Jun 3, 2024
github-merge-queue bot pushed a commit that referenced this pull request Jun 3, 2024
…414)

Some functions must be translated from TSQL or Snowflake versions into
the equivalent IR for Databricks SQL. In some cases the function must be
translated in one dialect, say TSql but is equivalent in another, say
Snowflake.

Here we upgrade the FunctionBuilder system to be dialect aware, and
provide a ConversionStrategy system that allows for any type of
conversion from a simple name translation or more complicated specific
IR representations when there is no equivalent.

For example the TSQL code
```tsql
SELECT ISNULL(x, 0)
```

Should translate to:

```sql
SELECT IFNULL(x, 0)
```

In Databricks SQL, but in Snowflake SQL:

```snowflake
SELECT ISNULL(col)
```

Is directly equivalent to Databricks SQL and needs no conversion.

---------

Co-authored-by: Valentin Kasas <[email protected]>
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 3, 2024
@nfx nfx removed the pr/ready-to-merge this PR is safe to merge label Jun 3, 2024
jimidle and others added 7 commits June 3, 2024 13:16
  - Some functions must be translated from TSQL or Snowflake versions into the
    equivalent IR for Databricks SQL.
  - Provides a ConversionStrategy system that allows for simple name translations
    or more complicated IR representations when there is no equivalent.
  - Modifies FunctionBuilders etc to accept differnt dialects.
Refactoring and improvements for `FunctionBuilder`.

- Introduce a `FunctionDefinition` that contains a `FunctionArity` and a
`FunctionType`
- Add a `NotConvertibleFunction` case of `FunctionType` (and get rid of
the `isConvertible` boolean)
- Lookup function definitions by name and dialect, allowing for
dialect-specific definitions (`case (Snowflake, "FOO") => ...`)
- Define `IFNULL` only for Snowflake and `ISNULL` only for TSql
@nfx nfx requested a review from vil1 June 3, 2024 19:42
@jimidle jimidle force-pushed the feature/funcconverter branch from 8149749 to 83e5d11 Compare June 3, 2024 20:25
Copy link
Contributor

@vil1 vil1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vil1 vil1 requested a review from nfx June 4, 2024 07:15
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@nfx nfx added this pull request to the merge queue Jun 4, 2024
Merged via the queue into main with commit eac0207 Jun 4, 2024
9 of 10 checks passed
@nfx nfx deleted the feature/funcconverter branch June 4, 2024 07:40
nfx added a commit that referenced this pull request Jul 1, 2024
* Added Oracle ojdbc8 dependent library during reconcile Installation ([#474](#474)). In this release, the `deployment.py` file in the `databricks/labs/remorph/helpers` directory has been updated to add the `ojdbc8` library as a `MavenLibrary` in the `_job_recon_task` function, enabling the reconciliation process to access the Oracle Data source and pull data for reconciliation between Oracle and Databricks. The `JDBCReaderMixin` class in the `jdbc_reader.py` file has also been updated to include the Oracle ojdbc8 dependent library for reconciliation during the `reconcile` process. This involves installing the `com.oracle.database.jdbc:ojdbc8:23.4.0.24.05` jar as a dependent library and updating the driver class to `oracle.jdbc.driver.OracleDriver` from `oracle`. A new dictionary `driver_class` has been added, which maps the driver name to the corresponding class name, allowing for dynamic driver class selection during the `_get_jdbc_reader` method call. The `test_read_data_with_options` unit test has been updated to test the Oracle connector for reading data with specific options, including the use of the correct driver class and specifying the database table for data retrieval, improving the accuracy and reliability of the reconciliation process.
* Added TSQL coverage tests in the generated report artifact ([#452](#452)). In this release, we have added new TSQL coverage tests and Snowflake coverage tests to the generated report artifact in the CI/CD pipeline. These tests are executed using Maven with the updated command "mvn --update-snapshots -B test -pl coverage --file pom.xml --fail-at-end" and "mvn --update-snapshots -B exec:java -pl coverage --file pom.xml --fail-at-end -Dexec.args="-i tests/resources/functional/snowflake -o coverage-result.json" respectively, and the "continue-on-error: true" option is added to allow the pipeline to proceed even if the tests fail. Additionally, we have introduced a new constructor to the `CommentBasedQueryExtractor` class, which accepts a `dialect` parameter and allows for easier configuration of the start and end comments for different SQL dialects. We have also updated the CommentBasedQueryExtractor for Snowflake and added two TSQL coverage tests to the generated report artifact to ensure that the `QueryExtractor` is working correctly for TSQL queries. These changes will help ensure thorough testing and identification of TSQL and Snowflake queries during the CI/CD process.
* Added full support for analytical windowing functions ([#401](#401)). In this release, full support for analytical windowing functions has been implemented, addressing issue [#401](#401). The functions were previously specified in the parser grammar but have been moved to the standard function lookup table for more consistent handling. This enhancement allows for the use of analytical aggregate functions, such as FIRST_VALUE and PERCENTILE_CONT, with a `WITHIN GROUP` syntax and an `OVER` clause, enabling more complex queries and data analysis. The `FixedArity` and `VariableArity` classes have been updated with new methods for the supported functions, and appropriate examples have been provided to demonstrate their usage in SQL.
* Added parsing for STRPOS in presto ([#462](#462)). A new feature has been added to the remorph/snow package's presto module to parse the STRPOS function in SQL code. This has been achieved by importing the locate_to_strposition function from sqlglot.dialects.dialect and incorporating it into the FUNCTIONS dictionary in the Parser class. This change enables the parsing of the STRPOS function, which returns the position of the first occurrence of a substring in a string. The implementation has been tested with a SQL file containing two queries for Presto SQL using STRPOS and Databricks SQL using LOCATE, both aimed at finding the position of the letter `l` in the string 'Hello world', starting the search from the second position. This feature is particularly relevant for software engineers working on data processing and analytics projects involving both Presto and Databricks SQL, as it ensures compatibility and consistent behavior between the two for string manipulation functions. The commit is part of issue [#462](#462), and the diff provided includes a new SQL file with test cases for the STRPOS function in Presto and Locate function in Databricks SQL. The test cases confirm if the `hello` string is present in the greeting_message column of the greetings_table. This feature allows users to utilize the STRPOS function in Presto to determine if a specific substring is present in a string.
* Added validation for join columns for all query builders and limiting rows for reports ([#413](#413)). In this release, we've added validation for join columns in all query builders, ensuring consistent and accurate data joins. A limit on the number of rows displayed for reports has been implemented with a default of 50. The `compare.py` and `execute.py` files have been updated to include validation, and the `QueryBuilder` and `HashQueryBuilder` classes have new methods for validating join columns. The `SamplingQueryBuilder`, `ThresholdQueryBuilder`, and `recon_capture.py` files have similar updates for validation and limiting rows for reports. The `recon_config.py` file now has a new return type for the `get_join_columns` method, and a new method `test_no_join_columns_raise_exception()` has been added in the `test_threshold_query.py` file. These changes aim to enhance data consistency, accuracy, and efficiency for software engineers.
* Adds more coverage tests for functions to TSQL coverage ([#420](#420)). This commit adds new coverage tests for various TSQL functions, focusing on the COUNT, MAX, MIN, STDEV, STDEVP, SUM, and VARP functions, which are identical in Databricks SQL. The tests include cases with and without the DISTINCT keyword to ensure consistent behavior between TSQL and Databricks. For the GROUPING and GROUPING_ID functions, which have some differences, tests and examples of TSQL and Databicks SQL code are provided. The CHECKSUM_AGG function, not directly supported in Databricks SQL, is tested using MD5 and CONCAT_WS for equivalence. The CUME_DIST function, identical in both systems, is also tested. Additionally, a new test file for the STDEV function and updated tests for the VAR function are introduced, enhancing the reliability and robustness of TSQL conversions in the project.
* Catalog, Schema Permission checks ([#492](#492)). This release introduces enhancements to the Catalog and Schema functionality, with the addition of permission checks that raise explicit `Permission Denied` exceptions. The logger messages have been updated for clarity and a new variable, README_RECON_REPO, has been created to reference the readme file for the recon_config repository. The ReconcileUtils class has been modified to handle scenarios where the recon_config file is not found or corrupted during loading, providing clear error messages and guidance for users. The unit tests for the install feature have been updated with permission checks for Catalog and Schema operations, ensuring robust handling of permission denied errors. These changes improve the system's error handling and provide clearer guidance for users encountering permission issues.
* Changing the secret name acc to install script ([#432](#432)). In this release, the `recon` function in the `execute.py` file of the `databricks.labs.remorph.reconcile` package has been updated to dynamically generate the secret name instead of hardcoding it as "secret_scope". This change utilizes the new `get_key_form_dialect` function to create a secret name specific to the source dialect being used in the reconciliation process. The `get_dialect` function, along with `DatabaseConfig`, `TableRecon`, and the newly added `get_key_form_dialect`, have been imported from `databricks.labs.remorph.config`. This enhancement improves the security and flexibility of the reconciliation process by generating dynamic and dialect-specific secret names.
* Feature/recon documentation ([#395](#395)). This commit introduces a new reconciliation process, enhancing data consistency between sources, co-authored by Ganesh Dogiparthi, ganeshdogiparthi-db, and SundarShankar89. The README.md file provides detailed documentation for the reconciliation process. A new binary file, docs/transpile-install.gif, offers installation instructions or visual aids, while a mermaid flowchart in `report_types_visualisation.md` illustrates report generation for data, rows, schema, and overall reconciliation. No existing functionality was modified, ensuring the addition of valuable features for software engineers adopting this project.
* Fixing issues in sample query builder to handle Null's and zero ([#457](#457)). This commit introduces improvements to the sample query builder's handling of Nulls and zeroes, addressing bug [#450](#450). The changes include updated SQL queries in the test threshold query file with COALESCE and TRIM functions to replace Null values with a specified string, ensuring consistent comparison of datasets. The query store in test_execute.py has also been enhanced to handle NULL and zero values using COALESCE, improving overall robustness and consistency. Additionally, new methods such as build_join_clause, trim, and coalesce have been added to enhance null handling in the query builder. The commit also introduces the MockDataSource class, a likely test implementation of a data source, and updates the log_and_throw_exception function for clearer error messaging.
* Implement Lakeview Dashboard Publisher ([#405](#405)). In this release, we've introduced the `DashboardPublisher` class in the `dashboard_publisher.py` module to streamline the process of creating and publishing dashboards in Databricks Workspace. This class simplifies dashboard creation by accepting an instance of `WorkspaceClient` and `Installation` and providing methods for creating and publishing dashboards with optional parameter substitution. Additionally, we've added a new JSON file, 'Remorph-Reconciliation-Substituted.lvdash.json', which contains a dashboard definition for a data reconciliation feature. This dashboard includes various widgets for filtering and displaying reconciliation results. We've also added a test file for the Lakeview Dashboard Publisher feature, which includes tests to ensure that the `DashboardPublisher` can create dashboards using specified file paths and parameters. These new features and enhancements are aimed at improving the user experience and streamlining the process of creating and publishing dashboards in Databricks Workspace.
* Integrate recon metadata reconcile cli ([#444](#444)). A new CLI command, `databricks labs remorph reconcile`, has been added to initiate the Data Reconciliation process, loading `reconcile.yml` and `recon_config.json` configuration files from the Databricks Workspace. If these files are missing, the user is prompted to reinstall the `reconcile` module and exit the command. The command then triggers the `Remorph_Reconciliation_Job` based on the Job ID stored in the `reconcile.yml` file. This simplifies the reconcile execution process, requiring users to first configure the `reconcile` module and generate the `recon_config_<SOURCE>.json` file using `databricks labs remorph install` and `databricks labs remorph generate-recon-config` commands. The new CLI command has been manually tested and includes unit tests. Integration tests and verification on the staging environment are pending. This feature was co-authored by Bishwajit, Ganesh Dogiparthi, and SundarShankar89.
* Introduce coverage tests ([#382](#382)). This commit introduces coverage tests and updates the GitHub Actions workflow to use Java 11 with Corretto distribution, improving testing and coverage analysis for the project. Coverage tests are added as part of the remorph project with the introduction of a new module for coverage and updating the artifact version to 0.2.0-SNAPSHOT. The pom.xml file is modified to change the parent project version to 0.2.0-SNAPSHOT, ensuring accurate assessment and maintenance of code coverage during development. In addition, a new Main object within the com.databricks.labs.remorph.coverage package is implemented for running coverage tests using command-line arguments, along with the addition of a new file QueryRunner.scala and case classes for ReportEntryHeader, ReportEntryReport, and ReportEntry for capturing and reporting on the status and results of parsing and transpilation processes. The `Cache Maven packages` step is removed and replaced with two new steps: `Run Unit Tests with Maven` and "Run Coverage Tests with Maven." The former executes unit tests and generates a test coverage report, while the latter downloads remorph-core jars as artifacts, executes coverage tests with Maven, and uploads coverage tests results as json artifacts. The `coverage-tests` job runs after the `test-core` job and uses the same environment, checking out the code with full history, setting up Java 11 with Corretto distribution, downloading remorph-core-jars artifacts, and running coverage tests with Maven, even if there are errors. The JUnit report is also published, and the coverage tests results are uploaded as json artifacts, providing better test coverage and more reliable code for software engineers adopting the project.
* Presto approx percentile func fix ([#411](#411)). The remorph library has been updated to support the Presto database system, with a new module added to the config.py file to enable robust and maintainable interaction. An `APPROX_PERCENTILE` function has been implemented in the `presto.py` file of the `sqlglot.dialects.presto` package, allowing for approximate percentile calculations in Presto and Databricks SQL. A test file has been included for both SQL dialects, with queries calculating the approximate median of the height column in the people table. The new functionality enhances the compatibility and versatility of the remorph library in working with Presto databases and improves overall project functionality. Additionally, a new test file for Presto in the snowflakedriver project has been introduced to test expected exceptions, further ensuring robustness and reliability.
* Raise exception if reconciliation fails for any table ([#412](#412)). In this release, we have implemented significant changes to improve exception handling and raise meaningful exceptions when reconciliation fails for any table in our open-source library. A new exception class, `ReconciliationException`, has been added as a child of the `Exception` class, which takes two optional parameters in its constructor, `message` and `reconcile_output`. The `ReconcileOutput` property has been created for accessing the reconcile output object. The `InvalidInputException` class now inherits from `ValueError`, making the code more explicit with the type of errors being handled. A new method, `_verify_successful_reconciliation`, has been introduced to check the reconciliation output status and raise a `ReconciliationException` if any table fails reconciliation. The `test_execute.py` file has been updated to raise a `ReconciliationException` if reconciliation for a specific report type fails, and new tests have been added to the test suite to ensure the correct behavior of the `reconcile` function with and without raising exceptions.
* Removed USE catalog/schema statement as lsql has added the feature ([#465](#465)). In this release, the usage of `USE` statements for selecting a catalog and schema has been removed in the `get_sql_backend` function, thanks to the new feature provided by the lsql library. This enhancement improves code readability, maintainability, and enables better integration with the SQL backend. The commit also includes changes to the installation process for reconciliation metadata tables, providing more clarity and simplicity in the code. Additionally, several test functions have been added or modified to ensure the proper functioning of the `get_sql_backend` function in various scenarios, including cases where a warehouse ID is not provided or when executing SQL statements in a notebook environment. An error simulation test has also been added for handling `DatabricksError` exceptions when executing SQL statements using the `DatabricksConnectBackend` class.
* Sampling with clause query to have `from dual` in from clause for oracle source ([#464](#464)). In this release, we've added the `get_key_from_dialect` function, replacing the previous `get_key_form_dialect` function, to retrieve the key associated with a given dialect object, serving as a unique identifier for the dialect. This improvement enhances the flexibility and readability of the codebase, making it easier to locate and manipulate dialect objects. Additionally, we've modified the 'sampling_query.py' file to include `from dual` in the `from` clause for Oracle sources in a sampling query with a clause, enabling sampling from Oracle databases. The `_insert_into_main_table` method in the `recon_capture.py` file of the `databricks.labs.remorph.reconcile` module has been updated to ensure accurate key retrieval for the specified dialect, thereby improving the reconciliation process. These changes resolve issues [#458](#458) and [#464](#464), enhancing the functionality of the sampling query builder and providing better support for various databases.
* Support function translation to Databricks SQL in TSql and Snowflake ([#414](#414)). This commit introduces a dialect-aware FunctionBuilder system and a ConversionStrategy system to enable seamless translation of SQL functions between TSQL, Snowflake, and Databricks SQL IR. The new FunctionBuilder system can handle both simple name translations and more complex conversions when there is no direct equivalent. For instance, TSQL's ISNULL function translates to IFNULL in Databricks SQL, while Snowflake's ISNULL remains unchanged. The commit also includes updates to the TSqlExpressionBuilder and new methods for building and visiting various contexts, enhancing compatibility and expanding the range of supported SQL dialects. Additionally, new tests have been added in the FunctionBuilderSpec to ensure the correct arity and function type for various SQL functions.
* TSQL: Create coverage tests for TSQL -> Databricks functions ([#415](#415)). This commit introduces coverage tests for T-SQL functions and their equivalent Databricks SQL implementations, focusing on the DATEADD function's `yy` keyword. The DATEADD function is translated to the ADD_MONTHS function in Databricks SQL, with the number of months multiplied by 12. This ensures functional equivalence between T-SQL and Databricks SQL for date addition involving years. The tests are written as SQL scripts and are located in the `tests/resources/functional/tsql/functions` directory, covering various scenarios and possible engine differences between T-SQL and Databricks SQL. The conversion process is documented, and future automation of this documentation is considered.
* TSQL: Implement WITH CTE ([#443](#443)). With this commit, we have extended the TSQL functionality by adding support for Common Table Expressions (CTEs). CTEs are temporary result sets that can be defined within a single execution of a SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement, allowing for more complex and efficient queries. The implementation includes the ability to create a CTE with an optional name and a column list, followed by a SELECT statement that defines the CTE. CTEs can be self-referential and can be used to simplify complex queries, improving code readability and performance. This feature is particularly useful for cases where multiple queries rely on the same intermediate result set, as it enables reusing the results without having to repeat the query.
* TSQL: Implement functions with specialized syntax ([#430](#430)). This commit introduces new data type conversion functions and JSON manipulation capabilities to T-SQL, addressing issue [#430](#430). The newly implemented features include `NEXT VALUE FOR sequence`, `CAST(col TO sometype)`, `TRY_CAST(col TO sometype)`, `JSON_ARRAY`, and `JSON_OBJECT`. These functions support specialized syntax for handling data type conversions and JSON operations, including NULL value handling using `NULL ON NULL` and `ABSENT ON NULL` syntax. The `TSqlFunctionBuilder` class has been updated to accommodate these changes, and new test cases have been added to the `TSqlFunctionSpec` test class in Scala. This enhancement enables SQL-based querying and data manipulation with increased functionality for T-SQL parser and function evaluations.
* TSQL: Support DISTINCT in SELECT list and aggregate functions ([#400](#400)). This commit adds support for the `DISTINCT` keyword in T-SQL for use in the `SELECT` list and aggregate functions such as `COUNT`. When used in the `SELECT` list, `DISTINCT` ensures unique values of the specified expression are returned, and in aggregate functions like `COUNT`, it considers only distinct values of the specified argument. This change aligns with the SQL standard and enhances the functionality of the T-SQL parser, providing developers with greater flexibility and control when using `DISTINCT` in complex queries and aggregate functions. The default behavior in SQL, `ALL`, remains unchanged, and the parser has been updated to accommodate these improvements.
* TSQL: Update the SELECT statement to support XML workspaces ([#451](#451)). This release introduces updates to the TSQL Select statement grammar to correctly support XMLWORKSPACES in accordance with the latest specification. Although Databricks SQL does not currently support XMLWORKSPACES, this change is a syntax-only update to enable compatibility with other platforms that do support it. Newly added components include 'xmlNamespaces', 'xmlDeclaration', 'xmlSchemaCollection', 'xmlTypeDefinition', 'createXmlSchemaCollection', 'xmlIndexOptions', 'xmlIndexOption', 'openXml', 'xmlCommonDirectives', and 'xmlColumnDefinition'. These additions enable the creation, configuration, and usage of XML schemas and indexes, as well as the specification of XML namespaces and directives. A new test file for functional tests has been included to demonstrate the use of XMLWORKSPACES in TSQL and its equivalent syntax in Databricks SQL. While this update does not affect the existing codebase's functionality, it does enable support for XMLWORKSPACES syntax in TSQL, facilitating easier integration with other platforms that support it. Please note that Databricks SQL does not currently support XML workspaces.
* Test merge queue ([#424](#424)). In this release, the Scalafmt configuration has been updated to version 3.8.0, with changes to the formatting of Scala code. The `danglingParentheses` preset option has been set to "false", removing dangling parentheses from the code. Additionally, the `configStyleArguments` option has been set to `false` under "optIn". These modifications to the configuration file are likely to affect the formatting and style of the Scala code in the project, ensuring consistent and organized code. This change aims to enhance the readability and maintainability of the codebase.
* Updated bug and feature yml to support reconcile ([#390](#390)). The open-source library has been updated to improve issue and feature categorization. In the `.github/ISSUE_TEMPLATE/bug.yml` file, new options for TranspileParserError, TranspileValidationError, and TranspileLateralColumnAliasError have been added to the `label: Category of Bug / Issue` field. Additionally, a new option for ReconcileError has been included. The `feature.yml` file in the `.github/ISSUE_TEMPLATE` directory has also been updated, introducing a required dropdown menu labeled "Category of feature request." This dropdown offers options for Transpile, Reconcile, and Other categories, ensuring accurate classification and organization of incoming feature requests. The modifications aim to enhance clarity for maintainers in reviewing and prioritizing issue resolutions and feature implementations related to reconciliation functionality.
* Updated the documentation with json config examples ([#486](#486)). In this release, the Remorph Reconciliation tool on Databricks has been updated to include JSON config examples for various config elements such as jdbc_reader_options, column_mapping, transformations, thresholds, and filters. These config elements enable users to define source and target data, join columns, JDBC reader options, select and drop columns, column mappings, transformations, thresholds, and filters. The update also provides examples in both Python and JSON formats, as well as instructions for installing the necessary Oracle JDBC library on a Databricks cluster. This update enhances the tool's functionality, making it easier for software engineers to reconcile source data with target data on Databricks.
* Updated uninstall flow ([#476](#476)). In this release, the `uninstall` functionality of the `databricks labs remorph` tool has been updated to align with the latest changes made to the `install` refactoring. The `uninstall` flow now utilizes a new `MockInstallation` class, which handles the uninstallation process and takes a dictionary of configuration files and their corresponding contents as input. The `uninstall` function has been modified to return `False` in two cases, either when there is no remorph directory or when the user decides not to uninstall. A `MockInstallation` object is created for the reconcile.yml file, and appropriate exceptions are raised in the aforementioned cases. The `uninstall` function now uses a `WorkspaceUnInstallation` or `WorkspaceUnInstaller` object, depending on the input arguments, to handle the uninstallation process. Additionally, the `MockPrompts` class is used to prompt the user for confirmation before uninstalling remorph.
* Updates to developer documentation and add grammar formatting to maven ([#490](#490)). The developer documentation has been updated to include grammar formatting instructions and support for dialects other than Snowflake. The Maven build cycle has been modified to format grammars before ANTLR processes them, enhancing readability and easing conflict resolution during maintenance. The TSqlLexer.g4 file has been updated with formatting instructions and added dialect recognition. These changes ensure that grammars are consistently formatted and easily resolvable during merges. Engineers adopting this project should reformat the grammar file before each commit, following the provided formatting instructions and reference link. Grammar modifications in the TSqlParser.g4 file, such as alterations in partitionFunction and freetextFunction rules, improve structure and readability.
* Upgrade sqlglot from 23.13.7 to 25.1.0 ([#473](#473)). In the latest release, the sqlglot package has been upgraded from version 23.13.7 to 25.1.0, offering potential new features, bug fixes, and performance improvements for SQL processing. The package dependency for numpy has been updated to version 1.26.4, which may introduce new functionality, improve existing features, or fix numpy integration issues. Furthermore, the addition of the types-pytz package as a dependency provides type hints for pytz, enhancing codebase type checking and static analysis capabilities. Specific modifications to the test_sql_transpiler.py file include updating the expected result in the test_parse_query function and removing unnecessary whitespaces in the transpiled_sql assertion in the test_procedure_conversion function. Although the find_root_tables function remains unchanged, the upgrade to sqlglot promises overall functionality enhancements, which software engineers can leverage in their projects.
* Use default_factory in recon_config.py ([#431](#431)). In this release, the default value handling for the `status` field in the `DataReconcileOutput` and `ReconcileTableOutput` classes has been improved to comply with Python 3.11. Previously, a mutable default value was used, causing a `ValueError` issue. This has been addressed by implementing the `default_factory` argument in the `field` function to ensure a new instance of `StatusOutput` is created for each class. Additionally, `MismatchOutput` and `ThresholdOutput` classes now also utilize `default_factory` for consistent and robust default value handling, enhancing the overall code quality and preventing potential issues arising from mutable default values.
* edit distance ([#501](#501)). In this release, we have implemented an `edit distance` feature for calculating the difference between two strings using the LEVENSHTEIN function. This has been achieved by adding a new method, `anonymous_sql`, to the `Generator` class in the `databricks.py` file. The method takes expressions of the `Anonymous` type as arguments and calls the `LEVENSHTEIN` function if the `this` attribute of the expression is equal to "EDITDISTANCE". Additionally, a new test file has been introduced for the anonymous user in the functional snowflake test suite to ensure the accurate calculation of string similarity using the EDITDISTANCE function. This change includes examples of using the EDITDISTANCE function with different parameters and compares it with the LEVENSHTEIN function available in Databricks. It addresses issue [#500](#500), which was related to testing the edit distance functionality.
@nfx nfx mentioned this pull request Jul 1, 2024
nfx added a commit that referenced this pull request Jul 1, 2024
* Added Oracle ojdbc8 dependent library during reconcile Installation
([#474](#474)). In this
release, the `deployment.py` file in the
`databricks/labs/remorph/helpers` directory has been updated to add the
`ojdbc8` library as a `MavenLibrary` in the `_job_recon_task` function,
enabling the reconciliation process to access the Oracle Data source and
pull data for reconciliation between Oracle and Databricks. The
`JDBCReaderMixin` class in the `jdbc_reader.py` file has also been
updated to include the Oracle ojdbc8 dependent library for
reconciliation during the `reconcile` process. This involves installing
the `com.oracle.database.jdbc:ojdbc8:23.4.0.24.05` jar as a dependent
library and updating the driver class to
`oracle.jdbc.driver.OracleDriver` from `oracle`. A new dictionary
`driver_class` has been added, which maps the driver name to the
corresponding class name, allowing for dynamic driver class selection
during the `_get_jdbc_reader` method call. The
`test_read_data_with_options` unit test has been updated to test the
Oracle connector for reading data with specific options, including the
use of the correct driver class and specifying the database table for
data retrieval, improving the accuracy and reliability of the
reconciliation process.
* Added TSQL coverage tests in the generated report artifact
([#452](#452)). In this
release, we have added new TSQL coverage tests and Snowflake coverage
tests to the generated report artifact in the CI/CD pipeline. These
tests are executed using Maven with the updated command "mvn
--update-snapshots -B test -pl coverage --file pom.xml --fail-at-end"
and "mvn --update-snapshots -B exec:java -pl coverage --file pom.xml
--fail-at-end -Dexec.args="-i tests/resources/functional/snowflake -o
coverage-result.json" respectively, and the "continue-on-error: true"
option is added to allow the pipeline to proceed even if the tests fail.
Additionally, we have introduced a new constructor to the
`CommentBasedQueryExtractor` class, which accepts a `dialect` parameter
and allows for easier configuration of the start and end comments for
different SQL dialects. We have also updated the
CommentBasedQueryExtractor for Snowflake and added two TSQL coverage
tests to the generated report artifact to ensure that the
`QueryExtractor` is working correctly for TSQL queries. These changes
will help ensure thorough testing and identification of TSQL and
Snowflake queries during the CI/CD process.
* Added full support for analytical windowing functions
([#401](#401)). In this
release, full support for analytical windowing functions has been
implemented, addressing issue
[#401](#401). The
functions were previously specified in the parser grammar but have been
moved to the standard function lookup table for more consistent
handling. This enhancement allows for the use of analytical aggregate
functions, such as FIRST_VALUE and PERCENTILE_CONT, with a `WITHIN
GROUP` syntax and an `OVER` clause, enabling more complex queries and
data analysis. The `FixedArity` and `VariableArity` classes have been
updated with new methods for the supported functions, and appropriate
examples have been provided to demonstrate their usage in SQL.
* Added parsing for STRPOS in presto
([#462](#462)). A new
feature has been added to the remorph/snow package's presto module to
parse the STRPOS function in SQL code. This has been achieved by
importing the locate_to_strposition function from
sqlglot.dialects.dialect and incorporating it into the FUNCTIONS
dictionary in the Parser class. This change enables the parsing of the
STRPOS function, which returns the position of the first occurrence of a
substring in a string. The implementation has been tested with a SQL
file containing two queries for Presto SQL using STRPOS and Databricks
SQL using LOCATE, both aimed at finding the position of the letter `l`
in the string 'Hello world', starting the search from the second
position. This feature is particularly relevant for software engineers
working on data processing and analytics projects involving both Presto
and Databricks SQL, as it ensures compatibility and consistent behavior
between the two for string manipulation functions. The commit is part of
issue [#462](#462), and
the diff provided includes a new SQL file with test cases for the STRPOS
function in Presto and Locate function in Databricks SQL. The test cases
confirm if the `hello` string is present in the greeting_message column
of the greetings_table. This feature allows users to utilize the STRPOS
function in Presto to determine if a specific substring is present in a
string.
* Added validation for join columns for all query builders and limiting
rows for reports
([#413](#413)). In this
release, we've added validation for join columns in all query builders,
ensuring consistent and accurate data joins. A limit on the number of
rows displayed for reports has been implemented with a default of 50.
The `compare.py` and `execute.py` files have been updated to include
validation, and the `QueryBuilder` and `HashQueryBuilder` classes have
new methods for validating join columns. The `SamplingQueryBuilder`,
`ThresholdQueryBuilder`, and `recon_capture.py` files have similar
updates for validation and limiting rows for reports. The
`recon_config.py` file now has a new return type for the
`get_join_columns` method, and a new method
`test_no_join_columns_raise_exception()` has been added in the
`test_threshold_query.py` file. These changes aim to enhance data
consistency, accuracy, and efficiency for software engineers.
* Adds more coverage tests for functions to TSQL coverage
([#420](#420)). This
commit adds new coverage tests for various TSQL functions, focusing on
the COUNT, MAX, MIN, STDEV, STDEVP, SUM, and VARP functions, which are
identical in Databricks SQL. The tests include cases with and without
the DISTINCT keyword to ensure consistent behavior between TSQL and
Databricks. For the GROUPING and GROUPING_ID functions, which have some
differences, tests and examples of TSQL and Databicks SQL code are
provided. The CHECKSUM_AGG function, not directly supported in
Databricks SQL, is tested using MD5 and CONCAT_WS for equivalence. The
CUME_DIST function, identical in both systems, is also tested.
Additionally, a new test file for the STDEV function and updated tests
for the VAR function are introduced, enhancing the reliability and
robustness of TSQL conversions in the project.
* Catalog, Schema Permission checks
([#492](#492)). This
release introduces enhancements to the Catalog and Schema functionality,
with the addition of permission checks that raise explicit `Permission
Denied` exceptions. The logger messages have been updated for clarity
and a new variable, README_RECON_REPO, has been created to reference the
readme file for the recon_config repository. The ReconcileUtils class
has been modified to handle scenarios where the recon_config file is not
found or corrupted during loading, providing clear error messages and
guidance for users. The unit tests for the install feature have been
updated with permission checks for Catalog and Schema operations,
ensuring robust handling of permission denied errors. These changes
improve the system's error handling and provide clearer guidance for
users encountering permission issues.
* Changing the secret name acc to install script
([#432](#432)). In this
release, the `recon` function in the `execute.py` file of the
`databricks.labs.remorph.reconcile` package has been updated to
dynamically generate the secret name instead of hardcoding it as
"secret_scope". This change utilizes the new `get_key_form_dialect`
function to create a secret name specific to the source dialect being
used in the reconciliation process. The `get_dialect` function, along
with `DatabaseConfig`, `TableRecon`, and the newly added
`get_key_form_dialect`, have been imported from
`databricks.labs.remorph.config`. This enhancement improves the security
and flexibility of the reconciliation process by generating dynamic and
dialect-specific secret names.
* Feature/recon documentation
([#395](#395)). This
commit introduces a new reconciliation process, enhancing data
consistency between sources, co-authored by Ganesh Dogiparthi,
ganeshdogiparthi-db, and SundarShankar89. The README.md file provides
detailed documentation for the reconciliation process. A new binary
file, docs/transpile-install.gif, offers installation instructions or
visual aids, while a mermaid flowchart in
`report_types_visualisation.md` illustrates report generation for data,
rows, schema, and overall reconciliation. No existing functionality was
modified, ensuring the addition of valuable features for software
engineers adopting this project.
* Fixing issues in sample query builder to handle Null's and zero
([#457](#457)). This
commit introduces improvements to the sample query builder's handling of
Nulls and zeroes, addressing bug
[#450](#450). The
changes include updated SQL queries in the test threshold query file
with COALESCE and TRIM functions to replace Null values with a specified
string, ensuring consistent comparison of datasets. The query store in
test_execute.py has also been enhanced to handle NULL and zero values
using COALESCE, improving overall robustness and consistency.
Additionally, new methods such as build_join_clause, trim, and coalesce
have been added to enhance null handling in the query builder. The
commit also introduces the MockDataSource class, a likely test
implementation of a data source, and updates the log_and_throw_exception
function for clearer error messaging.
* Implement Lakeview Dashboard Publisher
([#405](#405)). In this
release, we've introduced the `DashboardPublisher` class in the
`dashboard_publisher.py` module to streamline the process of creating
and publishing dashboards in Databricks Workspace. This class simplifies
dashboard creation by accepting an instance of `WorkspaceClient` and
`Installation` and providing methods for creating and publishing
dashboards with optional parameter substitution. Additionally, we've
added a new JSON file, 'Remorph-Reconciliation-Substituted.lvdash.json',
which contains a dashboard definition for a data reconciliation feature.
This dashboard includes various widgets for filtering and displaying
reconciliation results. We've also added a test file for the Lakeview
Dashboard Publisher feature, which includes tests to ensure that the
`DashboardPublisher` can create dashboards using specified file paths
and parameters. These new features and enhancements are aimed at
improving the user experience and streamlining the process of creating
and publishing dashboards in Databricks Workspace.
* Integrate recon metadata reconcile cli
([#444](#444)). A new
CLI command, `databricks labs remorph reconcile`, has been added to
initiate the Data Reconciliation process, loading `reconcile.yml` and
`recon_config.json` configuration files from the Databricks Workspace.
If these files are missing, the user is prompted to reinstall the
`reconcile` module and exit the command. The command then triggers the
`Remorph_Reconciliation_Job` based on the Job ID stored in the
`reconcile.yml` file. This simplifies the reconcile execution process,
requiring users to first configure the `reconcile` module and generate
the `recon_config_<SOURCE>.json` file using `databricks labs remorph
install` and `databricks labs remorph generate-recon-config` commands.
The new CLI command has been manually tested and includes unit tests.
Integration tests and verification on the staging environment are
pending. This feature was co-authored by Bishwajit, Ganesh Dogiparthi,
and SundarShankar89.
* Introduce coverage tests
([#382](#382)). This
commit introduces coverage tests and updates the GitHub Actions workflow
to use Java 11 with Corretto distribution, improving testing and
coverage analysis for the project. Coverage tests are added as part of
the remorph project with the introduction of a new module for coverage
and updating the artifact version to 0.2.0-SNAPSHOT. The pom.xml file is
modified to change the parent project version to 0.2.0-SNAPSHOT,
ensuring accurate assessment and maintenance of code coverage during
development. In addition, a new Main object within the
com.databricks.labs.remorph.coverage package is implemented for running
coverage tests using command-line arguments, along with the addition of
a new file QueryRunner.scala and case classes for ReportEntryHeader,
ReportEntryReport, and ReportEntry for capturing and reporting on the
status and results of parsing and transpilation processes. The `Cache
Maven packages` step is removed and replaced with two new steps: `Run
Unit Tests with Maven` and "Run Coverage Tests with Maven." The former
executes unit tests and generates a test coverage report, while the
latter downloads remorph-core jars as artifacts, executes coverage tests
with Maven, and uploads coverage tests results as json artifacts. The
`coverage-tests` job runs after the `test-core` job and uses the same
environment, checking out the code with full history, setting up Java 11
with Corretto distribution, downloading remorph-core-jars artifacts, and
running coverage tests with Maven, even if there are errors. The JUnit
report is also published, and the coverage tests results are uploaded as
json artifacts, providing better test coverage and more reliable code
for software engineers adopting the project.
* Presto approx percentile func fix
([#411](#411)). The
remorph library has been updated to support the Presto database system,
with a new module added to the config.py file to enable robust and
maintainable interaction. An `APPROX_PERCENTILE` function has been
implemented in the `presto.py` file of the `sqlglot.dialects.presto`
package, allowing for approximate percentile calculations in Presto and
Databricks SQL. A test file has been included for both SQL dialects,
with queries calculating the approximate median of the height column in
the people table. The new functionality enhances the compatibility and
versatility of the remorph library in working with Presto databases and
improves overall project functionality. Additionally, a new test file
for Presto in the snowflakedriver project has been introduced to test
expected exceptions, further ensuring robustness and reliability.
* Raise exception if reconciliation fails for any table
([#412](#412)). In this
release, we have implemented significant changes to improve exception
handling and raise meaningful exceptions when reconciliation fails for
any table in our open-source library. A new exception class,
`ReconciliationException`, has been added as a child of the `Exception`
class, which takes two optional parameters in its constructor, `message`
and `reconcile_output`. The `ReconcileOutput` property has been created
for accessing the reconcile output object. The `InvalidInputException`
class now inherits from `ValueError`, making the code more explicit with
the type of errors being handled. A new method,
`_verify_successful_reconciliation`, has been introduced to check the
reconciliation output status and raise a `ReconciliationException` if
any table fails reconciliation. The `test_execute.py` file has been
updated to raise a `ReconciliationException` if reconciliation for a
specific report type fails, and new tests have been added to the test
suite to ensure the correct behavior of the `reconcile` function with
and without raising exceptions.
* Removed USE catalog/schema statement as lsql has added the feature
([#465](#465)). In this
release, the usage of `USE` statements for selecting a catalog and
schema has been removed in the `get_sql_backend` function, thanks to the
new feature provided by the lsql library. This enhancement improves code
readability, maintainability, and enables better integration with the
SQL backend. The commit also includes changes to the installation
process for reconciliation metadata tables, providing more clarity and
simplicity in the code. Additionally, several test functions have been
added or modified to ensure the proper functioning of the
`get_sql_backend` function in various scenarios, including cases where a
warehouse ID is not provided or when executing SQL statements in a
notebook environment. An error simulation test has also been added for
handling `DatabricksError` exceptions when executing SQL statements
using the `DatabricksConnectBackend` class.
* Sampling with clause query to have `from dual` in from clause for
oracle source
([#464](#464)). In this
release, we've added the `get_key_from_dialect` function, replacing the
previous `get_key_form_dialect` function, to retrieve the key associated
with a given dialect object, serving as a unique identifier for the
dialect. This improvement enhances the flexibility and readability of
the codebase, making it easier to locate and manipulate dialect objects.
Additionally, we've modified the 'sampling_query.py' file to include
`from dual` in the `from` clause for Oracle sources in a sampling query
with a clause, enabling sampling from Oracle databases. The
`_insert_into_main_table` method in the `recon_capture.py` file of the
`databricks.labs.remorph.reconcile` module has been updated to ensure
accurate key retrieval for the specified dialect, thereby improving the
reconciliation process. These changes resolve issues
[#458](#458) and
[#464](#464), enhancing
the functionality of the sampling query builder and providing better
support for various databases.
* Support function translation to Databricks SQL in TSql and Snowflake
([#414](#414)). This
commit introduces a dialect-aware FunctionBuilder system and a
ConversionStrategy system to enable seamless translation of SQL
functions between TSQL, Snowflake, and Databricks SQL IR. The new
FunctionBuilder system can handle both simple name translations and more
complex conversions when there is no direct equivalent. For instance,
TSQL's ISNULL function translates to IFNULL in Databricks SQL, while
Snowflake's ISNULL remains unchanged. The commit also includes updates
to the TSqlExpressionBuilder and new methods for building and visiting
various contexts, enhancing compatibility and expanding the range of
supported SQL dialects. Additionally, new tests have been added in the
FunctionBuilderSpec to ensure the correct arity and function type for
various SQL functions.
* TSQL: Create coverage tests for TSQL -> Databricks functions
([#415](#415)). This
commit introduces coverage tests for T-SQL functions and their
equivalent Databricks SQL implementations, focusing on the DATEADD
function's `yy` keyword. The DATEADD function is translated to the
ADD_MONTHS function in Databricks SQL, with the number of months
multiplied by 12. This ensures functional equivalence between T-SQL and
Databricks SQL for date addition involving years. The tests are written
as SQL scripts and are located in the
`tests/resources/functional/tsql/functions` directory, covering various
scenarios and possible engine differences between T-SQL and Databricks
SQL. The conversion process is documented, and future automation of this
documentation is considered.
* TSQL: Implement WITH CTE
([#443](#443)). With
this commit, we have extended the TSQL functionality by adding support
for Common Table Expressions (CTEs). CTEs are temporary result sets that
can be defined within a single execution of a SELECT, INSERT, UPDATE,
DELETE, or CREATE VIEW statement, allowing for more complex and
efficient queries. The implementation includes the ability to create a
CTE with an optional name and a column list, followed by a SELECT
statement that defines the CTE. CTEs can be self-referential and can be
used to simplify complex queries, improving code readability and
performance. This feature is particularly useful for cases where
multiple queries rely on the same intermediate result set, as it enables
reusing the results without having to repeat the query.
* TSQL: Implement functions with specialized syntax
([#430](#430)). This
commit introduces new data type conversion functions and JSON
manipulation capabilities to T-SQL, addressing issue
[#430](#430). The newly
implemented features include `NEXT VALUE FOR sequence`, `CAST(col TO
sometype)`, `TRY_CAST(col TO sometype)`, `JSON_ARRAY`, and
`JSON_OBJECT`. These functions support specialized syntax for handling
data type conversions and JSON operations, including NULL value handling
using `NULL ON NULL` and `ABSENT ON NULL` syntax. The
`TSqlFunctionBuilder` class has been updated to accommodate these
changes, and new test cases have been added to the `TSqlFunctionSpec`
test class in Scala. This enhancement enables SQL-based querying and
data manipulation with increased functionality for T-SQL parser and
function evaluations.
* TSQL: Support DISTINCT in SELECT list and aggregate functions
([#400](#400)). This
commit adds support for the `DISTINCT` keyword in T-SQL for use in the
`SELECT` list and aggregate functions such as `COUNT`. When used in the
`SELECT` list, `DISTINCT` ensures unique values of the specified
expression are returned, and in aggregate functions like `COUNT`, it
considers only distinct values of the specified argument. This change
aligns with the SQL standard and enhances the functionality of the T-SQL
parser, providing developers with greater flexibility and control when
using `DISTINCT` in complex queries and aggregate functions. The default
behavior in SQL, `ALL`, remains unchanged, and the parser has been
updated to accommodate these improvements.
* TSQL: Update the SELECT statement to support XML workspaces
([#451](#451)). This
release introduces updates to the TSQL Select statement grammar to
correctly support XMLWORKSPACES in accordance with the latest
specification. Although Databricks SQL does not currently support
XMLWORKSPACES, this change is a syntax-only update to enable
compatibility with other platforms that do support it. Newly added
components include 'xmlNamespaces', 'xmlDeclaration',
'xmlSchemaCollection', 'xmlTypeDefinition', 'createXmlSchemaCollection',
'xmlIndexOptions', 'xmlIndexOption', 'openXml', 'xmlCommonDirectives',
and 'xmlColumnDefinition'. These additions enable the creation,
configuration, and usage of XML schemas and indexes, as well as the
specification of XML namespaces and directives. A new test file for
functional tests has been included to demonstrate the use of
XMLWORKSPACES in TSQL and its equivalent syntax in Databricks SQL. While
this update does not affect the existing codebase's functionality, it
does enable support for XMLWORKSPACES syntax in TSQL, facilitating
easier integration with other platforms that support it. Please note
that Databricks SQL does not currently support XML workspaces.
* Test merge queue
([#424](#424)). In this
release, the Scalafmt configuration has been updated to version 3.8.0,
with changes to the formatting of Scala code. The `danglingParentheses`
preset option has been set to "false", removing dangling parentheses
from the code. Additionally, the `configStyleArguments` option has been
set to `false` under "optIn". These modifications to the configuration
file are likely to affect the formatting and style of the Scala code in
the project, ensuring consistent and organized code. This change aims to
enhance the readability and maintainability of the codebase.
* Updated bug and feature yml to support reconcile
([#390](#390)). The
open-source library has been updated to improve issue and feature
categorization. In the `.github/ISSUE_TEMPLATE/bug.yml` file, new
options for TranspileParserError, TranspileValidationError, and
TranspileLateralColumnAliasError have been added to the `label: Category
of Bug / Issue` field. Additionally, a new option for ReconcileError has
been included. The `feature.yml` file in the `.github/ISSUE_TEMPLATE`
directory has also been updated, introducing a required dropdown menu
labeled "Category of feature request." This dropdown offers options for
Transpile, Reconcile, and Other categories, ensuring accurate
classification and organization of incoming feature requests. The
modifications aim to enhance clarity for maintainers in reviewing and
prioritizing issue resolutions and feature implementations related to
reconciliation functionality.
* Updated the documentation with json config examples
([#486](#486)). In this
release, the Remorph Reconciliation tool on Databricks has been updated
to include JSON config examples for various config elements such as
jdbc_reader_options, column_mapping, transformations, thresholds, and
filters. These config elements enable users to define source and target
data, join columns, JDBC reader options, select and drop columns, column
mappings, transformations, thresholds, and filters. The update also
provides examples in both Python and JSON formats, as well as
instructions for installing the necessary Oracle JDBC library on a
Databricks cluster. This update enhances the tool's functionality,
making it easier for software engineers to reconcile source data with
target data on Databricks.
* Updated uninstall flow
([#476](#476)). In this
release, the `uninstall` functionality of the `databricks labs remorph`
tool has been updated to align with the latest changes made to the
`install` refactoring. The `uninstall` flow now utilizes a new
`MockInstallation` class, which handles the uninstallation process and
takes a dictionary of configuration files and their corresponding
contents as input. The `uninstall` function has been modified to return
`False` in two cases, either when there is no remorph directory or when
the user decides not to uninstall. A `MockInstallation` object is
created for the reconcile.yml file, and appropriate exceptions are
raised in the aforementioned cases. The `uninstall` function now uses a
`WorkspaceUnInstallation` or `WorkspaceUnInstaller` object, depending on
the input arguments, to handle the uninstallation process. Additionally,
the `MockPrompts` class is used to prompt the user for confirmation
before uninstalling remorph.
* Updates to developer documentation and add grammar formatting to maven
([#490](#490)). The
developer documentation has been updated to include grammar formatting
instructions and support for dialects other than Snowflake. The Maven
build cycle has been modified to format grammars before ANTLR processes
them, enhancing readability and easing conflict resolution during
maintenance. The TSqlLexer.g4 file has been updated with formatting
instructions and added dialect recognition. These changes ensure that
grammars are consistently formatted and easily resolvable during merges.
Engineers adopting this project should reformat the grammar file before
each commit, following the provided formatting instructions and
reference link. Grammar modifications in the TSqlParser.g4 file, such as
alterations in partitionFunction and freetextFunction rules, improve
structure and readability.
* Upgrade sqlglot from 23.13.7 to 25.1.0
([#473](#473)). In the
latest release, the sqlglot package has been upgraded from version
23.13.7 to 25.1.0, offering potential new features, bug fixes, and
performance improvements for SQL processing. The package dependency for
numpy has been updated to version 1.26.4, which may introduce new
functionality, improve existing features, or fix numpy integration
issues. Furthermore, the addition of the types-pytz package as a
dependency provides type hints for pytz, enhancing codebase type
checking and static analysis capabilities. Specific modifications to the
test_sql_transpiler.py file include updating the expected result in the
test_parse_query function and removing unnecessary whitespaces in the
transpiled_sql assertion in the test_procedure_conversion function.
Although the find_root_tables function remains unchanged, the upgrade to
sqlglot promises overall functionality enhancements, which software
engineers can leverage in their projects.
* Use default_factory in recon_config.py
([#431](#431)). In this
release, the default value handling for the `status` field in the
`DataReconcileOutput` and `ReconcileTableOutput` classes has been
improved to comply with Python 3.11. Previously, a mutable default value
was used, causing a `ValueError` issue. This has been addressed by
implementing the `default_factory` argument in the `field` function to
ensure a new instance of `StatusOutput` is created for each class.
Additionally, `MismatchOutput` and `ThresholdOutput` classes now also
utilize `default_factory` for consistent and robust default value
handling, enhancing the overall code quality and preventing potential
issues arising from mutable default values.
* edit distance
([#501](#501)). In this
release, we have implemented an `edit distance` feature for calculating
the difference between two strings using the LEVENSHTEIN function. This
has been achieved by adding a new method, `anonymous_sql`, to the
`Generator` class in the `databricks.py` file. The method takes
expressions of the `Anonymous` type as arguments and calls the
`LEVENSHTEIN` function if the `this` attribute of the expression is
equal to "EDITDISTANCE". Additionally, a new test file has been
introduced for the anonymous user in the functional snowflake test suite
to ensure the accurate calculation of string similarity using the
EDITDISTANCE function. This change includes examples of using the
EDITDISTANCE function with different parameters and compares it with the
LEVENSHTEIN function available in Databricks. It addresses issue
[#500](#500), which was
related to testing the edit distance functionality.
sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this pull request Jan 2, 2025
* Added Oracle ojdbc8 dependent library during reconcile Installation
([databrickslabs#474](databrickslabs#474)). In this
release, the `deployment.py` file in the
`databricks/labs/remorph/helpers` directory has been updated to add the
`ojdbc8` library as a `MavenLibrary` in the `_job_recon_task` function,
enabling the reconciliation process to access the Oracle Data source and
pull data for reconciliation between Oracle and Databricks. The
`JDBCReaderMixin` class in the `jdbc_reader.py` file has also been
updated to include the Oracle ojdbc8 dependent library for
reconciliation during the `reconcile` process. This involves installing
the `com.oracle.database.jdbc:ojdbc8:23.4.0.24.05` jar as a dependent
library and updating the driver class to
`oracle.jdbc.driver.OracleDriver` from `oracle`. A new dictionary
`driver_class` has been added, which maps the driver name to the
corresponding class name, allowing for dynamic driver class selection
during the `_get_jdbc_reader` method call. The
`test_read_data_with_options` unit test has been updated to test the
Oracle connector for reading data with specific options, including the
use of the correct driver class and specifying the database table for
data retrieval, improving the accuracy and reliability of the
reconciliation process.
* Added TSQL coverage tests in the generated report artifact
([databrickslabs#452](databrickslabs#452)). In this
release, we have added new TSQL coverage tests and Snowflake coverage
tests to the generated report artifact in the CI/CD pipeline. These
tests are executed using Maven with the updated command "mvn
--update-snapshots -B test -pl coverage --file pom.xml --fail-at-end"
and "mvn --update-snapshots -B exec:java -pl coverage --file pom.xml
--fail-at-end -Dexec.args="-i tests/resources/functional/snowflake -o
coverage-result.json" respectively, and the "continue-on-error: true"
option is added to allow the pipeline to proceed even if the tests fail.
Additionally, we have introduced a new constructor to the
`CommentBasedQueryExtractor` class, which accepts a `dialect` parameter
and allows for easier configuration of the start and end comments for
different SQL dialects. We have also updated the
CommentBasedQueryExtractor for Snowflake and added two TSQL coverage
tests to the generated report artifact to ensure that the
`QueryExtractor` is working correctly for TSQL queries. These changes
will help ensure thorough testing and identification of TSQL and
Snowflake queries during the CI/CD process.
* Added full support for analytical windowing functions
([databrickslabs#401](databrickslabs#401)). In this
release, full support for analytical windowing functions has been
implemented, addressing issue
[databrickslabs#401](databrickslabs#401). The
functions were previously specified in the parser grammar but have been
moved to the standard function lookup table for more consistent
handling. This enhancement allows for the use of analytical aggregate
functions, such as FIRST_VALUE and PERCENTILE_CONT, with a `WITHIN
GROUP` syntax and an `OVER` clause, enabling more complex queries and
data analysis. The `FixedArity` and `VariableArity` classes have been
updated with new methods for the supported functions, and appropriate
examples have been provided to demonstrate their usage in SQL.
* Added parsing for STRPOS in presto
([databrickslabs#462](databrickslabs#462)). A new
feature has been added to the remorph/snow package's presto module to
parse the STRPOS function in SQL code. This has been achieved by
importing the locate_to_strposition function from
sqlglot.dialects.dialect and incorporating it into the FUNCTIONS
dictionary in the Parser class. This change enables the parsing of the
STRPOS function, which returns the position of the first occurrence of a
substring in a string. The implementation has been tested with a SQL
file containing two queries for Presto SQL using STRPOS and Databricks
SQL using LOCATE, both aimed at finding the position of the letter `l`
in the string 'Hello world', starting the search from the second
position. This feature is particularly relevant for software engineers
working on data processing and analytics projects involving both Presto
and Databricks SQL, as it ensures compatibility and consistent behavior
between the two for string manipulation functions. The commit is part of
issue [databrickslabs#462](databrickslabs#462), and
the diff provided includes a new SQL file with test cases for the STRPOS
function in Presto and Locate function in Databricks SQL. The test cases
confirm if the `hello` string is present in the greeting_message column
of the greetings_table. This feature allows users to utilize the STRPOS
function in Presto to determine if a specific substring is present in a
string.
* Added validation for join columns for all query builders and limiting
rows for reports
([databrickslabs#413](databrickslabs#413)). In this
release, we've added validation for join columns in all query builders,
ensuring consistent and accurate data joins. A limit on the number of
rows displayed for reports has been implemented with a default of 50.
The `compare.py` and `execute.py` files have been updated to include
validation, and the `QueryBuilder` and `HashQueryBuilder` classes have
new methods for validating join columns. The `SamplingQueryBuilder`,
`ThresholdQueryBuilder`, and `recon_capture.py` files have similar
updates for validation and limiting rows for reports. The
`recon_config.py` file now has a new return type for the
`get_join_columns` method, and a new method
`test_no_join_columns_raise_exception()` has been added in the
`test_threshold_query.py` file. These changes aim to enhance data
consistency, accuracy, and efficiency for software engineers.
* Adds more coverage tests for functions to TSQL coverage
([databrickslabs#420](databrickslabs#420)). This
commit adds new coverage tests for various TSQL functions, focusing on
the COUNT, MAX, MIN, STDEV, STDEVP, SUM, and VARP functions, which are
identical in Databricks SQL. The tests include cases with and without
the DISTINCT keyword to ensure consistent behavior between TSQL and
Databricks. For the GROUPING and GROUPING_ID functions, which have some
differences, tests and examples of TSQL and Databicks SQL code are
provided. The CHECKSUM_AGG function, not directly supported in
Databricks SQL, is tested using MD5 and CONCAT_WS for equivalence. The
CUME_DIST function, identical in both systems, is also tested.
Additionally, a new test file for the STDEV function and updated tests
for the VAR function are introduced, enhancing the reliability and
robustness of TSQL conversions in the project.
* Catalog, Schema Permission checks
([databrickslabs#492](databrickslabs#492)). This
release introduces enhancements to the Catalog and Schema functionality,
with the addition of permission checks that raise explicit `Permission
Denied` exceptions. The logger messages have been updated for clarity
and a new variable, README_RECON_REPO, has been created to reference the
readme file for the recon_config repository. The ReconcileUtils class
has been modified to handle scenarios where the recon_config file is not
found or corrupted during loading, providing clear error messages and
guidance for users. The unit tests for the install feature have been
updated with permission checks for Catalog and Schema operations,
ensuring robust handling of permission denied errors. These changes
improve the system's error handling and provide clearer guidance for
users encountering permission issues.
* Changing the secret name acc to install script
([databrickslabs#432](databrickslabs#432)). In this
release, the `recon` function in the `execute.py` file of the
`databricks.labs.remorph.reconcile` package has been updated to
dynamically generate the secret name instead of hardcoding it as
"secret_scope". This change utilizes the new `get_key_form_dialect`
function to create a secret name specific to the source dialect being
used in the reconciliation process. The `get_dialect` function, along
with `DatabaseConfig`, `TableRecon`, and the newly added
`get_key_form_dialect`, have been imported from
`databricks.labs.remorph.config`. This enhancement improves the security
and flexibility of the reconciliation process by generating dynamic and
dialect-specific secret names.
* Feature/recon documentation
([databrickslabs#395](databrickslabs#395)). This
commit introduces a new reconciliation process, enhancing data
consistency between sources, co-authored by Ganesh Dogiparthi,
ganeshdogiparthi-db, and SundarShankar89. The README.md file provides
detailed documentation for the reconciliation process. A new binary
file, docs/transpile-install.gif, offers installation instructions or
visual aids, while a mermaid flowchart in
`report_types_visualisation.md` illustrates report generation for data,
rows, schema, and overall reconciliation. No existing functionality was
modified, ensuring the addition of valuable features for software
engineers adopting this project.
* Fixing issues in sample query builder to handle Null's and zero
([databrickslabs#457](databrickslabs#457)). This
commit introduces improvements to the sample query builder's handling of
Nulls and zeroes, addressing bug
[databrickslabs#450](databrickslabs#450). The
changes include updated SQL queries in the test threshold query file
with COALESCE and TRIM functions to replace Null values with a specified
string, ensuring consistent comparison of datasets. The query store in
test_execute.py has also been enhanced to handle NULL and zero values
using COALESCE, improving overall robustness and consistency.
Additionally, new methods such as build_join_clause, trim, and coalesce
have been added to enhance null handling in the query builder. The
commit also introduces the MockDataSource class, a likely test
implementation of a data source, and updates the log_and_throw_exception
function for clearer error messaging.
* Implement Lakeview Dashboard Publisher
([databrickslabs#405](databrickslabs#405)). In this
release, we've introduced the `DashboardPublisher` class in the
`dashboard_publisher.py` module to streamline the process of creating
and publishing dashboards in Databricks Workspace. This class simplifies
dashboard creation by accepting an instance of `WorkspaceClient` and
`Installation` and providing methods for creating and publishing
dashboards with optional parameter substitution. Additionally, we've
added a new JSON file, 'Remorph-Reconciliation-Substituted.lvdash.json',
which contains a dashboard definition for a data reconciliation feature.
This dashboard includes various widgets for filtering and displaying
reconciliation results. We've also added a test file for the Lakeview
Dashboard Publisher feature, which includes tests to ensure that the
`DashboardPublisher` can create dashboards using specified file paths
and parameters. These new features and enhancements are aimed at
improving the user experience and streamlining the process of creating
and publishing dashboards in Databricks Workspace.
* Integrate recon metadata reconcile cli
([databrickslabs#444](databrickslabs#444)). A new
CLI command, `databricks labs remorph reconcile`, has been added to
initiate the Data Reconciliation process, loading `reconcile.yml` and
`recon_config.json` configuration files from the Databricks Workspace.
If these files are missing, the user is prompted to reinstall the
`reconcile` module and exit the command. The command then triggers the
`Remorph_Reconciliation_Job` based on the Job ID stored in the
`reconcile.yml` file. This simplifies the reconcile execution process,
requiring users to first configure the `reconcile` module and generate
the `recon_config_<SOURCE>.json` file using `databricks labs remorph
install` and `databricks labs remorph generate-recon-config` commands.
The new CLI command has been manually tested and includes unit tests.
Integration tests and verification on the staging environment are
pending. This feature was co-authored by Bishwajit, Ganesh Dogiparthi,
and SundarShankar89.
* Introduce coverage tests
([databrickslabs#382](databrickslabs#382)). This
commit introduces coverage tests and updates the GitHub Actions workflow
to use Java 11 with Corretto distribution, improving testing and
coverage analysis for the project. Coverage tests are added as part of
the remorph project with the introduction of a new module for coverage
and updating the artifact version to 0.2.0-SNAPSHOT. The pom.xml file is
modified to change the parent project version to 0.2.0-SNAPSHOT,
ensuring accurate assessment and maintenance of code coverage during
development. In addition, a new Main object within the
com.databricks.labs.remorph.coverage package is implemented for running
coverage tests using command-line arguments, along with the addition of
a new file QueryRunner.scala and case classes for ReportEntryHeader,
ReportEntryReport, and ReportEntry for capturing and reporting on the
status and results of parsing and transpilation processes. The `Cache
Maven packages` step is removed and replaced with two new steps: `Run
Unit Tests with Maven` and "Run Coverage Tests with Maven." The former
executes unit tests and generates a test coverage report, while the
latter downloads remorph-core jars as artifacts, executes coverage tests
with Maven, and uploads coverage tests results as json artifacts. The
`coverage-tests` job runs after the `test-core` job and uses the same
environment, checking out the code with full history, setting up Java 11
with Corretto distribution, downloading remorph-core-jars artifacts, and
running coverage tests with Maven, even if there are errors. The JUnit
report is also published, and the coverage tests results are uploaded as
json artifacts, providing better test coverage and more reliable code
for software engineers adopting the project.
* Presto approx percentile func fix
([databrickslabs#411](databrickslabs#411)). The
remorph library has been updated to support the Presto database system,
with a new module added to the config.py file to enable robust and
maintainable interaction. An `APPROX_PERCENTILE` function has been
implemented in the `presto.py` file of the `sqlglot.dialects.presto`
package, allowing for approximate percentile calculations in Presto and
Databricks SQL. A test file has been included for both SQL dialects,
with queries calculating the approximate median of the height column in
the people table. The new functionality enhances the compatibility and
versatility of the remorph library in working with Presto databases and
improves overall project functionality. Additionally, a new test file
for Presto in the snowflakedriver project has been introduced to test
expected exceptions, further ensuring robustness and reliability.
* Raise exception if reconciliation fails for any table
([databrickslabs#412](databrickslabs#412)). In this
release, we have implemented significant changes to improve exception
handling and raise meaningful exceptions when reconciliation fails for
any table in our open-source library. A new exception class,
`ReconciliationException`, has been added as a child of the `Exception`
class, which takes two optional parameters in its constructor, `message`
and `reconcile_output`. The `ReconcileOutput` property has been created
for accessing the reconcile output object. The `InvalidInputException`
class now inherits from `ValueError`, making the code more explicit with
the type of errors being handled. A new method,
`_verify_successful_reconciliation`, has been introduced to check the
reconciliation output status and raise a `ReconciliationException` if
any table fails reconciliation. The `test_execute.py` file has been
updated to raise a `ReconciliationException` if reconciliation for a
specific report type fails, and new tests have been added to the test
suite to ensure the correct behavior of the `reconcile` function with
and without raising exceptions.
* Removed USE catalog/schema statement as lsql has added the feature
([databrickslabs#465](databrickslabs#465)). In this
release, the usage of `USE` statements for selecting a catalog and
schema has been removed in the `get_sql_backend` function, thanks to the
new feature provided by the lsql library. This enhancement improves code
readability, maintainability, and enables better integration with the
SQL backend. The commit also includes changes to the installation
process for reconciliation metadata tables, providing more clarity and
simplicity in the code. Additionally, several test functions have been
added or modified to ensure the proper functioning of the
`get_sql_backend` function in various scenarios, including cases where a
warehouse ID is not provided or when executing SQL statements in a
notebook environment. An error simulation test has also been added for
handling `DatabricksError` exceptions when executing SQL statements
using the `DatabricksConnectBackend` class.
* Sampling with clause query to have `from dual` in from clause for
oracle source
([databrickslabs#464](databrickslabs#464)). In this
release, we've added the `get_key_from_dialect` function, replacing the
previous `get_key_form_dialect` function, to retrieve the key associated
with a given dialect object, serving as a unique identifier for the
dialect. This improvement enhances the flexibility and readability of
the codebase, making it easier to locate and manipulate dialect objects.
Additionally, we've modified the 'sampling_query.py' file to include
`from dual` in the `from` clause for Oracle sources in a sampling query
with a clause, enabling sampling from Oracle databases. The
`_insert_into_main_table` method in the `recon_capture.py` file of the
`databricks.labs.remorph.reconcile` module has been updated to ensure
accurate key retrieval for the specified dialect, thereby improving the
reconciliation process. These changes resolve issues
[databrickslabs#458](databrickslabs#458) and
[databrickslabs#464](databrickslabs#464), enhancing
the functionality of the sampling query builder and providing better
support for various databases.
* Support function translation to Databricks SQL in TSql and Snowflake
([databrickslabs#414](databrickslabs#414)). This
commit introduces a dialect-aware FunctionBuilder system and a
ConversionStrategy system to enable seamless translation of SQL
functions between TSQL, Snowflake, and Databricks SQL IR. The new
FunctionBuilder system can handle both simple name translations and more
complex conversions when there is no direct equivalent. For instance,
TSQL's ISNULL function translates to IFNULL in Databricks SQL, while
Snowflake's ISNULL remains unchanged. The commit also includes updates
to the TSqlExpressionBuilder and new methods for building and visiting
various contexts, enhancing compatibility and expanding the range of
supported SQL dialects. Additionally, new tests have been added in the
FunctionBuilderSpec to ensure the correct arity and function type for
various SQL functions.
* TSQL: Create coverage tests for TSQL -> Databricks functions
([databrickslabs#415](databrickslabs#415)). This
commit introduces coverage tests for T-SQL functions and their
equivalent Databricks SQL implementations, focusing on the DATEADD
function's `yy` keyword. The DATEADD function is translated to the
ADD_MONTHS function in Databricks SQL, with the number of months
multiplied by 12. This ensures functional equivalence between T-SQL and
Databricks SQL for date addition involving years. The tests are written
as SQL scripts and are located in the
`tests/resources/functional/tsql/functions` directory, covering various
scenarios and possible engine differences between T-SQL and Databricks
SQL. The conversion process is documented, and future automation of this
documentation is considered.
* TSQL: Implement WITH CTE
([databrickslabs#443](databrickslabs#443)). With
this commit, we have extended the TSQL functionality by adding support
for Common Table Expressions (CTEs). CTEs are temporary result sets that
can be defined within a single execution of a SELECT, INSERT, UPDATE,
DELETE, or CREATE VIEW statement, allowing for more complex and
efficient queries. The implementation includes the ability to create a
CTE with an optional name and a column list, followed by a SELECT
statement that defines the CTE. CTEs can be self-referential and can be
used to simplify complex queries, improving code readability and
performance. This feature is particularly useful for cases where
multiple queries rely on the same intermediate result set, as it enables
reusing the results without having to repeat the query.
* TSQL: Implement functions with specialized syntax
([databrickslabs#430](databrickslabs#430)). This
commit introduces new data type conversion functions and JSON
manipulation capabilities to T-SQL, addressing issue
[databrickslabs#430](databrickslabs#430). The newly
implemented features include `NEXT VALUE FOR sequence`, `CAST(col TO
sometype)`, `TRY_CAST(col TO sometype)`, `JSON_ARRAY`, and
`JSON_OBJECT`. These functions support specialized syntax for handling
data type conversions and JSON operations, including NULL value handling
using `NULL ON NULL` and `ABSENT ON NULL` syntax. The
`TSqlFunctionBuilder` class has been updated to accommodate these
changes, and new test cases have been added to the `TSqlFunctionSpec`
test class in Scala. This enhancement enables SQL-based querying and
data manipulation with increased functionality for T-SQL parser and
function evaluations.
* TSQL: Support DISTINCT in SELECT list and aggregate functions
([databrickslabs#400](databrickslabs#400)). This
commit adds support for the `DISTINCT` keyword in T-SQL for use in the
`SELECT` list and aggregate functions such as `COUNT`. When used in the
`SELECT` list, `DISTINCT` ensures unique values of the specified
expression are returned, and in aggregate functions like `COUNT`, it
considers only distinct values of the specified argument. This change
aligns with the SQL standard and enhances the functionality of the T-SQL
parser, providing developers with greater flexibility and control when
using `DISTINCT` in complex queries and aggregate functions. The default
behavior in SQL, `ALL`, remains unchanged, and the parser has been
updated to accommodate these improvements.
* TSQL: Update the SELECT statement to support XML workspaces
([databrickslabs#451](databrickslabs#451)). This
release introduces updates to the TSQL Select statement grammar to
correctly support XMLWORKSPACES in accordance with the latest
specification. Although Databricks SQL does not currently support
XMLWORKSPACES, this change is a syntax-only update to enable
compatibility with other platforms that do support it. Newly added
components include 'xmlNamespaces', 'xmlDeclaration',
'xmlSchemaCollection', 'xmlTypeDefinition', 'createXmlSchemaCollection',
'xmlIndexOptions', 'xmlIndexOption', 'openXml', 'xmlCommonDirectives',
and 'xmlColumnDefinition'. These additions enable the creation,
configuration, and usage of XML schemas and indexes, as well as the
specification of XML namespaces and directives. A new test file for
functional tests has been included to demonstrate the use of
XMLWORKSPACES in TSQL and its equivalent syntax in Databricks SQL. While
this update does not affect the existing codebase's functionality, it
does enable support for XMLWORKSPACES syntax in TSQL, facilitating
easier integration with other platforms that support it. Please note
that Databricks SQL does not currently support XML workspaces.
* Test merge queue
([databrickslabs#424](databrickslabs#424)). In this
release, the Scalafmt configuration has been updated to version 3.8.0,
with changes to the formatting of Scala code. The `danglingParentheses`
preset option has been set to "false", removing dangling parentheses
from the code. Additionally, the `configStyleArguments` option has been
set to `false` under "optIn". These modifications to the configuration
file are likely to affect the formatting and style of the Scala code in
the project, ensuring consistent and organized code. This change aims to
enhance the readability and maintainability of the codebase.
* Updated bug and feature yml to support reconcile
([databrickslabs#390](databrickslabs#390)). The
open-source library has been updated to improve issue and feature
categorization. In the `.github/ISSUE_TEMPLATE/bug.yml` file, new
options for TranspileParserError, TranspileValidationError, and
TranspileLateralColumnAliasError have been added to the `label: Category
of Bug / Issue` field. Additionally, a new option for ReconcileError has
been included. The `feature.yml` file in the `.github/ISSUE_TEMPLATE`
directory has also been updated, introducing a required dropdown menu
labeled "Category of feature request." This dropdown offers options for
Transpile, Reconcile, and Other categories, ensuring accurate
classification and organization of incoming feature requests. The
modifications aim to enhance clarity for maintainers in reviewing and
prioritizing issue resolutions and feature implementations related to
reconciliation functionality.
* Updated the documentation with json config examples
([databrickslabs#486](databrickslabs#486)). In this
release, the Remorph Reconciliation tool on Databricks has been updated
to include JSON config examples for various config elements such as
jdbc_reader_options, column_mapping, transformations, thresholds, and
filters. These config elements enable users to define source and target
data, join columns, JDBC reader options, select and drop columns, column
mappings, transformations, thresholds, and filters. The update also
provides examples in both Python and JSON formats, as well as
instructions for installing the necessary Oracle JDBC library on a
Databricks cluster. This update enhances the tool's functionality,
making it easier for software engineers to reconcile source data with
target data on Databricks.
* Updated uninstall flow
([databrickslabs#476](databrickslabs#476)). In this
release, the `uninstall` functionality of the `databricks labs remorph`
tool has been updated to align with the latest changes made to the
`install` refactoring. The `uninstall` flow now utilizes a new
`MockInstallation` class, which handles the uninstallation process and
takes a dictionary of configuration files and their corresponding
contents as input. The `uninstall` function has been modified to return
`False` in two cases, either when there is no remorph directory or when
the user decides not to uninstall. A `MockInstallation` object is
created for the reconcile.yml file, and appropriate exceptions are
raised in the aforementioned cases. The `uninstall` function now uses a
`WorkspaceUnInstallation` or `WorkspaceUnInstaller` object, depending on
the input arguments, to handle the uninstallation process. Additionally,
the `MockPrompts` class is used to prompt the user for confirmation
before uninstalling remorph.
* Updates to developer documentation and add grammar formatting to maven
([databrickslabs#490](databrickslabs#490)). The
developer documentation has been updated to include grammar formatting
instructions and support for dialects other than Snowflake. The Maven
build cycle has been modified to format grammars before ANTLR processes
them, enhancing readability and easing conflict resolution during
maintenance. The TSqlLexer.g4 file has been updated with formatting
instructions and added dialect recognition. These changes ensure that
grammars are consistently formatted and easily resolvable during merges.
Engineers adopting this project should reformat the grammar file before
each commit, following the provided formatting instructions and
reference link. Grammar modifications in the TSqlParser.g4 file, such as
alterations in partitionFunction and freetextFunction rules, improve
structure and readability.
* Upgrade sqlglot from 23.13.7 to 25.1.0
([databrickslabs#473](databrickslabs#473)). In the
latest release, the sqlglot package has been upgraded from version
23.13.7 to 25.1.0, offering potential new features, bug fixes, and
performance improvements for SQL processing. The package dependency for
numpy has been updated to version 1.26.4, which may introduce new
functionality, improve existing features, or fix numpy integration
issues. Furthermore, the addition of the types-pytz package as a
dependency provides type hints for pytz, enhancing codebase type
checking and static analysis capabilities. Specific modifications to the
test_sql_transpiler.py file include updating the expected result in the
test_parse_query function and removing unnecessary whitespaces in the
transpiled_sql assertion in the test_procedure_conversion function.
Although the find_root_tables function remains unchanged, the upgrade to
sqlglot promises overall functionality enhancements, which software
engineers can leverage in their projects.
* Use default_factory in recon_config.py
([databrickslabs#431](databrickslabs#431)). In this
release, the default value handling for the `status` field in the
`DataReconcileOutput` and `ReconcileTableOutput` classes has been
improved to comply with Python 3.11. Previously, a mutable default value
was used, causing a `ValueError` issue. This has been addressed by
implementing the `default_factory` argument in the `field` function to
ensure a new instance of `StatusOutput` is created for each class.
Additionally, `MismatchOutput` and `ThresholdOutput` classes now also
utilize `default_factory` for consistent and robust default value
handling, enhancing the overall code quality and preventing potential
issues arising from mutable default values.
* edit distance
([databrickslabs#501](databrickslabs#501)). In this
release, we have implemented an `edit distance` feature for calculating
the difference between two strings using the LEVENSHTEIN function. This
has been achieved by adding a new method, `anonymous_sql`, to the
`Generator` class in the `databricks.py` file. The method takes
expressions of the `Anonymous` type as arguments and calls the
`LEVENSHTEIN` function if the `this` attribute of the expression is
equal to "EDITDISTANCE". Additionally, a new test file has been
introduced for the anonymous user in the functional snowflake test suite
to ensure the accurate calculation of string similarity using the
EDITDISTANCE function. This change includes examples of using the
EDITDISTANCE function with different parameters and compares it with the
LEVENSHTEIN function available in Databricks. It addresses issue
[databrickslabs#500](databrickslabs#500), which was
related to testing the edit distance functionality.
sundarshankar89 pushed a commit to sundarshankar89/remorph that referenced this pull request Jan 3, 2025
* Added Oracle ojdbc8 dependent library during reconcile Installation
([databrickslabs#474](databrickslabs#474)). In this
release, the `deployment.py` file in the
`databricks/labs/remorph/helpers` directory has been updated to add the
`ojdbc8` library as a `MavenLibrary` in the `_job_recon_task` function,
enabling the reconciliation process to access the Oracle Data source and
pull data for reconciliation between Oracle and Databricks. The
`JDBCReaderMixin` class in the `jdbc_reader.py` file has also been
updated to include the Oracle ojdbc8 dependent library for
reconciliation during the `reconcile` process. This involves installing
the `com.oracle.database.jdbc:ojdbc8:23.4.0.24.05` jar as a dependent
library and updating the driver class to
`oracle.jdbc.driver.OracleDriver` from `oracle`. A new dictionary
`driver_class` has been added, which maps the driver name to the
corresponding class name, allowing for dynamic driver class selection
during the `_get_jdbc_reader` method call. The
`test_read_data_with_options` unit test has been updated to test the
Oracle connector for reading data with specific options, including the
use of the correct driver class and specifying the database table for
data retrieval, improving the accuracy and reliability of the
reconciliation process.
* Added TSQL coverage tests in the generated report artifact
([databrickslabs#452](databrickslabs#452)). In this
release, we have added new TSQL coverage tests and Snowflake coverage
tests to the generated report artifact in the CI/CD pipeline. These
tests are executed using Maven with the updated command "mvn
--update-snapshots -B test -pl coverage --file pom.xml --fail-at-end"
and "mvn --update-snapshots -B exec:java -pl coverage --file pom.xml
--fail-at-end -Dexec.args="-i tests/resources/functional/snowflake -o
coverage-result.json" respectively, and the "continue-on-error: true"
option is added to allow the pipeline to proceed even if the tests fail.
Additionally, we have introduced a new constructor to the
`CommentBasedQueryExtractor` class, which accepts a `dialect` parameter
and allows for easier configuration of the start and end comments for
different SQL dialects. We have also updated the
CommentBasedQueryExtractor for Snowflake and added two TSQL coverage
tests to the generated report artifact to ensure that the
`QueryExtractor` is working correctly for TSQL queries. These changes
will help ensure thorough testing and identification of TSQL and
Snowflake queries during the CI/CD process.
* Added full support for analytical windowing functions
([databrickslabs#401](databrickslabs#401)). In this
release, full support for analytical windowing functions has been
implemented, addressing issue
[databrickslabs#401](databrickslabs#401). The
functions were previously specified in the parser grammar but have been
moved to the standard function lookup table for more consistent
handling. This enhancement allows for the use of analytical aggregate
functions, such as FIRST_VALUE and PERCENTILE_CONT, with a `WITHIN
GROUP` syntax and an `OVER` clause, enabling more complex queries and
data analysis. The `FixedArity` and `VariableArity` classes have been
updated with new methods for the supported functions, and appropriate
examples have been provided to demonstrate their usage in SQL.
* Added parsing for STRPOS in presto
([databrickslabs#462](databrickslabs#462)). A new
feature has been added to the remorph/snow package's presto module to
parse the STRPOS function in SQL code. This has been achieved by
importing the locate_to_strposition function from
sqlglot.dialects.dialect and incorporating it into the FUNCTIONS
dictionary in the Parser class. This change enables the parsing of the
STRPOS function, which returns the position of the first occurrence of a
substring in a string. The implementation has been tested with a SQL
file containing two queries for Presto SQL using STRPOS and Databricks
SQL using LOCATE, both aimed at finding the position of the letter `l`
in the string 'Hello world', starting the search from the second
position. This feature is particularly relevant for software engineers
working on data processing and analytics projects involving both Presto
and Databricks SQL, as it ensures compatibility and consistent behavior
between the two for string manipulation functions. The commit is part of
issue [databrickslabs#462](databrickslabs#462), and
the diff provided includes a new SQL file with test cases for the STRPOS
function in Presto and Locate function in Databricks SQL. The test cases
confirm if the `hello` string is present in the greeting_message column
of the greetings_table. This feature allows users to utilize the STRPOS
function in Presto to determine if a specific substring is present in a
string.
* Added validation for join columns for all query builders and limiting
rows for reports
([databrickslabs#413](databrickslabs#413)). In this
release, we've added validation for join columns in all query builders,
ensuring consistent and accurate data joins. A limit on the number of
rows displayed for reports has been implemented with a default of 50.
The `compare.py` and `execute.py` files have been updated to include
validation, and the `QueryBuilder` and `HashQueryBuilder` classes have
new methods for validating join columns. The `SamplingQueryBuilder`,
`ThresholdQueryBuilder`, and `recon_capture.py` files have similar
updates for validation and limiting rows for reports. The
`recon_config.py` file now has a new return type for the
`get_join_columns` method, and a new method
`test_no_join_columns_raise_exception()` has been added in the
`test_threshold_query.py` file. These changes aim to enhance data
consistency, accuracy, and efficiency for software engineers.
* Adds more coverage tests for functions to TSQL coverage
([databrickslabs#420](databrickslabs#420)). This
commit adds new coverage tests for various TSQL functions, focusing on
the COUNT, MAX, MIN, STDEV, STDEVP, SUM, and VARP functions, which are
identical in Databricks SQL. The tests include cases with and without
the DISTINCT keyword to ensure consistent behavior between TSQL and
Databricks. For the GROUPING and GROUPING_ID functions, which have some
differences, tests and examples of TSQL and Databicks SQL code are
provided. The CHECKSUM_AGG function, not directly supported in
Databricks SQL, is tested using MD5 and CONCAT_WS for equivalence. The
CUME_DIST function, identical in both systems, is also tested.
Additionally, a new test file for the STDEV function and updated tests
for the VAR function are introduced, enhancing the reliability and
robustness of TSQL conversions in the project.
* Catalog, Schema Permission checks
([databrickslabs#492](databrickslabs#492)). This
release introduces enhancements to the Catalog and Schema functionality,
with the addition of permission checks that raise explicit `Permission
Denied` exceptions. The logger messages have been updated for clarity
and a new variable, README_RECON_REPO, has been created to reference the
readme file for the recon_config repository. The ReconcileUtils class
has been modified to handle scenarios where the recon_config file is not
found or corrupted during loading, providing clear error messages and
guidance for users. The unit tests for the install feature have been
updated with permission checks for Catalog and Schema operations,
ensuring robust handling of permission denied errors. These changes
improve the system's error handling and provide clearer guidance for
users encountering permission issues.
* Changing the secret name acc to install script
([databrickslabs#432](databrickslabs#432)). In this
release, the `recon` function in the `execute.py` file of the
`databricks.labs.remorph.reconcile` package has been updated to
dynamically generate the secret name instead of hardcoding it as
"secret_scope". This change utilizes the new `get_key_form_dialect`
function to create a secret name specific to the source dialect being
used in the reconciliation process. The `get_dialect` function, along
with `DatabaseConfig`, `TableRecon`, and the newly added
`get_key_form_dialect`, have been imported from
`databricks.labs.remorph.config`. This enhancement improves the security
and flexibility of the reconciliation process by generating dynamic and
dialect-specific secret names.
* Feature/recon documentation
([databrickslabs#395](databrickslabs#395)). This
commit introduces a new reconciliation process, enhancing data
consistency between sources, co-authored by Ganesh Dogiparthi,
ganeshdogiparthi-db, and SundarShankar89. The README.md file provides
detailed documentation for the reconciliation process. A new binary
file, docs/transpile-install.gif, offers installation instructions or
visual aids, while a mermaid flowchart in
`report_types_visualisation.md` illustrates report generation for data,
rows, schema, and overall reconciliation. No existing functionality was
modified, ensuring the addition of valuable features for software
engineers adopting this project.
* Fixing issues in sample query builder to handle Null's and zero
([databrickslabs#457](databrickslabs#457)). This
commit introduces improvements to the sample query builder's handling of
Nulls and zeroes, addressing bug
[databrickslabs#450](databrickslabs#450). The
changes include updated SQL queries in the test threshold query file
with COALESCE and TRIM functions to replace Null values with a specified
string, ensuring consistent comparison of datasets. The query store in
test_execute.py has also been enhanced to handle NULL and zero values
using COALESCE, improving overall robustness and consistency.
Additionally, new methods such as build_join_clause, trim, and coalesce
have been added to enhance null handling in the query builder. The
commit also introduces the MockDataSource class, a likely test
implementation of a data source, and updates the log_and_throw_exception
function for clearer error messaging.
* Implement Lakeview Dashboard Publisher
([databrickslabs#405](databrickslabs#405)). In this
release, we've introduced the `DashboardPublisher` class in the
`dashboard_publisher.py` module to streamline the process of creating
and publishing dashboards in Databricks Workspace. This class simplifies
dashboard creation by accepting an instance of `WorkspaceClient` and
`Installation` and providing methods for creating and publishing
dashboards with optional parameter substitution. Additionally, we've
added a new JSON file, 'Remorph-Reconciliation-Substituted.lvdash.json',
which contains a dashboard definition for a data reconciliation feature.
This dashboard includes various widgets for filtering and displaying
reconciliation results. We've also added a test file for the Lakeview
Dashboard Publisher feature, which includes tests to ensure that the
`DashboardPublisher` can create dashboards using specified file paths
and parameters. These new features and enhancements are aimed at
improving the user experience and streamlining the process of creating
and publishing dashboards in Databricks Workspace.
* Integrate recon metadata reconcile cli
([databrickslabs#444](databrickslabs#444)). A new
CLI command, `databricks labs remorph reconcile`, has been added to
initiate the Data Reconciliation process, loading `reconcile.yml` and
`recon_config.json` configuration files from the Databricks Workspace.
If these files are missing, the user is prompted to reinstall the
`reconcile` module and exit the command. The command then triggers the
`Remorph_Reconciliation_Job` based on the Job ID stored in the
`reconcile.yml` file. This simplifies the reconcile execution process,
requiring users to first configure the `reconcile` module and generate
the `recon_config_<SOURCE>.json` file using `databricks labs remorph
install` and `databricks labs remorph generate-recon-config` commands.
The new CLI command has been manually tested and includes unit tests.
Integration tests and verification on the staging environment are
pending. This feature was co-authored by Bishwajit, Ganesh Dogiparthi,
and SundarShankar89.
* Introduce coverage tests
([databrickslabs#382](databrickslabs#382)). This
commit introduces coverage tests and updates the GitHub Actions workflow
to use Java 11 with Corretto distribution, improving testing and
coverage analysis for the project. Coverage tests are added as part of
the remorph project with the introduction of a new module for coverage
and updating the artifact version to 0.2.0-SNAPSHOT. The pom.xml file is
modified to change the parent project version to 0.2.0-SNAPSHOT,
ensuring accurate assessment and maintenance of code coverage during
development. In addition, a new Main object within the
com.databricks.labs.remorph.coverage package is implemented for running
coverage tests using command-line arguments, along with the addition of
a new file QueryRunner.scala and case classes for ReportEntryHeader,
ReportEntryReport, and ReportEntry for capturing and reporting on the
status and results of parsing and transpilation processes. The `Cache
Maven packages` step is removed and replaced with two new steps: `Run
Unit Tests with Maven` and "Run Coverage Tests with Maven." The former
executes unit tests and generates a test coverage report, while the
latter downloads remorph-core jars as artifacts, executes coverage tests
with Maven, and uploads coverage tests results as json artifacts. The
`coverage-tests` job runs after the `test-core` job and uses the same
environment, checking out the code with full history, setting up Java 11
with Corretto distribution, downloading remorph-core-jars artifacts, and
running coverage tests with Maven, even if there are errors. The JUnit
report is also published, and the coverage tests results are uploaded as
json artifacts, providing better test coverage and more reliable code
for software engineers adopting the project.
* Presto approx percentile func fix
([databrickslabs#411](databrickslabs#411)). The
remorph library has been updated to support the Presto database system,
with a new module added to the config.py file to enable robust and
maintainable interaction. An `APPROX_PERCENTILE` function has been
implemented in the `presto.py` file of the `sqlglot.dialects.presto`
package, allowing for approximate percentile calculations in Presto and
Databricks SQL. A test file has been included for both SQL dialects,
with queries calculating the approximate median of the height column in
the people table. The new functionality enhances the compatibility and
versatility of the remorph library in working with Presto databases and
improves overall project functionality. Additionally, a new test file
for Presto in the snowflakedriver project has been introduced to test
expected exceptions, further ensuring robustness and reliability.
* Raise exception if reconciliation fails for any table
([databrickslabs#412](databrickslabs#412)). In this
release, we have implemented significant changes to improve exception
handling and raise meaningful exceptions when reconciliation fails for
any table in our open-source library. A new exception class,
`ReconciliationException`, has been added as a child of the `Exception`
class, which takes two optional parameters in its constructor, `message`
and `reconcile_output`. The `ReconcileOutput` property has been created
for accessing the reconcile output object. The `InvalidInputException`
class now inherits from `ValueError`, making the code more explicit with
the type of errors being handled. A new method,
`_verify_successful_reconciliation`, has been introduced to check the
reconciliation output status and raise a `ReconciliationException` if
any table fails reconciliation. The `test_execute.py` file has been
updated to raise a `ReconciliationException` if reconciliation for a
specific report type fails, and new tests have been added to the test
suite to ensure the correct behavior of the `reconcile` function with
and without raising exceptions.
* Removed USE catalog/schema statement as lsql has added the feature
([databrickslabs#465](databrickslabs#465)). In this
release, the usage of `USE` statements for selecting a catalog and
schema has been removed in the `get_sql_backend` function, thanks to the
new feature provided by the lsql library. This enhancement improves code
readability, maintainability, and enables better integration with the
SQL backend. The commit also includes changes to the installation
process for reconciliation metadata tables, providing more clarity and
simplicity in the code. Additionally, several test functions have been
added or modified to ensure the proper functioning of the
`get_sql_backend` function in various scenarios, including cases where a
warehouse ID is not provided or when executing SQL statements in a
notebook environment. An error simulation test has also been added for
handling `DatabricksError` exceptions when executing SQL statements
using the `DatabricksConnectBackend` class.
* Sampling with clause query to have `from dual` in from clause for
oracle source
([databrickslabs#464](databrickslabs#464)). In this
release, we've added the `get_key_from_dialect` function, replacing the
previous `get_key_form_dialect` function, to retrieve the key associated
with a given dialect object, serving as a unique identifier for the
dialect. This improvement enhances the flexibility and readability of
the codebase, making it easier to locate and manipulate dialect objects.
Additionally, we've modified the 'sampling_query.py' file to include
`from dual` in the `from` clause for Oracle sources in a sampling query
with a clause, enabling sampling from Oracle databases. The
`_insert_into_main_table` method in the `recon_capture.py` file of the
`databricks.labs.remorph.reconcile` module has been updated to ensure
accurate key retrieval for the specified dialect, thereby improving the
reconciliation process. These changes resolve issues
[databrickslabs#458](databrickslabs#458) and
[databrickslabs#464](databrickslabs#464), enhancing
the functionality of the sampling query builder and providing better
support for various databases.
* Support function translation to Databricks SQL in TSql and Snowflake
([databrickslabs#414](databrickslabs#414)). This
commit introduces a dialect-aware FunctionBuilder system and a
ConversionStrategy system to enable seamless translation of SQL
functions between TSQL, Snowflake, and Databricks SQL IR. The new
FunctionBuilder system can handle both simple name translations and more
complex conversions when there is no direct equivalent. For instance,
TSQL's ISNULL function translates to IFNULL in Databricks SQL, while
Snowflake's ISNULL remains unchanged. The commit also includes updates
to the TSqlExpressionBuilder and new methods for building and visiting
various contexts, enhancing compatibility and expanding the range of
supported SQL dialects. Additionally, new tests have been added in the
FunctionBuilderSpec to ensure the correct arity and function type for
various SQL functions.
* TSQL: Create coverage tests for TSQL -> Databricks functions
([databrickslabs#415](databrickslabs#415)). This
commit introduces coverage tests for T-SQL functions and their
equivalent Databricks SQL implementations, focusing on the DATEADD
function's `yy` keyword. The DATEADD function is translated to the
ADD_MONTHS function in Databricks SQL, with the number of months
multiplied by 12. This ensures functional equivalence between T-SQL and
Databricks SQL for date addition involving years. The tests are written
as SQL scripts and are located in the
`tests/resources/functional/tsql/functions` directory, covering various
scenarios and possible engine differences between T-SQL and Databricks
SQL. The conversion process is documented, and future automation of this
documentation is considered.
* TSQL: Implement WITH CTE
([databrickslabs#443](databrickslabs#443)). With
this commit, we have extended the TSQL functionality by adding support
for Common Table Expressions (CTEs). CTEs are temporary result sets that
can be defined within a single execution of a SELECT, INSERT, UPDATE,
DELETE, or CREATE VIEW statement, allowing for more complex and
efficient queries. The implementation includes the ability to create a
CTE with an optional name and a column list, followed by a SELECT
statement that defines the CTE. CTEs can be self-referential and can be
used to simplify complex queries, improving code readability and
performance. This feature is particularly useful for cases where
multiple queries rely on the same intermediate result set, as it enables
reusing the results without having to repeat the query.
* TSQL: Implement functions with specialized syntax
([databrickslabs#430](databrickslabs#430)). This
commit introduces new data type conversion functions and JSON
manipulation capabilities to T-SQL, addressing issue
[databrickslabs#430](databrickslabs#430). The newly
implemented features include `NEXT VALUE FOR sequence`, `CAST(col TO
sometype)`, `TRY_CAST(col TO sometype)`, `JSON_ARRAY`, and
`JSON_OBJECT`. These functions support specialized syntax for handling
data type conversions and JSON operations, including NULL value handling
using `NULL ON NULL` and `ABSENT ON NULL` syntax. The
`TSqlFunctionBuilder` class has been updated to accommodate these
changes, and new test cases have been added to the `TSqlFunctionSpec`
test class in Scala. This enhancement enables SQL-based querying and
data manipulation with increased functionality for T-SQL parser and
function evaluations.
* TSQL: Support DISTINCT in SELECT list and aggregate functions
([databrickslabs#400](databrickslabs#400)). This
commit adds support for the `DISTINCT` keyword in T-SQL for use in the
`SELECT` list and aggregate functions such as `COUNT`. When used in the
`SELECT` list, `DISTINCT` ensures unique values of the specified
expression are returned, and in aggregate functions like `COUNT`, it
considers only distinct values of the specified argument. This change
aligns with the SQL standard and enhances the functionality of the T-SQL
parser, providing developers with greater flexibility and control when
using `DISTINCT` in complex queries and aggregate functions. The default
behavior in SQL, `ALL`, remains unchanged, and the parser has been
updated to accommodate these improvements.
* TSQL: Update the SELECT statement to support XML workspaces
([databrickslabs#451](databrickslabs#451)). This
release introduces updates to the TSQL Select statement grammar to
correctly support XMLWORKSPACES in accordance with the latest
specification. Although Databricks SQL does not currently support
XMLWORKSPACES, this change is a syntax-only update to enable
compatibility with other platforms that do support it. Newly added
components include 'xmlNamespaces', 'xmlDeclaration',
'xmlSchemaCollection', 'xmlTypeDefinition', 'createXmlSchemaCollection',
'xmlIndexOptions', 'xmlIndexOption', 'openXml', 'xmlCommonDirectives',
and 'xmlColumnDefinition'. These additions enable the creation,
configuration, and usage of XML schemas and indexes, as well as the
specification of XML namespaces and directives. A new test file for
functional tests has been included to demonstrate the use of
XMLWORKSPACES in TSQL and its equivalent syntax in Databricks SQL. While
this update does not affect the existing codebase's functionality, it
does enable support for XMLWORKSPACES syntax in TSQL, facilitating
easier integration with other platforms that support it. Please note
that Databricks SQL does not currently support XML workspaces.
* Test merge queue
([databrickslabs#424](databrickslabs#424)). In this
release, the Scalafmt configuration has been updated to version 3.8.0,
with changes to the formatting of Scala code. The `danglingParentheses`
preset option has been set to "false", removing dangling parentheses
from the code. Additionally, the `configStyleArguments` option has been
set to `false` under "optIn". These modifications to the configuration
file are likely to affect the formatting and style of the Scala code in
the project, ensuring consistent and organized code. This change aims to
enhance the readability and maintainability of the codebase.
* Updated bug and feature yml to support reconcile
([databrickslabs#390](databrickslabs#390)). The
open-source library has been updated to improve issue and feature
categorization. In the `.github/ISSUE_TEMPLATE/bug.yml` file, new
options for TranspileParserError, TranspileValidationError, and
TranspileLateralColumnAliasError have been added to the `label: Category
of Bug / Issue` field. Additionally, a new option for ReconcileError has
been included. The `feature.yml` file in the `.github/ISSUE_TEMPLATE`
directory has also been updated, introducing a required dropdown menu
labeled "Category of feature request." This dropdown offers options for
Transpile, Reconcile, and Other categories, ensuring accurate
classification and organization of incoming feature requests. The
modifications aim to enhance clarity for maintainers in reviewing and
prioritizing issue resolutions and feature implementations related to
reconciliation functionality.
* Updated the documentation with json config examples
([databrickslabs#486](databrickslabs#486)). In this
release, the Remorph Reconciliation tool on Databricks has been updated
to include JSON config examples for various config elements such as
jdbc_reader_options, column_mapping, transformations, thresholds, and
filters. These config elements enable users to define source and target
data, join columns, JDBC reader options, select and drop columns, column
mappings, transformations, thresholds, and filters. The update also
provides examples in both Python and JSON formats, as well as
instructions for installing the necessary Oracle JDBC library on a
Databricks cluster. This update enhances the tool's functionality,
making it easier for software engineers to reconcile source data with
target data on Databricks.
* Updated uninstall flow
([databrickslabs#476](databrickslabs#476)). In this
release, the `uninstall` functionality of the `databricks labs remorph`
tool has been updated to align with the latest changes made to the
`install` refactoring. The `uninstall` flow now utilizes a new
`MockInstallation` class, which handles the uninstallation process and
takes a dictionary of configuration files and their corresponding
contents as input. The `uninstall` function has been modified to return
`False` in two cases, either when there is no remorph directory or when
the user decides not to uninstall. A `MockInstallation` object is
created for the reconcile.yml file, and appropriate exceptions are
raised in the aforementioned cases. The `uninstall` function now uses a
`WorkspaceUnInstallation` or `WorkspaceUnInstaller` object, depending on
the input arguments, to handle the uninstallation process. Additionally,
the `MockPrompts` class is used to prompt the user for confirmation
before uninstalling remorph.
* Updates to developer documentation and add grammar formatting to maven
([databrickslabs#490](databrickslabs#490)). The
developer documentation has been updated to include grammar formatting
instructions and support for dialects other than Snowflake. The Maven
build cycle has been modified to format grammars before ANTLR processes
them, enhancing readability and easing conflict resolution during
maintenance. The TSqlLexer.g4 file has been updated with formatting
instructions and added dialect recognition. These changes ensure that
grammars are consistently formatted and easily resolvable during merges.
Engineers adopting this project should reformat the grammar file before
each commit, following the provided formatting instructions and
reference link. Grammar modifications in the TSqlParser.g4 file, such as
alterations in partitionFunction and freetextFunction rules, improve
structure and readability.
* Upgrade sqlglot from 23.13.7 to 25.1.0
([databrickslabs#473](databrickslabs#473)). In the
latest release, the sqlglot package has been upgraded from version
23.13.7 to 25.1.0, offering potential new features, bug fixes, and
performance improvements for SQL processing. The package dependency for
numpy has been updated to version 1.26.4, which may introduce new
functionality, improve existing features, or fix numpy integration
issues. Furthermore, the addition of the types-pytz package as a
dependency provides type hints for pytz, enhancing codebase type
checking and static analysis capabilities. Specific modifications to the
test_sql_transpiler.py file include updating the expected result in the
test_parse_query function and removing unnecessary whitespaces in the
transpiled_sql assertion in the test_procedure_conversion function.
Although the find_root_tables function remains unchanged, the upgrade to
sqlglot promises overall functionality enhancements, which software
engineers can leverage in their projects.
* Use default_factory in recon_config.py
([databrickslabs#431](databrickslabs#431)). In this
release, the default value handling for the `status` field in the
`DataReconcileOutput` and `ReconcileTableOutput` classes has been
improved to comply with Python 3.11. Previously, a mutable default value
was used, causing a `ValueError` issue. This has been addressed by
implementing the `default_factory` argument in the `field` function to
ensure a new instance of `StatusOutput` is created for each class.
Additionally, `MismatchOutput` and `ThresholdOutput` classes now also
utilize `default_factory` for consistent and robust default value
handling, enhancing the overall code quality and preventing potential
issues arising from mutable default values.
* edit distance
([databrickslabs#501](databrickslabs#501)). In this
release, we have implemented an `edit distance` feature for calculating
the difference between two strings using the LEVENSHTEIN function. This
has been achieved by adding a new method, `anonymous_sql`, to the
`Generator` class in the `databricks.py` file. The method takes
expressions of the `Anonymous` type as arguments and calls the
`LEVENSHTEIN` function if the `this` attribute of the expression is
equal to "EDITDISTANCE". Additionally, a new test file has been
introduced for the anonymous user in the functional snowflake test suite
to ensure the accurate calculation of string similarity using the
EDITDISTANCE function. This change includes examples of using the
EDITDISTANCE function with different parameters and compares it with the
LEVENSHTEIN function available in Databricks. It addresses issue
[databrickslabs#500](databrickslabs#500), which was
related to testing the edit distance functionality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants