Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schedule the migration progress workflow to run daily #3485

Merged
merged 24 commits into from
Jan 9, 2025

Conversation

asnare
Copy link
Contributor

@asnare asnare commented Jan 7, 2025

Changes

This PR updates the UCX installation so that the migration progress workflow runs automatically once per day.

Changes include:

  • Refactoring some of the plumbing that we use for managing/installing workflows.
  • Allowing workflows to have a (Cron-based) schedule attached to them.
  • Allowing the default schedule to be overridden via configuration during installation. (Will be addressed in a new PR.)
  • Configuring the migration progress workflow to run daily, by default at 5 a.m. (UTC).

Functionality

  • updated relevant user documentation
  • modified existing workflow: migration-progress-experimental

Tests

  • added and existing unit tests
  • added and existing integration tests

@asnare asnare added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 7, 2025
@asnare asnare self-assigned this Jan 7, 2025
@asnare asnare changed the title Scheduler the migration progress workflow to run daily Schedule the migration progress workflow to run daily Jan 7, 2025
asnare and others added 3 commits January 7, 2025 10:12
Steps was (I believe) the historic term for workflows, when migration was represented as a sequence of steps that corresponded 1:1 to workflows.
## Changes
Add fixtures to context and change usages accordingly

### Linked issues
Resolves #3428 and #3429

### Functionality
- [x] refactored tests

### Tests
- [x] use existing integration tests
@asnare asnare force-pushed the schedule-migration-progress-workflow branch from 918b027 to f914167 Compare January 7, 2025 12:36
@asnare asnare force-pushed the schedule-migration-progress-workflow branch from f914167 to 73e2e39 Compare January 7, 2025 13:17
@asnare asnare force-pushed the schedule-migration-progress-workflow branch from 4e480ee to df6bdcc Compare January 7, 2025 13:45
@asnare asnare marked this pull request as ready for review January 7, 2025 14:57
@asnare asnare requested a review from a team as a code owner January 7, 2025 14:57
Copy link

github-actions bot commented Jan 7, 2025

✅ 57/57 passed, 5 flaky, 1h48m58s total

Flaky tests:

  • 🤪 test_job_failure_propagates_correct_error_message_and_logs (5m39.746s)
  • 🤪 test_running_real_remove_backup_groups_job (5m17.611s)
  • 🤪 test_repair_run_workflow_job (11m19.121s)
  • 🤪 test_installation_with_dependency_upload (8m22.347s)
  • 🤪 test_running_real_migration_progress_job (5m25.857s)

Running from acceptance #7938

Copy link
Member

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid, some small pointers

@@ -919,11 +919,11 @@ The output is processed and displayed in the migration dashboard using the in `r

## [EXPERIMENTAL] Migration Progress Workflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not experimental anymore if we automate to run it daily

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but reasoned that we really don't know how well it works yet because I don't think many people (manually) trigger it. As such I thought it prudent to first see what kind of reports come in (if any) before removing the experimental label.

Does that make sense at all?

src/databricks/labs/ucx/runtime.py Show resolved Hide resolved
"tags": tags,
"job_clusters": self._job_clusters(job_clusters),
"email_notifications": email_notifications,
"schedule": workflow.schedule,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works with None? Probably covered in directly by integration tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: this dictionary is eventually used as **kwargs and schedule is an argument that defaults to None.

@asnare asnare enabled auto-merge January 9, 2025 14:11
@asnare asnare added this pull request to the merge queue Jan 9, 2025
Merged via the queue into main with commit db29a63 Jan 9, 2025
7 checks passed
@asnare asnare deleted the schedule-migration-progress-workflow branch January 9, 2025 14:47
gueniai added a commit that referenced this pull request Jan 16, 2025
*  Implement disposition field in SQL backend ([#3477](#3477)). This commit introduces the `query_statement_disposition` configuration value to handle large SQL queries during assessment results export for workspaces with numerous findings. A new parameter is added to the `config.yml` file, allowing users to specify the disposition method for running large SQL statements. The modification includes changes to the `databricks labs install ucx` and `databricks labs ucx export-assessment` commands and updates to the SqlBackend definition. The `Disposition` enum is utilized to specify the disposition method in tests, which have been manually verified. This feature, developed by Michele Daddetta and Guenia Izquierdo Delgado, resolves issue [#3447](#3447) and is based on changes from PR [#3455](#3455).
* AWS role issue with external locations pointing to the root of a storage account ([#3510](#3510)). In this release, the `AWSResources` class in `aws.py` has been updated to improve S3 bucket ARN pattern matching by modifying the regular expression pattern for matching. The `_identify_missing_paths` function in `access.py` has been enhanced to check for AWS role compatibility with external locations that point to the root of a storage account using `PurePath` class. Additionally, new unit tests have been added to `tests/unit/aws/test_access.py` to ensure the correct creation of all necessary UC roles, including the new external location `s3://BUCKET4` with an appropriate access level. These changes improve the accuracy of ARN pattern matching and enhance compatibility checking and testing for AWS roles and external locations. This release is part of the ongoing development of the AWS assessment tool and addresses issues [#3510](#3510) and [#3505](#3505).
* Added dashboards to migration progress dashboard ([#3314](#3314)). This commit, co-authored by Guenia Izquierdo Delgado, modifies the migration progress dashboard to include linting resources, adds new dashboards, and improves overall functionality and maintainability. The changes include modifying the existing 'Migration [main]' dashboard and updating associated unit and integration tests. New dashboards such as `Dashboards migrated` and `Dashboard pending migration` provide valuable insights into the migration progress, displaying successful migrations and pending migration status by owner. The commit also reorganizes some existing queries and adds new methods to support the new functionality, addressing dependencies from issue [#3424](#3424) and progressing work on issue [#3045](#3045), while breaking up issue [#3112](#3112).
* Added history log encoder for dashboards ([#3424](#3424)). This commit introduces a history log encoder for dashboards in the context of a larger application, addressing issues [#3368](#3368) and [#3369](#3369). The `experimental-migration-progress` workflow has been modified, and new classes, properties, and methods have been added to handle dashboard-related progress encoding. Specifically, the `Dashboard` class, `DashboardOwnership` class, and `DashboardProgressEncoder` class have been introduced, along with several methods for assessing dashboard ownership. These changes are tested through manual testing, unit tests, and integration tests. Additionally, the existing `TableProgressEncoder` class has been updated with new tests for failure scenarios involving tables that have not been migrated. The `WorkspacePathOwnership` method has been added to determine the owner of a given workspace path, and a new unit test has been added to test table creation from historical data.
* Create specific failure for Python syntax error while parsing with Astroid ([#3498](#3498)). This commit enhances the Python linting functionality in our open-source library by introducing a specific failure message for syntax errors that occur during code parsing with Astroid. Previously, a generic `system-error` message was displayed, which provided limited guidance for users. Now, a new failure type called `python-parse-error` is displayed when a SyntaxError is raised during parsing, with detailed information such as the error message, line, and column numbers. This change aligns the failure type with `sql-parse-error` and adds a default GitHub issue template to report the error. Additionally, the commit renames `system-error` to `python-parse-error` to maintain consistency and updates the README to explain the new failure type. The commit also includes new unit tests to ensure that the new failure type is being handled correctly, and modifies the Python linting-related code to add a new method `Tree.maybe_parse()` to handle syntax errors.
* DBR 16 and later support ([#3481](#3481)). This pull request introduces support for Databricks Runtime (DBR) 16 in the optional conversion of Hive Metastore (HMS) tables to external tables within the `migrate-tables` workflow. The update includes modifications to the existing `migrate-tables` workflow, such as the addition of a `_get_entity_storage_locations` method to check for the presence of the `entityStorageLocations` property in the table metadata, which is required for the `CatalogTable` constructor in DBR 16.0. The changes have been tested manually on DBR16, passed integration tests on DBR15, and verified on a staging environment using DBR16. Additionally, the `test_running_real_assessment_job` function in `test_workflows.py` has been updated to include the `skip_job_wait=True` parameter when running the `run_workflow` method for the `assessment` workflow, improving testing efficiency. The commit also includes a deprecated test case for converting managed tables to external before migrating, with a note about its failure from DBR 16.0 onwards due to a JDK update. The test case remains unchanged, but the note serves as a reminder for further investigation. The `run_workflow` function in the test cases has been modified to include a `skip_job_wait` parameter, allowing tests to bypass waiting for job completion, reducing overall test runtime and improving the developer experience.
* Exclude ucx dashboards from Lakeview dashboard crawler ([#3450](#3450)). In this release, we have introduced modifications to the `assessment` workflow, specifically in the `dashboards.py` file, to exclude dashboards from the UCX package in the Lakeview dashboard crawler and prevent false positives. The `lakeview_crawler` method in the `application.py` file has been updated to include a new argument `exclude_dashboard_ids`, set to the list of dashboard IDs in the `install_state.dashboards` object. This ensures that these dashboards are excluded from the crawler. Additionally, two new unit tests have been added to ensure the exclusion functionality works correctly. The first test checks if the crawler skips the dashboard with the ID specified in the `exclude_dashboard_ids` parameter, and the second test ensures that the `exclude_dashboard_ids` parameter takes priority over the `include_dashboard_ids` parameter when both are provided. The changes have been manually tested and verified on the staging environment, and the linked issue [#3441](#3441) has been resolved.
* Fixed issue in installing UCX on UC enabled workspace ([#3501](#3501)). In this release, we have updated the UCX policy definition for `spark_version` from a fixed value to an allowlist with a default value. This change resolves an issue where enabling UC on a workspace caused the cluster definition to take on `single_user` and `user_isolation` values instead of `Legacy_Single_User` and 'Legacy_Table_ACL'. The policy was found to be overriding these values, and changing `spark_version` from fixed to allowlist resolved the issue. Additionally, the job definition now uses the default value if no value is provided by setting `apply_policy_default_values` to true. This change resolves issue [#3420](#3420). No new methods have been added, and existing functionality has not been significantly altered. To test this change, updated unit tests, integration tests, and a static installation test should be performed. The code modification includes a new method called `test_job_cluster_on_uc_enabled_workspace` which tests the behavior of installation on a UC-enabled workspace, verifying that the correct data security modes are set for different job clusters. The changes in this release are backward compatible and do not affect existing functionality. The modification to the UCX policy ensures that the correct spark version and node type are selected, while also allowing for flexibility in data security modes. The updated tests provide confidence in the correct behavior of the installation process on both standard and UC-enabled workspaces.
* Fixed typo in workflow name (in error message) ([#3491](#3491)). This PR fixes a minor typo in an error message that appears when group permissions fail to migrate successfully. The typo, found in the name of the workflow for validating permissions, has been corrected from `validate-group-permissions` to "validate-groups-permissions". This change enhances the user experience by providing clearer instructions for addressing issues with group permissions during migration. No new methods have been introduced, and existing functionality has been modified solely for the correction of the typo. The change does not impact any other parts of the codebase. This project is geared towards software engineers who seek to utilize its features.
* Refactor `PipelineMigrator`'s to add `include_pipeline_ids` ([#3495](#3495)). In this release, the `PipelineMigrator` class in the `pipelines_migrate.py` file has been refactored to enhance the pipeline migration process. The refactor introduces a new parameter `include_pipeline_ids`, which allows users to specify a list of pipelines to migrate. Previously, users could only skip pipelines that were already migrated or explicitly specified using the `skip_pipeline_ids` parameter. With this refactor, users now have more control over the migration process by being able to explicitly include and exclude pipelines using the `include_pipeline_ids` and `exclude_pipeline_ids` parameters, respectively. Additionally, the implementation of the `PipelineMigrator` class has been simplified, and unit tests and integration tests have been updated to reflect these changes. As a software engineer, it is important to thoroughly test and validate this new behavior to ensure compatibility with existing systems.
* Schedule the migration progress workflow to run daily ([#3485](#3485)). This PR introduces a daily schedule for the UCX installation's migration progress workflow, refactoring workflow management/installation plumbing to enable Cron-based scheduling and setting the default schedule for the migration progress workflow to run at 5 a.m. UTC. Relevant user documentation has been updated, and the existing `migration-progress-experimental` workflow has been modified. New test methods have been added to check for the presence of workflows and tasks, as well as validate the workflow's schedule and pause status. These changes improve automation and maintainability of the UCX installation process, while ensuring that existing functionalities are working correctly.
* Scope crawled pipelines in PipelineCrawler ([#3513](#3513)). In this release, the `PipelineCrawler` class in the `databricks/labs/ucx/assessment` directory has been updated with a new optional argument `include_pipeline_ids` in the constructor. This argument is a list of strings that represent the IDs of pipelines to be crawled. If not provided, all pipelines will be crawled. The `_crawl` method has been modified to accept a list of pipeline IDs and now obtains a list of pipeline IDs instead of pipeline objects. For each pipeline ID, the method tries to get the pipeline and extract its configuration, while also checking for any failures. Additionally, assertions have been added to ensure that the `pipeline_id` and `spec.configuration` attributes are not `None`. A new test function `test_include_pipeline_ids()` has been introduced to verify the functionality of this argument. These changes improve the functionality of the `PipelineCrawler` class by allowing users to crawl specific pipelines based on their IDs.
* Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to >=0.9.1,<0.11 ([#3519](#3519)). In this update, the requirement for the `databricks-labs-blueprint` package has been updated to a version greater than or equal to 0.9.1 and strictly less than 0.11, previously it was greater than or equal to 0.9.1 and strictly less than 0.10. This change allows the latest version of the package to be used. Additionally, the commit includes release notes, a changelog, and commit information for the updated package, as well as instructions for Dependabot commands and options. The changes are limited to the `pyproject.toml` file and do not have any impact on other parts of the codebase.
* Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2 ([#3500](#3500)). In this pull request, we have updated the version requirement for the `sqlglot` dependency in the 'pyproject.toml' file. The previous version constraint was for a version greater than or equal to 25.5.0 and less than 26.1, but it has been relaxed to permit versions greater than or equal to 25.5.0 and less than 26.2. This change was made to enable the use of the latest version of 'sqlglot', which includes several new features, bug fixes, and breaking changes as detailed in the 26.1.0 changelog. We have also included the commit history for the `sqlglot` repository to provide further context and reference. This update aims to ensure compatibility with the latest version of `sqlglot` while also providing transparency regarding the changes implemented.
* Updated sqlglot requirement from <26.2,>=25.5.0 to >=25.5.0,<26.3 ([#3528](#3528)). In this release, we have updated the required version constraint of the `sqlglot` library in the `pyproject.toml` file. The previous constraint `>=25.5.0,<26.2` has been updated to `>=25.5.0,<26.3`. This change allows the project to utilize the latest version of `sqlglot` within the newly specified range while maintaining compatibility with the project's existing requirements. Notably, this update does not introduce any new methods to the project; it only affects the version constraint for the `sqlglot` library. Software engineers integrating this project can now benefit from the latest `sqlglot` versions within the specified range.
* Updated table-migration workflows to also capture updated migration progress into the history log ([#3239](#3239)). This pull request enhances the table-migration workflows by logging updated migration progress in the history log, providing improved visibility into the migration process. The workflows, including `migrate-tables`, `migrate-external-hiveserde-tables-in-place-experimental`, `migrate-external-tables-ctas`, `scan-tables-in-mounts-experimental`, and `migrate-tables-in-mounts-experimental`, have been updated to include this new logging functionality. In addition to these changes, the documentation has been updated to reflect which workflows update which tables, and the `TableMigrationStatus` data initialization behavior has been modified. New and updated unit and integration tests have been manually tested to ensure the changes are functioning correctly. Co-authored by Serge Smertin and Cor Zuurmond.

Dependency updates:

 * Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2 ([#3500](#3500)).
 * Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to >=0.9.1,<0.11 ([#3519](#3519)).
@gueniai gueniai mentioned this pull request Jan 16, 2025
gueniai added a commit that referenced this pull request Jan 23, 2025
* Implement disposition field in SQL backend
([#3477](#3477)). In this
release, we've added a new `query_statement_disposition` configuration
option for the SQL backend used in the `databricks labs ucx`
command-line interface. This option allows users to choose the
disposition method for running large SQL queries during assessment
results export, preventing failures in cases of large workspaces with
high volumes of findings. The new option is included in the `config.yml`
file and used in the SqlBackend definition. The commit also includes
updates to the `workspace_cli.py` file and addresses issue
[#3447](#3447). The
`disposition` parameter has been added to the
`StatementExecutionBackend` method, and the `Disposition` enum from the
`databricks.sdk.service.sql` module has been added to the `config.py`
file. The changes have been manually tested and are included in the
modified `databricks labs install ucx` and `databricks labs ucx
export-assessment` commands.
* AWS role issue with external locations pointing to the root of a
storage account
([#3510](#3510)). This
release includes a modification to enhance AWS role access for external
locations pointing to the root of a storage account, addressing issue
[#3510](#3510) and closing
issue [#3505](#3505). The
`aws.py` file in the `src/databricks/labs/ucx/assessment/` directory has
been updated to improve S3 bucket ARN pattern matching, now allowing
optional trailing slashes for greater flexibility. In the `access.py`
file within the `aws` directory of the `databricks/labs/ucx` package,
the `_identify_missing_paths` method now checks if the
`role.resource_path` is a parent of the external location path or if
they match exactly, allowing root-level external locations to be
recognized as compatible with AWS roles. A new method,
`AWSUCRoleCandidate`, has been added to the `AWSResources` class, and
several test cases have been updated or added to ensure proper
functionality with UC roles and AWS resources, including handling cases
with multiple role creations.
* Added assert to make sure installation is finished before
re-installation
([#3546](#3546)). In the
latest release, we've addressed an issue (commit 3546) where the
reinstallation of a software component was starting before the initial
installation was complete, causing a warning message to be suppressed
and the test to fail. To rectify this, we have enhanced the integration
tests and added an assert to ensure that the installation is finished
before attempting reinstallation. A new function called
`wait_for_installation_to_finish` has been introduced to manage the
waiting process. Furthermore, we have updated the
`test_compare_remote_local_install_versions` function to accept
`installation_ctx` instead of `ws` as a parameter, ensuring proper
configuration and loading of the installation before test execution.
These changes guarantee that the test will pass if the installation is
finished before the reinstallation is attempted.
* Added dashboards to migration progress dashboard
([#3314](#3314)). The
release notes have been updated to reflect the new features and changes
in the migration progress dashboard. This commit includes the addition
of dashboards to track the migration progress, with linting resources
added to ensure code quality. The commit also modifies the existing
dashboard "Migration [main]" and updates both unit and integration
tests. Specific new files and methods have been added to enhance
functionality, including the tracking of dashboard migration, and new
fixtures have been introduced to improve testing. The changes depend on
several issues and break up others to progress functionality. Overall,
this commit enhances the migration progress dashboard's capabilities,
making it more efficient and reliable for tracking migration progress.
* Added history log encoder for dashboards
([#3424](#3424)). A history
log encoder for dashboards has been added, addressing issues
[#3368](#3368) and
[#3369](#3369), which
modifies the existing `experimental-migration-progress` workflow. This
enhancement introduces a `DashboardProgressEncoder` class that encodes
Dashboard objects into Historical records, appending inventory snapshots
to the history table. The changes include adding new methods for
handling object types such as directories, and updating the `is_delta`
property of the `Table` class. The commit also includes new tests:
manually tested, unit tests added, and integration tests added.
Specifically, `test_table_progress_encoder_table_failures` has been
updated to include a new parameter, `is_migrated_table`, which, if set
to False, adds `Pending migration` to the list of failures. The
`is_used_table` parameter has been removed, and its functionality is no
longer part of this commit. The changes are tested through manual, unit,
and integration testing, ensuring the proper encoding of migration
progress and identifying relevant failures.
* Create specific failure for Python syntax error while parsing with
Astroid ([#3498](#3498)). In
this release, the Python linting-related code has been updated to
introduce a specific failure type for syntax errors that occur while
parsing code using Astroid. Previously, such errors resulted in a
generic `system-error` message, but with this change, a new failure type
called `python-parse-error` has been introduced. This new error type
includes the start and end line and column numbers of the error and is
accompanied by a new issue URL for reporting the error on the UCX
GitHub. The `system-error` failure type has been renamed to
`python-parse-error` to maintain consistency with the `sql-parse-error`
failure type. Additionally, a new method `Tree.maybe_parse()` has been
introduced to improve error detection and reporting during Python
linting. A unit test has been added to ensure the new failure type is
working as intended, and a generic failure is kept for directing users
to create GitHub issues for surfacing other issues.
* DBR 16 and later support
([#3481](#3481)). This
release adds support for Databricks Runtime (DBR) 16 and later, enabling
the optional conversion of Hive Metastore (HMS) tables to external
tables within the `migrate-tables` workflow. The change includes a new
static method `_get_entity_storage_locations` to check for the presence
of the `entityStorageLocations` property on table metadata. The existing
`_convert_hms_table_to_external` method has been updated to use this new
method and to include the `entityStorageLocations` constructor argument
if present. The changes have been manually tested for DBR 16, tested
with existing integration tests for DBR 15, and verified on the staging
environment with DBR 16. Additionally, the `skip_job_wait=True`
parameter has been added to specific test function calls to improve test
execution time. This release also resolves an issue with a failed test
in DBR16 due to a JDK update.
* Delete stale code: `NotebookLinter._load_source_from_run_cell`
([#3529](#3529)). In this
release, we have improved the code linting functionality in the
NotebookLinter class of our open-source library by removing the
`_load_source_from_run_cell` method in the sources.py file. This method,
previously used to load source code from run cells in a notebook, has
been identified as stale code and is no longer required. Consequently,
this change affects the `databricks labs ucx lint-local-code` command
and results in cleaner and more maintainable code. Furthermore, updated
and added unit tests have been included in this commit, which have been
manually tested to ensure that the changes do not adversely impact
existing functionality, thus progressing issue
[#3514](#3514).
* Exclude ucx dashboards from Lakeview dashboard crawler
([#3450](#3450)). In this
release, the functionality of the `assessment` workflow has been
improved to exclude certain dashboard IDs from the Lakeview dashboard
crawler. This change has been made to address the issue of false
positive dashboards and affects the `_crawl` method in the
`dashboards.py` file. The excluded dashboard IDs are now obtained from
the `install_state.dashboards` object. Additionally, new methods have
been added to the `test_dashboards.py` file in the `unit/assessment`
directory to test the exclusion functionality, including a test to
ensure that the exclude parameter takes priority over the include
parameter. The commit also includes unit tests, manual tests, and
screenshots to verify the changes on the staging environment. Overall,
this modification enhances the accuracy of the dashboard crawler and
simplifies the process of identifying and assessing relevant dashboards.
* Fixed issue in installing UCX on UC enabled workspace
([#3501](#3501)). This pull
request introduces changes to the UCX installer to address an issue
([#3420](#3420)) with
installing UCX on UC-enabled workspaces. It updates the UCX policy by
changing the `spark_version` parameter from `fixed` to `allowlist` with
a default value, allowing the cluster definition to take `single_user`
and `user_isolation` values instead of `Legacy_Single_User` and
'Legacy_Table_ACL'. Additionally, the job definition has been updated to
use the default value when not explicitly provided. The changes are
implemented in the `test_policy.py` file and impact the
`test_job_cluster_policy` and `test_job_cluster_on_uc_enabled_workspace`
methods. The pull request also includes updates to unit tests and
integration tests to ensure the correct behavior of the updated UCX
policy and job definition. The target audience is software engineers
adopting this project, with changes involving adjusting policy
definitions and testing job cluster behavior under different
configurations. Issue
[#3501](#3501) is also
resolved with these changes.
* Fixed typo in workflow name (in error message)
([#3491](#3491)). This PR
includes a fix for a minor typo in the error message of the
`validate_groups_permissions` method in the `workflows.py` file. The
typo resulted in the incorrect spelling of `group` as `groups` in the
workflow name. The fix simply changes `groups` to `group` in the error
message, ensuring accurate workflow name display. The functionality of
the code remains unaffected by this change, and no new methods have been
added. To clarify, the `validate_groups_permissions` method verifies
whether group permissions have been migrated correctly, and if not,
raises a ValueError with an error message suggesting the use of the
`validate-group-permissions` workflow for validation after the API has
caught up. This fix resolves the typo issue and maintains the expected
behavior of the code.
* Make link to issue template url safe
([#3508](#3508)). In this
commit, the `_definitely_failure` function in the `python_ast.py` file
has been modified to make the link to the issue template URL safe using
Python's `urllib`. This change ensures that any special characters in
the source code passed to the function will be properly displayed in the
issue template. If the source code cannot be parsed, the function
creates a link to the issue template for reporting a bug in the UCX
library, including the source code as part of the issue body. With this
commit, the source code is now passed through the
`urllib.parse.quote_plus` function before being added to the issue body,
making it url-safe and improving the robustness and user-friendliness of
the library. This change has been introduced in issue
[#3498](#3498) and has been
manually tested.
* Refactor `PipelineMigrator`'s to add `include_pipeline_ids`
([#3495](#3495)). In this
refactoring, the `PipelineMigrator` has been updated to introduce an
`include_pipeline_ids` option, replacing the previous
`skip_pipeline_ids` flag. This change allows users to specify the list
of pipelines to migrate, providing better control over the migration
process. The `PipelinesMigrator` constructor,
`_get_pipelines_to_migrate`, and `migrate_pipelines` methods have been
modified to accommodate this new flag. The `_migrate_pipeline` method
now accepts the pipeline ID instead of a `PipelineInfo` object.
Additionally, the unit tests have been updated to include the new
`include_flag` parameter, which facilitates testing various scenarios
with different pipeline lists. Although the commit does not show changes
to test files, integration tests should be updated to reflect the new
`include-pipeline-ids` flag functionality. This improvement resolves
issue [#3492](#3492) and
enhances the overall flexibility of the `PipelineMigrator`.
* Rename Python AST's `Tree` methods for clarity
([#3524](#3524)). In this
release, the `Tree` class in the Python AST library has been updated for
improved code clarity and functionality. The `append_` methods have been
renamed to `attach_` for better accuracy, and now include docstrings for
increased understanding. These methods have been updated to always
return `None`. A new method, `attach_child_tree`, has been added,
allowing for traversal from both parent and child and propagating any
module references. Several new methods and functionalities have been
introduced to improve the class, while extensive unit testing has been
conducted to ensure functionality. Additionally, the diff includes test
cases for various functionalities, such as inferring values when
attaching trees and verifying spark module propagation, as well as tests
to ensure that certain operations are not supported. This change, linked
to issues [#3514](#3514) and
[#3520](#3520), may affect
any code that calls these methods and relies on their return values.
However, the added docstrings and unit tests will help ensure your code
continues to function correctly.
* Schedule the migration progress workflow to run daily
([#3485](#3485)). This PR
introduces changes to the UCX installation process to schedule the
migration progress workflow to run automatically once a day, with the
default schedule set to run at 5 a.m. UTC. It includes refactoring the
plumbing used for managing and installing workflows, enabling them to
have a Cron-based schedule. The relevant user documentation has been
updated, and the existing `migration-progress-experimental` workflow has
been modified. Additionally, unit and integration tests have been
added/modified to ensure the proper functioning of the updated code, and
new functions have been added to verify the workflow's schedule and task
detection.
* Scope crawled pipelines in PipelineCrawler
([#3513](#3513)). In this
release, the `PipelineCrawler` class in the `pipelines.py` file has been
updated to include a new optional argument `include_pipeline_ids` in its
constructor. This argument allows users to filter the pipelines that are
crawled by specifying a list of pipeline IDs. The `_crawl` method has
been modified to check if `include_pipeline_ids` is not `None` and to
filter the list of pipelines accordingly. The class now also checks if
each pipeline exists before getting its configuration, and logs a
warning message if the pipeline is not found. Previously, a `NotFound`
exception was raised. Additionally, the code has been updated to use
`pipeline.spec.configuration` instead of
`pipeline_response.spec.configuration` to get the pipeline
configuration. These changes have been tested through new and updated
unit tests, including a test for handling creators' user names. Overall,
these updates provide improved functionality and flexibility for
crawling pipelines.
* Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to
>=0.9.1,<0.11
([#3519](#3519)). In this
release, we have updated the version requirement of the
`databricks-labs-blueprint` package to be greater than or equal to 0.9.1
and less than 0.11. This change allows us to use the latest version of
the package and includes bug fixes and dependency updates. The hosted
runner has been patched in version 0.10.1 to address issues with
publishing artifacts in the release workflow. Release notes for previous
versions are also provided in the commit. These updates are intended to
improve the overall functionality and stability of the library.
* Updated databricks-sdk requirement from <0.41,>=0.40 to >=0.40,<0.42
([#3553](#3553)). In this
release, the `databricks-sdk` package requirement has been updated to
version 0.41.0, which brings new features, improvements, bug fixes, and
API changes. Among the new features are the addition of
'serving.http_request' for calling external functions, and recovery on
download failures in the Files API client. Although the specifics of the
functionality added and changed are not detailed, the focus of this
release appears to be on bug fixes and internal enhancements.
Additionally, the API has undergone changes, including added and altered
methods and fields, however, specific information about these changes
has not been provided in the release notes.
* Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2
([#3500](#3500)). A critical
update has been implemented in this release for the `sqlglot` package,
which has been updated to version 25.5.0 or higher, but less than 26.2.
This change is essential to leverage the latest version of sqlglot while
avoiding any breaking changes introduced in version 26.1. The new
version includes several breaking changes, new features, bug fixes, and
modifications to various dialects such as hive, postgres, tsql, and
sqlite. Moreover, the tokenizer has been updated to accept
underscore-separated number literals. However, the specific impact of
these changes on the project is not detailed in the commit message, and
software engineers should thoroughly test and review the changes to
ensure seamless functionality.
* Updated sqlglot requirement from <26.2,>=25.5.0 to >=25.5.0,<26.3
([#3528](#3528)). In this
update, we have modified the version constraint for the `sqlglot`
dependency from `>=25.5.0,<26.2` to `>=25.5.0,<26.3` in the
`pyproject.toml` file. Sqlglot is a Python-based SQL parser and
optimizer, and this change allows us to adopt the latest version of
sqlglot within the specified version range. This update addresses
potential security vulnerabilities and incorporates performance
enhancements and bug fixes, ensuring that our library remains up-to-date
and secure.
* Updated table-migration workflows to also capture updated migration
progress into the history log
([#3239](#3239)). This pull
request updates the table-migration workflows to log not only the tables
that still need to be migrated, but also the progress of the migration.
The affected workflows include `migrate-tables`,
`migrate-external-hiveserde-tables-in-place-experimental`,
`migrate-external-tables-ctas`, `scan-tables-in-mounts-experimental`,
and `migrate-tables-in-mounts-experimental`. The encoder for
table-history has been refactored to improve control over when the
`TableMigrationStatus` data is refreshed. The documentation has been
updated to reflect the changes in each workflow. Additionally, both unit
and integration tests have been added and updated to ensure the changes
work as intended and resolve any conflicts. A new
`ProgressTrackingInstallation` class has been added to support this
functionality. The changes have been manually tested and include
modifications to the existing workflows, new methods, and a renamed
method. The `mock_workspace_client` function has been replaced, and the
`external_locations.resolve_mount` method and other methods have not
been called. The `TablesCrawler` object's `snapshot` method has been
called once to retrieve the list of tables in the Hive metastore. The
migration record workflow run is also updated to include the workflow
run information in the `workflow_runs` table. These changes are expected
to improve the accuracy and reliability of the table-migration
workflows.

Dependency updates:

* Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2
([#3500](#3500)).
* Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to
>=0.9.1,<0.11
([#3519](#3519)).
* Updated databricks-sdk requirement from <0.41,>=0.40 to >=0.40,<0.42
([#3553](#3553))
gueniai added a commit that referenced this pull request Jan 23, 2025
*  Implement disposition field in SQL backend ([#3477](#3477)). This commit adds a `query_statement_disposition` configuration option for the SQL backend in the UCX tool, allowing users to specify the disposition of SQL statements during assessment results export and preventing failures when dealing with large workspaces and a large number of findings. The new configuration option is added to the `config.yml` file and used by the `SqlBackend` definition. The `databricks labs install ucx` and `databricks labs ucx export-assessment` commands have been modified to support this new functionality. A new `Disposition` enum has been added to the `databricks.sdk.service.sql` module. This change resolves issue [#3447](#3447) and is related to pull request [#3455](#3455). The functionality has been manually tested.
* AWS role issue with external locations pointing to the root of a storage account ([#3510](#3510)). The `AWSResources` class in the `aws.py` file has been updated to enhance the regular expression pattern for matching S3 bucket names, now including an optional group for trailing slashes and any subsequent characters. This allows for recognition of external locations pointing to the root of a storage account, addressing issue [#3505](#3505). The `access.py` file within the AWS module has also been updated, introducing a new `path` variable and updating a for loop condition to accurately identify missing paths in external locations referencing the root of a storage account. New unit tests have been added to `tests/unit/aws/test_access.py`, including a `test_uc_roles_create_all_roles` method that checks the creation of all possible UC roles when none exist and external locations with and without folders. Additionally, the `backend` fixture has been updated to include a new external location `s3://BUCKET4`, and various tests have been updated to incorporate this location and handle errors appropriately.
* Added assert to make sure installation is finished before re-installation ([#3546](#3546)). In this release, we have added an assertion to ensure that the installation process is completed before attempting to reinstall, addressing a previous issue where the reinstallation was starting before the first installation was finished, causing a warning to not be raised and resulting in a test failure. We have introduced a new function `wait_for_installation_to_finish()`, which retries loading the installation if it is not found, with a timeout of 2 minutes. This function is utilized in the `test_compare_remote_local_install_versions` test to ensure that the installation is finished before proceeding. Furthermore, we have extracted the warning message to a variable `error_message` for better readability. This change enhances the reliability of the installation process.
* Added dashboards to migration progress dashboard ([#3314](#3314)). This commit introduces significant updates to the migration progress dashboard, adding dashboards, linting resources, and modifying existing components. The changes include a new dashboard displaying the number of dashboards pending migration, with the data sourced from the `ucx_catalog.multiworkspace.objects_snapshot` table. The existing 'Migration [main]' dashboard has been updated, and unit and integration tests have been adapted accordingly. The commit also renames several SQL files, updates the percentage UDF, grant, job, cluster, table, and pipeline migration progress queries, and resolves linting compatibility issues related to Unity Catalog. The changes depend on issue [#3424](#3424), progress issue [#3045](#3045), and break up issue [#3112](#3112). The new dashboard aims to enhance the migration process and ensure a smooth transition to the Unity Catalog.
* Added history log encoder for dashboards ([#3424](#3424)). A new history log encoder for dashboards has been added, addressing issues [#3368](#3368) and [#3369](#3369), and modifying the existing `experimental-migration-progress` workflow. This update includes the addition of the `DashboardOwnership` class, used to generate ownership information for dashboards, and the `DashboardProgressEncoder` class, responsible for encoding progress data related to dashboards. The new functionality is tested through manual, unit, and integration testing. In the `Table` class, the `from_table_info` and `from_historical_data` methods have been added, allowing for the creation of `Table` instances from `TableInfo` objects and historical data dictionaries with more flexibility and safety. The `test_tables.py` file in the `integration/progress` directory has also been updated to include a new test function for checking table failures. These changes improve the tracking and management of dashboard IDs, enhance user name retrieval, and ensure the accurate determination of object ownership.
* Create specific failure for Python syntax error while parsing with Astroid ([#3498](#3498)). This commit enhances the Python linting functionality in our open-source library by introducing a specific failure message, `python-parse-error`, for syntax errors encountered during code parsing using Astroid. Previously, a generic `system-error` message was used, which has been renamed to maintain consistency with the existing `sql-parse-error` message. This change provides clearer failure indicators and includes more detailed information about the error location. Additionally, modifications to Python linting-related code, unit test additions, and updates to the README guide users on handling these new error types have been implemented. A new method, `Tree.maybe_parse()`, has been introduced to parse Python code and detect syntax errors, ensuring more precise error handling for users.
* DBR 16 and later support ([#3481](#3481)). This pull request introduces support for Databricks Runtime (DBR) 16 and later in the code that converts Hive Metastore (HMS) tables to external tables within the `migrate-tables` workflow. The changes include the addition of a new static method `_get_entity_storage_locations` to handle the new `entityStorageLocations` property in DBR16 and the modification of the `_convert_hms_table_to_external` method to account for this property. Additionally, the `run_workflow` function in the `assessment` workflow now has the `skip_job_wait` parameter set to `True`, which allows the workflow to continue running even if a job within it fails. The changes have been manually tested for DBR16, verified in a staging environment, and existing integration tests have been run for DBR 15. The diff also includes updates to the `test_table_migration_convert_manged_to_external` method to skip job waiting during testing, enabling the test to run successfully on DBR 16.
* Delete stale code: `NotebookLinter._load_source_from_run_cell` ([#3529](#3529)). In this update, we have removed the stale code `NotebookLinter._load_source_from_run_cell`, which was responsible for loading the source code from a run cell in a notebook. This change is a part of the ongoing effort to address issue [#3514](#3514) and enhances the overall codebase. Additionally, we have modified the existing `databricks labs ucx lint-local-code` command to update the code linting functionality. We have conducted manual testing to ensure that the changes function as intended and have added and modified several unit tests. The `_load_source_from_run_cell` method is no longer needed, as it was part of a deprecated functionality. The modifications to the `databricks labs ucx lint-local-code` command impact the way code linting is performed, ultimately improving the efficiency and maintainability of the codebase.
* Exclude ucx dashboards from Lakeview dashboard crawler ([#3450](#3450)). In this release, we have enhanced the `lakeview_crawler` method in the open-source library to exclude Ucx dashboards and prevent false positives. This has been achieved by adding a new optional argument, `exclude_dashboard_ids`, to the `__init__` method, which takes a list of dashboard IDs to exclude from the crawler. The `_crawl` method has been updated to skip dashboards whose IDs match the ones in the `exclude_dashboard_ids` list. The change includes unit tests and manual testing to ensure proper functionality and has been verified on the staging environment. These updates improve the accuracy and reliability of the dashboard crawler, providing better results for software engineers utilizing this library.
* Fixed issue in installing UCX on UC enabled workspace ([#3501](#3501)). This PR introduces changes to the `ClusterPolicyInstaller` class, updating the `spark_version` policy definition from a fixed value to an allowlist with a default value. This resolves an issue where, when UC is enabled on a workspace, the cluster definition takes on `single_user` and `user_isolation` values instead of `Legacy_Single_User` and 'Legacy_Table_ACL'. The job definition is also updated to use the default value when not explicitly provided. These changes improve compatibility with UC-enabled workspaces, ensuring the correct values for `spark_version` in the cluster definition. The PR includes updates to unit tests and installation tests, addressing issue [#3420](#3420).
* Fixed typo in workflow name (in error message) ([#3491](#3491)). This PR (Pull Request) addresses a minor typo in the error message displayed by the `validate_groups_permissions` method in the `workflows.py` file. The typo occurred in the workflow name mentioned in the error message, where `group` was incorrectly spelled as "groups." The corrected spelling is now `validate-groups-permissions`. This change does not introduce any new methods or modify any existing functionality, but instead focuses on enhancing the clarity and accuracy of the error message. Ensuring that error messages are free from typos and other inaccuracies is essential for maintaining the usability and effectiveness of the code, as it enables users to more easily troubleshoot any issues that may arise during its usage.
* HMS Federation Glue Support ([#3526](#3526)). This commit introduces support for HMS Federation Glue in the open-source library, resolving issue [#3011](#3011). The changes include adding a new command, `migrate-glue-credentials`, to migrate Glue credentials to UC storage credentials in the federation glue for HMS. The `AWSResourcePermissions` class has been updated to include a new parameter `config` for HMS Federation Glue configuration and the `load_uc_compatible_roles` method now accepts an optional `resource_type` parameter for filtering compatible roles based on the provided type. Additionally, the `ExternalLocations` class has been updated to handle S3 resource type when identifying missing external locations. The commit also includes several bug fixes, new classes, methods, and changes to the existing methods to handle AWS Glue resources, and updates to the integration tests. Overall, these changes add significant functionality for AWS Glue support in the HMS Federation Glue feature.
* Make link to issue template url safe ([#3508](#3508)). In this release, we have updated the `python_ast.py` file to enhance the encoding of the link to the issue template for bug reports. By utilizing the `urllib.parse.quote_plus()` function from Python's standard library, we have ensured that any special characters in the provided source code will be properly encoded. This eliminates the risk of issues arising from incorrectly interpreted characters, enhancing the reliability of the bug reporting process. This change, initially introduced in issue [#3498](#3498), has been thoroughly tested to guarantee its correct functioning. The rest of the file remains unaffected, preserving its original functionality.
* Refactor `PipelineMigrator`'s to add `include_pipeline_ids` ([#3495](#3495)). In this release, the `PipelineMigrator` class has been refactored to enhance pipeline migration functionality. The `skip-pipeline-ids` flag has been replaced with `include-pipeline-ids`, allowing users to specify a list of pipelines to migrate, rather than listing pipelines to skip. Additionally, the `exclude_pipeline_ids` functionality has been added to provide even more granularity in pipeline selection. The `migrate_pipelines` method now prioritizes `include_pipeline_ids` and `exclude_pipeline_ids` parameters to determine the list of pipelines for migration. The `_migrate_pipeline` method has been updated to accept a string pipeline ID and now checks if the pipeline has already been migrated. Several support methods, such as `_clone_pipeline`, have also been refactored for improved functionality. Although no new methods were added, the behavior of the `migrate_pipelines` method has changed. While unit tests have been updated to cover the changes, integration tests have not been modified yet. Ensure thorough testing to prevent any new issues or breaks in existing functionality.
* Release v0.54.0 ([#3530](#3530)). 0.54.0 brings several enhancements and bug fixes to the UCX library. A `query_statement_disposition` option is added to the SQL backend to handle large SQL queries during assessment results export, preventing potential failures in large workspaces with high volumes of findings. AWS role compatibility checks are improved for external locations pointing to the root of a storage account. Dashboards are enhanced with added migration progress dashboards and a history log encoder. New failure types are introduced for Python syntax errors during parsing and SQL parsing errors. The library now supports DBR 16 and later versions, with optional conversion of Hive Metastore tables to external tables in the `migrate-tables` workflow. The `PipelineMigrator` functionality is refactored to add an `include_pipeline_ids` parameter for better control over the migration process. Various dependency updates, including `databricks-labs-blueprint`, `databricks-sdk`, and `sqlglot`, are included in this release, which bring new features, improvements, and bug fixes, as well as API changes. Please thoroughly test and review the changes to ensure seamless functionality.
* Rename Python AST's `Tree` methods for clarity ([#3524](#3524)). In this release, we have made significant improvements to the clarity of the Python AST's `Tree` methods in the `python_analyzer.py` file. The `append_` and `extend_` methods have been renamed to `attach_` to better reflect their functionality. These methods now always return `None`. New methods such as `attach_child_tree`, `attach_nodes`, and `extend_globals` have been introduced to enhance the functionality of the library. The `attach_child_tree` method allows for attaching one tree as a child of another tree, propagating module references and enabling traversal from both the parent and child trees. The `attach_nodes` method sets the parent of the attached nodes and adds them to the body of the tree. Additionally, docstrings have been added, and unit testing has been expanded. The changes include modifications to code linting, existing command functionalities, and manual testing to ensure compatibility. These enhancements improve the clarity, functionality, and flexibility of the Python AST's `Tree` methods.
* Revert "Release v0.54.0" ([#3569](#3569)). In version 0.53.1, we have reverted changes from 0.54.0 to address issues with the previous release and ensure proper propagation to PyPI. This version includes various updates such as implementing a disposition field in the SQL backend, improving ARN pattern matching for AWS roles, adding dashboards to migration progress, enhancing Python linting functionality, and adding support for DBR 16 in converting Hive Metastore tables to external tables. We have also excluded UCX dashboards from the Lakeview dashboard crawler, refactored PipelineMigrator's to add include_pipeline_ids, and updated the sqlglot and databricks-labs-blueprint requirements. Additionally, several issues related to installation, typo in workflow name, and table-migration workflows have been fixed. The sqlglot requirement has been updated from <26.1,>=25.5.0 to >=25.5.0,<26.2, and databricks-labs-blueprint from <0.10,>=0.9.1 to >=0.9.1,<0.11. This release does not introduce any new methods or change existing functionality, but focuses on addressing bugs and improving functionality.
* Schedule the migration progress workflow to run daily ([#3485](#3485)). This PR introduces a daily scheduling mechanism for the UCX installation's migration progress workflow, allowing it to run automatically once per day at 5 a.m. UTC. It includes refactoring the plumbing for managing and installing workflows, enabling them to have a Cron-based schedule. Relevant user documentation has been updated, and existing unit and integration tests have been added to ensure the changes function as intended. A new test has been added to verify the migration-progress workflow is installed with a schedule attached, checking the workflow schedule's quartz cron expression, time zone, and pause status, as well as confirming that the workflow is unpaused upon installation. The PR also introduces new methods to manage workflow scheduling and configure cron-based schedules.
* Scope crawled pipelines in PipelineCrawler ([#3513](#3513)). In the latest release, we have introduced a new optional argument, 'include_pipeline_ids', in the constructor of the PipelinesCrawler class located in the 'databricks/labs/ucx/assessment' module. This argument allows users to filter pipelines based on a list of pipeline IDs, improving the crawler's flexibility and efficiency in processing pipelines. In the `_crawl` method of the PipelinesCrawler class, a new behavior has been implemented based on the value of 'include_pipeline_ids'. If the argument is not None, then the method uses the pipeline IDs from this list instead of retrieving all pipelines. Additionally, two unit tests have been added to verify the functionality of this new argument and ensure that the crawler handles cases where a pipeline is not found or its specification is missing. A new parameter, 'force_refresh', has also been added to the `snapshot` function. This release aims to provide a more efficient and customizable pipeline crawling experience for users.
* Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to >=0.9.1,<0.11 ([#3519](#3519)). In this update, the requirement for the `databricks-labs-blueprint` library has been changed from version range '<0.10,>=0.9.1>' to a new range of '>=0.9.1,<0.11'. This change allows for the use of the latest version of the library while maintaining compatibility with the current project setup, and is based on information from the library's releases and changelog. The commit includes a list of commits and dependencies for the updated library. This update was automatically implemented by Dependabot, a tool that handles dependency updates and conflict resolution, ensuring a seamless integration process for engineers adopting the project.
* Updated databricks-sdk requirement from <0.41,>=0.40 to >=0.40,<0.42 ([#3553](#3553)). In this release, we have updated the `databricks-sdk` package requirement to permit version 0.41 while excluding version 0.42. This update includes several improvements and new features in version 0.41, such as the addition of the `serving.http_request` method for calling external functions and enhancements to the Files API client to recover from download failures. The commit also includes bug fixes, internal changes, and updates to the API for better functionality and compatibility. It is essential to note that these changes have been made to ensure compatibility with the latest features and improvements in the `databricks-sdk` package.
* Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2 ([#3500](#3500)). In this release, we have updated the version requirement for the sqlglot package. The minimum version required is now 25.5.0 and less than 26.2, previously it was 25.5.0 and less than 26.1. This change allows for the most recent version of sqlglot to be installed, while still maintaining compatibility with the current codebase. The update is necessary due to breaking changes introduced in version 26.1.0 of sqlglot, including normalizing before qualifying tables, requiring the `AS` token in CTEs for all dialects except spark and databricks, supporting Unicode in sqlite, mysql, tsql, postgres, and oracle, parsing ASCII into Unicode to facilitate transpilation, and improving transpilation of CHAR[ACTER]_LENGTH. Additionally, several bug fixes and new features have been added in this update.
* Updated sqlglot requirement from <26.2,>=25.5.0 to >=25.5.0,<26.3 ([#3528](#3528)). In this release, we have updated the version constraint for the `sqlglot` dependency in our project's "pyproject.toml" file. The previous constraint allowed versions between 25.5.0 and 26.2, while the new constraint allows versions between 25.5.0 and 26.3. This change was made to ensure that we can use the latest version of sqlglot while also preventing the version from exceeding 26.3. Additionally, the commit includes detailed information about the specific commits and changes made in the updated version of sqlglot, providing valuable insights for software engineers working with this open-source library.
* Updated table-migration workflows to also capture updated migration progress into the history log ([#3239](#3239)). The table-migration workflows have been updated to log not only the tables that still need to be migrated, but also the updated progress information into the history log, ensuring a more comprehensive record of migration progress. The affected workflows include `migrate-tables`, `migrate-external-hiveserde-tables-in-place-experimental`, `migrate-external-tables-ctas`, `scan-tables-in-mounts-experimental`, and `migrate-tables-in-mounts-experimental`. The encoder for table-history has been updated to prevent implicit refresh of `TableMigrationStatus` data during initialization. Additionally, the documentation has been updated to reflect which workflows update which tables. New and updated unit and integration tests, as well as manual testing, have been conducted to ensure the functionality of the changes.

Dependency updates:

 * Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2 ([#3500](#3500)).
 * Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to >=0.9.1,<0.11 ([#3519](#3519)).
 * Updated databricks-sdk requirement from <0.41,>=0.40 to >=0.40,<0.42 ([#3553](#3553)).
@gueniai gueniai mentioned this pull request Jan 23, 2025
gueniai added a commit that referenced this pull request Jan 23, 2025
* Implement disposition field in SQL backend
([#3477](#3477)). This
commit adds a `query_statement_disposition` configuration option for the
SQL backend in the UCX tool, allowing users to specify the disposition
of SQL statements during assessment results export and preventing
failures when dealing with large workspaces and a large number of
findings. The new configuration option is added to the `config.yml` file
and used by the `SqlBackend` definition. The `databricks labs install
ucx` and `databricks labs ucx export-assessment` commands have been
modified to support this new functionality. A new `Disposition` enum has
been added to the `databricks.sdk.service.sql` module. This change
resolves issue
[#3447](#3447) and is
related to pull request
[#3455](#3455). The
functionality has been manually tested.
* AWS role issue with external locations pointing to the root of a
storage account
([#3510](#3510)). The
`AWSResources` class in the `aws.py` file has been updated to enhance
the regular expression pattern for matching S3 bucket names, now
including an optional group for trailing slashes and any subsequent
characters. This allows for recognition of external locations pointing
to the root of a storage account, addressing issue
[#3505](#3505). The
`access.py` file within the AWS module has also been updated,
introducing a new `path` variable and updating a for loop condition to
accurately identify missing paths in external locations referencing the
root of a storage account. New unit tests have been added to
`tests/unit/aws/test_access.py`, including a
`test_uc_roles_create_all_roles` method that checks the creation of all
possible UC roles when none exist and external locations with and
without folders. Additionally, the `backend` fixture has been updated to
include a new external location `s3://BUCKET4`, and various tests have
been updated to incorporate this location and handle errors
appropriately.
* Added assert to make sure installation is finished before
re-installation
([#3546](#3546)). In this
release, we have added an assertion to ensure that the installation
process is completed before attempting to reinstall, addressing a
previous issue where the reinstallation was starting before the first
installation was finished, causing a warning to not be raised and
resulting in a test failure. We have introduced a new function
`wait_for_installation_to_finish()`, which retries loading the
installation if it is not found, with a timeout of 2 minutes. This
function is utilized in the `test_compare_remote_local_install_versions`
test to ensure that the installation is finished before proceeding.
Furthermore, we have extracted the warning message to a variable
`error_message` for better readability. This change enhances the
reliability of the installation process.
* Added dashboards to migration progress dashboard
([#3314](#3314)). This
commit introduces significant updates to the migration progress
dashboard, adding dashboards, linting resources, and modifying existing
components. The changes include a new dashboard displaying the number of
dashboards pending migration, with the data sourced from the
`ucx_catalog.multiworkspace.objects_snapshot` table. The existing
'Migration [main]' dashboard has been updated, and unit and integration
tests have been adapted accordingly. The commit also renames several SQL
files, updates the percentage UDF, grant, job, cluster, table, and
pipeline migration progress queries, and resolves linting compatibility
issues related to Unity Catalog. The changes depend on issue
[#3424](#3424), progress
issue [#3045](#3045), and
break up issue
[#3112](#3112). The new
dashboard aims to enhance the migration process and ensure a smooth
transition to the Unity Catalog.
* Added history log encoder for dashboards
([#3424](#3424)). A new
history log encoder for dashboards has been added, addressing issues
[#3368](#3368) and
[#3369](#3369), and
modifying the existing `experimental-migration-progress` workflow. This
update includes the addition of the `DashboardOwnership` class, used to
generate ownership information for dashboards, and the
`DashboardProgressEncoder` class, responsible for encoding progress data
related to dashboards. The new functionality is tested through manual,
unit, and integration testing. In the `Table` class, the
`from_table_info` and `from_historical_data` methods have been added,
allowing for the creation of `Table` instances from `TableInfo` objects
and historical data dictionaries with more flexibility and safety. The
`test_tables.py` file in the `integration/progress` directory has also
been updated to include a new test function for checking table failures.
These changes improve the tracking and management of dashboard IDs,
enhance user name retrieval, and ensure the accurate determination of
object ownership.
* Create specific failure for Python syntax error while parsing with
Astroid ([#3498](#3498)).
This commit enhances the Python linting functionality in our open-source
library by introducing a specific failure message, `python-parse-error`,
for syntax errors encountered during code parsing using Astroid.
Previously, a generic `system-error` message was used, which has been
renamed to maintain consistency with the existing `sql-parse-error`
message. This change provides clearer failure indicators and includes
more detailed information about the error location. Additionally,
modifications to Python linting-related code, unit test additions, and
updates to the README guide users on handling these new error types have
been implemented. A new method, `Tree.maybe_parse()`, has been
introduced to parse Python code and detect syntax errors, ensuring more
precise error handling for users.
* DBR 16 and later support
([#3481](#3481)). This pull
request introduces support for Databricks Runtime (DBR) 16 and later in
the code that converts Hive Metastore (HMS) tables to external tables
within the `migrate-tables` workflow. The changes include the addition
of a new static method `_get_entity_storage_locations` to handle the new
`entityStorageLocations` property in DBR16 and the modification of the
`_convert_hms_table_to_external` method to account for this property.
Additionally, the `run_workflow` function in the `assessment` workflow
now has the `skip_job_wait` parameter set to `True`, which allows the
workflow to continue running even if a job within it fails. The changes
have been manually tested for DBR16, verified in a staging environment,
and existing integration tests have been run for DBR 15. The diff also
includes updates to the
`test_table_migration_convert_manged_to_external` method to skip job
waiting during testing, enabling the test to run successfully on DBR 16.
* Delete stale code: `NotebookLinter._load_source_from_run_cell`
([#3529](#3529)). In this
update, we have removed the stale code
`NotebookLinter._load_source_from_run_cell`, which was responsible for
loading the source code from a run cell in a notebook. This change is a
part of the ongoing effort to address issue
[#3514](#3514) and enhances
the overall codebase. Additionally, we have modified the existing
`databricks labs ucx lint-local-code` command to update the code linting
functionality. We have conducted manual testing to ensure that the
changes function as intended and have added and modified several unit
tests. The `_load_source_from_run_cell` method is no longer needed, as
it was part of a deprecated functionality. The modifications to the
`databricks labs ucx lint-local-code` command impact the way code
linting is performed, ultimately improving the efficiency and
maintainability of the codebase.
* Exclude ucx dashboards from Lakeview dashboard crawler
([#3450](#3450)). In this
release, we have enhanced the `lakeview_crawler` method in the
open-source library to exclude Ucx dashboards and prevent false
positives. This has been achieved by adding a new optional argument,
`exclude_dashboard_ids`, to the `__init__` method, which takes a list of
dashboard IDs to exclude from the crawler. The `_crawl` method has been
updated to skip dashboards whose IDs match the ones in the
`exclude_dashboard_ids` list. The change includes unit tests and manual
testing to ensure proper functionality and has been verified on the
staging environment. These updates improve the accuracy and reliability
of the dashboard crawler, providing better results for software
engineers utilizing this library.
* Fixed issue in installing UCX on UC enabled workspace
([#3501](#3501)). This PR
introduces changes to the `ClusterPolicyInstaller` class, updating the
`spark_version` policy definition from a fixed value to an allowlist
with a default value. This resolves an issue where, when UC is enabled
on a workspace, the cluster definition takes on `single_user` and
`user_isolation` values instead of `Legacy_Single_User` and
'Legacy_Table_ACL'. The job definition is also updated to use the
default value when not explicitly provided. These changes improve
compatibility with UC-enabled workspaces, ensuring the correct values
for `spark_version` in the cluster definition. The PR includes updates
to unit tests and installation tests, addressing issue
[#3420](#3420).
* Fixed typo in workflow name (in error message)
([#3491](#3491)). This PR
(Pull Request) addresses a minor typo in the error message displayed by
the `validate_groups_permissions` method in the `workflows.py` file. The
typo occurred in the workflow name mentioned in the error message, where
`group` was incorrectly spelled as "groups." The corrected spelling is
now `validate-groups-permissions`. This change does not introduce any
new methods or modify any existing functionality, but instead focuses on
enhancing the clarity and accuracy of the error message. Ensuring that
error messages are free from typos and other inaccuracies is essential
for maintaining the usability and effectiveness of the code, as it
enables users to more easily troubleshoot any issues that may arise
during its usage.
* HMS Federation Glue Support
([#3526](#3526)). This
commit introduces support for HMS Federation Glue in the open-source
library, resolving issue
[#3011](#3011). The changes
include adding a new command, `migrate-glue-credentials`, to migrate
Glue credentials to UC storage credentials in the federation glue for
HMS. The `AWSResourcePermissions` class has been updated to include a
new parameter `config` for HMS Federation Glue configuration and the
`load_uc_compatible_roles` method now accepts an optional
`resource_type` parameter for filtering compatible roles based on the
provided type. Additionally, the `ExternalLocations` class has been
updated to handle S3 resource type when identifying missing external
locations. The commit also includes several bug fixes, new classes,
methods, and changes to the existing methods to handle AWS Glue
resources, and updates to the integration tests. Overall, these changes
add significant functionality for AWS Glue support in the HMS Federation
Glue feature.
* Make link to issue template url safe
([#3508](#3508)). In this
release, we have updated the `python_ast.py` file to enhance the
encoding of the link to the issue template for bug reports. By utilizing
the `urllib.parse.quote_plus()` function from Python's standard library,
we have ensured that any special characters in the provided source code
will be properly encoded. This eliminates the risk of issues arising
from incorrectly interpreted characters, enhancing the reliability of
the bug reporting process. This change, initially introduced in issue
[#3498](#3498), has been
thoroughly tested to guarantee its correct functioning. The rest of the
file remains unaffected, preserving its original functionality.
* Refactor `PipelineMigrator`'s to add `include_pipeline_ids`
([#3495](#3495)). In this
release, the `PipelineMigrator` class has been refactored to enhance
pipeline migration functionality. The `skip-pipeline-ids` flag has been
replaced with `include-pipeline-ids`, allowing users to specify a list
of pipelines to migrate, rather than listing pipelines to skip.
Additionally, the `exclude_pipeline_ids` functionality has been added to
provide even more granularity in pipeline selection. The
`migrate_pipelines` method now prioritizes `include_pipeline_ids` and
`exclude_pipeline_ids` parameters to determine the list of pipelines for
migration. The `_migrate_pipeline` method has been updated to accept a
string pipeline ID and now checks if the pipeline has already been
migrated. Several support methods, such as `_clone_pipeline`, have also
been refactored for improved functionality. Although no new methods were
added, the behavior of the `migrate_pipelines` method has changed. While
unit tests have been updated to cover the changes, integration tests
have not been modified yet. Ensure thorough testing to prevent any new
issues or breaks in existing functionality.
* Release v0.54.0
([#3530](#3530)). 0.54.0
brings several enhancements and bug fixes to the UCX library. A
`query_statement_disposition` option is added to the SQL backend to
handle large SQL queries during assessment results export, preventing
potential failures in large workspaces with high volumes of findings.
AWS role compatibility checks are improved for external locations
pointing to the root of a storage account. Dashboards are enhanced with
added migration progress dashboards and a history log encoder. New
failure types are introduced for Python syntax errors during parsing and
SQL parsing errors. The library now supports DBR 16 and later versions,
with optional conversion of Hive Metastore tables to external tables in
the `migrate-tables` workflow. The `PipelineMigrator` functionality is
refactored to add an `include_pipeline_ids` parameter for better control
over the migration process. Various dependency updates, including
`databricks-labs-blueprint`, `databricks-sdk`, and `sqlglot`, are
included in this release, which bring new features, improvements, and
bug fixes, as well as API changes. Please thoroughly test and review the
changes to ensure seamless functionality.
* Rename Python AST's `Tree` methods for clarity
([#3524](#3524)). In this
release, we have made significant improvements to the clarity of the
Python AST's `Tree` methods in the `python_analyzer.py` file. The
`append_` and `extend_` methods have been renamed to `attach_` to better
reflect their functionality. These methods now always return `None`. New
methods such as `attach_child_tree`, `attach_nodes`, and
`extend_globals` have been introduced to enhance the functionality of
the library. The `attach_child_tree` method allows for attaching one
tree as a child of another tree, propagating module references and
enabling traversal from both the parent and child trees. The
`attach_nodes` method sets the parent of the attached nodes and adds
them to the body of the tree. Additionally, docstrings have been added,
and unit testing has been expanded. The changes include modifications to
code linting, existing command functionalities, and manual testing to
ensure compatibility. These enhancements improve the clarity,
functionality, and flexibility of the Python AST's `Tree` methods.
* Revert "Release v0.54.0"
([#3569](#3569)). In version
0.53.1, we have reverted changes from 0.54.0 to address issues with the
previous release and ensure proper propagation to PyPI. This version
includes various updates such as implementing a disposition field in the
SQL backend, improving ARN pattern matching for AWS roles, adding
dashboards to migration progress, enhancing Python linting
functionality, and adding support for DBR 16 in converting Hive
Metastore tables to external tables. We have also excluded UCX
dashboards from the Lakeview dashboard crawler, refactored
PipelineMigrator's to add include_pipeline_ids, and updated the sqlglot
and databricks-labs-blueprint requirements. Additionally, several issues
related to installation, typo in workflow name, and table-migration
workflows have been fixed. The sqlglot requirement has been updated from
<26.1,>=25.5.0 to >=25.5.0,<26.2, and databricks-labs-blueprint from
<0.10,>=0.9.1 to >=0.9.1,<0.11. This release does not introduce any new
methods or change existing functionality, but focuses on addressing bugs
and improving functionality.
* Schedule the migration progress workflow to run daily
([#3485](#3485)). This PR
introduces a daily scheduling mechanism for the UCX installation's
migration progress workflow, allowing it to run automatically once per
day at 5 a.m. UTC. It includes refactoring the plumbing for managing and
installing workflows, enabling them to have a Cron-based schedule.
Relevant user documentation has been updated, and existing unit and
integration tests have been added to ensure the changes function as
intended. A new test has been added to verify the migration-progress
workflow is installed with a schedule attached, checking the workflow
schedule's quartz cron expression, time zone, and pause status, as well
as confirming that the workflow is unpaused upon installation. The PR
also introduces new methods to manage workflow scheduling and configure
cron-based schedules.
* Scope crawled pipelines in PipelineCrawler
([#3513](#3513)). In the
latest release, we have introduced a new optional argument,
'include_pipeline_ids', in the constructor of the PipelinesCrawler class
located in the 'databricks/labs/ucx/assessment' module. This argument
allows users to filter pipelines based on a list of pipeline IDs,
improving the crawler's flexibility and efficiency in processing
pipelines. In the `_crawl` method of the PipelinesCrawler class, a new
behavior has been implemented based on the value of
'include_pipeline_ids'. If the argument is not None, then the method
uses the pipeline IDs from this list instead of retrieving all
pipelines. Additionally, two unit tests have been added to verify the
functionality of this new argument and ensure that the crawler handles
cases where a pipeline is not found or its specification is missing. A
new parameter, 'force_refresh', has also been added to the `snapshot`
function. This release aims to provide a more efficient and customizable
pipeline crawling experience for users.
* Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to
>=0.9.1,<0.11
([#3519](#3519)). In this
update, the requirement for the `databricks-labs-blueprint` library has
been changed from version range '<0.10,>=0.9.1>' to a new range of
'>=0.9.1,<0.11'. This change allows for the use of the latest version of
the library while maintaining compatibility with the current project
setup, and is based on information from the library's releases and
changelog. The commit includes a list of commits and dependencies for
the updated library. This update was automatically implemented by
Dependabot, a tool that handles dependency updates and conflict
resolution, ensuring a seamless integration process for engineers
adopting the project.
* Updated databricks-sdk requirement from <0.41,>=0.40 to >=0.40,<0.42
([#3553](#3553)). In this
release, we have updated the `databricks-sdk` package requirement to
permit version 0.41 while excluding version 0.42. This update includes
several improvements and new features in version 0.41, such as the
addition of the `serving.http_request` method for calling external
functions and enhancements to the Files API client to recover from
download failures. The commit also includes bug fixes, internal changes,
and updates to the API for better functionality and compatibility. It is
essential to note that these changes have been made to ensure
compatibility with the latest features and improvements in the
`databricks-sdk` package.
* Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2
([#3500](#3500)). In this
release, we have updated the version requirement for the sqlglot
package. The minimum version required is now 25.5.0 and less than 26.2,
previously it was 25.5.0 and less than 26.1. This change allows for the
most recent version of sqlglot to be installed, while still maintaining
compatibility with the current codebase. The update is necessary due to
breaking changes introduced in version 26.1.0 of sqlglot, including
normalizing before qualifying tables, requiring the `AS` token in CTEs
for all dialects except spark and databricks, supporting Unicode in
sqlite, mysql, tsql, postgres, and oracle, parsing ASCII into Unicode to
facilitate transpilation, and improving transpilation of
CHAR[ACTER]_LENGTH. Additionally, several bug fixes and new features
have been added in this update.
* Updated sqlglot requirement from <26.2,>=25.5.0 to >=25.5.0,<26.3
([#3528](#3528)). In this
release, we have updated the version constraint for the `sqlglot`
dependency in our project's "pyproject.toml" file. The previous
constraint allowed versions between 25.5.0 and 26.2, while the new
constraint allows versions between 25.5.0 and 26.3. This change was made
to ensure that we can use the latest version of sqlglot while also
preventing the version from exceeding 26.3. Additionally, the commit
includes detailed information about the specific commits and changes
made in the updated version of sqlglot, providing valuable insights for
software engineers working with this open-source library.
* Updated table-migration workflows to also capture updated migration
progress into the history log
([#3239](#3239)). The
table-migration workflows have been updated to log not only the tables
that still need to be migrated, but also the updated progress
information into the history log, ensuring a more comprehensive record
of migration progress. The affected workflows include `migrate-tables`,
`migrate-external-hiveserde-tables-in-place-experimental`,
`migrate-external-tables-ctas`, `scan-tables-in-mounts-experimental`,
and `migrate-tables-in-mounts-experimental`. The encoder for
table-history has been updated to prevent implicit refresh of
`TableMigrationStatus` data during initialization. Additionally, the
documentation has been updated to reflect which workflows update which
tables. New and updated unit and integration tests, as well as manual
testing, have been conducted to ensure the functionality of the changes.

Dependency updates:

* Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2
([#3500](#3500)).
* Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to
>=0.9.1,<0.11
([#3519](#3519)).
* Updated databricks-sdk requirement from <0.41,>=0.40 to >=0.40,<0.42
([#3553](#3553)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants