[TECH DEBT]: Rewrite to `PipelinesMigrator.skip_pipeline_ids` to an `include_pipeline_ids` #3492

JCZuurmond · 2025-01-07T15:08:57Z

Is there an existing issue for this?

I have searched the existing issues

Problem statement

The PipelinesMigrator cannot be scoped in the integration, thus now it migrates a lot of pipelines, which is slow and flaky

Proposed Solution

Rewrite to PipelinesMigrator.skip_pipeline_ids to an include_pipeline_ids

Additional Context

No response

The text was updated successfully, but these errors were encountered:

pritishpai · 2025-01-08T06:11:16Z

Added the PR as per the issue. I am still wondering if we should have both flags as there might be end users who have a lot of pipelines and need to skip a couple, then they will need to add a long list under include-pipeline-ids instead of a couple in skip-pipeline-ids. We could design it in a way that only one of the flags can be used for a command run.

## Changes Replace skip-pipeline-ids flag with include-pipelines-ids that allows the end user to only migrate the pipelines in the list ### Linked issues Resolves #3492 ### Functionality - [x] modified flag for migrate-pipelines ### Tests - [ ] modified unit tests - [ ] modified integration tests

* Implement disposition field in SQL backend ([#3477](#3477)). In this release, we've added a new `query_statement_disposition` configuration option for the SQL backend used in the `databricks labs ucx` command-line interface. This option allows users to choose the disposition method for running large SQL queries during assessment results export, preventing failures in cases of large workspaces with high volumes of findings. The new option is included in the `config.yml` file and used in the SqlBackend definition. The commit also includes updates to the `workspace_cli.py` file and addresses issue [#3447](#3447). The `disposition` parameter has been added to the `StatementExecutionBackend` method, and the `Disposition` enum from the `databricks.sdk.service.sql` module has been added to the `config.py` file. The changes have been manually tested and are included in the modified `databricks labs install ucx` and `databricks labs ucx export-assessment` commands. * AWS role issue with external locations pointing to the root of a storage account ([#3510](#3510)). This release includes a modification to enhance AWS role access for external locations pointing to the root of a storage account, addressing issue [#3510](#3510) and closing issue [#3505](#3505). The `aws.py` file in the `src/databricks/labs/ucx/assessment/` directory has been updated to improve S3 bucket ARN pattern matching, now allowing optional trailing slashes for greater flexibility. In the `access.py` file within the `aws` directory of the `databricks/labs/ucx` package, the `_identify_missing_paths` method now checks if the `role.resource_path` is a parent of the external location path or if they match exactly, allowing root-level external locations to be recognized as compatible with AWS roles. A new method, `AWSUCRoleCandidate`, has been added to the `AWSResources` class, and several test cases have been updated or added to ensure proper functionality with UC roles and AWS resources, including handling cases with multiple role creations. * Added assert to make sure installation is finished before re-installation ([#3546](#3546)). In the latest release, we've addressed an issue (commit 3546) where the reinstallation of a software component was starting before the initial installation was complete, causing a warning message to be suppressed and the test to fail. To rectify this, we have enhanced the integration tests and added an assert to ensure that the installation is finished before attempting reinstallation. A new function called `wait_for_installation_to_finish` has been introduced to manage the waiting process. Furthermore, we have updated the `test_compare_remote_local_install_versions` function to accept `installation_ctx` instead of `ws` as a parameter, ensuring proper configuration and loading of the installation before test execution. These changes guarantee that the test will pass if the installation is finished before the reinstallation is attempted. * Added dashboards to migration progress dashboard ([#3314](#3314)). The release notes have been updated to reflect the new features and changes in the migration progress dashboard. This commit includes the addition of dashboards to track the migration progress, with linting resources added to ensure code quality. The commit also modifies the existing dashboard "Migration [main]" and updates both unit and integration tests. Specific new files and methods have been added to enhance functionality, including the tracking of dashboard migration, and new fixtures have been introduced to improve testing. The changes depend on several issues and break up others to progress functionality. Overall, this commit enhances the migration progress dashboard's capabilities, making it more efficient and reliable for tracking migration progress. * Added history log encoder for dashboards ([#3424](#3424)). A history log encoder for dashboards has been added, addressing issues [#3368](#3368) and [#3369](#3369), which modifies the existing `experimental-migration-progress` workflow. This enhancement introduces a `DashboardProgressEncoder` class that encodes Dashboard objects into Historical records, appending inventory snapshots to the history table. The changes include adding new methods for handling object types such as directories, and updating the `is_delta` property of the `Table` class. The commit also includes new tests: manually tested, unit tests added, and integration tests added. Specifically, `test_table_progress_encoder_table_failures` has been updated to include a new parameter, `is_migrated_table`, which, if set to False, adds `Pending migration` to the list of failures. The `is_used_table` parameter has been removed, and its functionality is no longer part of this commit. The changes are tested through manual, unit, and integration testing, ensuring the proper encoding of migration progress and identifying relevant failures. * Create specific failure for Python syntax error while parsing with Astroid ([#3498](#3498)). In this release, the Python linting-related code has been updated to introduce a specific failure type for syntax errors that occur while parsing code using Astroid. Previously, such errors resulted in a generic `system-error` message, but with this change, a new failure type called `python-parse-error` has been introduced. This new error type includes the start and end line and column numbers of the error and is accompanied by a new issue URL for reporting the error on the UCX GitHub. The `system-error` failure type has been renamed to `python-parse-error` to maintain consistency with the `sql-parse-error` failure type. Additionally, a new method `Tree.maybe_parse()` has been introduced to improve error detection and reporting during Python linting. A unit test has been added to ensure the new failure type is working as intended, and a generic failure is kept for directing users to create GitHub issues for surfacing other issues. * DBR 16 and later support ([#3481](#3481)). This release adds support for Databricks Runtime (DBR) 16 and later, enabling the optional conversion of Hive Metastore (HMS) tables to external tables within the `migrate-tables` workflow. The change includes a new static method `_get_entity_storage_locations` to check for the presence of the `entityStorageLocations` property on table metadata. The existing `_convert_hms_table_to_external` method has been updated to use this new method and to include the `entityStorageLocations` constructor argument if present. The changes have been manually tested for DBR 16, tested with existing integration tests for DBR 15, and verified on the staging environment with DBR 16. Additionally, the `skip_job_wait=True` parameter has been added to specific test function calls to improve test execution time. This release also resolves an issue with a failed test in DBR16 due to a JDK update. * Delete stale code: `NotebookLinter._load_source_from_run_cell` ([#3529](#3529)). In this release, we have improved the code linting functionality in the NotebookLinter class of our open-source library by removing the `_load_source_from_run_cell` method in the sources.py file. This method, previously used to load source code from run cells in a notebook, has been identified as stale code and is no longer required. Consequently, this change affects the `databricks labs ucx lint-local-code` command and results in cleaner and more maintainable code. Furthermore, updated and added unit tests have been included in this commit, which have been manually tested to ensure that the changes do not adversely impact existing functionality, thus progressing issue [#3514](#3514). * Exclude ucx dashboards from Lakeview dashboard crawler ([#3450](#3450)). In this release, the functionality of the `assessment` workflow has been improved to exclude certain dashboard IDs from the Lakeview dashboard crawler. This change has been made to address the issue of false positive dashboards and affects the `_crawl` method in the `dashboards.py` file. The excluded dashboard IDs are now obtained from the `install_state.dashboards` object. Additionally, new methods have been added to the `test_dashboards.py` file in the `unit/assessment` directory to test the exclusion functionality, including a test to ensure that the exclude parameter takes priority over the include parameter. The commit also includes unit tests, manual tests, and screenshots to verify the changes on the staging environment. Overall, this modification enhances the accuracy of the dashboard crawler and simplifies the process of identifying and assessing relevant dashboards. * Fixed issue in installing UCX on UC enabled workspace ([#3501](#3501)). This pull request introduces changes to the UCX installer to address an issue ([#3420](#3420)) with installing UCX on UC-enabled workspaces. It updates the UCX policy by changing the `spark_version` parameter from `fixed` to `allowlist` with a default value, allowing the cluster definition to take `single_user` and `user_isolation` values instead of `Legacy_Single_User` and 'Legacy_Table_ACL'. Additionally, the job definition has been updated to use the default value when not explicitly provided. The changes are implemented in the `test_policy.py` file and impact the `test_job_cluster_policy` and `test_job_cluster_on_uc_enabled_workspace` methods. The pull request also includes updates to unit tests and integration tests to ensure the correct behavior of the updated UCX policy and job definition. The target audience is software engineers adopting this project, with changes involving adjusting policy definitions and testing job cluster behavior under different configurations. Issue [#3501](#3501) is also resolved with these changes. * Fixed typo in workflow name (in error message) ([#3491](#3491)). This PR includes a fix for a minor typo in the error message of the `validate_groups_permissions` method in the `workflows.py` file. The typo resulted in the incorrect spelling of `group` as `groups` in the workflow name. The fix simply changes `groups` to `group` in the error message, ensuring accurate workflow name display. The functionality of the code remains unaffected by this change, and no new methods have been added. To clarify, the `validate_groups_permissions` method verifies whether group permissions have been migrated correctly, and if not, raises a ValueError with an error message suggesting the use of the `validate-group-permissions` workflow for validation after the API has caught up. This fix resolves the typo issue and maintains the expected behavior of the code. * Make link to issue template url safe ([#3508](#3508)). In this commit, the `_definitely_failure` function in the `python_ast.py` file has been modified to make the link to the issue template URL safe using Python's `urllib`. This change ensures that any special characters in the source code passed to the function will be properly displayed in the issue template. If the source code cannot be parsed, the function creates a link to the issue template for reporting a bug in the UCX library, including the source code as part of the issue body. With this commit, the source code is now passed through the `urllib.parse.quote_plus` function before being added to the issue body, making it url-safe and improving the robustness and user-friendliness of the library. This change has been introduced in issue [#3498](#3498) and has been manually tested. * Refactor `PipelineMigrator`'s to add `include_pipeline_ids` ([#3495](#3495)). In this refactoring, the `PipelineMigrator` has been updated to introduce an `include_pipeline_ids` option, replacing the previous `skip_pipeline_ids` flag. This change allows users to specify the list of pipelines to migrate, providing better control over the migration process. The `PipelinesMigrator` constructor, `_get_pipelines_to_migrate`, and `migrate_pipelines` methods have been modified to accommodate this new flag. The `_migrate_pipeline` method now accepts the pipeline ID instead of a `PipelineInfo` object. Additionally, the unit tests have been updated to include the new `include_flag` parameter, which facilitates testing various scenarios with different pipeline lists. Although the commit does not show changes to test files, integration tests should be updated to reflect the new `include-pipeline-ids` flag functionality. This improvement resolves issue [#3492](#3492) and enhances the overall flexibility of the `PipelineMigrator`. * Rename Python AST's `Tree` methods for clarity ([#3524](#3524)). In this release, the `Tree` class in the Python AST library has been updated for improved code clarity and functionality. The `append_` methods have been renamed to `attach_` for better accuracy, and now include docstrings for increased understanding. These methods have been updated to always return `None`. A new method, `attach_child_tree`, has been added, allowing for traversal from both parent and child and propagating any module references. Several new methods and functionalities have been introduced to improve the class, while extensive unit testing has been conducted to ensure functionality. Additionally, the diff includes test cases for various functionalities, such as inferring values when attaching trees and verifying spark module propagation, as well as tests to ensure that certain operations are not supported. This change, linked to issues [#3514](#3514) and [#3520](#3520), may affect any code that calls these methods and relies on their return values. However, the added docstrings and unit tests will help ensure your code continues to function correctly. * Schedule the migration progress workflow to run daily ([#3485](#3485)). This PR introduces changes to the UCX installation process to schedule the migration progress workflow to run automatically once a day, with the default schedule set to run at 5 a.m. UTC. It includes refactoring the plumbing used for managing and installing workflows, enabling them to have a Cron-based schedule. The relevant user documentation has been updated, and the existing `migration-progress-experimental` workflow has been modified. Additionally, unit and integration tests have been added/modified to ensure the proper functioning of the updated code, and new functions have been added to verify the workflow's schedule and task detection. * Scope crawled pipelines in PipelineCrawler ([#3513](#3513)). In this release, the `PipelineCrawler` class in the `pipelines.py` file has been updated to include a new optional argument `include_pipeline_ids` in its constructor. This argument allows users to filter the pipelines that are crawled by specifying a list of pipeline IDs. The `_crawl` method has been modified to check if `include_pipeline_ids` is not `None` and to filter the list of pipelines accordingly. The class now also checks if each pipeline exists before getting its configuration, and logs a warning message if the pipeline is not found. Previously, a `NotFound` exception was raised. Additionally, the code has been updated to use `pipeline.spec.configuration` instead of `pipeline_response.spec.configuration` to get the pipeline configuration. These changes have been tested through new and updated unit tests, including a test for handling creators' user names. Overall, these updates provide improved functionality and flexibility for crawling pipelines. * Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to >=0.9.1,<0.11 ([#3519](#3519)). In this release, we have updated the version requirement of the `databricks-labs-blueprint` package to be greater than or equal to 0.9.1 and less than 0.11. This change allows us to use the latest version of the package and includes bug fixes and dependency updates. The hosted runner has been patched in version 0.10.1 to address issues with publishing artifacts in the release workflow. Release notes for previous versions are also provided in the commit. These updates are intended to improve the overall functionality and stability of the library. * Updated databricks-sdk requirement from <0.41,>=0.40 to >=0.40,<0.42 ([#3553](#3553)). In this release, the `databricks-sdk` package requirement has been updated to version 0.41.0, which brings new features, improvements, bug fixes, and API changes. Among the new features are the addition of 'serving.http_request' for calling external functions, and recovery on download failures in the Files API client. Although the specifics of the functionality added and changed are not detailed, the focus of this release appears to be on bug fixes and internal enhancements. Additionally, the API has undergone changes, including added and altered methods and fields, however, specific information about these changes has not been provided in the release notes. * Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2 ([#3500](#3500)). A critical update has been implemented in this release for the `sqlglot` package, which has been updated to version 25.5.0 or higher, but less than 26.2. This change is essential to leverage the latest version of sqlglot while avoiding any breaking changes introduced in version 26.1. The new version includes several breaking changes, new features, bug fixes, and modifications to various dialects such as hive, postgres, tsql, and sqlite. Moreover, the tokenizer has been updated to accept underscore-separated number literals. However, the specific impact of these changes on the project is not detailed in the commit message, and software engineers should thoroughly test and review the changes to ensure seamless functionality. * Updated sqlglot requirement from <26.2,>=25.5.0 to >=25.5.0,<26.3 ([#3528](#3528)). In this update, we have modified the version constraint for the `sqlglot` dependency from `>=25.5.0,<26.2` to `>=25.5.0,<26.3` in the `pyproject.toml` file. Sqlglot is a Python-based SQL parser and optimizer, and this change allows us to adopt the latest version of sqlglot within the specified version range. This update addresses potential security vulnerabilities and incorporates performance enhancements and bug fixes, ensuring that our library remains up-to-date and secure. * Updated table-migration workflows to also capture updated migration progress into the history log ([#3239](#3239)). This pull request updates the table-migration workflows to log not only the tables that still need to be migrated, but also the progress of the migration. The affected workflows include `migrate-tables`, `migrate-external-hiveserde-tables-in-place-experimental`, `migrate-external-tables-ctas`, `scan-tables-in-mounts-experimental`, and `migrate-tables-in-mounts-experimental`. The encoder for table-history has been refactored to improve control over when the `TableMigrationStatus` data is refreshed. The documentation has been updated to reflect the changes in each workflow. Additionally, both unit and integration tests have been added and updated to ensure the changes work as intended and resolve any conflicts. A new `ProgressTrackingInstallation` class has been added to support this functionality. The changes have been manually tested and include modifications to the existing workflows, new methods, and a renamed method. The `mock_workspace_client` function has been replaced, and the `external_locations.resolve_mount` method and other methods have not been called. The `TablesCrawler` object's `snapshot` method has been called once to retrieve the list of tables in the Hive metastore. The migration record workflow run is also updated to include the workflow run information in the `workflow_runs` table. These changes are expected to improve the accuracy and reliability of the table-migration workflows. Dependency updates: * Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2 ([#3500](#3500)). * Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to >=0.9.1,<0.11 ([#3519](#3519)). * Updated databricks-sdk requirement from <0.41,>=0.40 to >=0.40,<0.42 ([#3553](#3553))

JCZuurmond added enhancement New feature or request needs-triage labels Jan 7, 2025

JCZuurmond added this to UCX Jan 7, 2025

github-project-automation bot moved this to Todo in UCX Jan 7, 2025

This was referenced Jan 7, 2025

[IMPROVEMENT]: Scope crawled pipelines in PipelineCrawler #3493

Closed

Test failure: test_pipeline_migrate #3490

Closed

pritishpai self-assigned this Jan 7, 2025

pritishpai mentioned this issue Jan 8, 2025

Refactor PipelineMigrator's to add include_pipeline_ids #3495

Merged

3 tasks

pritishpai moved this from Todo to Ready for Review in UCX Jan 8, 2025

pritishpai closed this as completed in #3495 Jan 10, 2025

github-project-automation bot moved this from Ready for Review to Done in UCX Jan 10, 2025

gueniai mentioned this issue Jan 21, 2025

Release v0.54.0 #3530

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TECH DEBT]: Rewrite to `PipelinesMigrator.skip_pipeline_ids` to an `include_pipeline_ids` #3492

[TECH DEBT]: Rewrite to `PipelinesMigrator.skip_pipeline_ids` to an `include_pipeline_ids` #3492

JCZuurmond commented Jan 7, 2025

pritishpai commented Jan 8, 2025

[TECH DEBT]: Rewrite to PipelinesMigrator.skip_pipeline_ids to an include_pipeline_ids #3492

[TECH DEBT]: Rewrite to PipelinesMigrator.skip_pipeline_ids to an include_pipeline_ids #3492

Comments

JCZuurmond commented Jan 7, 2025

Is there an existing issue for this?

Problem statement

Proposed Solution

Additional Context

pritishpai commented Jan 8, 2025

[TECH DEBT]: Rewrite to `PipelinesMigrator.skip_pipeline_ids` to an `include_pipeline_ids` #3492

[TECH DEBT]: Rewrite to `PipelinesMigrator.skip_pipeline_ids` to an `include_pipeline_ids` #3492