Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TECH DEBT]: Rewrite to PipelinesMigrator.skip_pipeline_ids to an include_pipeline_ids #3492

Closed
1 task done
JCZuurmond opened this issue Jan 7, 2025 · 1 comment · Fixed by #3495
Closed
1 task done
Assignees
Labels
enhancement New feature or request needs-triage

Comments

@JCZuurmond
Copy link
Member

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

The PipelinesMigrator cannot be scoped in the integration, thus now it migrates a lot of pipelines, which is slow and flaky

Proposed Solution

Rewrite to PipelinesMigrator.skip_pipeline_ids to an include_pipeline_ids

Additional Context

No response

@pritishpai
Copy link
Contributor

Added the PR as per the issue. I am still wondering if we should have both flags as there might be end users who have a lot of pipelines and need to skip a couple, then they will need to add a long list under include-pipeline-ids instead of a couple in skip-pipeline-ids. We could design it in a way that only one of the flags can be used for a command run.

@pritishpai pritishpai moved this from Todo to Ready for Review in UCX Jan 8, 2025
github-merge-queue bot pushed a commit that referenced this issue Jan 10, 2025
## Changes
Replace skip-pipeline-ids flag with include-pipelines-ids that allows
the end user to only migrate the pipelines in the list

### Linked issues
Resolves #3492

### Functionality

- [x] modified flag for migrate-pipelines

### Tests
- [ ] modified unit tests
- [ ] modified integration tests
@github-project-automation github-project-automation bot moved this from Ready for Review to Done in UCX Jan 10, 2025
gueniai added a commit that referenced this issue Jan 23, 2025
* Implement disposition field in SQL backend
([#3477](#3477)). In this
release, we've added a new `query_statement_disposition` configuration
option for the SQL backend used in the `databricks labs ucx`
command-line interface. This option allows users to choose the
disposition method for running large SQL queries during assessment
results export, preventing failures in cases of large workspaces with
high volumes of findings. The new option is included in the `config.yml`
file and used in the SqlBackend definition. The commit also includes
updates to the `workspace_cli.py` file and addresses issue
[#3447](#3447). The
`disposition` parameter has been added to the
`StatementExecutionBackend` method, and the `Disposition` enum from the
`databricks.sdk.service.sql` module has been added to the `config.py`
file. The changes have been manually tested and are included in the
modified `databricks labs install ucx` and `databricks labs ucx
export-assessment` commands.
* AWS role issue with external locations pointing to the root of a
storage account
([#3510](#3510)). This
release includes a modification to enhance AWS role access for external
locations pointing to the root of a storage account, addressing issue
[#3510](#3510) and closing
issue [#3505](#3505). The
`aws.py` file in the `src/databricks/labs/ucx/assessment/` directory has
been updated to improve S3 bucket ARN pattern matching, now allowing
optional trailing slashes for greater flexibility. In the `access.py`
file within the `aws` directory of the `databricks/labs/ucx` package,
the `_identify_missing_paths` method now checks if the
`role.resource_path` is a parent of the external location path or if
they match exactly, allowing root-level external locations to be
recognized as compatible with AWS roles. A new method,
`AWSUCRoleCandidate`, has been added to the `AWSResources` class, and
several test cases have been updated or added to ensure proper
functionality with UC roles and AWS resources, including handling cases
with multiple role creations.
* Added assert to make sure installation is finished before
re-installation
([#3546](#3546)). In the
latest release, we've addressed an issue (commit 3546) where the
reinstallation of a software component was starting before the initial
installation was complete, causing a warning message to be suppressed
and the test to fail. To rectify this, we have enhanced the integration
tests and added an assert to ensure that the installation is finished
before attempting reinstallation. A new function called
`wait_for_installation_to_finish` has been introduced to manage the
waiting process. Furthermore, we have updated the
`test_compare_remote_local_install_versions` function to accept
`installation_ctx` instead of `ws` as a parameter, ensuring proper
configuration and loading of the installation before test execution.
These changes guarantee that the test will pass if the installation is
finished before the reinstallation is attempted.
* Added dashboards to migration progress dashboard
([#3314](#3314)). The
release notes have been updated to reflect the new features and changes
in the migration progress dashboard. This commit includes the addition
of dashboards to track the migration progress, with linting resources
added to ensure code quality. The commit also modifies the existing
dashboard "Migration [main]" and updates both unit and integration
tests. Specific new files and methods have been added to enhance
functionality, including the tracking of dashboard migration, and new
fixtures have been introduced to improve testing. The changes depend on
several issues and break up others to progress functionality. Overall,
this commit enhances the migration progress dashboard's capabilities,
making it more efficient and reliable for tracking migration progress.
* Added history log encoder for dashboards
([#3424](#3424)). A history
log encoder for dashboards has been added, addressing issues
[#3368](#3368) and
[#3369](#3369), which
modifies the existing `experimental-migration-progress` workflow. This
enhancement introduces a `DashboardProgressEncoder` class that encodes
Dashboard objects into Historical records, appending inventory snapshots
to the history table. The changes include adding new methods for
handling object types such as directories, and updating the `is_delta`
property of the `Table` class. The commit also includes new tests:
manually tested, unit tests added, and integration tests added.
Specifically, `test_table_progress_encoder_table_failures` has been
updated to include a new parameter, `is_migrated_table`, which, if set
to False, adds `Pending migration` to the list of failures. The
`is_used_table` parameter has been removed, and its functionality is no
longer part of this commit. The changes are tested through manual, unit,
and integration testing, ensuring the proper encoding of migration
progress and identifying relevant failures.
* Create specific failure for Python syntax error while parsing with
Astroid ([#3498](#3498)). In
this release, the Python linting-related code has been updated to
introduce a specific failure type for syntax errors that occur while
parsing code using Astroid. Previously, such errors resulted in a
generic `system-error` message, but with this change, a new failure type
called `python-parse-error` has been introduced. This new error type
includes the start and end line and column numbers of the error and is
accompanied by a new issue URL for reporting the error on the UCX
GitHub. The `system-error` failure type has been renamed to
`python-parse-error` to maintain consistency with the `sql-parse-error`
failure type. Additionally, a new method `Tree.maybe_parse()` has been
introduced to improve error detection and reporting during Python
linting. A unit test has been added to ensure the new failure type is
working as intended, and a generic failure is kept for directing users
to create GitHub issues for surfacing other issues.
* DBR 16 and later support
([#3481](#3481)). This
release adds support for Databricks Runtime (DBR) 16 and later, enabling
the optional conversion of Hive Metastore (HMS) tables to external
tables within the `migrate-tables` workflow. The change includes a new
static method `_get_entity_storage_locations` to check for the presence
of the `entityStorageLocations` property on table metadata. The existing
`_convert_hms_table_to_external` method has been updated to use this new
method and to include the `entityStorageLocations` constructor argument
if present. The changes have been manually tested for DBR 16, tested
with existing integration tests for DBR 15, and verified on the staging
environment with DBR 16. Additionally, the `skip_job_wait=True`
parameter has been added to specific test function calls to improve test
execution time. This release also resolves an issue with a failed test
in DBR16 due to a JDK update.
* Delete stale code: `NotebookLinter._load_source_from_run_cell`
([#3529](#3529)). In this
release, we have improved the code linting functionality in the
NotebookLinter class of our open-source library by removing the
`_load_source_from_run_cell` method in the sources.py file. This method,
previously used to load source code from run cells in a notebook, has
been identified as stale code and is no longer required. Consequently,
this change affects the `databricks labs ucx lint-local-code` command
and results in cleaner and more maintainable code. Furthermore, updated
and added unit tests have been included in this commit, which have been
manually tested to ensure that the changes do not adversely impact
existing functionality, thus progressing issue
[#3514](#3514).
* Exclude ucx dashboards from Lakeview dashboard crawler
([#3450](#3450)). In this
release, the functionality of the `assessment` workflow has been
improved to exclude certain dashboard IDs from the Lakeview dashboard
crawler. This change has been made to address the issue of false
positive dashboards and affects the `_crawl` method in the
`dashboards.py` file. The excluded dashboard IDs are now obtained from
the `install_state.dashboards` object. Additionally, new methods have
been added to the `test_dashboards.py` file in the `unit/assessment`
directory to test the exclusion functionality, including a test to
ensure that the exclude parameter takes priority over the include
parameter. The commit also includes unit tests, manual tests, and
screenshots to verify the changes on the staging environment. Overall,
this modification enhances the accuracy of the dashboard crawler and
simplifies the process of identifying and assessing relevant dashboards.
* Fixed issue in installing UCX on UC enabled workspace
([#3501](#3501)). This pull
request introduces changes to the UCX installer to address an issue
([#3420](#3420)) with
installing UCX on UC-enabled workspaces. It updates the UCX policy by
changing the `spark_version` parameter from `fixed` to `allowlist` with
a default value, allowing the cluster definition to take `single_user`
and `user_isolation` values instead of `Legacy_Single_User` and
'Legacy_Table_ACL'. Additionally, the job definition has been updated to
use the default value when not explicitly provided. The changes are
implemented in the `test_policy.py` file and impact the
`test_job_cluster_policy` and `test_job_cluster_on_uc_enabled_workspace`
methods. The pull request also includes updates to unit tests and
integration tests to ensure the correct behavior of the updated UCX
policy and job definition. The target audience is software engineers
adopting this project, with changes involving adjusting policy
definitions and testing job cluster behavior under different
configurations. Issue
[#3501](#3501) is also
resolved with these changes.
* Fixed typo in workflow name (in error message)
([#3491](#3491)). This PR
includes a fix for a minor typo in the error message of the
`validate_groups_permissions` method in the `workflows.py` file. The
typo resulted in the incorrect spelling of `group` as `groups` in the
workflow name. The fix simply changes `groups` to `group` in the error
message, ensuring accurate workflow name display. The functionality of
the code remains unaffected by this change, and no new methods have been
added. To clarify, the `validate_groups_permissions` method verifies
whether group permissions have been migrated correctly, and if not,
raises a ValueError with an error message suggesting the use of the
`validate-group-permissions` workflow for validation after the API has
caught up. This fix resolves the typo issue and maintains the expected
behavior of the code.
* Make link to issue template url safe
([#3508](#3508)). In this
commit, the `_definitely_failure` function in the `python_ast.py` file
has been modified to make the link to the issue template URL safe using
Python's `urllib`. This change ensures that any special characters in
the source code passed to the function will be properly displayed in the
issue template. If the source code cannot be parsed, the function
creates a link to the issue template for reporting a bug in the UCX
library, including the source code as part of the issue body. With this
commit, the source code is now passed through the
`urllib.parse.quote_plus` function before being added to the issue body,
making it url-safe and improving the robustness and user-friendliness of
the library. This change has been introduced in issue
[#3498](#3498) and has been
manually tested.
* Refactor `PipelineMigrator`'s to add `include_pipeline_ids`
([#3495](#3495)). In this
refactoring, the `PipelineMigrator` has been updated to introduce an
`include_pipeline_ids` option, replacing the previous
`skip_pipeline_ids` flag. This change allows users to specify the list
of pipelines to migrate, providing better control over the migration
process. The `PipelinesMigrator` constructor,
`_get_pipelines_to_migrate`, and `migrate_pipelines` methods have been
modified to accommodate this new flag. The `_migrate_pipeline` method
now accepts the pipeline ID instead of a `PipelineInfo` object.
Additionally, the unit tests have been updated to include the new
`include_flag` parameter, which facilitates testing various scenarios
with different pipeline lists. Although the commit does not show changes
to test files, integration tests should be updated to reflect the new
`include-pipeline-ids` flag functionality. This improvement resolves
issue [#3492](#3492) and
enhances the overall flexibility of the `PipelineMigrator`.
* Rename Python AST's `Tree` methods for clarity
([#3524](#3524)). In this
release, the `Tree` class in the Python AST library has been updated for
improved code clarity and functionality. The `append_` methods have been
renamed to `attach_` for better accuracy, and now include docstrings for
increased understanding. These methods have been updated to always
return `None`. A new method, `attach_child_tree`, has been added,
allowing for traversal from both parent and child and propagating any
module references. Several new methods and functionalities have been
introduced to improve the class, while extensive unit testing has been
conducted to ensure functionality. Additionally, the diff includes test
cases for various functionalities, such as inferring values when
attaching trees and verifying spark module propagation, as well as tests
to ensure that certain operations are not supported. This change, linked
to issues [#3514](#3514) and
[#3520](#3520), may affect
any code that calls these methods and relies on their return values.
However, the added docstrings and unit tests will help ensure your code
continues to function correctly.
* Schedule the migration progress workflow to run daily
([#3485](#3485)). This PR
introduces changes to the UCX installation process to schedule the
migration progress workflow to run automatically once a day, with the
default schedule set to run at 5 a.m. UTC. It includes refactoring the
plumbing used for managing and installing workflows, enabling them to
have a Cron-based schedule. The relevant user documentation has been
updated, and the existing `migration-progress-experimental` workflow has
been modified. Additionally, unit and integration tests have been
added/modified to ensure the proper functioning of the updated code, and
new functions have been added to verify the workflow's schedule and task
detection.
* Scope crawled pipelines in PipelineCrawler
([#3513](#3513)). In this
release, the `PipelineCrawler` class in the `pipelines.py` file has been
updated to include a new optional argument `include_pipeline_ids` in its
constructor. This argument allows users to filter the pipelines that are
crawled by specifying a list of pipeline IDs. The `_crawl` method has
been modified to check if `include_pipeline_ids` is not `None` and to
filter the list of pipelines accordingly. The class now also checks if
each pipeline exists before getting its configuration, and logs a
warning message if the pipeline is not found. Previously, a `NotFound`
exception was raised. Additionally, the code has been updated to use
`pipeline.spec.configuration` instead of
`pipeline_response.spec.configuration` to get the pipeline
configuration. These changes have been tested through new and updated
unit tests, including a test for handling creators' user names. Overall,
these updates provide improved functionality and flexibility for
crawling pipelines.
* Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to
>=0.9.1,<0.11
([#3519](#3519)). In this
release, we have updated the version requirement of the
`databricks-labs-blueprint` package to be greater than or equal to 0.9.1
and less than 0.11. This change allows us to use the latest version of
the package and includes bug fixes and dependency updates. The hosted
runner has been patched in version 0.10.1 to address issues with
publishing artifacts in the release workflow. Release notes for previous
versions are also provided in the commit. These updates are intended to
improve the overall functionality and stability of the library.
* Updated databricks-sdk requirement from <0.41,>=0.40 to >=0.40,<0.42
([#3553](#3553)). In this
release, the `databricks-sdk` package requirement has been updated to
version 0.41.0, which brings new features, improvements, bug fixes, and
API changes. Among the new features are the addition of
'serving.http_request' for calling external functions, and recovery on
download failures in the Files API client. Although the specifics of the
functionality added and changed are not detailed, the focus of this
release appears to be on bug fixes and internal enhancements.
Additionally, the API has undergone changes, including added and altered
methods and fields, however, specific information about these changes
has not been provided in the release notes.
* Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2
([#3500](#3500)). A critical
update has been implemented in this release for the `sqlglot` package,
which has been updated to version 25.5.0 or higher, but less than 26.2.
This change is essential to leverage the latest version of sqlglot while
avoiding any breaking changes introduced in version 26.1. The new
version includes several breaking changes, new features, bug fixes, and
modifications to various dialects such as hive, postgres, tsql, and
sqlite. Moreover, the tokenizer has been updated to accept
underscore-separated number literals. However, the specific impact of
these changes on the project is not detailed in the commit message, and
software engineers should thoroughly test and review the changes to
ensure seamless functionality.
* Updated sqlglot requirement from <26.2,>=25.5.0 to >=25.5.0,<26.3
([#3528](#3528)). In this
update, we have modified the version constraint for the `sqlglot`
dependency from `>=25.5.0,<26.2` to `>=25.5.0,<26.3` in the
`pyproject.toml` file. Sqlglot is a Python-based SQL parser and
optimizer, and this change allows us to adopt the latest version of
sqlglot within the specified version range. This update addresses
potential security vulnerabilities and incorporates performance
enhancements and bug fixes, ensuring that our library remains up-to-date
and secure.
* Updated table-migration workflows to also capture updated migration
progress into the history log
([#3239](#3239)). This pull
request updates the table-migration workflows to log not only the tables
that still need to be migrated, but also the progress of the migration.
The affected workflows include `migrate-tables`,
`migrate-external-hiveserde-tables-in-place-experimental`,
`migrate-external-tables-ctas`, `scan-tables-in-mounts-experimental`,
and `migrate-tables-in-mounts-experimental`. The encoder for
table-history has been refactored to improve control over when the
`TableMigrationStatus` data is refreshed. The documentation has been
updated to reflect the changes in each workflow. Additionally, both unit
and integration tests have been added and updated to ensure the changes
work as intended and resolve any conflicts. A new
`ProgressTrackingInstallation` class has been added to support this
functionality. The changes have been manually tested and include
modifications to the existing workflows, new methods, and a renamed
method. The `mock_workspace_client` function has been replaced, and the
`external_locations.resolve_mount` method and other methods have not
been called. The `TablesCrawler` object's `snapshot` method has been
called once to retrieve the list of tables in the Hive metastore. The
migration record workflow run is also updated to include the workflow
run information in the `workflow_runs` table. These changes are expected
to improve the accuracy and reliability of the table-migration
workflows.

Dependency updates:

* Updated sqlglot requirement from <26.1,>=25.5.0 to >=25.5.0,<26.2
([#3500](#3500)).
* Updated databricks-labs-blueprint requirement from <0.10,>=0.9.1 to
>=0.9.1,<0.11
([#3519](#3519)).
* Updated databricks-sdk requirement from <0.41,>=0.40 to >=0.40,<0.42
([#3553](#3553))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs-triage
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants