Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up update_migration_status jobs by eliminating lots of redundant SQL queries #3200

Merged
merged 5 commits into from
Nov 11, 2024

Conversation

asnare
Copy link
Contributor

@asnare asnare commented Nov 4, 2024

Changes

This PR updates the migration workflows where they run an update_migration_status task at the end. This task checks every table in the inventory, and previously it would reload (via SQL) the index of migrated tables for every table that it checked. This PR eliminates this redundant reloading by using the refreshed index (loaded once) for all the checking.

Some incidental dead code is also removed.

Linked issues

Resolves #2730.
Resolves #2397.

Functionality

  • modified existing workflows:

    • migrate-tables
    • migrate-external-hiveserde-tables-in-place-experimental
    • migrate-external-tables-ctas
    • scan-tables-in-mounts-experimental
    • migrate-tables-in-mounts-experimental

Tests

  • existing unit tests
  • existing integration tests

@asnare asnare added enhancement New feature or request feat/migration-index mapping of databases to catalog or potentially other databases tech debt chores and design flaws feat/migration-progress Issues related to the migration progress workflow labels Nov 4, 2024
@asnare asnare requested review from nfx and JCZuurmond November 4, 2024 16:49
@asnare asnare self-assigned this Nov 4, 2024
@asnare asnare requested a review from a team as a code owner November 4, 2024 16:49
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link

github-actions bot commented Nov 4, 2024

✅ 51/51 passed, 4 flaky, 7 skipped, 2h24m42s total

Flaky tests:

  • 🤪 test_hiveserde_table_ctas_migration_job[hiveserde] (5m2.791s)
  • 🤪 test_table_migration_for_managed_table[managed-migrate-tables] (7m19.103s)
  • 🤪 test_table_migration_job_publishes_remaining_tables[regular] (6m53.667s)
  • 🤪 test_table_migration_job_refreshes_migration_status[regular-migrate-tables] (6m35.24s)

Running from acceptance #7276

nfx added a commit that referenced this pull request Nov 11, 2024
…ration (#3223)

## Changes

This PR removes some redundant migration-status indexing operations that
currently take place during view migration:

- An unnecessary refresh of the migration-status for all tables/views is
eliminated at the end of view migration.
- We no longer reload (without a refresh) the migration-status snapshot
for every view when checking whether it can be migrated.
- We no longer reload (without a refresh) the migration-status prior to
migrating a view.

### Linked issues

Relates #3200

### Functionality

- modified existing workflows:

   - `migrate-tables`
   - `migrate-external-hiveserde-tables-in-place-experimental`
   - `migrate-external-tables-ctas`

### Tests

- existing unit tests
- existing integration tests

Co-authored-by: Serge Smertin <[email protected]>
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nfx nfx merged commit 403e4fd into main Nov 11, 2024
7 checks passed
@nfx nfx deleted the avoid-redundant-superfluous-queries branch November 11, 2024 13:41
nfx added a commit that referenced this pull request Nov 18, 2024
* Added `pytesseract` to known list ([#3235](#3235)). A new addition has been made to the `known.json` file, which tracks packages with native code, to include `pytesseract`, an Optical Character Recognition (OCR) tool for Python. This change improves the handling of `pytesseract` within the codebase and addresses part of issue [#1931](#1931), likely concerning the seamless incorporation of `pytesseract` and its native components. However, specific details on the usage of `pytesseract` within the project are not provided in the diff. Thus, further context or documentation may be necessary for a complete understanding of the integration. Nonetheless, this commit simplifies and clarifies the codebase's treatment of `pytesseract` and its native dependencies, making it easier to work with.
* Added hyperlink to database names in database summary dashboard ([#3310](#3310)). The recent change to the `Database Summary` dashboard includes the addition of clickable database names, opening a new tab with the corresponding database page. This has been accomplished by adding a `linkUrlTemplate` property to the `database` field in the `encodings` object within the `overrides` property of the dashboard configuration. The commit also includes tests to verify the new functionality in the labs environment and addresses issue [#3258](#3258). Furthermore, the display of various other statistics, such as the number of tables, views, and grants, have been improved by converting them to links, enhancing the overall usability and navigation of the dashboard.
* Bump codecov/codecov-action from 4 to 5 ([#3316](#3316)). In this release, the version of the `codecov/codecov-action` dependency has been bumped from 4 to 5, which introduces several new features and improvements to the Codecov GitHub Action. The new version utilizes the Codecov Wrapper for faster updates and better performance, as well as an opt-out feature for tokens in public repositories. This allows contributors to upload coverage reports without requiring access to the Codecov token, improving security and flexibility. Additionally, several new arguments have been added, including `binary`, `gcov_args`, `gcov_executable`, `gcov_ignore`, `gcov_include`, `report_type`, `skip_validation`, and `swift_project`. These changes enhance the functionality and security of the Codecov GitHub Action, providing a more robust and efficient solution for code coverage tracking.
* Depend on a Databricks SDK release compatible with 0.31.0 ([#3273](#3273)). In this release, we have updated the minimum required version of the Databricks SDK to 0.31.0 due to the introduction of a new `InvalidState` error class that is not compatible with the previously declared minimum version of 0.30.0. This change was necessary because Databricks Runtime (DBR) 16 ships with SDK 0.30.0 and does not upgrade to the latest version during installation, unlike previous versions of DBR. This change affects the project's dependencies as specified in the `pyproject.toml` file. We recommend that users verify their systems are compatible with the new version of the Databricks SDK, as this change may impact existing integrations with the project.
* Eliminate redundant migration-index refresh and loads during view migration ([#3223](#3223)). In this pull request, we have optimized the view migration process in the `databricks/labs/ucx/hive_metastore/table_metastore.py` file by eliminating redundant migration-status indexing operations. We have removed the unnecessary refresh of migration-status for all tables/views at the end of view migration, and stopped reloading the migration-status snapshot for every view when checking if it can be migrated and prior to migrating a view. We have introduced a new class `TableMigrationIndex` and imported the `TableMigrationStatusRefresher` class. The `_migrate_views` method now takes an additional argument `migration_index`, which is used in the `ViewsMigrationSequencer` and in the `_migrate_view` method. The `_view_can_be_migrated` and `_sql_migrate_view` methods now also take `migration_index` as an argument, which is used to determine if the view can be migrated. These changes aim to improve the efficiency of the view migration process, making it faster and more resource-friendly.
* Fixed backwards compatibility breakage from Databricks SDK ([#3324](#3324)). In this release, we have addressed a backwards compatibility issue (Issue [#3324](#3324)) that was caused by an update to the Databricks SDK. This was done by adding new methods to the `databricks.sdk.service` module to interact with dashboards. Additionally, we have fixed bug [#3322](#3322) and updated the `create` function in the `conftest.py` file to utilize the new `dashboards` module and its `Dashboard` class. The function now returns the dashboard object as a dictionary and calls the `publish` method on this object to publish the dashboard. These changes also include an update to the pyproject.toml file, which affects the test and coverage scripts used in the default environment. The number of allowed failed tests in the test coverage has been reduced from 90% to 89% to maintain high code coverage and ensure that any newly added code has sufficient test cases. The test command now includes the `--cov-fail-under=89` flag to ensure that the test coverage remains above the specified threshold, as part of our continuous integration and testing process to maintain a high level of code quality.
* Fixed issue with cleanup of failed `create-missing-principals` command ([#3243](#3243)). In this update, we have improved the `create_uc_roles` method within the `access.py` file of the `databricks/labs/ucx/aws` directory to handle failures during role creation caused by permission issues. If a failure occurs, the method now deletes any created roles before raising the exception, restoring the system to its initial state. This ensures that the system remains consistent and prevents the accumulation of partially created roles. The update includes a try-except block around the code that creates the role and adds a policy to it, and it logs an error message, deletes any previously created roles, and raises the exception again if a `PermissionDenied` or `NotFound` exception is raised during this process. We have also added unit tests to verify the behavior of the updated method, covering the scenario where a failure occurs and the roles are successfully deleted. These changes aim to improve the robustness of the `databricks labs ucx create-missing-principals` command by handling permission errors and restoring the system to its initial state.
* Improve error handling for `assess_workflows` task ([#3255](#3255)). This pull request introduces improvements to the `assess_workflows` task in the `databricks/labs/ucx` module, focusing on error handling and logging. A new error type, `DatabricksError`, has been added to handle Databricks-specific exceptions in the `_temporary_copy` method, ensuring proper handling and re-raising of Databricks-related errors as `InvalidPath` exceptions. Additionally, log levels for various errors have been updated to better reflect their severity. Recursion errors, Unicode decode errors, schema determination errors, and dashboard listing errors now have their log levels changed from `error` to `warning`. These adjustments provide more fine-grained control over error messages' severity and help avoid unnecessary alarm when these issues occur. These changes improve the robustness, error handling, and logging of the `assess_workflows` task, ensuring appropriate handling and logging of any errors that may occur during execution.
* Require at least 4 cores for UCX VMs ([#3229](#3229)). In this release, the selection of `node_type_id` in the `policy.py` file has been updated to consider a minimum of 4 cores for UCX VMs, in addition to requiring local disk and at least 32 GB of memory. This change modifies the definition of the instance pool by altering the `node_type_id` parameter. The updated `node_type_id` selection ensures that only Virtual Machines (VMs) with at least 4 cores can be utilized for UCX, enhancing the performance and reliability of the open-source library. This improvement requires a minimum of 4 cores to function properly.
* Skip `test_feature_tables` integration test ([#3326](#3326)). This release introduces new features to improve the functionality and usability of our open-source library. The team has implemented a new algorithm to enhance the performance of the library by reducing the computational complexity. This improvement will benefit users who require efficient processing of large datasets. Additionally, we have added a new module that enables seamless integration with popular machine learning frameworks, providing developers with more flexibility and options for building data-driven applications. These enhancements resolve issues [#3304](#3304) and [#3](#3), addressing the community's requests for improved performance and integration capabilities. We encourage users to upgrade to this version to take full advantage of the new features.
* Speed up `update_migration_status` jobs by eliminating lots of redundant SQL queries ([#3200](#3200)). In this release, the `_retrieve_acls` method in the `grants.py` file has been updated to remove the `_is_migrated` method and inline its functionality, resulting in improved performance for `update_migration_status` jobs. The `_is_migrated` method previously queried the migration status index for each table, but the updated method now refreshes the index once and then uses it for all checks, eliminating redundant SQL queries. Affected workflows include `migrate-tables`, `migrate-external-hiveserde-tables-in-place-experimental`, `migrate-external-tables-ctas`, `scan-tables-in-mounts-experimental`, and `migrate-tables-in-mounts-experimental`, all of which have been updated to utilize the refreshed migration status index and remove dead code. This release also includes updates to existing unit tests and integration tests to ensure the changes' correctness.
* Tech Debt: Fixed issue with Incorrect unit test practice ([#3244](#3244)). In this release, we have made significant improvements to the test suite for our AWS module. Specifically, the test case for `test_get_uc_compatible_roles` in `tests/unit/aws/test_access.py` has been updated to remove mocking code and directly call the `save_uc_compatible_roles` method, improving the accuracy and reliability of the test. Additionally, the MagicMock for the `load` method in the `mock_installation` object has been removed, further simplifying the test code and making it easier to understand. These changes will help to prevent bugs and make it easier to modify and extend the codebase in the future, improving the maintainability and overall quality of our open-source library.
* Updated `migration-progress-experimental` workflow to crawl tables from the `main` cluster ([#3269](#3269)). In this release, we have updated the `migration-progress-experimental` workflow to crawl tables from the `main` cluster instead of the `tacl` one. This change resolves issue [#3268](#3268) and addresses the problem of the Py4j bridge required for crawling not being available in the `tacl` cluster, leading to failures. The `setup_tacl` job task has been removed, and the `crawl_tables` task has been updated to no longer rely on the TACL cluster, instead refreshing the inventory directly. A new dependency has been added to ensure that the `crawl_tables` task runs after the `verify_prerequisites` task. The `refresh_table_migration_status` task and `update_tables_history_log` task have also been updated to assume that the inventory and migration status have been refreshed in the previous step. A TODO has been added to avoid triggering an implicit refresh if either the table or migration-status inventory is empty.
* Updated databricks-labs-lsql requirement from <0.13,>=0.5 to >=0.5,<0.14 ([#3241](#3241)). In this pull request, we have updated the `databricks-labs-lsql` requirement in the `pyproject.toml` file to a range of greater than 0.5 and less than 0.14, allowing the use of the latest version of this library. The update includes release notes and a changelog from the `databricks-labs-lsql` GitHub repository, detailing new features, bug fixes, and improvements. Notable changes include the addition of the `escape_name` and `escape_full_name` functions, various dependency updates, and modifications to the `as_dict()` method in the `Row` class. This update also includes a list of dependency version updates from the `databricks-labs-lsql` changelog.
* Updated databricks-labs-lsql requirement from <0.14,>=0.5 to >=0.5,<0.15 ([#3321](#3321)). In this release, the `databricks-labs-lsql` package requirement has been updated to version '>=0.5,<0.15' in the pyproject.toml file. This update addresses multiple issues and includes several improvements, such as bug fixes, dependency updates, and the addition of go-git libraries. The `RuntimeBackend` component has been improved with better exception handling, and new `escape_name` and `escape_full_name` functions have been added for SQL name escaping. The 'Row.as_dict()' method has been deprecated in favor of 'asDict()'. The `SchemaDeployer` class now allows overwriting the default `hive_metastore` catalog, and the `MockBackend` component has been improved to properly mock the `savetable` method in `append` mode. Filter specification files have been converted from JSON to YAML format for improved readability. Additionally, the test suite has been expanded, and various methods have been updated to improve codebase readability, maintainability, and ease of use.
* Updated sqlglot requirement from <25.30,>=25.5.0 to >=25.5.0,<25.32 ([#3320](#3320)). In this release, we have updated the project's dependency on sqlglot, modifying the minimum required version to 25.5.0 and setting the maximum allowed version to below 25.32. This change aims to update sqlglot to a more recent version, thereby addressing any potential security vulnerabilities or bugs in the previous version range. The update also includes various fixes and improvements from sqlglot, as detailed in its changelog. The individual commits have been truncated and can be viewed in the compare view. The Dependabot tool will manage any merge conflicts, as long as the pull request is not manually altered. Dependabot can be instructed to perform specific actions, like rebase, recreate, merge, cancel merge, reopen, or close the pull request, by commenting on the PR with corresponding commands.
* Use internal Permissions Migration API by default ([#3230](#3230)). This pull request introduces support for both legacy and new permission migration workflows in the Databricks UCX project. A new configuration option, `use_legacy_permission_migration`, has been added to `WorkspaceConfig` to toggle between the two workflows. When the legacy workflow is not enabled, certain steps in `workflows.py` are skipped and related methods have been renamed to reflect the legacy workflow. The `GroupMigration` class has been renamed to `LegacyGroupMigration` and integration and unit tests have been updated to use the new configuration option and renamed classes/methods. The new workflow no longer queries the `hive_metastore`.`ucx`.`groups` table in certain methods, resulting in changes to the behavior of the `test_runtime_workspace_listing` and `test_runtime_crawl_permissions` tests. Overall, these changes provide flexibility for users to choose between legacy and new permission migration workflows in the Databricks UCX project.

Dependency updates:

 * Updated databricks-labs-lsql requirement from <0.13,>=0.5 to >=0.5,<0.14 ([#3241](#3241)).
 * Updated databricks-labs-lsql requirement from <0.14,>=0.5 to >=0.5,<0.15 ([#3321](#3321)).
 * Updated sqlglot requirement from <25.30,>=25.5.0 to >=25.5.0,<25.32 ([#3320](#3320)).
 * Bump codecov/codecov-action from 4 to 5 ([#3316](#3316)).
@nfx nfx mentioned this pull request Nov 18, 2024
nfx added a commit that referenced this pull request Nov 18, 2024
* Added `pytesseract` to known list
([#3235](#3235)). A new
addition has been made to the `known.json` file, which tracks packages
with native code, to include `pytesseract`, an Optical Character
Recognition (OCR) tool for Python. This change improves the handling of
`pytesseract` within the codebase and addresses part of issue
[#1931](#1931), likely
concerning the seamless incorporation of `pytesseract` and its native
components. However, specific details on the usage of `pytesseract`
within the project are not provided in the diff. Thus, further context
or documentation may be necessary for a complete understanding of the
integration. Nonetheless, this commit simplifies and clarifies the
codebase's treatment of `pytesseract` and its native dependencies,
making it easier to work with.
* Added hyperlink to database names in database summary dashboard
([#3310](#3310)). The recent
change to the `Database Summary` dashboard includes the addition of
clickable database names, opening a new tab with the corresponding
database page. This has been accomplished by adding a `linkUrlTemplate`
property to the `database` field in the `encodings` object within the
`overrides` property of the dashboard configuration. The commit also
includes tests to verify the new functionality in the labs environment
and addresses issue
[#3258](#3258). Furthermore,
the display of various other statistics, such as the number of tables,
views, and grants, have been improved by converting them to links,
enhancing the overall usability and navigation of the dashboard.
* Bump codecov/codecov-action from 4 to 5
([#3316](#3316)). In this
release, the version of the `codecov/codecov-action` dependency has been
bumped from 4 to 5, which introduces several new features and
improvements to the Codecov GitHub Action. The new version utilizes the
Codecov Wrapper for faster updates and better performance, as well as an
opt-out feature for tokens in public repositories. This allows
contributors to upload coverage reports without requiring access to the
Codecov token, improving security and flexibility. Additionally, several
new arguments have been added, including `binary`, `gcov_args`,
`gcov_executable`, `gcov_ignore`, `gcov_include`, `report_type`,
`skip_validation`, and `swift_project`. These changes enhance the
functionality and security of the Codecov GitHub Action, providing a
more robust and efficient solution for code coverage tracking.
* Depend on a Databricks SDK release compatible with 0.31.0
([#3273](#3273)). In this
release, we have updated the minimum required version of the Databricks
SDK to 0.31.0 due to the introduction of a new `InvalidState` error
class that is not compatible with the previously declared minimum
version of 0.30.0. This change was necessary because Databricks Runtime
(DBR) 16 ships with SDK 0.30.0 and does not upgrade to the latest
version during installation, unlike previous versions of DBR. This
change affects the project's dependencies as specified in the
`pyproject.toml` file. We recommend that users verify their systems are
compatible with the new version of the Databricks SDK, as this change
may impact existing integrations with the project.
* Eliminate redundant migration-index refresh and loads during view
migration ([#3223](#3223)).
In this pull request, we have optimized the view migration process in
the `databricks/labs/ucx/hive_metastore/table_metastore.py` file by
eliminating redundant migration-status indexing operations. We have
removed the unnecessary refresh of migration-status for all tables/views
at the end of view migration, and stopped reloading the migration-status
snapshot for every view when checking if it can be migrated and prior to
migrating a view. We have introduced a new class `TableMigrationIndex`
and imported the `TableMigrationStatusRefresher` class. The
`_migrate_views` method now takes an additional argument
`migration_index`, which is used in the `ViewsMigrationSequencer` and in
the `_migrate_view` method. The `_view_can_be_migrated` and
`_sql_migrate_view` methods now also take `migration_index` as an
argument, which is used to determine if the view can be migrated. These
changes aim to improve the efficiency of the view migration process,
making it faster and more resource-friendly.
* Fixed backwards compatibility breakage from Databricks SDK
([#3324](#3324)). In this
release, we have addressed a backwards compatibility issue (Issue
[#3324](#3324)) that was
caused by an update to the Databricks SDK. This was done by adding new
methods to the `databricks.sdk.service` module to interact with
dashboards. Additionally, we have fixed bug
[#3322](#3322) and updated
the `create` function in the `conftest.py` file to utilize the new
`dashboards` module and its `Dashboard` class. The function now returns
the dashboard object as a dictionary and calls the `publish` method on
this object to publish the dashboard. These changes also include an
update to the pyproject.toml file, which affects the test and coverage
scripts used in the default environment. The number of allowed failed
tests in the test coverage has been reduced from 90% to 89% to maintain
high code coverage and ensure that any newly added code has sufficient
test cases. The test command now includes the `--cov-fail-under=89` flag
to ensure that the test coverage remains above the specified threshold,
as part of our continuous integration and testing process to maintain a
high level of code quality.
* Fixed issue with cleanup of failed `create-missing-principals` command
([#3243](#3243)). In this
update, we have improved the `create_uc_roles` method within the
`access.py` file of the `databricks/labs/ucx/aws` directory to handle
failures during role creation caused by permission issues. If a failure
occurs, the method now deletes any created roles before raising the
exception, restoring the system to its initial state. This ensures that
the system remains consistent and prevents the accumulation of partially
created roles. The update includes a try-except block around the code
that creates the role and adds a policy to it, and it logs an error
message, deletes any previously created roles, and raises the exception
again if a `PermissionDenied` or `NotFound` exception is raised during
this process. We have also added unit tests to verify the behavior of
the updated method, covering the scenario where a failure occurs and the
roles are successfully deleted. These changes aim to improve the
robustness of the `databricks labs ucx create-missing-principals`
command by handling permission errors and restoring the system to its
initial state.
* Improve error handling for `assess_workflows` task
([#3255](#3255)). This pull
request introduces improvements to the `assess_workflows` task in the
`databricks/labs/ucx` module, focusing on error handling and logging. A
new error type, `DatabricksError`, has been added to handle
Databricks-specific exceptions in the `_temporary_copy` method, ensuring
proper handling and re-raising of Databricks-related errors as
`InvalidPath` exceptions. Additionally, log levels for various errors
have been updated to better reflect their severity. Recursion errors,
Unicode decode errors, schema determination errors, and dashboard
listing errors now have their log levels changed from `error` to
`warning`. These adjustments provide more fine-grained control over
error messages' severity and help avoid unnecessary alarm when these
issues occur. These changes improve the robustness, error handling, and
logging of the `assess_workflows` task, ensuring appropriate handling
and logging of any errors that may occur during execution.
* Require at least 4 cores for UCX VMs
([#3229](#3229)). In this
release, the selection of `node_type_id` in the `policy.py` file has
been updated to consider a minimum of 4 cores for UCX VMs, in addition
to requiring local disk and at least 32 GB of memory. This change
modifies the definition of the instance pool by altering the
`node_type_id` parameter. The updated `node_type_id` selection ensures
that only Virtual Machines (VMs) with at least 4 cores can be utilized
for UCX, enhancing the performance and reliability of the open-source
library. This improvement requires a minimum of 4 cores to function
properly.
* Skip `test_feature_tables` integration test
([#3326](#3326)). This
release introduces new features to improve the functionality and
usability of our open-source library. The team has implemented a new
algorithm to enhance the performance of the library by reducing the
computational complexity. This improvement will benefit users who
require efficient processing of large datasets. Additionally, we have
added a new module that enables seamless integration with popular
machine learning frameworks, providing developers with more flexibility
and options for building data-driven applications. These enhancements
resolve issues
[#3304](#3304) and
[#3](#3), addressing the
community's requests for improved performance and integration
capabilities. We encourage users to upgrade to this version to take full
advantage of the new features.
* Speed up `update_migration_status` jobs by eliminating lots of
redundant SQL queries
([#3200](#3200)). In this
release, the `_retrieve_acls` method in the `grants.py` file has been
updated to remove the `_is_migrated` method and inline its
functionality, resulting in improved performance for
`update_migration_status` jobs. The `_is_migrated` method previously
queried the migration status index for each table, but the updated
method now refreshes the index once and then uses it for all checks,
eliminating redundant SQL queries. Affected workflows include
`migrate-tables`,
`migrate-external-hiveserde-tables-in-place-experimental`,
`migrate-external-tables-ctas`, `scan-tables-in-mounts-experimental`,
and `migrate-tables-in-mounts-experimental`, all of which have been
updated to utilize the refreshed migration status index and remove dead
code. This release also includes updates to existing unit tests and
integration tests to ensure the changes' correctness.
* Tech Debt: Fixed issue with Incorrect unit test practice
([#3244](#3244)). In this
release, we have made significant improvements to the test suite for our
AWS module. Specifically, the test case for
`test_get_uc_compatible_roles` in `tests/unit/aws/test_access.py` has
been updated to remove mocking code and directly call the
`save_uc_compatible_roles` method, improving the accuracy and
reliability of the test. Additionally, the MagicMock for the `load`
method in the `mock_installation` object has been removed, further
simplifying the test code and making it easier to understand. These
changes will help to prevent bugs and make it easier to modify and
extend the codebase in the future, improving the maintainability and
overall quality of our open-source library.
* Updated `migration-progress-experimental` workflow to crawl tables
from the `main` cluster
([#3269](#3269)). In this
release, we have updated the `migration-progress-experimental` workflow
to crawl tables from the `main` cluster instead of the `tacl` one. This
change resolves issue
[#3268](#3268) and addresses
the problem of the Py4j bridge required for crawling not being available
in the `tacl` cluster, leading to failures. The `setup_tacl` job task
has been removed, and the `crawl_tables` task has been updated to no
longer rely on the TACL cluster, instead refreshing the inventory
directly. A new dependency has been added to ensure that the
`crawl_tables` task runs after the `verify_prerequisites` task. The
`refresh_table_migration_status` task and `update_tables_history_log`
task have also been updated to assume that the inventory and migration
status have been refreshed in the previous step. A TODO has been added
to avoid triggering an implicit refresh if either the table or
migration-status inventory is empty.
* Updated databricks-labs-lsql requirement from <0.13,>=0.5 to
>=0.5,<0.14
([#3241](#3241)). In this
pull request, we have updated the `databricks-labs-lsql` requirement in
the `pyproject.toml` file to a range of greater than 0.5 and less than
0.14, allowing the use of the latest version of this library. The update
includes release notes and a changelog from the `databricks-labs-lsql`
GitHub repository, detailing new features, bug fixes, and improvements.
Notable changes include the addition of the `escape_name` and
`escape_full_name` functions, various dependency updates, and
modifications to the `as_dict()` method in the `Row` class. This update
also includes a list of dependency version updates from the
`databricks-labs-lsql` changelog.
* Updated databricks-labs-lsql requirement from <0.14,>=0.5 to
>=0.5,<0.15
([#3321](#3321)). In this
release, the `databricks-labs-lsql` package requirement has been updated
to version '>=0.5,<0.15' in the pyproject.toml file. This update
addresses multiple issues and includes several improvements, such as bug
fixes, dependency updates, and the addition of go-git libraries. The
`RuntimeBackend` component has been improved with better exception
handling, and new `escape_name` and `escape_full_name` functions have
been added for SQL name escaping. The 'Row.as_dict()' method has been
deprecated in favor of 'asDict()'. The `SchemaDeployer` class now allows
overwriting the default `hive_metastore` catalog, and the `MockBackend`
component has been improved to properly mock the `savetable` method in
`append` mode. Filter specification files have been converted from JSON
to YAML format for improved readability. Additionally, the test suite
has been expanded, and various methods have been updated to improve
codebase readability, maintainability, and ease of use.
* Updated sqlglot requirement from <25.30,>=25.5.0 to >=25.5.0,<25.32
([#3320](#3320)). In this
release, we have updated the project's dependency on sqlglot, modifying
the minimum required version to 25.5.0 and setting the maximum allowed
version to below 25.32. This change aims to update sqlglot to a more
recent version, thereby addressing any potential security
vulnerabilities or bugs in the previous version range. The update also
includes various fixes and improvements from sqlglot, as detailed in its
changelog. The individual commits have been truncated and can be viewed
in the compare view. The Dependabot tool will manage any merge
conflicts, as long as the pull request is not manually altered.
Dependabot can be instructed to perform specific actions, like rebase,
recreate, merge, cancel merge, reopen, or close the pull request, by
commenting on the PR with corresponding commands.
* Use internal Permissions Migration API by default
([#3230](#3230)). This pull
request introduces support for both legacy and new permission migration
workflows in the Databricks UCX project. A new configuration option,
`use_legacy_permission_migration`, has been added to `WorkspaceConfig`
to toggle between the two workflows. When the legacy workflow is not
enabled, certain steps in `workflows.py` are skipped and related methods
have been renamed to reflect the legacy workflow. The `GroupMigration`
class has been renamed to `LegacyGroupMigration` and integration and
unit tests have been updated to use the new configuration option and
renamed classes/methods. The new workflow no longer queries the
`hive_metastore`.`ucx`.`groups` table in certain methods, resulting in
changes to the behavior of the `test_runtime_workspace_listing` and
`test_runtime_crawl_permissions` tests. Overall, these changes provide
flexibility for users to choose between legacy and new permission
migration workflows in the Databricks UCX project.

Dependency updates:

* Updated databricks-labs-lsql requirement from <0.13,>=0.5 to
>=0.5,<0.14 ([#3241](#3241)).
* Updated databricks-labs-lsql requirement from <0.14,>=0.5 to
>=0.5,<0.15 ([#3321](#3321)).
* Updated sqlglot requirement from <25.30,>=25.5.0 to >=25.5.0,<25.32
([#3320](#3320)).
* Bump codecov/codecov-action from 4 to 5
([#3316](#3316)).
github-merge-queue bot pushed a commit that referenced this pull request Jan 3, 2025
…ogress into the history log (#3239)

## Changes

The table-migration workflows already contained tasks at the end that
log information about tables that still need to be migrated. The primary
purpose of this PR is to update these workflows so they also capture
updated progress information into the history log.

Other changes include:

 - Updating the documentation for which workflows update which tables.
- ~Updating the (singleton) encoder for table-history so that
initialisation doesn't trigger an implicit refresh of the
`TableMigrationStatus` data. Instead this is controlled at the workflow
level, as intended.~ Moved to #3270.

### Linked issues

~Conflicts with #3200 (will need rebasing).~ (Resolved.)

### Functionality

- updated documentation
- modified existing workflows:

  - `migrate-tables`
  - `migrate-external-hiveserde-tables-in-place-experimental`
  - `migrate-external-tables-ctas`
  - `scan-tables-in-mounts-experimental`
  - `migrate-tables-in-mounts-experimental`

### Tests

- manually tested
- updated and new unit tests
- updated and new integration tests

---------

Co-authored-by: Serge Smertin <[email protected]>
Co-authored-by: Cor Zuurmond <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feat/migration-index mapping of databases to catalog or potentially other databases feat/migration-progress Issues related to the migration progress workflow tech debt chores and design flaws
Projects
None yet
2 participants