Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive Metastore on multiple workspaces may point to the same assets. We need to dedupe upgrades. #335

Open
1 task done
Tracked by #891
nfx opened this issue Sep 29, 2023 · 7 comments
Open
1 task done
Tracked by #891
Assignees
Labels
cloud/azure issues related to Azure feat/account-level cross-workspace installations feat/cli CLI commands feat/migration-index mapping of databases to catalog or potentially other databases migrate/managed go/uc/upgrade Upgrade Managed Tables and Jobs step/assign metastore go/uc/upgrade Assign Metastore

Comments

@nfx
Copy link
Collaborator

nfx commented Sep 29, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

Need to handle duplication of credentials & prefixes across different workspaces

  • Prefixes that show up on more than one workspace.
  • Prefixes that show up on more than one workspace with different credentials

Proposed Solution

  1. Addressing prefix conflicts/duplications require special processing we have the following options
  • Prefixes that show up on more than one workspace.
    • If already upgraded, ignore
    • If not, warn, and will upgrade later
  • Prefixes that show up on more than one workspace with different credentials
    • Prompt, confirm choice of credentials

Additional Context

Requires:

#910

  1. Create an exception list at the account level the list should contain
    1. Tables that show up on more than one workspace (pointing to the same cloud storage location)
    2. Tables that show up on more than one workspace with different metadata
    3. Tables that show up on more than one workspace with different ACLs
  2. Addressing table conflicts/duplications require special processing we have the following options
    1. Define a "master" and create derivative objects as views
    2. Flag and skip the dupes
    3. Duplicate the data and create dupes
  3. Consider upgrading a workspace at a time. Highlight the conflict with prior upgrades.

Now for tables, there also needs to be a report on table/db inconsistency - like
A: db1.tbl1, db1.tbl3
B: db1.tbl2

And the team(s) that are driving UC Migration within account would make a decision after some time in review (of excel spreadsheet). By the way, we can split UCX installation across different Azure Subscriptions. And every installation would just focus on defining target catalog mapping per database. But here are unanswered questions:

two workspaces, same dbs, all different tables and columns (all managed tables, effectively)
two workspaces, same dbs, 90% same tables, 10% are different tables
two workspaces, two different dbs
We can technically support both db_to_catalog and workspace_to_catalog, and even at the same time, but db_to_catalog will override workspace_to_catalog. We also need default_catalog_for_workspace, if workspace_to_catalog is set (default catalog for all workspaces is set per metastore)..

We can also do another override for tables, but we have unanswered questions:

what if same db, same workspace, same table, but different columns/order/types? Ignore and keep in hive metastore? And then rerun the scan for tables and grants?
what if during migration catalog/database/table were deleted either from hms and/or uc?
Speaking of metastores, in the beginning, there needs to be workspace_to_metastore mapping with default_metastore_for_workspace. Can we come up with a good default mapping here? Coarse or fine grained? Select between the two? Ask for inline input? How many conflicts we expect to justify the need to create/support custom mapping?

the last very important question is what future-proof configuration format might we need for this mapping.

@nfx nfx added enhancement New feature or request feat/account-level cross-workspace installations step/assessment go/uc/upgrade - Assessment Step step/assign metastore go/uc/upgrade Assign Metastore labels Sep 29, 2023
@pohlposition pohlposition added migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step migrate/managed go/uc/upgrade Upgrade Managed Tables and Jobs labels Oct 2, 2023
@nfx nfx added the feat/migration-index mapping of databases to catalog or potentially other databases label Oct 2, 2023
@nfx nfx added this to UCX Oct 3, 2023
@nfx nfx moved this to Refined in UCX Oct 3, 2023
@zpappa zpappa moved this from Refined to Triage in UCX Oct 16, 2023
@zpappa zpappa moved this from Triage to Refined in UCX Oct 16, 2023
@pohlposition
Copy link
Contributor

Thinking of an incremental approach here:

  1. Add workspace-id to the schema-name
  • ex: ucx-##########
  • This can be retrieved via the workspace URL; prob best added to the SDK
  • This ensures that each workspace inventory is isolated; regardless of if there is an external HMS or not
  • No need to alter "destroy" scripts
  1. Add workspace-id as a column to every inventory table to treat tables as multi-tenant
  • ex: workspace_id
  • This ensures that data is always segmented; even if a user overrides the config for schema name
  1. Add metastore object to inventory
  • Not sure if there is a natural identifier or not
  • Capture if this metastore is internal, external, glue, config, etc.
  1. Add metastore-id column to any related inventory object
  2. Create Workflow to switch inventory from HMS to UC
  • These tasks would ideally be included at the end of the workflow for Step 1 "Assign Metastore to Workspace"
  • Copy inventory data from HMS ucx-########## schema to ucx CATALOG in the same set of tables
  • Since each table has workspace-id column and same schema, this is a simple INSERT INTO SELECT * FROM
  • Run Destroy workflow to purge data from HMS
  • Update config file to redirect ucx location from HMS to UC
  • From this point forward, UCX in this workspace will reference the ucx CATALOG
  1. When running the "mapping" portion for any object, UCX logic will examine the data in the UC tables and highlight if there are any mappings which must be reviewed and approved manually
  • Each object will have a mapping table from old to new
  • Exporting / Importing to a google sheet would be a good interface for this
  • The same process can be run for each new workspace that gets added
  1. Migration of objects is performed based on the mapping tables

@nfx nfx moved this from Refined to Month Backlog in UCX Dec 6, 2023
@nfx nfx added the cloud/azure issues related to Azure label Feb 5, 2024
@nfx nfx removed migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step enhancement New feature or request step/assessment go/uc/upgrade - Assessment Step labels Apr 15, 2024
@nfx nfx moved this from Month Backlog to Quarter Backlog in UCX May 2, 2024
@nfx nfx moved this from Quarter Backlog to Month Backlog in UCX May 6, 2024
@nfx nfx moved this from Month Backlog to Design in UCX May 6, 2024
@nfx nfx added the feat/cli CLI commands label Jul 3, 2024
@HariGS-DB HariGS-DB self-assigned this Jul 17, 2024
@nfx nfx self-assigned this Jul 17, 2024
@william-conti
Copy link
Contributor

This ticket does not explicilty mention the scenario where workspace 1 is using a instance profile which only has read permission on a table, but workspace 2 is using different instance profile that has write permission on the same table.

When we migrate a table to UC on the first workspace using Glue metastore, we need to make sure that all permissions are gathered across all workspaces

@nfx nfx assigned JCZuurmond and unassigned nfx Aug 2, 2024
@HariGS-DB HariGS-DB removed their assignment Aug 13, 2024
@JCZuurmond JCZuurmond moved this from Design to Active Backlog in UCX Sep 11, 2024
@JCZuurmond
Copy link
Member

Start with a federated query for a (new) validate command. Dashboard might come later

@JCZuurmond
Copy link
Member

Start with a federated query for a (new) validate command. Dashboard might come later

Is already implemented for external locations in #2341

@JCZuurmond
Copy link
Member

Verify if the command works for external hive metastore, if so close issue

@gueniai gueniai moved this to Todo in UCX Dec 19, 2024
@JCZuurmond JCZuurmond moved this from Todo to In Progress in UCX Jan 8, 2025
@gueniai gueniai moved this from In Progress to Blocked/Hold in UCX Jan 10, 2025
@gueniai
Copy link
Collaborator

gueniai commented Jan 23, 2025

Waiting for Demo environment so that we can test this without disrupting other people's work

@gueniai
Copy link
Collaborator

gueniai commented Feb 18, 2025

Depends on #3563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud/azure issues related to Azure feat/account-level cross-workspace installations feat/cli CLI commands feat/migration-index mapping of databases to catalog or potentially other databases migrate/managed go/uc/upgrade Upgrade Managed Tables and Jobs step/assign metastore go/uc/upgrade Assign Metastore
Projects
Status: Blocked/Hold
Development

No branches or pull requests

6 participants