Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Migrate direct filesystem access to UC tables #2021

Open
nfx opened this issue Jul 2, 2024 · 3 comments
Open

[FEATURE]: Migrate direct filesystem access to UC tables #2021

nfx opened this issue Jul 2, 2024 · 3 comments
Assignees
Labels
migrate/code Abstract Syntax Trees and other dark magic migrate/volumes migrate from raw DBFS mounts to UC Volumes

Comments

@nfx
Copy link
Collaborator

nfx commented Jul 2, 2024

We have instances of spark.read.format("delta").load("s3a://prefix/...") in the code, though we want to migrate that into spark.table("catalog.schema.table") to follow UC practices. Build on top of "tables in mounts". See:

Do we migrate to UC Volumes?

yes

Do we resolve mounts?

yes

Do we resolve dbutils.widgets.get()?

if possible

where to store mappings? add a prefix in the table mapping?

TBD

what scans all jobs?

  1. assessment workflow
  2. migration-progress (new) workflow on a daily schedule

See:

what determines all direct filesystem accesses?

  • Extend FromDbfsFolder, DirectFilesystemAccessMatcher, and FromTable to return file access.
  • Add new matchers for open('/dbfs/...') literals.
  • Modify WorkflowLinter to persist this information in a new table.
@nfx nfx added the migrate/code Abstract Syntax Trees and other dark magic label Jul 2, 2024
@nfx nfx added this to UCX Jul 2, 2024
@github-project-automation github-project-automation bot moved this to Triage in UCX Jul 2, 2024
@nfx nfx moved this from Triage to Design in UCX Jul 2, 2024
@nfx nfx added the migrate/volumes migrate from raw DBFS mounts to UC Volumes label Jul 11, 2024
@ericvergnaud
Copy link
Contributor

ericvergnaud commented Aug 1, 2024

As I understand it, a plan for implementing this feature could look as follows:

  • assessment

    • A linter that detects direct file system access already exist. We need this linter to generate a dedicated advice when in a relevant scenario: read or write (or are all scenarios relevant i.e. since we're migrating the file to UC, all usages need to be migrated ?).
    • the generated Advice will also describe if the file needs to be migrated to a UC volume. If it's the case it will be migrated with the same file name
    • the Advice would be different if a table pointing to the underlying file already exists than if the table needs to be created (to a certain extent that's already the case)
  • migration, there are various scenarios:

    1. the file name cannot be inferred

    2. the table does not exist, and the file must be migrated to a UC volume

    3. the table already exists, but the file must be migrated to a UC volume

      (a scenario where the file is already migrated to a UC volume is deemed impossible)

    There is no plan for 1, manual migration is required
    For 2, once the table is created, we fall back into 3

    • migration is done through a new cli command: migrate-direct-file-access
    • for each required migration, the command will perform the following:
      • if the table needs to be created, ucx will suggest a unique name inferred from the location, that the user can accept or change. The name is checked for uniqueness.
      • the corresponding tuple (current_location, uc_location, uc_table_to_use_or_create, file_is_migrated, table_is_created) is saved
    • if the table needs to be created, a table pointing to the UC location is created, the tuple is updated accordingly
    • if the file needs to be migrated to a UC volume, it is migrated, the tuple is updated accordingly
    • the source code is transformed

@ericvergnaud
Copy link
Contributor

@nfx @FastLee can you comment on the above proposed plan ?

@ericvergnaud
Copy link
Contributor

Are we looking to migrate access for any format (csv, parquet, json, delta...) or only delta ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
migrate/code Abstract Syntax Trees and other dark magic migrate/volumes migrate from raw DBFS mounts to UC Volumes
Projects
Status: Todo
Development

No branches or pull requests

3 participants