Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Athena with Delta tables #6351

Open
ozkatz opened this issue Aug 8, 2023 · 4 comments
Open

Support Athena with Delta tables #6351

ozkatz opened this issue Aug 8, 2023 · 4 comments
Labels
area/integrations AWS no stale Using this label will prevent items from being marked as stale

Comments

@ozkatz
Copy link
Collaborator

ozkatz commented Aug 8, 2023

currently, using Athena with lakeFS works by registering symlinks into Glue

for Delta tables, this won't work (or worse: will cause deletef parquet files to also be queried).

For delta we should either generate symlinks based on the delta log, or find another way to query lakeFS from Athena.

@kesarwam
Copy link
Contributor

kesarwam commented Aug 8, 2023

Yes, it queries deleted files also

@ozkatz
Copy link
Collaborator Author

ozkatz commented Aug 10, 2023

Some additional context:

Blindly exporting symlinks is a bad idea for open table formats - symlinks work with hive style tables so any parquet file appearing in the symlink will be queried and partition information might be lost. so that's a no go.

An almost viable option was to export a "shallow clone" of the delta table: as a post-hook, write a single file, named <ref id>/<table name>/_delta_log/00000000000000000000.json. Inside it, specify the schema, metadata and list of files from the latest calculated snapshot of the table we're exporting - but use absolute URIs to point to the physical addresses of the files making up the given table.

I tested this (without lakeFS) by constructing a json file that points to arbitrary s3://... paths on the same bucket but in another directory. This works well for Unity! Also, it might work well for Spark on Glue (haven't tried) - but for Athena, I'm hitting this wall: trinodb/trino#17011 - I see exactly this behavior on Athena (v3).
It seems like this fix: trinodb/trino#17038 (once merged, released in Trino, picked up by Athena and made available..) will solve it, but it might take quite a while to get there.

Other options that might work:

  1. Reading the Delta snapshot and exporting symlinks in a way that preserves partitioning information and only includes "live" parquet files (essentially, exporting the Delta Table as a Hive table)
  2. Doing the same as above ^ but instead of symlinks, export as Iceberg, which should default to absolute paths anyway.
  3. Use Athena's federated querying abilities: This has a few downsides: it requires quite a bit of development work - the connector would be specific for delta on lakeFS and would have to implement parts of Delta such as predicate pushdowns and other low level capabilities. The other downside is the operations cost for the user: having to install lambda functions from the AWS marketplace, setup IAM for them, etc.

Not a fan of any of the above :)

@johnnyaug johnnyaug removed their assignment Sep 21, 2023
Copy link

This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

@github-actions github-actions bot added the stale label Dec 21, 2023
Copy link

Closing this issue because it has been stale for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 29, 2023
@itaiad200 itaiad200 added no stale Using this label will prevent items from being marked as stale and removed stale labels Dec 30, 2023
@itaiad200 itaiad200 reopened this Dec 30, 2023
@github-actions github-actions bot added the stale label Mar 20, 2024
@arielshaqed arielshaqed removed the stale label Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/integrations AWS no stale Using this label will prevent items from being marked as stale
Projects
None yet
Development

No branches or pull requests

6 participants