Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General question - OPTIMIZE internal workings #4145

Open
keenanwells-tatari opened this issue Feb 11, 2025 · 0 comments
Open

General question - OPTIMIZE internal workings #4145

keenanwells-tatari opened this issue Feb 11, 2025 · 0 comments

Comments

@keenanwells-tatari
Copy link

I'm looking for some information regarding how the internals of the OPTIMIZE command work. I've been looking for docs on this but haven't found them.

We have a job that is going to routinely optimize our delta tables based on certain criteria, one of which is the average file size of the table (which we get from DESCRIBE DETAIL cols numFiles / sizeInBytes).

I want to make sure I understand how OPTIMIZE will determine which files need to be optimized though, when it's run for a given table, and what affects the time it takes to complete compaction. I've noticed that subsequent optimize commands run after a table has already been optimized will complete much more quickly, so I'm assuming that the _delta_log is read first when OPTIMIZE is run and only files that meet a certain threshold are compacted.

Anyways, any specific docs or code that maintainers could point me to that would have more details would be much appreciated 🙏

Thanks for all the work on this framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant