Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support reading delta tables with delta plugin #263
feat: support reading delta tables with delta plugin #263
Changes from 15 commits
2914aed
14558bc
f2d0aa7
b2e9e76
d3366eb
919b233
b6a67c1
ddda919
c072ced
959a1c1
806e04f
cbca70d
0faf9bb
d891d91
eba3f7f
3d64d08
09dfe1a
0cd1f35
040bd54
6ec7466
b48be37
689d35d
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not thrilled about special-casing the
delta
plugin like this; would it be better if we simply maintained the dictionary of source-created data frames (or dataframe-like equivalents) in theEnvironment
instance, instead of inside of thePlugin
class, for all of the sources? Would that work?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, we have to distinguish between materialization. Correct me if I am wrong
If you create table as df then you have the df instance and you copy data into duckdb process, if you create a view as df then you hold just the representation and pointer to df and load the first time when you resolve this view.
This is why other plugins work with table materialization and without each cursor configuration.
Therefore I am not sure if the idea is good to hold every instance of df in a dictionary in the Environment. Maybe I could make it so that we return a dataframe from the load function and if the materialization is a view register it to Environment. This could work and would solve also other plugins problems with the materialization view.
Let me try it! Just thought about it while answerinng 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jwills i did refactor now, can you take another look?
So i moved the logic into environment and abstracted it over the plugins if the plugin sets materialization to view
What i would like to do but i am not sure where exactly is to set default materialization for delta framework to view because it is per default better but it is not so much important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, why the indirection instead of just referencing
df
directly in the create view statement?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this use
to_pyarrow_dataset()
to enable filter pushdown and avoid materializing the entire delta table in memory?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Alexander, thank you very much for your comments
I tried so hard to make it pushdown predicates therefore was looking into duckdb execution graph which is as i see it now exactly the same by the table and dataset (see above few comments). But now when you say it, i remember reading somewhere in some article that there is difference and as i see it here in the example they use dataset too.
So i will change it to the dataset. Thank you very much for a good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it similarly to another test cases but it makes sense to me. I have to look into it how to do it.
Thank you for the tip