Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating DeltaTable object slow #2518

Closed
braaannigan opened this issue May 15, 2024 · 6 comments
Closed

Creating DeltaTable object slow #2518

braaannigan opened this issue May 15, 2024 · 6 comments
Labels
binding/python Issues for the Python package bug Something isn't working

Comments

@braaannigan
Copy link
Contributor

Environment

Delta-rs version: 0.17.4

Binding: python

Environment:

  • Cloud provider: AWS
  • OS:MacOS
  • Other:

Bug

What happened:
I have a DeltaTable on S3 partitioned by date with about 60 dates. The partitions have been compacted and vacuumed so have one file each. I append to this table 100 times a day so the transaction log has about 6000 json files.

When I try to create the DeltaTable object it takes 30 seconds.

What you expected to happen: I expected this operation to be faster but I'm not sure if that's a reasonable expectation?

How to reproduce it: No repro example, I'm just trying to establish if there is something unusual here

More details:

@braaannigan braaannigan added the bug Something isn't working label May 15, 2024
@ion-elgreco
Copy link
Collaborator

Did you checkpoint?

@braaannigan
Copy link
Contributor Author

Did you checkpoint?

Haven't come across that before, how do I do it?

@ion-elgreco
Copy link
Collaborator

With the latest version it should automatically checkpount every 100 commits, but you can also manually do it by doing DeltaTable.create_checkpoint()

@PeterKeDer
Copy link
Contributor

PeterKeDer commented May 16, 2024

We're also seeing a similar issue where constructing the DeltaTable takes 20-30 seconds, which is longer than we expected.

For context, we're using version 0.17.1. Our table is on AWS S3 and has 20000 transaction logs. We have a very recent checkpoint (only 5 versions from the latest). The parquet is 8 MB in size and has 20000 rows.

I traced through the performance with some custom debug logs. Here's the operations that takes the most time:

  • The commit jsons and checkpoint parquet are loaded twice during the instantiation
    • Is there a reason we're loading them two separate times? If we remove one, we could save 5-10s
  • Replaying the transactions takes up to 10s
    • Are these performance numbers expected? I'm not too familiar with the exact operations, but this feels slow considering there's only 20000 rows

Edit: disregard these numbers, we were running on debug mode 😅

@braaannigan
Copy link
Contributor Author

@ion-elgreco Shall I make a PR to document checkpointing a bit more?

@rtyler rtyler added the binding/python Issues for the Python package label May 17, 2024
@rtyler
Copy link
Member

rtyler commented May 17, 2024

@braaannigan improving our documentation is always welcome! I'm going to close this in the meantime

@rtyler rtyler closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants