-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count rows as a metadata-only operation #1223
Comments
Hi @Visorgood - thank you for raising this issue. As you mentioned,
Woot! I'm a big fan of this idea as well, and if implemented well, I think we could extend a similar approach for other aggregations that rely on statistics like min and max. However, there are challenges in making this a purely metadata only operation:
|
This is a great idea! We should leverage Iceberg's robust metadata whenever possible. As mentioned, this would be a specific optimization for querying Iceberg table under specific circumstances. There are some prior art in this optimization. Trino has implemented it for |
I think this is an optimization for the engine side. |
Thanks @Visorgood for reaching out here, and that's an excellent idea. We actually already do this in a project like Datahub, see: https://github.com/datahub-project/datahub/blob/0e62c699fc2e4cf2d3525e899037b8277541cfd6/metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_profiler.py#L141-L162 There are some limitations that @sungwy already pointed out, such as applying a filter. There are a couple more, such as when you have positional deletes, the row-counts are not accurate anymore. You would need to apply the deletes and then count, but this requires computation. Also, the upper- and lower bounds are truncated by default when the column is a string. For DataHub this is fine, but you need to be aware of the limitations. That said, I do think there is value in a special API to quickly get table/column statistics. I think adding this to the metadata tables is the right place. WDYT? |
RCAHi @Visorgood, @Fokko please assign this too me. |
@tusharchou Thanks. I was noodling on this, and instead of having a |
…ta-only-op add count in data scan and test in catalog sql
Hi @Fokko, I agree that positional deletes are confusing to the user. Hence, this value cannot be used as a business metric, but it might help with skewness analysis or load evaluation. I would want to add more test cases, so please suggest any pytest I can reference. |
…-count Residual Evaluator with test
…a-only-row-count
…b.com/tusharchou/iceberg-python into apachegh-1223-metadata-only-row-count
We have the same use case and concerns about loading too much data into memory for counting, the way I'm doing it to use
|
…nto apachegh-1223-metadata-only-row-count
Feature Request / Improvement
Hello!
I'm using PyIceberg 0.7.1
I have a use-case where I need to count rows given a certain filter, and I was expecting it to be doable with PyIceberg as a metadata-only operation, given that manifest files contain counts of rows in each data file.
I figured out this code to count rows:
but this is loading the data filtered (using the
query
expression) into memory first, and then does the calculation of the count.I couldn't figure out the code that would return the result without converting either to
duckdb
or topyarrow
dataframe first.Is there a way to do such operation without loading data into memory - as a metadata-only operation?
If not, I believe this would be a good feature to have in PyIceberg.
I have tried Daft, which is supposed to be a
fully lazily optimized query engine interface on top of PyIceberg tables
, but it still seems to need to load data into memory, even when I do.limit(1)
.The text was updated successfully, but these errors were encountered: