Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add predicate pushdown #75

Closed
wants to merge 10 commits into from

Conversation

mike-luabase
Copy link

No description provided.

@mike-luabase
Copy link
Author

@samansmink made the updates here. I tried getting rid of the data dir from the commit, but wasn't sure the best way to do that.

@mike-luabase
Copy link
Author

mike-luabase commented Oct 10, 2024

Enhance query performance by filtering data at the metadata level, reducing the amount of data read during scans.

Key Changes

  • Extended IcebergManifestEntry:

    • Added lower_bounds and upper_bounds maps to store column statistics.
  • Utility Function:

    • Implemented IcebergUtils::GetFullPath to resolve file paths accurately.
  • Metadata Retrieval:

    • Added GetEntries template in IcebergTable to fetch relevant manifest entries, excluding deleted ones.
  • Predicate Evaluation:

    • Created EvaluatePredicateAgainstStatistics to assess if data files satisfy query predicates based on their statistics.
    • For each predicate:
      • Identifies the column involved.
      • Checks if the column has defined lower and upper bounds.
      • Based on the comparison type (e.g., =, >, <), determines if the predicate can be satisfied given the file's bounds.
      • If any predicate fails, the file is excluded from the scan.
  • Scan Expression Modification:

    • Updated MakeScanExpression to filter data_file_entries using predicates before scanning.
  • Binding Function Enhancements:

    • Enhanced IcebergScanBindReplace with additional logging and prepared data files based on predicate results.

This implementation optimizes Iceberg table scans by leveraging metadata for early data filtering, significantly improving query efficiency and resource usage.

@carlopi carlopi requested a review from samansmink October 14, 2024 13:15
Copy link
Contributor

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Some comments from my side. Could you also look at the failing CI?

@@ -1,2 +0,0 @@
count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to delete all of these files?



#include <sstream>
#include "boost/any.hpp"
#include <any>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is generated - should this be modified?


return make_uniq<ComparisonExpression>(ExpressionType::COMPARE_NOT_DISTINCT_FROM, std::move(data_filename_expr),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these format changes make the code hard to review - can we leave the old format in place?

@mike-luabase
Copy link
Author

@Mytherin thanks for the review! I cleaned up the issues here: #78

will close this one.

@mike-luabase
Copy link
Author

closing, see above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants