hydro is a collection of Python-based Apache Spark and Delta Lake extensions.
See Key Functionality for concrete use cases.
hydro is intended to be used by developers and engineers who interact with Delta Lake tables and Spark DataFrames with Python. It can be used by those of all skill levels.
hydro is compatible with the Databricks platform as well as on other platforms where PySpark and Delta Lake can be installed: laptops for example.
hydro is well tested but not battle hardened, yet. Use it at your own risk.
pip install spark-hydro
https://christophergrant.github.io/hydro
- Correctly perform Slowly Changing Dimensions (SCD) on Delta Lake tables - hydro.delta.scd and hydro.delta.bootstrap_scd2
- Issue queries against Delta Log metadata, quickly and efficiently retrieving file-level metadata even on Petabyte-scale tables - hydro.delta.file_stats, hydro.delta.partition_stats
- Infer the schema of JSON columns - hydro.spark.infer_json_schema
- Drop nested fields from a Spark DataFrame hydro.spark.drop_fields
- Quality of life improvements like hydro.delta.detail_enhanced and hydro.spark.fields
- And more... check the docs!
Contributions are welcome!
Please create an issue and discuss before starting work on a feature to make sure that it aligns with the future of the project.