This repo has all the resources you need to become an amazing data engineer focused on delta spark!
I'll write a comprehensive description of the data pipeline architecture for your README.
This data pipeline architecture implements a modern data processing and storage solution using a stack of open-source technologies. The system is containerized using Docker and Docker Compose for easy deployment and scalability.
- DBBeaver: A universal database tool for handling database connections and queries
- Pandas: Python data analysis library for data manipulation and preprocessing
- Beeline: Command-line interface for connecting to Hive and executing queries
- Kyuubi SQL Gateway: Serves as the SQL interface gateway, providing multi-tenant Spark SQL engines
- Apache Spark: Core processing engine for large-scale data processing and analytics
- Metastore & Metastore DB:
- Hive Metastore service for managing metadata
- Stores metadata information about Spark SQL tables, partitions, and schemas
- Delta Lake: Storage layer that brings ACID transactions and versioning to data lakes
- Hadoop AWS: Provides integration with AWS services for data storage and processing
- MINIO: Object storage system compatible with Amazon S3 API
- Data enters the system through various input sources (DBBeaver, Pandas, or Beeline)
- Kyuubi SQL Gateway manages the routing of queries and data to Spark engines
- Apache Spark processes the data, interacting with the Metastore for metadata management
- Processed data is stored in Delta Lake format
- Final data is persisted in the storage layer through Hadoop AWS connectors to MINIO
The entire architecture is containerized using Docker, with Docker Compose orchestrating the multi-container application. This ensures:
- Consistent development and production environments
- Easy deployment and scaling
- Isolated services with defined networking
- Simplified dependency management
- Multi-tenant Support: Through Kyuubi SQL Gateway
- ACID Compliance: Provided by Delta Lake
- Metadata Management: Via Hive Metastore
- Scalable Processing: Using Apache Spark
- S3-Compatible Storage: Through MINIO
- Multiple Access Methods: Supporting various data tools and interfaces
This architecture is suitable for:
- Data warehousing
- Real-time analytics
- Large-scale ETL operations
- Data lake implementations
- Business intelligence applications
flowchart LR
dbeaver[DBBeaver]
pandas[Pandas]
beeline[Beeline]
kyuubi[Kyuubi SQL Gateway]
spark[Apache Spark]
metastore[Metastore]
metastoredb[Metastore DB]
deltalake[Delta Lake]
hadoop[Hadoop AWS]
storage[Storage/MINIO]
subgraph Input Sources
dbeaver
pandas
beeline
end
subgraph Processing Layer
kyuubi
spark
metastore
metastoredb
end
subgraph Storage Layer
deltalake
hadoop
storage
end
dbeaver --> kyuubi
pandas --> kyuubi
beeline --> kyuubi
kyuubi --> spark
spark --> metastore
spark --> deltalake
metastore --> metastoredb
deltalake --> hadoop
hadoop --> storage
Operation | Parquet | ORC | Delta Lake | Apache Iceberg |
---|---|---|---|---|
Basic Create | sql CREATE TABLE t1 (id INT) USING PARQUET |
sql CREATE TABLE t1 (id INT) USING ORC |
sql CREATE TABLE t1 (id INT) USING DELTA |
sql CREATE TABLE t1 (id INT) USING ICEBERG |
Partitioned Table | sql CREATE TABLE t1 (id INT, dt DATE) USING PARQUET PARTITIONED BY (dt) |
sql CREATE TABLE t1 (id INT, dt DATE) USING ORC PARTITIONED BY (dt) |
sql CREATE TABLE t1 (id INT, dt DATE) USING DELTA PARTITIONED BY (dt) |
sql CREATE TABLE t1 (id INT, dt DATE) USING ICEBERG PARTITIONED BY (dt) |
Insert Data | sql INSERT INTO t1 VALUES (1) |
sql INSERT INTO t1 VALUES (1) |
sql INSERT INTO t1 VALUES (1) |
sql INSERT INTO t1 VALUES (1) |
Merge/Upsert | Not Supported | Not Supported | sql MERGE INTO t1 USING t2 ON t1.id = t2.id WHEN MATCHED THEN UPDATE... |
sql MERGE INTO t1 USING t2 ON t1.id = t2.id WHEN MATCHED THEN UPDATE... |
Time Travel | Not Supported | Not Supported | sql SELECT * FROM t1 VERSION AS OF 1 |
sql SELECT * FROM t1 FOR VERSION AS OF 1234567 |
Schema Evolution | Manual Alter Table | Manual Alter Table | sql ALTER TABLE t1 ADD COLUMN name STRING |
sql ALTER TABLE t1 ADD COLUMN name STRING |
Streaming Write | sql writeStream.format("parquet") |
sql writeStream.format("orc") |
sql writeStream.format("delta") |
sql writeStream.format("iceberg") |
Delete Records | Not Supported | Not Supported | sql DELETE FROM t1 WHERE id = 1 |
sql DELETE FROM t1 WHERE id = 1 |
Update Records | Not Supported | Not Supported | sql UPDATE t1 SET col = 'value' WHERE id = 1 |
sql UPDATE t1 SET col = 'value' WHERE id = 1 |
Vacuum/Clean | Manual File Delete | Manual File Delete | sql VACUUM t1 RETAIN 168 HOURS |
sql CALL t1.system.rewrite_data_files() |
Feature | Parquet | ORC | Delta Lake | Apache Iceberg |
---|---|---|---|---|
ACID Transactions | ❌ | ❌ | ✅ | ✅ |
Schema Evolution | Limited | Limited | ✅ | ✅ |
Time Travel | ❌ | ❌ | ✅ | ✅ |
CDC Support | ❌ | ❌ | ✅ | ✅ |
Partition Evolution | ❌ | ❌ | ✅ | ✅ |
Row-level Operations | ❌ | ❌ | ✅ | ✅ |
Streaming Support | Limited | Limited | ✅ | ✅ |
Cloud Storage Support | ✅ | ✅ | ✅ | ✅ |
Metadata Management | File-based | File-based | Transaction Log | Table Format |
Unity Catalog Support | ❌ | ❌ | ✅ | ✅ |
Delta Sharing Support | ❌ | ❌ | ✅ | Limited |
Property | Parquet | ORC | Delta Lake | Apache Iceberg |
---|---|---|---|---|
File Format | Columnar | Columnar | Parquet Based | Parquet Based |
Compression | Snappy/GZIP | ZLIB/Snappy | Inherits Parquet | Inherits Parquet |
Statistics | Basic | Advanced | Advanced | Advanced |
Metadata Storage | File Level | File Level | Transaction Log | Metadata Tables |
Format | Best Used For |
---|---|
Parquet | - Static analytics data - Simple data lake storage - When only columnar storage is needed |
ORC | - Hive workloads - Hadoop ecosystem - Legacy systems integration |
Delta Lake | - Data lakehouse architecture - Real-time + batch processing - When ACID is required - Databricks ecosystem |
Iceberg | - Cloud-native data lakes - Multi-table transactions - Complex schema evolution - Vendor-neutral solution |
-- Delta Lake Properties
CREATE TABLE delta_table
USING DELTA
TBLPROPERTIES (
'delta.enableChangeDataFeed' = 'true',
'delta.autoOptimize.optimizeWrite' = 'true'
);
-- Iceberg Properties
CREATE TABLE iceberg_table
USING ICEBERG
TBLPROPERTIES (
'format-version' = '2',
'write.merge.mode' = 'merge-on-read'
);
-- Parquet Properties
CREATE TABLE parquet_table
USING PARQUET
TBLPROPERTIES (
'parquet.compression' = 'SNAPPY'
);
-- ORC Properties
CREATE TABLE orc_table
USING ORC
TBLPROPERTIES (
'orc.compress' = 'SNAPPY'
);
Data Engineering Communities
Sample Delta Lake Configuration Projects