A test data management tool with automated data generation, validation and clean up.
Generate data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly. Clean up generated data or consumed data in downstream data sources to keep your environments tidy. Define alerts to get notified when failures occur and deep dive into issues from the generated report.
Scala/Java examples found here.
- Batch and/or event data generation
- Maintain relationships across any dataset
- Create custom data generation/validation scenarios
- Data validation
- Clean up generated and downstream data
- Suggest data validations
- Metadata discovery
- Detailed report of generated data and validation results
- Alerts to be notified of results
- Run as GitHub Action
Follow quick start instructions from here.
Data Caterer supports the below data sources. Check here for the full roadmap.
Data Source Type | Data Source | Support |
---|---|---|
Cloud Storage | AWS S3 | âś… |
Cloud Storage | Azure Blob Storage | âś… |
Cloud Storage | GCP Cloud Storage | âś… |
Database | Cassandra | âś… |
Database | MySQL | âś… |
Database | Postgres | âś… |
Database | Elasticsearch | ❌ |
Database | MongoDB | ❌ |
File | CSV | âś… |
File | Delta Lake | âś… |
File | JSON | âś… |
File | Iceberg | âś… |
File | ORC | âś… |
File | Parquet | âś… |
File | Hudi | ❌ |
HTTP | REST API | âś… |
Messaging | Kafka | âś… |
Messaging | Solace | âś… |
Messaging | ActiveMQ | ❌ |
Messaging | Pulsar | ❌ |
Messaging | RabbitMQ | ❌ |
Metadata | Data Contract CLI | âś… |
Metadata | Great Expectations | âś… |
Metadata | Marquez | âś… |
Metadata | OpenAPI/Swagger | âś… |
Metadata | OpenMetadata | âś… |
Metadata | Open Data Contract Standard (ODCS) | âś… |
Metadata | Amundsen | ❌ |
Metadata | Datahub | ❌ |
Metadata | Solace Event Portal | ❌ |
Different ways to run Data Caterer based on your use case:
Design motivations and details can be found here.
Can check here for full list of roadmap items.
Data Caterer is set up under a usage pricing model for the latest application version. There are different pricing tiers based on how much you use Data Caterer. This also includes support and requesting features. The current open-source version will be kept for those who want to continue using the open-source version.
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer") //name and url
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
.validation(validation.count.isEqual(1000))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
.validation(
validation.upstreamData(postgresTask)
.joinFields("account_id")
.withValidation(validation.count().isEqual(1000))
)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
.validation(
validation.upstreamData(postgresTask)
.joinFields("account_id")
.withValidation(validation.count().isEqual(1000))
)
.validationWait(waitCondition.file("/data/parquet/customer"))
kafka("my_kafka", "localhost:29092")
.topic("account-topic")
.fields(...)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.fields(field.name("account_id").regex("ACC[0-9]{10}"))
val kafkaTask = kafka("my_kafka", "localhost:29092")
.topic("account-topic")
.fields(...)
plan.addForeignKeyRelationship(
postgresTask, List("account_id"),
List(kafkaTask -> List("account_id"))
)
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.table("account", "transactions")
.count(count.recordsPerField(5, "account_id"))
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.table("account", "transactions")
.count(count.recordsPerFieldGenerator(generator.min(1).max(5), "account_id"))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.table("account", "transactions")
.count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))
val conf = configuration
.enableDeleteGeneratedRecords(true)
.enableGenerateData(false)
I also want to delete the data in Cassandra because my job consumed the data in Postgres and pushed to Cassandra
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.table("account", "transactions")
.count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))
val cassandraTxns = cassandra("ingested_data", "localhost:9042")
.table("account", "transactions")
val deletePlan = plan.addForeignKeyRelationship(
postgresTask, List("account_id"),
List(),
List(cassandraTxns -> List("account_id"))
)
val conf = configuration
.enableDeleteGeneratedRecords(true)
.enableGenerateData(false)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))
val cassandraTxns = cassandra("ingested_data", "localhost:9042")
.table("account", "transactions")
val deletePlan = plan.addForeignKeyRelationship(
postgresTask, List("account_id"),
List(),
List(cassandraTxns -> List("SUBSTR(account_id, 3) AS account_number"))
)
val conf = configuration
.enableDeleteGeneratedRecords(true)
.enableGenerateData(false)
parquet("customer_parquet", "/data/parquet/customer")
.fields(metadataSource.openDataContractStandard("/data/odcs/full-example.odcs.yaml"))
http("my_http")
.fields(metadataSource.openApi("/data/http/petstore.json"))
parquet("customer_parquet", "/data/parquet/customer")
.validations(metadataSource.greatExpectations("/data/great-expectations/taxi-expectations.json"))