Data Caterer - Test Data Management Tool

Overview

A test data management tool with automated data generation, validation and clean up.

Generate data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly. Clean up generated data or consumed data in downstream data sources to keep your environments tidy. Define alerts to get notified when failures occur and deep dive into issues from the generated report.

Full docs can be found here.

Scala/Java examples found here.

A demo of the UI found here.

Features

Quick start

Follow quick start instructions from here.

Integrations

Supported data sources

Data Caterer supports the below data sources. Check here for the full roadmap.

Data Source Type	Data Source	Support
Cloud Storage	AWS S3	✅
Cloud Storage	Azure Blob Storage	✅
Cloud Storage	GCP Cloud Storage	✅
Database	Cassandra	✅
Database	MySQL	✅
Database	Postgres	✅
Database	Elasticsearch	❌
Database	MongoDB	❌
File	CSV	✅
File	Delta Lake	✅
File	JSON	✅
File	Iceberg	✅
File	ORC	✅
File	Parquet	✅
File	Hudi	❌
HTTP	REST API	✅
Messaging	Kafka	✅
Messaging	Solace	✅
Messaging	ActiveMQ	❌
Messaging	Pulsar	❌
Messaging	RabbitMQ	❌
Metadata	Data Contract CLI	✅
Metadata	Great Expectations	✅
Metadata	Marquez	✅
Metadata	OpenAPI/Swagger	✅
Metadata	OpenMetadata	✅
Metadata	Open Data Contract Standard (ODCS)	✅
Metadata	Amundsen	❌
Metadata	Datahub	❌
Metadata	Solace Event Portal	❌

Additional Details

Run Configurations

Different ways to run Data Caterer based on your use case:

Design

Design motivations and details can be found here.

Roadmap

Can check here for full list of roadmap items.

Pricing

Data Caterer is set up under a usage pricing model for the latest application version. There are different pricing tiers based on how much you use Data Caterer. This also includes support and requesting features. The current open-source version will be kept for those who want to continue using the open-source version.

Find out more details here.

Mildly Quick Start

Generate and validate data

I want to generate data in Postgres

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")  //name and url

But I want `account_id` to follow a pattern and be unique

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

I then want to test my job ingests all the data after generating

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(validation.count.isEqual(1000))

I want to make sure all the `account_id` values in Postgres are in the Parquet file

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinFields("account_id")
       .withValidation(validation.count().isEqual(1000))
  )

I want to start validating once the Parquet file is available

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinFields("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
  .validationWait(waitCondition.file("/data/parquet/customer"))

Generate same data across data sources

I also want to generate events in Kafka

kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .fields(...)

But I want the same `account_id` to show in Postgres and Kafka

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}"))

val kafkaTask = kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .fields(...)

plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(kafkaTask -> List("account_id"))
)

Generate data and clean up

I want to generate 5 transactions per `account_id`

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerField(5, "account_id"))

Randomly generate 1 to 5 transactions per `account_id`

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(1).max(5), "account_id"))

I want to delete the generated data

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

I also want to delete the data in Cassandra because my job consumed the data in Postgres and pushed to Cassandra

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("account_id"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

But only the `account_number` is saved in Cassandra from the `account_id`

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("SUBSTR(account_id, 3) AS account_number"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

Generate data with schema from metadata source

I have a data contract using the Open Data Contract Standard (ODCS) format

parquet("customer_parquet", "/data/parquet/customer")
  .fields(metadataSource.openDataContractStandard("/data/odcs/full-example.odcs.yaml"))

I have an OpenAPI/Swagger doc

http("my_http")
  .fields(metadataSource.openApi("/data/http/petstore.json"))

Validate data using validations from metadata source

I have expectations from Great Expectations

parquet("customer_parquet", "/data/parquet/customer")
  .validations(metadataSource.greatExpectations("/data/great-expectations/taxi-expectations.json"))

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
.github		.github
api		api
app		app
gradle/wrapper		gradle/wrapper
misc		misc
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
docker-action.sh		docker-action.sh
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
insta-integration.yaml		insta-integration.yaml
renovate.json		renovate.json
settings.gradle.kts		settings.gradle.kts
workspace.xml		workspace.xml

License

data-catering/data-caterer

Folders and files

Latest commit

History

Repository files navigation

Data Caterer - Test Data Management Tool

Overview

Features

Quick start

Integrations

Supported data sources

Additional Details

Run Configurations

Design

Roadmap

Pricing

Mildly Quick Start

Generate and validate data

Generate same data across data sources

Generate data and clean up

Generate data with schema from metadata source

Validate data using validations from metadata source

About

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages