Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs update #62

Merged
merged 17 commits into from
Mar 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
264 changes: 172 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,30 @@
# Datagen CLI

### Installation
This command line interface application allows you to take schemas defined in JSON (`.json`), Avro (`.avsc`), or SQL (`.sql`) and produce believable fake data to Kafka in JSON or Avro format.

The benefits of using this datagen tool are:
- You can specify what values are generated using the expansive [FakerJS API](https://fakerjs.dev/api/) to craft data that more faithfully imitates your use case. This allows you to more easily apply business logic downstream.
- This is a relatively simple CLI tool compared to other Kafka data generators that require Kafka Connect.
- When using the `avro` output format, datagen connects to Schema Registry. This allows you to take advantage of the [benefits](https://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one/) of using Schema Registry.
- Often when you generate random data, your downstream join results won't make sense because it's unlikely a randomly generated field in one dataset will match a randomly generated field in another. With this datagen tool, you can specify relationships between your datasets so that related columns will match up, resulting in meaningful joins downstream. Jump to the [end-to-end ecommerce tutorial](./examples/ecommerce.md) for a full example.

> :construction: Specifying relationships between datasets currently requires using JSON for the input schema.

## Installation

### npm

```
npm install -g @materializeinc/datagen
```

### Docker

```
docker pull materialize/datagen
```
### From Source

> Note: Until the package has been published on npmjs.org, you can install it from source

```bash
git clone https://github.com/MaterializeInc/datagen.git
Expand All @@ -11,113 +33,190 @@ npm install
npm link
```

### Usage
## Setup

Create a file called `.env` with the following environment variables

```bash
# Connect to Kafka
SASL_USERNAME=
SASL_PASSWORD=
SASL_MECHANISM=
KAFKA_BROKERS=

# Connect to Schema Registry if using '--format avro'
SCHEMA_REGISTRY_URL=
SCHEMA_REGISTRY_USERNAME=
SCHEMA_REGISTRY_PASSWORD=
```

The `datagen` program will read the environment variables from `.env` in the current working directory.


## Usage

```bash
datagen -h
```

```
Usage: datagen [options]

Fake Data Generator

Options:
-V, --version output the version number
-f, --format <char> The format of the produced data (choices: "json", "avro", default: "json")
-s, --schema <char> Schema file to use
-f, --format <char> The format of the produced data (choices: "json", "avro", default: "json")
-n, --number <char> Number of records to generate. For infinite records, use -1 (default: "10")
-c, --clean Clean Kafka topic and schema registry before producing data
-c, --clean Clean (delete) Kafka topics and schema subjects previously created
-dr, --dry-run Dry run (no data will be produced to Kafka)
-d, --debug Output extra debugging information
-w, --wait <int> Wait time in ms between record production
-rs, --record-size <int> Record size in bytes, eg. 1048576 for 1MB
-p, --prefix <char> Kafka topic and schema registry prefix
-h, --help display help for command
```
chuck-alt-delete marked this conversation as resolved.
Show resolved Hide resolved

### Env variables

To produce records to a Kafka topic, you need to set the following environment variables:
## Quick Examples

```bash
SASL_USERNAME=
SASL_PASSWORD=
SASL_MECHANISM=
KAFKA_BROKERS=
```

### Examples
See example input schema files in [examples](./examples) and [tests](/tests) folders.

```bash
# Generate 10 records in JSON format
datagen -s products.sql -f json -n 10
```
### Quickstart

Output:
1. Iterate through a schema defined in SQL 10 times, but don't actually interact with Kafka or Schema Registry ("dry run"). Also, see extra output with debug mode.
```bash
datagen --schema tests/products.sql --format avro --dry-run --debug
```

```
✔ Parsing schema...
1. Same as above, but actually create the schema subjects and Kafka topics, and actually produce the data. There is less output because debug mode is off.
```bash
datagen \
--schema tests/products.sql \
--format avro
```

1. Same as above, but produce to Kafka continuously. Press `Ctrl+C` to quit.
```bash
datagen \
-s tests/products.sql \
-f avro \
-n -1
```

✔ Creating Kafka topic...
1. If you want to generate a larger payload, you can use the `--record-size` option to specify number of bytes of junk data to add to each record. Here, we generate a 1MB record. So if you have to generate 1GB of data, you run the command with the following options:
```bash
datagen \
-s tests/products.sql \
-f avro \
-n 1000 \
--record-size 1048576
```
This will add a `recordSizePayload` field to the record with the specified size and will send the record to Kafka.

> :notebook: The 'Max Message Size' of your Kafka cluster needs to be set to a higher value than 1MB for this to work.

✔ Producing records...
1. Clean (delete) the topics and schema subjects created above
```bash
datagen \
--schema tests/products.sql \
--format avro \
--clean
```

### Generate records with sequence numbers

✔ Record sent to Kafka topic
{"products":{"id":50720,"name":"white","merchant_id":76809,"price":1170,"status":89517,"created_at":"upset"}}
...
```

### JSON Schema
To simulate auto incrementing primary keys, you can use the `iteration.index` variable in the schema.

The JSON schema option allows you to define the data that is generated using Faker.js.
This is particularly useful when you want to generate a small set of records with sequence of IDs, for example 1000 records with IDs from 1 to 1000:

```json
[
{
"_meta": {
"topic": "mz_datagen_users"
},
"id": "datatype.uuid",
"id": "iteration.index",
"name": "internet.userName",
"email": "internet.exampleEmail",
"phone": "phone.imei",
"website": "internet.domainName",
"city": "address.city",
"company": "company.name",
"age": "datatype.number",
"created_at": "datatype.datetime"
}
]
```

The schema needs to be an array of objects, as that way we can produce relational data in the future.
Example:

Each object represents a record that will be generated. The `_meta` key is used to define the topic that the record will be sent to.
```
datagen \
-s tests/iterationIndex.json \
-f json \
-n 1000 \
--dry-run
```

You can find the documentation for Faker.js [here](https://fakerjs.dev/api/)
### Docker

### Record Size Option
Call the docker container like you would call the CLI locally, except:
- include `--rm` to remove the container when it exits
- include `-it` (interactive teletype) to see the output as you would locally (e.g. colors)
- mount `.env` and schema files into the container
- note that the working directory in the container is `/app`

In some cases, you might need to generate a large amount of data. In that case, you can use the `--record-size` option to generate a record of a specific size.
```
docker run \
--rm -it \
-v ${PWD}/.env:/app/.env \
-v ${PWD}/tests/schema.json:/app/blah.json \
materialize/datagen -s blah.json -n 1 --dry-run
```

The `--record-size 1048576` option will generate a 1MB record. So if you have to generate 1GB of data, you run the command with the following options:
## Input Schemas

```bash
datagen -s ./tests/datasize.json -f json -n 1000 --record-size 1048576
You can define input schemas using JSON (`.json`), Avro (`.avsc`), or SQL (`.sql`). Within those schemas, you use the [FakerJS API](https://fakerjs.dev/api/) to define the data that is generated for each field.

You can pass arguments to `faker` methods by escaping quotes. For example, here is [datatype.number](https://fakerjs.dev/api/datatype.html#number) with `min` and `max` arguments:

```
"datatype.number({\"min\": 100, \"max\": 1000})"
```

This will add a `recordSizePayload` key to the record with the specified size and will send the record to Kafka.
> :construction: Right now, JSON is the only kind of input schema that supports generating relational data.
### JSON Schema

> Note: The 'Max Message Size' of your Kafka cluster needs to be set to a higher value than 1MB for this to work.
Here is the general syntax for a JSON input schema:

### `UPSERT` Evelope Support
```json
[
{
"_meta": {
"topic": "<my kafka topic>",
"key": "<field to be used for kafka record key>" ,
"relationships": [
{
"topic": "<topic for dependent dataset>",
"parent_field": "<field in this dataset>",
"child_field": "<matching field in dependent dataset>",
"records_per": <number of records in dependent dataset per record in this dataset>
},
...
]
},
"<my first field>": "<method from the faker API>",
"<my second field>": "<another method from the faker API>",
...
},
{
...
},
...
]
```

Go to the [end-to-end ecommerce tutorial](./examples/ecommerce.md) to walk through an example that uses a JSON input schema with relational data.

To make sure `UPSERT` envelope is supported, you need to define an `id` column in the schema.
The value of the `id` column will be used as the key of the record.

### Faker.js and SQL Schema
### SQL Schema

The SQL schema option allows you to define the data that is generated using Faker.js by defining a `COMMENT` on the column.
The SQL schema option allows you to use a `CREATE TABLE` statement to define what data is generated. You specify the [FakerJS API](https://fakerjs.dev/api/) method using a `COMMENT` on the column. Here is an example:

```sql
CREATE TABLE "ecommerce"."products" (
Expand All @@ -130,46 +229,27 @@ CREATE TABLE "ecommerce"."products" (
);
```

The `COMMENT` needs to be a valid Faker.js function. You can find the documentation for Faker.js [here](https://fakerjs.dev/api/).

### Docker

Build the docker image.

```
docker buildx build -t datagen .
```

Run a command.
This will produce the desired mock data to the topic `ecommerce.products`.

```
docker run \
--rm -it \
-v ${PWD}/.env:/app/.env \
-v ${PWD}/tests/schema.json:/app/blah.json \
datagen -s blah.json -n 1 --dry-run
```
### Avro Schema

### Generate records with sequence numbers
> :construction: Avro input schema currently does not support arbitrary FakerJS methods. Instead, data is randomly generated based on the type.

To simulate auto incrementing primary keys, you can use the `iteration.index` variable in the schema.

This is particularly useful when you want to generate a small set of records with sequence of IDs, for example 1000 records with IDs from 1 to 1000:
Here is an example Avro input schema from `tests/schema.avsc` that will produce data to a topic called `products`:

```json
[
{
"_meta": {
"topic": "mz_datagen_users"
},
"id": "iteration.index",
"name": "internet.userName",
}
]
```

Example:

```
datagen -s tests/iterationIndex.json --dry-run -f json -n 1000
```
{
"type": "record",
"name": "products",
"namespace": "exp.products.v1",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "productId", "type": ["null", "string"] },
{ "name": "title", "type": "string" },
{ "name": "price", "type": "int" },
{ "name": "isLimited", "type": "boolean" },
{ "name": "sizes", "type": ["null", "string"], "default": null },
{ "name": "ownerIds", "type": { "type": "array", "items": "string" } }
]
}
```
6 changes: 3 additions & 3 deletions datagen.js
Original file line number Diff line number Diff line change
Expand Up @@ -17,22 +17,22 @@ const dataGenerator = require('./src/dataGenerator');
const fs = require('fs');
const { program, Option } = require('commander');

program.name('datagen').description('Fake Data Generator').version('0.1.2');
program.name('datagen').description('Fake Data Generator').version('0.1.3');

program
.requiredOption('-s, --schema <char>', 'Schema file to use')
.addOption(
new Option('-f, --format <char>', 'The format of the produced data')
.choices(['json', 'avro'])
.default('json')
)
.requiredOption('-s, --schema <char>', 'Schema file to use')
.addOption(
new Option(
'-n, --number <char>',
'Number of records to generate. For infinite records, use -1'
).default('10')
)
.option('-c, --clean', 'Clean Kafka topic and schema registry before producing data')
.option('-c, --clean', 'Clean (delete) Kafka topics and schema subjects previously created')
.option('-dr, --dry-run', 'Dry run (no data will be produced to Kafka)')
.option('-d, --debug', 'Output extra debugging information')
.option('-w, --wait <int>', 'Wait time in ms between record production', parseInt)
Expand Down
File renamed without changes.
Loading