sql: new functions `crdb_internal.generate_test_objects` and `.gen_rand_ident` #94027

knz · 2022-12-21T01:19:23Z

tldr: the new function

   crdb_internal.generate_test_objects(<parameters>)

generates multiple SQL objects at once. It is meant for use by tests.

Additionally, a new flag --expand-schema was added to cockroach demo which exercises and demonstrates this new feature. When
specified, demo will gradually multiply the workload schema
by the specified number of schema objects, by adding databases
and tables similar to the one defined by the workload.

For example:

$ cockroach demo movr --expand-schema=10000
... wait a bit ...

> SHOW DATABASES;
... observe: similar databases to movr have been created, 100 tables each

The parameters argument is a JSONB object, containing at least the
following fields:

"counts": the counts of objects to create.
"names": the pattern to use to name generated objects.

The pattern and counts array influences object creation as follows:

Counts	Pattern	Creates databases?	Creates schemas?	Creates tables?	Total objects created
N	`tb`	no	no	yes: `<currentdb>.<currentschema>.tb1`...	N
N	`sc.tb`	no	no	yes: `<currentdb>.sc.tb1`...	N
N	`db.sc.tb`	no	no	yes: `db.sc.tb1`...	N
N2, N1	`sc.tb`	no	yes: `<currentdb>.sc1`...	yes: `<currentdb>.sc1.tb1`...	N2 + (N1 x N2)
N2, 0	`sc.tb`	no	yes: `<currentdb>.sc1`...	no	N2
N2, N1	`db.sc.tb`	no	yes: `db.sc1`...	yes: `db.sc1.tb1`...	N2 + (N1 x N2)
N3, N2, N1	`db.sc.tb`	yes: `db1`...	yes: `db1.sc1`...	yes: `db1.sc1.tb1`...	(N3 x 2) + (N3 x N2) + (N1 x N2 x N3)
N3, N2, 0	`db.sc._`	yes: `db1`...	yes: `db1.sc1`...	no	(N3 x 2) + (N3 x N2)
N2, 0	`db.sc._`	no	yes: `db.sc1`...	no	N2
N3, 0, 0	`db._._`	yes: `db1`...	no	no	N3 x 2
N1, 0, N2	`db.tb`	yes: `db1`...	no	yes: `db1.public.tb1`...	(N3 x 2) + (N3 x N2)
0	`tb`	no	no	no	0
0, N2	`prefix.tb`	no	no	no	0
0, N2, N1	`db.sc.tb`	no	no	no	0
0	`prefix.tb`	no	no	no	0
0	`db.sc.tb`	no	no	no	0
0, N1	`db.sc._`	no	no	no	0

NB: the reason why the requested number of databases is doubled in the
total count of objects created is that a public schema is created for
each new database.

The convenience overload generate_test_objects(pattern : string, count : integer) is also provided as a convenience alias for
generate_test_objects('{"names":pattern, "counts":[count]}'::JSONB).

Likewise, generate_test_objects(pattern : string, counts : []integer) is also provided as a convenience alias for
generate_test_objects('{"names":pattern, "counts":counts}'::JSONB).

The configuration parameters may also contain the following fields:

"names": pattern to use to name the generated objects (default:
"_", see note about "table_templates" below).
"counts": counts of generated objects (default: [10]).
"dry_run": prepare the schema but do not actually write it
(default: false).
"seed": random seed to use (default: auto).
"randomize_columns": whether to randomize the column names on tables
(default: true).
"table_templates": table templates to use.
If the last part of "names" is "_", the name of the template
will be used as base pattern during name generation for tables.
Otherwise, the last part of "names" will be used as pattern.
If no table templates are specified, a simple template is used.
"name_gen": configuration for the name generation, see below.

Guaranteed properties:

when a requested object count is zero, no object of that type (and
no object of sub-types) gets created. This preserves the
mathematical properties:
- calls with a count of zero are idempotent.
- if the function is called sequentially
  with counts N and M, the total number of objects is always N+M
  regardless of the values of N and M.

Also for testing and troubleshooting, another function is provided:

crdb_internal.gen_rand_ident(name_pattern: string, count: int, parameters: jsonb) -> string

It produces SQL identifiers using the same algorithm as used by
generate_test_objects.

Its parameters are:

"seed": the seed to use for the pseudo-random generator (default:
random).
- "number": whether to add a number to the generated names (default
true). When enabled, occurrences of the character '#' in the name
pattern are replaced by the number. If '#' is not present, the
number is added at the end.
"noise": whether to add noise to the generated names (default
true). It adds a non-zero probability for each of the probability
options below left to zero. (To enable noise generally but disable
one type of noise, set its probability to -1.)
"punctuate": probability of adding punctuation to the generated
names.
"quote": probabiltiy of adding single or double quotes to the
generated names.
"emote": probability of adding emojis to the generated names.
"space": probability of adding simple spaces to the generated
names.
"whitespace": probability of adding complex whitespace to the
generated names.
"capitals": probability of using capital letters in the generated
names. Note: the name pattern must contain ASCII letters already for
capital letters to be used.
"diacritics": probability of adding diacritics in the generated
names.
"diacritic_depth": max number of diacritics to add at a time
(default 1).
"zalgo": special option that overrides diacritics and
diacritic_depth (default false).

For convenience, the overload crdb_internal.gen_rand_ident(name_pattern: string, count: int)
is provided as an alias to gen_rand_ident(pattern, count, '{}'::JSONB).

Release note: None

cockroach-teamcity · 2022-12-21T01:19:31Z

This change is

dt · 2022-12-22T15:58:19Z

two quick drive-by ergonomics questions:

How would you feel about including "test" or "testing" in the name, to make its purpose clearer, e.g. "crdb_internal.generate_test_objects" ?
How would you feel about different arity overloads instead of an array of variable length, so that the parameters can be named?

knz · 2022-12-22T16:22:43Z

How would you feel about including "test" or "testing" in the name, to make its purpose clearer, e.g. "crdb_internal.generate_test_objects" ?

no objection

How would you feel about different arity overloads instead of an array of variable length, so that the parameters can be named?

let's delay this convo until I show you the degree of configurability i'm aiming for.

knz · 2022-12-23T17:23:22Z

@dt okay can you look again at the proposed interface (see updated PR description)

Now I am willing to discuss UX. I'd also like the common case to be simple to type.

knz · 2022-12-25T22:02:39Z

So this is good for a first round of review, tests and all.

@postamar @ajwerner I'm thinking the schema team would like to own this; how do you feel about that?

postamar

Thanks for doing this. This LGTM modulo a couple of easy fixes related to building and validating descriptors. The rest is just nits, questions and suggestions, but nothing I wouldn't be OK with being merged.

postamar · 2023-01-03T15:08:13Z

pkg/sql/catalog/randgen/randgen.go

+}
+
+// Catalog interfaces with the schema resolver and privilege checker.
+type Catalog interface {


Can we make this interface package-private, if possible? There's no benefit to exporting it, is there?

My concern here is that exporting something yet again named "catalog" potentially makes the codebase more confusing.

postamar · 2023-01-03T15:15:08Z

pkg/sql/sem/eval/deps.go

+	// objets quickly.
+	// Note: we pass parameters as a string to avoid a package
+	// dependency to randgen from users of this interface;
+	GenerateTestObjects(ctx context.Context, parameters string) (string, error)


I get why the param and the return value are strings but I wonder if interface{} wouldn't be more convenient. What do you think? I don't feel strongly about this, in any case.

postamar · 2023-01-03T15:19:24Z

pkg/sql/sem/builtins/builtins.go

+
+Parameters:` + randgencfg.ConfigDoc,
+			Volatility: volatility.Volatile,
+		},


Would it be possible to redefine this and the overload above as UDFs instead?

postamar · 2023-01-03T15:21:54Z

pkg/sql/sem/builtins/generator_builtins.go

+See the documentation of the other gen_rand_ident overload for details.
+`,
+			volatility.Volatile,
+		),


Same comment as above: can this overload be defined as a UDF instead?

postamar · 2023-01-03T15:36:19Z

pkg/util/randident/api.go

+	// interrupted when the algorithm has a hard time avoiding
+	// conflicts.
+	GenerateMultiple(ctx context.Context, count int, conflictNames map[string]struct{}) ([]string, error)
+}


nit: I'm not sure whether this interface serves much purpose. It's implemented only once and that's unlikely to change, furthermore GenerateMultiple could just as well be defined as a function taking this interface. I understand it documents the API and I like that, but this could easily be done otherwise, like with a package docstring or something like that. I don't feel strongly about this.

postamar · 2023-01-03T16:36:59Z

pkg/sql/catalog/randgen/templates.go

+
+	// Look up the descriptors from the IDs.
+	descs, err := g.coll.GetImmutableDescriptorsByID(ctx, g.txn,
+		tree.CommonLookupFlags{Required: true},


nit: Required: true is always implied by GetImmutableDescriptorsByID. One of the many sad little warts of the collection API which I hope to get rid of soon, see #93813 if you're interested in this.

postamar · 2023-01-03T16:40:51Z

pkg/sql/catalog/randgen/templates.go

+		},
+		PrimaryIndex: descpb.IndexDescriptor{
+			ID:                  1,
+			Name:                "pk",


Consider using tabledesc.PrimaryKeyIndexName here

postamar · 2023-01-03T16:42:51Z

pkg/sql/catalog/randgen/util.go

+	_ = db.ForEachSchema(func(_ descpb.ID, name string) error {
+		res[name] = struct{}{}
+		return nil
+	})


Consider using coll.GetAllSchemasInDatabase here, unless you don't care that this won't work when temporary schemas are involved.

postamar · 2023-01-03T16:45:55Z

pkg/sql/catalog/randgen/tables.go

+		tmpl.desc.PrimaryIndex.Name = idxName
+	}
+
+	tb := tabledesc.NewBuilder(&tmpl.desc).BuildCreatedMutableTable()


You should call RunPostDeserializationChanges on the builder before building the descriptor here and elsewhere to make sure this functionality ages gracefully.

postamar · 2023-01-03T16:56:27Z

pkg/sql/catalog/randgen/batch.go

+}
+
+// queueDesc queues a modified descriptor for (later) write to KV.
+func (g *testSchemaGenerator) queueDescMut(ctx context.Context, desc catalog.MutableDescriptor) {


I think it's worth calling validate.Self(clusterversion.TestingClusterVersion, desc) right here. Validation will be performed anyway when the transaction commits but since a lot of descriptors are involved, if a validation failure occurs it might be difficult to make sense of it.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @postamar)

pkg/sql/catalog/randgen/api.go line 30 at r2 (raw file):