-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: new functions crdb_internal.generate_test_objects
and .gen_rand_ident
#94027
Conversation
57b4ac5
to
dcabfa4
Compare
two quick drive-by ergonomics questions:
|
no objection
let's delay this convo until I show you the degree of configurability i'm aiming for. |
crdb_internal.generate_objects
crdb_internal.generate_test_objects
dcabfa4
to
9323408
Compare
@dt okay can you look again at the proposed interface (see updated PR description) Now I am willing to discuss UX. I'd also like the common case to be simple to type. |
a66de42
to
7e1de53
Compare
crdb_internal.generate_test_objects
.gen_rand_ident
.gen_rand_ident
crdb_internal.generate_test_objects
and .gen_rand_ident
1a9b034
to
2c263cb
Compare
7a1cf1c
to
d97ff1a
Compare
22fbe6c
to
91d092b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. This LGTM modulo a couple of easy fixes related to building and validating descriptors. The rest is just nits, questions and suggestions, but nothing I wouldn't be OK with being merged.
} | ||
|
||
// Catalog interfaces with the schema resolver and privilege checker. | ||
type Catalog interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this interface package-private, if possible? There's no benefit to exporting it, is there?
My concern here is that exporting something yet again named "catalog" potentially makes the codebase more confusing.
// objets quickly. | ||
// Note: we pass parameters as a string to avoid a package | ||
// dependency to randgen from users of this interface; | ||
GenerateTestObjects(ctx context.Context, parameters string) (string, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get why the param and the return value are strings but I wonder if interface{}
wouldn't be more convenient. What do you think? I don't feel strongly about this, in any case.
|
||
Parameters:` + randgencfg.ConfigDoc, | ||
Volatility: volatility.Volatile, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to redefine this and the overload above as UDFs instead?
See the documentation of the other gen_rand_ident overload for details. | ||
`, | ||
volatility.Volatile, | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above: can this overload be defined as a UDF instead?
// interrupted when the algorithm has a hard time avoiding | ||
// conflicts. | ||
GenerateMultiple(ctx context.Context, count int, conflictNames map[string]struct{}) ([]string, error) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm not sure whether this interface serves much purpose. It's implemented only once and that's unlikely to change, furthermore GenerateMultiple
could just as well be defined as a function taking this interface. I understand it documents the API and I like that, but this could easily be done otherwise, like with a package docstring or something like that. I don't feel strongly about this.
|
||
// Look up the descriptors from the IDs. | ||
descs, err := g.coll.GetImmutableDescriptorsByID(ctx, g.txn, | ||
tree.CommonLookupFlags{Required: true}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Required: true
is always implied by GetImmutableDescriptorsByID
. One of the many sad little warts of the collection API which I hope to get rid of soon, see #93813 if you're interested in this.
pkg/sql/catalog/randgen/templates.go
Outdated
}, | ||
PrimaryIndex: descpb.IndexDescriptor{ | ||
ID: 1, | ||
Name: "pk", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using tabledesc.PrimaryKeyIndexName
here
_ = db.ForEachSchema(func(_ descpb.ID, name string) error { | ||
res[name] = struct{}{} | ||
return nil | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using coll.GetAllSchemasInDatabase
here, unless you don't care that this won't work when temporary schemas are involved.
tmpl.desc.PrimaryIndex.Name = idxName | ||
} | ||
|
||
tb := tabledesc.NewBuilder(&tmpl.desc).BuildCreatedMutableTable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should call RunPostDeserializationChanges
on the builder before building the descriptor here and elsewhere to make sure this functionality ages gracefully.
} | ||
|
||
// queueDesc queues a modified descriptor for (later) write to KV. | ||
func (g *testSchemaGenerator) queueDescMut(ctx context.Context, desc catalog.MutableDescriptor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth calling validate.Self(clusterversion.TestingClusterVersion, desc)
right here. Validation will be performed anyway when the transaction commits but since a lot of descriptors are involved, if a validation failure occurs it might be difficult to make sense of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @postamar)
pkg/sql/catalog/randgen/api.go
line 30 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
Should this be named
RandomSchemaGenerator
instead? If so, propagate the name change.
While the names are randomized, the structure of the schema is not. We should also have such a thing as a random schema generator but this is not it.
pkg/sql/catalog/randgen/batch.go
line 30 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
I think it's worth calling
validate.Self(clusterversion.TestingClusterVersion, desc)
right here. Validation will be performed anyway when the transaction commits but since a lot of descriptors are involved, if a validation failure occurs it might be difficult to make sense of it.
I don't buy it. If validation failure occurs, it's only because of some property of the templates and that would show up at just 1 descriptor created.
pkg/sql/catalog/randgen/randgen.go
line 117 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
Should this error come with a hint?
Done.
pkg/sql/catalog/randgen/randgen.go
line 123 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
Dumb question perhaps but is there any point in allowing non-admin users to bulk generate tests objects in the first place?
You tell me :)
The view i've taken so far is that this code is not doing anything that a user couldn't do via CREATE statements in the first place. Only the rate is different, although again a client could achieve similar things via concurrent client connections.
Maybe we could ask a PM for advice?
pkg/sql/catalog/randgen/randgen.go
line 160 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
Can we make this interface package-private, if possible? There's no benefit to exporting it, is there?
My concern here is that exporting something yet again named "catalog" potentially makes the codebase more confusing.
I don't understand how to do this. Doesn't the linter require that all types in a public function/method are also public?
pkg/sql/catalog/randgen/randgen.go
line 225 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
I would find it useful to group the fields in this struct according to their lifecycle/mutability, either by rearranging the lines in which they appear, or by introducing sub-structs. Some fields are part of the "generator state" so to speak (like
baseSchema
,rand
), some are part of the generator output (b
etc), some are immutable configuration values (cfg
,kvtrace
etc) and some are dependencies (cat
,coll
, etc).
Done.
pkg/sql/catalog/randgen/tables.go
line 152 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
uber nit: FYI the prevailing convention in
pkg/sql/catalog/...
is to just call thesedb
andsc
Done.
pkg/sql/catalog/randgen/tables.go
line 193 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
You should call
RunPostDeserializationChanges
on the builder before building the descriptor here and elsewhere to make sure this functionality ages gracefully.
Absolutely not. This causes the performance to tank by 50% (those checks are expensive!!!) and for no good reason:
- it's the same template used every time.
- we're 100% controlling the template here. If there's a bug, it would be caught by the test suite.
pkg/sql/catalog/randgen/templates.go
line 205 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
Consider using
tabledesc.PrimaryKeyIndexName
here
Done.
pkg/sql/catalog/randgen/util.go
line 57 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
Consider using
coll.GetAllSchemasInDatabase
here, unless you don't care that this won't work when temporary schemas are involved.
I do care about temporary schemas and this is exercised in tests.
pkg/sql/sem/builtins/builtins.go
line 6434 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
Would it be possible to redefine this and the overload above as UDFs instead?
Done.
pkg/sql/sem/builtins/generator_builtins.go
line 575 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
Same comment as above: can this overload be defined as a UDF instead?
Done.
pkg/sql/sem/eval/deps.go
line 220 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
I get why the param and the return value are strings but I wonder if
interface{}
wouldn't be more convenient. What do you think? I don't feel strongly about this, in any case.
I've tried it the other ways, and it results in code that's slightly harder to read/maintain.
pkg/sql/logictest/testdata/logic_test/rand_ident
line 12 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
nit: Could
seed
beomitempty
and its value be 0 in tests? It's no worse than 123 which is used everywhere (I think)
I've made it omitempty because zero is, indeed, a valid rng seed value and we shouldn't treat it as special.
However, I do see value in using "123" in tests, it has better explanatory power (using "0" would make the user wonder if it's special in some way").
pkg/sql/logictest/testdata/logic_test/test_objects
line 1 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
Consider renaming this file to
gen_test_objects
perhaps?test_objects
can give the impression that this is about testing objects.
Done.
pkg/util/randident/namegen_test.go
line 110 at r2 (raw file):
Previously, postamar (Marius Posta) wrote…
nit: perhaps having
testNameGen
take astruct{always, never, sometimes []pred}
type might be simpler?
Done.
…nd_ident` tldr: the new function ``` crdb_internal.generate_test_objects(<parameters>) ``` generates multiple SQL objects at once. It is meant for use by tests that want to exercise schemas of up to tens of thousands of descriptors. Additionally, a new flag `--expand-schema` was added to `cockroach demo` which exercises and demonstrates this new feature. When specified, `demo` will gradually *multiply* the workload schema by the specified number of schema objects, by adding databases and tables similar to the one defined by the workload. For example: ``` $ cockroach demo movr --expand-schema=10000 ... wait a bit ... > SHOW DATABASES; ... observe: similar databases to movr have been created, 100 tables each ``` The parameters argument to the SQL built-in function is a JSONB object, containing at least the following fields: - `"counts"`: the counts of objects to create. - `"names"`: the pattern to use to name generated objects. The pattern and counts array influences object creation as follows: | Counts | Pattern | Creates databases? | Creates schemas? | Creates tables? | Total objects created | |------------|-------------|--------------------|---------------------------|-------------------------------------------|---------------------------------------| | N | `tb` | no | no | yes: `<currentdb>.<currentschema>.tb1`... | N | | N | `sc.tb` | no | no | yes: `<currentdb>.sc.tb1`... | N | | N | `db.sc.tb` | no | no | yes: `db.sc.tb1`... | N | | N2, N1 | `sc.tb` | no | yes: `<currentdb>.sc1`... | yes: `<currentdb>.sc1.tb1`... | N2 + (N1 x N2) | | N2, 0 | `sc.tb` | no | yes: `<currentdb>.sc1`... | no | N2 | | N2, N1 | `db.sc.tb` | no | yes: `db.sc1`... | yes: `db.sc1.tb1`... | N2 + (N1 x N2) | | N3, N2, N1 | `db.sc.tb` | yes: `db1`... | yes: `db1.sc1`... | yes: `db1.sc1.tb1`... | (N3 x 2) + (N3 x N2) + (N1 x N2 x N3) | | N3, N2, 0 | `db.sc._` | yes: `db1`... | yes: `db1.sc1`... | no | (N3 x 2) + (N3 x N2) | | N2, 0 | `db.sc._` | no | yes: `db.sc1`... | no | N2 | | N3, 0, 0 | `db._._` | yes: `db1`... | no | no | N3 x 2 | | N1, 0, N2 | `db.tb` | yes: `db1`... | no | yes: `db1.public.tb1`... | (N3 x 2) + (N3 x N2) | | 0 | `tb` | no | no | no | 0 | | 0, N2 | `prefix.tb` | no | no | no | 0 | | 0, N2, N1 | `db.sc.tb` | no | no | no | 0 | | 0 | `prefix.tb` | no | no | no | 0 | | 0 | `db.sc.tb` | no | no | no | 0 | | 0, N1 | `db.sc._` | no | no | no | 0 | NB: the reason why the requested number of databases is doubled in the total count of objects created is that a `public` schema is created for each new database. The convenience overload `generate_test_objects(pattern : string, count : integer)` is also provided as a convenience alias for `generate_test_objects('{"names":pattern, "counts":[count]}'::JSONB)`. Likewise, `generate_test_objects(pattern : string, counts : []integer)` is also provided as a convenience alias for `generate_test_objects('{"names":pattern, "counts":counts}'::JSONB)`. The configuration parameters may also contain the following fields: - `"names"`: pattern to use to name the generated objects (default: "_", see note about `"table_templates"` below). - `"counts"`: counts of generated objects (default: [10]). - `"dry_run"`: prepare the schema but do not actually write it (default: false). - `"seed"`: random seed to use (default: auto). - `"randomize_columns"`: whether to randomize the column names on tables (default: true). - `"table_templates"`: table templates to use. If the last part of `"names"` is "_", the name of the template will be used as base pattern during name generation for tables. Otherwise, the last part of "names" will be used as pattern. If no table templates are specified, a simple template is used. - `"name_gen"`: configuration for the name generation, see below. Guaranteed properties: - when a requested object count is zero, no object of that type (and no object of sub-types) gets created. This preserves the mathematical properties: - calls with a count of zero are idempotent. - if the function is called sequentially with counts N and M, the total number of objects is always N+M regardless of the values of N and M. Also for testing and troubleshooting, another function is provided: ``` crdb_internal.gen_rand_ident(name_pattern: string, count: int, parameters: jsonb) -> string ``` It produces SQL identifiers using the same algorithm as used by `generate_test_objects`. Its parameters are: - `"seed"`: the seed to use for the pseudo-random generator (default: random). -` "number"`: whether to add a number to the generated names (default true). When enabled, occurrences of the character '#' in the name pattern are replaced by the number. If '#' is not present, the number is added at the end. - `"noise"`: whether to add noise to the generated names (default true). It adds a non-zero probability for each of the probability options below left to zero. (To enable noise generally but disable one type of noise, set its probability to -1.) - `"punctuate"`: probability of adding punctuation to the generated names. - `"quote"`: probabiltiy of adding single or double quotes to the generated names. - `"emote"`: probability of adding emojis to the generated names. - `"space"`: probability of adding simple spaces to the generated names. - `"whitespace"`: probability of adding complex whitespace to the generated names. - `"capitals"`: probability of using capital letters in the generated names. Note: the name pattern must contain ASCII letters already for capital letters to be used. - `"diacritics"`: probability of adding diacritics in the generated names. - `"diacritic_depth"`: max number of diacritics to add at a time (default 1). - `"zalgo"`: special option that overrides diacritics and diacritic_depth (default false). For convenience, the overload `crdb_internal.gen_rand_ident(name_pattern: string, count: int)` is provided as an alias to `gen_rand_ident(pattern, count, '{}'::JSONB)`. The following two cluster settings control access to the feature: - `sql.schema.test_object_generator.enabled` (default true) - `sql.schema.test_object_generator.non_admins.enabled` (default false) Release note: None
i've made it admin-only by default like discussed. |
bors r=postamar |
Build succeeded: |
Re. I'm fine with all your other decisions. |
See #94840 to that effect. As it turns out, the test actually picked up something. Nothing worth worrying about, but enough to justify its existence. |
Fixes #94026.
tldr: the new function
generates multiple SQL objects at once. It is meant for use by tests.
Additionally, a new flag
--expand-schema
was added tocockroach demo
which exercises and demonstrates this new feature. Whenspecified,
demo
will gradually multiply the workload schemaby the specified number of schema objects, by adding databases
and tables similar to the one defined by the workload.
For example:
The parameters argument is a JSONB object, containing at least the
following fields:
"counts"
: the counts of objects to create."names"
: the pattern to use to name generated objects.The pattern and counts array influences object creation as follows:
tb
<currentdb>.<currentschema>.tb1
...sc.tb
<currentdb>.sc.tb1
...db.sc.tb
db.sc.tb1
...sc.tb
<currentdb>.sc1
...<currentdb>.sc1.tb1
...sc.tb
<currentdb>.sc1
...db.sc.tb
db.sc1
...db.sc1.tb1
...db.sc.tb
db1
...db1.sc1
...db1.sc1.tb1
...db.sc._
db1
...db1.sc1
...db.sc._
db.sc1
...db._._
db1
...db.tb
db1
...db1.public.tb1
...tb
prefix.tb
db.sc.tb
prefix.tb
db.sc.tb
db.sc._
NB: the reason why the requested number of databases is doubled in the
total count of objects created is that a
public
schema is created foreach new database.
The convenience overload
generate_test_objects(pattern : string, count : integer)
is also provided as a convenience alias forgenerate_test_objects('{"names":pattern, "counts":[count]}'::JSONB)
.Likewise,
generate_test_objects(pattern : string, counts : []integer)
is also provided as a convenience alias forgenerate_test_objects('{"names":pattern, "counts":counts}'::JSONB)
.The configuration parameters may also contain the following fields:
"names"
: pattern to use to name the generated objects (default:"_", see note about
"table_templates"
below)."counts"
: counts of generated objects (default: [10])."dry_run"
: prepare the schema but do not actually write it(default: false).
"seed"
: random seed to use (default: auto)."randomize_columns"
: whether to randomize the column names on tables(default: true).
"table_templates"
: table templates to use.If the last part of
"names"
is "_", the name of the templatewill be used as base pattern during name generation for tables.
Otherwise, the last part of "names" will be used as pattern.
If no table templates are specified, a simple template is used.
"name_gen"
: configuration for the name generation, see below.Guaranteed properties:
no object of sub-types) gets created. This preserves the
mathematical properties:
with counts N and M, the total number of objects is always N+M
regardless of the values of N and M.
Also for testing and troubleshooting, another function is provided:
It produces SQL identifiers using the same algorithm as used by
generate_test_objects
.Its parameters are:
"seed"
: the seed to use for the pseudo-random generator (default:random).
-
"number"
: whether to add a number to the generated names (defaulttrue). When enabled, occurrences of the character '#' in the name
pattern are replaced by the number. If '#' is not present, the
number is added at the end.
"noise"
: whether to add noise to the generated names (defaulttrue). It adds a non-zero probability for each of the probability
options below left to zero. (To enable noise generally but disable
one type of noise, set its probability to -1.)
"punctuate"
: probability of adding punctuation to the generatednames.
"quote"
: probabiltiy of adding single or double quotes to thegenerated names.
"emote"
: probability of adding emojis to the generated names."space"
: probability of adding simple spaces to the generatednames.
"whitespace"
: probability of adding complex whitespace to thegenerated names.
"capitals"
: probability of using capital letters in the generatednames. Note: the name pattern must contain ASCII letters already for
capital letters to be used.
"diacritics"
: probability of adding diacritics in the generatednames.
"diacritic_depth"
: max number of diacritics to add at a time(default 1).
"zalgo"
: special option that overrides diacritics anddiacritic_depth (default false).
For convenience, the overload
crdb_internal.gen_rand_ident(name_pattern: string, count: int)
is provided as an alias to
gen_rand_ident(pattern, count, '{}'::JSONB)
.Release note: None