Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make resource autonames determinstic #8631

Merged
merged 18 commits into from
Jan 20, 2022
Merged

Make resource autonames determinstic #8631

merged 18 commits into from
Jan 20, 2022

Conversation

Frassle
Copy link
Member

@Frassle Frassle commented Dec 23, 2021

Description

This adds a new property to resource state, a sequence number that starts at 1 when created and incremented on each replace operation. This sequence number is passed to the provider Check function and can be used plus the resource URN to build a hash for the random suffix of an autoname (see NewUniqueHexV2).

For existing resources the sequence number will be zero (default int value) and NewUniqueHexV2 will fall back to the old logic of a non-deterministic suffix.
This will also be fallen back to if an old CLI is used which won't read or maintain the sequence numbers from the state file, implicitly resetting all resources sequence numbers to zero (that is unknown).
An old CLI will also not pass sequenceNumber to check via GRPC and so even if using a new provider that supports sequence numbers GRPC will marshal the number as zero and so NewUniqueHexV2 will fall back to non-determinism.

The reason for treating zero as unknown rather than the first valid sequence number is for the following scenario:

  1. Imagine we create a new resource using a new CLI and provider that uses deterministic names. We init sequence number as 0.
  2. [Optionally zero or more times] We then do an update that causes a replace and increment the sequence number for a new name.
  3. We then run an update using an old CLI and lose all sequence number information.
  4. We then run an update using a new CLI again and need to do a replace, we need to ensure Check makes a new auto-name but we don't know what sequence number is safe to use. If we build the name with 0 it could clash if the current resource was created with a deterministic name from 0. If we build the name with 1 it could clash if the current resource was replaced and used a deterministic for that operation and currently had the name built from 1 (or any higher integer).

Given the above if sequence number is zero we just keep the existing non-deterministic behavior to avoid name clashes. However once we've done a replace with non-deterministic behavior we can safely reset back to using sequence numbers (we've just created a new resource with a random name so we know that deterministic name 1 won't clash with it). To mark this state we store a -1 for sequence number in the state file. -1 is never passed to providers as part of Check, it's either translated to 0 (in the case we're calling Check on the existing resource) or 1 when we go to create a replacement resource.

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • Yes, there are changes in this PR that warrants bumping the Pulumi Service API version

@Frassle Frassle requested a review from a team December 23, 2021 13:12
@codecov
Copy link

codecov bot commented Dec 23, 2021

Codecov Report

Merging #8631 (22fc82f) into master (40a4507) will increase coverage by 0.03%.
The diff coverage is 82.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8631      +/-   ##
==========================================
+ Coverage   59.35%   59.38%   +0.03%     
==========================================
  Files         637      637              
  Lines       97765    97844      +79     
  Branches     1385     1386       +1     
==========================================
+ Hits        58025    58102      +77     
+ Misses      36466    36464       -2     
- Partials     3274     3278       +4     
Impacted Files Coverage Δ
sdk/go/common/apitype/core.go 100.00% <ø> (ø)
sdk/go/common/resource/plugin/provider.go 67.61% <ø> (ø)
sdk/nodejs/dynamic/index.ts 21.05% <ø> (ø)
sdk/nodejs/provider/server.ts 5.01% <0.00%> (ø)
sdk/proto/go/provider.pb.go 34.26% <0.00%> (-0.12%) ⬇️
...dk/python/lib/pulumi/runtime/proto/provider_pb2.py 100.00% <ø> (ø)
sdk/nodejs/proto/provider_pb.js 26.34% <20.00%> (-0.03%) ⬇️
pkg/resource/deploy/step_generator.go 84.06% <90.00%> (-0.04%) ⬇️
sdk/go/common/resource/resource_id.go 67.16% <90.00%> (+17.16%) ⬆️
pkg/backend/display/json.go 59.28% <100.00%> (ø)
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40a4507...22fc82f. Read the comment docs.

@t0yv0
Copy link
Member

t0yv0 commented Jan 4, 2022

@Frassle there was some discussion around it, is this ready for review or are some more changes coming?

@Frassle
Copy link
Member Author

Frassle commented Jan 4, 2022

I need to plumb sequence number through for the dynamic providers still, but I think otherwise this is ready to be looked at. Things to keep in mind if reviewing this:

  1. Is this even the 'right' thing to do for fixing autoname determinism. I haven't come up with anything else but maybe there's a generally better way to solve this?
  2. Is this safely compatible with old cli versions
  3. Is this the best way to change the public go api to plumb through this new information (sequence number)

@@ -585,8 +596,13 @@ func (sg *stepGenerator) generateStepsFromDiff(
//
// Note that if we're performing a targeted replace, we already have the correct inputs.
if prov != nil && !sg.isTargetedReplace(urn) {
// Increment the sequence number (if it's known) before calling check so we get a new autoname
if new.SequenceNumber != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main place where we're incrementing it seems. Right?

Why is it under prov != nil && !sg.isTargetedReplace(urn) condition? Looks like otherwise we don't even call Check. In other words, if we are doing targeted replace, will sequence auto-increment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the code higher in the file (~350) should handle this case now. If we're recreating or doing a targeted replace we need to increment.

var failures []plugin.CheckFailure
inputs, failures, err = prov.Check(urn, nil, goal.Properties, allowUnknowns)
inputs, failures, err = prov.Check(urn, nil, goal.Properties, allowUnknowns, new.SequenceNumber)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the code below. Planner may be cascading along the dependency graph to delete and re-create some other resources. I am wondering if sequencenumber for these resources would auto-increment, and if so, how. It sounds like it should? Since it's an indirect replacement.

NewDeleteReplacementStep(sg.deployment, old, true),
NewReplaceStep(sg.deployment, old, new, diff.ReplaceKeys, diff.ChangedKeys, diff.DetailedDiff, false),
NewCreateReplacementStep(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should increment and I think that will get caught by the step generator seeing that 'recreating' is true for those resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would seem so from:

// Mark the condemned resource as deleted. We won't know until later in the deployment whether
// or not we're going to be replacing this resource.
sg.deletes[dependentResource.URN] = true

and this definition of recreating

	// We may be re-creating this resource if it got deleted earlier in the execution of this deployment.
	_, recreating := sg.deletes[urn]

However I do not fully grok the state machine here wrt sg.deletes and ordering.

It would seem that we'd need some sort of circularity here that might be absent, that is:

  • we handle seq-num in generateSteps
  • generateSteps decides to do some extra replacements based on deleteBeforeReplace and dep graph
  • in that case those replacements need seq-num handling so these dep resources need to undergo generateSteps

But it would seem that generateSteps does not recur, in fact it would seem it is called once per resource in the order as RegisterResource calls arrive.

Let me run a quick experiment to clarify further.

@t0yv0
Copy link
Member

t0yv0 commented Jan 4, 2022

It's looking decent, I'm fumbling a bit verifying where 0 is correct and where 1 is correct.

One thing that strikes me as really counter-intuitive is that Check is where auto-naming happens. We'd want to document that thoroughly at the definition place for Check and in the dev docs. Check really seems to return inputs in CheckResult and that's kind of confusing.

Another exercise that helped me - we can link PRs here, is trying to build pulumi-azure-native and pulumi-aws against this PR. For this to work resource.NewUniqueHex needs to be replaced with resource.NewUniqueHexV2 and sequencenumber passed through. I could sort of see how it happens in Check in pulumi-azure-native, but it's not obvious. It seems to subject a lot more props to auto-naming than "name". Couldn't see it at all in the TF-bridge based pulumi-aws. Having these PRs linked in here can help review/verify the call sequence assumptions.

Another random thought. For the case of imports with ResourceOption import, impl probably should stick to 0 sequencenumber until the user removes the import option and the resource is managed by pulumi, at which point an update will assign 1.

@t0yv0
Copy link
Member

t0yv0 commented Jan 4, 2022

Another thing I would like to understand better is precise location of where replacement is decided, that would be the right place to increment the sequence number.

Which component is responsible for deciding if a replacement is to happen?

It sounds like it should be the engine, and it does so in cases of cascading replaces code noted above. But providers Diff can return a result that indicates that a certain property change forces a replacement? Thinking of func (r DiffResult) Replace() bool here.

Thinking out loud of what we could have.. The replacements only ever are finished off with a provider.Create right; a provider.Update is not a replacement. There's some good verbiage in https://pulumi-developer-docs.readthedocs.io/en/latest/providers/implementers-guide.html?highlight=Check#check about how replacement works. TODO for myself, check the flow in the code tomorrow, expecting to find that every Create initiated in order to replace a resource is receiving results of Check (re-checked inputs) with incremented sequence-number; see if we can use a phantom type just for this tracking to make it super-obvious.

@Frassle
Copy link
Member Author

Frassle commented Jan 5, 2022

It's looking decent, I'm fumbling a bit verifying where 0 is correct and where 1 is correct.

0 is the default value of falling back to the old non-deterministic logic. So new stuff managed totally by up to date engines will get set to 1 and then increment from that. Old CLIs or providers or existing resources will be 0. We should work out a path to let people easily opt existing resources into the new scheme (because even a replace won't do it automatically)

One thing that strikes me as really counter-intuitive is that Check is where auto-naming happens. We'd want to document that thoroughly at the definition place for Check and in the dev docs. Check really seems to return inputs in CheckResult and that's kind of confusing.

Yeh I did wonder if autonames should be in engine but then other stuff might also need randomness and the engine doesn't really understand "names" just "ids" which are different.

Another exercise that helped me - we can link PRs here, is trying to build pulumi-azure-native and pulumi-aws against this PR. For this to work resource.NewUniqueHex needs to be replaced with resource.NewUniqueHexV2 and sequencenumber passed through. I could sort of see how it happens in Check in pulumi-azure-native, but it's not obvious. It seems to subject a lot more props to auto-naming than "name". Couldn't see it at all in the TF-bridge based pulumi-aws. Having these PRs linked in here can help review/verify the call sequence assumptions.

I'll have a look at doing that.

Another random thought. For the case of imports with ResourceOption import, impl probably should stick to 0 sequencenumber until the user removes the import option and the resource is managed by pulumi, at which point an update will assign 1.

Currently imports consider this a brand new resources which will be safe to deterministically autoname in the event it gets replaced later and so set the sequence number to 1. This is the same for reads and the import command. I think this is safe because imports will call check and diff against the result of read so that will check the names match. Got to admit I feel like the import/external/read resources stuff is pretty confusing and I think I need to double check a lot of this.

Another thing I would like to understand better is precise location of where replacement is decided, that would be the right place to increment the sequence number.

Which component is responsible for deciding if a replacement is to happen?

It sounds like it should be the engine, and it does so in cases of cascading replaces code noted above. But providers Diff can return a result that indicates that a certain property change forces a replacement? Thinking of func (r DiffResult) Replace() bool here.

Replacements are decided by the step generator, one of the bits of info towards that decision is the provider diff saying it should get replaced. But it looks like changing an external resource to an internal managed one also triggers a replace.

Thinking out loud of what we could have.. The replacements only ever are finished off with a provider.Create right; a provider.Update is not a replacement. There's some good verbiage in https://pulumi-developer-docs.readthedocs.io/en/latest/providers/implementers-guide.html?highlight=Check#check about how replacement works. TODO for myself, check the flow in the code tomorrow, expecting to find that every Create initiated in order to replace a resource is receiving results of Check (re-checked inputs) with incremented sequence-number; see if we can use a phantom type just for this tracking to make it super-obvious.

Yes replaces are done by provider Create and so should always call Check first.

@pgavlin
Copy link
Member

pgavlin commented Jan 5, 2022

While I think that moving towards to a deterministic Check is a Good Thing, I'm a bit nervous about this approach's practicality as regards fixing the issues we're seeing with update constraints. Two things are bothering me:

  1. Most importantly, this approach requires that a user updates all of their providers if they're hitting issues with non-determinism. That is a pretty big thing to ask, especially if the providers in use are pulled in indirectly (e.g. via the use of an MLC).
  2. This approach requires that provider authors are aware of and in control of any sources of non-determinism. In the case of tfbridge, for example, we are aware of and in control of the non-determinism introduced by auto naming, but the TF provider itself (or even the cloud provider!) may introduce additional non-determinism.

I believe that (2) is probably going to be a problem no matter what we do. I think that all we can do there is think of possible escape hatches that can be applied by the user and think of possibilities for handling such properties in providers (e.g. by considering properties unknown1).

I think we might be able to address (1) within the context of update constraints by taking a different approach, however. We already need to control for the non-determinism of autonames in the sense that we don't want autonames to be regenerated each time a resource is Checked. To achieve that, we pass Check both the inputs from the program and the last checked inputs from the state file. If a name is present in the last checked inputs, the auto naming code pulls that name instead of generating a new name. What if we stored the last checked inputs for each resource being created or replaced in a set of update constraints and fed that into the appropriate call to Check? I think that might let update constraints work with autonames without requiring any changes besides updating the Pulumi CLI.

Footnotes

  1. A potentially crazy idea here is adding another bit to Output that marks a value as non-deterministic. Providers could then mark known outputs as non-deterministic in order to signal to the constraint checker that a mismatch in the value should be considered benign, but the value would otherwise be treated as known. This would be a pretty big change, to say the least, but it might be worth some discussion.

@t0yv0
Copy link
Member

t0yv0 commented Jan 5, 2022

(1) yes, (2) not sure why non-determinism is in play? Naively sequence counter increments on every replace per resource, non-determinism doesn't cause reorderings of replaces right, a resource can be replaced only once at a time (as if locking on resource)? Perhaps if you could elaborate that.

@t0yv0
Copy link
Member

t0yv0 commented Jan 5, 2022

What if we stored the last checked inputs for each resource being created or replaced in a set of update constraints and fed that into the appropriate call to Check? I think that might let update constraints work with autonames without requiring any changes besides updating the Pulumi CLI.

So instead of Check(olds=nil, news=news) preceding Create, you'd call Check(olds={name: x}, news=news) restored from preview via state, or from current engine session?

Here is a transcript of Check calls from a program forcing a bucket replacement. This helped me a lot to follow the discussion.


I0105 10:22:46.188029   56144 provider_plugin.go:610] CHECK: (*plugin.detailedCheckLog)(0xc00167f3e0)({
 urn: (resource.URN) (len=67) "urn:pulumi:dev::autonaming-example::aws:s3/bucket:Bucket::my-bucket",
 olds: (resource.PropertyMap) (len=5) {
  (resource.PropertyKey) (len=10) "__defaults": (resource.PropertyValue) {[{acl} {bucket} {forceDestroy}]},
  (resource.PropertyKey) (len=3) "acl": (resource.PropertyValue) {private},
  (resource.PropertyKey) (len=6) "bucket": (resource.PropertyValue) {my-bucket-15ae09c},
  (resource.PropertyKey) (len=12) "forceDestroy": (resource.PropertyValue) {false},
  (resource.PropertyKey) (len=10) "versioning": (resource.PropertyValue) {map[__defaults:{[{mfaDelete}]} enabled:{true} mfaDelete:{false}]}
 },
 news: (resource.PropertyMap) (len=1) {
  (resource.PropertyKey) (len=10) "versioning": (resource.PropertyValue) {map[enabled:{false}]}
 },
 allowUnknowns: (bool) false,
 result: (resource.PropertyMap) (len=5) {
  (resource.PropertyKey) (len=12) "forceDestroy": (resource.PropertyValue) {false},
  (resource.PropertyKey) (len=10) "versioning": (resource.PropertyValue) {map[__defaults:{[{mfaDelete}]} enabled:{false} mfaDelete:{false}]},
  (resource.PropertyKey) (len=10) "__defaults": (resource.PropertyValue) {[{acl} {bucket} {forceDestroy}]},
  (resource.PropertyKey) (len=3) "acl": (resource.PropertyValue) {private},
  (resource.PropertyKey) (len=6) "bucket": (resource.PropertyValue) {my-bucket-15ae09c}
 }
})

// Calls provider.Diff() with the results of the above Check

I0105 10:22:46.194906   56144 provider_plugin.go:610] CHECK: (*plugin.detailedCheckLog)(0xc0017c9050)({
 urn: (resource.URN) (len=67) "urn:pulumi:dev::autonaming-example::aws:s3/bucket:Bucket::my-bucket",
 olds: (resource.PropertyMap) <nil>,
 news: (resource.PropertyMap) (len=1) {
  (resource.PropertyKey) (len=10) "versioning": (resource.PropertyValue) {map[enabled:{false}]}
 },
 allowUnknowns: (bool) false,
 result: (resource.PropertyMap) (len=5) {
  (resource.PropertyKey) (len=10) "__defaults": (resource.PropertyValue) {[{acl} {bucket} {forceDestroy}]},
  (resource.PropertyKey) (len=3) "acl": (resource.PropertyValue) {private},
  (resource.PropertyKey) (len=6) "bucket": (resource.PropertyValue) {my-bucket-c024088},
  (resource.PropertyKey) (len=12) "forceDestroy": (resource.PropertyValue) {false},
  (resource.PropertyKey) (len=10) "versioning": (resource.PropertyValue) {map[__defaults:{[{mfaDelete}]} enabled:{false} mfaDelete:{false}]}
 }
})

// Calls provider.Create with the result of the above Check

What seems a bit tough is that engine does not know which props like "bucket" are subject to auto-naming (provider does), and feeding olds instead of nil whole-sale without specializing on props seems to be running the danger of bleeding some state in a replace? the nil was there for a reason?

@Frassle
Copy link
Member Author

Frassle commented Jan 6, 2022

While I think that moving towards to a deterministic Check is a Good Thing, I'm a bit nervous about this approach's practicality as regards fixing the issues we're seeing with update constraints. Two things are bothering me:

Most importantly, this approach requires that a user updates all of their providers if they're hitting issues with non-determinism. That is a pretty big thing to ask, especially if the providers in use are pulled in indirectly (e.g. via the use of an MLC).

Just to point out any idea that requires any extra data from the provider would also require updates. So if we want to avoid this we have to solve this entirely via engine changes.

This approach requires that provider authors are aware of and in control of any sources of non-determinism. In the case of tfbridge, for example, we are aware of and in control of the non-determinism introduced by auto naming, but the TF provider itself (or even the cloud provider!) may introduce additional non-determinism.
I believe that (2) is probably going to be a problem no matter what we do. I think that all we can do there is think of possible escape hatches that can be applied by the user and think of possibilities for handling such properties in providers (e.g. by considering properties unknown1).

I'd hope that Check in all cases CAN be made deterministic, the simple fix for things that aren't deterministic is to make them <computed>. And we could just make autonames <computed> instead of trying to make them deterministic but that has knock on effects elsewhere (we saw issues with computed names making multiple updates very hard in our circular resource hackathon). Plus this would still require updating all the providers.

I think we might be able to address (1) within the context of update constraints by taking a different approach, however. We already need to control for the non-determinism of autonames in the sense that we don't want autonames to be regenerated each time a resource is Checked. To achieve that, we pass Check both the inputs from the program and the last checked inputs from the state file. If a name is present in the last checked inputs, the auto naming code pulls that name instead of generating a new name. What if we stored the last checked inputs for each resource being created or replaced in a set of update constraints and fed that into the appropriate call to Check? I think that might let update constraints work with autonames without requiring any changes besides updating the Pulumi CLI.

I've already tried this and as Anton points out you hit problems with properties changing from nil to non-nil. Taking his bucket example and imagining you run it with a plan then during preview you wouldn't pass acl or forceDestroy but the result of Check includes them (and _defaults). If you save that result in the plan and then use that to "fill in" the inputs during update you end up passing acl and forceDestroy to Check which then returns back a different _defaults which then fails constraint validation. And there's nothing in the engine to know which fields should be filled in and which should be left blank for defaults (The closet thing on the schema is RequiredInputs but names aren't required either because of autonaming).

Footnotes
A potentially crazy idea here is adding another bit to Output that marks a value as non-deterministic. Providers could then mark known outputs as non-deterministic in order to signal to the constraint checker that a mismatch in the value should be considered benign, but the value would otherwise be treated as known. This would be a pretty big change, to say the least, but it might be worth some discussion.

We could do this but:

  1. It still would require us updating all the providers to mark names as non-deterministic
  2. It likely makes a lot of other things more difficult, including the ideas we had for circular resources

@pgavlin
Copy link
Member

pgavlin commented Jan 6, 2022

Just to point out any idea that requires any extra data from the provider would also require updates. So if we want to avoid this we have to solve this entirely via engine changes.

Yep, that's right. Which is maybe impossible.

If you save that result in the plan and then use that to "fill in" the inputs during update you end up passing acl and forceDestroy to Check which then returns back a different _defaults which then fails constraint validation.

I thought we weren't validating the post-Check inputs?

Just to be clear about the flow I'm thinking of here:

  1. During a preview:
    a. If a resource is being created, save the results of this call to Check to the constraints
    b. If a resource is being replaced, save the results of this call to Check to the constraints
  2. During an update:
    a. If a resource is being created, pass the saved Check results as the old inputs in this call to Check
    b. If a resource is being replaces, pass the saved Check results as the old inputs in this call to Check

If I understand the state of the world correctly, we'll validate the resource in 2a. against its pre-Check inputs, so the results of Check won't come into play, and in 2b. we'll already have validates the diff constraints by the time we get to the second Check. Is that right?

@pgavlin
Copy link
Member

pgavlin commented Jan 6, 2022

(for reference, here's an extremely hacky branch that saves post-check inputs in the situations I described: https://github.com/pulumi/pulumi/compare/pgavlin/savedInputs. It seems to work as I would expect, though I'm likely missing something as I've only tried some basic scenarios: https://asciinema.org/a/P9owoTmZITmhbuAffgryn63KO)

@Frassle
Copy link
Member Author

Frassle commented Jan 6, 2022

I thought we weren't validating the post-Check inputs?

No we are validating post-Check inputs because that's what's stored in the state file that we can diff againt.

@t0yv0 t0yv0 self-requested a review January 19, 2022 20:24
Copy link
Member

@t0yv0 t0yv0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through this again, looking good (safe).

For my information the new common/apitype/core.go ResourceV3 - where does it interact with the state backend (such as Pulumi Service stack state backend)? Couldn't find it at a glance.

@Frassle
Copy link
Member Author

Frassle commented Jan 19, 2022

For my information the new common/apitype/core.go ResourceV3 - where does it interact with the state backend (such as Pulumi Service stack state backend)? Couldn't find it at a glance.

PatchUpdateCheckpoint in "pkg/backend/httpstate/client/client.go" I think. I think the service just stores the data as sent, so I don't think it needs updating to be aware of sequence numbers.

@Frassle Frassle merged commit ec3ef44 into master Jan 20, 2022
@pulumi-bot pulumi-bot deleted the fraser/sequenceNumber branch January 20, 2022 11:18
julienp added a commit that referenced this pull request Aug 26, 2024
NewUniqueHexV2 is unused, let’s get rid of it.

This was introduced in #8631, used in some providers, but then reverted https://github.com/search?q=org%3Apulumi+NewUniqueHexV2&type=pullrequests
github-merge-queue bot pushed a commit that referenced this pull request Aug 26, 2024
NewUniqueHexV2 is unused, let’s get rid of it.

This was introduced in #8631, used
in some providers, but then reverted
https://github.com/search?q=org%3Apulumi+NewUniqueHexV2&type=pullrequests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants