Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Rollout PlanResourceChange #1785

Closed
5 of 6 tasks
Tracked by #1796
t0yv0 opened this issue Mar 22, 2024 · 42 comments · Fixed by #2380
Closed
5 of 6 tasks
Tracked by #1796

[EPIC] Rollout PlanResourceChange #1785

t0yv0 opened this issue Mar 22, 2024 · 42 comments · Fixed by #2380
Assignees
Labels
kind/epic Large new features or investments resolution/fixed This issue was fixed
Milestone

Comments

@t0yv0
Copy link
Member

t0yv0 commented Mar 22, 2024

🚧

PlanResourceChange flag enables a rewritten implementation of the lifecycle for SDKv2 based resources that more closely aligns with how TF CLI exercises the providers. Building out a plan to roll it out as the default behavior in the bridge. Currently it is rolled out selectively on a per-resource basis where it is shown to fix significant issues.

Technicalities

What PlanResourceChange flag does: it currently is a partial rewrite of the resource lifecycle methods for bridging the SDKv2-based resources. It ties into as much as possible into TF gRPC implementations to work "exactly-as" TF. Several known areas where this improves things:

  • RawState, RawConfig and RawPlan are computed as-in TF; when they differ this may lead to panics in providers
  • Does not call CustomizeDiff repeatedly, matching TF behavior more closely and solving complicated bugs

Cross-testing is essential to pin down corner cases and get full confidence before broader rollout.

Rollout

We will start the rollout as follows:

  • random
  • cloudflare
  • The rest of tier 2
  • GCP and Azure
  • AWS
  • Cleanup

Instances of TF behaviour regressing against non-PRC:

@VenelinMartinov
Copy link
Contributor

VenelinMartinov commented Apr 1, 2024

#1822 is likely the GCP Meta handling issue - it does have a repro.

@VenelinMartinov
Copy link
Contributor

Possible issue in GCP bucket: pulumi/pulumi-gcp#1952 - I attempted to enroll it in PlanResourceChange to check if it fixes an issue. Did not fix the original issue but caused a lot of test failures.

At least one was legitimate - the resource started refreshing dirty when no labels are specified.

t0yv0 added a commit to pulumi/pulumi-aws that referenced this issue May 2, 2024
Recent changes in hashicorp/terraform-provider-aws#37111 introduced a Diff customizer that leads to errors in the Pulumi
version of the provider due to discrepancies in the order in which Diff customizer functions are applied between Pulumi
and Terraform. This change fixes the problem by applying the experimental PlanResourceChange flag to the affected
resource.

A regression test is included.

See also: pulumi/pulumi-terraform-bridge#1785
t0yv0 added a commit to pulumi/pulumi-aws that referenced this issue May 2, 2024
…3888)

Recent changes in hashicorp/terraform-provider-aws#37111 introduced a
Diff customizer that leads to errors in the Pulumi version of the
provider due to discrepancies in the order in which Diff customizer
functions are applied between Pulumi and Terraform. This change fixes
the problem by applying the experimental PlanResourceChange flag to the
affected resource.

Fixes #3887

A regression test is included.

See also: pulumi/pulumi-terraform-bridge#1785
@t0yv0
Copy link
Member Author

t0yv0 commented May 9, 2024

Some updates. There's still some very basic issues around unknown handling (#1943) causing customer P1s in AWS (pulumi/pulumi-aws@8c302b9) for resources where this was speculatively rolled out to fix other P1s or diff issues like in the case of LaunchTemplate.

Given this situation, after #1943 it would be great to open a bridge PR that runs all the downstream tests with this feature turned on to see if we can spot any more basic simple failures and proactively fix them.

The issue in question could have been located by #1858 but I would bet also regular integration tests would flag it somewhere.

@t0yv0
Copy link
Member Author

t0yv0 commented May 9, 2024

ON a positive note we landed #1927 which together with this flag is shown to close 4 resource cycling issues in AWS, and likely actually fixes a lot more.

@VenelinMartinov
Copy link
Contributor

VenelinMartinov commented May 9, 2024

We could add an env var which default enables this perhaps?

Then run downstream tests with the env var enabled. Iterate until all downstream tests succeed?

EDIT: This was done in fb50112

PULUMI_ENABLE_PLAN_RESOURCE_CHANGE=1 enables PlanResourceChange for all resources in all bridged providers.

@t0yv0
Copy link
Member Author

t0yv0 commented May 9, 2024

Exactly. The issues from actual integration tests likely cover the most important basics.

@iwahbe iwahbe added kind/epic Large new features or investments and removed kind/task Work that's part of an ongoing epic labels May 29, 2024
@iwahbe iwahbe changed the title Rollout PlanResourceChange [EPIC] Rollout PlanResourceChange May 29, 2024
@VenelinMartinov
Copy link
Contributor

Should we finish #1858 and maybe #1864 before rolling this out?

@t0yv0
Copy link
Member Author

t0yv0 commented May 29, 2024

As discussed today:

For failing downstream checks do not over-index on upgrade tests flagging Updates, this is not unexpected and we can weaken the tests to just weed out replacements.

VenelinMartinov added a commit that referenced this issue Jun 5, 2024
part of #1785

This change adds a normalisation step for collections when recovering
cty values to pass to terraform. This ensures we represent them
similarly to terraform.

In practice this means that all block collections need to be passed to
TF providers as an empty collection instead of nil. This should get rid
of quite a few subtle discrepancies in the data we pass to the TF
provider code. These sometimes result in panics since we pass unexpected
nils.

This gets rid of all known input discrepancies discovered so far through
cross-testing.

The full rules for what is a block are
[here](https://github.com/hashicorp/terraform-plugin-sdk/blob/1f499688ebd9420768f501d4ed622a51b2135ced/helper/schema/core_schema.go#L60).
It is essentially properties with schema: typeList or typeSet with a
Resource Elem.

fixes #1970
fixes #1915
fixes #1964
fixes #1767 
fixes #1762 

TODO: [DONE] remove the MaxItemsOne default hacks introduced in
#1725 (opened
#2049)

---------

Co-authored-by: Anton Tayanovskyy <[email protected]>
Co-authored-by: Ian Wahbe <[email protected]>
@VenelinMartinov
Copy link
Contributor

Discovered a new issue with GCP and PRC: pulumi/pulumi-gcp#2078

@VenelinMartinov
Copy link
Contributor

#2039 reproes on non-PRC, removing it from the list.

@VenelinMartinov
Copy link
Contributor

pulumi/pulumi-gcp#2078 is actually an upstream bug/behaviour uncovered by PRC

@VenelinMartinov
Copy link
Contributor

1967 and 2047 are addressed by #2065 so we only need to run down all the failures in #1962

VenelinMartinov added a commit to pulumi/pulumi-azure that referenced this issue Aug 19, 2024
Enables PlanResourceChange by default in pulumi-azure.

Part of pulumi/pulumi-terraform-bridge#1785

Also contains #2325
Also contains #2331

fixes #2322
fixes #2323
fixes #2330
@VenelinMartinov
Copy link
Contributor

Azure is done: pulumi/pulumi-azure#2306

@VenelinMartinov
Copy link
Contributor

Investigating PRC test failures in AWS here: pulumi/pulumi-aws#4410

@VenelinMartinov
Copy link
Contributor

Started on the bridge cleanup #2380 so that we can base other work off of that.

@VenelinMartinov
Copy link
Contributor

Possible issue with user-defined timeout overrides for Create under PRC: #2386

@t0yv0
Copy link
Member Author

t0yv0 commented Sep 6, 2024

Progress this week on fixing the emergent custom timeouts issue 2386; next up: roll out to AWS, cleanup and close. AWS rollout delayed as waiting for a bridge release to pass all downstream tests.

@mjeffryes mjeffryes modified the milestones: 0.109, 0.110 Sep 12, 2024
@t0yv0
Copy link
Member Author

t0yv0 commented Sep 14, 2024

pulumi/pulumi-aws#4457 is a consequence we believe, but it is quite tricky. The low-level behavior is more inline with matching TF (our goal) but at the higher level we have gaps to make the users be able to get unstuck.

@t0yv0
Copy link
Member Author

t0yv0 commented Sep 14, 2024

pulumi/pulumi-gcp#2372 is this a consequence of PRC rollout?

@mjeffryes
Copy link
Member

Yes, the behavior is actually more consistent with the upstream behavior with PRC, but this results in unexpected diffs for when there are labels set as ""

@VenelinMartinov
Copy link
Contributor

Another persistent diff fixed by PRC: pulumi/pulumi-newrelic#875

@t0yv0
Copy link
Member Author

t0yv0 commented Sep 18, 2024

One more fix possibly: pulumi/pulumi-aws#1784 per @corymhall

@VenelinMartinov
Copy link
Contributor

New issue with detailed diff when a set contains an unknown element: #2427

@pulumi-bot
Copy link
Contributor

This issue has been addressed in PR #2380 and shipped in release v3.91.0.

@VenelinMartinov
Copy link
Contributor

possibly fixed with this: pulumi/pulumi-azuread#1338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/epic Large new features or investments resolution/fixed This issue was fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants