Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle recurring downtimes definitions #1092

Merged
merged 13 commits into from
Jun 17, 2021

Conversation

armcburney
Copy link
Member

@armcburney armcburney commented Jun 1, 2021

Overview

This PR implements a workaround for handling recurring downtimes with the Datadog Terraform provider. Recurring downtimes differ from regular 'one-off' downtimes in that subsequent recurrences are scheduled as new downtimes from the previous parent definition. Conceptually, we can think of this as a "linked list" of downtimes where each subsequent downtime is scheduled from the previous scheduled recurrence downtime definition. All fields are copied over from the previous parent with the exception of certain fields like start and end, which are calculated off of the recurrence attribute.

The Datadog Terraform provider keeps reference to this original parent downtime ID. Since subsequent recurrence downtimes are scheduled as new downtimes (with a new ID), updates in the UI/API to the existing 'child' downtime corresponding to the original recurrence would not previously be recognized when comparing the downtime's state with what we store in Terraform. Moreover, after downtimes expire, we delete them from our database after a certain period of time. This behavior was recently changed so that we don't delete downtimes after they expire if they are the first downtime in the recurrence chain (i.e., the original parent downtime).

Instead, we now return that downtime with a new active_child field in our GET /api/v1/downtime API - which we use to compare state with Terraform. This way updates from the UI/API will be propagated back to Terraform. Additionally, when making updates through Terraform, we call the PUT /api/v1/downtime/{downtime_id} endpoint with the active_child definition on the active_child's ID, so that changes from Terraform will be made to the current active recurrence downtime.

Caveats

Caveat One

UPDATE: We don't check the start/end boundaries for changes if the recurring downtime is a child to prevent superfluous diffs every time the downtime is rescheduled. The con to this approach is that if the start/end values are changed in the UI on a child recurring downtime, the diff will not be picked up by Terraform. We plan to iterate on this solution to address this shortcoming, but feel the benefits of the child/parent references being handled by Terraform are worth merging in.

Since new downtimes are scheduled each time a recurrence is rescheduled, fields like start and end will perpetually differ after the first schedule when running terraform plan/terraform apply.

$ terraform plan
datadog_downtime.recurring_downtime: Refreshing state... [id=1337]

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # datadog_downtime.recurring_downtime will be updated in-place
  ~ resource "datadog_downtime" "recurring_downtime" {
      ~ end             = 1621016100 -> 1620929700
        id              = "1337"
      ~ message         = "test this out edit ui" -> "test this out"
      ~ start           = 1621015800 -> 1620929400
        # (6 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

I recommend we update the documentation around recurring downtimes to document this problem and recommended an approach to mitigating this (i.e., using ignore_changes to ignore the start and end updates).

resource "datadog_downtime" "recurring_downtime" {
  scope        = ["*"]
  start        = 1620929400
  end          = 1620929700
  timezone     = "EST"
  monitor_tags = ["github:armcburney"]
  message      = "test this out"

  recurrence {
    type   = "days"
    period = 1
  }

  lifecycle {
    ignore_changes = [
      start,
      end
    ]
  }
}

Caveat Two

Recurring downtimes created before 2021-05-13 using Terraform will need to be deleted and recreated for the references to work with the newest version of the Terraform provider. All recurring downtimes created after 2021-05-13 will have the updated parent/child references, ensuring they’ll work as expected with the latest version of the provider. We apologize for this inconvenience.

Fixes

@armcburney armcburney marked this pull request as ready for review June 2, 2021 11:59
@armcburney armcburney requested review from a team as code owners June 2, 2021 11:59
@phillip-dd
Copy link
Contributor

thanks @armcburney! This makes sense to me, except for this part:

fields like start and end will perpetually differ after the first schedule

From a customer perspective I don't think this is what we want - the provider should not be showing a diff when this is working as expected. As well, if customers use ignore_changes, then they won't be able to see any true diffs if the recurrence is updated in the UI.

Some other options:

  • ignore start/end explicitly in the provider
  • always compare start/end on the parent downtime (this would only catch changes if the original parent was changed)
  • compare duration: e.g. end-start
  • something else?

Copy link
Contributor

@phillip-dd phillip-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 couple of questions, but this looks good to me. Definitely an improvement!

gotestsum --hide-summary skipped --format testname --debug --packages $(TEST) -- $(TESTARGS) -timeout=30s

# Run acceptance tests (this runs integration CRUD tests through the terraform test framework)
testacc: get-test-deps fmtcheck lint
testacc: get-test-deps fmtcheck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw there was a slack conversation about this, @therve can you just confirm this is what the recommendation was?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we'll come back to fix it later.

@@ -50,6 +50,7 @@ resource "datadog_downtime" "foo" {

### Optional

- **active_child_id** (Number) The id corresponding to the downtime object definition of the active child for the original parent recurring downtime. This field will only exist on recurring downtimes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we include this, I think it should be actually read only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping on this, is this part auto generated or a copy/paste?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc is auto generated. But to avoid this, we should remove the line below since the attribute is read only:

datadog/resource_datadog_downtime.go Outdated Show resolved Hide resolved
datadog/resource_datadog_downtime.go Show resolved Hide resolved
phillip-dd
phillip-dd previously approved these changes Jun 15, 2021
Copy link
Contributor

@phillip-dd phillip-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - we may need to update the PR title for the change log

@therve therve changed the title [MA-2231] Properly handle recurring downtimes definitions in terraform. Properly handle recurring downtimes definitions Jun 17, 2021
@therve
Copy link
Contributor

therve commented Jun 17, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@therve therve enabled auto-merge (squash) June 17, 2021 13:45
@therve therve merged commit 8a594bb into master Jun 17, 2021
@therve therve deleted the armcburney/recurring_downtimes branch June 17, 2021 14:02
@NBParis NBParis linked an issue Jun 18, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Downtime recreated after recurrence
4 participants