Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Please support parent_id on recurring Downtimes #109

Closed
ardrigh opened this issue Oct 26, 2018 · 15 comments
Closed

Feature request: Please support parent_id on recurring Downtimes #109

ardrigh opened this issue Oct 26, 2018 · 15 comments

Comments

@ardrigh
Copy link

ardrigh commented Oct 26, 2018

Our team is trying to use Terraform to manage a scheduled monthly downtime for Datadog. It occurs on the first day of the month for one hour.

I imported the existing downtime monitor to avoid manually adding the start and end values, and it worked fine until the next downtime was completed and the id value changed.

I asked about this behaviour in the Datadog Slack channel and I was told this is the way the downtime monitors work. The first id value runs, when it is complete a new id value is created with the parent_id value set to the original id value.

If the Datadog provider can process the extra pieces of information, the downtimes would not appear in the plan as a creation. It would hopefully manage the id value transparently in the state file somehow.

Terraform Version

Terraform v0.11.10

Affected Resource(s)

datadog_downtime

Terraform Configuration Files

resource "datadog_downtime" "scheduled_outage" {
  scope      = ["host:host.example.com"]
  monitor_id = 0000001

  recurrence {
    type   = "months"
    period = 1
  }

  message = "host downtime for monthly backup of vm. Notify: @slack-devops"

  lifecycle {
    ignore_changes = ["start", "end", "active", "disabled"]
  }
}

Expected Behavior

terraform plan will look for updates to the downtime_monitor but will not consider it an addition.

Actual Behavior

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  + datadog_downtime.scheduled_outage
      id:                  <computed>
      message:             "host downtime for monthly backup of vm. Notify: @slack-devops"
      monitor_id:          "0000001"
      recurrence.#:        "1"
      recurrence.0.period: "1"
      recurrence.0.type:   "months"
      scope.#:             "1"
      scope.0:             "host:host.example.com"


Plan: 1 to add, 0 to change, 0 to destroy.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform init
  2. terraform plan

References

The API code examples show the parent_id field, it is not mentioned in the attached documentation.
https://docs.datadoghq.com/api/?lang=python#schedule-monitor-downtime

@vanvlack
Copy link

Added ParentId to the client in PR zorkian/go-datadog-api#227 - Once that's accepted, likely can make changes here to bring it in.

@vanvlack
Copy link

ParentId is now supported as of zorkian/go-datadog-api#227 being merged

@ardrigh
Copy link
Author

ardrigh commented Apr 23, 2019

@vanvlack thanks very much for getting that piece done.

I don't know what code is required to add support in this provider.

I believe if the provider was able to compare a new parent_id with the previous list of id in the state file, it would avoid the downtime_monitor being viewed as a change, and thus avoid creating duplicates.

@bal2ag
Copy link

bal2ag commented May 1, 2019

I wanted to add some color to this issue. We've been trying to get our entire monitoring infrastructure defined in Terraform, and recurring downtimes are the only resource we've been unable to manage in Terraform. We use recurring downtimes almost exclusively, for e.g. anomaly monitors that get noisy during off hours.

Because the recurring downtime model changes the ID on every reoccurrence, it breaks Terraform's model that re-applying Terraform configuration without any changes should have no backend resource changes. This was quite frustrating as from Terraform's perspective, it thinks the originally created downtime just "disappears" and re-applying the same configuration causes a 400 error (since it tries to create a new reoccurring downtime in the past, since you have to specify start and end in absolute epoch time).

I know I'm not being super helpful by describing a problem we already know exists, but it might be worth updating the Terraform documentation to reflect that reoccurring downtimes don't work as expected from Terraform's perspective until this issue is resolved (I also think it's closer to a bug than a feature request, IMHO). Happy to provide additional insight from our experience but I suspect many people have taken a similar path to us and just reverted to managing downtimes through DataDog's UI.

@pdecat
Copy link
Contributor

pdecat commented Sep 17, 2019

Hi, I've looked into implementing this using the downtime parent_id field.
As each new occurrence of the downtime takes the parent id from its immediate predecessor, this actually makes a linked-list of downtime items.
But completed downtime items being deleted from the https://api.datadoghq.com/api/v1/downtime/ endpoint after a few hours, the full linked-list to the original downtime cannot be re-built so the identity of the current downtime item cannot be determined with certainty.

Steps to reproduce:

  1. create a downtime for 1h with a 1 day recurring period, let's say it gets id 0001,
  2. verify its attributes with curl -s "https://api.datadoghq.com/api/v1/downtime/0001?api_key=${DATADOG_API_KEY}&application_key=${DATADOG_APP_KEY}", as expected its parent_id is null
  3. wait until it completes, it is still accessible using the above command for a few hours (must be a batch process of some kind)
  4. after a few hours, the above command will fail with HTTP/1.1 404 Not Found and payload {"errors":["Downtime not found"]}
  5. get all currently existing downtimes with curl -s "https://api.datadoghq.com/api/v1/downtime?api_key=${DATADOG_API_KEY}&application_key=${DATADOG_APP_KEY}", you'll find a downtime whose parent_id field is the original downtime, let's say it gets id 0002
  6. verify its attributes with curl -s "https://api.datadoghq.com/api/v1/downtime/0002?api_key=${DATADOG_API_KEY}&application_key=${DATADOG_APP_KEY}", as expected its parent_id is 0001
  7. wait until it completes and is replaced by another occurrence which id 0003
  8. after a few hours, the above command will fail with HTTP/1.1 404 Not Found and payload {"errors":["Downtime not found"]}
  9. at that point, there's no more link to the original downtime id.

I can think about at least two options that could let this work:

  • The downtime API could expose an original_parent_id field on each downtime item. That way, the link could always be restored. Better yet, it could add an option to query downtimes by that field to avoid having to retrieve all downtimes and search client side.
  • The datadog API could expose some kind of downtime generator items whose id would stay stable over time.

@pdecat
Copy link
Contributor

pdecat commented Sep 17, 2019

FWIW, I pushed a POC here: https://github.com/pdecat/terraform-provider-datadog/tree/recurrent_downtimes (https://github.com/pdecat/terraform-provider-datadog/commit/44f4ecd27b36371e9ca4cb8f0855d90c2d1a3947)

Applied this yesterday (Monday 2019/09/16):

resource "datadog_downtime" "test" {
  disabled   = false
  message    = "Managed by Terraform. Imported from web."
  monitor_id = null
  scope      = ["*"]

  start = 1568647200
  end   = 1568647300

  timezone = "Europe/Paris"

  recurrence {
    period = 1
    type   = "days"
  }
}

Today's plan with 2.4.0 (Tuesday 2019/09/17):

Terraform will perform the following actions:

  # datadog_downtime.test will be created
  + resource "datadog_downtime" "test" {
      + disabled = false
      + end      = 1568647300
      + id       = (known after apply)
      + message  = "Managed by Terraform. Imported from web."
      + scope    = [
          + "*",
        ]
      + start    = 1568647200
      + timezone = "Europe/Paris"

      + recurrence {
          + period = 1
          + type   = "days"
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Update:

And as expected, the day after (Wednesday 2019/09/18), this no longer works because the first child of the original downtime was deleted:

Terraform will perform the following actions:

  # datadog_downtime.test will be created
  + resource "datadog_downtime" "test" {
      + disabled = false
      + end      = 1568647300
      + id       = (known after apply)
      + message  = "Managed by Terraform. Imported from web."
      + scope    = [
          + "*",
        ]
      + start    = 1568647200
      + timezone = "Europe/Paris"

      + recurrence {
          + period = 1
          + type   = "days"
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

@vanvlack
Copy link

vanvlack commented Sep 18, 2019

@pdecat any reason we need to know that original parent_id at all? We can assume that if a parent_id exists, it is a repeating downtime. Wonder if there is a way forward with that assumption?

edit: realizing this doesnt actually help us, as changes will need to somehow be tied to the new rotated monitors...

@ardrigh
Copy link
Author

ardrigh commented Sep 18, 2019

@pdecat any reason we need to know that original parent_id at all? We can assume that if a parent_id exists, it is a repeating downtime. Wonder if there is a way forward with that assumption?

edit: realizing this doesnt actually help us, as changes will need to somehow be tied to the new rotated monitors...

The parent_id indicates it is an existing resource, but without the full history of that list, then you can only use data from the resource as written in the Terraform code - and apart from the name of it, I don't think you could safely rely on those fields.

It might need a request to Datadog to support something like a grandparent_id field, if they don't provide a way to query the history of a parent_id back to the original id value.

I am happy to put in a support query to see what they say

@platinummonkey
Copy link
Contributor

👋 this is something we’re looking to address in the nearish future. Among some other changes making downtimes (mostly) immutable (to address other edge cases people have run into).

Thank you for this helpful feedback 😄

@ardrigh
Copy link
Author

ardrigh commented Oct 3, 2019

@platinummonkey that's great news. Please keep us updated on any progress 🍻

@bal2ag
Copy link

bal2ag commented Jan 10, 2020

@platinummonkey has there been any progress towards making recurring downtimes manageable in Terraform?

@MrLemur
Copy link

MrLemur commented Feb 18, 2021

@platinummonkey Just wondering if there has been any progress on this yet?

@platinummonkey
Copy link
Contributor

@phillip-dd ^

@phillip-dd
Copy link
Contributor

We're tracking this internally and have some worked queued that should address this.

@NBParis
Copy link

NBParis commented Jun 18, 2021

Hello,

Thanks for your patience on this.
I’m happy to share that the issue has been addressed and recurring downtimes are now properly handled with the new version of the terraform provider (v3.1.0 - see updates here).

You have to update your terraform provider to the version 3.10 to benefit from the fix.

You can find here the PR that addresses this issue and which contains a very detailed description about the change made and the remaining caveat that we are still working to improve.

I'll go ahead and resolve this issue but feel free to let us know if you have any question or feedback.

Thanks again for reporting this issue and helping us improve the terraform provider to better manage Downtimes.

@NBParis NBParis closed this as completed Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants