Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: send alerts can wait forever and fail for broken workflows #9638

Merged
merged 5 commits into from
Jul 12, 2024

Conversation

hamidzr
Copy link
Contributor

@hamidzr hamidzr commented Jul 11, 2024

Ticket

https://hpe-aiatscale.atlassian.net/browse/RM-372

Description

adding a condition to skip over workflows that have elapsed their timeout instead of reporting them as active.
make sure send alerts does not:

why

send-alert job:

  1. takes too long 5hrs in these cases
  2. does not fail if another ci job takes long

this job has been staying active for 5hours
https://app.circleci.com/pipelines/github/determined-ai/determined/57843/workflows/1a75f6bf-77b2-45e5-8155-9f2f5dcc96c4/jobs/2742238

testing the timeout https://app.circleci.com/pipelines/github/determined-ai/determined/58023/workflows/53afd482-abd0-403a-9346-0de85400e533/jobs/2756232

context https://hpe-aiatscale.slack.com/archives/C04C9JXB1C2/p1720448458815989

Test Plan

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@cla-bot cla-bot bot added the cla-signed label Jul 11, 2024
Copy link

netlify bot commented Jul 11, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit cd7b409
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66901121e3d3600008c3ac49

@hamidzr hamidzr self-assigned this Jul 11, 2024
Copy link

codecov bot commented Jul 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.00%. Comparing base (d4c50b5) to head (cd7b409).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9638   +/-   ##
=======================================
  Coverage   52.99%   53.00%           
=======================================
  Files        1255     1255           
  Lines      152884   152884           
  Branches     3233     3234    +1     
=======================================
+ Hits        81015    81029   +14     
+ Misses      71718    71704   -14     
  Partials      151      151           
Flag Coverage Δ
backend 44.19% <ø> (+0.02%) ⬆️
harness 72.77% <ø> (ø)
web 51.37% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 7 files with indirect coverage changes

@hamidzr hamidzr changed the title chore: skip the wait for nightlies chore: send alerts can wait forever and fail for broken workflows Jul 11, 2024
@hamidzr hamidzr requested a review from NicholasBlaskey July 11, 2024 16:13
@hamidzr hamidzr force-pushed the hz-cialerts branch 2 times, most recently from 8ebd932 to ff129d7 Compare July 11, 2024 16:51
@hamidzr hamidzr marked this pull request as ready for review July 11, 2024 17:34
@hamidzr hamidzr requested a review from a team as a code owner July 11, 2024 17:34
@hamidzr hamidzr requested a review from JComins000 July 11, 2024 17:34
for w in workflows["items"]:
if w["name"] in workflows_to_skip:
continue

workflow_id = w["id"]
if not workflows_are_running and w["stopped_at"] is None:
print(f"waiting for at least workflow {w['name']} to finish")
workflows_are_running = True
created_at = datetime.datetime.strptime(w["created_at"], "%Y-%m-%dT%H:%M:%SZ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this

            created_at = datetime.datetime.strptime(w["created_at"], "%Y-%m-%dT%H:%M:%SZ")
            if created_at < earliest_accepted_time:
                print(f"workflow {w['name']} timed out.")
                # TODO: add support for reporting as a timeout or failure.
                continue
            print(f"waiting for at least workflow {w['name']} to finish")
            workflows_are_running = True

Copy link
Contributor Author

@hamidzr hamidzr Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went to change it but then I thought about it again and IMO the existing one makes it easier to show no behavior has changed other than not setting the workflow as running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, it's just style at the end of the day

@JComins000
Copy link
Contributor

JComins000 commented Jul 11, 2024

why don't you make an issue for this? the issue content should explain the problem or "why" you're doing this work. it's your current pr description.
the pr description should explain the implementation or "how" you're doing the work. something like

adding a condition to skip over workflows that have elapsed their timeout instead of always setting workflowsRunning to true

@hamidzr hamidzr assigned JComins000 and unassigned hamidzr Jul 11, 2024
@hamidzr
Copy link
Contributor Author

hamidzr commented Jul 12, 2024

thanks for the review!

@hamidzr hamidzr merged commit a498008 into main Jul 12, 2024
88 of 101 checks passed
@hamidzr hamidzr deleted the hz-cialerts branch July 12, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants