Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cml runner long job #583

Merged
merged 14 commits into from
Jul 5, 2021
Merged

Cml runner long job #583

merged 14 commits into from
Jul 5, 2021

Conversation

DavidGOrtega
Copy link
Contributor

@DavidGOrtega DavidGOrtega commented Jun 8, 2021

Introduces restart all the workflows with jobs not yet finished.

  • Adds restart workflows with tracked and unfinished jobs
  • Solves 72 hours max time in Github
  • Refactors the code in cml parseRunnerLog and cml-runner shutdown
  • solves unknown bug in logs due to incomplete string comparison success:

closes #174
closes #208 partially?

@DavidGOrtega DavidGOrtega temporarily deployed to internal June 8, 2021 16:03 Inactive
@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Jun 8, 2021

#208 is pending because:

  • We might find desirable to still handle timeouts
  • In GH is near imposible to know if the job is canceled due to timeout, at least from the logs.

The only solution (and actually the original one) is to control the timer to not reach the MAX GH timeout time and restart a few min before reach it. This would solve the max time limit in GH respecting every timeout defined by the users.

What do you think?

@casperdcl @0x2b3bfa0 @Suor @mnrozhkov @dmpetrov

@DavidGOrtega DavidGOrtega temporarily deployed to internal June 8, 2021 16:10 Inactive
@DavidGOrtega DavidGOrtega marked this pull request as draft June 8, 2021 16:10
@DavidGOrtega DavidGOrtega temporarily deployed to internal June 8, 2021 16:33 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal June 8, 2021 17:03 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal June 8, 2021 17:06 Inactive
@casperdcl casperdcl marked this pull request as ready for review June 8, 2021 18:46
bin/cml-runner.js Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
bin/cml-runner.js Outdated Show resolved Hide resolved
@casperdcl
Copy link
Contributor

we should also maybe add an option (default true) to auto-post a warning comment saying "restarting due to timeout?"

@DavidGOrtega
Copy link
Contributor Author

we should also maybe add an option (default true) to auto-post a warning comment saying "restarting due to timeout?"

I liked this idea and has been explored before but the runner needs to know the SHA. Aside that I think that what we should do is do proper logging #22 That way they could setup the logging config and being able to log everything and this would also be nice for DVC viewer

@DavidGOrtega DavidGOrtega temporarily deployed to internal June 18, 2021 11:34 Inactive
Copy link
Contributor

@casperdcl casperdcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be configurable (--retry-timeout, similar to --idle-timeout) - here's the gist...

bin/cml-runner.js Outdated Show resolved Hide resolved
bin/cml-runner.js Show resolved Hide resolved
bin/cml-runner.js Outdated Show resolved Hide resolved
bin/cml-runner.js Outdated Show resolved Hide resolved
bin/cml-runner.js Outdated Show resolved Hide resolved
bin/cml-runner.js Outdated Show resolved Hide resolved
bin/cml-runner.js Show resolved Hide resolved
bin/cml-runner.js Outdated Show resolved Hide resolved
@DavidGOrtega DavidGOrtega temporarily deployed to internal June 18, 2021 15:04 Inactive
@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jun 22, 2021

@DavidGOrtega, do you want a release manager review? They're free this week!

@DavidGOrtega
Copy link
Contributor Author

@0x2b3bfa0 please!

@0x2b3bfa0 0x2b3bfa0 self-requested a review June 23, 2021 10:34
@0x2b3bfa0 0x2b3bfa0 temporarily deployed to internal June 30, 2021 20:07 Inactive
bin/cml-runner.js Outdated Show resolved Hide resolved
bin/cml-runner.js Outdated Show resolved Hide resolved
@0x2b3bfa0 0x2b3bfa0 temporarily deployed to internal June 30, 2021 23:08 Inactive
@0x2b3bfa0 0x2b3bfa0 force-pushed the cml-runner/long-job branch from 496070a to eddeae9 Compare June 30, 2021 23:52
@0x2b3bfa0 0x2b3bfa0 temporarily deployed to internal June 30, 2021 23:53 Inactive
@0x2b3bfa0 0x2b3bfa0 temporarily deployed to internal July 1, 2021 00:05 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 4, 2021 21:34 Inactive
@DavidGOrtega DavidGOrtega force-pushed the cml-runner/long-job branch from 1d6e728 to 3e6545f Compare July 4, 2021 23:44
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 4, 2021 23:44 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 5, 2021 00:50 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflow timeout a better scenario Spot instances- Runner must be able to restart workflow
3 participants