Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add job run-time limit flag to neuro run #1325

Merged
merged 14 commits into from
Feb 26, 2020
Merged

Add job run-time limit flag to neuro run #1325

merged 14 commits into from
Feb 26, 2020

Conversation

atemate
Copy link
Contributor

@atemate atemate commented Feb 11, 2020

Support run-time limit enforcement functionality: neuro-inc/platform-api#1044

neuromation/api/utils.py Outdated Show resolved Hide resolved
neuromation/cli/job.py Outdated Show resolved Hide resolved
@atemate atemate marked this pull request as ready for review February 11, 2020 15:05
@serhiy-storchaka
Copy link
Contributor

You can use the timeout command.

@atemate
Copy link
Contributor Author

atemate commented Feb 11, 2020

You can use the timeout command.

@serhiy-storchaka sorry I didn't understand you: for what?

@serhiy-storchaka
Copy link
Contributor

I meant that the user can use the timeout command instead of neuro option to limit the job runtime.

$ neuro run timeout 1h job-command

If add such feature in neuro, I think it should time suffixes m for minutes and h for hours as in admin add-user-quota. It is also compatible with timeout.

@serhiy-storchaka
Copy link
Contributor

I think --limit is not good name. We can add other limits in future.

Copy link
Contributor

@asvetlov asvetlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update ./docs/jobs_reference.rts at least as well.

neuromation/api/utils.py Outdated Show resolved Hide resolved
neuromation/api/config.py Outdated Show resolved Hide resolved
@atemate
Copy link
Contributor Author

atemate commented Feb 19, 2020

@asvetlov, a few questions if you don't mind:

  1. Any objections from using pytimeparse? It provides the parser for free format time-delta definition like 2d 5h 34m or 2 days, 5 hours, 34 minutes or 5hr34m56s. However, it's officially supported only up to Pyhton 3.6. If we refuse to use it, I'll write similar but simplified regex-based parser to parse strings like 2d 5h 34m, not a problem.
  2. If we change the name of the flag to neuro run --timeout, is that a problem that neuro exec --timeout already exists, but --timeout there has slightly different meaning?
  3. How can we (and should we) write the default config value to the user's yml-config? I see method calc_columns addresses the default value COLUMNS if no value found in config, so here we repeat this algorithm for method calc_default_job_timeout. However, I thought it might be useful if we write this default value to the yml-config, and just let the user modify it -- it's easier.

@asvetlov
Copy link
Contributor

@asvetlov, a few questions if you don't mind:

  1. Any objections from using pytimeparse? It provides the parser for free format time-delta definition like 2d 5h 34m or 2 days, 5 hours, 34 minutes or 5hr34m56s. However, it's officially supported only up to Pyhton 3.6. If we refuse to use it, I'll write similar but simplified regex-based parser to parse strings like 2d 5h 34m, not a problem.

A free format is not good for usage in CLI parameters: 2d 5h 34m should be enclosed in brackets, 2 days, 5 hours, 34 minutes is even worse because of its length.

  1. If we change the name of the flag to neuro run --timeout, is that a problem that neuro exec --timeout already exists, but --timeout there has slightly different meaning?

Please use --life-span, the term is very well known, short enough and unambiguous.

  1. How can we (and should we) write the default config value to the user's yml-config? I see method calc_columns addresses the default value COLUMNS if no value found in config, so here we repeat this algorithm for method calc_default_job_timeout. However, I thought it might be useful if we write this default value to the yml-config, and just let the user modify it -- it's easier.

The philosophy of user-provided configuration files is that they are never modified by the application itself. Even a command for user config file changing is redundant -- every software engineer knows how to use a text editor. Please use the hard-coded default value if absent.

P.S. I wonder why do you use yml-config term for toml file format?

Artem Yushkovskiy added 2 commits February 20, 2020 14:34
@atemate
Copy link
Contributor Author

atemate commented Feb 20, 2020

Re-worked this PR considering the comments.

  1. Not a big deal not to use a side library, the hand-written parser turned out to be small and supporting more concise format: 1d 2h 3m 4s or 1d2h3m4s or 1d 3m or 1d or 0 (with or without spaces, where some parts may be missing, or 0 as an edge-case since 0s is the same as 0m).
  2. I never heard of --life-span in the sense of timeout, but it does not matter for me, I did the --life-span.
  3. I would think of a user-config as of a normal Unix config (a .conf file somewhere inside /etc/), which contains all sections with defaults in the right format but commented out. But I agree that, comparing to Unix utilities, neuro is evolving fastly, and the format of the user-config is changed fastly either. So this is apparently the part of documentation, to teach user the format.
    BTW, note the format of the new section:
[job.default-life-span]
days = 1
hours = 2
minutes = 3

P.S. I wonder why do you use yml-config term for toml file format?

my bad, a typo-like mistake.

Artem Yushkovskiy added 2 commits February 20, 2020 15:19
@atemate
Copy link
Contributor Author

atemate commented Feb 20, 2020

New question: What should be the short option for --life-time? Should it be at all?

Comment on lines 1190 to 1191
elif seconds < 0:
raise click.UsageError(f"Job life span must be non-negative, got: '{result}'")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, our input parser does not allow negative values, so they can come only from the config. I understand, it's better to check it during user config validation, but it's quicker to do it here, after parsing.

  1. Should I add more validation code to _validate_user_config(), similarly as it is with _validate_alias()?
  2. Why don't we use trafaret? Doesn't it have enough expressive power to validate aliases, does it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, our input parser does not allow negative values, so they can come only from the config. I understand, it's better to check it during user config validation, but it's quicker to do it here, after parsing.

  1. Should I add more validation code to _validate_user_config(), similarly as it is with _validate_alias()?

Please keep as is until we implement something like neuro config validate command.

  1. Why don't we use trafaret? Doesn't it have enough expressive power to validate aliases, does it?

Trafaret provides machine-readable error messages which should be translated to human-readable texts before printing. It makes trafaret virtually useless for our needs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to use extract_error? It might be easier to use a package that was designed to write custom validators, rather than to implement our own from scratch.

@asvetlov
Copy link
Contributor

New question: What should be the short option for --life-time? Should it be at all?

If you are in doubt -- don't use short name, that's fine.

neuromation/api/config.py Outdated Show resolved Hide resolved
neuromation/api/jobs.py Outdated Show resolved Hide resolved
neuromation/cli/job.py Outdated Show resolved Hide resolved
neuromation/cli/job.py Outdated Show resolved Hide resolved
@@ -889,6 +927,13 @@ def format_fail(job: str, reason: Exception) -> str:
if not wait_start:
detach = True

job_life_span = await calc_life_span(root.client, life_span)
if not root.quiet:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rather prefer to see this printout in explicit verbose mode only. In both normal and quiet modes the message should be absent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the information that your job will die automatically after certain (in some cases, implicit) amount of time a more important information rather than, say, Shortcuts? I think, if the job has a timeout, it should be printed in non-quiet mode. Other question is, perhaps, a better place for it is formatters/jobs.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of important information about the job. We need a balance between insufficient and too verbose outputs.
Moving reports about remaining time is a very cool idea.
I see the REST API inconsistency here though: the deadline timestamp is not returned in JobDescription data structure. Seems like the server doesn't file this info, that's sad. In my mind, it should be an absolute UNIX timestamp; relative times are not reliable. The error is one less significant digit due to network latency, 1 minute in your case.
It raises another question: should the absolute timestamp be used as input in REST API? Passing a relative time in Python API is ok.

Comment on lines 1190 to 1191
elif seconds < 0:
raise click.UsageError(f"Job life span must be non-negative, got: '{result}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, our input parser does not allow negative values, so they can come only from the config. I understand, it's better to check it during user config validation, but it's quicker to do it here, after parsing.

  1. Should I add more validation code to _validate_user_config(), similarly as it is with _validate_alias()?

Please keep as is until we implement something like neuro config validate command.

  1. Why don't we use trafaret? Doesn't it have enough expressive power to validate aliases, does it?

Trafaret provides machine-readable error messages which should be translated to human-readable texts before printing. It makes trafaret virtually useless for our needs.

Copy link
Contributor

@asvetlov asvetlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document

neuromation/cli/job.py Outdated Show resolved Hide resolved
neuromation/cli/job.py Outdated Show resolved Hide resolved
neuromation/api/jobs.py Show resolved Hide resolved
@@ -134,6 +135,9 @@ Jobs
cannot be scheduled because the lack of computation
cluster resources (memory, CPU/GPU etc).

:param float life_span: job run-time limit in the format '1d2h3m4s'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The '1d2h3m4s' is invalid float.
What is None for? I assume it is for server default, please correct me if I'm wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None to disable. I'd actually leave the default API value for 1d, not for None as it is now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not enforce the default on server side?
It is more reliable.
I feel potential problems if we want to change 1day to another default eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long time ago, when I was implementing this feature on the backend, my very first version enforced default server-side as you suggest. However, after several review we agreed to pass default value by the client so that:

  1. no problems with DB migration, or if we decide not to write the default value to the database, then no implicit logic;
  2. much less code
  3. not over-complicated protocol (0 means limit=0)
  4. easier to change the default value per-user: no need to put it to the database.

docs/jobs_reference.rst Outdated Show resolved Hide resolved
@atemate atemate requested a review from asvetlov February 21, 2020 17:26
Copy link
Contributor

@asvetlov asvetlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last neats.

return seconds


async def calc_default_life_span(client: Client) -> timedelta:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline the function in calc_life_span() please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? it's harder to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is never used alone, it is always called from calc_life_span().
You can test calc_life_span(client, None) instead of calc_default_life_span(client), should prodice the same result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the matter of taste. changed.

seconds = delta.total_seconds()
if seconds == 0:
return None
assert seconds > 0, f"life-span cannot be negative, got: {seconds}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please raise ValueError instead of AssertionError here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's impossible situation, and we need this check only in case in order to protect ourselves from sending negative values to the server one day if we change something in future (for example, we are going to change toml parsers). So I cannot test this check properly. Here's what we can do:

  1. leave assertion
  2. change to ValueError but still keep it untested
  3. remove the check completely

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read assert expr, msg as something that you want to expose to a user.
If the situation is impossible I expect bare assert expr at least. Like for the sake of mypy correctness, we have assert var is not None very often without a custom provided error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was always thinking that assert expr is a bad practice since if anything ever goes wrong, the user will not get any idea on the reason, just AssertionError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to bare assert expr

Artem Yushkovskiy added 2 commits February 25, 2020 18:36
@atemate atemate merged commit 7808497 into master Feb 26, 2020
@atemate atemate deleted the ay/lifetime branch February 26, 2020 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants