Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provision, start, stop subcommands #722

Closed
wants to merge 39 commits into from

Conversation

drawlerr
Copy link
Contributor

@drawlerr drawlerr commented Jul 4, 2019

Implement provision, start and stop subcommands for independently managing the benchmarking candidate.

Relates #697
Closes #733

@dliappis
Copy link
Contributor

dliappis commented Jul 8, 2019

@drawlerr I took an initial pass and have the following top level comments:

  1. stop command doesn't seem to wipe away unneeded files (see Allow to manage Elasticsearch nodes separately from benchmarking #697 (comment) #stop)

    e.g. after ./rally stop --trial-id 9cbdf669-d2c4-424e-8a55-df4ee1d17ce0

    I still see:

    $ du -sh ~/.rally/benchmarks/races/9cbdf669-d2c4-424e-8a55-df4ee1d17ce0/rally-node-0/*
    0B	/Users/dl/.rally/benchmarks/races/9cbdf669-d2c4-424e-8a55-df4ee1d17ce0/rally-node-0/heapdump
    423M	/Users/dl/.rally/benchmarks/races/9cbdf669-d2c4-424e-8a55-df4ee1d17ce0/rally-node-0/install
    48K	/Users/dl/.rally/benchmarks/races/9cbdf669-d2c4-424e-8a55-df4ee1d17ce0/rally-node-0/logs
    
  2. Does --node-ip get used for all_node_ids?

    To setup a cluster we need all_node_ips to contain all IP addresses as it's used by the elasticsearch.yml template for discovery.

  3. I know this is what #697 specifies and this PR implements it well, but the trial-id needs to be collected by some orchestrator (shell script, ansible play ...) and currently it's contained between the Rally banner and the success message:

    $ ./rally provision --node-ip 127.0.0.1 --docker --distribution-version=7.2.0
    
        ____        ____
    / __ \____ _/ / /_  __
    / /_/ / __ `/ / / / / /
    / _, _/ /_/ / / / /_/ /
    /_/ |_|\__,_/_/_/\__, /
                    /____/
    
    c9a4f171-48b4-48ad-8f4e-28c9ad262922
    
    -------------------------------
    [INFO] SUCCESS (took 1 seconds)
    -------------------------------```
    

    So the process invoking provision will have a hard time matching the id; should this be something like trial-id: ...? I was thinking of a simple json structure but again the orchestrator could have trouble skipping noise and multiline matching the json structure.
    Alternatively we could skip printing the banner and the final info message, but perhaps it's an unnecessary complication. @danielmitterdorfer WDYT?

Copy link
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some initial comments.

@drawlerr
Copy link
Contributor Author

drawlerr commented Jul 8, 2019

@drawlerr I took an initial pass and have the following top level comments:

  1. stop command doesn't seem to wipe away unneeded files (see #697 (comment) #stop)
    e.g. after ./rally stop --trial-id 9cbdf669-d2c4-424e-8a55-df4ee1d17ce0
    I still see:
    $ du -sh ~/.rally/benchmarks/races/9cbdf669-d2c4-424e-8a55-df4ee1d17ce0/rally-node-0/*
    0B	/Users/dl/.rally/benchmarks/races/9cbdf669-d2c4-424e-8a55-df4ee1d17ce0/rally-node-0/heapdump
    423M	/Users/dl/.rally/benchmarks/races/9cbdf669-d2c4-424e-8a55-df4ee1d17ce0/rally-node-0/install
    48K	/Users/dl/.rally/benchmarks/races/9cbdf669-d2c4-424e-8a55-df4ee1d17ce0/rally-node-0/logs
    

I didn't realize this was a requirement of the "stop" subcommand. Sounds maybe like a separate "cleanup" command?

  1. Does --node-ip get used for all_node_ips?
    To setup a cluster we need all_node_ips to contain all IP addresses as it's used by the elasticsearch.yml template for discovery.

It does, but it should probably be a list instead of a single element. I'll fix it.

  1. I know this is what #697 specifies and this PR implements it well, but the trial-id needs to be collected by some orchestrator (shell script, ansible play ...) and currently it's contained between the Rally banner and the success message:

    $ ./rally provision --node-ip 127.0.0.1 --docker --distribution-version=7.2.0
    
        ____        ____
    / __ \____ _/ / /_  __
    / /_/ / __ `/ / / / / /
    / _, _/ /_/ / / / /_/ /
    /_/ |_|\__,_/_/_/\__, /
                    /____/
    
    c9a4f171-48b4-48ad-8f4e-28c9ad262922
    
    -------------------------------
    [INFO] SUCCESS (took 1 seconds)
    -------------------------------```
    

    So the process invoking provision will have a hard time matching the id; should this be something like trial-id: ...? I was thinking of a simple json structure but again the orchestrator could have trouble skipping noise and multiline matching the json structure.
    Alternatively we could skip printing the banner and the final info message, but perhaps it's an unnecessary complication. @danielmitterdorfer WDYT?

This is tricky. I used quiet mode (--quiet) to suppress all output except for the "forced" output of the UUID. (see integration-test.sh for example usage)

@dliappis
Copy link
Contributor

dliappis commented Jul 9, 2019

Unrelated, but mentioned it since this PR touches provisioner_test.py I observed that most of the unittests under BareProvisionerTests, e.g. test_prepare_without_plugins while succeeding in their asserts, have an underlying failure with subprocess.Popen() invoked from process#run_subprocess_with_logging -- it manifests e.g. if you run the unittest from the IDE:

Ran 1 test in 0.007s

OK
Could not execute command.
Traceback (most recent call last):
  File "/Users/dl/source/elastic/rally/esrally/utils/process.py", line 52, in exit_status_as_bool
    return_code = runnable()
  File "/Users/dl/source/elastic/rally/esrally/utils/jvm.py", line 33, in <lambda>
    return process.exit_status_as_bool(lambda: process.run_subprocess_with_logging("%s/bin/java %s -version" % (java_home, option)))
  File "/Users/dl/source/elastic/rally/esrally/utils/process.py", line 76, in run_subprocess_with_logging
    with subprocess.Popen(command_line_args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=env) as command_line_process:
  File "/Users/dl/.pyenv/versions/3.6.5/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/Users/dl/.pyenv/versions/3.6.5/lib/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/javas/java8/bin/java': '/usr/local/javas/java8/bin/java'

Process finished with exit code 0

This can be fixed by e.g. patching the afforementioned function e.g. with: @mock.patch("esrally.utils.process.run_subprocess_with_logging")

Copy link
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another pass and left a few more comments re: use of --node-ips and how it translates to all_node_ips.

tests/mechanic/provisioner_test.py Outdated Show resolved Hide resolved
esrally/mechanic/mechanic.py Outdated Show resolved Hide resolved
"list": dispatch_list,
"race": race,
"download": mechanic.download,
"provision": mechanic.provision,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to document the provision and start / stop commands either in this PR or in a separate PR. @danielmitterdorfer WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRs should be self-contained thus I'd opt for documenting it in this PR.

@danielmitterdorfer
Copy link
Member

I think we want to document the provision and start / stop commands either in this PR or in a separate PR. @danielmitterdorfer WDYT?

Docs should always go along with the code changes, i.e. they should be included in this PR not in a separate one.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass and left some comments and suggestions. On a high level I think the following needs to be addressed:

  • Handling of telemetry devices
  • Only allowing to provision one node

I think we also need an option to cleanup an installation on termination (i.e. in the stop subcommand). IMHO this does not need to be a separate subcommand.

I also spotted a lot of unrelated changes (like reformattings + it seems to be based on a yet unmerged PR) which made the PR harder to review and I suggest that we move such changes in the future to dedicated PRs.

docs/migrate.rst Show resolved Hide resolved
tests/metrics_test.py Show resolved Hide resolved
integration-test.sh Outdated Show resolved Hide resolved
esrally/reporter.py Show resolved Hide resolved
"list": dispatch_list,
"race": race,
"download": mechanic.download,
"provision": mechanic.provision,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRs should be self-contained thus I'd opt for documenting it in this PR.

esrally/mechanic/telemetry.py Outdated Show resolved Hide resolved
esrally/mechanic/telemetry.py Outdated Show resolved Hide resolved
esrally/rally.py Outdated Show resolved Hide resolved
esrally/rally.py Outdated Show resolved Hide resolved
action="store_true",
default=False
)
for p in [start_parser, stop_parser]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing support to enable telemetry devices.

@danielmitterdorfer
Copy link
Member

Thanks so much for working on this PR! As we've discussed offline earlier, there are too many changes that we need to make before addressing this properly (mainly around gathering metrics properly with the new subcommand structure) and we should tackle them first in separate PRs. When we've finished this work, we can revisit this change again. It is very likely that this will happen in a separate PR but we will draw from your work here and the knowledge you've acquired. As such, I'm going to close this one unmerged as it is likely that this code as it is now will not be merged but rather be used as reference later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing attach_to_node call for telemetry devices in DockerLauncher
3 participants