-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update DiskIo telemetry device to persist the counters #721
Conversation
Ensure that DiskIo telemetry does not rely on Rally being a parent process of Elasticsearch and persists the disk counters at the beginning of a benchmark and can read it again afterwards. Relates to elastic#697
Accidentally clicked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything here looks rather sane to me, thanks! I haven't tested it with an external metric store yet but worked fine with the in-memory metrics store.
Update: I've also tested this PR (i.e. without #722) with an external metric store and seems to be alright.
I left some suggestions.
esrally/mechanic/telemetry.py
Outdated
read_bytes = 0 | ||
write_bytes = 0 | ||
io.ensure_dir(self.log_root) | ||
log_file = "%s/%s-%s.io" % (self.log_root, self.car.safe_name, self.node.node_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This for example produces something like:
.rally/benchmarks/races/2019-07-04-15-05-34/rally-node-0/telemetry/defaults-rally-node-0.io
I don't think self.car.safe_name
(in this case defaults
) provides value in the filename, in which case you don't even need to pass car in the constructor. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I was trying to be consistent with what we do fo gcstats and jfr and other telemetry files. So not sure whether to go for simplicity or consistency. What's your preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contrary to other telemetry devices this one is not meant to be inspected by the user. Instead this is a temporary state file that should also be removed when the benchmark ends (i.e. in on_benchmark_stop
). Hence, I'd opt for simplicity here.
Also, we should maybe use a different name than log_file
here? It's not exactly a log file.
Finally, I think we should use os.path.join
instead of /
when creating paths.
esrally/mechanic/telemetry.py
Outdated
self.logger.warning("Process I/O counters are not supported on this platform. Falling back to less accurate disk " | ||
"I/O counters.") | ||
except RuntimeError: | ||
self.logger.exception("Could not determine I/O stats at benchmark start.") | ||
with open(log_file, "wt", encoding="utf-8") as f: | ||
diskio_str = "%d %d" % (read_bytes, write_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered writing to a dict and dumping it as json instead? What you have here works perfectly, but I wonder if standardizing to json for all structures we have in Rally makes sense.
So here we'd have {"read_bytes": read_bytes, "write_bytes": write_bytes}
and then on the next line it could be json.dump(diskio_str, f))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your json idea, I will change it to use that. Thanks!
esrally/mechanic/telemetry.py
Outdated
log_file = "%s/%s-%s.io" % (self.log_root, self.car.safe_name, self.node.node_name) | ||
io_str = "" | ||
with open(log_file, "rt", encoding="utf-8") as f: | ||
io_str = f.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my earlier suggestion whether we could json.load instead and use it as a dictionary without the need to split.
esrally/mechanic/telemetry.py
Outdated
write_bytes = process_end.write_bytes - self.process_start.write_bytes | ||
elif self.disk_start: | ||
if process_end: | ||
read_bytes = process_end.read_bytes - int(io_bytes[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The json approach, as a benefit, can ensure type safety in a type unsafe language and you can get rid of the int conversions.
esrally/mechanic/launcher.py
Outdated
@@ -295,7 +295,7 @@ def _start_node(self, node_configuration, node_count_on_host): | |||
enabled_devices = self.cfg.opts("mechanic", "telemetry.devices") | |||
telemetry_params = self.cfg.opts("mechanic", "telemetry.params") | |||
node_telemetry = [ | |||
telemetry.DiskIo(self.metrics_store, node_count_on_host), | |||
telemetry.DiskIo(self.metrics_store, node_count_on_host, node_telemetry_dir, car, node_name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my question below whether car
is actually needed.
@dliappis I wonder if I do need to do something about self.metrics_store as well. It's initialized in the launcher, but by the time stop/start is executed we don't have it anymore.. |
@ebadyano this needs to be looked together with #722; as per #697 (comment) the race-id will be provided to the start/stop commands and each should still initialize the metric stores used for recoding telemetry and other metric records so in theory if all this initialization is done in #722 the code here should work as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few suggestions.
esrally/mechanic/telemetry.py
Outdated
write_bytes = 0 | ||
io.ensure_dir(self.log_root) | ||
log_file = "%s/%s-%s.io" % (self.log_root, self.car.safe_name, self.node.node_name) | ||
console.info("%s: Writing start I/O stats to [%s]" % (self.human_name, log_file), logger=self.logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is only a temporary file that is necessary due to the way the start
and stop
subcommands will work, we should not inform the user about it.
esrally/mechanic/telemetry.py
Outdated
self.logger.warning("Process I/O counters are not supported on this platform. Falling back to less accurate disk " | ||
"I/O counters.") | ||
except RuntimeError: | ||
self.logger.exception("Could not determine I/O stats at benchmark start.") | ||
diskio_str = {"read_bytes": read_bytes, "write_bytes": write_bytes} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we avoid the _str
suffix? Also, I think we could just inline the dict
declaration and get rid of the variable completely.
esrally/mechanic/telemetry.py
Outdated
process_end = sysstats.process_io_counters(self.process) | ||
disk_end = sysstats.disk_io_counters() | ||
io.ensure_dir(self.log_root) | ||
log_file = "%s/%s-%s.io" % (self.log_root, self.car.safe_name, self.node.node_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use os.path.join
instead of /
here?
esrally/mechanic/telemetry.py
Outdated
disk_end = sysstats.disk_io_counters() | ||
io.ensure_dir(self.log_root) | ||
log_file = "%s/%s-%s.io" % (self.log_root, self.car.safe_name, self.node.node_name) | ||
io_str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
esrally/mechanic/telemetry.py
Outdated
write_bytes = process_end.write_bytes - self.process_start.write_bytes | ||
elif self.disk_start: | ||
if process_end: | ||
read_bytes = process_end.read_bytes - io_bytes['read_bytes'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use double quotes for string constants by default (there are cases where it makes sense to use single quotes, e.g. when there are a lot of double quotes used in the string constant, it is easier to use single quotes but here this is not the case).
esrally/mechanic/telemetry.py
Outdated
@@ -776,17 +777,21 @@ def _store_merge_times(self, merge_times): | |||
|
|||
|
|||
class DiskIo(InternalTelemetryDevice): | |||
human_name = "Disk IO" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we get rid of the console message (see my comment below), this will be unused and we can remove it as well.
@danielmitterdorfer Thank you for the review, I addressed your comments. Could you please take another look? Thanks! |
With this commit we remove support in Rally to benchmark Elasticsearch 1.x. Relates elastic#715 Relates elastic#716
With this commit we add the JVM flag `ExitOnOutOfMemoryError` unconditionally when Elasticsearch is configured by Rally. Previously we had to check whether the JVM in use supports this flag because we supported Elasticsearch 1.x which can be run with Java 1.7. As the JVM flag has only been introduced with Java 8, we had a check in place. Now that we have dropped support for Elasticsearch 1.x (in elastic#716), we can safely assume that the JVM supports this flag and unconditionally set it. Relates elastic#715 Relates elastic#723
To check the complete list of exposed track parameters in an integration test we rely on parameters in rally-tracks which can change over time[1]. Be more lenient by just checking the unused track parameters. Relates elastic#726 [1] elastic/rally-tracks@4080dc9
With this commit we allow (again) to set the distribution version as a command line parameter even in cases where it does not seem to be useful initially, like the `benchmark-only` pipeline. However, if Rally fails to determine the cluster version (for whatever reason) it can be useful to be able to override this on the command line. Similarly, it can be useful to override this when benchmarking a source build that is *not* based on master but rather of an older branch. In these cases, Rally would still take the right decisions with some guidance by the user. Relates elastic#728
With this commit we remove support for Elasticsearch 1.x when retrieving cluster metadata. This is a leftover from elastic#716 Relates elastic#716 Relates elastic#729
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of comments.
@@ -17,6 +17,8 @@ | |||
|
|||
import random | |||
import collections | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused import
disk_io_counters.side_effect = [process_start, process_stop] | ||
process_io_counters.side_effect = [None, None] | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: two newlines instead of one.
@mock.patch("esrally.metrics.EsMetricsStore.put_count_node_level") | ||
def test_diskio_process_io_counters(self, metrics_store_node_count, process_io_counters): | ||
Diskio = namedtuple("Diskio", "read_bytes write_bytes") | ||
process_start = Diskio(10,10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing space after the comma.
def test_diskio_process_io_counters(self, metrics_store_node_count, process_io_counters): | ||
Diskio = namedtuple("Diskio", "read_bytes write_bytes") | ||
process_start = Diskio(10,10) | ||
process_stop = Diskio(11,11) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing space after the comma.
@mock.patch("esrally.metrics.EsMetricsStore.put_count_node_level") | ||
def test_diskio_disk_io_counters(self, metrics_store_node_count, process_io_counters, disk_io_counters): | ||
Diskio = namedtuple("Diskio", "read_bytes write_bytes") | ||
process_start = Diskio(10,10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing space after the comma.
cfg = create_config() | ||
metrics_store = metrics.EsMetricsStore(cfg) | ||
|
||
device = telemetry.DiskIo(metrics_store, 1, tmp_dir, "rally0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in tests we should prefer named parameters for arguments, e.g. I had to look up what the 1
refers to.
cfg = create_config() | ||
metrics_store = metrics.EsMetricsStore(cfg) | ||
|
||
device = telemetry.DiskIo(metrics_store, 2, tmp_dir, "rally0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in tests we should prefer named parameters for arguments.
t.detach_from_node(node, running=False) | ||
|
||
metrics_store_node_count.assert_has_calls([ | ||
mock.call("rally0", "disk_io_write_bytes", 1, "byte"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to understand why we expect 1 byte here although the diff is 3 bytes (it's due to the fact that there are two nodes on that machine). How about a short comment? Wdyt?
node = cluster.Node(pid=None, host_name="localhost", node_name="rally0", telemetry=t) | ||
t.attach_to_node(node) | ||
t.on_benchmark_start() | ||
t.on_benchmark_stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably "simulate" that file-based persistence works by creating a new instance here (with the same constructor parameters)?
node = cluster.Node(pid=None, host_name="localhost", node_name="rally0", telemetry=t) | ||
t.attach_to_node(node) | ||
t.on_benchmark_start() | ||
t.on_benchmark_stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably "simulate" that file-based persistence works by creating a new instance here (with the same constructor parameters).
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.
Relates to #697