Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DiskIo telemetry device to persist the counters #721

Closed
wants to merge 12 commits into from

Conversation

ebadyano
Copy link
Contributor

@ebadyano ebadyano commented Jul 3, 2019

Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to #697

Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
@ebadyano ebadyano requested a review from dliappis July 3, 2019 16:13
@ebadyano ebadyano changed the title Update DiskIo telemetry device to persist the counters WIP: Update DiskIo telemetry device to persist the counters Jul 3, 2019
@ebadyano
Copy link
Contributor Author

ebadyano commented Jul 3, 2019

Accidentally clicked create PR before marking it a draft PR, so adding WIP to the title.

@ebadyano ebadyano added enhancement Improves the status quo :Telemetry Telemetry Devices that gather additional metrics labels Jul 3, 2019
@ebadyano ebadyano added this to the 1.3.0 milestone Jul 3, 2019
@ebadyano ebadyano added the :misc Changes that don't affect users directly: linter fixes, test improvements, etc. label Jul 3, 2019
Copy link
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything here looks rather sane to me, thanks! I haven't tested it with an external metric store yet but worked fine with the in-memory metrics store.

Update: I've also tested this PR (i.e. without #722) with an external metric store and seems to be alright.

I left some suggestions.

read_bytes = 0
write_bytes = 0
io.ensure_dir(self.log_root)
log_file = "%s/%s-%s.io" % (self.log_root, self.car.safe_name, self.node.node_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for example produces something like:

.rally/benchmarks/races/2019-07-04-15-05-34/rally-node-0/telemetry/defaults-rally-node-0.io

I don't think self.car.safe_name (in this case defaults) provides value in the filename, in which case you don't even need to pass car in the constructor. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I was trying to be consistent with what we do fo gcstats and jfr and other telemetry files. So not sure whether to go for simplicity or consistency. What's your preference?

Copy link
Member

@danielmitterdorfer danielmitterdorfer Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrary to other telemetry devices this one is not meant to be inspected by the user. Instead this is a temporary state file that should also be removed when the benchmark ends (i.e. in on_benchmark_stop). Hence, I'd opt for simplicity here.

Also, we should maybe use a different name than log_file here? It's not exactly a log file.

Finally, I think we should use os.path.join instead of / when creating paths.

self.logger.warning("Process I/O counters are not supported on this platform. Falling back to less accurate disk "
"I/O counters.")
except RuntimeError:
self.logger.exception("Could not determine I/O stats at benchmark start.")
with open(log_file, "wt", encoding="utf-8") as f:
diskio_str = "%d %d" % (read_bytes, write_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered writing to a dict and dumping it as json instead? What you have here works perfectly, but I wonder if standardizing to json for all structures we have in Rally makes sense.

So here we'd have {"read_bytes": read_bytes, "write_bytes": write_bytes} and then on the next line it could be json.dump(diskio_str, f))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your json idea, I will change it to use that. Thanks!

log_file = "%s/%s-%s.io" % (self.log_root, self.car.safe_name, self.node.node_name)
io_str = ""
with open(log_file, "rt", encoding="utf-8") as f:
io_str = f.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my earlier suggestion whether we could json.load instead and use it as a dictionary without the need to split.

write_bytes = process_end.write_bytes - self.process_start.write_bytes
elif self.disk_start:
if process_end:
read_bytes = process_end.read_bytes - int(io_bytes[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The json approach, as a benefit, can ensure type safety in a type unsafe language and you can get rid of the int conversions.

@@ -295,7 +295,7 @@ def _start_node(self, node_configuration, node_count_on_host):
enabled_devices = self.cfg.opts("mechanic", "telemetry.devices")
telemetry_params = self.cfg.opts("mechanic", "telemetry.params")
node_telemetry = [
telemetry.DiskIo(self.metrics_store, node_count_on_host),
telemetry.DiskIo(self.metrics_store, node_count_on_host, node_telemetry_dir, car, node_name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my question below whether car is actually needed.

@ebadyano
Copy link
Contributor Author

ebadyano commented Jul 4, 2019

@dliappis I wonder if I do need to do something about self.metrics_store as well. It's initialized in the launcher, but by the time stop/start is executed we don't have it anymore..

@dliappis
Copy link
Contributor

dliappis commented Jul 5, 2019

@dliappis I wonder if I do need to do something about self.metrics_store as well. It's initialized in the launcher, but by the time stop/start is executed we don't have it anymore..

@ebadyano this needs to be looked together with #722; as per #697 (comment) the race-id will be provided to the start/stop commands and each should still initialize the metric stores used for recoding telemetry and other metric records so in theory if all this initialization is done in #722 the code here should work as is.

@ebadyano ebadyano changed the title WIP: Update DiskIo telemetry device to persist the counters Update DiskIo telemetry device to persist the counters Jul 9, 2019
Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few suggestions.

write_bytes = 0
io.ensure_dir(self.log_root)
log_file = "%s/%s-%s.io" % (self.log_root, self.car.safe_name, self.node.node_name)
console.info("%s: Writing start I/O stats to [%s]" % (self.human_name, log_file), logger=self.logger)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is only a temporary file that is necessary due to the way the start and stop subcommands will work, we should not inform the user about it.

self.logger.warning("Process I/O counters are not supported on this platform. Falling back to less accurate disk "
"I/O counters.")
except RuntimeError:
self.logger.exception("Could not determine I/O stats at benchmark start.")
diskio_str = {"read_bytes": read_bytes, "write_bytes": write_bytes}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid the _str suffix? Also, I think we could just inline the dict declaration and get rid of the variable completely.

process_end = sysstats.process_io_counters(self.process)
disk_end = sysstats.disk_io_counters()
io.ensure_dir(self.log_root)
log_file = "%s/%s-%s.io" % (self.log_root, self.car.safe_name, self.node.node_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use os.path.join instead of / here?

disk_end = sysstats.disk_io_counters()
io.ensure_dir(self.log_root)
log_file = "%s/%s-%s.io" % (self.log_root, self.car.safe_name, self.node.node_name)
io_str = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

write_bytes = process_end.write_bytes - self.process_start.write_bytes
elif self.disk_start:
if process_end:
read_bytes = process_end.read_bytes - io_bytes['read_bytes']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use double quotes for string constants by default (there are cases where it makes sense to use single quotes, e.g. when there are a lot of double quotes used in the string constant, it is easier to use single quotes but here this is not the case).

@@ -776,17 +777,21 @@ def _store_merge_times(self, merge_times):


class DiskIo(InternalTelemetryDevice):
human_name = "Disk IO"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we get rid of the console message (see my comment below), this will be unused and we can remove it as well.

esrally/mechanic/telemetry.py Outdated Show resolved Hide resolved
@ebadyano
Copy link
Contributor Author

@danielmitterdorfer Thank you for the review, I addressed your comments. Could you please take another look? Thanks!

danielmitterdorfer and others added 7 commits July 16, 2019 10:59
With this commit we remove support in Rally to benchmark Elasticsearch
1.x.

Relates elastic#715
Relates elastic#716
With this commit we add the JVM flag `ExitOnOutOfMemoryError`
unconditionally when Elasticsearch is configured by Rally. Previously we
had to check whether the JVM in use supports this flag because we
supported Elasticsearch 1.x which can be run with Java 1.7. As the JVM flag
has only been introduced with Java 8, we had a check in place. Now that
we have dropped support for Elasticsearch 1.x (in elastic#716), we can safely
assume that the JVM supports this flag and unconditionally set it.

Relates elastic#715
Relates elastic#723
To check the complete list of exposed track parameters in an integration
 test we rely on parameters in rally-tracks which can change over
 time[1].
Be more lenient by just checking the unused track parameters.

Relates elastic#726

[1] elastic/rally-tracks@4080dc9
With this commit we allow (again) to set the distribution version as a
command line parameter even in cases where it does not seem to be useful
initially, like the `benchmark-only` pipeline. However, if Rally fails
to determine the cluster version (for whatever reason) it can be useful
to be able to override this on the command line. Similarly, it can be
useful to override this when benchmarking a source build that is *not*
based on master but rather of an older branch. In these cases, Rally
would still take the right decisions with some guidance by the user.

Relates elastic#728
With this commit we remove support for Elasticsearch 1.x when retrieving
cluster metadata. This is a leftover from elastic#716

Relates elastic#716
Relates elastic#729
Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of comments.

@@ -17,6 +17,8 @@

import random
import collections
import os
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import

disk_io_counters.side_effect = [process_start, process_stop]
process_io_counters.side_effect = [None, None]


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: two newlines instead of one.

@mock.patch("esrally.metrics.EsMetricsStore.put_count_node_level")
def test_diskio_process_io_counters(self, metrics_store_node_count, process_io_counters):
Diskio = namedtuple("Diskio", "read_bytes write_bytes")
process_start = Diskio(10,10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing space after the comma.

def test_diskio_process_io_counters(self, metrics_store_node_count, process_io_counters):
Diskio = namedtuple("Diskio", "read_bytes write_bytes")
process_start = Diskio(10,10)
process_stop = Diskio(11,11)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing space after the comma.

@mock.patch("esrally.metrics.EsMetricsStore.put_count_node_level")
def test_diskio_disk_io_counters(self, metrics_store_node_count, process_io_counters, disk_io_counters):
Diskio = namedtuple("Diskio", "read_bytes write_bytes")
process_start = Diskio(10,10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing space after the comma.

cfg = create_config()
metrics_store = metrics.EsMetricsStore(cfg)

device = telemetry.DiskIo(metrics_store, 1, tmp_dir, "rally0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in tests we should prefer named parameters for arguments, e.g. I had to look up what the 1 refers to.

cfg = create_config()
metrics_store = metrics.EsMetricsStore(cfg)

device = telemetry.DiskIo(metrics_store, 2, tmp_dir, "rally0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in tests we should prefer named parameters for arguments.

t.detach_from_node(node, running=False)

metrics_store_node_count.assert_has_calls([
mock.call("rally0", "disk_io_write_bytes", 1, "byte"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand why we expect 1 byte here although the diff is 3 bytes (it's due to the fact that there are two nodes on that machine). How about a short comment? Wdyt?

node = cluster.Node(pid=None, host_name="localhost", node_name="rally0", telemetry=t)
t.attach_to_node(node)
t.on_benchmark_start()
t.on_benchmark_stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably "simulate" that file-based persistence works by creating a new instance here (with the same constructor parameters)?

node = cluster.Node(pid=None, host_name="localhost", node_name="rally0", telemetry=t)
t.attach_to_node(node)
t.on_benchmark_start()
t.on_benchmark_stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably "simulate" that file-based persistence works by creating a new instance here (with the same constructor parameters).

@ebadyano ebadyano closed this Jul 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :misc Changes that don't affect users directly: linter fixes, test improvements, etc. :Telemetry Telemetry Devices that gather additional metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants