Cirrus: Collect runner.sh stats #8107

cevich · 2020-10-22T12:35:53Z

On several occasions, there have been questions about CPU/Memory/IO trends in testing over time. Start collecting this data for all jobs, using a common/stable format so that trending analysis can be performed within/across multiple Cirrus-CI builds. This PR doesn't add any related tooling, it simply arranges for the collection of the data.

cevich · 2020-11-04T19:30:04Z

@edsantiago not urgent, but PTAL when you have a second. I'm slightly concerned about burying your logformatter URL above the time --verbose output. Unf. time needs to run at a high-level, since runner.sh calls itself as a user and inside a container for several tests. Is this worth worrying about?

If so, I was thinking to use the --output=file option, and teach logcollector.sh how to scoop it up.

edsantiago · 2020-11-04T19:58:53Z

It's not clear to me what problem this is trying to solve. ISTR it had something to do with the Cirrus agent timeouts, but didn't you solve that by keeping the VMs around and looking at console logs? In what context will these timings be useful, and how?

(Please do not answer inline. If you still see a real use for this, please answer in the commit message, so a future maintainer can understand this)

cevich · 2020-11-05T14:10:15Z

It's not clear to me what problem this is trying to solve.

A valid point, I should probably spell that out:smile:

It's true, resource use was originally suspected in the agent-stopped-responding problem. However, there have been other cases where (especially memory usage) has been a concern - None of these VMs have swap, by design. So this is mainly about retaining a record of CPU/Memory/Disk usage trends long-term, for high-level analysis. That's why having it in the log output or in an artifact file doesn't matter...so long as it is consistent going forward.

cevich · 2020-11-05T14:14:08Z

PR description updated.

edsantiago · 2020-11-05T14:40:14Z

I would've preferred the actual git commit message (humans run git log, they seldom visit the PR web page) but that doesn't seem worth the suffering of going through CI again. What do you need from me? It LGTM, but the PR is marked Draft/WIP.

edsantiago · 2020-11-05T14:41:09Z

OMG this just completely changed from a 1-liner to something super complicated. I withdraw my LGTM. Will take another look later.

cevich · 2020-11-05T14:46:31Z

I withdraw my LGTM. Will take another look later.

Yes please. I decided to go the "store into file" route, since it doesn't bury your logformatter URL and (in theory) an artifact file should be easier to post-process if it's separates from other content.

cevich · 2020-11-05T14:50:31Z

I would've preferred the actual git commit message

I had thought the PR description becomes part of the merge-commit message, but looking back, it seems only the PR title is included 😞 Since I'm re-running tests anyway, I'll update the commit message as well...

edsantiago · 2020-11-05T21:39:01Z

Well, this is an interesting monkey wrench! time, with the --output option, uses fd3:

$ bash -c '/usr/bin/time --verbose --output=/tmp/foo.log ls -l /proc/self/fd'                                                                         -
total 0
lrwx------. 1 esm esm 64 Nov  5 14:23 0 -> /dev/pts/11
lrwx------. 1 esm esm 64 Nov  5 14:23 1 -> /dev/pts/11
lrwx------. 1 esm esm 64 Nov  5 14:23 2 -> /dev/pts/11
l-wx------. 1 esm esm 64 Nov  5 14:23 3 -> /tmp/foo.log

The tests, though, expect a normal situation in which only stdin/stdout/stderr are open.

I think the easiest, sanest way to work around this would be to add exec 3<&- to the top of runner.sh, with a HUGE comment explaining what it does and why.

cevich · 2020-11-06T20:50:31Z

Well, this is an interesting monkey wrench!

I'm more than happy to have provided at least some of the entertainment 🤣 Wow, what a good catch!

So IIUC: Some tests/scripts/processes assume (4.e.g.) /proc/self/fd/3 is available where in fact /usr/bin/time has it open. This causes said tests/scripts/processes to tuck-tail-twix-legs and run for the hills?

Geeze...well ya, closing the FD in the script, with a comment, is going to be far easier than filing a "bug" against /usr/bin/time 😃 (assuming the necessary convincing is even attainable)

TBH, I don't think fixing this in runner.sh is such a bad thing...literally everything executes through that path (which is also how we get timing stats on everything).

contrib/cirrus/runner.sh

cevich · 2020-11-09T16:41:58Z

Stats-wise, using time in front of the ssh client is basically useless.. This makes sense as a new session (process group) is created. All the other cases seem to look okay, including the "inside a container" case:.

cevich · 2020-11-09T17:15:31Z

@edsantiago I can understand the CPU/Memory being different for podman vs podman-remote, but does this seem reasonable/justifiable to you:

sys podman fedora-33 root host wall-clock time compared to sys podman fedora-33 root host.

In my mind, they should be virtually identical, no?

(just an observation)

edsantiago · 2020-11-09T18:46:00Z

sys podman fedora-33 root host wall-clock time compared to sys podman fedora-33 root host.

(I'm making the assumption that you intended to write "remote" in the second link, not "podman").

Are you asking, is something wrong with time? That seems unlikely, since the results it reports match those listed in the main accordion tab by Cirrus.

Are you asking, why do podman-remote tests run more quickly than podman-local? I believe that's because podman-remote has a lot of skips. From a quick glance, I know that some of those tests (--help, podman login) take tens of seconds, so it seems plausible that that could explain some of the difference. I'm not sure I want to spend the effort to confirm test-by-test though.

If I misunderstood your question, I apologize, and could you please elaborate?

cevich · 2020-11-09T18:55:34Z

(I'm making the assumption that you intended to write "remote" in the second link, not "podman").

Woops.

I believe that's because podman-remote has a lot of skips. From a quick glance, I know that some of those tests (--help, podman login) take tens of seconds, so it seems plausible that that could explain some of the difference. I'm not sure I want to spend the effort to confirm test-by-test though.

That answers my query, thanks.

On several occasions, there have been questions about CPU/Memory/IO trends in testing over time. Start collecting this data for all jobs, using a common/stable format so that trending analysis can be performed within/across multiple Cirrus-CI builds. This PR doesn't add any related tooling, it simply arranges for the collection of the data. Stats generation is done at the orchestration level to guarantee they reflect everything happening inside `runner.sh`. For example, the container-based tests re-exec `runner.sh` inside a container, but we're only interested in the top-level stats. Update all tasks to include collection of the stats file. Unfortunately, due to the way the Cirrus-CI YAML parser works, it is *not* possible to alias the artifacts collection more clearly, for example: ```yaml always: <<: *runner_stats <<: *logs_artifacts ``` Signed-off-by: Chris Evich <[email protected]>

edsantiago

LGTM

edsantiago · 2020-11-09T22:09:43Z

contrib/cirrus/runner.sh

+# operations depend on making use of FD3, and so it must be explicitly
+# closed here (and for all further child-processes).
+# STATS_LOGFILE assumed empty/undefined outside of Cirrus-CI (.cirrus.yml)
+# shellcheck disable=SC2154


You probably don't need this shellcheck directive any more, but it's not worth re-pushing for this.

woops. I'll kill it next time I come across it.

openshift-ci-robot · 2020-11-09T22:10:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [edsantiago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

edsantiago · 2020-11-12T15:38:08Z

Sorry for late response, this was churning in CI when I last looked at it.

/lgtm
/hold

rhatdan · 2020-11-12T19:07:13Z

/hold cancel

cevich requested a review from edsantiago October 22, 2020 12:35

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2020

cevich force-pushed the measure_testing_stats branch from 7ea631e to 90ff3a6 Compare October 22, 2020 14:31

cevich mentioned this pull request Oct 22, 2020

Enlarge fedora VM starting disk containers/automation_images#30

Merged

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2020

cevich force-pushed the measure_testing_stats branch from 90ff3a6 to 63f7460 Compare October 23, 2020 18:22

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2020

cevich force-pushed the measure_testing_stats branch from 63f7460 to 4eb1af8 Compare October 26, 2020 15:42

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2020

cevich force-pushed the measure_testing_stats branch from 4eb1af8 to 0cb2cb4 Compare November 4, 2020 19:21

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2020

cevich force-pushed the measure_testing_stats branch from 0cb2cb4 to 2fb0b3a Compare November 5, 2020 14:40

cevich force-pushed the measure_testing_stats branch from 2fb0b3a to 15a0e82 Compare November 5, 2020 14:44

cevich force-pushed the measure_testing_stats branch 3 times, most recently from 062aca2 to 6da39a1 Compare November 5, 2020 15:03

cevich marked this pull request as ready for review November 5, 2020 15:05

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2020

cevich force-pushed the measure_testing_stats branch 2 times, most recently from ba0321c to 63bb6b5 Compare November 6, 2020 21:06

cevich commented Nov 6, 2020

View reviewed changes

contrib/cirrus/runner.sh Outdated Show resolved Hide resolved

cevich force-pushed the measure_testing_stats branch 2 times, most recently from dcaed3f to fb59c41 Compare November 6, 2020 21:09

edsantiago reviewed Nov 7, 2020

View reviewed changes

contrib/cirrus/runner.sh Outdated Show resolved Hide resolved

cevich force-pushed the measure_testing_stats branch from fb59c41 to 525bf8d Compare November 9, 2020 15:49

cevich force-pushed the measure_testing_stats branch 3 times, most recently from 451722f to 9315816 Compare November 9, 2020 19:28

cevich force-pushed the measure_testing_stats branch from 9315816 to f44af20 Compare November 9, 2020 19:32

edsantiago approved these changes Nov 9, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020

openshift-ci-robot assigned edsantiago Nov 12, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 12, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020

openshift-merge-robot merged commit 2e9c942 into containers:master Nov 12, 2020

cevich deleted the measure_testing_stats branch June 30, 2021 18:12

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cirrus: Collect runner.sh stats #8107

Cirrus: Collect runner.sh stats #8107

cevich commented Oct 22, 2020 •

edited

Loading

cevich commented Nov 4, 2020

edsantiago commented Nov 4, 2020

cevich commented Nov 5, 2020

cevich commented Nov 5, 2020

edsantiago commented Nov 5, 2020

edsantiago commented Nov 5, 2020

cevich commented Nov 5, 2020

cevich commented Nov 5, 2020

edsantiago commented Nov 5, 2020

cevich commented Nov 6, 2020

cevich commented Nov 9, 2020

cevich commented Nov 9, 2020

edsantiago commented Nov 9, 2020

cevich commented Nov 9, 2020

edsantiago left a comment

edsantiago Nov 9, 2020

cevich Nov 9, 2020

openshift-ci-robot commented Nov 9, 2020

edsantiago commented Nov 12, 2020

rhatdan commented Nov 12, 2020

Cirrus: Collect runner.sh stats #8107

Cirrus: Collect runner.sh stats #8107

Conversation

cevich commented Oct 22, 2020 • edited Loading

cevich commented Nov 4, 2020

edsantiago commented Nov 4, 2020

cevich commented Nov 5, 2020

cevich commented Nov 5, 2020

edsantiago commented Nov 5, 2020

edsantiago commented Nov 5, 2020

cevich commented Nov 5, 2020

cevich commented Nov 5, 2020

edsantiago commented Nov 5, 2020

cevich commented Nov 6, 2020

cevich commented Nov 9, 2020

cevich commented Nov 9, 2020

edsantiago commented Nov 9, 2020

cevich commented Nov 9, 2020

edsantiago left a comment

Choose a reason for hiding this comment

edsantiago Nov 9, 2020

Choose a reason for hiding this comment

cevich Nov 9, 2020

Choose a reason for hiding this comment

openshift-ci-robot commented Nov 9, 2020

edsantiago commented Nov 12, 2020

rhatdan commented Nov 12, 2020

cevich commented Oct 22, 2020 •

edited

Loading