Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cirrus: Collect runner.sh stats #8107

Merged

Conversation

cevich
Copy link
Member

@cevich cevich commented Oct 22, 2020

On several occasions, there have been questions about CPU/Memory/IO trends in testing over time. Start collecting this data for all jobs, using a common/stable format so that trending analysis can be performed within/across multiple Cirrus-CI builds. This PR doesn't add any related tooling, it simply arranges for the collection of the data.

@cevich cevich requested a review from edsantiago October 22, 2020 12:35
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2020
@cevich cevich force-pushed the measure_testing_stats branch from 7ea631e to 90ff3a6 Compare October 22, 2020 14:31
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2020
@cevich cevich force-pushed the measure_testing_stats branch from 90ff3a6 to 63f7460 Compare October 23, 2020 18:22
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2020
@cevich cevich force-pushed the measure_testing_stats branch from 63f7460 to 4eb1af8 Compare October 26, 2020 15:42
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2020
@cevich cevich force-pushed the measure_testing_stats branch from 4eb1af8 to 0cb2cb4 Compare November 4, 2020 19:21
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2020
@cevich
Copy link
Member Author

cevich commented Nov 4, 2020

@edsantiago not urgent, but PTAL when you have a second. I'm slightly concerned about burying your logformatter URL above the time --verbose output. Unf. time needs to run at a high-level, since runner.sh calls itself as a user and inside a container for several tests. Is this worth worrying about?

If so, I was thinking to use the --output=file option, and teach logcollector.sh how to scoop it up.

@edsantiago
Copy link
Member

It's not clear to me what problem this is trying to solve. ISTR it had something to do with the Cirrus agent timeouts, but didn't you solve that by keeping the VMs around and looking at console logs? In what context will these timings be useful, and how?

(Please do not answer inline. If you still see a real use for this, please answer in the commit message, so a future maintainer can understand this)

@cevich
Copy link
Member Author

cevich commented Nov 5, 2020

It's not clear to me what problem this is trying to solve.

A valid point, I should probably spell that out:smile:

It's true, resource use was originally suspected in the agent-stopped-responding problem. However, there have been other cases where (especially memory usage) has been a concern - None of these VMs have swap, by design. So this is mainly about retaining a record of CPU/Memory/Disk usage trends long-term, for high-level analysis. That's why having it in the log output or in an artifact file doesn't matter...so long as it is consistent going forward.

@cevich
Copy link
Member Author

cevich commented Nov 5, 2020

PR description updated.

@cevich cevich force-pushed the measure_testing_stats branch from 0cb2cb4 to 2fb0b3a Compare November 5, 2020 14:40
@edsantiago
Copy link
Member

I would've preferred the actual git commit message (humans run git log, they seldom visit the PR web page) but that doesn't seem worth the suffering of going through CI again. What do you need from me? It LGTM, but the PR is marked Draft/WIP.

@edsantiago
Copy link
Member

OMG this just completely changed from a 1-liner to something super complicated. I withdraw my LGTM. Will take another look later.

@cevich cevich force-pushed the measure_testing_stats branch from 2fb0b3a to 15a0e82 Compare November 5, 2020 14:44
@cevich
Copy link
Member Author

cevich commented Nov 5, 2020

I withdraw my LGTM. Will take another look later.

Yes please. I decided to go the "store into file" route, since it doesn't bury your logformatter URL and (in theory) an artifact file should be easier to post-process if it's separates from other content.

@cevich
Copy link
Member Author

cevich commented Nov 5, 2020

I would've preferred the actual git commit message

I had thought the PR description becomes part of the merge-commit message, but looking back, it seems only the PR title is included 😞 Since I'm re-running tests anyway, I'll update the commit message as well...

@cevich cevich force-pushed the measure_testing_stats branch 3 times, most recently from 062aca2 to 6da39a1 Compare November 5, 2020 15:03
@cevich cevich marked this pull request as ready for review November 5, 2020 15:05
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2020
@edsantiago
Copy link
Member

Well, this is an interesting monkey wrench! time, with the --output option, uses fd3:

$ bash -c '/usr/bin/time --verbose --output=/tmp/foo.log ls -l /proc/self/fd'                                                                         -
total 0
lrwx------. 1 esm esm 64 Nov  5 14:23 0 -> /dev/pts/11
lrwx------. 1 esm esm 64 Nov  5 14:23 1 -> /dev/pts/11
lrwx------. 1 esm esm 64 Nov  5 14:23 2 -> /dev/pts/11
l-wx------. 1 esm esm 64 Nov  5 14:23 3 -> /tmp/foo.log

The tests, though, expect a normal situation in which only stdin/stdout/stderr are open.

I think the easiest, sanest way to work around this would be to add exec 3<&- to the top of runner.sh, with a HUGE comment explaining what it does and why.

@cevich
Copy link
Member Author

cevich commented Nov 6, 2020

Well, this is an interesting monkey wrench!

I'm more than happy to have provided at least some of the entertainment 🤣 Wow, what a good catch!

So IIUC: Some tests/scripts/processes assume (4.e.g.) /proc/self/fd/3 is available where in fact /usr/bin/time has it open. This causes said tests/scripts/processes to tuck-tail-twix-legs and run for the hills?

Geeze...well ya, closing the FD in the script, with a comment, is going to be far easier than filing a "bug" against /usr/bin/time 😃 (assuming the necessary convincing is even attainable)

TBH, I don't think fixing this in runner.sh is such a bad thing...literally everything executes through that path (which is also how we get timing stats on everything).

@cevich cevich force-pushed the measure_testing_stats branch 2 times, most recently from ba0321c to 63bb6b5 Compare November 6, 2020 21:06
contrib/cirrus/runner.sh Outdated Show resolved Hide resolved
@cevich cevich force-pushed the measure_testing_stats branch 2 times, most recently from dcaed3f to fb59c41 Compare November 6, 2020 21:09
contrib/cirrus/runner.sh Outdated Show resolved Hide resolved
@cevich cevich force-pushed the measure_testing_stats branch from fb59c41 to 525bf8d Compare November 9, 2020 15:49
@cevich
Copy link
Member Author

cevich commented Nov 9, 2020

Stats-wise, using time in front of the ssh client is basically useless.. This makes sense as a new session (process group) is created. All the other cases seem to look okay, including the "inside a container" case:.

@cevich
Copy link
Member Author

cevich commented Nov 9, 2020

@edsantiago I can understand the CPU/Memory being different for podman vs podman-remote, but does this seem reasonable/justifiable to you:

sys podman fedora-33 root host wall-clock time compared to sys podman fedora-33 root host.

In my mind, they should be virtually identical, no?

(just an observation)

@edsantiago
Copy link
Member

sys podman fedora-33 root host wall-clock time compared to sys podman fedora-33 root host.

(I'm making the assumption that you intended to write "remote" in the second link, not "podman").

Are you asking, is something wrong with time? That seems unlikely, since the results it reports match those listed in the main accordion tab by Cirrus.

Are you asking, why do podman-remote tests run more quickly than podman-local? I believe that's because podman-remote has a lot of skips. From a quick glance, I know that some of those tests (--help, podman login) take tens of seconds, so it seems plausible that that could explain some of the difference. I'm not sure I want to spend the effort to confirm test-by-test though.

If I misunderstood your question, I apologize, and could you please elaborate?

@cevich
Copy link
Member Author

cevich commented Nov 9, 2020

(I'm making the assumption that you intended to write "remote" in the second link, not "podman").

Woops.

I believe that's because podman-remote has a lot of skips. From a quick glance, I know that some of those tests (--help, podman login) take tens of seconds, so it seems plausible that that could explain some of the difference. I'm not sure I want to spend the effort to confirm test-by-test though.

That answers my query, thanks.

@cevich cevich force-pushed the measure_testing_stats branch 3 times, most recently from 451722f to 9315816 Compare November 9, 2020 19:28
On several occasions, there have been questions about CPU/Memory/IO
trends in testing over time. Start collecting this data for all jobs,
using a common/stable format so that trending analysis can be performed
within/across multiple Cirrus-CI builds. This PR doesn't add any related
tooling, it simply arranges for the collection of the data.

Stats generation is done at the orchestration level to guarantee they
reflect everything happening inside `runner.sh`.  For example, the
container-based tests re-exec `runner.sh` inside a container, but
we're only interested in the top-level stats.

Update all tasks to include collection of the stats file.
Unfortunately, due to the way the Cirrus-CI YAML parser works,
it is *not* possible to alias the artifacts collection more clearly,
for example:

```yaml
    always:
        <<: *runner_stats
        <<: *logs_artifacts
```

Signed-off-by: Chris Evich <[email protected]>
@cevich cevich force-pushed the measure_testing_stats branch from 9315816 to f44af20 Compare November 9, 2020 19:32
Copy link
Member

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# operations depend on making use of FD3, and so it must be explicitly
# closed here (and for all further child-processes).
# STATS_LOGFILE assumed empty/undefined outside of Cirrus-CI (.cirrus.yml)
# shellcheck disable=SC2154
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need this shellcheck directive any more, but it's not worth re-pushing for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops. I'll kill it next time I come across it.

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2020
@edsantiago
Copy link
Member

Sorry for late response, this was churning in CI when I last looked at it.

/lgtm
/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 12, 2020
@rhatdan
Copy link
Member

rhatdan commented Nov 12, 2020

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020
@openshift-merge-robot openshift-merge-robot merged commit 2e9c942 into containers:master Nov 12, 2020
@cevich cevich deleted the measure_testing_stats branch June 30, 2021 18:12
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants