-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cirrus: Collect runner.sh stats #8107
Cirrus: Collect runner.sh stats #8107
Conversation
7ea631e
to
90ff3a6
Compare
90ff3a6
to
63f7460
Compare
63f7460
to
4eb1af8
Compare
4eb1af8
to
0cb2cb4
Compare
@edsantiago not urgent, but PTAL when you have a second. I'm slightly concerned about burying your If so, I was thinking to use the |
It's not clear to me what problem this is trying to solve. ISTR it had something to do with the Cirrus agent timeouts, but didn't you solve that by keeping the VMs around and looking at console logs? In what context will these timings be useful, and how? (Please do not answer inline. If you still see a real use for this, please answer in the commit message, so a future maintainer can understand this) |
A valid point, I should probably spell that out:smile: It's true, resource use was originally suspected in the agent-stopped-responding problem. However, there have been other cases where (especially memory usage) has been a concern - None of these VMs have swap, by design. So this is mainly about retaining a record of CPU/Memory/Disk usage trends long-term, for high-level analysis. That's why having it in the log output or in an artifact file doesn't matter...so long as it is consistent going forward. |
PR description updated. |
0cb2cb4
to
2fb0b3a
Compare
I would've preferred the actual git commit message (humans run |
OMG this just completely changed from a 1-liner to something super complicated. I withdraw my LGTM. Will take another look later. |
2fb0b3a
to
15a0e82
Compare
Yes please. I decided to go the "store into file" route, since it doesn't bury your logformatter URL and (in theory) an artifact file should be easier to post-process if it's separates from other content. |
I had thought the PR description becomes part of the merge-commit message, but looking back, it seems only the PR title is included 😞 Since I'm re-running tests anyway, I'll update the commit message as well... |
062aca2
to
6da39a1
Compare
Well, this is an interesting monkey wrench! $ bash -c '/usr/bin/time --verbose --output=/tmp/foo.log ls -l /proc/self/fd' -
total 0
lrwx------. 1 esm esm 64 Nov 5 14:23 0 -> /dev/pts/11
lrwx------. 1 esm esm 64 Nov 5 14:23 1 -> /dev/pts/11
lrwx------. 1 esm esm 64 Nov 5 14:23 2 -> /dev/pts/11
l-wx------. 1 esm esm 64 Nov 5 14:23 3 -> /tmp/foo.log The tests, though, expect a normal situation in which only stdin/stdout/stderr are open. I think the easiest, sanest way to work around this would be to add |
I'm more than happy to have provided at least some of the entertainment 🤣 Wow, what a good catch! So IIUC: Some tests/scripts/processes assume (4.e.g.) /proc/self/fd/3 is available where in fact /usr/bin/time has it open. This causes said tests/scripts/processes to tuck-tail-twix-legs and run for the hills? Geeze...well ya, closing the FD in the script, with a comment, is going to be far easier than filing a "bug" against /usr/bin/time 😃 (assuming the necessary convincing is even attainable) TBH, I don't think fixing this in |
ba0321c
to
63bb6b5
Compare
dcaed3f
to
fb59c41
Compare
fb59c41
to
525bf8d
Compare
Stats-wise, using time in front of the ssh client is basically useless.. This makes sense as a new session (process group) is created. All the other cases seem to look okay, including the "inside a container" case:. |
@edsantiago I can understand the CPU/Memory being different for podman vs podman-remote, but does this seem reasonable/justifiable to you: sys podman fedora-33 root host wall-clock time compared to sys podman fedora-33 root host. In my mind, they should be virtually identical, no? (just an observation) |
(I'm making the assumption that you intended to write "remote" in the second link, not "podman"). Are you asking, is something wrong with Are you asking, why do podman-remote tests run more quickly than podman-local? I believe that's because podman-remote has a lot of If I misunderstood your question, I apologize, and could you please elaborate? |
Woops.
That answers my query, thanks. |
451722f
to
9315816
Compare
On several occasions, there have been questions about CPU/Memory/IO trends in testing over time. Start collecting this data for all jobs, using a common/stable format so that trending analysis can be performed within/across multiple Cirrus-CI builds. This PR doesn't add any related tooling, it simply arranges for the collection of the data. Stats generation is done at the orchestration level to guarantee they reflect everything happening inside `runner.sh`. For example, the container-based tests re-exec `runner.sh` inside a container, but we're only interested in the top-level stats. Update all tasks to include collection of the stats file. Unfortunately, due to the way the Cirrus-CI YAML parser works, it is *not* possible to alias the artifacts collection more clearly, for example: ```yaml always: <<: *runner_stats <<: *logs_artifacts ``` Signed-off-by: Chris Evich <[email protected]>
9315816
to
f44af20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
# operations depend on making use of FD3, and so it must be explicitly | ||
# closed here (and for all further child-processes). | ||
# STATS_LOGFILE assumed empty/undefined outside of Cirrus-CI (.cirrus.yml) | ||
# shellcheck disable=SC2154 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably don't need this shellcheck directive any more, but it's not worth re-pushing for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops. I'll kill it next time I come across it.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cevich, edsantiago The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Sorry for late response, this was churning in CI when I last looked at it. /lgtm |
/hold cancel |
On several occasions, there have been questions about CPU/Memory/IO trends in testing over time. Start collecting this data for all jobs, using a common/stable format so that trending analysis can be performed within/across multiple Cirrus-CI builds. This PR doesn't add any related tooling, it simply arranges for the collection of the data.