Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many: deal with incorrect status from osbuild better #827

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mvo5
Copy link
Collaborator

@mvo5 mvo5 commented Feb 7, 2025

go.mod: upadate to latest images to get PR#1200

This commit updates the images library to pull in the fix [0] for
the overly long messages from osbuild. This should test the
(previously) failing test that runs the centos-9 ISO with an
attached terminal.

[0] osbuild/images#1200


test: run the centos-9 test with an attached terminal

This commit changes the centos-9 test to run with podman -t so
that we have a test-case that uses the terminal progress. This
is prompted by:
a) we have no integration test currently that uses the terminal
progress for the full build
b) a konflux failure/memory leak that showed because there the
test is run with -t


This commit ensures that we do restrict the memory of the bib
test conatiner to catch excessive memory usage. This is prompted
by a memory leak when dealing with unrecoverable status messages
that lead to failures in konflux.


This commit changes the progress parser (again) to deal with
errors from the osbuild json progress scanner. On errors we
will now exit right away and potentially kill osbuild but
provide an error message that hints how to workaround the
issue.

The original code assumed we get transient errors like json
decode issues. However it turns out that this is not always
the case, some errors from a bufio.Scanner are unrecoverable
(like ErrTooBig) and trying again just leads to an endless
loop. We can also not "break" wait for the build to finish
because that would appear like the progress is broken
forever and we would still have to report an error (just
much later).

defer func() {
// ensure osbuild is stopped even if we exit early
if cmd.Process != nil {
cmd.Process.Kill()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may need a bit more care, maybe only do it in hte error case (as the happy case does a cmd.Wait already so no killing needed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine as is but I would strongly recommend using PR_SET_PDEATHSIG if it's not already in use. That's a kernel-enforced way to lifecycle bind the child to the parent, such that if e.g. the parent dies due to SIGSEGV or whatever, the child gets killed too.

@mvo5 mvo5 marked this pull request as ready for review February 7, 2025 21:05
@mvo5 mvo5 force-pushed the fix-konflux-oom branch 3 times, most recently from 0f80abb to 837d891 Compare February 10, 2025 08:59
Copy link
Member

@ondrejbudai ondrejbudai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

PR_SET_PDEATHSIG feels like a nice follow-up, right, @mvo5 ?

@mvo5
Copy link
Collaborator Author

mvo5 commented Feb 11, 2025

Looks good, thanks!

PR_SET_PDEATHSIG feels like a nice follow-up, right, @mvo5 ?

Yeah, I think we can look into this in a followup, while writing this I realized its actually more complicated because killing osbuild hard will probably leave a bunch of (dandling) bind mounts around. So the current approach is to send a SIGINT and hope for the best, osbuild should cleanup after itself. I'm not sure we have good options here but fortunately the error condition (bad progress reported) is rare. The other thing we could try is to make sure that the osbuild status scanner always advances, then we could also keep trying until we get valid json again (replacing the current bufio scanner essentially, however in the general case when we never get valid output again we would still appear hung for an extended period of time, but maybe a worthwhle tradeoff over erroring?)

mvo5 added 3 commits February 11, 2025 17:22
This commit changes the progress parser (again) to deal with
errors from the osbuild json progress scanner. On errors we
will now exit right away and potentially kill osbuild but
provide an error message that hints how to workaround the
issue.

The original code assumed we get transient errors like json
decode issues. However it turns out that this is not always
the case, some errors from a bufio.Scanner are unrecoverable
(like ErrTooBig) and trying again just leads to an endless
loop. We can also not "break" and wait for the build to finish
because that would appear like the progress is broken
forever and we would still have to report an error (just
much later).
This commit ensures that we do restrict the memory of the bib
test conatiner to catch excessive memory usage. This is prompted
by a memory leak when dealing with unrecoverable status messages
that lead to failures in konflux.
This commit changes the centos-9 test to run with `podman -t` so
that we have a test-case that uses the `terminal` progress. This
is prompted by:
a) we have no integration test currently that uses the terminal
   progress for the full build
b) a konflux failure/memory leak that showed because there the
   test is run with `-t`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants