Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube play: cannot unmarshal ... PodRmReport.RmReport.Err #16154

Open
edsantiago opened this issue Oct 13, 2022 · 19 comments
Open

kube play: cannot unmarshal ... PodRmReport.RmReport.Err #16154

edsantiago opened this issue Oct 13, 2022 · 19 comments
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. kube remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

[+1526s] not ok 392 podman kube play - hostport
...
$ podman-remote kube down /tmp/podman_bats.rYmm67/testpod.yaml
Error: json: cannot unmarshal string into Go struct field PodRmReport.RmReport.Err of type error
[ rc=125 (** EXPECTED 0 **) ]

Three instances, first in August. Root/rootless. Always remote.

@edsantiago edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Oct 13, 2022
@github-actions github-actions bot removed the remote Problem is in podman-remote label Oct 13, 2022
@edsantiago
Copy link
Member Author

An interesting side note: two of these three failures are associated with #15367 (the everything-hosed-forever flake): as soon as this unmarshal error happens, all future podman invocations fail catastrophically.

@edsantiago edsantiago added the remote Problem is in podman-remote label Oct 13, 2022
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@mtrmac
Copy link
Collaborator

mtrmac commented Dec 16, 2022

A weird thing is that there is no RmReport member in pkg/domain/entities.PodRmReport.

@edsantiago
Copy link
Member Author

Three new instances, but slightly different. These are all VolumeRmReport.VolumeRmReport.Err instead of PodRmReport. And they don't triple-fail, or even single-fail (the test passes), so my flake logger cannot detect them. I have to cirrus-flake-grep which will only catch it if it manifests in a real flake.

# podman-remote [options] play kube /tmp/podman_test2607771438/kube.yaml
         Error: error tearing down workloads "json: cannot unmarshal string into Go struct field VolumeRmReport.VolumeRmReport.Err of type error" after kube play error "playing YAML file: importing volumes is not supported for remote requests"

All of them in int remote fedora-37 root host:

@edsantiago
Copy link
Member Author

Seen Jan 27 in sys remote f37 rootless:

not ok 511 podman kube play read-only
...
$ podman-remote --url unix:/tmp/podman_tmp_dEuC kube down -
Error: json: cannot unmarshal string into Go struct field PodRmReport.RmReport.Err of type error
[ rc=125 (** EXPECTED 0 **) ]

Summary so far: happening both root and rootless, int and sys, f36 and f37, and still happening many months after first reported.

@mtrmac
Copy link
Collaborator

mtrmac commented Jan 30, 2023

I guess something like

diff --git a/pkg/bindings/errors.go b/pkg/bindings/errors.go
index d9dfa95a6..af7ab0c45 100644
--- a/pkg/bindings/errors.go
+++ b/pkg/bindings/errors.go
@@ -7,6 +7,7 @@ import (
        "io"
 
        "github.com/containers/podman/v4/pkg/errorhandling"
+       "github.com/sirupsen/logrus"
 )
 
 var (
@@ -15,7 +16,8 @@ var (
 
 func handleError(data []byte, unmarshalErrorInto interface{}) error {
        if err := json.Unmarshal(data, unmarshalErrorInto); err != nil {
-               return err
+               logrus.Errorf("Error unmarshaling into %#v, data %q", unmarshalErrorInto, data)
+               panic("handleError failed")
        }
        return unmarshalErrorInto.(error)
 }
@@ -35,7 +37,10 @@ func (h APIResponse) ProcessWithError(unmarshalInto interface{}, unmarshalErrorI
        }
        if h.IsSuccess() || h.IsRedirection() {
                if unmarshalInto != nil {
-                       return json.Unmarshal(data, unmarshalInto)
+                       if err := json.Unmarshal(data, unmarshalInto); err != nil {
+                               logrus.Errorf("Error unmarshaling into %#v, data %q", unmarshalInto, data)
+                               panic("APIResponse.ProcessWithError failed")
+                       }
                }
                return nil
        }

could help diagnose, but if this is an infrequent flake we can’t trigger, a draft-only PR probably wouldn’t help, and I’m not comfortable with shipping those panics to users.

Maybe we could just log the data and type in the error, and keep that shipped on the main branch until this triggers again…

@edsantiago
Copy link
Member Author

Hey, could "unexpected end of JSON input" be a variation of this bug? Seen in podman machine test, f37 aarch64 rootless:

Machine init complete
         To start your machine run:
         
         	podman machine start 7ad7ab2c8b50
         
         output: Image resized. Machine init complete To start your machine run: podman machine start 7ad7ab2c8b50
         /var/tmp/go/src/github.com/containers/podman/bin/podman-remote machine start 7ad7ab2c8b50
         /var/tmp/go/src/github.com/containers/podman/bin/podman-remote machine list
         Migrating machine ""
         Error: checking VM active: unexpected end of JSON input
         NAME           VM TYPE     CREATED        LAST UP        CPUS        MEMORY      DISK SIZE
         7ad7ab2c8b50*  qemu        7 seconds ago  7 seconds ago  1           2.147GB     107.4GB

@mtrmac
Copy link
Collaborator

mtrmac commented Jan 30, 2023

Maybe we could just log the data and type in the error, and keep that shipped on the main branch until this triggers again…

Let’s start with that: #17282 .

@edsantiago
Copy link
Member Author

Yay!. Does this help?

# podman-remote --url unix:/tmp/podman_tmp_PyCl kube down /tmp/podman_bats.yuokza/test.yaml
Error: unmarshaling into &entities.PlayKubeReport{Pods:[]entities.PlayKubePod(nil), Volumes:[]entities.PlayKubeVolume(nil), PlayKubeTeardown:entities.PlayKubeTeardown{StopReport:[]*entities.PodStopReport{(*entities.PodStopReport)(0xc0003b4540)}, RmReport:[]*entities.PodRmReport{(*entities.PodRmReport)(0xc0002701a0)}, VolumeRmReport:[]*entities.VolumeRmReport(nil)}, Secrets:[]entities.PlaySecret(nil)}, data "{\"Pods\":null,\"Volumes\":null,\"StopReport\":[{\"Errs\":null,\"Id\":\"23925178e788af05632a6613f214378c0d2a3c64ca078339fc0bd906b3ae4820\"}],\"RmReport\":[{\"Err\":\"1 error occurred:\\n\\t* removing container 919c374a24cadedcba2d95849dcf5c4aa17d0e757965436b5530d14fa28ab695 from pod 23925178e788af05632a6613f214378c0d2a3c64ca078339fc0bd906b3ae4820: removing container 919c374a24cadedcba2d95849dcf5c4aa17d0e757965436b5530d14fa28ab695 root filesystem: 1 error occurred:\\n\\t* unlinkat /var/lib/containers/storage/overlay-containers/919c374a24cadedcba2d95849dcf5c4aa17d0e757965436b5530d14fa28ab695/userdata/shm: device or resource busy\\n\\n\\n\\n\",\"Id\":\"23925178e788af05632a6613f214378c0d2a3c64ca078339fc0bd906b3ae4820\"}],\"VolumeRmReport\":null,\"Secrets\":null}\n": json: cannot unmarshal string into Go struct field PodRmReport.RmReport.Err of type error

(Sorry for the long line)

@mtrmac
Copy link
Collaborator

mtrmac commented Feb 3, 2023

It definitely does.


The underlying cause, in RmReport:

1 error occurred:
	* removing container 919c374a24cadedcba2d95849dcf5c4aa17d0e757965436b5530d14fa28ab695 from pod 23925178e788af05632a6613f214378c0d2a3c64ca078339fc0bd906b3ae4820: removing container 919c374a24cadedcba2d95849dcf5c4aa17d0e757965436b5530d14fa28ab695 root filesystem: 1 error occurred:
	* unlinkat /var/lib/containers/storage/overlay-containers/919c374a24cadedcba2d95849dcf5c4aa17d0e757965436b5530d14fa28ab695/userdata/shm: device or resource busy

That looks rather like some other flakes, e.g. #11594 (comment)


And then there seems to be a design issue in error handling of the Kube commands: With the disclaimer that I’m very new to this code, and I am just reading it, not testing it in practice:

Summary: The error interface can’t be directly transferred over JSON.

E.g. consider how

func (ic *ContainerEngine) PodRm(ctx context.Context, namesOrIds []string, opts entities.PodRmOptions) ([]*entities.PodRmReport, error) {
works (outside of Kube): The PodRmReport.Err field is manually set by the remote’s client caller, i.e. it is not transferred directly as JSON. (Well, actually it is,
return &report, response.Process(&report)
does end up unmarshaling the Err field, but as long as there is no error, the value is nil and that is passed through JSON just fine.)

If the server fails, a different HTTP status triggers transfer not of PodRmReport, but of a specialized

type ErrorModel struct {
JSON format; and the callee detects that and manually populates PodRmReport.Err.

That’s the way this seems to work for the standalone PodRmReport. But that mechanism only works at the top level of the request; meanwhile play kube just embeds the full PodRmReport (and other structs) into the total result, as a single unit, with no special-case handling of error values; because PodRmReport can’t actually transfer Err through raw JSON, that then breaks.

Fixing this seems to mean changing the structure of entities.PlayKubeTeardown so that it does not naively embed the top-level “Report” structures, but some variant that does not contain a raw error field but maybe instances of errorhandling.ErrorModel. (That would, in theory, be an API break, but the existing API can’t be correctly consumed by Podman itself in error situations, so…)

And as much as I originally looked into this bug because of a stupid oversight (#16154 (comment) mixes up type and field names, there is actually no mystery in that), I’d like to leave this play kube reorganization to someone familiar with the code.

@edsantiago
Copy link
Member Author

Seen just now in f37 remote rootless, and again everything after that is hosed (#17216) even though it's not a "podman rm":

Error: unmarshalling into &entities.PlayKubeReport{Pods:[]entities.PlayKubePod(nil),
   Volumes:[]entities.PlayKubeVolume(nil),
   PlayKubeTeardown:entities.PlayKubeTeardown{StopReport:[]*entities.PodStopReport{(*entities.PodStopReport)(0xc0005f2810)},
   RmReport:[]*entities.PodRmReport{(*entities.PodRmReport)(0xc00046e1c0)},
   VolumeRmReport:[]*entities.VolumeRmReport(nil)},
   Secrets:[]entities.PlaySecret(nil)},
   data "{\"Pods\":null,\"Volumes\":null,\"StopReport\":[{\"Errs\":null,\"Id\":\"447565dbf1252d30bb7ccee54d90301af053f5c62a8fad9c784c7a8129edaa73\"}],\"RmReport\":[{\"Err\":\"2 errors occurred:\
 \\t* removing container fe02d5e5ad813a31fd7a806006644c7068d5f00b5eb74e1e3e0e3ab3a642225a from pod 447565dbf1252d30bb7ccee54d90301af053f5c62a8fad9c784c7a8129edaa73: removing container fe02d5e5ad813a31fd7a806006644c7068d5f00b5eb74e1e3e0e3ab3a642225a root filesystem: 1 error occurred:\
 \\t* unlinkat /home/some29111dude/.local/share/containers/storage/overlay/b0007db8a5b78536fa46c4d4f06aa36700c48ca3d69f598012993742a7e87f0c/merged: device or resource busy\
 \
 \
 \\t* removing container 18fbc78049db4a79ea0cbd24e46c4a3dce60e8fd13bd753160be5be8abaafcaf from pod 447565dbf1252d30bb7ccee54d90301af053f5c62a8fad9c784c7a8129edaa73: removing container 18fbc78049db4a79ea0cbd24e46c4a3dce60e8fd13bd753160be5be8abaafcaf root filesystem: 1 error occurred:\
 \\t* unlinkat /home/some29111dude/.local/share/containers/storage/overlay/64e6dcc8a2ddd7396a85052a892829941bbce37ab9bc367666d7db30e18acc27/merged: device or resource busy\
 \
 \
 \
 \",\"Id\":\"447565dbf1252d30bb7ccee54d90301af053f5c62a8fad9c784c7a8129edaa73\"}],\"VolumeRmReport\":null,\"Secrets\":null}
 ": json: cannot unmarshal string into Go struct field PodRmReport.RmReport.Err of type error

@edsantiago
Copy link
Member Author

Now seen in debian also. remote, root.

@edsantiago
Copy link
Member Author

f37 remote root, and once it happens, all future tests are hosed:

not ok 535 podman kube play - default log driver
...
#  podman-remote --url unix:/tmp/podman_tmp_i1Vb kube down /tmp/podman_bats.8ABmsh/test.yaml
Error: unmarshalling into
&entities.PlayKubeReport{
  Pods:[]entities.PlayKubePod(nil),
      Volumes:[]entities.PlayKubeVolume(nil),
      PlayKubeTeardown:entities.PlayKubeTeardown{
          StopReport:[]*entities.PodStopReport{
                (*entities.PodStopReport)(0xc0006f5b00)},
                  RmReport:[]*entities.PodRmReport{
                        (*entities.PodRmReport)(0xc00011c920)
                    },
                  VolumeRmReport:[]*entities.VolumeRmReport(nil),
                  SecretRmReport:[]*entities.SecretRmReport{}
                },
              Secrets:[]entities.PlaySecret(nil),
              ServiceContainerID:""
            },
    data "{
    \"Pods\":null,
    \"Volumes\":null,
    \"StopReport\":[
        {
            \"Errs\":null,
            \"Id\":\"e99d70f60aaa8182801659dcf1e361cec618ff546de03d48f6d95ed3e74c063c\"
        }
    ],
    \"RmReport\":[
        {
            \"Err\":\"1 error occurred:\\n\\t* removing container 6e4e1af5522f8cade47cb6a01646259df0e629471b38444adb6f8371c153097f from pod e99d70f60aaa8182801659dcf1e361cec618ff546de03d48f6d95ed3e74c063c: removing container 6e4e1af5522f8cade47cb6a01646259df0e629471b38444adb6f8371c153097f root filesystem: 1 error occurred:\\n\\t* unlinkat /var/lib/containers/storage/overlay/f33f7a528eafe7ccb32652ad381133af5007aafe0c9761f94eb0597cca81414f/merged: device or resource busy\\n\\n\\n\\n\",
            \"Id\":\"e99d70f60aaa8182801659dcf1e361cec618ff546de03d48f6d95ed3e74c063c\"
        }
    ],
    \"VolumeRmReport\":null,
    \"SecretRmReport\":[],
    \"Secrets\":null,
    \"ServiceContainerID\":\"\"}\n": json: cannot unmarshal string into Go struct field PodRmReport.RmReport.Err of type error

I'm really quite certain this is related to #17216

@edsantiago edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023
@vrothberg
Copy link
Member

I'm really quite certain this is related to #17216

Yes, me too. I close this issue here so we can focus on #17216

@mtrmac
Copy link
Collaborator

mtrmac commented Apr 24, 2023

My understanding, per the above, is that the #17216 underlying cause is the same, but on top, this demonstrates a bug in PodRmReport error handling.

@vrothberg
Copy link
Member

That's a fair point. Let's reopen to look into the error reporting.

@vrothberg vrothberg reopened this Apr 25, 2023
@vrothberg
Copy link
Member

Had a quick look. The underlying issue is that the returned type includes a type error which cannot be (un)marshalled. Those need to be turned into strings and then be converted into errors as done for other endpoints.

@vrothberg vrothberg added the kube label Apr 25, 2023
@edsantiago
Copy link
Member Author

Here's a recent one:

[+1437s] not ok 635 [700] podman play with init container
...
<+022ms> # # podman-remote --url unix:///var/tmp/podman_tmp_6K3d kube down /var/tmp/podman_bats.5MGJtt/test.yaml
<+470ms> # Error: unmarshalling into &types.PlayKubeReport{Pods:[]types.PlayKubePod(nil), Volumes:[]types.PlayKubeVolume(nil), PlayKubeTeardown:types.PlayKubeTeardown{StopReport:[]*types.PodStopReport{(*types.PodStopReport)(0xc00033e510)}, RmReport:[]*types.PodRmReport{(*types.PodRmReport)(0xc00033e810)}, VolumeRmReport:[]*types.VolumeRmReport(nil), SecretRmReport:[]*types.SecretRmReport{}}, Secrets:[]types.PlaySecret(nil), ServiceContainerID:"", ExitCode:(*int32)(nil)}, data "{\"Pods\":null,\"Volumes\":null,\"StopReport\":[{\"Errs\":null,\"Id\":\"10321db6fad090213edd0e6c34007ec23d487e3f83e3273e3ef9501e486feace\"}],\"RmReport\":[{\"RemovedCtrs\":{\"af19e7c3ea879539cf3b692907da987f2f8375e0f271c15937615b772465096d\":\"cleaning up container af19e7c3ea879539cf3b692907da987f2f8375e0f271c15937615b772465096d: removing container af19e7c3ea879539cf3b692907da987f2f8375e0f271c15937615b772465096d from runtime: `/usr/bin/crun delete --force af19e7c3ea879539cf3b692907da987f2f8375e0f271c15937615b772465096d` failed: exit status 1\",\"e6331f85910c1d80bcf0da2bf2432a7ad90933b94f65eee7c78a2a7e63ddf62f\":\"a container that depends on container e6331f85910c1d80bcf0da2bf2432a7ad90933b94f65eee7c78a2a7e63ddf62f could not be removed: container state improper\"},\"Err\":\"not all containers could be removed from pod 10321db6fad090213edd0e6c34007ec23d487e3f83e3273e3ef9501e486feace: removing pod containers\",\"Id\":\"10321db6fad090213edd0e6c34007ec23d487e3f83e3273e3ef9501e486feace\"}],\"VolumeRmReport\":null,\"SecretRmReport\":[],\"Secrets\":null,\"ServiceContainerID\":\"\",\"ExitCode\":null}\n": json: cannot unmarshal string into Go struct field PodRmReport.RmReport.RemovedCtrs of type error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. kube remote Problem is in podman-remote
Projects
None yet
Development

No branches or pull requests

3 participants