Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread 'tokio-runtime-worker' panicked at 'assertion failed: !(self.ptype == ProgressType::Task)' #4284

Closed
YaLTeR opened this issue Jan 31, 2023 · 35 comments · Fixed by #4405
Labels
difficulty/medium medium complexity/difficutly issue priority/high

Comments

@YaLTeR
Copy link

YaLTeR commented Jan 31, 2023

Host system details

● fedora:fedora/rawhide/x86_64/silverblue
                  Version: Rawhide.20230128.n.0 (2023-01-28T06:06:38Z)
               BaseCommit: 50f6d64e26e190a512a14cffbf7a7137771b56f2ef46d77a8ea1a61ef835755e
             GPGSignature: Valid signature by 6A51BBABBA3D5467B6171221809A8D7CEB10B464
      RemovedBasePackages: adwaita-qt5 1.4.2-2.fc38 firefox firefox-langpacks 109.0-1.fc38
                           qgnomeplatform-qt5 0.9.0-8.fc38
          LayeredPackages: alacritty dnf exa fish flatpak-builder gdb gjs-debuginfo glib2-debuginfo
                           gnome-shell-debuginfo hotspot langpacks-ru mesa-dri-drivers-debuginfo mozjs102-debuginfo
                           mutter-debuginfo perf sysprof
                Initramfs: --include /etc/initramfs-overlay /

Expected vs actual behavior

$ rpm-ostree update
⠒ Scanning metadata: 3
Enabled rpm-md repositories: fedora-cisco-openh264 rawhide-modular rawhide rawhide-debuginfo rawhide-source fedora-rawhide-nodebug
⠴ Scanning metadata: 3
thread 'tokio-runtime-worker' panicked at 'assertion failed: !(self.ptype == ProgressType::Task)', rust/src/console_progress.rs:124:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fish: Job 1, 'rpm-ostree update' terminated by signal SIGABRT (Abort)

Expected:

No panic.

Steps to reproduce it

Not reproducible.

Would you like to work on the issue?

Should be assigned to someone else.

@cgwalters
Copy link
Member

Do you have a coredump from this? Try using coredumpctl and coredumpctl info to try to extract a stack trace.

@YaLTeR
Copy link
Author

YaLTeR commented Feb 11, 2023

I think there's some issue with how rpm-ostree is built because there's nothing in the stack trace:

                #0  0x00007f87f7c6cb94 __pthread_kill_implementation (libc.so.6 + 0x8eb94)
                #1  0x00007f87f7c1baee raise (libc.so.6 + 0x3daee)
                #2  0x00007f87f7c0487f abort (libc.so.6 + 0x2687f)
                #3  0x0000558a579cc927 _ZN11panic_abort18__rust_start_panic5abort17h6ac263e9d2a6eb7eE (rpm-ostree + 0x985927)
                #4  0x00007f87e4021b30 n/a (n/a + 0x0)

I also tried reproducing with RUST_BACKTRACE=1 and that's empty too. Looks like the debug symbols are stripped but not available in any other package.

@YaLTeR
Copy link
Author

YaLTeR commented Feb 13, 2023

┌ ~
└─ env RUST_BACKTRACE=full rpm-ostree update
⠤ Receiving metadata objects: 1/(estimating) 49 bytes/s 196 bytes                                                       2 metadata, 0 content objects fetched; 788 B transferred in 5 seconds; 0 bytes content written
Receiving metadata objects: 1/(estimating) 49 bytes/s 196 bytes... done
⠒ Checking out tree 0e96fdf...                                                                                          Enabled rpm-md repositories: fedora-cisco-openh264 rawhide-modular rawhide rawhide-debuginfo rawhide-source fedora-rawhide-nodebug
⠐ Checking out tree 0e96fdf...                                                                                          thread 'tokio-runtime-worker' panicked at 'assertion failed: !(self.ptype == ProgressType::Task)', rust/src/console_progress.rs:124:9
stack backtrace:
   0:     0x55adf3155b4a - <unknown>
   1:     0x55adf317ef8e - <unknown>
   2:     0x55adf314e485 - <unknown>
   3:     0x55adf3155915 - <unknown>
   4:     0x55adf315705f - <unknown>
   5:     0x55adf3156d9b - <unknown>
   6:     0x55adf315776c - <unknown>
   7:     0x55adf31574c2 - <unknown>
   8:     0x55adf3155ffc - <unknown>
   9:     0x55adf3157212 - <unknown>
  10:     0x55adf2952b83 - <unknown>
  11:     0x55adf2952c5d - <unknown>
  12:     0x55adf2be0665 - <unknown>
  13:     0x55adf2a7875f - <unknown>
⠐ Checking out tree 0e96fdf...                                                                                            14:     0x7fe1084673a2 - g_closure_invoke
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gobject/gclosure.c:832:7
  15:     0x7fe108495f58 - signal_emit_unlocked_R.isra.0
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gobject/gsignal.c:3796:8
  16:     0x7fe1084855ea - g_signal_emit_valist
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gobject/gsignal.c:3549:5
  17:     0x7fe108485813 - g_signal_emit
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gobject/gsignal.c:3606:3
⠒ Checking out tree 0e96fdf...                                                                                            18:     0x7fe107f999eb - on_signal_received
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gio/gdbusproxy.c:897:3
  19:     0x7fe107f88070 - emit_signal_instance_in_idle_cb
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gio/gdbusconnection.c:3791:5
⠓ Checking out tree 0e96fdf...                                                                                            20:     0x7fe107d9b49d - g_idle_dispatch
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../glib/gmain.c:6124:15
  21:     0x7fe107d9c540 - g_main_dispatch
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../glib/gmain.c:3444:28
  22:     0x7fe107d9c540 - g_main_context_dispatch
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../glib/gmain.c:4162:7
  23:     0x7fe107df44e8 - g_main_context_iterate.isra.0
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../glib/gmain.c:4238:5
  24:     0x7fe107d9baef - g_main_loop_run
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../glib/gmain.c:4438:5
  25:     0x55adf2a76868 - <unknown>
  26:     0x55adf2a76a64 - <unknown>
  27:     0x55adf2a69cf5 - <unknown>
  28:     0x55adf2a6dae6 - <unknown>
  29:     0x55adf2a675a4 - <unknown>
  30:     0x55adf2bf554c - <unknown>
  31:     0x55adf2a1d216 - <unknown>
  32:     0x55adf2a63892 - <unknown>
  33:     0x55adf30b1189 - <unknown>
  34:     0x55adf30a16eb - <unknown>
  35:     0x55adf30af09a - <unknown>
  36:     0x55adf315daf3 - <unknown>
  37:     0x7fe10713bc57 - start_thread
  38:     0x7fe1071c1a70 - __clone3
  39:                0x0 - <unknown>
fish: Job 1, 'env RUST_BACKTRACE=full rpm-ost…' terminated by signal SIGABRT (Abort)

@istvan-derda
Copy link

istvan-derda commented Mar 19, 2023

Random (maybe helpful) report:
Happened to me too. Happened for the first time today when removing the layered gcc package (could that do something?) Will now keep happening every time I do something with the ostree (install/uninstall/rebase), but only the first time. if I run the command a second time it works.

Error happens on fedora silverblue 38 but not on 37

image

@YaLTeR
Copy link
Author

YaLTeR commented Mar 19, 2023

It's been happening to me randomly like ⅓ of the time, I think this is just a race condition of some kind. Still happens on this version btw:

rpm-ostree:
 Version: '2023.2'
 Git: 8e81b02cfc3dabcda9af7d4275200efedbe145c7
 Features:
  - rust
  - compose
  - container
  - fedora-integration

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Mar 25, 2023
@cgwalters
Copy link
Member

cgwalters commented Mar 25, 2023

Sorry about the delay on this. I put up #4348 which should help us debug. Once that PR lands a build should show up in https://copr.fedorainfracloud.org/coprs/g/CoreOS/continuous/

Then env G_MESSAGES_DEBUG=rpm-ostree,rpm-ostreed RUST_LOG=rpm_ostree=debug,rpmostree_rust=debug rpm-ostree rebase fedora:fedora/x86_64/coreos/next

should get us a bit more debugging information. Also this is a client-server setup, so we may also need to do e.g.

$ systemctl edit --runtime rpm-ostreed

Then paste in there:

[Service]
Environment=G_MESSAGES_DEBUG=rpm-ostreed
Environment=RUST_LOG=rpm_ostree=debug,rpmostree_rust=debug

And systemctl restart rpm-ostreed.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Mar 28, 2023
@cgwalters
Copy link
Member

Ok @YaLTeR the debug patch is available in https://copr.fedorainfracloud.org/coprs/g/CoreOS/continuous/ - let me know if you need help trying out enabling that build alongside injecting the debug configuration above!

@cgwalters
Copy link
Member

Also cc @istvan-derda since you hit this too

@YaLTeR
Copy link
Author

YaLTeR commented Apr 1, 2023

Got a new one with the debug logging:

⠠ Importing rpm-md  75% [███████████████░░░░░] (0s)                                                                     (rpm-ostree update:4686): rpm-ostreed-DEBUG: 08:32:52.724: txn progress PercentProgress
⠤ Importing rpm-md  80% [████████████████░░░░] (0s)                                                                     (rpm-ostree update:4686): rpm-ostreed-DEBUG: 08:32:52.918: txn progress PercentProgress
⠦ Importing rpm-md  85% [█████████████████░░░] (0s)                                                                     (rpm-ostree update:4686): rpm-ostreed-DEBUG: 08:32:52.964: txn progress PercentProgress
⠖ Importing rpm-md  90% [██████████████████░░] (0s)                                                                     (rpm-ostree update:4686): rpm-ostreed-DEBUG: 08:32:53.041: txn progress TaskBegin
thread 'tokio-runtime-worker' panicked at 'Overwriting task: "Importing rpm-md"', rust/src/console_progress.rs:179:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fish: Job 1, 'env G_MESSAGES_DEBUG=rpm-ostree…' terminated by signal SIGABRT (Abort)

@cgwalters
Copy link
Member

OK, sadly I did the wrong thing with the debug logging...just put up #4353 which should really help this time. What's going wrong is we're trying to output a new progress bar when an old one is still active, and we need to know exactly which are old and new.

@YaLTeR
Copy link
Author

YaLTeR commented Apr 6, 2023

Would it be possible to update the COPR with RPMs including that PR? Or just send the RPMs here for F38

@YaLTeR
Copy link
Author

YaLTeR commented Apr 8, 2023

Btw, for me the crash reproduces basically every time if I update with rpm-ostree update and at the same time do flatpak update -y.

@BrainBlasted
Copy link

I'm getting this with a brand-new F38 Silverblue install FWIW

@osslate
Copy link

osslate commented Apr 17, 2023

Have been experiencing this intermittently with F38 also. Sometimes rerunning rpm-ostree upgrade works, but other times the core dump happens consecutively.

For me this is happening at this point:

⠙ Running posttrans scripts...                                                                                                                                                   thread 'tokio-runtime-worker' panicked at 'assertion failed: !(self.ptype == ProgressType::Task)', rust/src/console_progress.rs:124:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped)

@YaLTeR
Copy link
Author

YaLTeR commented Apr 19, 2023

I built an rpm from main, will try to reproduce again.

@YaLTeR
Copy link
Author

YaLTeR commented Apr 20, 2023

⠉ Receiving objects; 83% (506/605) 3,1 MB/s 146,6 MB                                                                    (rpm-ostree update:26250): rpm-ostreed-DEBUG: 22:18:41.679: txn progress DownloadProgress
⠈ Receiving objects; 84% (509/605) 3,1 MB/s 148,9 MB                                                                    (rpm-ostree update:26250): rpm-ostreed-DEBUG: 22:18:41.909: txn progress Message
Enabled rpm-md repositories: fedora-cisco-openh264 fedora-modular updates-modular updates updates-debuginfo updates-source fedora fedora-debuginfo fedora-source updates-archive
⠈ Receiving objects; 84% (509/605) 3,1 MB/s 148,9 MB                                                                    (rpm-ostree update:26250): rpm-ostreed-DEBUG: 22:18:42.219: txn progress PercentProgress
thread 'tokio-runtime-worker' panicked at 'expected non-task progress type; current message is Receiving objects; 84% (509/605) 3,1 MB/s 148,9 MB', rust/src/console_progress.rs:128:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fish: Job 1, 'env G_MESSAGES_DEBUG=rpm-ostree…' terminated by signal SIGABRT (Abort)

@YaLTeR
Copy link
Author

YaLTeR commented Apr 20, 2023

And a second one for the bulk (flatpak update -y took a while, so I managed to get two crashes):

DEBUG Creating progress "Running post scripts" with type Task
⠖ Running post scripts...                                                                                               (rpm-ostree update:27127): rpm-ostreed-DEBUG: 22:21:11.695: txn progress TaskEnd
DEBUG Ending progress Task; suffix=Some("done")
Running post scripts... done
(rpm-ostree update:27127): rpm-ostreed-DEBUG: 22:21:11.696: txn progress TaskBegin
DEBUG Creating progress "Running posttrans scripts" with type Task
⠦ Running posttrans scripts...                                                                                          (rpm-ostree update:27127): rpm-ostreed-DEBUG: 22:21:15.202: txn progress Message
Enabled rpm-md repositories: fedora-cisco-openh264 fedora-modular updates-modular updates updates-debuginfo updates-source fedora fedora-debuginfo fedora-source updates-archive
⠒ Running posttrans scripts...                                                                                          (rpm-ostree update:27127): rpm-ostreed-DEBUG: 22:21:15.573: txn progress PercentProgress
thread 'tokio-runtime-worker' panicked at 'expected non-task progress type; current message is Running posttrans scripts', rust/src/console_progress.rs:128:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fish: Job 1, 'env G_MESSAGES_DEBUG=rpm-ostree…' terminated by signal SIGABRT (Abort)

@ejgr-mtsiw
Copy link

Was having the same error

[someone@somewhere ~]$ rpm-ostree update
Enabled rpm-md repositories: fedora-cisco-openh264 fedora-modular updates-modular updates fedora rpmfusion-nonfree-nvidia-driver
Importing rpm-md... done
rpm-md repo 'fedora-cisco-openh264' (cached); generated: 2023-03-14T10:56:46Z solvables: 4
rpm-md repo 'fedora-modular' (cached); generated: 2023-04-14T09:25:02Z solvables: 1082
rpm-md repo 'updates-modular' (cached); generated: 2018-02-20T19:18:14Z solvables: 0
rpm-md repo 'updates' (cached); generated: 2023-04-20T04:37:27Z solvables: 3831
rpm-md repo 'fedora' (cached); generated: 2023-04-14T09:32:40Z solvables: 69222
rpm-md repo 'rpmfusion-nonfree-nvidia-driver' (cached); generated: 2023-04-08T18:30:02Z solvables: 31
⠈ Receiving metadata objects: 0/(estimating) -/s 0 bytes...
Enabled rpm-md repositories: fedora-cisco-openh264 fedora-modular updates-modular updates fedora rpmfusion-nonfree-nvidia-driver
⠁ Receiving metadata objects: 0/(estimating) -/s 0 bytes...
thread 'tokio-runtime-worker' panicked at 'assertion failed: !(self.ptype == ProgressType::Task)', rust/src/console_progress.rs:124:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Just did a rpm-ostree cleanup -b -p -r -m and everything ran fine after that.
I don't think all the options are needed, but it was the command I had in history and I just ran it on a whim...
I guess the problem is related to some cached content...?

@shoop
Copy link

shoop commented Apr 21, 2023

Piling on here, randomly on overlay of RPM fusion mesa-vdpau-drivers-freeworld: rpm-ostree install mesa-vdpau-drivers-freeworld. Just updated to F38 (pretty smooth apart from RPM fusion stuff, thanks!). Deployment completed on repeating the command.

coredumpctl info:

stijn@tangaloor: ~
$ coredumpctl info
           PID: 3879 (rpm-ostree)
           UID: 1000 (stijn)
           GID: 1000 (stijn)
        Signal: 6 (ABRT)
     Timestamp: Fri 2023-04-21 10:42:34 CEST (1min 38s ago)
  Command Line: rpm-ostree install mesa-vdpau-drivers-freeworld
    Executable: /usr/bin/rpm-ostree
 Control Group: /user.slice/user-1000.slice/[email protected]/app.slice/app-org.gnome.Terminal.slice/vte-spawn-3cf7a228-1d18-4007-aa7e-6363284335bb.scope
          Unit: [email protected]
     User Unit: vte-spawn-3cf7a228-1d18-4007-aa7e-6363284335bb.scope
         Slice: user-1000.slice
     Owner UID: 1000 (stijn)
       Boot ID: 302b227720394945a6a38ccd187a901b
    Machine ID: 49cf6ab9e0d14993ba7805d3a9dd578b
      Hostname: tangaloor
       Storage: /var/lib/systemd/coredump/core.rpm-ostree.1000.302b227720394945a6a38ccd187a901b.3879.1682066554000000.zst (present)
  Size on Disk: 764.2K
       Message: Process 3879 (rpm-ostree) of user 1000 dumped core.
                
                Module libcrypt.so.2 from rpm libxcrypt-4.4.33-7.fc38.x86_64
                Module libbrotlicommon.so.1 from rpm brotli-1.0.9-11.fc38.x86_64
                Module libsasl2.so.3 from rpm cyrus-sasl-2.1.28-9.fc38.x86_64
                Module libevent-2.1.so.7 from rpm libevent-2.1.12-8.fc38.x86_64
                Module libkeyutils.so.1 from rpm keyutils-1.6.1-6.fc38.x86_64
                Module libkrb5support.so.0 from rpm krb5-1.20.1-8.fc38.x86_64
                Module libcom_err.so.2 from rpm e2fsprogs-1.46.5-4.fc38.x86_64
                Module libk5crypto.so.3 from rpm krb5-1.20.1-8.fc38.x86_64
                Module libkrb5.so.3 from rpm krb5-1.20.1-8.fc38.x86_64
                Module libbrotlidec.so.1 from rpm brotli-1.0.9-11.fc38.x86_64
                Module libgssapi_krb5.so.2 from rpm krb5-1.20.1-8.fc38.x86_64
                Module libpsl.so.5 from rpm libpsl-0.21.2-2.fc38.x86_64
                Module libssh.so.4 from rpm libssh-0.10.4-4.fc38.x86_64
                Module libidn2.so.0 from rpm libidn2-2.3.4-2.fc38.x86_64
                Module libnghttp2.so.14 from rpm nghttp2-1.52.0-1.fc38.x86_64
                Module libblkid.so.1 from rpm util-linux-2.38.1-4.fc38.x86_64
                Module libattr.so.1 from rpm attr-2.5.1-6.fc38.x86_64
                Module libyaml-0.so.2 from rpm libyaml-0.2.5-9.fc38.x86_64
                Module libmagic.so.1 from rpm file-5.44-3.fc38.x86_64
                Module libzck.so.1 from rpm zchunk-1.3.1-1.fc38.x86_64
                Module libcurl.so.4 from rpm curl-7.87.0-7.fc38.x86_64
                Module libaudit.so.1 from rpm audit-3.1-2.fc38.x86_64
                Module liblua-5.4.so from rpm lua-5.4.4-9.fc38.x86_64
                Module libpopt.so.0 from rpm popt-1.19-2.fc38.x86_64
                Module librpm_sequoia.so.1 from rpm rust-rpm-sequoia-1.3.0-1.fc38.x86_64
                Module libpcre2-8.so.0 from rpm pcre2-10.42-1.fc38.1.x86_64
                Module libffi.so.8 from rpm libffi-3.4.4-2.fc38.x86_64
                Module libselinux.so.1 from rpm libselinux-3.5-1.fc38.x86_64
                Module libmount.so.1 from rpm util-linux-2.38.1-4.fc38.x86_64
                Module libgmodule-2.0.so.0 from rpm glib2-2.76.1-1.fc38.x86_64
                Module libxml2.so.2 from rpm libxml2-2.10.4-1.fc38.x86_64
                Module libbz2.so.1 from rpm bzip2-1.0.8-13.fc38.x86_64
                Module liblz4.so.1 from rpm lz4-1.9.4-2.fc38.x86_64
                Module libzstd.so.1 from rpm zstd-1.5.5-1.fc38.x86_64
                Module liblzma.so.5 from rpm xz-5.4.1-1.fc38.x86_64
                Module libacl.so.1 from rpm acl-2.3.1-6.fc38.x86_64
                Module libostree-1.so.1 from rpm ostree-2023.1-2.fc38.x86_64
                Module libz.so.1 from rpm zlib-1.2.13-3.fc38.x86_64
                Module libcrypto.so.3 from rpm openssl-3.0.8-2.fc38.x86_64
                Module libssl.so.3 from rpm openssl-3.0.8-2.fc38.x86_64
                Module libmodulemd.so.2 from rpm libmodulemd-2.14.0-5.fc38.x86_64
                Module libsolv.so.1 from rpm libsolv-0.7.22-4.fc38.x86_64
                Module libsolvext.so.1 from rpm libsolv-0.7.22-4.fc38.x86_64
                Module librepo.so.0 from rpm librepo-1.15.1-2.fc38.x86_64
                Module libjson-c.so.5 from rpm json-c-0.16-4.fc38.x86_64
                Module libsystemd.so.0 from rpm systemd-253.2-1.fc38.x86_64
                Module librpmio.so.9 from rpm rpm-4.18.1-1.fc38.x86_64
                Module librpm.so.9 from rpm rpm-4.18.1-1.fc38.x86_64
                Module libglib-2.0.so.0 from rpm glib2-2.76.1-1.fc38.x86_64
                Module libgobject-2.0.so.0 from rpm glib2-2.76.1-1.fc38.x86_64
                Module libgio-2.0.so.0 from rpm glib2-2.76.1-1.fc38.x86_64
                Module libpolkit-gobject-1.so.0 from rpm polkit-122-3.fc38.1.x86_64
                Module libarchive.so.13 from rpm libarchive-3.6.1-4.fc38.x86_64
                Module libjson-glib-1.0.so.0 from rpm json-glib-1.6.6-4.fc38.x86_64
                Module libcap.so.2 from rpm libcap-2.48-6.fc38.x86_64
                Stack trace of thread 3880:
                #0  0x00007fc118468b94 __pthread_kill_implementation (libc.so.6 + 0x8eb94)
                #1  0x00007fc118417aee raise (libc.so.6 + 0x3daee)
                #2  0x00007fc11840087f abort (libc.so.6 + 0x2687f)
                #3  0x0000565195440917 _ZN11panic_abort18__rust_start_panic5abort17h3c900776fa44b50cE (rpm-ostree + 0x99e917)
                #4  0x00007fc104029720 n/a (n/a + 0x0)
                ELF object binary architecture: AMD x86-64

@ghost
Copy link

ghost commented May 7, 2023

I also have this time to time no idea where it come, at first i've thinked it would come from some source blocked, but i never have checked more since i've seen some other had also the problem.

I say i have it to see any potential fix, and let know that one more user have the problem.

@garrett
Copy link

garrett commented May 8, 2023

FWIW, this happens on every one of my systems for the past few months; I thought it would be fixed quickly (as these things usually are), and I've tried all the workarounds listed here (clearing cache and such), but it still happens. I have to try to upgrade 2 to 5 times before it finishes with "no changes" (even though there are changes according to rpm-ostree status and reboots); it's likely because the last time, when it's successful, has no changes to the previous time, when it failed with the error.

I do have overrides on my machines, and I do have third party repos enabled (OBS for darktable on my personal desktop and a copr for a patched mutter on both personal and work) — but this was even happening on my work machine without any additional repos (before I added the patched triple-buffered mutter). However, I do have a local package on my work laptop (for VPN certificates) and layered Fedora packages on both.

@garrett
Copy link

garrett commented May 8, 2023

I've asked my coworkers who also use Silveblue and they haven't seen the issue. They don't layer packages, however.

So my guess is that it's really related to layering packages.

@ghost
Copy link

ghost commented May 8, 2023

I've asked my coworkers who also use Silveblue and they haven't seen the issue. They don't layer packages, however.

So my guess is that it's really related to layering packages.

The worst it's this bug wasn't there on 37

@cgwalters cgwalters added the difficulty/medium medium complexity/difficutly issue label May 8, 2023
@cgwalters
Copy link
Member

@garrett Can you try grabbing some of the debug info using env G_MESSAGES_DEBUG=rpm-ostree,rpm-ostreed RUST_LOG=rpm_ostree=debug,rpmostree_rust=debug rpm-ostree upgrade or so from #4284 (comment) ?

It'd be particularly helpful to use rpm-ostree-2023.3 as a base for updating.

I've tried briefly again today to reproduce this, and took another tour through the code trying to retrofit these symptoms to code paths; but no luck so far. If we don't succeed soon I'll try just spamming debug log prints around the relevant code paths I can think of and have someone who is hitting this update to that.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue May 8, 2023
This is actually quite important for general "understanding of
the system" purposes. We can close this FIXME without needing
to print the transaction type.

Motivated specifically by a theory that this may relate
to coreos#4284
if somehow the client gets confused as to which transaction it's
monitoring.
@garrett
Copy link

garrett commented May 9, 2023

@cgwalters: Sure!

rpm-ostree:
 Version: '2023.3'
 Git: 8ab6f143a0ecad8b125b47dee8bbeb2e99f1b215
 Features:
  - rust
  - compose
  - container
  - fedora-integration

On my work laptop, before I realized I should modify the service too:

rpm-ostree--rebase--debug.txt

On my personal workstation, where I did modify the service before:

rpm-ostree--debug--personal.txt

The second of these is probably more useful, as it should have more debugging info:

@cgwalters
Copy link
Member

In

Writing rpmdb...done
(rpm-ostree upgrade:37328): rpm-ostreed-DEBUG: 11:37:59.031: txn progress TaskBegin
DEBUG Creating progress "Writing OSTree commit" with type Task
⠒                                                                               (rpm-ostree upgrade:37328): rpm-ostreed-DEBUG: 11:38:09.273: txn progress Message
Enabled rpm-md repositories: fedora-modular updates-modular updates fedora graphics_darktable graphics_darktable_master copr:copr.fedorainfracloud.org:calcastor:gnome-patched updates-archive
(rpm-ostree upgrade:37328): rpm-ostreed-DEBUG: 11:38:09.377: txn progress PercentProgress
thread 'tokio-runtime-worker' panicked at 'expected non-task progress type; current message is Writing OSTree commit', rust/src/console_progress.rs:128:13

It's really strange to me that we're seeing the "Enabled rpm-md repositories" message at that point...that seems like a leftover bit that should have been seen earlier.

I'm starting to build up a theory that this is somehow related to us iterating a new main context in the commit path, but we have leftover queued idle work on the mainloop or so? But not yet clear to me what that is...

@garrett Can you also attach the daemon-side logs from that run? i.e. journalctl -u rpm-ostreed around that time.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue May 9, 2023
This is actually quite important for general "understanding of
the system" purposes.

Motivated specifically by a theory that this may relate
to coreos#4284
if somehow the client gets confused as to which transaction it's
monitoring.
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue May 9, 2023
- Allocate a unique serial per Progress instance
- Output the serial alongside the text, which is really a unique
  string format that will tell us where in the code it's being
  called
- Add `g_debug()` output with both in the constructors, and
  just the serial in other methods

This should help us figure out on the daemon side which specific
code is trying to do a percent progress while we have a plain
task.

cc coreos#4284
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue May 10, 2023
This is actually quite important for general "understanding of
the system" purposes.

Motivated specifically by a theory that this may relate
to coreos#4284
if somehow the client gets confused as to which transaction it's
monitoring.
@garrett
Copy link

garrett commented May 11, 2023

@cgwalters: Apologies for the delay! Here are the logs from ostreed on May 09 from my personal machine:

ostreed-may09.txt

I think the relevant entries should be around 11:37 - 11:38.

@cgwalters
Copy link
Member

All of the repeated "loading fedora" messages looked very odd to me, and digging in this is coming from the "sysroot changed' signal. We don't have even debug logging for invocation of these messages (will add) but...that gave me a clue.

I think I'm starting to understand this bug. Basically what I think is happening right now is that there are two processes talking to the daemon (gnome-software and /usr/bin/rpm-ostree).

While the human has invoked e.g. /usr/bin/rpm-ostree upgrade, at the same time gnome-software is calling one of the other os methods that are not transactions (and hence not serialized). I'm not sure which yet; maybe e.g. get_deployments_rpm_diff?

But basically at some point, one of these other methods ends up calling something in the core that tries to refresh rpm-md, and that in turn will try to invoke the progress APIs, and that will cause this crash.

First I'm going to add a lot more useful debug logging to this path to verify.

I think the best fix here is to change our output infrastructure to only route through the dbus txn progress for the transaction thread.

@cgwalters
Copy link
Member

Ahhh yes I think the core bug here is everything using os_create_dnf_context_simple (which is the "not transactional inspection" stuff) calling rpmostree_context_download_metadata which will indeed enable the progress infrastructure. That'd completely explain the race here.

@cgwalters
Copy link
Member

Yep, this is really easy to reproduce now that I understand the bug; having layered packages makes it much much more likely (but I think it may be reproducible without).

Basically in one shell: while rpm-ostree upgrade >/dev/null; do :; done
And in another shell, use this handy helper to do rpmostree_busctl_call_os WhatProvides as 1 provided-testing-daemon. That really quickly causes the first client to crash.

So yes, will look at reworking the transaction progress stuff to be thread-local.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue May 11, 2023
This is needed because for transactions (which are always run in
a thread today) we want output to go to the transaction's DBus
progress.  But for other methods which are not transactions, we don't
have a channel for status reporting, so output needs to
continue to go to the journal.   If we mix these two things due to
concurrent method invocations, the client may get confused and crash.

Closes: coreos#4284
@cgwalters
Copy link
Member

OK, got a PR up in #4405

@juhp
Copy link

juhp commented May 15, 2023

I also hit this several times today in the office, but when I came home 'update' worked first time.

@cgwalters
Copy link
Member

Can one or more people try out
https://copr.fedorainfracloud.org/coprs/g/CoreOS/continuous/build/5919148/
which has the fix merged?

cgwalters added a commit that referenced this issue May 15, 2023
This is actually quite important for general "understanding of
the system" purposes.

Motivated specifically by a theory that this may relate
to #4284
if somehow the client gets confused as to which transaction it's
monitoring.
@YaLTeR
Copy link
Author

YaLTeR commented May 16, 2023

Well, it's just a single sample, but I just did rpm-ostree update together with flatpak update -y, which usually triggered the issue, and it went fine.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue May 18, 2023
Particularly to get the fix for coreos#4284
out.
cgwalters added a commit that referenced this issue May 18, 2023
Particularly to get the fix for #4284
out.
@cgwalters
Copy link
Member

This is now up as https://bodhi.fedoraproject.org/updates/FEDORA-2023-ce6c9dcd69

lukewarmtemp pushed a commit to lukewarmtemp/rpm-ostree that referenced this issue Jun 7, 2023
- Allocate a unique serial per Progress instance
- Output the serial alongside the text, which is really a unique
  string format that will tell us where in the code it's being
  called
- Add `g_debug()` output with both in the constructors, and
  just the serial in other methods

This should help us figure out on the daemon side which specific
code is trying to do a percent progress while we have a plain
task.

cc coreos#4284
lukewarmtemp pushed a commit to lukewarmtemp/rpm-ostree that referenced this issue Jun 7, 2023
This is needed because for transactions (which are always run in
a thread today) we want output to go to the transaction's DBus
progress.  But for other methods which are not transactions, we don't
have a channel for status reporting, so output needs to
continue to go to the journal.   If we mix these two things due to
concurrent method invocations, the client may get confused and crash.

Closes: coreos#4284
lukewarmtemp pushed a commit to lukewarmtemp/rpm-ostree that referenced this issue Jun 7, 2023
This is actually quite important for general "understanding of
the system" purposes.

Motivated specifically by a theory that this may relate
to coreos#4284
if somehow the client gets confused as to which transaction it's
monitoring.
lukewarmtemp pushed a commit to lukewarmtemp/rpm-ostree that referenced this issue Jun 7, 2023
Particularly to get the fix for coreos#4284
out.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty/medium medium complexity/difficutly issue priority/high
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants