Turning off graphics card hangs indefinitely #105

OctarineSourcerer · 2019-04-03T17:35:44Z

Starting nvidia-xrun seems to work fine, but when I exit from Xorg, it hangs on:

Removing Nvidia bus from the kernel
1

Opening htop on another tty, tee /sys/bus/pci/devices/0000:00:01.0/remove is still running. If I try to terminate this, whether with a SIGTERM or SIGKILL, I get 100% CPU usage on one core, and it doesn't seem to end.

I think similar symptoms have been talked about by @michelesr as part of issue #94

The text was updated successfully, but these errors were encountered:

michelesr · 2019-04-03T17:49:15Z

@OctarineSorcerer can you reproduce this issue if you use the nvidia-xrun script from my fork?

OctarineSourcerer · 2019-04-03T18:44:54Z

Nope, using your fork it seems to stop just fine - after those lines it shows

Enabling powersave for the PCIe controller
auto

michelesr · 2019-04-03T18:50:28Z

That's the expected behavior, as `tee` will echo auto to the terminal in addition to writing it on the file. If everything worked fine you should see a decrease in the power drain in powertop after the bus PM is set to `auto`. As I suspected there might be something wrong with the logic the in upstream script, but so far I didn't manage to get in touch with @Witko to discuss.

OctarineSourcerer · 2019-04-03T18:57:09Z

That makes sense - give me a couple minutes, and I'll test out the powertop results

EDIT: @michelesr Power drain in powertop went down from ~18W to ~10-12W, and I also couldn't see it anymore on powertop's "Device stats" tab. I'll second that the power management in your fork seems to work well, without the issues here.

michelesr · 2019-04-03T19:32:23Z

Good, just to clarify: Linux PM should automatically put the bus (and the device) in power saving mode when it's not utilized, as long the value of the PM control is set to auto for both (the bus and the card). However, the reason the device needs to be removed from the tree is that some programs (like GNOME shell or Xorg when using the modesetting driver) will cause the nvidia module to be loaded, so then the card will never go into the power saving mode because the driver will keep it always active.

Removing the device from the tree is a (dirty) workaround to prevent those program to load the nvidia kernel module so that Linux can put the bus in the power saving mode.

OctarineSourcerer · 2019-04-03T19:38:54Z

I see, seems like a sensible workaround, and for now I'm using your fork.
I think your fork also uses /sys/bus/pci/devices/0000:01:00.0/remove to remove the device from the tree - Any idea why the same hangs in vanilla nvidia-xrun?

michelesr · 2019-04-03T19:49:03Z

That's probably because the BUS_ID environment variable is set to 0000:00:01.0, that is not the device but the PCI controller. Removing the PCI controller doesn't seem safe, as it might be used by the kernel, that hangs when trying to remove it.

My script instead disables the card 0000:01:00.0 and then set the power control of the bus0000:00:01.0 to auto, telling the kernel that it can be put in power saving mode when it's not utilized (this is equivalent of toggling Bad into Good in powertop tunables for the PCI express bus)

EDIT: actually the reason for the hang is another, read my next comment.

ghost · 2019-04-11T15:03:49Z

can reproduce. the @michelesr fork works.

michelesr · 2019-04-11T20:51:14Z

The reason for the hang is that the kernel modules aren't really unloaded, as specified in #95 and #101.

If you unload the modules properly (via rmmod or modprobe -r) then you can issue the remove action on the PCI controller without making the kernel hang, but that (at least in my system) seems to bring the controller in high power usage... instead, putting the controller power control in auto will enable proper power saving, while removing the card from the kernel will prevent further loading of the nvidia kernel module.

If you don't use the nvidia card at all you can simple uninstall the nvidia proprietary driver and use powertop to set the bus power control to auto and have proper power saving, but of course if you're using nvidia-xrun it means you need the card so you need the driver as well (and the remove trick to avoid accidental loading and power drain), and that's exactly what my fork is doing.

neworld · 2019-04-20T08:56:39Z

I am using nvidia-xrun-pm fork. According to the dmesg, seems like Nvidia module crashed because the device was used:

[85030.559331] nvidia-uvm: Unloaded the UVM driver in 8 mode
[85030.671758] NVRM: Attempting to remove minor device 0 with non-zero usage count!
[85030.672015] WARNING: CPU: 11 PID: 5607 at /build/nvidia-lts/src/NVIDIA-Linux-x86_64-418.56/kernel/nvidia/nv.c:4296 nvidia_remove+0x33a/0x360 [nvidia]
[85030.672016] Modules linked in: nvidia_drm(POE) nvidia_modeset(POE) tun nvidia(POE) ipmi_devintf ipmi_msghandler xt_nat xt_tcpudp veth ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc overlay ccm rfcomm fuse snd_hda_codec_hdmi cmac bnep snd_hda_codec_realtek snd_hda_codec_generic arc4 wacom hid_multitouch cdc_ether usbnet btusb btrtl uvcvideo btbcm btintel videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common videodev r8152 cdc_acm mii media ecdh_generic iTCO_wdt iTCO_vendor_support joydev mousedev mei_wdt nls_iso8859_1 nls_cp437 vfat fat dell_laptop dell_wmi dell_smbios wmi_bmof intel_wmi_thunderbolt mxm_wmi
[85030.672072]  dell_wmi_descriptor msr squashfs i915 loop ath10k_pci ath10k_core ath mac80211 intel_rapl kvmgt x86_pkg_temp_thermal intel_powerclamp snd_hda_intel coretemp vfio_mdev mdev snd_hda_codec vfio_iommu_type1 vfio i2c_algo_bit kvm_intel drm_kms_helper kvm snd_hda_core snd_hwdep irqbypass tpm_crb drm dcdbas cfg80211 snd_pcm intel_cstate snd_timer intel_gtt psmouse agpgart snd intel_uncore rtsx_pci_ms syscopyarea input_leds sysfillrect mei_me soundcore idma64 sysimgblt memstick intel_rapl_perf fb_sys_fops pcspkr i2c_i801 ucsi_acpi mei typec_ucsi rfkill typec intel_lpss_pci intel_lpss intel_pch_thermal processor_thermal_device intel_soc_dts_iosf i2c_hid int3403_thermal int340x_thermal_zone tpm_tis tpm_tis_core tpm dell_smo8800 wmi evdev rng_core battery mac_hid pcc_cpufreq int3400_thermal intel_hid
[85030.672125]  acpi_thermal_rel sparse_keymap ac crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto algif_skcipher af_alg hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid dm_crypt dm_mod rtsx_pci_sdmmc crct10dif_pclmul serio_raw mmc_core crc32_pclmul atkbd crc32c_intel libps2 ghash_clmulni_intel pcbc ahci libahci libata aesni_intel aes_x86_64 crypto_simd xhci_pci cryptd scsi_mod glue_helper xhci_hcd rtsx_pci i8042 serio [last unloaded: nvidia_uvm]
[85030.672160] CPU: 11 PID: 5607 Comm: tee Tainted: P        W  OE     4.19.34-1-lts #1
[85030.672162] Hardware name: Dell Inc. XPS 15 9570/07GHH0, BIOS 1.7.0 12/25/2018
[85030.672375] RIP: 0010:nvidia_remove+0x33a/0x360 [nvidia]
[85030.672378] Code: fe ff ff 48 89 de 48 89 ef e8 a2 59 6d 00 e9 92 fe ff ff 8b 93 78 04 00 00 48 c7 c6 e0 42 1c c2 bf 04 00 00 00 e8 06 8d 00 00 <0f> 0b e8 0f 93 00 00 eb f9 48 89 de 48 89 ef e8 72 96 6d 00 e9 32
[85030.672380] RSP: 0018:ffffb05690d17d28 EFLAGS: 00010246
[85030.672382] RAX: 0000000000000044 RBX: ffff9fb200158000 RCX: 0000000000000000
[85030.672383] RDX: ffff9fb25c4de680 RSI: ffff9fb25c4d6588 RDI: ffff9fb25c4d6588
[85030.672385] RBP: ffff9fad7a9e0000 R08: 0000000000000ef1 R09: 0000000000000001
[85030.672386] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9fb257b73000
[85030.672387] R13: ffffffffc2220210 R14: 0000000000000060 R15: ffff9fb01c200ce0
[85030.672389] FS:  00007fa34f8b4540(0000) GS:ffff9fb25c4c0000(0000) knlGS:0000000000000000
[85030.672390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[85030.672392] CR2: 00005641b98df1b8 CR3: 0000000812ff0006 CR4: 00000000003606e0
[85030.672393] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[85030.672395] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[85030.672396] Call Trace:
[85030.672406]  pci_device_remove+0x3b/0xc0
[85030.672411]  device_release_driver_internal+0x183/0x250
[85030.672415]  pci_stop_bus_device+0x69/0x90
[85030.672418]  pci_stop_and_remove_bus_device_locked+0x16/0x30
[85030.672422]  remove_store+0x75/0x90
[85030.672425]  kernfs_fop_write+0x116/0x190
[85030.672429]  __vfs_write+0x36/0x1a0
[85030.672432]  ? __switch_to+0x143/0x490
[85030.672435]  vfs_write+0xa9/0x1a0
[85030.672437]  ksys_write+0x4f/0xb0
[85030.672441]  do_syscall_64+0x4e/0x100
[85030.672445]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[85030.672448] RIP: 0033:0x7fa34f7db7e8
[85030.672449] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 55 6d 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[85030.672451] RSP: 002b:00007ffec0b80338 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[85030.672453] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa34f7db7e8
[85030.672454] RDX: 0000000000000002 RSI: 00007ffec0b80460 RDI: 0000000000000003
[85030.672455] RBP: 00007ffec0b80460 R08: 00005641b98dd600 R09: 00007fa34f8b4540
[85030.672457] R10: 00000000000001b6 R11: 0000000000000246 R12: 00005641b98dd520
[85030.672459] R13: 0000000000000

And this causes full system hung, I am not able even restart PC:

[85954.939548] INFO: task systemd-tmpfile:21029 blocked for more than 120 seconds.
[85954.939549]       Tainted: P        W  OE     4.19.34-1-lts #1
[85954.939549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

neworld · 2019-04-20T09:54:14Z

I tried to enable and disable scripts from this wiki. I found there are some differences between nvidia-xrun and that wiki.

michelesr · 2019-05-05T22:38:11Z

@OctarineSorcerer the code from my fork has been merged here, can you please see if it works for you so that this issues can be closed?

OctarineSourcerer · 2019-05-06T21:48:38Z

@michelesr Starting nvidia-xrun and then exiting the session caused no hangup, so I'm happy to close the issue. Though, nvidia-xrun-pm.service is not so far included in the PKGBUILD of the AUR package, so it's not entirely at parity with yours yet. I did see your comment mentioning this in the AUR, but thought I should comment it somewhere here just so it's seen.

wioo · 2019-07-24T07:19:45Z

This issue is not resolved. If nvidia_drm module cant be unloaded for whatever reason (and consequently all modules fail to unload), computer will go rogue, and reboot is required.

There should be some check after unload_modules to see if nvidia modules unloaded successfully.

rihardsk · 2019-07-25T18:27:33Z

@wioo have you tried loading nvidia_drm with modeset=0? I had the exact same issue and that solved it. See also this comment on issue #117 which seems very much related.

wioo · 2019-07-25T18:56:18Z

Thanks for reply @rihardsk
Sure, I can load it with modeset=0, but then PRIME Synchronization will not work.

mj-saunders · 2022-06-12T04:03:28Z

Any idea the state of this issue upstream?
I'm still having the same problem.

tee pegs a cpu when leaving X.

Was thinking to try @michelesr's fork, but noticed it's been a few years since an update...

michelesr · 2022-06-12T09:37:22Z

Was thinking to try @michelesr's fork, but noticed it's been a few years since an update...

Probably not a good idea, my fork was merged upstream a long time ago, so there's no benefit in using that and it's probably outdated

mj-saunders · 2022-06-12T14:55:03Z

Was thinking to try @michelesr's fork, but noticed it's been a few years since an update...

Probably not a good idea, my fork was merged upstream a long time ago, so there's no benefit in using that and it's probably outdated

Noted, thank you kindly. Seems I was a bit more behind in my updates than I thought.

Now to fix a nvidia-xrun crash instead :) but that's another story.

michelesr · 2022-06-12T15:03:39Z

Now to fix a nvidia-xrun crash instead :) but that's another story.

Not sure if it's a viable solution for you, but you might want to try PRIME offload instead

mj-saunders · 2022-06-12T17:59:00Z

Now to fix a nvidia-xrun crash instead :) but that's another story.

Not sure if it's a viable solution for you, but you might want to try PRIME offload instead

Might try it out perhaps. Originally used bumblebee, but have been using nvidia-xrun for a couple of years or so now; quite like it for my use-case.
We'll see though, thank you again.

Anyway, getting off topic now I guess

michelesr mentioned this issue May 3, 2019

Merge nvidia-xrun-pm into nvidia-xrun #108

Merged

Witko added this to the Version 0.4.0 milestone May 6, 2019

OctarineSourcerer closed this as completed May 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turning off graphics card hangs indefinitely #105

Turning off graphics card hangs indefinitely #105

OctarineSourcerer commented Apr 3, 2019

michelesr commented Apr 3, 2019

OctarineSourcerer commented Apr 3, 2019

michelesr commented Apr 3, 2019 via email •

edited

Loading

OctarineSourcerer commented Apr 3, 2019 •

edited

Loading

michelesr commented Apr 3, 2019 •

edited

Loading

OctarineSourcerer commented Apr 3, 2019

michelesr commented Apr 3, 2019 •

edited

Loading

ghost commented Apr 11, 2019

michelesr commented Apr 11, 2019 •

edited

Loading

neworld commented Apr 20, 2019

neworld commented Apr 20, 2019

michelesr commented May 5, 2019

OctarineSourcerer commented May 6, 2019

wioo commented Jul 24, 2019

rihardsk commented Jul 25, 2019

wioo commented Jul 25, 2019

mj-saunders commented Jun 12, 2022

michelesr commented Jun 12, 2022

mj-saunders commented Jun 12, 2022

michelesr commented Jun 12, 2022

mj-saunders commented Jun 12, 2022

Turning off graphics card hangs indefinitely #105

Turning off graphics card hangs indefinitely #105

Comments

OctarineSourcerer commented Apr 3, 2019

michelesr commented Apr 3, 2019

OctarineSourcerer commented Apr 3, 2019

michelesr commented Apr 3, 2019 via email • edited Loading

OctarineSourcerer commented Apr 3, 2019 • edited Loading

michelesr commented Apr 3, 2019 • edited Loading

OctarineSourcerer commented Apr 3, 2019

michelesr commented Apr 3, 2019 • edited Loading

ghost commented Apr 11, 2019

michelesr commented Apr 11, 2019 • edited Loading

neworld commented Apr 20, 2019

neworld commented Apr 20, 2019

michelesr commented May 5, 2019

OctarineSourcerer commented May 6, 2019

wioo commented Jul 24, 2019

rihardsk commented Jul 25, 2019

wioo commented Jul 25, 2019

mj-saunders commented Jun 12, 2022

michelesr commented Jun 12, 2022

mj-saunders commented Jun 12, 2022

michelesr commented Jun 12, 2022

mj-saunders commented Jun 12, 2022

michelesr commented Apr 3, 2019 via email •

edited

Loading

OctarineSourcerer commented Apr 3, 2019 •

edited

Loading

michelesr commented Apr 3, 2019 •

edited

Loading

michelesr commented Apr 3, 2019 •

edited

Loading

michelesr commented Apr 11, 2019 •

edited

Loading