Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turning off graphics card hangs indefinitely #105

Closed
OctarineSourcerer opened this issue Apr 3, 2019 · 21 comments
Closed

Turning off graphics card hangs indefinitely #105

OctarineSourcerer opened this issue Apr 3, 2019 · 21 comments
Milestone

Comments

@OctarineSourcerer
Copy link

Starting nvidia-xrun seems to work fine, but when I exit from Xorg, it hangs on:

Removing Nvidia bus from the kernel
1

Opening htop on another tty, tee /sys/bus/pci/devices/0000:00:01.0/remove is still running. If I try to terminate this, whether with a SIGTERM or SIGKILL, I get 100% CPU usage on one core, and it doesn't seem to end.

I think similar symptoms have been talked about by @michelesr as part of issue #94

@michelesr
Copy link
Contributor

@OctarineSorcerer can you reproduce this issue if you use the nvidia-xrun script from my fork?

@OctarineSourcerer
Copy link
Author

Nope, using your fork it seems to stop just fine - after those lines it shows

Enabling powersave for the PCIe controller
auto

@michelesr
Copy link
Contributor

michelesr commented Apr 3, 2019 via email

@OctarineSourcerer
Copy link
Author

OctarineSourcerer commented Apr 3, 2019

That makes sense - give me a couple minutes, and I'll test out the powertop results

EDIT: @michelesr Power drain in powertop went down from ~18W to ~10-12W, and I also couldn't see it anymore on powertop's "Device stats" tab. I'll second that the power management in your fork seems to work well, without the issues here.

@michelesr
Copy link
Contributor

michelesr commented Apr 3, 2019

Good, just to clarify: Linux PM should automatically put the bus (and the device) in power saving mode when it's not utilized, as long the value of the PM control is set to auto for both (the bus and the card). However, the reason the device needs to be removed from the tree is that some programs (like GNOME shell or Xorg when using the modesetting driver) will cause the nvidia module to be loaded, so then the card will never go into the power saving mode because the driver will keep it always active.

Removing the device from the tree is a (dirty) workaround to prevent those program to load the nvidia kernel module so that Linux can put the bus in the power saving mode.

@OctarineSourcerer
Copy link
Author

I see, seems like a sensible workaround, and for now I'm using your fork.
I think your fork also uses /sys/bus/pci/devices/0000:01:00.0/remove to remove the device from the tree - Any idea why the same hangs in vanilla nvidia-xrun?

@michelesr
Copy link
Contributor

michelesr commented Apr 3, 2019

That's probably because the BUS_ID environment variable is set to 0000:00:01.0, that is not the device but the PCI controller. Removing the PCI controller doesn't seem safe, as it might be used by the kernel, that hangs when trying to remove it.

My script instead disables the card 0000:01:00.0 and then set the power control of the bus0000:00:01.0 to auto, telling the kernel that it can be put in power saving mode when it's not utilized (this is equivalent of toggling Bad into Good in powertop tunables for the PCI express bus)

EDIT: actually the reason for the hang is another, read my next comment.

@ghost
Copy link

ghost commented Apr 11, 2019

can reproduce. the @michelesr fork works.

@michelesr
Copy link
Contributor

michelesr commented Apr 11, 2019

The reason for the hang is that the kernel modules aren't really unloaded, as specified in #95 and #101.

If you unload the modules properly (via rmmod or modprobe -r) then you can issue the remove action on the PCI controller without making the kernel hang, but that (at least in my system) seems to bring the controller in high power usage... instead, putting the controller power control in auto will enable proper power saving, while removing the card from the kernel will prevent further loading of the nvidia kernel module.

If you don't use the nvidia card at all you can simple uninstall the nvidia proprietary driver and use powertop to set the bus power control to auto and have proper power saving, but of course if you're using nvidia-xrun it means you need the card so you need the driver as well (and the remove trick to avoid accidental loading and power drain), and that's exactly what my fork is doing.

@neworld
Copy link

neworld commented Apr 20, 2019

I am using nvidia-xrun-pm fork. According to the dmesg, seems like Nvidia module crashed because the device was used:

[85030.559331] nvidia-uvm: Unloaded the UVM driver in 8 mode
[85030.671758] NVRM: Attempting to remove minor device 0 with non-zero usage count!
[85030.672015] WARNING: CPU: 11 PID: 5607 at /build/nvidia-lts/src/NVIDIA-Linux-x86_64-418.56/kernel/nvidia/nv.c:4296 nvidia_remove+0x33a/0x360 [nvidia]
[85030.672016] Modules linked in: nvidia_drm(POE) nvidia_modeset(POE) tun nvidia(POE) ipmi_devintf ipmi_msghandler xt_nat xt_tcpudp veth ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc overlay ccm rfcomm fuse snd_hda_codec_hdmi cmac bnep snd_hda_codec_realtek snd_hda_codec_generic arc4 wacom hid_multitouch cdc_ether usbnet btusb btrtl uvcvideo btbcm btintel videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common videodev r8152 cdc_acm mii media ecdh_generic iTCO_wdt iTCO_vendor_support joydev mousedev mei_wdt nls_iso8859_1 nls_cp437 vfat fat dell_laptop dell_wmi dell_smbios wmi_bmof intel_wmi_thunderbolt mxm_wmi
[85030.672072]  dell_wmi_descriptor msr squashfs i915 loop ath10k_pci ath10k_core ath mac80211 intel_rapl kvmgt x86_pkg_temp_thermal intel_powerclamp snd_hda_intel coretemp vfio_mdev mdev snd_hda_codec vfio_iommu_type1 vfio i2c_algo_bit kvm_intel drm_kms_helper kvm snd_hda_core snd_hwdep irqbypass tpm_crb drm dcdbas cfg80211 snd_pcm intel_cstate snd_timer intel_gtt psmouse agpgart snd intel_uncore rtsx_pci_ms syscopyarea input_leds sysfillrect mei_me soundcore idma64 sysimgblt memstick intel_rapl_perf fb_sys_fops pcspkr i2c_i801 ucsi_acpi mei typec_ucsi rfkill typec intel_lpss_pci intel_lpss intel_pch_thermal processor_thermal_device intel_soc_dts_iosf i2c_hid int3403_thermal int340x_thermal_zone tpm_tis tpm_tis_core tpm dell_smo8800 wmi evdev rng_core battery mac_hid pcc_cpufreq int3400_thermal intel_hid
[85030.672125]  acpi_thermal_rel sparse_keymap ac crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto algif_skcipher af_alg hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid dm_crypt dm_mod rtsx_pci_sdmmc crct10dif_pclmul serio_raw mmc_core crc32_pclmul atkbd crc32c_intel libps2 ghash_clmulni_intel pcbc ahci libahci libata aesni_intel aes_x86_64 crypto_simd xhci_pci cryptd scsi_mod glue_helper xhci_hcd rtsx_pci i8042 serio [last unloaded: nvidia_uvm]
[85030.672160] CPU: 11 PID: 5607 Comm: tee Tainted: P        W  OE     4.19.34-1-lts #1
[85030.672162] Hardware name: Dell Inc. XPS 15 9570/07GHH0, BIOS 1.7.0 12/25/2018
[85030.672375] RIP: 0010:nvidia_remove+0x33a/0x360 [nvidia]
[85030.672378] Code: fe ff ff 48 89 de 48 89 ef e8 a2 59 6d 00 e9 92 fe ff ff 8b 93 78 04 00 00 48 c7 c6 e0 42 1c c2 bf 04 00 00 00 e8 06 8d 00 00 <0f> 0b e8 0f 93 00 00 eb f9 48 89 de 48 89 ef e8 72 96 6d 00 e9 32
[85030.672380] RSP: 0018:ffffb05690d17d28 EFLAGS: 00010246
[85030.672382] RAX: 0000000000000044 RBX: ffff9fb200158000 RCX: 0000000000000000
[85030.672383] RDX: ffff9fb25c4de680 RSI: ffff9fb25c4d6588 RDI: ffff9fb25c4d6588
[85030.672385] RBP: ffff9fad7a9e0000 R08: 0000000000000ef1 R09: 0000000000000001
[85030.672386] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9fb257b73000
[85030.672387] R13: ffffffffc2220210 R14: 0000000000000060 R15: ffff9fb01c200ce0
[85030.672389] FS:  00007fa34f8b4540(0000) GS:ffff9fb25c4c0000(0000) knlGS:0000000000000000
[85030.672390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[85030.672392] CR2: 00005641b98df1b8 CR3: 0000000812ff0006 CR4: 00000000003606e0
[85030.672393] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[85030.672395] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[85030.672396] Call Trace:
[85030.672406]  pci_device_remove+0x3b/0xc0
[85030.672411]  device_release_driver_internal+0x183/0x250
[85030.672415]  pci_stop_bus_device+0x69/0x90
[85030.672418]  pci_stop_and_remove_bus_device_locked+0x16/0x30
[85030.672422]  remove_store+0x75/0x90
[85030.672425]  kernfs_fop_write+0x116/0x190
[85030.672429]  __vfs_write+0x36/0x1a0
[85030.672432]  ? __switch_to+0x143/0x490
[85030.672435]  vfs_write+0xa9/0x1a0
[85030.672437]  ksys_write+0x4f/0xb0
[85030.672441]  do_syscall_64+0x4e/0x100
[85030.672445]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[85030.672448] RIP: 0033:0x7fa34f7db7e8
[85030.672449] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 55 6d 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[85030.672451] RSP: 002b:00007ffec0b80338 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[85030.672453] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa34f7db7e8
[85030.672454] RDX: 0000000000000002 RSI: 00007ffec0b80460 RDI: 0000000000000003
[85030.672455] RBP: 00007ffec0b80460 R08: 00005641b98dd600 R09: 00007fa34f8b4540
[85030.672457] R10: 00000000000001b6 R11: 0000000000000246 R12: 00005641b98dd520
[85030.672459] R13: 0000000000000

And this causes full system hung, I am not able even restart PC:

[85954.939548] INFO: task systemd-tmpfile:21029 blocked for more than 120 seconds.
[85954.939549]       Tainted: P        W  OE     4.19.34-1-lts #1
[85954.939549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

@neworld
Copy link

neworld commented Apr 20, 2019

I tried to enable and disable scripts from this wiki. I found there are some differences between nvidia-xrun and that wiki.

@michelesr
Copy link
Contributor

@OctarineSorcerer the code from my fork has been merged here, can you please see if it works for you so that this issues can be closed?

@Witko Witko added this to the Version 0.4.0 milestone May 6, 2019
@OctarineSourcerer
Copy link
Author

@michelesr Starting nvidia-xrun and then exiting the session caused no hangup, so I'm happy to close the issue. Though, nvidia-xrun-pm.service is not so far included in the PKGBUILD of the AUR package, so it's not entirely at parity with yours yet. I did see your comment mentioning this in the AUR, but thought I should comment it somewhere here just so it's seen.

@wioo
Copy link

wioo commented Jul 24, 2019

This issue is not resolved. If nvidia_drm module cant be unloaded for whatever reason (and consequently all modules fail to unload), computer will go rogue, and reboot is required.

There should be some check after unload_modules to see if nvidia modules unloaded successfully.

@rihardsk
Copy link

@wioo have you tried loading nvidia_drm with modeset=0? I had the exact same issue and that solved it. See also this comment on issue #117 which seems very much related.

@wioo
Copy link

wioo commented Jul 25, 2019

Thanks for reply @rihardsk
Sure, I can load it with modeset=0, but then PRIME Synchronization will not work.

@mj-saunders
Copy link

Any idea the state of this issue upstream?
I'm still having the same problem.

tee pegs a cpu when leaving X.

Was thinking to try @michelesr's fork, but noticed it's been a few years since an update...

@michelesr
Copy link
Contributor

Was thinking to try @michelesr's fork, but noticed it's been a few years since an update...

Probably not a good idea, my fork was merged upstream a long time ago, so there's no benefit in using that and it's probably outdated

@mj-saunders
Copy link

Was thinking to try @michelesr's fork, but noticed it's been a few years since an update...

Probably not a good idea, my fork was merged upstream a long time ago, so there's no benefit in using that and it's probably outdated

Noted, thank you kindly. Seems I was a bit more behind in my updates than I thought.

Now to fix a nvidia-xrun crash instead :) but that's another story.

@michelesr
Copy link
Contributor

Now to fix a nvidia-xrun crash instead :) but that's another story.

Not sure if it's a viable solution for you, but you might want to try PRIME offload instead

@mj-saunders
Copy link

Now to fix a nvidia-xrun crash instead :) but that's another story.

Not sure if it's a viable solution for you, but you might want to try PRIME offload instead

Might try it out perhaps. Originally used bumblebee, but have been using nvidia-xrun for a couple of years or so now; quite like it for my use-case.
We'll see though, thank you again.

Anyway, getting off topic now I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants