SP3 - i2c_designware devices spam interrupts, reducing performance and battery life #101

jkatzmewing · 2021-06-04T15:35:18Z

(Currently Xubuntu 21.04, with latest linux-surface kernel)

My Surface Pro 3 performs really badly out of the box with any Linux distro, with lots of micro-stutter, applications taking a long time to start up, lots of heat and short battery life... Looking at powertop, I found that interrupts for the device INT33C2:00 were eating a huge amount of power, often 4 or 5 watts.

Blacklisting i2c_hid helped performance and got rid of the INT33C2 interrupts, but performance was still not great. Trying the linux-surface kernel instead of the Ubuntu generic or lowlatency ones, things were even worse - when i2c_hid wasn't blacklisted, INT33C2 interrupts ate up 16 watts, keeping the SP3's fans constantly on blast.

Looking at dmesg I noted that INT33C2 was actually powered by the (built in) i2c_designware driver, and disabled that as detailed here:

https://unix.stackexchange.com/questions/423797/how-do-i-disable-i2c-designware-support-when-its-not-built-as-a-module

On reboot, the touchscreen did not work (as expect and as with blacklisting i2c_hid), but performance was much better, and estimated battery life increased by almost an hour.

qzed · 2021-06-05T19:12:54Z

Is it possible that the touchscreen is misbehaving and causing those interrupts? This kinda looks like a device misbehaving to me, I'm not entirely sure how to debug this.

jkatzmewing · 2021-06-05T19:29:21Z

Yes, it could definitely be. I'll try some further investigation and see if I can narrow this down a bit.

RussH · 2021-06-14T11:00:46Z

I can confirm this is also occurring on my SP3 - looks like the ambient light sensor INT33C2:00 is at fault.

`

Top 10 Power Consumers

Usage	Events/s	Category	Description	PW Estimate
0.2%	168.0	kWork	dbs_work_handler	667 mW
0.2%	149.3	Timer	tick_sched_timer	593 mW
0.2%	104.5	Interrupt	[7] INT33C2:00	416 mW
0.3%	77.1	Process	[PID 6167] /opt/brave.com/brave/brave --high-dpi-support=1 --force-device-scale-factor=1.6	348 mW
0.8%	28.2	Process	[PID 12463] baloo_file_extr	133 mW
0.3%	24.4	Timer	hrtimer_wakeup	101 mW
0.1%	24.2	Process	[PID 12465] QDBusConnection	97.4 mW
0.0%	23.8	Interrupt	[4] block(softirq)	94.9 mW
0.6%	20.4	Process	[PID 1979] /usr/bin/latte-dock -session 101751bb1ba17d000162324977800000017790013_1623406482_240294	89.4 mW
0.3%	20.4	kWork	mwifiex_main_work_queue	84.8 mW

`

qzed · 2021-06-14T12:02:47Z

INT33C2 is a I2C controller that's used for multiple I2C clients. Those seem to be (according to the DSDT but you can check that in /sys/bus/i2c/devices/... as well):

MSHW0028 (VGBI) volume and power buttons (?)
INT33CA (ACD0) Intel SPB Peripheral (something related to audio)
INT33C9 (ACD1) Wolfson Microelectronics Audio WM5102 (something else related to audio)
INT33CB (ACD2) Intel Smart Sound Technology Audio Codec (yet another audio thing)
INT33D1 (SHUB) Intel GPIO Buttons (more buttons?)
INT33D7 (DFUD) no clue
MSFT1111 (TPD4) some HID-over-I2C device (touchscreen maybe?)
MSHW0030 (SAM) SAM v1 as HID-over-I2C device

So it's either the controller that's at fault or some of those client devices constantly want to talk to that controller and don't shut up (which is why I suspected the touchscreen).

jkatzmewing · 2021-06-29T15:23:16Z

So, maybe interesting update - unlike with Ubuntu, Fedora 34's stock kernels seem much less affected by this. INT33C2 interrupts are still numerous, but take much less CPU time; the tablet runs cooler and the fans do not run on full blast when the touchscreen is enabled. Also less mouse lag, and wakeups/second in powertop stays under 1000 during normal desktop use.

The same unfortunately can't be said for the linux-surface kernel for Fedora, which has the same issue as on Ubuntu, and as the Ubuntu stock kernel. Mouse lags visibly, high wakeups/second, INT33C2 consistently using more like 40 ms/s instead of 2.5.

So I'm guessing this is down to some kernel config option(s). Not sure what, though.

Edit: also to be clear we're still not talking "completely unaffected". Powertop still shows INT33C2 interrupts spiking at times, usually when launching Electron applications - sometimes spiking up to 8000/s or so, vs. 180/s or so normally.

qzed · 2021-06-30T21:10:05Z

Interesting, most config options should be the same as on Fedora. Can you try unbinding the drivers for the devices I mentioned above (specifically HID ones since you mentioned that blacklisting i2c_hid influences the behavior) and check if that makes a difference?

That should work e.g. via echo <device-name> | sudo tee /sys/bus/i2c/devices/<device-name>/driver/unbind where <device-name> is the name of the device in /sys/bus/i2c/devices/. You might need to read the HID of the device to match it to the table above (if the name doesn't give that away), which you can do via cat /sys/bus/i2c/devices/<device-name>/firmware_node/hid.

jkatzmewing · 2021-06-30T21:52:59Z

@qzed I'll give that a try this evening thanks!

jkatzmewing · 2021-06-30T22:10:57Z

@qzed

/sys/bus/i2c/devices/<device-name>/firmware_node/hid is LNXVIDEO for i2c-2 through i2c-7. i2c-8 and i2c-9 have no hid file. i2c-1 is INT33C3, and i2c-0 is INT33C2.

i2c-MSHW0028:00 doesn't have the necessary file for unbinding. Unbinding i2c-MSHW0030:00 definitely does not have any effect on the interrupts.

However, unbinding INT33C2 via /sys/bus/i2c/devices/i2c-0 gets rid of the interrupt spam and allows my touchscreen to work! So congrats, if nothing else you've at least helped me find a workaround. :)

qzed · 2021-06-30T22:35:50Z

Interesting, IIRC INT33C2 is an i2c controller. So this means that one controller constantly sends interrupts whereas the others work fine. It might be possible that a device connected to this specific controller causes the interrupts. If the controller has any client devices, they should be specified in the directory of the controller, e.g. something like /sys/bus/i2c/devices/i2c-0/i2c-INT33BE:00 for an INT33BE client.

You could try unbinding drivers for those client devices individually next (if there are any). Keep in mind though that you first have to re-bind the controller driver or reboot (that's probably easier). After boot the directory name might be different, so you might have to search for the HID again (this should be unique according to the SP3 ACPI).

jkatzmewing · 2021-06-30T22:51:41Z

@qzed MSHW0028 and MSHW0030 were the only devices attached to that controller (both disappeared after it was unbound). The former couldn't be unbound, and the latter being unbound did nothing, so perhaps the controller itself is the issue? IDK why that particular controller and no other though.

qzed · 2021-06-30T23:06:58Z

(both disappeared after it was unbound)

Yeah, that's the expected behavior. The client devices are essentially the children of the controller. So if that goes, the clients go as well.

The two devices are volume/power buttons and SAM. IIRC the volume/power button driver isn't actually an i2c driver, so makes sense that there's nothing to unbind (and that then shouldn't cause the issues, hopefully).

SAM is another thing though. That's the integrated EC. I think it might be possible that SAM tries to send something to the host or somehow messes with interrupts in other ways even when the HID-over-I2C driver normally attached to it has been unbound. If it's truly caused by the EC, I'm afraid that we very likely won't be able to fix it without a SAM-over-HID/SAM-gen4 driver.

The controller misbehaving might be another possibility, but I think that's less likely (although we probably can't be sure, no idea how to really test that). So I kinda think that SAM/the EC is at fault here.

commit 6d82ad1 upstream. Running generic/406 causes the following WARNING in btrfs_destroy_inode() which tells there are outstanding extents left. In btrfs_get_blocks_direct_write(), we reserve a temporary outstanding extents with btrfs_delalloc_reserve_metadata() (or indirectly from btrfs_delalloc_reserve_space(()). We then release the outstanding extents with btrfs_delalloc_release_extents(). However, the "len" can be modified in the COW case, which releases fewer outstanding extents than expected. Fix it by calling btrfs_delalloc_release_extents() for the original length. To reproduce the warning, the filesystem should be 1 GiB. It's triggering a short-write, due to not being able to allocate a large extent and instead allocating a smaller one. WARNING: CPU: 0 PID: 757 at fs/btrfs/inode.c:8848 btrfs_destroy_inode+0x1e6/0x210 [btrfs] Modules linked in: btrfs blake2b_generic xor lzo_compress lzo_decompress raid6_pq zstd zstd_decompress zstd_compress xxhash zram zsmalloc CPU: 0 PID: 757 Comm: umount Not tainted 5.17.0-rc8+ #101 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014 RIP: 0010:btrfs_destroy_inode+0x1e6/0x210 [btrfs] RSP: 0018:ffffc9000327bda8 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff888100548b78 RCX: 0000000000000000 RDX: 0000000000026900 RSI: 0000000000000000 RDI: ffff888100548b78 RBP: ffff888100548940 R08: 0000000000000000 R09: ffff88810b48aba8 R10: 0000000000000001 R11: ffff8881004eb240 R12: ffff88810b48a800 R13: ffff88810b48ec08 R14: ffff88810b48ed00 R15: ffff888100490c68 FS: 00007f8549ea0b80(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f854a09e733 CR3: 000000010a2e9003 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> destroy_inode+0x33/0x70 dispose_list+0x43/0x60 evict_inodes+0x161/0x1b0 generic_shutdown_super+0x2d/0x110 kill_anon_super+0xf/0x20 btrfs_kill_super+0xd/0x20 [btrfs] deactivate_locked_super+0x27/0x90 cleanup_mnt+0x12c/0x180 task_work_run+0x54/0x80 exit_to_user_mode_prepare+0x152/0x160 syscall_exit_to_user_mode+0x12/0x30 do_syscall_64+0x42/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f854a000fb7 Fixes: f0bfa76 ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range") CC: [email protected] # 5.17 Reviewed-by: Johannes Thumshirn <[email protected]> Tested-by: Johannes Thumshirn <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 1036f69 upstream. On RZ/Five SMARC EVK, where probing of SDHI is deferred due to probe deferral of the vqmmc-supply regulator: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at kernel/time/timer.c:1738 __run_timers.part.0+0x1d0/0x1e8 Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 6.7.0-rc4 #101 Hardware name: Renesas SMARC EVK based on r9a07g043f01 (DT) epc : __run_timers.part.0+0x1d0/0x1e8 ra : __run_timers.part.0+0x134/0x1e8 epc : ffffffff800771a4 ra : ffffffff80077108 sp : ffffffc800003e60 gp : ffffffff814f5028 tp : ffffffff8140c5c0 t0 : ffffffc800000000 t1 : 0000000000000001 t2 : ffffffff81201300 s0 : ffffffc800003f20 s1 : ffffffd8023bc4a0 a0 : 00000000fffee6b0 a1 : 0004010000400000 a2 : ffffffffc0000016 a3 : ffffffff81488640 a4 : ffffffc800003e60 a5 : 0000000000000000 a6 : 0000000004000000 a7 : ffffffc800003e68 s2 : 0000000000000122 s3 : 0000000000200000 s4 : 0000000000000000 s5 : ffffffffffffffff s6 : ffffffff81488678 s7 : ffffffff814886c0 s8 : ffffffff814f49c0 s9 : ffffffff81488640 s10: 0000000000000000 s11: ffffffc800003e60 t3 : 0000000000000240 t4 : 0000000000000a52 t5 : ffffffd8024ae018 t6 : ffffffd8024ae038 status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003 [<ffffffff800771a4>] __run_timers.part.0+0x1d0/0x1e8 [<ffffffff800771e0>] run_timer_softirq+0x24/0x4a [<ffffffff80809092>] __do_softirq+0xc6/0x1fa [<ffffffff80028e4c>] irq_exit_rcu+0x66/0x84 [<ffffffff80800f7a>] handle_riscv_irq+0x40/0x4e [<ffffffff80808f48>] call_on_irq_stack+0x1c/0x28 ---[ end trace 0000000000000000 ]--- What happens? renesas_sdhi_probe() { tmio_mmc_host_alloc() mmc_alloc_host() INIT_DELAYED_WORK(&host->detect, mmc_rescan); devm_request_irq(tmio_mmc_irq); /* * After this, the interrupt handler may be invoked at any time * * tmio_mmc_irq() * { * __tmio_mmc_card_detect_irq() * mmc_detect_change() * _mmc_detect_change() * mmc_schedule_delayed_work(&host->detect, delay); * } */ tmio_mmc_host_probe() tmio_mmc_init_ocr() -EPROBE_DEFER tmio_mmc_host_free() mmc_free_host() } When expire_timers() runs later, it warns because the MMC host structure containing the delayed work was freed, and now contains an invalid work function pointer. Fix this by cancelling any pending delayed work before releasing the MMC host structure. Signed-off-by: Geert Uytterhoeven <[email protected]> Tested-by: Lad Prabhakar <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/205dc4c91b47e31b64392fe2498c7a449e717b4b.1701689330.git.geert+renesas@glider.be Signed-off-by: Ulf Hansson <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

[ Upstream commit 601429c ] Why: The PCI error slot reset maybe triggered after inject ue to UMC multi times, this caused system hang. [ 557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume [ 557.373718] [drm] PCIE GART of 512M enabled. [ 557.373722] [drm] PTB located at 0x0000031FED700000 [ 557.373788] [drm] VRAM is lost due to GPU reset! [ 557.373789] [drm] PSP is resuming... [ 557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset [ 557.547067] [drm] PCI error: detected callback, state(1)!! [ 557.547069] [drm] No support for XGMI hive yet... [ 557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter [ 557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations [ 557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered [ 557.610492] [drm] PCI error: slot reset callback!! ... [ 560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded! [ 560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded! [ 560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI [ 560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G OE 5.15.0-91-generic #101-Ubuntu [ 560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023 [ 560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu] [ 560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] [ 560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff <48> 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00 [ 560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202 [ 560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0 [ 560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010 [ 560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08 [ 560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000 [ 560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000 [ 560.803889] FS: 0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000 [ 560.812973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0 [ 560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 560.843444] PKRU: 55555554 [ 560.846480] Call Trace: [ 560.849225] <TASK> [ 560.851580] ? show_trace_log_lvl+0x1d6/0x2ea [ 560.856488] ? show_trace_log_lvl+0x1d6/0x2ea [ 560.861379] ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] [ 560.867778] ? show_regs.part.0+0x23/0x29 [ 560.872293] ? __die_body.cold+0x8/0xd [ 560.876502] ? die_addr+0x3e/0x60 [ 560.880238] ? exc_general_protection+0x1c5/0x410 [ 560.885532] ? asm_exc_general_protection+0x27/0x30 [ 560.891025] ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] [ 560.898323] amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] [ 560.904520] process_one_work+0x228/0x3d0 How: In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

qzed added the D: Surface Pro 3 Device: Surface Pro 3 label Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SP3 - i2c_designware devices spam interrupts, reducing performance and battery life #101

SP3 - i2c_designware devices spam interrupts, reducing performance and battery life #101

jkatzmewing commented Jun 4, 2021

qzed commented Jun 5, 2021

jkatzmewing commented Jun 5, 2021

RussH commented Jun 14, 2021

qzed commented Jun 14, 2021

jkatzmewing commented Jun 29, 2021 •

edited

Loading

qzed commented Jun 30, 2021

jkatzmewing commented Jun 30, 2021

jkatzmewing commented Jun 30, 2021

qzed commented Jun 30, 2021

jkatzmewing commented Jun 30, 2021

qzed commented Jun 30, 2021 •

edited

Loading

SP3 - i2c_designware devices spam interrupts, reducing performance and battery life #101

SP3 - i2c_designware devices spam interrupts, reducing performance and battery life #101

Comments

jkatzmewing commented Jun 4, 2021

qzed commented Jun 5, 2021

jkatzmewing commented Jun 5, 2021

RussH commented Jun 14, 2021

Top 10 Power Consumers

qzed commented Jun 14, 2021

jkatzmewing commented Jun 29, 2021 • edited Loading

qzed commented Jun 30, 2021

jkatzmewing commented Jun 30, 2021

jkatzmewing commented Jun 30, 2021

qzed commented Jun 30, 2021

jkatzmewing commented Jun 30, 2021

qzed commented Jun 30, 2021 • edited Loading

jkatzmewing commented Jun 29, 2021 •

edited

Loading

qzed commented Jun 30, 2021 •

edited

Loading