Add ioctl to register OA configs at runtime #1

rib · 2015-12-02T18:04:35Z

Overview

The initial idea is to have a DRM_IOCTL_I915_PERF_ADD_CONFIG ioctl that would only be usable with root privileges, that can be given arrays of registers for a NOA/MUX config + Boolean logic config + possibly FLEX EU config (for Broadwell+) which would also be associated with a GUID for listing the config under /sys/class/drm/card0/metrics in the same way as built in metrics.

This will allow us to be able to work with experimental/unstable OA configurations or allow tools to distribute new configs without requiring developers to build a custom kernel.

Details

Add the ioctl

Referring to the patch that adds DRM_IOCTL_I915_PERF_OPEN that should provide an overview of how to add a new ioctl.

i915_perf.c:i915_perf_open_ioctl() is the start of the implementation for DRM_IOCTL_I915_PERF_OPEN and this will need another function with the same prototype something like i915_perf_add_config_ioctl()

Maybe the struct to pass the to ioctl will be something like:

{
    u8 uuid[16]; /* XXX: userspace should give a unique ID that will be used to advertise the config under /sys/class/drm/card0/metrics */

    u32 n_mux_regs;
    u64 mux_regs; /* XXX: note this is using a uint64 as a pointer to have a deterministic size, compare with the DRM_IOCTL_I915_PERF_OPEN ioctl doing a similar thing for properties and cast with to_user_ptr() macro */

    u32 n_boolean_regs;
    u64 boolean_regs;

    u32 n_flex_regs;
    u64 flex_regs;
}

mux_regs + boolean_regs would then probably be an array of (address, value) pairs (though probably just use a 1D array)

Validate user state

Early on the implementation should check that the user has root privileges (search for CAP_SYS_ADMIN in i915_perf.c to find an example)

To read the arrays of registers from userspace refer to read_properties_unlocked() as one example and look into get_user() as the kernel api for safely reading from the userspace.

The implementation will need to check for trying to add a config with a conflicting uuid. We'll have to think more careful about the best behaviour here (maybe we would trust userspace knows what it's doing and allow it to replace/override a config) but the simplest way to start will be to return an error like -EINVAL for a conflict.

Even though this interface will require root privileges, I think it will still be appropriate to white list what address ranges for these register configurations will be accepted, so that's one thing to validate early on in the implementation of this ioctl.

Track the new configs

So in the end we want to add a directory to /sys/class/drm/card0/metrics/ for a new config registered with this ioctl

After validating what userspace is giving the kernel, then we will need to store a record of this new config and that will probably go under dev_priv->perf

Take a look at struct drm_i915_private in i915-drv.h which contains a nested 'perf' structure.

I expect we can add a struct list_head dynamic_configs; here with a structure similar to the one passed to the ioctl, but also an embedded struct list_head link; member so that it can be linked into the dynamic_configs list.

For each config added to this list it will also need an ID.

Similar to how context IDs are allocated in i915-gem-context.c it could make sense to use the idr_alloc() utility api to allocate IDs. For that we will need to add a struct idr metrics_idr; member to dev_priv->perf too.

Currently it's the {hsw,bdw,chv,skl}_enable_metric_set() functions that take an ID from userspace to lookup the corresponding noa + boolean + flex registers that were pre-defined at build time. These functions will need to recognise a partitioning of the IDs userspace can give between the pre-defined IDs < I915_OA_METRIC_SET_MAX (ref: include/uapi/drm/i915_perf.h) and some dynamically allocated IDs for configs registered via this ioctl.

When initializing the struct idr, you'll want to make sure it doesn't allocate in the range of these pre-defined IDs.

Sysfs

Look at the sysfs related code that's e.g. generated in i915_oa_bdw.c to see how to add a new metrics/ directory.

One fiddly detail is going to be that unlike the autogenerated sysfs code the callbacks can't just use snprintf with a literal guid string or refer to the predefined IDs from include/uapi/drm/i915_drm.h.

E.g. considering the state to add the 3D config to sysfs:

static ssize_t
show_3d_id(struct device *kdev, struct device_attribute *attr, char *buf)
{
        return sprintf(buf, "%d\n", I915_OA_METRICS_SET_3D);
}

static struct device_attribute dev_attr_3d_id = {
        .attr = { .name = "id", .mode = S_IRUGO },
        .show = show_3d_id,
        .store = NULL,
};

static struct attribute *attrs_3d[] = {
        &dev_attr_3d_id.attr,
        NULL,
};

static struct attribute_group group_3d = {
        .name = "b541bd57-0e0f-4154-b4c0-5858010a2bf7",
        .attrs =  attrs_3d,
};

And see i915_perf_init_sysfs_bdw() where a new group is added to the metrics kobject that represents the 'metrics' directory in sysfs where it calls:

ret = sysfs_create_group(dev_priv->perf.metrics_kobj, &group_3d);

I imagine that it would work to embed a struct attribute_group and struct device_attribute into the same structure that will be linked into the dev_priv->perf.dynamic_configs list.

Lets say we add a struct i915_oa_config something like:

struct i915_oa_config {

    struct list_head link;

    char guid[40]; /* enough to fit a name like "b541bd57-0e0f-4154-b4c0-5858010a2bf7" that can be pointed to by attribute_group::name */
    u64 id; /* allocated via idr_alloc() */

    const struct i915_oa_reg *mux_regs;
    int mux_regs_len;
    const struct i915_oa_reg *b_counter_regs;
    int b_counter_regs_len;
    const struct i915_oa_reg *flex_regs;
    int flex_regs_len;

    struct attribute_group sysfs_metric;
    struct device_attribute sysfs_metric_id;
};

Then the sysfs members would be initialized similar to how they are initialized in the autogenerated code and added to sysfs via sysfs_create_group()

For the device attribute callback (like show_3d_id()) since it's passed a pointer to the struct device_attribute which we know is embedded within a struct i915_oa_config we can use the container_of() macro to map the attr pointer to an i915_oa_config pointer where we'll find the corresponding ID that should be written to buf.

Footnote

In general to start with you'll pretty much be able to ignore worrying about locking, and there will only be some minor cause for locking in the end considering the possibility that another process might try and open a stream at the same time that you are registering a stream.

The text was updated successfully, but these errors were encountered:

When a passive TCP is created, we eventually call tcp_md5_do_add() with sk pointing to the child. It is not owner by the user yet (we will add this socket into listener accept queue a bit later anyway) But we do own the spinlock, so amend the lockdep annotation to avoid following splat : [ 8451.090932] net/ipv4/tcp_ipv4.c:923 suspicious rcu_dereference_protected() usage! [ 8451.090932] [ 8451.090932] other info that might help us debug this: [ 8451.090932] [ 8451.090934] [ 8451.090934] rcu_scheduler_active = 1, debug_locks = 1 [ 8451.090936] 3 locks held by socket_sockopt_/214795: [ 8451.090936] #0: (rcu_read_lock){.+.+..}, at: [<ffffffff855c6ac1>] __netif_receive_skb_core+0x151/0xe90 [ 8451.090947] rib#1: (rcu_read_lock){.+.+..}, at: [<ffffffff85618143>] ip_local_deliver_finish+0x43/0x2b0 [ 8451.090952] rib#2: (slock-AF_INET){+.-...}, at: [<ffffffff855acda5>] sk_clone_lock+0x1c5/0x500 [ 8451.090958] [ 8451.090958] stack backtrace: [ 8451.090960] CPU: 7 PID: 214795 Comm: socket_sockopt_ [ 8451.091215] Call Trace: [ 8451.091216] <IRQ> [<ffffffff856fb29c>] dump_stack+0x55/0x76 [ 8451.091229] [<ffffffff85123b5b>] lockdep_rcu_suspicious+0xeb/0x110 [ 8451.091235] [<ffffffff8564544f>] tcp_md5_do_add+0x1bf/0x1e0 [ 8451.091239] [<ffffffff85645751>] tcp_v4_syn_recv_sock+0x1f1/0x4c0 [ 8451.091242] [<ffffffff85642b27>] ? tcp_v4_md5_hash_skb+0x167/0x190 [ 8451.091246] [<ffffffff85647c78>] tcp_check_req+0x3c8/0x500 [ 8451.091249] [<ffffffff856451ae>] ? tcp_v4_inbound_md5_hash+0x11e/0x190 [ 8451.091253] [<ffffffff85647170>] tcp_v4_rcv+0x3c0/0x9f0 [ 8451.091256] [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0 [ 8451.091260] [<ffffffff856181b6>] ip_local_deliver_finish+0xb6/0x2b0 [ 8451.091263] [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0 [ 8451.091267] [<ffffffff85618d38>] ip_local_deliver+0x48/0x80 [ 8451.091270] [<ffffffff85618510>] ip_rcv_finish+0x160/0x700 [ 8451.091273] [<ffffffff8561900e>] ip_rcv+0x29e/0x3d0 [ 8451.091277] [<ffffffff855c74b7>] __netif_receive_skb_core+0xb47/0xe90 Fixes: a8afca0 ("tcp: md5: protects md5sig_info with RCU") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>

On some host errors storvsc module tries to remove sdev by scheduling a job which does the following: sdev = scsi_device_lookup(wrk->host, 0, 0, wrk->lun); if (sdev) { scsi_remove_device(sdev); scsi_device_put(sdev); } While this code seems correct the following crash is observed: general protection fault: 0000 [rib#1] SMP DEBUG_PAGEALLOC RIP: 0010:[<ffffffff81169979>] [<ffffffff81169979>] bdi_destroy+0x39/0x220 ... [<ffffffff814aecdc>] ? _raw_spin_unlock_irq+0x2c/0x40 [<ffffffff8127b7db>] blk_cleanup_queue+0x17b/0x270 [<ffffffffa00b54c4>] __scsi_remove_device+0x54/0xd0 [scsi_mod] [<ffffffffa00b556b>] scsi_remove_device+0x2b/0x40 [scsi_mod] [<ffffffffa00ec47d>] storvsc_remove_lun+0x3d/0x60 [hv_storvsc] [<ffffffff81080791>] process_one_work+0x1b1/0x530 ... The problem comes with the fact that many such jobs (for the same device) are being scheduled simultaneously. While scsi_remove_device() uses shost->scan_mutex and scsi_device_lookup() will fail for a device in SDEV_DEL state there is no protection against someone who did scsi_device_lookup() before we actually entered __scsi_remove_device(). So the whole scenario looks like that: two callers do simultaneous (or preemption happens) calls to scsi_device_lookup() ant these calls succeed for both of them, after that they try doing scsi_remove_device(). shost->scan_mutex only serializes their calls to __scsi_remove_device() and we end up doing the cleanup path twice. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>

Fix issue from two patches overlapping causing a kernel oops [ 3569.297449] Unable to handle kernel NULL pointer dereference at virtual address 00000088 [ 3569.306272] pgd = dc894000 [ 3569.309287] [00000088] *pgd=00000000 [ 3569.313104] Internal error: Oops: 5 [rib#1] SMP ARM [ 3569.317986] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_filter ebtable_nat ebtable_broute bridge stp llc ebtables ip6table_security ip6table_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_filter ip6_tables iptable_security iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle musb_dsps cppi41 musb_hdrc phy_am335x udc_core phy_generic phy_am335x_control omap_sham omap_aes omap_rng omap_hwspinlock omap_mailbox hwspinlock_core musb_am335x omap_wdt at24 8250_omap leds_gpio cpufreq_dt smsc davinci_mdio mmc_block ti_cpsw cpsw_common ptp pps_core cpsw_ale davinci_cpdma omap_hsmmc omap_dma mmc_core i2c_dev [ 3569.386293] CPU: 0 PID: 1429 Comm: wdctl Not tainted 4.3.0-0.rc7.git0.1.fc24.armv7hl rib#1 [ 3569.394740] Hardware name: Generic AM33XX (Flattened Device Tree) [ 3569.401179] task: dbd11a00 ti: dbaac000 task.ti: dbaac000 [ 3569.406917] PC is at omap_wdt_get_timeleft+0xc/0x20 [omap_wdt] [ 3569.413106] LR is at watchdog_ioctl+0x3cc/0x42c [ 3569.417902] pc : [<bf0ab138>] lr : [<c0739c54>] psr: 600f0013 [ 3569.417902] sp : dbaadf18 ip : 00000003 fp : 7f5d3bbe [ 3569.430014] r10: 00000000 r9 : 00000003 r8 : bef21ab8 [ 3569.435535] r7 : dbbc0f7c r6 : dbbc0f18 r5 : bef21ab8 r4 : 00000000 [ 3569.442427] r3 : 00000000 r2 : 00000000 r1 : 8004570a r0 : dbbc0f18 [ 3569.449323] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none [ 3569.456858] Control: 10c5387d Table: 9c894019 DAC: 00000051 [ 3569.462927] Process wdctl (pid: 1429, stack limit = 0xdbaac220) [ 3569.469179] Stack: (0xdbaadf18 to 0xdbaae000) [ 3569.473790] df00: bef21ab8 dbf60e38 [ 3569.482441] df20: dc91b840 8004570a bef21ab8 c03988a4 dbaadf48 dc854000 00000000 dd313850 [ 3569.491092] df40: ddf033b8 0000570a dc91b80b dbaadf3c dbf60e38 00000020 c0df9250 c0df6c48 [ 3569.499741] df60: dc91b840 8004570a 00000000 dc91b840 dc91b840 8004570a bef21ab8 00000003 [ 3569.508389] df80: 00000000 c03989d4 bef21b74 7f5d3bad 00000003 00000036 c020fcc4 dbaac000 [ 3569.517037] dfa0: 00000000 c020fb00 bef21b74 7f5d3bad 00000003 8004570a bef21ab8 00000001 [ 3569.525685] dfc0: bef21b74 7f5d3bad 00000003 00000036 00000001 00000000 7f5e4eb0 7f5d3bbe [ 3569.534334] dfe0: 7f5e4f10 bef21a3c 7f5d0a54 b6e97e0c a00f0010 00000003 00000000 00000000 [ 3569.543038] [<bf0ab138>] (omap_wdt_get_timeleft [omap_wdt]) from [<c0739c54>] (watchdog_ioctl+0x3cc/0x42c) [ 3569.553266] [<c0739c54>] (watchdog_ioctl) from [<c03988a4>] (do_vfs_ioctl+0x5bc/0x698) [ 3569.561648] [<c03988a4>] (do_vfs_ioctl) from [<c03989d4>] (SyS_ioctl+0x54/0x7c) [ 3569.569400] [<c03989d4>] (SyS_ioctl) from [<c020fb00>] (ret_fast_syscall+0x0/0x3c) [ 3569.577413] Code: e12fff1e e52de004 e8bd4000 e5903060 (e5933088) [ 3569.584089] ---[ end trace cec3039bd3ae610a ]--- Cc: <[email protected]> # v4.2+ Signed-off-by: Peter Robinson <[email protected]> Acked-by: Lars Poeschel <[email protected]> Reviewed-by: Guenter Roeck <[email protected]> Signed-off-by: Wim Van Sebroeck <[email protected]>

When vrf's ->newlink is called, if register_netdevice() fails then it does free_netdev(), but that's also done by rtnl_newlink() so a second free happens and memory gets corrupted, to reproduce execute the following line a couple of times (1 - 5 usually is enough): $ for i in `seq 1 5`; do ip link add vrf: type vrf table 1; done; This works because we fail in register_netdevice() because of the wrong name "vrf:". And here's a trace of one crash: [ 28.792157] ------------[ cut here ]------------ [ 28.792407] kernel BUG at fs/namei.c:246! [ 28.792608] invalid opcode: 0000 [rib#1] SMP [ 28.793240] Modules linked in: vrf nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc crct10dif_pclmul crc32_pclmul crc32c_intel qxl drm_kms_helper ttm drm aesni_intel aes_x86_64 psmouse glue_helper lrw evdev gf128mul i2c_piix4 ablk_helper cryptd ppdev parport_pc parport serio_raw pcspkr virtio_balloon virtio_console i2c_core acpi_cpufreq button 9pnet_virtio 9p 9pnet fscache ipv6 autofs4 ext4 crc16 mbcache jbd2 virtio_blk virtio_net sg sr_mod cdrom ata_generic ehci_pci uhci_hcd ehci_hcd e1000 usbcore usb_common ata_piix libata virtio_pci virtio_ring virtio scsi_mod floppy [ 28.796016] CPU: 0 PID: 1148 Comm: ld-linux-x86-64 Not tainted 4.4.0-rc1+ rib#24 [ 28.796016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 [ 28.796016] task: ffff8800352561c0 ti: ffff88003592c000 task.ti: ffff88003592c000 [ 28.796016] RIP: 0010:[<ffffffff812187b3>] [<ffffffff812187b3>] putname+0x43/0x60 [ 28.796016] RSP: 0018:ffff88003592fe88 EFLAGS: 00010246 [ 28.796016] RAX: 0000000000000000 RBX: ffff8800352561c0 RCX: 0000000000000001 [ 28.796016] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88003784f000 [ 28.796016] RBP: ffff88003592ff08 R08: 0000000000000001 R09: 0000000000000000 [ 28.796016] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 28.796016] R13: 000000000000047c R14: ffff88003784f000 R15: ffff8800358c4a00 [ 28.796016] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 [ 28.796016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 28.796016] CR2: 00007ffd583bc2d9 CR3: 0000000035a99000 CR4: 00000000000406f0 [ 28.796016] Stack: [ 28.796016] ffffffff8121045d ffffffff812102d3 ffff8800352561c0 ffff880035a91660 [ 28.796016] ffff8800008a9880 0000000000000000 ffffffff81a49940 00ffffff81218684 [ 28.796016] ffff8800352561c0 000000000000047c 0000000000000000 ffff880035b36d80 [ 28.796016] Call Trace: [ 28.796016] [<ffffffff8121045d>] ? do_execveat_common.isra.34+0x74d/0x930 [ 28.796016] [<ffffffff812102d3>] ? do_execveat_common.isra.34+0x5c3/0x930 [ 28.796016] [<ffffffff8121066c>] do_execve+0x2c/0x30 [ 28.796016] [<ffffffff810939a0>] call_usermodehelper_exec_async+0xf0/0x140 [ 28.796016] [<ffffffff810938b0>] ? umh_complete+0x40/0x40 [ 28.796016] [<ffffffff815cb1af>] ret_from_fork+0x3f/0x70 [ 28.796016] Code: 48 8d 47 1c 48 89 e5 53 48 8b 37 48 89 fb 48 39 c6 74 1a 48 8b 3d 7e e9 8f 00 e8 49 fa fc ff 48 89 df e8 f1 01 fd ff 5b 5d f3 c3 <0f> 0b 48 89 fe 48 8b 3d 61 e9 8f 00 e8 2c fa fc ff 5b 5d eb e9 [ 28.796016] RIP [<ffffffff812187b3>] putname+0x43/0x60 [ 28.796016] RSP <ffff88003592fe88> Fixes: 193125d ("net: Introduce VRF device driver") Signed-off-by: Nikolay Aleksandrov <[email protected]> Acked-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>

If a user key gets negatively instantiated, an error code is cached in the payload area. A negatively instantiated key may be then be positively instantiated by updating it with valid data. However, the ->update key type method must be aware that the error code may be there. The following may be used to trigger the bug in the user key type: keyctl request2 user user "" @U keyctl add user user "a" @U which manifests itself as: BUG: unable to handle kernel paging request at 00000000ffffff8a IP: [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 kernel/rcu/tree.c:3046 PGD 7cc30067 PUD 0 Oops: 0002 [rib#1] SMP Modules linked in: CPU: 3 PID: 2644 Comm: a.out Not tainted 4.3.0+ torvalds#49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88003ddea700 ti: ffff88003dd88000 task.ti: ffff88003dd88000 RIP: 0010:[<ffffffff810a376f>] [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 kernel/rcu/tree.c:3046 RSP: 0018:ffff88003dd8bdb0 EFLAGS: 00010246 RAX: 00000000ffffff82 RBX: 0000000000000000 RCX: 0000000000000001 RDX: ffffffff81e3fe40 RSI: 0000000000000000 RDI: 00000000ffffff82 RBP: ffff88003dd8bde0 R08: ffff88007d2d2da0 R09: 0000000000000000 R10: 0000000000000000 R11: ffff88003e8073c0 R12: 00000000ffffff82 R13: ffff88003dd8be68 R14: ffff88007d027600 R15: ffff88003ddea700 FS: 0000000000b92880(0063) GS:ffff88007fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000ffffff8a CR3: 000000007cc5f000 CR4: 00000000000006e0 Stack: ffff88003dd8bdf0 ffffffff81160a8a 0000000000000000 00000000ffffff82 ffff88003dd8be68 ffff88007d027600 ffff88003dd8bdf0 ffffffff810a39e5 ffff88003dd8be20 ffffffff812a31ab ffff88007d027600 ffff88007d027620 Call Trace: [<ffffffff810a39e5>] kfree_call_rcu+0x15/0x20 kernel/rcu/tree.c:3136 [<ffffffff812a31ab>] user_update+0x8b/0xb0 security/keys/user_defined.c:129 [< inline >] __key_update security/keys/key.c:730 [<ffffffff8129e5c1>] key_create_or_update+0x291/0x440 security/keys/key.c:908 [< inline >] SYSC_add_key security/keys/keyctl.c:125 [<ffffffff8129fc21>] SyS_add_key+0x101/0x1e0 security/keys/keyctl.c:60 [<ffffffff8185f617>] entry_SYSCALL_64_fastpath+0x12/0x6a arch/x86/entry/entry_64.S:185 Note the error code (-ENOKEY) in EDX. A similar bug can be tripped by: keyctl request2 trusted user "" @U keyctl add trusted user "a" @U This should also affect encrypted keys - but that has to be correctly parameterised or it will fail with EINVAL before getting to the bit that will crashes. Reported-by: Dmitry Vyukov <[email protected]> Signed-off-by: David Howells <[email protected]> Acked-by: Mimi Zohar <[email protected]> Signed-off-by: James Morris <[email protected]>

OMAP CPU hotplug uses cpu1's clocks and power domains for CPU1 wake up from low power states (or turn on CPU1). This part of code is also part of system suspend (disable_nonboot_cpus()). >From other side, cpu1's clocks and power domains are used by CPUIdle. All above functionality is mutually exclusive and, therefore, lockless clkdm/pwrdm api can be used in omap4_boot_secondary(). This fixes below back-trace on -RT which is triggered by pwrdm_lock/unlock(): BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917 in_atomic(): 1, irqs_disabled(): 0, pid: 118, name: sh 9 locks held by sh/118: #0: (sb_writers#4){.+.+.+}, at: [<c0144a6c>] vfs_write+0x13c/0x164 rib#1: (&of->mutex){+.+.+.}, at: [<c01b4c70>] kernfs_fop_write+0x48/0x19c rib#2: (s_active#24){.+.+.+}, at: [<c01b4c78>] kernfs_fop_write+0x50/0x19c rib#3: (device_hotplug_lock){+.+.+.}, at: [<c03cbff0>] lock_device_hotplug_sysfs+0xc/0x4c rib#4: (&dev->mutex){......}, at: [<c03cd284>] device_online+0x14/0x88 rib#5: (cpu_add_remove_lock){+.+.+.}, at: [<c003af90>] cpu_up+0x50/0x1a0 rib#6: (cpu_hotplug.lock){++++++}, at: [<c003ae48>] cpu_hotplug_begin+0x0/0xc4 rib#7: (cpu_hotplug.lock#2){+.+.+.}, at: [<c003aec0>] cpu_hotplug_begin+0x78/0xc4 rib#8: (boot_lock){+.+...}, at: [<c002b254>] omap4_boot_secondary+0x1c/0x178 Preemption disabled at:[< (null)>] (null) CPU: 0 PID: 118 Comm: sh Not tainted 4.1.12-rt11-01998-gb4a62c3-dirty torvalds#137 Hardware name: Generic DRA74X (Flattened Device Tree) [<c0017574>] (unwind_backtrace) from [<c0013be8>] (show_stack+0x10/0x14) [<c0013be8>] (show_stack) from [<c05a8670>] (dump_stack+0x80/0x94) [<c05a8670>] (dump_stack) from [<c05ad158>] (rt_spin_lock+0x24/0x54) [<c05ad158>] (rt_spin_lock) from [<c0030dac>] (clkdm_wakeup+0x10/0x2c) [<c0030dac>] (clkdm_wakeup) from [<c002b2c0>] (omap4_boot_secondary+0x88/0x178) [<c002b2c0>] (omap4_boot_secondary) from [<c0015d00>] (__cpu_up+0xc4/0x164) [<c0015d00>] (__cpu_up) from [<c003b09c>] (cpu_up+0x15c/0x1a0) [<c003b09c>] (cpu_up) from [<c03cd2d4>] (device_online+0x64/0x88) [<c03cd2d4>] (device_online) from [<c03cd360>] (online_store+0x68/0x74) [<c03cd360>] (online_store) from [<c01b4ce0>] (kernfs_fop_write+0xb8/0x19c) [<c01b4ce0>] (kernfs_fop_write) from [<c0144124>] (__vfs_write+0x20/0xd8) [<c0144124>] (__vfs_write) from [<c01449c0>] (vfs_write+0x90/0x164) [<c01449c0>] (vfs_write) from [<c01451e4>] (SyS_write+0x44/0x9c) [<c01451e4>] (SyS_write) from [<c0010240>] (ret_fast_syscall+0x0/0x54) CPU1: smp_ops.cpu_die() returned, trying to resuscitate Cc: Tero Kristo <[email protected]> Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Tony Lindgren <[email protected]>

Originally OMAP MPUIO GPIO irqchip was implemented using Generic irq chip, but after set of reworks Generic irq chip code was replaced by common OMAP GPIO implementation and finally removed by commit d2d05c6 ("gpio: omap: Fix regression for MPUIO interrupts"). Unfortunately, above commit left .irq_mask/unmask callbacks assigned as below for MPUIO GPIO case: irqc->irq_mask = irq_gc_mask_set_bit; irqc->irq_unmask = irq_gc_mask_clr_bit; This now causes boot failure on OMAP1 platforms, after commit 450fa54 ("gpio: omap: convert to use generic irq handler") which forces these callbacks to be called during GPIO IRQs mapping from gpiochip_irq_map: Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = c0004000 [00000000] *pgd=00000000 Internal error: Oops: 75 [rib#1] ARM Modules linked in: CPU: 0 PID: 1 Comm: swapper Not tainted 4.4.0-rc1-e3-los_afe0c+-00002-g25379c0-dirty rib#1 Hardware name: Amstrad E3 (Delta) task: c1836000 ti: c1838000 task.ti: c1838000 PC is at irq_gc_mask_set_bit+0x1c/0x60 LR is at __irq_do_set_handler+0x118/0x15c pc : [<c004848c>] lr : [<c0047d4c>] psr: 600000d3 sp : c1839c90 ip : c1862c64 fp : c1839c9c r10: 00000000 r9 : c041195 r8 : c0411bbc r7 : 00000000 r6 : c185c310 r5 : c00444e8 r4 : c185c300 r3 : c1854b50 r2 : 00000000 r1 : 00000000 r0 : c185c310 Flags: nZCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment kernel Control: 0000317f Table: 10004000 DAC: 00000057 Process swapper (pid: 1, stack limit = 0xc1838190) Stack: (0xc1839c90 to 0xc183a000) [...] Backtrace: [<c0048470>] (irq_gc_mask_set_bit) from [<c0047d4c>] (__irq_do_set_handler+0x118/0x15c) [<c0047c34>] (__irq_do_set_handler) from [<c0047dd4>] (__irq_set_handler+0x44/0x5c) r6:00000000 r5:c00444e8 r4:c185c300 [<c0047d90>] (__irq_set_handler) from [<c0047e1c>] (irq_set_chip_and_handler_name+0x30/0x34) r7:00000050 r6:00000000 r5:c00444e8 r4:00000050 [<c0047dec>] (irq_set_chip_and_handler_name) from [<c01b345c>] (gpiochip_irq_map+0x3c/0x8c) r7:00000050 r6:00000000 r5:00000050 r4:c1862c64 [<c01b3420>] (gpiochip_irq_map) from [<c0049670>] (irq_domain_associate+0x7c/0x1c4) r5:c185c310 r4:c185cb00 [<c00495f4>] (irq_domain_associate) from [<c0049894>] (irq_domain_add_simple+0x98/0xc0) r8:c0411bbc r7:c185cb00 r6:00000050 r5:00000010 r4:00000001 [<c00497fc>] (irq_domain_add_simple) from [<c01b3328>] (_gpiochip_irqchip_add+0x64/0x10c) r7:c1862c64 r6:c0419280 r5:c1862c64 r4:c1854b50 [<c01b32c4>] (_gpiochip_irqchip_add) from [<c01b79f4>] (omap_gpio_probe+0x2fc/0x63c) r5:c1854b50 r4:c1862c10 [<c01b76f8>] (omap_gpio_probe) from [<c01fcf58>] (platform_drv_probe+0x2c/0x64) r10:00000000 r9:c03e45e8 r8:00000000 r7:c0419294 r6:c0411984 r5:c0419294 r4:c0411950 [<c01fcf2c>] (platform_drv_probe) from [<c01fb668>] (really_probe+0x160/0x29c) Hence, fix it by remove obsolete callbacks assignment. After this change omap_gpio_mask_irq()/omap_gpio_unmask_irq() will be used for MPUIO IRQs masking, but this now happens anyway from omap_gpio_irq_startup/shutdown(). Cc: Tony Lindgren <[email protected]> Fixes: commit d2d05c6 ("gpio: omap: Fix regression for MPUIO interrupts") Reported-by: Aaro Koskinen <[email protected]> Signed-off-by: Grygorii Strashko <[email protected]> Acked-by: Santosh Shilimkar <[email protected]> Tested-by: Aaro Koskinen <[email protected]> Signed-off-by: Linus Walleij <[email protected]>

A css_set represents the relationship between a set of tasks and css's. css_set never pinned the associated css's. This was okay because tasks used to always disassociate immediately (in RCU sense) - either a task is moved to a different css_set or exits and never accesses css_set again. Unfortunately, afcf6c8 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller") and patches leading up to it made a zombie hold onto its css_set and deref the associated css's on its release. Nothing pins the css's after exit and it might have already been freed leading to use-after-free. general protection fault: 0000 [rib#1] PREEMPT SMP task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000 RIP: 0010:[<ffffffff810fa205>] [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40 ... Call Trace: <IRQ> [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0 [<ffffffff810f8893>] cgroup_free+0x53/0xe0 [<ffffffff8104ed62>] __put_task_struct+0x42/0x130 [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130 [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820 [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820 [<ffffffff81056e54>] __do_softirq+0xd4/0x460 [<ffffffff81057369>] irq_exit+0x89/0xa0 [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50 [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90 <EOI> ... Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02 RIP [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40 RSP <ffff88001fc03e20> ---[ end trace 89a4a4b916b90c49 ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt Fix it by making css_set pin the associate css's until its release. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Dave Jones <[email protected]> Reported-by: Daniel Wagner <[email protected]> Link: http://lkml.kernel.org/g/[email protected] Link: http://lkml.kernel.org/g/[email protected] Fixes: afcf6c8 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")

qdisc_tree_decrease_qlen() suffers from two problems on multiqueue devices. One problem is that it updates sch->q.qlen and sch->qstats.drops on the mq/mqprio root qdisc, while it should not : Daniele reported underflows errors : [ 681.774821] PAX: sch->q.qlen: 0 n: 1 [ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head; [ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec rib#1 [ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015 [ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c [ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b [ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001 [ 681.774962] Call Trace: [ 681.774967] [<ffffffffa95d2810>] dump_stack+0x4c/0x7f [ 681.774970] [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50 [ 681.774972] [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160 [ 681.774976] [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel] [ 681.774978] [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel] [ 681.774980] [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0 [ 681.774983] [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160 [ 681.774985] [<ffffffffa90664c1>] __do_softirq+0xf1/0x200 [ 681.774987] [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30 [ 681.774989] [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260 [ 681.774991] [<ffffffffa9089560>] ? sort_range+0x40/0x40 [ 681.774992] [<ffffffffa9085fe4>] kthread+0xe4/0x100 [ 681.774994] [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170 [ 681.774995] [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70 mq/mqprio have their own ways to report qlen/drops by folding stats on all their queues, with appropriate locking. A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup() without proper locking : concurrent qdisc updates could corrupt the list that qdisc_match_from_root() parses to find a qdisc given its handle. Fix first problem adding a TCQ_F_NOPARENT qdisc flag that qdisc_tree_decrease_qlen() can use to abort its tree traversal, as soon as it meets a mq/mqprio qdisc children. Second problem can be fixed by RCU protection. Qdisc are already freed after RCU grace period, so qdisc_list_add() and qdisc_list_del() simply have to use appropriate rcu list variants. A future patch will add a per struct netdev_queue list anchor, so that qdisc_tree_decrease_qlen() can have more efficient lookups. Reported-by: Daniele Fucini <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Cc: Cong Wang <[email protected]> Cc: Jamal Hadi Salim <[email protected]> Signed-off-by: David S. Miller <[email protected]>

Recent PAT patchset has caused issue on 32-bit PAE machines: page:eea45000 count:0 mapcount:-128 mapping: (null) index:0x0 flags: 0x40000000() page dumped because: VM_BUG_ON_PAGE(page_mapcount(page) < 0) ------------[ cut here ]------------ kernel BUG at /home/build/linux-boris/mm/huge_memory.c:1485! invalid opcode: 0000 [rib#1] SMP [...] Call Trace: unmap_single_vma ? __wake_up unmap_vmas unmap_region do_munmap vm_munmap SyS_munmap do_fast_syscall_32 ? __do_page_fault sysenter_past_esp Code: ... EIP: [<c11bde80>] zap_huge_pmd+0x240/0x260 SS:ESP 0068:f6459d98 The problem is in pmd_pfn_mask() and pmd_flags_mask(). These helpers use PMD_PAGE_MASK to calculate resulting mask. PMD_PAGE_MASK is 'unsigned long', not 'unsigned long long' as phys_addr_t is on 32-bit PAE (ARCH_PHYS_ADDR_T_64BIT). As a result, the upper bits of resulting mask get truncated. pud_pfn_mask() and pud_flags_mask() aren't problematic since we don't have PUD page table level on 32-bit systems, but it's reasonable to keep them consistent with PMD counterpart. Introduce PHYSICAL_PMD_PAGE_MASK and PHYSICAL_PUD_PAGE_MASK in addition to existing PHYSICAL_PAGE_MASK and reworks helpers to use them. Reported-and-Tested-by: Boris Ostrovsky <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> [ Fix -Woverflow warnings from the realmode code. ] Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Toshi Kani <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jürgen Gross <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: linux-mm <[email protected]> Fixes: f70abb0 ("x86/asm: Fix pud/pmd interfaces to handle large PAT bit") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>

introduced jbd2_write_access_granted() to improve write|undo_access speed, but missed to check the status of b_committed_data which caused a kernel panic on ocfs2. [ 6538.405938] ------------[ cut here ]------------ [ 6538.406686] kernel BUG at fs/ocfs2/suballoc.c:2400! [ 6538.406686] invalid opcode: 0000 [rib#1] SMP [ 6538.406686] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront xen_fbfront parport_pc parport pcspkr i2c_piix4 acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix cirrus ttm drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod [ 6538.406686] CPU: 1 PID: 16265 Comm: mmap_truncate Not tainted 4.3.0 rib#1 [ 6538.406686] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014 [ 6538.406686] task: ffff88007c2bab00 ti: ffff880075b78000 task.ti: ffff880075b78000 [ 6538.406686] RIP: 0010:[<ffffffffa06a286b>] [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2] [ 6538.406686] RSP: 0018:ffff880075b7b7f8 EFLAGS: 00010246 [ 6538.406686] RAX: ffff8800760c5b40 RBX: ffff88006c06a000 RCX: ffffffffa06e6df0 [ 6538.406686] RDX: 0000000000000000 RSI: ffff88007a6f6ea0 RDI: ffff88007a760430 [ 6538.406686] RBP: ffff880075b7b878 R08: 0000000000000002 R09: 0000000000000001 [ 6538.406686] R10: ffffffffa06769be R11: 0000000000000000 R12: 0000000000000001 [ 6538.406686] R13: ffffffffa06a1750 R14: 0000000000000001 R15: ffff88007a6f6ea0 [ 6538.406686] FS: 00007f17fde30720(0000) GS:ffff88007f040000(0000) knlGS:0000000000000000 [ 6538.406686] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6538.406686] CR2: 0000000000601730 CR3: 000000007aea0000 CR4: 00000000000406e0 [ 6538.406686] Stack: [ 6538.406686] ffff88007c2bb5b0 ffff880075b7b8e0 ffff88007a7604b0 ffff88006c640800 [ 6538.406686] ffff88007a7604b0 ffff880075d77390 0000000075b7b878 ffffffffa06a309d [ 6538.406686] ffff880075d752d8 ffff880075b7b990 ffff880075b7b898 0000000000000000 [ 6538.406686] Call Trace: [ 6538.406686] [<ffffffffa06a309d>] ? ocfs2_read_group_descriptor+0x6d/0xa0 [ocfs2] [ 6538.406686] [<ffffffffa06a3654>] _ocfs2_free_suballoc_bits+0xe4/0x320 [ocfs2] [ 6538.406686] [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2] [ 6538.406686] [<ffffffffa06a397e>] _ocfs2_free_clusters+0xee/0x210 [ocfs2] [ 6538.406686] [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2] [ 6538.406686] [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2] [ 6538.406686] [<ffffffffa0682d50>] ? ocfs2_extend_trans+0x50/0x1a0 [ocfs2] [ 6538.406686] [<ffffffffa06a3ad5>] ocfs2_free_clusters+0x15/0x20 [ocfs2] [ 6538.406686] [<ffffffffa065072c>] ocfs2_replay_truncate_records+0xfc/0x290 [ocfs2] [ 6538.406686] [<ffffffffa06843ac>] ? ocfs2_start_trans+0xec/0x1d0 [ocfs2] [ 6538.406686] [<ffffffffa0654600>] __ocfs2_flush_truncate_log+0x140/0x2d0 [ocfs2] [ 6538.406686] [<ffffffffa0654394>] ? ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x170 [ocfs2] [ 6538.406686] [<ffffffffa065acd4>] ocfs2_remove_btree_range+0x374/0x630 [ocfs2] [ 6538.406686] [<ffffffffa017486b>] ? jbd2_journal_stop+0x25b/0x470 [jbd2] [ 6538.406686] [<ffffffffa065d5b5>] ocfs2_commit_truncate+0x305/0x670 [ocfs2] [ 6538.406686] [<ffffffffa0683430>] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2] [ 6538.406686] [<ffffffffa067adb7>] ocfs2_truncate_file+0x297/0x380 [ocfs2] [ 6538.406686] [<ffffffffa01759e4>] ? jbd2_journal_begin_ordered_truncate+0x64/0xc0 [jbd2] [ 6538.406686] [<ffffffffa067c7a2>] ocfs2_setattr+0x572/0x860 [ocfs2] [ 6538.406686] [<ffffffff810e4a3f>] ? current_fs_time+0x3f/0x50 [ 6538.406686] [<ffffffff812124b7>] notify_change+0x1d7/0x340 [ 6538.406686] [<ffffffff8121abf9>] ? generic_getxattr+0x79/0x80 [ 6538.406686] [<ffffffff811f5876>] do_truncate+0x66/0x90 [ 6538.406686] [<ffffffff81120e30>] ? __audit_syscall_entry+0xb0/0x110 [ 6538.406686] [<ffffffff811f5bb3>] do_sys_ftruncate.clone.0+0xf3/0x120 [ 6538.406686] [<ffffffff811f5bee>] SyS_ftruncate+0xe/0x10 [ 6538.406686] [<ffffffff816aa2ae>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 6538.406686] Code: 28 48 81 ee b0 04 00 00 48 8b 92 50 fb ff ff 48 8b 80 b0 03 00 00 48 39 90 88 00 00 00 0f 84 30 fe ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 [ 6538.406686] RIP [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2] [ 6538.406686] RSP <ffff880075b7b7f8> [ 6538.691128] ---[ end trace 31cd7011d6770d7e ]--- [ 6538.694492] Kernel panic - not syncing: Fatal exception [ 6538.695484] Kernel Offset: disabled Fixes: de92c8c("jbd2: speedup jbd2_journal_get_[write|undo]_access()") Cc: <[email protected]> Signed-off-by: Junxiao Bi <[email protected]> Signed-off-by: Theodore Ts'o <[email protected]>

Fixes segmentation fault using, for instance: (gdb) run record -I -e intel_pt/tsc=1,noretcomp=1/u /bin/ls Starting program: /home/acme/bin/perf record -I -e intel_pt/tsc=1,noretcomp=1/u /bin/ls Missing separate debuginfos, use: dnf debuginfo-install glibc-2.22-7.fc23.x86_64 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Program received signal SIGSEGV, Segmentation fault. 0 x00000000004b9ea5 in tracepoint_error (e=0x0, err=13, sys=0x19b1370 "sched", name=0x19a5d00 "sched_switch") at util/parse-events.c:410 (gdb) bt #0 0x00000000004b9ea5 in tracepoint_error (e=0x0, err=13, sys=0x19b1370 "sched", name=0x19a5d00 "sched_switch") at util/parse-events.c:410 #1 0x00000000004b9fc5 in add_tracepoint (list=0x19a5d20, idx=0x7fffffffb8c0, sys_name=0x19b1370 "sched", evt_name=0x19a5d00 "sched_switch", err=0x0, head_config=0x0) at util/parse-events.c:433 #2 0x00000000004ba334 in add_tracepoint_event (list=0x19a5d20, idx=0x7fffffffb8c0, sys_name=0x19b1370 "sched", evt_name=0x19a5d00 "sched_switch", err=0x0, head_config=0x0) at util/parse-events.c:498 #3 0x00000000004bb699 in parse_events_add_tracepoint (list=0x19a5d20, idx=0x7fffffffb8c0, sys=0x19b1370 "sched", event=0x19a5d00 "sched_switch", err=0x0, head_config=0x0) at util/parse-events.c:936 #4 0x00000000004f6eda in parse_events_parse (_data=0x7fffffffb8b0, scanner=0x19a49d0) at util/parse-events.y:391 #5 0x00000000004bc8e5 in parse_events__scanner (str=0x663ff2 "sched:sched_switch", data=0x7fffffffb8b0, start_token=258) at util/parse-events.c:1361 #6 0x00000000004bca57 in parse_events (evlist=0x19a5220, str=0x663ff2 "sched:sched_switch", err=0x0) at util/parse-events.c:1401 #7 0x0000000000518d5f in perf_evlist__can_select_event (evlist=0x19a3b90, str=0x663ff2 "sched:sched_switch") at util/record.c:253 #8 0x0000000000553c42 in intel_pt_track_switches (evlist=0x19a3b90) at arch/x86/util/intel-pt.c:364 #9 0x00000000005549d1 in intel_pt_recording_options (itr=0x19a2c40, evlist=0x19a3b90, opts=0x8edf68 <record+232>) at arch/x86/util/intel-pt.c:664 #10 0x000000000051e076 in auxtrace_record__options (itr=0x19a2c40, evlist=0x19a3b90, opts=0x8edf68 <record+232>) at util/auxtrace.c:539 #11 0x0000000000433368 in cmd_record (argc=1, argv=0x7fffffffde60, prefix=0x0) at builtin-record.c:1264 #12 0x000000000049bec2 in run_builtin (p=0x8fa2a8 <commands+168>, argc=5, argv=0x7fffffffde60) at perf.c:390 #13 0x000000000049c12a in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:451 #14 0x000000000049c278 in run_argv (argcp=0x7fffffffdcbc, argv=0x7fffffffdcb0) at perf.c:495 #15 0x000000000049c60a in main (argc=5, argv=0x7fffffffde60) at perf.c:618 (gdb) Intel PT attempts to find the sched:sched_switch tracepoint but that seg faults if tracefs is not readable, because the error reporting structure is null, as errors are not reported when automatically adding tracepoints. Fix by checking before using. Committer note: This doesn't take place in a kernel that supports perf_event_attr.context_switch, that is the default way that will be used for tracking context switches, only in older kernels, like 4.2, in a machine with Intel PT (e.g. Broadwell) for non-priviledged users. Further info from a similar patch by Wang: The error is in tracepoint_error: it assumes the 'e' parameter is valid. However, there are many situation a parse_event() can be called without parse_events_error. See result of $ grep 'parse_events(.*NULL)' ./tools/perf/ -r' Signed-off-by: Adrian Hunter <[email protected]> Tested-by: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Tong Zhang <[email protected]> Cc: Wang Nan <[email protected]> Cc: [email protected] # v4.4+ Fixes: 1965817 ("perf tools: Enhance parsing events tracepoint error output") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

The bulk of ATA host state machine is implemented by ata_sff_hsm_move(). The function is called from either the interrupt handler or, if polling, a work item. Unlike from the interrupt path, the polling path calls the function without holding the host lock and ata_sff_hsm_move() selectively grabs the lock. This is completely broken. If an IRQ triggers while polling is in progress, the two can easily race and end up accessing the hardware and updating state machine state at the same time. This can put the state machine in an illegal state and lead to a crash like the following. kernel BUG at drivers/ata/libata-sff.c:1302! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN Modules linked in: CPU: 1 PID: 10679 Comm: syz-executor Not tainted 4.5.0-rc1+ torvalds#300 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88002bd00000 ti: ffff88002e048000 task.ti: ffff88002e048000 RIP: 0010:[<ffffffff83a83409>] [<ffffffff83a83409>] ata_sff_hsm_move+0x619/0x1c60 ... Call Trace: <IRQ> [<ffffffff83a84c31>] __ata_sff_port_intr+0x1e1/0x3a0 drivers/ata/libata-sff.c:1584 [<ffffffff83a85611>] ata_bmdma_port_intr+0x71/0x400 drivers/ata/libata-sff.c:2877 [< inline >] __ata_sff_interrupt drivers/ata/libata-sff.c:1629 [<ffffffff83a85bf3>] ata_bmdma_interrupt+0x253/0x580 drivers/ata/libata-sff.c:2902 [<ffffffff81479f98>] handle_irq_event_percpu+0x108/0x7e0 kernel/irq/handle.c:157 [<ffffffff8147a717>] handle_irq_event+0xa7/0x140 kernel/irq/handle.c:205 [<ffffffff81484573>] handle_edge_irq+0x1e3/0x8d0 kernel/irq/chip.c:623 [< inline >] generic_handle_irq_desc include/linux/irqdesc.h:146 [<ffffffff811a92bc>] handle_irq+0x10c/0x2a0 arch/x86/kernel/irq_64.c:78 [<ffffffff811a7e4d>] do_IRQ+0x7d/0x1a0 arch/x86/kernel/irq.c:240 [<ffffffff86653d4c>] common_interrupt+0x8c/0x8c arch/x86/entry/entry_64.S:520 <EOI> [< inline >] rcu_lock_acquire include/linux/rcupdate.h:490 [< inline >] rcu_read_lock include/linux/rcupdate.h:874 [<ffffffff8164b4a1>] filemap_map_pages+0x131/0xba0 mm/filemap.c:2145 [< inline >] do_fault_around mm/memory.c:2943 [< inline >] do_read_fault mm/memory.c:2962 [< inline >] do_fault mm/memory.c:3133 [< inline >] handle_pte_fault mm/memory.c:3308 [< inline >] __handle_mm_fault mm/memory.c:3418 [<ffffffff816efb16>] handle_mm_fault+0x2516/0x49a0 mm/memory.c:3447 [<ffffffff8127dc16>] __do_page_fault+0x376/0x960 arch/x86/mm/fault.c:1238 [<ffffffff8127e358>] trace_do_page_fault+0xe8/0x420 arch/x86/mm/fault.c:1331 [<ffffffff8126f514>] do_async_page_fault+0x14/0xd0 arch/x86/kernel/kvm.c:264 [<ffffffff86655578>] async_page_fault+0x28/0x30 arch/x86/entry/entry_64.S:986 Fix it by ensuring that the polling path is holding the host lock before entering ata_sff_hsm_move() so that all hardware accesses and state updates are performed under the host lock. Signed-off-by: Tejun Heo <[email protected]> Reported-and-tested-by: Dmitry Vyukov <[email protected]> Link: http://lkml.kernel.org/g/CACT4Y+b_JsOxJu2EZyEf+mOXORc_zid5V1-pLZSroJVxyWdSpw@mail.gmail.com Cc: [email protected]

In the extent_same ioctl we are getting the pages for the source and target ranges and unlocking them immediately after, which is incorrect because later we attempt to map them (with kmap_atomic) and access their contents at btrfs_cmp_data(). When we do such access the pages might have been relocated or removed from memory, which leads to an invalid memory access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y which produces a trace like the following: 186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64 parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G W 4.4.0-rc6-btrfs-next-18+ #1 [186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [186736.681319] task: ffff880132600400 ti: ffff880362284000 task.ti: ffff880362284000 [186736.681319] RIP: 0010:[<ffffffff81264d00>] [<ffffffff81264d00>] memcmp+0xb/0x22 [186736.681319] RSP: 0018:ffff880362287d70 EFLAGS: 00010287 [186736.681319] RAX: 000002c002468acf RBX: 0000000012345678 RCX: 0000000000000000 [186736.681319] RDX: 0000000000001000 RSI: 0005d129c5cf9000 RDI: 0005d129c5cf9000 [186736.681319] RBP: ffff880362287d70 R08: 0000000000000000 R09: 0000000000001000 [186736.681319] R10: ffff880000000000 R11: 0000000000000476 R12: 0000000000001000 [186736.681319] R13: ffff8802f91d4c88 R14: ffff8801f2a77830 R15: ffff880352e83e40 [186736.681319] FS: 00007f27b37fe700(0000) GS:ffff88043dda0000(0000) knlGS:0000000000000000 [186736.681319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [186736.681319] CR2: 00007f27a406a000 CR3: 0000000217421000 CR4: 00000000001406e0 [186736.681319] Stack: [186736.681319] ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000 [186736.681319] 0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400 [186736.681319] 00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038 [186736.681319] Call Trace: [186736.681319] [<ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs] [186736.681319] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [186736.681319] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [186736.681319] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [186736.681319] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [186736.681319] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [186736.681319] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6 04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0 (gdb) list *(btrfs_ioctl+0x24cb) 0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972). 2967 dst_addr = kmap_atomic(dst_page); 2968 2969 flush_dcache_page(src_page); 2970 flush_dcache_page(dst_page); 2971 2972 if (memcmp(addr, dst_addr, cmp_len)) 2973 ret = BTRFS_SAME_DATA_DIFFERS; 2974 2975 kunmap_atomic(addr); 2976 kunmap_atomic(dst_addr); So fix this by making sure we keep the pages locked and respect the same locking order as everywhere else: get and lock the pages first and then lock the range in the inode's io tree (like for example at __btrfs_buffered_write() and extent_readpages()). If an ordered extent is found after locking the range in the io tree, unlock the range, unlock the pages, wait for the ordered extent to complete and repeat the entire locking process until no overlapping ordered extents are found. Cc: [email protected] # 4.2+ Signed-off-by: Filipe Manana <[email protected]>

While doing some tests I ran into an hang on an extent buffer's rwlock that produced the following trace: [39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166] [39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165] [39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800016] irq event stamp: 0 [39389.800016] hardirqs last enabled at (0): [< (null)>] (null) [39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800016] softirqs last enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800016] softirqs last disabled at (0): [< (null)>] (null) [39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000 [39389.800016] RIP: 0010:[<ffffffff810902af>] [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158 [39389.800016] RSP: 0018:ffff8800a185fb80 EFLAGS: 00000202 [39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101 [39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001 [39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000 [39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98 [39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40 [39389.800016] FS: 00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000 [39389.800016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800016] Stack: [39389.800016] ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0 [39389.800016] ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895 [39389.800016] ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c [39389.800016] Call Trace: [39389.800016] [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60 [39389.800016] [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41 [39389.800016] [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44 [39389.800016] [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs] [39389.800016] [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs] [39389.800016] [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs] [39389.800016] [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs] [39389.800016] [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs] [39389.800016] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800016] [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15 [39389.800016] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800016] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [39389.800016] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [39389.800016] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [39389.800016] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [39389.800016] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8 [39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800012] irq event stamp: 0 [39389.800012] hardirqs last enabled at (0): [< (null)>] (null) [39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800012] softirqs last enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800012] softirqs last disabled at (0): [< (null)>] (null) [39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G L 4.4.0-rc6-btrfs-next-18+ #1 [39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000 [39389.800012] RIP: 0010:[<ffffffff81091e8d>] [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72 [39389.800012] RSP: 0018:ffff880034a639f0 EFLAGS: 00000206 [39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000 [39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c [39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000 [39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98 [39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00 [39389.800012] FS: 00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000 [39389.800012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800012] Stack: [39389.800012] ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98 [39389.800012] ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00 [39389.800012] ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58 [39389.800012] Call Trace: [39389.800012] [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c [39389.800012] [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41 [39389.800012] [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs] [39389.800012] [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs] [39389.800012] [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs] [39389.800012] [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs] [39389.800012] [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28 [39389.800012] [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs] [39389.800012] [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf [39389.800012] [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc [39389.800012] [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs] [39389.800012] [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs] [39389.800012] [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44 [39389.800012] [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs] [39389.800012] [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs] [39389.800012] [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs] [39389.800012] [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs] [39389.800012] [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs] [39389.800012] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800012] [<ffffffff81140261>] ? __might_fault+0x4c/0xa7 [39389.800012] [<ffffffff81140261>] ? __might_fault+0x4c/0xa7 [39389.800012] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800012] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [39389.800012] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [39389.800012] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [39389.800012] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [39389.800012] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00 This happens because in the code path executed by the inode_paths ioctl we end up nesting two calls to read lock a leaf's rwlock when after the first call to read_lock() and before the second call to read_lock(), another task (running the delayed items as part of a transaction commit) has already called write_lock() against the leaf's rwlock. This situation is illustrated by the following diagram: Task A Task B btrfs_ref_to_path() btrfs_commit_transaction() read_lock(&eb->lock); btrfs_run_delayed_items() __btrfs_commit_inode_delayed_items() __btrfs_update_delayed_inode() btrfs_lookup_inode() write_lock(&eb->lock); --> task waits for lock read_lock(&eb->lock); --> makes this task hang forever (and task B too of course) So fix this by avoiding doing the nested read lock, which is easily avoidable. This issue does not happen if task B calls write_lock() after task A does the second call to read_lock(), however there does not seem to exist anything in the documentation that mentions what is the expected behaviour for recursive locking of rwlocks (leaving the idea that doing so is not a good usage of rwlocks). Also, as a side effect necessary for this fix, make sure we do not needlessly read lock extent buffers when the input path has skip_locking set (used when called from send). Cc: [email protected] Signed-off-by: Filipe Manana <[email protected]>

We miss to take the crypto_alg_sem semaphore when traversing the crypto_alg_list for CRYPTO_MSG_GETALG dumps. This allows a race with crypto_unregister_alg() removing algorithms from the list while we're still traversing it, thereby leading to a use-after-free as show below: [ 3482.071639] general protection fault: 0000 [#1] SMP [ 3482.075639] Modules linked in: aes_x86_64 glue_helper lrw ablk_helper cryptd gf128mul ipv6 pcspkr serio_raw virtio_net microcode virtio_pci virtio_ring virtio sr_mod cdrom [last unloaded: aesni_intel] [ 3482.075639] CPU: 1 PID: 11065 Comm: crconf Not tainted 4.3.4-grsec+ torvalds#126 [ 3482.075639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 3482.075639] task: ffff88001cd41a40 ti: ffff88001cd422c8 task.ti: ffff88001cd422c8 [ 3482.075639] RIP: 0010:[<ffffffff93722bd3>] [<ffffffff93722bd3>] strncpy+0x13/0x30 [ 3482.075639] RSP: 0018:ffff88001f713b60 EFLAGS: 00010202 [ 3482.075639] RAX: ffff88001f6c4430 RBX: ffff88001f6c43a0 RCX: ffff88001f6c4430 [ 3482.075639] RDX: 0000000000000040 RSI: fefefefefefeff16 RDI: ffff88001f6c4430 [ 3482.075639] RBP: ffff88001f713b60 R08: ffff88001f6c4470 R09: ffff88001f6c4480 [ 3482.075639] R10: 0000000000000002 R11: 0000000000000246 R12: ffff88001ce2aa28 [ 3482.075639] R13: ffff880000093700 R14: ffff88001f5e4bf8 R15: 0000000000003b20 [ 3482.075639] FS: 0000033826fa2700(0000) GS:ffff88001e900000(0000) knlGS:0000000000000000 [ 3482.075639] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3482.075639] CR2: ffffffffff600400 CR3: 00000000139ec000 CR4: 00000000001606f0 [ 3482.075639] Stack: [ 3482.075639] ffff88001f713bd8 ffffffff936ccd00 ffff88001e5c4200 ffff880000093700 [ 3482.075639] ffff88001f713bd0 ffffffff938ef4bf 0000000000000000 0000000000003b20 [ 3482.075639] ffff88001f5e4bf8 ffff88001f5e4848 0000000000000000 0000000000003b20 [ 3482.075639] Call Trace: [ 3482.075639] [<ffffffff936ccd00>] crypto_report_alg+0xc0/0x3e0 [ 3482.075639] [<ffffffff938ef4bf>] ? __alloc_skb+0x16f/0x300 [ 3482.075639] [<ffffffff936cd08a>] crypto_dump_report+0x6a/0x90 [ 3482.075639] [<ffffffff93935707>] netlink_dump+0x147/0x2e0 [ 3482.075639] [<ffffffff93935f99>] __netlink_dump_start+0x159/0x190 [ 3482.075639] [<ffffffff936ccb13>] crypto_user_rcv_msg+0xc3/0x130 [ 3482.075639] [<ffffffff936cd020>] ? crypto_report_alg+0x3e0/0x3e0 [ 3482.075639] [<ffffffff936cc4b0>] ? alg_test_crc32c+0x120/0x120 [ 3482.075639] [<ffffffff93933145>] ? __netlink_lookup+0xd5/0x120 [ 3482.075639] [<ffffffff936cca50>] ? crypto_add_alg+0x1d0/0x1d0 [ 3482.075639] [<ffffffff93938141>] netlink_rcv_skb+0xe1/0x130 [ 3482.075639] [<ffffffff936cc4f8>] crypto_netlink_rcv+0x28/0x40 [ 3482.075639] [<ffffffff939375a8>] netlink_unicast+0x108/0x180 [ 3482.075639] [<ffffffff93937c21>] netlink_sendmsg+0x541/0x770 [ 3482.075639] [<ffffffff938e31e1>] sock_sendmsg+0x21/0x40 [ 3482.075639] [<ffffffff938e4763>] SyS_sendto+0xf3/0x130 [ 3482.075639] [<ffffffff93444203>] ? bad_area_nosemaphore+0x13/0x20 [ 3482.075639] [<ffffffff93444470>] ? __do_page_fault+0x80/0x3a0 [ 3482.075639] [<ffffffff939d80cb>] entry_SYSCALL_64_fastpath+0x12/0x6e [ 3482.075639] Code: 88 4a ff 75 ed 5d 48 0f ba 2c 24 3f c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 d2 48 89 f8 48 89 f9 4c 8d 04 17 48 89 e5 74 15 <0f> b6 16 80 fa 01 88 11 48 83 de ff 48 83 c1 01 4c 39 c1 75 eb [ 3482.075639] RIP [<ffffffff93722bd3>] strncpy+0x13/0x30 To trigger the race run the following loops simultaneously for a while: $ while : ; do modprobe aesni-intel; rmmod aesni-intel; done $ while : ; do crconf show all > /dev/null; done Fix the race by taking the crypto_alg_sem read lock, thereby preventing crypto_unregister_alg() from modifying the algorithm list during the dump. This bug has been detected by the PaX memory sanitize feature. Cc: [email protected] Signed-off-by: Mathias Krause <[email protected]> Cc: Steffen Klassert <[email protected]> Cc: PaX Team <[email protected]> Signed-off-by: Herbert Xu <[email protected]>

The value of ctx->pos in the last readdir call is supposed to be set to INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a larger value, then it's LLONG_MAX. There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++" overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a 64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before the increment. We can get to that situation like that: * emit all regular readdir entries * still in the same call to readdir, bump the last pos to INT_MAX * next call to readdir will not emit any entries, but will reach the bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX Normally this is not a problem, but if we call readdir again, we'll find 'pos' set to LLONG_MAX and the unconditional increment will overflow. The report from Victor at (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging print shows that pattern: Overflow: e Overflow: 7fffffff Overflow: 7fffffffffffffff PAX: size overflow detected in function btrfs_real_readdir fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0; context: dir_context; CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015 ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48 ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78 ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8 Call Trace: [<ffffffff81742f0f>] dump_stack+0x4c/0x7f [<ffffffff811cb706>] report_size_overflow+0x36/0x40 [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0 [<ffffffff811dafc8>] iterate_dir+0xa8/0x150 [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70 [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0 Overflow: 1a [<ffffffff811db070>] ? iterate_dir+0x150/0x150 [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83 The jump from 7fffffff to 7fffffffffffffff happens when new dir entries are not yet synced and are processed from the delayed list. Then the code could go to the bump section again even though it might not emit any new dir entries from the delayed list. The fix avoids entering the "bump" section again once we've finished emitting the entries, both for synced and delayed entries. References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 Reported-by: Victor <[email protected]> CC: [email protected] Signed-off-by: David Sterba <[email protected]> Tested-by: Holger Hoffstätte <[email protected]> Signed-off-by: Chris Mason <[email protected]>

A narrow window for race condition still exist between multicast join thread and *dev_flush workers. A kernel crash caused by prolong erratic link state changes was observed (most likely a faulty cabling): [167275.656270] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 [167275.665973] IP: [<ffffffffa05f8f2e>] ipoib_mcast_join+0xae/0x1d0 [ib_ipoib] [167275.674443] PGD 0 [167275.677373] Oops: 0000 [#1] SMP ... [167275.977530] Call Trace: [167275.982225] [<ffffffffa05f92f0>] ? ipoib_mcast_free+0x200/0x200 [ib_ipoib] [167275.992024] [<ffffffffa05fa1b7>] ipoib_mcast_join_task+0x2a7/0x490 [ib_ipoib] [167276.002149] [<ffffffff8109d5fb>] process_one_work+0x17b/0x470 [167276.010754] [<ffffffff8109e3cb>] worker_thread+0x11b/0x400 [167276.019088] [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400 [167276.027737] [<ffffffff810a5aef>] kthread+0xcf/0xe0 Here was a hit spot: ipoib_mcast_join() { .............. rec.qkey = priv->broadcast->mcmember.qkey; ^^^^^^^ ..... } Proposed patch should prevent multicast join task to continue if link state change is detected. Signed-off-by: Alex Estrin <[email protected]> Changes from v4: - as suggested by Doug Ledford, optimized spinlock usage, i.e. ipoib_mcast_join() is called with lock held. Changes from v3: - sync with priv->lock before flag check. Chages from v2: - Move check for OPER_UP flag state to mcast_join() to ensure no event worker is in progress. - minor style fixes. Changes from v1: - No need to lock again if error detected. Signed-off-by: Doug Ledford <[email protected]>

Unfortunately i915 is still not fully atomic, and expects mode_config.mutex to be held during modeset until we finally fix it. This fixes the following WARN when resuming: [ 425.208983] ------------[ cut here ]------------ [ 425.208990] WARNING: CPU: 0 PID: 6828 at drivers/gpu/drm/drm_edid.c:3555 drm_select_eld+0xa5/0xd0() [ 425.209015] Modules linked in: pl2303 usbserial snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic intel_powerclamp coretemp i915 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep lpc_ich snd_hda_core snd_pcm i2c_hid i2c_designware_platform i2c_designware_core r8169 mii sdhci_acpi sdhci mmc_core [ 425.209018] CPU: 0 PID: 6828 Comm: kworker/u4:5 Tainted: G U W 4.5.0-rc4-gfxbench+ #1 [ 425.209020] Hardware name: \xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff \xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff/DN2820FYK, BIOS FYBYT10H.86A.0038.2014.0717.1455 07/17/2014 [ 425.209027] Workqueue: events_unbound async_run_entry_fn [ 425.209032] 0000000000000000 ffff880072433958 ffffffff813f6b05 0000000000000000 [ 425.209036] ffffffff81aaef2d ffff880072433990 ffffffff81078291 ffff880036b933d8 [ 425.209039] ffff88006d528000 ffff88006d52b3d8 ffff88006d52b3d8 ffff88007315b6f8 [ 425.209040] Call Trace: [ 425.209045] [<ffffffff813f6b05>] dump_stack+0x67/0x92 [ 425.209049] [<ffffffff81078291>] warn_slowpath_common+0x81/0xc0 [ 425.209052] [<ffffffff81078385>] warn_slowpath_null+0x15/0x20 [ 425.209054] [<ffffffff8151e195>] drm_select_eld+0xa5/0xd0 [ 425.209101] [<ffffffffa01f34f4>] intel_audio_codec_enable+0x44/0x160 [i915] [ 425.209135] [<ffffffffa023eac7>] intel_enable_hdmi_audio+0x87/0x90 [i915] [ 425.209169] [<ffffffffa023eb5a>] g4x_enable_hdmi+0x8a/0xa0 [i915] [ 425.209202] [<ffffffffa023f41b>] vlv_hdmi_pre_enable+0x1cb/0x240 [i915] [ 425.209236] [<ffffffffa020edcf>] valleyview_crtc_enable+0x10f/0x290 [i915] [ 425.209270] [<ffffffffa020ba49>] intel_atomic_commit+0x769/0x17a0 [i915] [ 425.209274] [<ffffffff81526ad5>] ? drm_atomic_check_only+0x145/0x660 [ 425.209276] [<ffffffff81527022>] drm_atomic_commit+0x32/0x50 [ 425.209310] [<ffffffffa0215fa0>] intel_display_resume+0xa0/0x130 [i915] [ 425.209338] [<ffffffffa018c1bb>] i915_drm_resume+0xcb/0x160 [i915] [ 425.209366] [<ffffffffa018c272>] i915_pm_resume+0x22/0x30 [i915] [ 425.209370] [<ffffffff8143d91e>] pci_pm_resume+0x6e/0xe0 [ 425.209373] [<ffffffff8143d8b0>] ? pci_pm_resume_noirq+0xa0/0xa0 [ 425.209375] [<ffffffff815409ae>] dpm_run_callback+0x6e/0x280 [ 425.209378] [<ffffffff815410b2>] device_resume+0x92/0x250 [ 425.209380] [<ffffffff81541288>] async_resume+0x18/0x40 [ 425.209382] [<ffffffff8109c7a5>] async_run_entry_fn+0x45/0x140 [ 425.209386] [<ffffffff81093293>] process_one_work+0x1e3/0x620 [ 425.209388] [<ffffffff810931f7>] ? process_one_work+0x147/0x620 [ 425.209391] [<ffffffff81093719>] worker_thread+0x49/0x490 [ 425.209393] [<ffffffff810936d0>] ? process_one_work+0x620/0x620 [ 425.209396] [<ffffffff81099e0a>] kthread+0xea/0x100 [ 425.209400] [<ffffffff81099d20>] ? kthread_create_on_node+0x1f0/0x1f0 [ 425.209404] [<ffffffff817ba03f>] ret_from_fork+0x3f/0x70 [ 425.209407] [<ffffffff81099d20>] ? kthread_create_on_node+0x1f0/0x1f0 [ 425.209409] ---[ end trace d1b247107f34a8b2 ]--- Fixes: e2c8b87 ("drm/i915: Use atomic helpers for suspend, v2.") Signed-off-by: Maarten Lankhorst <[email protected]> Link: http://patchwork.freedesktop.org/patch/msgid/1455632862-18557-1-git-send-email-maarten.lankhorst@linux.intel.com Reviewed-by: Daniel Vetter <[email protected]>

[ 1841.243670] divide error: 0000 [rib#1] SMP KASAN [ 1841.243994] Modules linked in: em28xx_rc rc_core tda18271 drxk em28xx_dvb dvb_core em28xx_alsa mt9v011 em28xx_v4l videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core em28xx tveeprom v4l2_common videodev media cpufreq_powersave cpufreq_conservative cpufreq_userspace cpufreq_stats parport_pc ppdev lp parport snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha256_ssse3 sha256_generic hmac drbg i915 snd_hda_codec_realtek snd_hda_codec_generic aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd btusb i2c_algo_bit snd_hda_intel btrtl drm_kms_helper btbcm evdev snd_hda_codec btintel psmouse bluetooth pcspkr snd_hwdep sg drm serio_raw [ 1841.244845] snd_hda_core snd_pcm mei_me rfkill snd_timer mei snd lpc_ich soundcore shpchp i2c_i801 mfd_core battery dw_dmac i2c_designware_platform i2c_designware_core dw_dmac_core video acpi_pad button tpm_tis tpm ext4 crc16 mbcache jbd2 dm_mod hid_generic usbhid sd_mod ahci libahci libata ehci_pci e1000e xhci_pci ptp scsi_mod ehci_hcd xhci_hcd pps_core fan thermal sdhci_acpi sdhci mmc_core i2c_hid hid [last unloaded: tveeprom] [ 1841.245342] CPU: 2 PID: 38 Comm: kworker/2:1 Tainted: G W 4.5.0-rc1+ torvalds#43 [ 1841.245413] Hardware name: /NUC5i7RYB, BIOS RYBDWi35.86A.0350.2015.0812.1722 08/12/2015 [ 1841.245503] Workqueue: events request_module_async [em28xx] [ 1841.245557] task: ffff88009df10000 ti: ffff88009df18000 task.ti: ffff88009df18000 [ 1841.245626] RIP: 0010:[<ffffffffa135a0ad>] [<ffffffffa135a0ad>] size_to_scale+0xed/0x2c0 [em28xx_v4l] [ 1841.245714] RSP: 0018:ffff88009df1faa8 EFLAGS: 00010246 [ 1841.245756] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8803bb933b38 [ 1841.245815] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8803bb933b00 [ 1841.245879] RBP: ffff88009df1fad8 R08: ffff8803bb933b3c R09: 1ffff10077726760 [ 1841.245944] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 1841.246006] R13: 0000000000000000 R14: dffffc0000000000 R15: ffff8803b391a130 [ 1841.246071] FS: 0000000000000000(0000) GS:ffff8803c6900000(0000) knlGS:0000000000000000 [ 1841.246141] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1841.246194] CR2: 0000000001d97008 CR3: 00000003bdd85000 CR4: 00000000003406e0 [ 1841.246256] Stack: [ 1841.246278] 0000000000000246 ffff8803bb9321f0 ffff8803bb932270 ffffffffa136f7a0 [ 1841.246359] 0000000000000000 ffff8803bb932130 ffff88009df1fb20 ffffffffa13646a0 [ 1841.246439] ffffffffa127f206 ffff8803bb932130 ffff8803bb932130 ffff8803b391a130 [ 1841.246517] Call Trace: [ 1841.246548] [<ffffffffa13646a0>] em28xx_set_video_format+0x140/0x1e0 [em28xx_v4l] Signed-off-by: Mauro Carvalho Chehab <[email protected]>

The Camera Adaptation Layer (CAL) is a block which consists of a dual port CSI2/MIPI camera capture engine. Port #0 can handle CSI2 camera connected to up to 4 data lanes. Port rib#1 can handle CSI2 camera connected to up to 2 data lanes. The driver implements the required API/ioctls to be V4L2 compliant. Driver supports the following: - V4L2 API using DMABUF/MMAP buffer access based on videobuf2 api - Asynchronous sensor sub device registration - DT support Signed-off-by: Benoit Parrot <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

If the pll_u is not configured by the bootloader, then on kernel boot the following warning is seen: clk_pll_wait_for_lock: Timed out waiting for pll pll_u_vco lock tegra_init_from_table: Failed to enable pll_u_out1 ------------[ cut here ]------------ WARNING: at drivers/clk/tegra/clk.c:269 Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.4.0-rc4-next-20151214+ rib#1 Hardware name: NVIDIA Tegra210 P2371 reference board (E.1) (DT) task: ffffffc0bc0a0000 ti: ffffffc0bc0a8000 task.ti: ffffffc0bc0a8000 PC is at tegra_init_from_table+0x140/0x164 LR is at tegra_init_from_table+0x140/0x164 pc : [<ffffffc0008fee78>] lr : [<ffffffc0008fee78>] pstate: 80000045 sp : ffffffc0bc0abd50 x29: ffffffc0bc0abd50 x28: ffffffc00090b8a8 x27: ffffffc000a06000 x26: ffffffc0bc019780 x25: ffffffc00086a708 x24: ffffffc00086a790 x23: ffffffc0006d7188 x22: ffffffc0bc010000 x21: 000000000000016e x20: ffffffc0bc00d100 x19: ffffffc000944178 x18: 0000000000000007 x17: 000000000000000e x16: 0000000000000001 x15: 0000000000000007 x14: 000000000000000e x13: 0000000000000013 x12: 000000000000001a x11: 000000000000004d x10: 0000000000000750 x9 : ffffffc0bc0a8000 x8 : ffffffc0bc0a07b0 x7 : 0000000000000001 x6 : 0000000002d5f0f8 x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000002 x2 : ffffffc000996724 x1 : 0000000000000000 x0 : 0000000000000032 ---[ end trace cbd20ae519e92ced ]--- Call trace: [<ffffffc0008fee78>] tegra_init_from_table+0x140/0x164 [<ffffffc000900ac8>] tegra210_clock_apply_init_table+0x20/0x28 [<ffffffc0008fec40>] tegra_clocks_apply_init_table+0x18/0x24 [<ffffffc00008291c>] do_one_initcall+0x90/0x194 [<ffffffc0008cfab0>] kernel_init_freeable+0x148/0x1e8 [<ffffffc000636bb0>] kernel_init+0x10/0xdc [<ffffffc000085cd0>] ret_from_fork+0x10/0x40 clk_pll_wait_for_lock: Timed out waiting for pll pll_u_vco lock tegra_init_from_table: Failed to enable pll_u_out2 ------------[ cut here ]------------ pll_u can be either controlled by software or hardware and this is selected via the OVERRIDE bit in the pll_u base register. In the function tegra210_pll_init(), the OVERRIDE bit for pll_u is cleared, which selects hardware control of the pll. However, at the same time the pll_u clocks are populated in the init_table for tegra210 and so software will try to configure the pll_u if it is not already configured and hence, the above warning is seen when the pll fails to lock. Remove the pll_u clocks from the init_table so that software does not try to configure this pll on boot. Signed-off-by: Jon Hunter <[email protected]> Acked-by: Rhyland Klein <[email protected]> Signed-off-by: Thierry Reding <[email protected]>

This softlockup is currently happening: [ 444.088002] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:1:29] [ 444.088002] Modules linked in: lpfc(-) qla2x00tgt(O) qla2xxx_scst(O) scst_vdisk(O) scsi_transport_fc libcrc32c scst(O) dlm configfs nfsd lockd grace nfs_acl auth_rpcgss sunrpc ed d snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device dm_mod iTCO_wdt snd_hda_codec_realtek snd_hda_codec_generic gpio_ich iTCO_vendor_support ppdev snd_hda_intel snd_hda_codec snd_hda _core snd_hwdep tg3 snd_pcm snd_timer libphy lpc_ich parport_pc ptp acpi_cpufreq snd pps_core fjes parport i2c_i801 ehci_pci tpm_tis tpm sr_mod cdrom soundcore floppy hwmon sg 8250_ fintek pcspkr i915 drm_kms_helper uhci_hcd ehci_hcd drm fb_sys_fops sysimgblt sysfillrect syscopyarea i2c_algo_bit usbcore button video usb_common fan ata_generic ata_piix libata th ermal [ 444.088002] CPU: 1 PID: 29 Comm: kworker/1:1 Tainted: G O 4.4.0-rc5-2.g1e923a3-default rib#1 [ 444.088002] Hardware name: FUJITSU SIEMENS ESPRIMO E /D2164-A1, BIOS 5.00 R1.10.2164.A1 05/08/2006 [ 444.088002] Workqueue: fc_wq_4 fc_rport_final_delete [scsi_transport_fc] [ 444.088002] task: f6266ec0 ti: f6268000 task.ti: f6268000 [ 444.088002] EIP: 0060:[<c07e7044>] EFLAGS: 00000286 CPU: 1 [ 444.088002] EIP is at _raw_spin_unlock_irqrestore+0x14/0x20 [ 444.088002] EAX: 00000286 EBX: f20d3800 ECX: 00000002 EDX: 00000286 [ 444.088002] ESI: f50ba800 EDI: f2146848 EBP: f6269ec8 ESP: f6269ec8 [ 444.088002] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 444.088002] CR0: 8005003b CR2: 08f96600 CR3: 363ae000 CR4: 000006d0 [ 444.088002] Stack: [ 444.088002] f6269eec c066b0f7 00000286 f2146848 f50ba808 f50ba800 f50ba800 f2146a90 [ 444.088002] f2146848 f6269f08 f8f0a4ed f3141000 f2146800 f2146a90 f619fa00 00000040 [ 444.088002] f6269f40 c026cb25 00000001 166c6392 00000061 f6757140 f6136340 00000004 [ 444.088002] Call Trace: [ 444.088002] [<c066b0f7>] scsi_remove_target+0x167/0x1c0 [ 444.088002] [<f8f0a4ed>] fc_rport_final_delete+0x9d/0x1e0 [scsi_transport_fc] [ 444.088002] [<c026cb25>] process_one_work+0x155/0x3e0 [ 444.088002] [<c026cde7>] worker_thread+0x37/0x490 [ 444.088002] [<c027214b>] kthread+0x9b/0xb0 [ 444.088002] [<c07e72c1>] ret_from_kernel_thread+0x21/0x40 What appears to be happening is that something has pinned the target so it can't go into STARGET_DEL via final release and the loop in scsi_remove_target spins endlessly until that happens. The fix for this soft lockup is to not keep looping over a device that we've called remove on but which hasn't gone into DEL state. This patch will retain a simplistic memory of the last target and not keep looping over it. Reported-by: Sebastian Herbszt <[email protected]> Tested-by: Sebastian Herbszt <[email protected]> Fixes: 4099819 Cc: [email protected] Signed-off-by: James Bottomley <[email protected]>

We receoved a bug report from someone using vmware: WARNING: CPU: 3 PID: 660 at kernel/sched/core.c:7389 __might_sleep+0x7d/0x90() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810fa68d>] prepare_to_wait+0x2d/0x90 Modules linked in: vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event snd_ens1371 iosf_mbi gameport snd_rawmidi snd_ac97_codec ac97_bus snd_seq coretemp snd_seq_device snd_pcm snd_timer snd soundcore ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vmw_vmci vmw_balloon i2c_piix4 shpchp parport_pc parport acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc btrfs xor raid6_pq 8021q garp stp llc mrp crc32c_intel serio_raw mptspi vmwgfx drm_kms_helper ttm drm scsi_transport_spi mptscsih e1000 ata_generic mptbase pata_acpi CPU: 3 PID: 660 Comm: vmtoolsd Not tainted 4.2.0-0.rc1.git3.1.fc23.x86_64 rib#1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014 0000000000000000 0000000049e617f3 ffff88006ac37ac8 ffffffff818641f5 0000000000000000 ffff88006ac37b20 ffff88006ac37b08 ffffffff810ab446 ffff880068009f40 ffffffff81c63bc0 0000000000000061 0000000000000000 Call Trace: [<ffffffff818641f5>] dump_stack+0x4c/0x65 [<ffffffff810ab446>] warn_slowpath_common+0x86/0xc0 [<ffffffff810ab4d5>] warn_slowpath_fmt+0x55/0x70 [<ffffffff8112551d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [<ffffffff810fa68d>] ? prepare_to_wait+0x2d/0x90 [<ffffffff810fa68d>] ? prepare_to_wait+0x2d/0x90 [<ffffffff810da2bd>] __might_sleep+0x7d/0x90 [<ffffffff812163b3>] __might_fault+0x43/0xa0 [<ffffffff81430477>] copy_from_iter+0x87/0x2a0 [<ffffffffa039460a>] __qp_memcpy_to_queue+0x9a/0x1b0 [vmw_vmci] [<ffffffffa0394740>] ? qp_memcpy_to_queue+0x20/0x20 [vmw_vmci] [<ffffffffa0394757>] qp_memcpy_to_queue_iov+0x17/0x20 [vmw_vmci] [<ffffffffa0394d50>] qp_enqueue_locked+0xa0/0x140 [vmw_vmci] [<ffffffffa039593f>] vmci_qpair_enquev+0x4f/0xd0 [vmw_vmci] [<ffffffffa04847bb>] vmci_transport_stream_enqueue+0x1b/0x20 [vmw_vsock_vmci_transport] [<ffffffffa047ae05>] vsock_stream_sendmsg+0x2c5/0x320 [vsock] [<ffffffff810fabd0>] ? wake_atomic_t_function+0x70/0x70 [<ffffffff81702af8>] sock_sendmsg+0x38/0x50 [<ffffffff81702ff4>] SYSC_sendto+0x104/0x190 [<ffffffff8126e25a>] ? vfs_read+0x8a/0x140 [<ffffffff817042ee>] SyS_sendto+0xe/0x10 [<ffffffff8186d9ae>] entry_SYSCALL_64_fastpath+0x12/0x76 transport->stream_enqueue may call copy_to_user so it should not be called inside a prepare_to_wait. Narrow the scope of the prepare_to_wait to avoid the bad call. This also applies to vsock_stream_recvmsg as well. Reported-by: Vinson Lee <[email protected]> Tested-by: Vinson Lee <[email protected]> Signed-off-by: Laura Abbott <[email protected]> Signed-off-by: David S. Miller <[email protected]>

Commit c0eb454 ("hv_netvsc: Don't ask for additional head room in the skb") got rid of needed_headroom setting for the driver. With the change I hit the following issue trying to use ptkgen module: [ 57.522021] kernel BUG at net/core/skbuff.c:1128! [ 57.522021] invalid opcode: 0000 [rib#1] SMP DEBUG_PAGEALLOC ... [ 58.721068] Call Trace: [ 58.721068] [<ffffffffa0144e86>] netvsc_start_xmit+0x4c6/0x8e0 [hv_netvsc] ... [ 58.721068] [<ffffffffa02f87fc>] ? pktgen_finalize_skb+0x25c/0x2a0 [pktgen] [ 58.721068] [<ffffffff814f5760>] ? __netdev_alloc_skb+0xc0/0x100 [ 58.721068] [<ffffffffa02f9907>] pktgen_thread_worker+0x257/0x1920 [pktgen] Basically, we're calling skb_cow_head(skb, RNDIS_AND_PPI_SIZE) and crash on if (skb_shared(skb)) BUG(); We probably need to restore needed_headroom setting (but shrunk to RNDIS_AND_PPI_SIZE as we don't need more) to request the required headroom space. In theory, it should not give us performance penalty. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: David S. Miller <[email protected]>

The mutex lock at rc_register_device() was added by commit 08aeb7c ("[media] rc: add locking to fix register/show race"). It is meant to avoid race issues when trying to open a sysfs file while the RC register didn't complete. Adding a lock there causes troubles, as detected by the Kernel lock debug instrumentation at the Kernel: ====================================================== [ INFO: possible circular locking dependency detected ] 4.5.0-rc3+ torvalds#46 Not tainted ------------------------------------------------------- systemd-udevd/2681 is trying to acquire lock: (s_active#171){++++.+}, at: [<ffffffff8171a115>] kernfs_remove_by_name_ns+0x45/0xa0 but task is already holding lock: (&dev->lock){+.+.+.}, at: [<ffffffffa0724def>] rc_register_device+0xb2f/0x1450 [rc_core] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> rib#1 (&dev->lock){+.+.+.}: [<ffffffff8124817d>] lock_acquire+0x13d/0x320 [<ffffffff822de966>] mutex_lock_nested+0xb6/0x860 [<ffffffffa0721f2b>] show_protocols+0x3b/0x3f0 [rc_core] [<ffffffff81cdaba5>] dev_attr_show+0x45/0xc0 [<ffffffff8171f1b3>] sysfs_kf_seq_show+0x203/0x3c0 [<ffffffff8171a6a1>] kernfs_seq_show+0x121/0x1b0 [<ffffffff81617c71>] seq_read+0x2f1/0x1160 [<ffffffff8171c911>] kernfs_fop_read+0x321/0x460 [<ffffffff815abc20>] __vfs_read+0xe0/0x3d0 [<ffffffff815ae90e>] vfs_read+0xde/0x2d0 [<ffffffff815b1d01>] SyS_read+0x111/0x230 [<ffffffff822e8636>] entry_SYSCALL_64_fastpath+0x16/0x76 -> #0 (s_active#171){++++.+}: [<ffffffff81244f24>] __lock_acquire+0x4304/0x5990 [<ffffffff8124817d>] lock_acquire+0x13d/0x320 [<ffffffff81717d3a>] __kernfs_remove+0x58a/0x810 [<ffffffff8171a115>] kernfs_remove_by_name_ns+0x45/0xa0 [<ffffffff81721592>] remove_files.isra.0+0x72/0x190 [<ffffffff8172174b>] sysfs_remove_group+0x9b/0x150 [<ffffffff81721854>] sysfs_remove_groups+0x54/0xa0 [<ffffffff81cd97d0>] device_remove_attrs+0xb0/0x140 [<ffffffff81cdb27c>] device_del+0x38c/0x6b0 [<ffffffffa0724b8b>] rc_register_device+0x8cb/0x1450 [rc_core] [<ffffffffa1326a7b>] dvb_usb_remote_init+0x66b/0x14d0 [dvb_usb] [<ffffffffa1321c81>] dvb_usb_device_init+0xf21/0x1860 [dvb_usb] [<ffffffffa13517dc>] dib0700_probe+0x14c/0x410 [dvb_usb_dib0700] [<ffffffff81dbb1dd>] usb_probe_interface+0x45d/0x940 [<ffffffff81ce7e7a>] driver_probe_device+0x21a/0xc30 [<ffffffff81ce89b1>] __driver_attach+0x121/0x160 [<ffffffff81ce21bf>] bus_for_each_dev+0x11f/0x1a0 [<ffffffff81ce6cdd>] driver_attach+0x3d/0x50 [<ffffffff81ce5df9>] bus_add_driver+0x4c9/0x770 [<ffffffff81cea39c>] driver_register+0x18c/0x3b0 [<ffffffff81db6e98>] usb_register_driver+0x1f8/0x440 [<ffffffffa074001e>] dib0700_driver_init+0x1e/0x1000 [dvb_usb_dib0700] [<ffffffff810021b1>] do_one_initcall+0x141/0x300 [<ffffffff8144d8eb>] do_init_module+0x1d0/0x5ad [<ffffffff812f27b6>] load_module+0x6666/0x9ba0 [<ffffffff812f5fe8>] SyS_finit_module+0x108/0x130 [<ffffffff822e8636>] entry_SYSCALL_64_fastpath+0x16/0x76 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&dev->lock); lock(s_active#171); lock(&dev->lock); lock(s_active#171); *** DEADLOCK *** 3 locks held by systemd-udevd/2681: #0: (&dev->mutex){......}, at: [<ffffffff81ce8933>] __driver_attach+0xa3/0x160 rib#1: (&dev->mutex){......}, at: [<ffffffff81ce8941>] __driver_attach+0xb1/0x160 rib#2: (&dev->lock){+.+.+.}, at: [<ffffffffa0724def>] rc_register_device+0xb2f/0x1450 [rc_core] In this specific case, some error happened during device init, causing IR to be disabled. Let's fix it by adding a var that will tell when the device is initialized. Any calls before that will return a -EINVAL. That should prevent the race issues. Signed-off-by: Mauro Carvalho Chehab <[email protected]>

… switches If cgroup writeback is in use, an inode is associated with a cgroup for writeback. If the inode's main dirtier changes to another cgroup, the association gets updated asynchronously. Nothing was pinning the superblock while such switches are in progress and superblock could go away while async switching is pending or in progress leading to crashes like the following. kernel BUG at fs/jbd2/transaction.c:319! invalid opcode: 0000 [rib#1] SMP DEBUG_PAGEALLOC CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 torvalds#51 Hardware name: Google Google, BIOS Google 01/01/2011 Workqueue: events inode_switch_wbs_work_fn task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000 RIP: 0010:[<ffffffff803e6922>] [<ffffffff803e6922>] start_this_handle+0x382/0x3e0 RSP: 0018:ffff880209267c30 EFLAGS: 00010202 ... Call Trace: [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190 [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70 [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0 [<ffffffff8035338b>] evict+0xbb/0x190 [<ffffffff80354190>] iput+0x130/0x190 [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0 [<ffffffff80279819>] process_one_work+0x129/0x300 [<ffffffff80279b16>] worker_thread+0x126/0x480 [<ffffffff8027ed14>] kthread+0xc4/0xe0 [<ffffffff809771df>] ret_from_fork+0x3f/0x70 Fix it by bumping s_active while cgroup association switching is in flight. Signed-off-by: Tejun Heo <[email protected]> Reported-and-tested-by: Tahsin Erdogan <[email protected]> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com Fixes: d10c809 ("writeback: implement foreign cgroup inode bdi_writeback switching") Cc: [email protected] #v4.5+ Signed-off-by: Jens Axboe <[email protected]>

There is a race between arc_emac_tx() and arc_emac_tx_clean(). sk_buff got freed by arc_emac_tx_clean() while arc_emac_tx() submitting sk_buff. In order to free sk_buff arc_emac_tx_clean() checks: if ((info & FOR_EMAC) || !txbd->data) break; ... dev_kfree_skb_irq(skb); If condition false, arc_emac_tx_clean() free sk_buff. In order to submit txbd, arc_emac_tx() do: priv->tx_buff[*txbd_curr].skb = skb; ... priv->txbd[*txbd_curr].data = cpu_to_le32(addr); ... ... <== arc_emac_tx_clean() check condition here ... <== (info & FOR_EMAC) is false ... <== !txbd->data is false ... *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len); In order to reproduce the situation, run device: # iperf -s run on host: # iperf -t 600 -c <device-ip-addr> [ 28.396284] ------------[ cut here ]------------ [ 28.400912] kernel BUG at .../net/core/skbuff.c:1355! [ 28.414019] Internal error: Oops - BUG: 0 [rib#1] SMP ARM [ 28.419150] Modules linked in: [ 28.422219] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 4.4.0+ torvalds#120 [ 28.429516] Hardware name: Rockchip (Device Tree) [ 28.434216] task: c0665070 ti: c0660000 task.ti: c0660000 [ 28.439622] PC is at skb_put+0x10/0x54 [ 28.443381] LR is at arc_emac_poll+0x260/0x474 [ 28.447821] pc : [<c03af580>] lr : [<c028fec4>] psr: a0070113 [ 28.447821] sp : c0661e58 ip : eea68502 fp : ef377000 [ 28.459280] r10: 0000012c r9 : f08b2000 r8 : eeb57100 [ 28.464498] r7 : 00000000 r6 : ef376594 r5 : 00000077 r4 : ef376000 [ 28.471015] r3 : 0030488b r2 : ef13e880 r1 : 000005ee r0 : eeb57100 [ 28.477534] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none [ 28.484658] Control: 10c5387d Table: 8eaf004a DAC: 00000051 [ 28.490396] Process swapper/0 (pid: 0, stack limit = 0xc0660210) [ 28.496393] Stack: (0xc0661e58 to 0xc0662000) [ 28.500745] 1e40: 00000002 00000000 [ 28.508913] 1e60: 00000000 ef376520 00000028 f08b23b8 00000000 ef376520 ef7b6900 c028fc64 [ 28.517082] 1e80: 2f158000 c0661ea8 c0661eb0 0000012c c065e900 c03bdeac ffff95e9 c0662100 [ 28.525250] 1ea0: c0663924 00000028 c0661ea8 c0661ea8 c0661eb0 c0661eb0 0000001e c0660000 [ 28.533417] 1ec0: 40000003 00000008 c0695a00 0000000a c066208c 00000100 c0661ee0 c0027410 [ 28.541584] 1ee0: ef0fb700 2f158000 00200000 ffff95e8 00000004 c0662100 c0662080 00000003 [ 28.549751] 1f00: 00000000 00000000 00000000 c065b45c 0000001e ef005000 c0647a30 00000000 [ 28.557919] 1f20: 00000000 c0027798 00000000 c005cf40 f0802100 c0662ffc c0661f60 f0803100 [ 28.566088] 1f40: c0661fb8 c00093bc c000ffb4 60070013 ffffffff c0661f94 c0661fb8 c00137d4 [ 28.574267] 1f60: 00000001 00000000 00000000 c001ffa0 00000000 c0660000 00000000 c065a364 [ 28.582441] 1f80: c0661fb8 c0647a30 00000000 00000000 00000000 c0661fb0 c000ffb0 c000ffb4 [ 28.590608] 1fa0: 60070013 ffffffff 00000051 00000000 00000000 c005496c c0662400 c061bc40 [ 28.598776] 1fc0: ffffffff ffffffff 00000000 c061b680 00000000 c0647a30 00000000 c0695294 [ 28.606943] 1fe0: c0662488 c0647a2c c066619c 6000406a 413fc090 6000807c 00000000 00000000 [ 28.615127] [<c03af580>] (skb_put) from [<ef376520>] (0xef376520) [ 28.621218] Code: e5902054 e590c090 e3520000 0a000000 (e7f001f2) [ 28.627307] ---[ end trace 4824734e2243fdb6 ]--- [ 34.377068] Internal error: Oops: 17 [rib#1] SMP ARM [ 34.382854] Modules linked in: [ 34.385947] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.4.0+ torvalds#120 [ 34.392219] Hardware name: Rockchip (Device Tree) [ 34.396937] task: ef02d040 ti: ef05c000 task.ti: ef05c000 [ 34.402376] PC is at __dev_kfree_skb_irq+0x4/0x80 [ 34.407121] LR is at arc_emac_poll+0x130/0x474 [ 34.411583] pc : [<c03bb640>] lr : [<c028fd94>] psr: 60030013 [ 34.411583] sp : ef05de68 ip : 0008e83c fp : ef377000 [ 34.423062] r10: c001bec4 r9 : 00000000 r8 : f08b24c8 [ 34.428296] r7 : f08b2400 r6 : 00000075 r5 : 00000019 r4 : ef376000 [ 34.434827] r3 : 00060000 r2 : 00000042 r1 : 00000001 r0 : 00000000 [ 34.441365] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none [ 34.448507] Control: 10c5387d Table: 8f25c04a DAC: 00000051 [ 34.454262] Process ksoftirqd/0 (pid: 3, stack limit = 0xef05c210) [ 34.460449] Stack: (0xef05de68 to 0xef05e000) [ 34.464827] de60: ef376000 c028fd94 00000000 c0669480 c0669480 ef376520 [ 34.473022] de80: 00000028 00000001 00002ae ef376520 ef7b6900 c028fc64 2f158000 ef05dec0 [ 34.481215] dea0: ef05dec8 0000012c c065e900 c03bdeac ffff983f c0662100 c0663924 00000028 [ 34.489409] dec0: ef05dec0 ef05dec0 ef05dec8 ef05dec8 ef7b6000 ef05c000 40000003 00000008 [ 34.497600] dee0: c0695a00 0000000a c066208c 00000100 ef05def8 c0027410 ef7b6000 40000000 [ 34.505795] df00: 04208040 ffff983e 00000004 c0662100 c0662080 00000003 ef05c000 ef027340 [ 34.513985] df20: ef05c000 c0666c2c 00000000 00000001 00000002 00000000 00000000 c0027568 [ 34.522176] df40: ef027340 c003ef48 ef027300 00000000 ef027340 c003edd4 00000000 00000000 [ 34.530367] df60: 00000000 c003c37c ffffff7f 00000001 00000000 ef027340 00000000 00030003 [ 34.538559] df80: ef05df80 ef05df80 00000000 00000000 ef05df90 ef05df90 ef05dfac ef027300 [ 34.546750] dfa0: c003c2a4 00000000 00000000 c000f578 00000000 00000000 00000000 00000000 [ 34.554939] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 34.563129] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 ffffffff dfff7fff [ 34.571360] [<c03bb640>] (__dev_kfree_skb_irq) from [<c028fd94>] (arc_emac_poll+0x130/0x474) [ 34.579840] [<c028fd94>] (arc_emac_poll) from [<c03bdeac>] (net_rx_action+0xdc/0x28c) [ 34.587712] [<c03bdeac>] (net_rx_action) from [<c0027410>] (__do_softirq+0xcc/0x1f8) [ 34.595482] [<c0027410>] (__do_softirq) from [<c0027568>] (run_ksoftirqd+0x2c/0x50) [ 34.603168] [<c0027568>] (run_ksoftirqd) from [<c003ef48>] (smpboot_thread_fn+0x174/0x18c) [ 34.611466] [<c003ef48>] (smpboot_thread_fn) from [<c003c37c>] (kthread+0xd8/0xec) [ 34.619075] [<c003c37c>] (kthread) from [<c000f578>] (ret_from_fork+0x14/0x3c) [ 34.626317] Code: e8bd8010 e3a00000 e12fff1e e92d4010 (e59030a4) [ 34.632572] ---[ end trace cca5a3d86a82249a ]--- Signed-off-by: Alexander Kochetkov <[email protected]> Signed-off-by: David S. Miller <[email protected]>

Or Gerlitz says: ==================== Mellanox 10/40G mlx4 driver fixes for 4.5-rc Bunch of fixes from the team to the mlx4 Eth and core drivers. Series generated against net commit aac8d3c "qmi_wwan: add "4G LTE usb-modem U901"" Please push patches 1,2 and 6 to -stable as well changes from v0: - handled another wrongly accounted HW counter in patch rib#1 (Rick) - fixed coding style issues in patch rib#4 (Sergei) ==================== Signed-off-by: David S. Miller <[email protected]>

A kernel page fault oops with the callstack below was observed when a read syscall was made to a pmem device after a huge amount (>512GB) of vmalloc ranges was allocated by ioremap() on a x86_64 system: BUG: unable to handle kernel paging request at ffff880840000ff8 IP: vmalloc_fault+0x1be/0x300 PGD c7f03a067 PUD 0 Oops: 0000 [rib#1] SM Call Trace: __do_page_fault+0x285/0x3e0 do_page_fault+0x2f/0x80 ? put_prev_entity+0x35/0x7a0 page_fault+0x28/0x30 ? memcpy_erms+0x6/0x10 ? schedule+0x35/0x80 ? pmem_rw_bytes+0x6a/0x190 [nd_pmem] ? schedule_timeout+0x183/0x240 btt_log_read+0x63/0x140 [nd_btt] : ? __symbol_put+0x60/0x60 ? kernel_read+0x50/0x80 SyS_finit_module+0xb9/0xf0 entry_SYSCALL_64_fastpath+0x1a/0xa4 Since v4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc range is limited to pte mappings. vmalloc faults do not normally happen in ioremap'd ranges since ioremap() sets up the kernel page tables, which are shared by user processes. pgd_ctor() sets the kernel's PGD entries to user's during fork(). When allocation of the vmalloc ranges crosses a 512GB boundary, ioremap() allocates a new pud table and updates the kernel PGD entry to point it. If user process's PGD entry does not have this update yet, a read/write syscall to the range will cause a vmalloc fault, which hits the Oops above as it does not handle a large page properly. Following changes are made to vmalloc_fault(). 64-bit: - No change for the PGD sync operation as it handles large pages already. - Add pud_huge() and pmd_huge() to the validation code to handle large pages. - Change pud_page_vaddr() to pud_pfn() since an ioremap range is not directly mapped (while the if-statement still works with a bogus addr). - Change pmd_page() to pmd_pfn() since an ioremap range is not backed by struct page (while the if-statement still works with a bogus addr). 32-bit: - No change for the sync operation since the index3 PGD entry covers the entire vmalloc range, which is always valid. (A separate change to sync PGD entry is necessary if this memory layout is changed regardless of the page size.) - Add pmd_huge() to the validation code to handle large pages. This is for completeness since vmalloc_fault() won't happen in ioremap'd ranges as its PGD entry is always valid. Reported-by: Henning Schild <[email protected]> Signed-off-by: Toshi Kani <[email protected]> Acked-by: Borislav Petkov <[email protected]> Cc: <[email protected]> # 4.1+ Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luis R. Rodriguez <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Toshi Kani <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>

Otherwise, when compiled as module, a WARN_ON is triggered: WARNING: CPU: 0 PID: 5 at sound/core/init.c:208 snd_card_new+0x310/0x39c [snd] [...] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.10.39 rib#1 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) Workqueue: events deferred_probe_work_func [<c0111988>] (unwind_backtrace) from [<c010c8ac>] (show_stack+0x10/0x14) [<c010c8ac>] (show_stack) from [<c092784c>] (dump_stack+0xdc/0x104) [<c092784c>] (dump_stack) from [<c0129710>] (__warn+0xd8/0x114) [<c0129710>] (__warn) from [<c0922a48>] (warn_slowpath_fmt+0x5c/0xc4) [<c0922a48>] (warn_slowpath_fmt) from [<bf0496f8>] (snd_card_new+0x310/0x39c [snd]) [<bf0496f8>] (snd_card_new [snd]) from [<bf1d7df8>] (snd_soc_bind_card+0x334/0x9c4 [snd_soc_core]) [<bf1d7df8>] (snd_soc_bind_card [snd_soc_core]) from [<bf1e9cd8>] (devm_snd_soc_register_card+0x30/0x6c [snd_soc_core]) [<bf1e9cd8>] (devm_snd_soc_register_card [snd_soc_core]) from [<bf22d964>] (fsl_asoc_card_probe+0x550/0xcc8 [snd_soc_fsl_asoc_card]) [<bf22d964>] (fsl_asoc_card_probe [snd_soc_fsl_asoc_card]) from [<c060c930>] (platform_drv_probe+0x48/0x98) [...] Signed-off-by: Nicolas Cavallari <[email protected]> Acked-by: Shengjiu Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>

After the commit 5ce2dce ("RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces"), if the IPoIB device is moved to non-initial netns, destroying that netns lets the device vanish instead of moving it back to the initial netns, This is happening because default_device_exit() skips the interfaces due to having rtnl_link_ops set. Steps to reporoduce: ip netns add foo ip link set mlx5_ib0 netns foo ip netns delete foo WARNING: CPU: 1 PID: 704 at net/core/dev.c:11435 netdev_exit+0x3f/0x50 Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun d fuse CPU: 1 PID: 704 Comm: kworker/u64:3 Tainted: G S W 5.13.0-rc1+ rib#1 Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.1.5 04/11/2016 Workqueue: netns cleanup_net RIP: 0010:netdev_exit+0x3f/0x50 Code: 48 8b bb 30 01 00 00 e8 ef 81 b1 ff 48 81 fb c0 3a 54 a1 74 13 48 8b 83 90 00 00 00 48 81 c3 90 00 00 00 48 39 d8 75 02 5b c3 <0f> 0b 5b c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 RSP: 0018:ffffb297079d7e08 EFLAGS: 00010206 RAX: ffff8eb542c00040 RBX: ffff8eb541333150 RCX: 000000008010000d RDX: 000000008010000e RSI: 000000008010000d RDI: ffff8eb440042c00 RBP: ffffb297079d7e48 R08: 0000000000000001 R09: ffffffff9fdeac00 R10: ffff8eb5003be000 R11: 0000000000000001 R12: ffffffffa1545620 R13: ffffffffa1545628 R14: 0000000000000000 R15: ffffffffa1543b20 FS: 0000000000000000(0000) GS:ffff8ed37fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005601b5f4c2e8 CR3: 0000001fc8c10002 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ops_exit_list.isra.9+0x36/0x70 cleanup_net+0x234/0x390 process_one_work+0x1cb/0x360 ? process_one_work+0x360/0x360 worker_thread+0x30/0x370 ? process_one_work+0x360/0x360 kthread+0x116/0x130 ? kthread_park+0x80/0x80 ret_from_fork+0x22/0x30 To avoid the above warning and later on the kernel panic that could happen on shutdown due to a NULL pointer dereference, make sure to set the netns_refund flag that was introduced by commit 3a5ca85 ("can: dev: Move device back to init netns on owning netns delete") to properly restore the IPoIB interfaces to the initial netns. Fixes: 5ce2dce ("RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kamal Heib <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>

This reverts commit 568262b. The commit causes the following panic when shutting down a rockpro64-v2 board: [..] [ 41.684569] xhci-hcd xhci-hcd.2.auto: USB bus 1 deregistered [ 41.686301] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a0 [ 41.687096] Mem abort info: [ 41.687345] ESR = 0x96000004 [ 41.687615] EC = 0x25: DABT (current EL), IL = 32 bits [ 41.688082] SET = 0, FnV = 0 [ 41.688352] EA = 0, S1PTW = 0 [ 41.688628] Data abort info: [ 41.688882] ISV = 0, ISS = 0x00000004 [ 41.689219] CM = 0, WnR = 0 [ 41.689481] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000073b2000 [ 41.690046] [00000000000000a0] pgd=0000000000000000, p4d=0000000000000000 [ 41.690654] Internal error: Oops: 96000004 [rib#1] PREEMPT SMP [ 41.691143] Modules linked in: [ 41.691416] CPU: 5 PID: 1 Comm: shutdown Not tainted 5.13.0-rc4 torvalds#43 [ 41.691966] Hardware name: Pine64 RockPro64 v2.0 (DT) [ 41.692409] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) [ 41.692937] pc : down_read_interruptible+0xec/0x200 [ 41.693373] lr : simple_recursive_removal+0x48/0x280 [ 41.693815] sp : ffff800011fab910 [ 41.694107] x29: ffff800011fab910 x28: ffff0000008fe480 x27: ffff0000008fe4d8 [ 41.694736] x26: ffff800011529a90 x25: 00000000000000a0 x24: ffff800011edd030 [ 41.695364] x23: 0000000000000080 x22: 0000000000000000 x21: ffff800011f23994 [ 41.695992] x20: ffff800011f23998 x19: ffff0000008fe480 x18: ffffffffffffffff [ 41.696620] x17: 000c0400bb44ffff x16: 0000000000000009 x15: ffff800091faba3d [ 41.697248] x14: 0000000000000004 x13: 0000000000000000 x12: 0000000000000020 [ 41.697875] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f x9 : 6f6c746364716e62 [ 41.698502] x8 : 7f7f7f7f7f7f7f7f x7 : fefefeff6364626d x6 : 0000000000000440 [ 41.699130] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 00000000000000a0 [ 41.699758] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 00000000000000a0 [ 41.700386] Call trace: [ 41.700602] down_read_interruptible+0xec/0x200 [ 41.701003] debugfs_remove+0x5c/0x80 [ 41.701328] dwc3_debugfs_exit+0x1c/0x6c [ 41.701676] dwc3_remove+0x34/0x1a0 [ 41.701988] platform_remove+0x28/0x60 [ 41.702322] __device_release_driver+0x188/0x22c [ 41.702730] device_release_driver+0x2c/0x44 [ 41.703106] bus_remove_device+0x124/0x130 [ 41.703468] device_del+0x16c/0x424 [ 41.703777] platform_device_del.part.0+0x1c/0x90 [ 41.704193] platform_device_unregister+0x28/0x44 [ 41.704608] of_platform_device_destroy+0xe8/0x100 [ 41.705031] device_for_each_child_reverse+0x64/0xb4 [ 41.705470] of_platform_depopulate+0x40/0x84 [ 41.705853] __dwc3_of_simple_teardown+0x20/0xd4 [ 41.706260] dwc3_of_simple_shutdown+0x14/0x20 [ 41.706652] platform_shutdown+0x28/0x40 [ 41.706998] device_shutdown+0x158/0x330 [ 41.707344] kernel_power_off+0x38/0x7c [ 41.707684] __do_sys_reboot+0x16c/0x2a0 [ 41.708029] __arm64_sys_reboot+0x28/0x34 [ 41.708383] invoke_syscall+0x48/0x114 [ 41.708716] el0_svc_common.constprop.0+0x44/0xdc [ 41.709131] do_el0_svc+0x28/0x90 [ 41.709426] el0_svc+0x2c/0x54 [ 41.709698] el0_sync_handler+0xa4/0x130 [ 41.710045] el0_sync+0x198/0x1c0 [ 41.710342] Code: c8047c62 35ffff84 17fffe5f f9800071 (c85ffc60) [ 41.710881] ---[ end trace 406377df5178f75c ]--- [ 41.711299] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 41.712084] Kernel Offset: disabled [ 41.712391] CPU features: 0x10001031,20000846 [ 41.712775] Memory Limit: none [ 41.713049] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- As Felipe explained: "dwc3_shutdown() is just called dwc3_remove() directly, then we end up calling debugfs_remove_recursive() twice." Reverting the commit fixes the panic. Fixes: 568262b ("usb: dwc3: core: Add shutdown callback for dwc3") Acked-by: Felipe Balbi <[email protected]> Signed-off-by: Alexandru Elisei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

…r tcpm port A pending hrtimer may expire after the kthread_worker of tcpm port is destroyed, see below kernel dump when do module unload, fix it by cancel the 2 hrtimers. [ 111.517018] Unable to handle kernel paging request at virtual address ffff8000118cb880 [ 111.518786] blk_update_request: I/O error, dev sda, sector 60061185 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [ 111.526594] Mem abort info: [ 111.526597] ESR = 0x96000047 [ 111.526600] EC = 0x25: DABT (current EL), IL = 32 bits [ 111.526604] SET = 0, FnV = 0 [ 111.526607] EA = 0, S1PTW = 0 [ 111.526610] Data abort info: [ 111.526612] ISV = 0, ISS = 0x00000047 [ 111.526615] CM = 0, WnR = 1 [ 111.526619] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000041d75000 [ 111.526623] [ffff8000118cb880] pgd=10000001bffff003, p4d=10000001bffff003, pud=10000001bfffe003, pmd=10000001bfffa003, pte=0000000000000000 [ 111.526642] Internal error: Oops: 96000047 [rib#1] PREEMPT SMP [ 111.526647] Modules linked in: dwc3_imx8mp dwc3 phy_fsl_imx8mq_usb [last unloaded: tcpci] [ 111.526663] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-rc4-00927-gebbe9dbd802c-dirty torvalds#36 [ 111.526670] Hardware name: NXP i.MX8MPlus EVK board (DT) [ 111.526674] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO BTYPE=--) [ 111.526681] pc : queued_spin_lock_slowpath+0x1a0/0x390 [ 111.526695] lr : _raw_spin_lock_irqsave+0x88/0xb4 [ 111.526703] sp : ffff800010003e20 [ 111.526706] x29: ffff800010003e20 x28: ffff00017f380180 [ 111.537156] buffer_io_error: 6 callbacks suppressed [ 111.537162] Buffer I/O error on dev sda1, logical block 60040704, async page read [ 111.539932] x27: ffff00017f3801c0 [ 111.539938] x26: ffff800010ba2490 x25: 0000000000000000 x24: 0000000000000001 [ 111.543025] blk_update_request: I/O error, dev sda, sector 60061186 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 0 [ 111.548304] [ 111.548306] x23: 00000000000000c0 x22: ffff0000c2a9f184 x21: ffff00017f380180 [ 111.551374] Buffer I/O error on dev sda1, logical block 60040705, async page read [ 111.554499] [ 111.554503] x20: ffff0000c5f14210 x19: 00000000000000c0 x18: 0000000000000000 [ 111.557391] Buffer I/O error on dev sda1, logical block 60040706, async page read [ 111.561218] [ 111.561222] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 111.564205] Buffer I/O error on dev sda1, logical block 60040707, async page read [ 111.570887] x14: 00000000000000f5 x13: 0000000000000001 x12: 0000000000000040 [ 111.570902] x11: ffff0000c05ac6d8 [ 111.583420] Buffer I/O error on dev sda1, logical block 60040708, async page read [ 111.588978] x10: 0000000000000000 x9 : 0000000000040000 [ 111.588988] x8 : 0000000000000000 [ 111.597173] Buffer I/O error on dev sda1, logical block 60040709, async page read [ 111.605766] x7 : ffff00017f384880 x6 : ffff8000118cb880 [ 111.605777] x5 : ffff00017f384880 [ 111.611094] Buffer I/O error on dev sda1, logical block 60040710, async page read [ 111.617086] x4 : 0000000000000000 x3 : ffff0000c2a9f184 [ 111.617096] x2 : ffff8000118cb880 [ 111.622242] Buffer I/O error on dev sda1, logical block 60040711, async page read [ 111.626927] x1 : ffff8000118cb880 x0 : ffff00017f384888 [ 111.626938] Call trace: [ 111.626942] queued_spin_lock_slowpath+0x1a0/0x390 [ 111.795809] kthread_queue_work+0x30/0xc0 [ 111.799828] state_machine_timer_handler+0x20/0x30 [ 111.804624] __hrtimer_run_queues+0x140/0x1e0 [ 111.808990] hrtimer_interrupt+0xec/0x2c0 [ 111.813004] arch_timer_handler_phys+0x38/0x50 [ 111.817456] handle_percpu_devid_irq+0x88/0x150 [ 111.821991] __handle_domain_irq+0x80/0xe0 [ 111.826093] gic_handle_irq+0xc0/0x140 [ 111.829848] el1_irq+0xbc/0x154 [ 111.832991] arch_cpu_idle+0x1c/0x2c [ 111.836572] default_idle_call+0x24/0x6c [ 111.840497] do_idle+0x238/0x2ac [ 111.843729] cpu_startup_entry+0x2c/0x70 [ 111.847657] rest_init+0xdc/0xec [ 111.850890] arch_call_rest_init+0x14/0x20 [ 111.854988] start_kernel+0x508/0x540 [ 111.858659] Code: 910020e0 8b0200c2 f861d884 aa0203e1 (f8246827) [ 111.864760] ---[ end trace 308b9a4a3dcb73ac ]--- [ 111.869381] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 111.876258] SMP: stopping secondary CPUs [ 111.880185] Kernel Offset: disabled [ 111.883673] CPU features: 0x00001001,20000846 [ 111.888031] Memory Limit: none [ 111.891090] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]--- Fixes: 3ed8e1c ("usb: typec: tcpm: Migrate workqueue to RT priority for processing events") Cc: stable <[email protected]> Reviewed-by: Guenter Roeck <[email protected]> Signed-off-by: Li Jun <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

We've suffered from severe kernel crashes due to memory corruption on our production environment, like, Call Trace: [1640542.554277] general protection fault: 0000 [rib#1] SMP PTI [1640542.554856] CPU: 17 PID: 26996 Comm: python Kdump: loaded Tainted:G [1640542.556629] RIP: 0010:kmem_cache_alloc+0x90/0x190 [1640542.559074] RSP: 0018:ffffb16faa597df8 EFLAGS: 00010286 [1640542.559587] RAX: 0000000000000000 RBX: 0000000000400200 RCX: 0000000006e931bf [1640542.560323] RDX: 0000000006e931be RSI: 0000000000400200 RDI: ffff9a45ff004300 [1640542.560996] RBP: 0000000000400200 R08: 0000000000023420 R09: 0000000000000000 [1640542.561670] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9a20608d [1640542.562366] R13: ffff9a45ff004300 R14: ffff9a45ff004300 R15: 696c662f65636976 [1640542.563128] FS: 00007f45d7c6f740(0000) GS:ffff9a45ff840000(0000) knlGS:0000000000000000 [1640542.563937] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1640542.564557] CR2: 00007f45d71311a0 CR3: 000000189d63e004 CR4: 00000000003606e0 [1640542.565279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1640542.566069] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1640542.566742] Call Trace: [1640542.567009] anon_vma_clone+0x5d/0x170 [1640542.567417] __split_vma+0x91/0x1a0 [1640542.567777] do_munmap+0x2c6/0x320 [1640542.568128] vm_munmap+0x54/0x70 [1640542.569990] __x64_sys_munmap+0x22/0x30 [1640542.572005] do_syscall_64+0x5b/0x1b0 [1640542.573724] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [1640542.575642] RIP: 0033:0x7f45d6e61e27 James Wang has reproduced it stably on the latest 4.19 LTS. After some debugging, we finally proved that it's due to ftrace buffer out-of-bound access using a debug tool as follows: [ 86.775200] BUG: Out-of-bounds write at addr 0xffff88aefe8b7000 [ 86.780806] no_context+0xdf/0x3c0 [ 86.784327] __do_page_fault+0x252/0x470 [ 86.788367] do_page_fault+0x32/0x140 [ 86.792145] page_fault+0x1e/0x30 [ 86.795576] strncpy_from_unsafe+0x66/0xb0 [ 86.799789] fetch_memory_string+0x25/0x40 [ 86.804002] fetch_deref_string+0x51/0x60 [ 86.808134] kprobe_trace_func+0x32d/0x3a0 [ 86.812347] kprobe_dispatcher+0x45/0x50 [ 86.816385] kprobe_ftrace_handler+0x90/0xf0 [ 86.820779] ftrace_ops_assist_func+0xa1/0x140 [ 86.825340] 0xffffffffc00750bf [ 86.828603] do_sys_open+0x5/0x1f0 [ 86.832124] do_syscall_64+0x5b/0x1b0 [ 86.835900] entry_SYSCALL_64_after_hwframe+0x44/0xa9 commit b220c04 ("tracing: Check length before giving out the filter buffer") adds length check to protect trace data overflow introduced in 0fc1b09, seems that this fix can't prevent overflow entirely, the length check should also take the sizeof entry->array[0] into account, since this array[0] is filled the length of trace data and occupy addtional space and risk overflow. Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Cc: Ingo Molnar <[email protected]> Cc: Xunlei Pang <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Fixes: b220c04 ("tracing: Check length before giving out the filter buffer") Reviewed-by: Xunlei Pang <[email protected]> Reviewed-by: yinbinbin <[email protected]> Reviewed-by: Wetp Zhang <[email protected]> Tested-by: James Wang <[email protected]> Signed-off-by: Liangyan <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>

There is no validation of the index from dwc3_wIndex_to_dep() and we might be referring a non-existing ep and trigger a NULL pointer exception. In certain configurations we might use fewer eps and the index might wrongly indicate a larger ep index than existing. By adding this validation from the patch we can actually report a wrong index back to the caller. In our usecase we are using a composite device on an older kernel, but upstream might use this fix also. Unfortunately, I cannot describe the hardware for others to reproduce the issue as it is a proprietary implementation. [ 82.958261] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a4 [ 82.966891] Mem abort info: [ 82.969663] ESR = 0x96000006 [ 82.972703] Exception class = DABT (current EL), IL = 32 bits [ 82.978603] SET = 0, FnV = 0 [ 82.981642] EA = 0, S1PTW = 0 [ 82.984765] Data abort info: [ 82.987631] ISV = 0, ISS = 0x00000006 [ 82.991449] CM = 0, WnR = 0 [ 82.994409] user pgtable: 4k pages, 39-bit VAs, pgdp = 00000000c6210ccc [ 83.000999] [00000000000000a4] pgd=0000000053aa5003, pud=0000000053aa5003, pmd=0000000000000000 [ 83.009685] Internal error: Oops: 96000006 [rib#1] PREEMPT SMP [ 83.026433] Process irq/62-dwc3 (pid: 303, stack limit = 0x000000003985154c) [ 83.033470] CPU: 0 PID: 303 Comm: irq/62-dwc3 Not tainted 4.19.124 rib#1 [ 83.044836] pstate: 60000085 (nZCv daIf -PAN -UAO) [ 83.049628] pc : dwc3_ep0_handle_feature+0x414/0x43c [ 83.054558] lr : dwc3_ep0_interrupt+0x3b4/0xc94 ... [ 83.141788] Call trace: [ 83.144227] dwc3_ep0_handle_feature+0x414/0x43c [ 83.148823] dwc3_ep0_interrupt+0x3b4/0xc94 [ 83.181546] ---[ end trace aac6b5267d84c32f ]--- Signed-off-by: Marian-Cristian Rotariu <[email protected]> Cc: stable <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

The sinfo.pertid and sinfo.generation variables are not initialized and it causes a crash when we use this as a wireless access point. [ 456.873025] ------------[ cut here ]------------ [ 456.878198] kernel BUG at mm/slub.c:3968! [ 456.882680] Internal error: Oops - BUG: 0 [rib#1] PREEMPT SMP ARM [ snip ] [ 457.271004] Backtrace: [ 457.273733] [<c02b7ee4>] (kfree) from [<c0e2a470>] (nl80211_send_station+0x954/0xfc4) [ 457.282481] r9:eccca0c0 r8:e8edfec0 r7:00000000 r6:00000011 r5:e80a9480 r4:e8edfe00 [ 457.291132] [<c0e29b1c>] (nl80211_send_station) from [<c0e2b18c>] (cfg80211_new_sta+0x90/0x1cc) [ 457.300850] r10:e80a9480 r9:e8edfe00 r8:ea678cca r7:00000a20 r6:00000000 r5:ec46d000 [ 457.309586] r4:ec46d9e0 [ 457.312433] [<c0e2b0fc>] (cfg80211_new_sta) from [<bf086684>] (rtw_cfg80211_indicate_sta_assoc+0x80/0x9c [r8723bs]) [ 457.324095] r10:00009930 r9:e85b9d80 r8:bf091050 r7:00000000 r6:00000000 r5:0000001c [ 457.332831] r4:c1606788 [ 457.335692] [<bf086604>] (rtw_cfg80211_indicate_sta_assoc [r8723bs]) from [<bf03df38>] (rtw_stassoc_event_callback+0x1c8/0x1d4 [r8723bs]) [ 457.349489] r7:ea678cc0 r6:000000a1 r5:f1225f84 r4:f086b000 [ 457.355845] [<bf03dd70>] (rtw_stassoc_event_callback [r8723bs]) from [<bf048e4c>] (mlme_evt_hdl+0x8c/0xb4 [r8723bs]) [ 457.367601] r7:c1604900 r6:f086c4b8 r5:00000000 r4:f086c000 [ 457.373959] [<bf048dc0>] (mlme_evt_hdl [r8723bs]) from [<bf03693c>] (rtw_cmd_thread+0x198/0x3d8 [r8723bs]) [ 457.384744] r5:f086e000 r4:f086c000 [ 457.388754] [<bf0367a4>] (rtw_cmd_thread [r8723bs]) from [<c014a214>] (kthread+0x170/0x174) [ 457.398083] r10:ed7a57e8 r9:bf0367a4 r8:f086b000 r7:e8ede000 r6:00000000 r5:e9975200 [ 457.406828] r4:e8369900 [ 457.409653] [<c014a0a4>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c) [ 457.417718] Exception stack(0xe8edffb0 to 0xe8edfff8) [ 457.423356] ffa0: 00000000 00000000 00000000 00000000 [ 457.432492] ffc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 457.441618] ffe0: 00000000 00000000 00000000 00000000 00000013 00000000 [ 457.449006] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c014a0a4 [ 457.457750] r4:e9975200 [ 457.460574] Code: 1a000003 e5953004 e3130001 1a000000 (e7f001f2) [ 457.467381] ---[ end trace 4acbc8c15e9e6aa7 ]--- Link: https://forum.armbian.com/topic/14727-wifi-ap-kernel-bug-in-kernel-5444/ Fixes: 8689c05 ("cfg80211: dynamically allocate per-tid stats for station info") Fixes: f5ea912 ("nl80211: add generation number to all dumps") Signed-off-by: Wenli Looi <[email protected]> Reviewed-by: Dan Carpenter <[email protected]> Cc: stable <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>

The value mr->sig is stored in the entry upon mr allocation, however, ibmr is wrongly entered here as "old", therefore, xa_cmpxchg() does not replace the entry with NULL, which leads to the following trace: WARNING: CPU: 28 PID: 2078 at drivers/infiniband/hw/mlx5/main.c:3643 mlx5_ib_stage_init_cleanup+0x4d/0x60 [mlx5_ib] Modules linked in: nvme_rdma nvme_fabrics nvme_core 8021q garp mrp bonding bridge stp llc rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_tad CPU: 28 PID: 2078 Comm: reboot Tainted: G X --------- --- 5.13.0-0.rc2.19.el9.x86_64 rib#1 Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 2.9.1 12/07/2018 RIP: 0010:mlx5_ib_stage_init_cleanup+0x4d/0x60 [mlx5_ib] Code: 8d bb 70 1f 00 00 be 00 01 00 00 e8 9d 94 ce da 48 3d 00 01 00 00 75 02 5b c3 0f 0b 5b c3 0f 0b 48 83 bb b0 20 00 00 00 74 d5 <0f> 0b eb d1 4 RSP: 0018:ffffa8db06d33c90 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff97f890a44000 RCX: ffff97f900ec0160 RDX: 0000000000000000 RSI: 0000000080080001 RDI: ffff97f890a44000 RBP: ffffffffc0c189b8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000300 R12: ffff97f890a44000 R13: ffffffffc0c36030 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f0d5a8a3b40(0000) GS:ffff98077fb80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555acbf4f450 CR3: 00000002a6f56002 CR4: 00000000001706e0 Call Trace: mlx5r_remove+0x39/0x60 [mlx5_ib] auxiliary_bus_remove+0x1b/0x30 __device_release_driver+0x17a/0x230 device_release_driver+0x24/0x30 bus_remove_device+0xdb/0x140 device_del+0x18b/0x3e0 mlx5_detach_device+0x59/0x90 [mlx5_core] mlx5_unload_one+0x22/0x60 [mlx5_core] shutdown+0x31/0x3a [mlx5_core] pci_device_shutdown+0x34/0x60 device_shutdown+0x15b/0x1c0 __do_sys_reboot.cold+0x2f/0x5b ? vfs_writev+0xc7/0x140 ? handle_mm_fault+0xc5/0x290 ? do_writev+0x6b/0x110 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: e6fb246 ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()") Link: https://lore.kernel.org/r/f3f585ea0db59c2a78f94f65eedeafc5a2374993.1623309971.git.leonro@nvidia.com Signed-off-by: Aharon Landau <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>

On its own, this change looks a little strange and doesn't do too much useful. To understand why we're doing this we need to look forward to future patches where we're going to probe our panel using the new DP AUX bus. See the patch ("drm/bridge: ti-sn65dsi86: Add support for the DP AUX bus"). Let's think about the set of steps we'll want to happen when we have the DP AUX bus: 1. We'll create the DP AUX bus. 2. We'll populate the devices on the DP AUX bus (AKA our panel). 3. For setting up the bridge-related functions of ti-sn65dsi86 we'll need to get a reference to the panel. If we do rib#1 - rib#3 in a single probe call things _mostly_ will work, but it won't be massively robust. Let's explore. First let's think of the easy case of no -EPROBE_DEFER. In that case in step rib#2 when we populate the devices on the DP AUX bus it will actually try probing the panel right away. Since the panel probe doesn't defer then in step rib#3 we'll get a reference to the panel and we're golden. Second, let's think of the case when the panel returns -EPROBE_DEFER. In that case step rib#2 won't synchronously create the panel (it'll just add the device to the defer list to do it later). Step rib#3 will fail to get the panel and the bridge sub-device will return -EPROBE_DEFER. We'll depopulate the DP AUX bus. Later we'll try the whole sequence again. Presumably the panel will eventually stop returning -EPROBE_DEFER and we'll go back to the first case where things were golden. So this case is OK too even if it's a bit ugly that we have to keep creating / deleting the AUX bus over and over. So where is the problem? As I said, it's mostly about robustness. I don't believe that step rib#2 (creating the sub-devices) is really guaranteed to be synchronous. This is evidenced by the fact that it's allowed to "succeed" by just sticking the device on the deferred list. If anything about the process changes in Linux as a whole and step rib#2 just kicks off the probe of the DP AUX endpoints (our panel) in the background then we'd be in trouble because we might never get the panel in step rib#3. Adding an extra sub-device means we just don't need to worry about it. We'll create the sub-device for the DP AUX bus and it won't go away until the whole ti-sn65dsi86 driver goes away. If the bridge sub-device defers (maybe because it can't find the panel) that won't depopulate the DP AUX bus and so we don't need to worry about it. NOTE: there's a little bit of a trick here. Though the AUX channel can run without the MIPI-to-eDP bits of the code, the MIPI-to-eDP bits can't run without the AUX channel. We could come up a complicated signaling scheme (have the MIPI-to-eDP bits return EPROBE_DEFER for a while or wait on some sort of completion), but it seems simple enough to just not even bother creating the bridge device until the AUX channel probes. That's what we'll do. Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Lyude Paul <[email protected]> Reviewed-by: Linus Walleij <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/20210611101711.v10.7.If89144992cb9d900f8c91a8d1817dbe00f543720@changeid

Due to a change in requirements that disallows tasklet_disable() being called from atomic context, rearrange the selftest to avoid doing so. <3> [324.942939] BUG: sleeping function called from invalid context at kernel/softirq.c:888 <3> [324.942952] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 5601, name: i915_selftest <4> [324.942960] 1 lock held by i915_selftest/5601: <4> [324.942963] #0: ffff888101d19240 (&dev->mutex){....}-{3:3}, at: device_driver_attach+0x18/0x50 <3> [324.942987] Preemption disabled at: <3> [324.942990] [<ffffffffa026fbd2>] live_hold_reset.part.65+0xc2/0x2f0 [i915] <4> [324.943255] CPU: 0 PID: 5601 Comm: i915_selftest Tainted: G U 5.13.0-rc5-CI-CI_DRM_10197+ rib#1 <4> [324.943259] Hardware name: Intel Corp. Geminilake/GLK RVP2 LP4SD (07), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017 <4> [324.943263] Call Trace: <4> [324.943267] dump_stack+0x7f/0xad <4> [324.943276] ___might_sleep.cold.123+0xf2/0x106 <4> [324.943286] tasklet_unlock_wait+0x2e/0xb0 <4> [324.943291] ? ktime_get_raw+0x81/0x120 <4> [324.943305] live_hold_reset.part.65+0x1ab/0x2f0 [i915] <4> [324.943500] __i915_subtests.cold.7+0x42/0x92 [i915] <4> [324.943723] ? __i915_live_teardown+0x50/0x50 [i915] <4> [324.943922] ? __intel_gt_live_setup+0x30/0x30 [i915] Fixes: da04474 ("tasklets: Replace spin wait in tasklet_unlock_wait()") Signed-off-by: Chris Wilson <[email protected]> Reviewed-by: Thomas Hellström <[email protected]> Signed-off-by: Matthew Auld <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

We can't allow spam in CI. Update 26th June 2018: This is still an issue: Update 23rd May 2019: You guessed it, still ocurring. [ 224.739686] ------------[ cut here ]------------ [ 224.739712] WARNING: CPU: 3 PID: 2982 at net/sched/sch_generic.c:461 dev_watchdog+0x1fd/0x210 [ 224.739714] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcm i915 asix usbnet mii mei_me mei prime_numbers i2c_hid pinctrl_sunrisepoint pinctrl_intel btusb btrtl btbcm btintel bluetooth ecdh_generic [ 224.739775] CPU: 3 PID: 2982 Comm: gem_exec_suspen Tainted: G U W 4.18.0-rc2-CI-Patchwork_9414+ rib#1 [ 224.739777] Hardware name: Dell Inc. XPS 13 9350/, BIOS 1.4.12 11/30/2016 [ 224.739780] RIP: 0010:dev_watchdog+0x1fd/0x210 [ 224.739781] Code: 49 63 4c 24 f0 eb 92 4c 89 ef c6 05 21 46 ad 00 01 e8 77 ee fc ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 88 4c 14 82 e8 a3 fe 84 ff <0f> 0b eb be 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 c7 47 [ 224.739866] RSP: 0018:ffff88027dd83e40 EFLAGS: 00010286 [ 224.739869] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000102 [ 224.739871] RDX: 0000000080000102 RSI: ffffffff820c8c6c RDI: 00000000ffffffff [ 224.739873] RBP: ffff8802644c1540 R08: 0000000071be9b33 R09: 0000000000000000 [ 224.739874] R10: ffff88027dd83dc0 R11: 0000000000000000 R12: ffff8802644c1588 [ 224.739876] R13: ffff8802644c1160 R14: 0000000000000001 R15: ffff88026a5dc728 [ 224.739878] FS: 00007f18f4887980(0000) GS:ffff88027dd80000(0000) knlGS:0000000000000000 [ 224.739880] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 224.739881] CR2: 00007f4c627ae548 CR3: 000000022ca1a002 CR4: 00000000003606e0 [ 224.739883] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 224.739885] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 224.739886] Call Trace: [ 224.739888] <IRQ> [ 224.739892] ? qdisc_reset+0xe0/0xe0 [ 224.739894] ? qdisc_reset+0xe0/0xe0 [ 224.739897] call_timer_fn+0x93/0x360 [ 224.739903] expire_timers+0xc1/0x1d0 [ 224.739908] run_timer_softirq+0xc7/0x170 [ 224.739916] __do_softirq+0xd9/0x505 [ 224.739923] irq_exit+0xa9/0xc0 [ 224.739926] smp_apic_timer_interrupt+0x9c/0x2d0 [ 224.739929] apic_timer_interrupt+0xf/0x20 [ 224.739931] </IRQ> [ 224.739934] RIP: 0010:delay_tsc+0x2e/0xb0 [ 224.739936] Code: 49 89 fc 55 53 bf 01 00 00 00 e8 6d 2c 78 ff e8 88 9d b6 ff 41 89 c5 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d5 eb 16 f3 90 <bf> 01 00 00 00 e8 48 2c 78 ff e8 63 9d b6 ff 44 39 e8 75 36 0f ae [ 224.740021] RSP: 0018:ffffc900002f7d48 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13 [ 224.740024] RAX: 0000000080000000 RBX: 0000000649565ca9 RCX: 0000000000000001 [ 224.740026] RDX: 0000000080000001 RSI: ffffffff820c8c6c RDI: 00000000ffffffff [ 224.740027] RBP: 00000006493ea9ce R08: 000000005e81e2ee R09: 0000000000000000 [ 224.740029] R10: 0000000000000120 R11: 0000000000000000 R12: 00000000002ad8d6 [ 224.740030] R13: 0000000000000003 R14: 0000000000000004 R15: ffff88025caf5408 [ 224.740040] ? delay_tsc+0x66/0xb0 [ 224.740045] hibernation_debug_sleep+0x1c/0x30 [ 224.740048] hibernation_snapshot+0x2c1/0x690 [ 224.740053] hibernate+0x142/0x2a4 [ 224.740057] state_store+0xd0/0xe0 [ 224.740063] kernfs_fop_write+0x104/0x190 [ 224.740068] __vfs_write+0x31/0x180 [ 224.740072] ? rcu_read_lock_sched_held+0x6f/0x80 [ 224.740075] ? rcu_sync_lockdep_assert+0x29/0x50 [ 224.740078] ? __sb_start_write+0x152/0x1f0 [ 224.740080] ? __sb_start_write+0x168/0x1f0 [ 224.740084] vfs_write+0xbd/0x1a0 [ 224.740088] ksys_write+0x50/0xc0 [ 224.740094] do_syscall_64+0x55/0x190 [ 224.740097] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 224.740099] RIP: 0033:0x7f18f400a281 [ 224.740100] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 59 8d 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 8a d1 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53 [ 224.740186] RSP: 002b:00007fffd1f4fec8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 224.740189] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f18f400a281 [ 224.740190] RDX: 0000000000000004 RSI: 00007f18f448069a RDI: 0000000000000006 [ 224.740192] RBP: 00007fffd1f4fef0 R08: 0000000000000000 R09: 0000000000000000 [ 224.740194] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e795d03400 [ 224.740195] R13: 00007fffd1f50500 R14: 0000000000000000 R15: 0000000000000000 [ 224.740205] irq event stamp: 1582591 [ 224.740207] hardirqs last enabled at (1582590): [<ffffffff810f9f9c>] vprintk_emit+0x4bc/0x4d0 [ 224.740210] hardirqs last disabled at (1582591): [<ffffffff81a0111c>] error_entry+0x7c/0x100 [ 224.740212] softirqs last enabled at (1582568): [<ffffffff81c0034f>] __do_softirq+0x34f/0x505 [ 224.740215] softirqs last disabled at (1582571): [<ffffffff8108c959>] irq_exit+0xa9/0xc0 [ 224.740218] WARNING: CPU: 3 PID: 2982 at net/sched/sch_generic.c:461 dev_watchdog+0x1fd/0x210 [ 224.740219] ---[ end trace 6e41d690e611c338 ]--- References: https://bugzilla.kernel.org/show_bug.cgi?id=196399 Acked-by: Martin Peres <[email protected]> Cc: Martin Peres <[email protected]> Signed-off-by: Daniel Vetter <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Rodrigo Vivi <[email protected]>

If the MSI is already enabled, trying to enable it again results in an -EINVAL and on the first attempt a WARN. That WARN causes our CI to abort the run [on each first attempt to suspend]: <4> [463.142025] WARNING: CPU: 0 PID: 2225 at drivers/pci/msi.c:1074 __pci_enable_msi_range+0x3cb/0x420 <4> [463.142026] Modules linked in: snd_hda_intel i915 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_dspcfg ghash_clmulni_intel snd_hda_codec btusb btrtl btbcm btintel e1000e bluetooth snd_hwdep snd_hda_core ptp ecdh_generic snd_pcm ecc pps_core mei_me mei prime_numbers [last unloaded: i915] <4> [463.142045] CPU: 0 PID: 2225 Comm: kworker/u8:14 Tainted: G U 5.7.0-rc2-CI-CI_DRM_8350+ rib#1 <4> [463.142046] Hardware name: Intel Corporation NUC7i5BNH/NUC7i5BNB, BIOS BNKBL357.86A.0060.2017.1214.2013 12/14/2017 <4> [463.142049] Workqueue: events_unbound async_run_entry_fn <4> [463.142051] RIP: 0010:__pci_enable_msi_range+0x3cb/0x420 <4> [463.142053] Code: 76 58 49 8d 56 48 48 89 df e8 31 73 fd ff e9 20 fe ff ff 31 f6 48 89 df e8 c2 e9 fd ff e9 d6 fe ff ff 45 89 fc e9 1a ff ff ff <0f> 0b 41 bc ea ff ff ff e9 0d ff ff ff 41 bc ea ff ff ff e9 02 ff <4> [463.142054] RSP: 0018:ffffc90000593cd0 EFLAGS: 00010202 <4> [463.142056] RAX: 0000000000000010 RBX: ffff888274051000 RCX: 0000000000000000 <4> [463.142057] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff888274051000 <4> [463.142058] RBP: ffff888238aa1018 R08: 0000000000000001 R09: 0000000000000001 <4> [463.142060] R10: ffffc90000593d90 R11: 00000000c79cdfd5 R12: ffff8882740510b0 <4> [463.142061] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 <4> [463.142062] FS: 0000000000000000(0000) GS:ffff888276c00000(0000) knlGS:0000000000000000 <4> [463.142064] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [463.142065] CR2: 000055706f347d80 CR3: 0000000005610003 CR4: 00000000003606f0 <4> [463.142066] Call Trace: <4> [463.142073] pci_enable_msi+0x11/0x20 <4> [463.142077] azx_resume+0x1ab/0x200 [snd_hda_intel] <4> [463.142080] ? pci_pm_thaw+0x80/0x80 <4> [463.142084] dpm_run_callback+0x64/0x280 <4> [463.142089] device_resume+0xd4/0x1c0 <4> [463.142093] ? dpm_watchdog_set+0x60/0 While this would appear to be a bug in snd-hda, it does appear inconsequential, at least for gfx-ci. Downgrade the warning to an info, like the other already-enabled error for MSI-X. Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1687 Signed-off-by: Chris Wilson <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Rodrigo Vivi <[email protected]>

We inadvertently create a dependency on mmap_sem with a whole chain. This breaks any user who wants to take a lock and call rcu_barrier(), while also taking that lock inside mmap_sem: <4> [604.892532] ====================================================== <4> [604.892534] WARNING: possible circular locking dependency detected <4> [604.892536] 5.6.0-rc7-CI-Patchwork_17096+ rib#1 Tainted: G U <4> [604.892537] ------------------------------------------------------ <4> [604.892538] kms_frontbuffer/2595 is trying to acquire lock: <4> [604.892540] ffffffff8264a558 (rcu_state.barrier_mutex){+.+.}, at: rcu_barrier+0x23/0x190 <4> [604.892547] but task is already holding lock: <4> [604.892547] ffff888484716050 (reservation_ww_class_mutex){+.+.}, at: i915_gem_object_pin_to_display_plane+0x89/0x270 [i915] <4> [604.892592] which lock already depends on the new lock. <4> [604.892593] the existing dependency chain (in reverse order) is: <4> [604.892594] -> rib#6 (reservation_ww_class_mutex){+.+.}: <4> [604.892597] __ww_mutex_lock.constprop.15+0xc3/0x1090 <4> [604.892598] ww_mutex_lock+0x39/0x70 <4> [604.892600] dma_resv_lockdep+0x10e/0x1f5 <4> [604.892602] do_one_initcall+0x58/0x300 <4> [604.892604] kernel_init_freeable+0x17b/0x1dc <4> [604.892605] kernel_init+0x5/0x100 <4> [604.892606] ret_from_fork+0x24/0x50 <4> [604.892607] -> rib#5 (reservation_ww_class_acquire){+.+.}: <4> [604.892609] dma_resv_lockdep+0xec/0x1f5 <4> [604.892610] do_one_initcall+0x58/0x300 <4> [604.892610] kernel_init_freeable+0x17b/0x1dc <4> [604.892611] kernel_init+0x5/0x100 <4> [604.892612] ret_from_fork+0x24/0x50 <4> [604.892613] -> rib#4 (&mm->mmap_sem#2){++++}: <4> [604.892615] __might_fault+0x63/0x90 <4> [604.892617] _copy_to_user+0x1e/0x80 <4> [604.892619] perf_read+0x200/0x2b0 <4> [604.892621] vfs_read+0x96/0x160 <4> [604.892622] ksys_read+0x9f/0xe0 <4> [604.892623] do_syscall_64+0x4f/0x220 <4> [604.892624] entry_SYSCALL_64_after_hwframe+0x49/0xbe <4> [604.892625] -> rib#3 (&cpuctx_mutex){+.+.}: <4> [604.892626] __mutex_lock+0x9a/0x9c0 <4> [604.892627] perf_event_init_cpu+0xa4/0x140 <4> [604.892629] perf_event_init+0x19d/0x1cd <4> [604.892630] start_kernel+0x362/0x4e4 <4> [604.892631] secondary_startup_64+0xa4/0xb0 <4> [604.892631] -> rib#2 (pmus_lock){+.+.}: <4> [604.892633] __mutex_lock+0x9a/0x9c0 <4> [604.892633] perf_event_init_cpu+0x6b/0x140 <4> [604.892635] cpuhp_invoke_callback+0x9b/0x9d0 <4> [604.892636] _cpu_up+0xa2/0x140 <4> [604.892637] do_cpu_up+0x61/0xa0 <4> [604.892639] smp_init+0x57/0x96 <4> [604.892639] kernel_init_freeable+0x87/0x1dc <4> [604.892640] kernel_init+0x5/0x100 <4> [604.892642] ret_from_fork+0x24/0x50 <4> [604.892642] -> rib#1 (cpu_hotplug_lock.rw_sem){++++}: <4> [604.892643] cpus_read_lock+0x34/0xd0 <4> [604.892644] rcu_barrier+0xaa/0x190 <4> [604.892645] kernel_init+0x21/0x100 <4> [604.892647] ret_from_fork+0x24/0x50 <4> [604.892647] -> #0 (rcu_state.barrier_mutex){+.+.}: <4> [604.892649] __lock_acquire+0x1328/0x15d0 <4> [604.892650] lock_acquire+0xa7/0x1c0 <4> [604.892651] __mutex_lock+0x9a/0x9c0 <4> [604.892652] rcu_barrier+0x23/0x190 <4> [604.892680] i915_gem_object_unbind+0x29d/0x3f0 [i915] <4> [604.892707] i915_gem_object_pin_to_display_plane+0x141/0x270 [i915] <4> [604.892737] intel_pin_and_fence_fb_obj+0xec/0x1f0 [i915] <4> [604.892767] intel_plane_pin_fb+0x3f/0xd0 [i915] <4> [604.892797] intel_prepare_plane_fb+0x13b/0x5c0 [i915] <4> [604.892798] drm_atomic_helper_prepare_planes+0x85/0x110 <4> [604.892827] intel_atomic_commit+0xda/0x390 [i915] <4> [604.892828] drm_atomic_helper_set_config+0x57/0xa0 <4> [604.892830] drm_mode_setcrtc+0x1c4/0x720 <4> [604.892830] drm_ioctl_kernel+0xb0/0xf0 <4> [604.892831] drm_ioctl+0x2e1/0x390 <4> [604.892833] ksys_ioctl+0x7b/0x90 <4> [604.892835] __x64_sys_ioctl+0x11/0x20 <4> [604.892835] do_syscall_64+0x4f/0x220 <4> [604.892836] entry_SYSCALL_64_after_hwframe+0x49/0xbe <4> [604.892837] Changes since v1: - Use (*values)[n++] in perf_read_one(). Changes since v2: - Centrally allocate values. Signed-off-by: Maarten Lankhorst <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Rodrigo Vivi <[email protected]>

As previously noted in commit 66e4f4a ("rtc: cmos: Use spin_lock_irqsave() in cmos_interrupt()"): <4>[ 254.192378] WARNING: inconsistent lock state <4>[ 254.192384] 5.12.0-rc1-CI-CI_DRM_9834+ rib#1 Not tainted <4>[ 254.192396] -------------------------------- <4>[ 254.192400] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. <4>[ 254.192409] rtcwake/5309 [HC0[0]:SC0[0]:HE1:SE1] takes: <4>[ 254.192429] ffffffff8263c5f8 (rtc_lock){?...}-{2:2}, at: cmos_interrupt+0x18/0x100 <4>[ 254.192481] {IN-HARDIRQ-W} state was registered at: <4>[ 254.192488] lock_acquire+0xd1/0x3d0 <4>[ 254.192504] _raw_spin_lock+0x2a/0x40 <4>[ 254.192519] cmos_interrupt+0x18/0x100 <4>[ 254.192536] rtc_handler+0x1f/0xc0 <4>[ 254.192553] acpi_ev_fixed_event_detect+0x109/0x13c <4>[ 254.192574] acpi_ev_sci_xrupt_handler+0xb/0x28 <4>[ 254.192596] acpi_irq+0x13/0x30 <4>[ 254.192620] __handle_irq_event_percpu+0x43/0x2c0 <4>[ 254.192641] handle_irq_event_percpu+0x2b/0x70 <4>[ 254.192661] handle_irq_event+0x2f/0x50 <4>[ 254.192680] handle_fasteoi_irq+0x9e/0x150 <4>[ 254.192693] __common_interrupt+0x76/0x140 <4>[ 254.192715] common_interrupt+0x96/0xc0 <4>[ 254.192732] asm_common_interrupt+0x1e/0x40 <4>[ 254.192750] _raw_spin_unlock_irqrestore+0x38/0x60 <4>[ 254.192767] resume_irqs+0xba/0xf0 <4>[ 254.192786] dpm_resume_noirq+0x245/0x3d0 <4>[ 254.192811] suspend_devices_and_enter+0x230/0xaa0 <4>[ 254.192835] pm_suspend.cold.8+0x301/0x34a <4>[ 254.192859] state_store+0x7b/0xe0 <4>[ 254.192879] kernfs_fop_write_iter+0x11d/0x1c0 <4>[ 254.192899] new_sync_write+0x11d/0x1b0 <4>[ 254.192916] vfs_write+0x265/0x390 <4>[ 254.192933] ksys_write+0x5a/0xd0 <4>[ 254.192949] do_syscall_64+0x33/0x80 <4>[ 254.192965] entry_SYSCALL_64_after_hwframe+0x44/0xae <4>[ 254.192986] irq event stamp: 43775 <4>[ 254.192994] hardirqs last enabled at (43775): [<ffffffff81c00c42>] asm_sysvec_apic_timer_interrupt+0x12/0x20 <4>[ 254.193023] hardirqs last disabled at (43774): [<ffffffff81aa691a>] sysvec_apic_timer_interrupt+0xa/0xb0 <4>[ 254.193049] softirqs last enabled at (42548): [<ffffffff81e00342>] __do_softirq+0x342/0x48e <4>[ 254.193074] softirqs last disabled at (42543): [<ffffffff810b45fd>] irq_exit_rcu+0xad/0xd0 <4>[ 254.193101] other info that might help us debug this: <4>[ 254.193107] Possible unsafe locking scenario: <4>[ 254.193112] CPU0 <4>[ 254.193117] ---- <4>[ 254.193121] lock(rtc_lock); <4>[ 254.193137] <Interrupt> <4>[ 254.193142] lock(rtc_lock); <4>[ 254.193156] *** DEADLOCK *** <4>[ 254.193161] 6 locks held by rtcwake/5309: <4>[ 254.193174] #0: ffff888104861430 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x5a/0xd0 <4>[ 254.193232] rib#1: ffff88810f823288 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xe7/0x1c0 <4>[ 254.193282] rib#2: ffff888100cef3c0 (kn->active#285 <7>[ 254.192706] i915 0000:00:02.0: [drm:intel_modeset_setup_hw_state [i915]] [CRTC:51:pipe A] hw state readout: disabled <4>[ 254.193307] ){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xf0/0x1c0 <4>[ 254.193333] rib#3: ffffffff82649fa8 (system_transition_mutex){+.+.}-{3:3}, at: pm_suspend.cold.8+0xce/0x34a <4>[ 254.193387] rib#4: ffffffff827a2108 (acpi_scan_lock){+.+.}-{3:3}, at: acpi_suspend_begin+0x47/0x70 <4>[ 254.193433] rib#5: ffff8881019ea178 (&dev->mutex){....}-{3:3}, at: device_resume+0x68/0x1e0 <4>[ 254.193485] stack backtrace: <4>[ 254.193492] CPU: 1 PID: 5309 Comm: rtcwake Not tainted 5.12.0-rc1-CI-CI_DRM_9834+ rib#1 <4>[ 254.193514] Hardware name: Google Soraka/Soraka, BIOS MrChromebox-4.10 08/25/2019 <4>[ 254.193524] Call Trace: <4>[ 254.193536] dump_stack+0x7f/0xad <4>[ 254.193567] mark_lock.part.47+0x8ca/0xce0 <4>[ 254.193604] __lock_acquire+0x39b/0x2590 <4>[ 254.193626] ? asm_sysvec_apic_timer_interrupt+0x12/0x20 <4>[ 254.193660] lock_acquire+0xd1/0x3d0 <4>[ 254.193677] ? cmos_interrupt+0x18/0x100 <4>[ 254.193716] _raw_spin_lock+0x2a/0x40 <4>[ 254.193735] ? cmos_interrupt+0x18/0x100 <4>[ 254.193758] cmos_interrupt+0x18/0x100 <4>[ 254.193785] cmos_resume+0x2ac/0x2d0 <4>[ 254.193813] ? acpi_pm_set_device_wakeup+0x1f/0x110 <4>[ 254.193842] ? pnp_bus_suspend+0x10/0x10 <4>[ 254.193864] pnp_bus_resume+0x5e/0x90 <4>[ 254.193885] dpm_run_callback+0x5f/0x240 <4>[ 254.193914] device_resume+0xb2/0x1e0 <4>[ 254.193942] ? pm_dev_err+0x25/0x25 <4>[ 254.193974] dpm_resume+0xea/0x3f0 <4>[ 254.194005] dpm_resume_end+0x8/0x10 <4>[ 254.194030] suspend_devices_and_enter+0x29b/0xaa0 <4>[ 254.194066] pm_suspend.cold.8+0x301/0x34a <4>[ 254.194094] state_store+0x7b/0xe0 <4>[ 254.194124] kernfs_fop_write_iter+0x11d/0x1c0 <4>[ 254.194151] new_sync_write+0x11d/0x1b0 <4>[ 254.194183] vfs_write+0x265/0x390 <4>[ 254.194207] ksys_write+0x5a/0xd0 <4>[ 254.194232] do_syscall_64+0x33/0x80 <4>[ 254.194251] entry_SYSCALL_64_after_hwframe+0x44/0xae <4>[ 254.194274] RIP: 0033:0x7f07d79691e7 <4>[ 254.194293] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 <4>[ 254.194312] RSP: 002b:00007ffd9cc2c768 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 <4>[ 254.194337] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f07d79691e7 <4>[ 254.194352] RDX: 0000000000000004 RSI: 0000556ebfc63590 RDI: 000000000000000b <4>[ 254.194366] RBP: 0000556ebfc63590 R08: 0000000000000000 R09: 0000000000000004 <4>[ 254.194379] R10: 0000556ebf0ec2a6 R11: 0000000000000246 R12: 0000000000000004 which breaks S3-resume on fi-kbl-soraka presumably as that's slow enough to trigger the alarm during the suspend. Fixes: 6950d04 ("rtc: cmos: Replace spin_lock_irqsave with spin_lock in hard IRQ") References: 66e4f4a ("rtc: cmos: Use spin_lock_irqsave() in cmos_interrupt()"): Signed-off-by: Chris Wilson <[email protected]> Cc: Xiaofei Tan <[email protected]> Cc: Alexandre Belloni <[email protected]> Cc: Alessandro Zummo <[email protected]> Cc: Ville Syrjälä <[email protected]> Signed-off-by: Rodrigo Vivi <[email protected]>

We are accessing "desc->ops" in sof_pci_probe without checking "desc" pointer. This results in NULL pointer exception if pci_id->driver_data i.e desc pointer isn't defined in sof device probe: BUG: kernel NULL pointer dereference, address: 0000000000000060 PGD 0 P4D 0 Oops: 0000 [rib#1] PREEMPT SMP NOPTI RIP: 0010:sof_pci_probe+0x1e/0x17f [snd_sof_pci] Code: Unable to access opcode bytes at RIP 0xffffffffc043dff4. RSP: 0018:ffffac4b03b9b8d8 EFLAGS: 00010246 Add NULL pointer check for sof_dev_desc pointer to avoid such exception. Reviewed-by: Ranjani Sridharan <[email protected]> Signed-off-by: Ajit Kumar Pandey <[email protected]> Signed-off-by: Pierre-Louis Bossart <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>

Current DP driver implementation has adding safe mode done at dp_hpd_plug_handle() which is expected to be executed under event thread context. However there is possible circular locking happen (see blow stack trace) after edp driver call dp_hpd_plug_handle() from dp_bridge_enable() which is executed under drm_thread context. After review all possibilities methods and as discussed on https://patchwork.freedesktop.org/patch/483155/, supporting EDID compliance tests in the driver is quite hacky. As seen with other vendor drivers, supporting these will be much easier with IGT. Hence removing all the related fail safe code for it so that no possibility of circular lock will happen. Reviewed-by: Stephen Boyd <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Reviewed-by: Dmitry Baryshkov <[email protected]> ====================================================== WARNING: possible circular locking dependency detected 5.15.35-lockdep rib#6 Tainted: G W ------------------------------------------------------ frecon/429 is trying to acquire lock: ffffff808dc3c4e8 (&dev->mode_config.mutex){+.+.}-{3:3}, at: dp_panel_add_fail_safe_mode+0x4c/0xa0 but task is already holding lock: ffffff808dc441e0 (&kms->commit_lock[i]){+.+.}-{3:3}, at: lock_crtcs+0xb4/0x124 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> rib#3 (&kms->commit_lock[i]){+.+.}-{3:3}: __mutex_lock_common+0x174/0x1a64 mutex_lock_nested+0x98/0xac lock_crtcs+0xb4/0x124 msm_atomic_commit_tail+0x330/0x748 commit_tail+0x19c/0x278 drm_atomic_helper_commit+0x1dc/0x1f0 drm_atomic_commit+0xc0/0xd8 drm_atomic_helper_set_config+0xb4/0x134 drm_mode_setcrtc+0x688/0x1248 drm_ioctl_kernel+0x1e4/0x338 drm_ioctl+0x3a4/0x684 __arm64_sys_ioctl+0x118/0x154 invoke_syscall+0x78/0x224 el0_svc_common+0x178/0x200 do_el0_svc+0x94/0x13c el0_svc+0x5c/0xec el0t_64_sync_handler+0x78/0x108 el0t_64_sync+0x1a4/0x1a8 -> rib#2 (crtc_ww_class_mutex){+.+.}-{3:3}: __mutex_lock_common+0x174/0x1a64 ww_mutex_lock+0xb8/0x278 modeset_lock+0x304/0x4ac drm_modeset_lock+0x4c/0x7c drmm_mode_config_init+0x4a8/0xc50 msm_drm_init+0x274/0xac0 msm_drm_bind+0x20/0x2c try_to_bring_up_master+0x3dc/0x470 __component_add+0x18c/0x3c0 component_add+0x1c/0x28 dp_display_probe+0x954/0xa98 platform_probe+0x124/0x15c really_probe+0x1b0/0x5f8 __driver_probe_device+0x174/0x20c driver_probe_device+0x70/0x134 __device_attach_driver+0x130/0x1d0 bus_for_each_drv+0xfc/0x14c __device_attach+0x1bc/0x2bc device_initial_probe+0x1c/0x28 bus_probe_device+0x94/0x178 deferred_probe_work_func+0x1a4/0x1f0 process_one_work+0x5d4/0x9dc worker_thread+0x898/0xccc kthread+0x2d4/0x3d4 ret_from_fork+0x10/0x20 -> rib#1 (crtc_ww_class_acquire){+.+.}-{0:0}: ww_acquire_init+0x1c4/0x2c8 drm_modeset_acquire_init+0x44/0xc8 drm_helper_probe_single_connector_modes+0xb0/0x12dc drm_mode_getconnector+0x5dc/0xfe8 drm_ioctl_kernel+0x1e4/0x338 drm_ioctl+0x3a4/0x684 __arm64_sys_ioctl+0x118/0x154 invoke_syscall+0x78/0x224 el0_svc_common+0x178/0x200 do_el0_svc+0x94/0x13c el0_svc+0x5c/0xec el0t_64_sync_handler+0x78/0x108 el0t_64_sync+0x1a4/0x1a8 -> #0 (&dev->mode_config.mutex){+.+.}-{3:3}: __lock_acquire+0x2650/0x672c lock_acquire+0x1b4/0x4ac __mutex_lock_common+0x174/0x1a64 mutex_lock_nested+0x98/0xac dp_panel_add_fail_safe_mode+0x4c/0xa0 dp_hpd_plug_handle+0x1f0/0x280 dp_bridge_enable+0x94/0x2b8 drm_atomic_bridge_chain_enable+0x11c/0x168 drm_atomic_helper_commit_modeset_enables+0x500/0x740 msm_atomic_commit_tail+0x3e4/0x748 commit_tail+0x19c/0x278 drm_atomic_helper_commit+0x1dc/0x1f0 drm_atomic_commit+0xc0/0xd8 drm_atomic_helper_set_config+0xb4/0x134 drm_mode_setcrtc+0x688/0x1248 drm_ioctl_kernel+0x1e4/0x338 drm_ioctl+0x3a4/0x684 __arm64_sys_ioctl+0x118/0x154 invoke_syscall+0x78/0x224 el0_svc_common+0x178/0x200 do_el0_svc+0x94/0x13c el0_svc+0x5c/0xec el0t_64_sync_handler+0x78/0x108 el0t_64_sync+0x1a4/0x1a8 Changes in v2: -- re text commit title -- remove all fail safe mode Changes in v3: -- remove dp_panel_add_fail_safe_mode() from dp_panel.h -- add Fixes Changes in v5: -- [email protected] Changes in v6: -- fix Fixes commit ID Fixes: 8b2c181 ("drm/msm/dp: add fail safe mode outside of event_mutex context") Reported-by: Douglas Anderson <[email protected]> Signed-off-by: Kuogee Hsieh <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Rob Clark <[email protected]>

[ 168.544078] ====================================================== [ 168.550309] WARNING: possible circular locking dependency detected [ 168.556523] 5.16.0-kfd-fkuehlin torvalds#148 Tainted: G E [ 168.562558] ------------------------------------------------------ [ 168.568764] kfdtest/3479 is trying to acquire lock: [ 168.573672] ffffffffc0927a70 (&topology_lock){++++}-{3:3}, at: kfd_topology_device_by_id+0x16/0x60 [amdgpu] [ 168.583663] but task is already holding lock: [ 168.589529] ffff97d303dee668 (&mm->mmap_lock#2){++++}-{3:3}, at: vm_mmap_pgoff+0xa9/0x180 [ 168.597755] which lock already depends on the new lock. [ 168.605970] the existing dependency chain (in reverse order) is: [ 168.613487] -> rib#3 (&mm->mmap_lock#2){++++}-{3:3}: [ 168.619700] lock_acquire+0xca/0x2e0 [ 168.623814] down_read+0x3e/0x140 [ 168.627676] do_user_addr_fault+0x40d/0x690 [ 168.632399] exc_page_fault+0x6f/0x270 [ 168.636692] asm_exc_page_fault+0x1e/0x30 [ 168.641249] filldir64+0xc8/0x1e0 [ 168.645115] call_filldir+0x7c/0x110 [ 168.649238] ext4_readdir+0x58e/0x940 [ 168.653442] iterate_dir+0x16a/0x1b0 [ 168.657558] __x64_sys_getdents64+0x83/0x140 [ 168.662375] do_syscall_64+0x35/0x80 [ 168.666492] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 168.672095] -> rib#2 (&type->i_mutex_dir_key#6){++++}-{3:3}: [ 168.679008] lock_acquire+0xca/0x2e0 [ 168.683122] down_read+0x3e/0x140 [ 168.686982] path_openat+0x5b2/0xa50 [ 168.691095] do_file_open_root+0xfc/0x190 [ 168.695652] file_open_root+0xd8/0x1b0 [ 168.702010] kernel_read_file_from_path_initns+0xc4/0x140 [ 168.709542] _request_firmware+0x2e9/0x5e0 [ 168.715741] request_firmware+0x32/0x50 [ 168.721667] amdgpu_cgs_get_firmware_info+0x370/0xdd0 [amdgpu] [ 168.730060] smu7_upload_smu_firmware_image+0x53/0x190 [amdgpu] [ 168.738414] fiji_start_smu+0xcf/0x4e0 [amdgpu] [ 168.745539] pp_dpm_load_fw+0x21/0x30 [amdgpu] [ 168.752503] amdgpu_pm_load_smu_firmware+0x4b/0x80 [amdgpu] [ 168.760698] amdgpu_device_fw_loading+0xb8/0x140 [amdgpu] [ 168.768412] amdgpu_device_init.cold+0xdf6/0x1716 [amdgpu] [ 168.776285] amdgpu_driver_load_kms+0x15/0x120 [amdgpu] [ 168.784034] amdgpu_pci_probe+0x19b/0x3a0 [amdgpu] [ 168.791161] local_pci_probe+0x40/0x80 [ 168.797027] work_for_cpu_fn+0x10/0x20 [ 168.802839] process_one_work+0x273/0x5b0 [ 168.808903] worker_thread+0x20f/0x3d0 [ 168.814700] kthread+0x176/0x1a0 [ 168.819968] ret_from_fork+0x1f/0x30 [ 168.825563] -> rib#1 (&adev->pm.mutex){+.+.}-{3:3}: [ 168.834721] lock_acquire+0xca/0x2e0 [ 168.840364] __mutex_lock+0xa2/0x930 [ 168.846020] amdgpu_dpm_get_mclk+0x37/0x60 [amdgpu] [ 168.853257] amdgpu_amdkfd_get_local_mem_info+0xba/0xe0 [amdgpu] [ 168.861547] kfd_create_vcrat_image_gpu+0x1b1/0xbb0 [amdgpu] [ 168.869478] kfd_create_crat_image_virtual+0x447/0x510 [amdgpu] [ 168.877884] kfd_topology_add_device+0x5c8/0x6f0 [amdgpu] [ 168.885556] kgd2kfd_device_init.cold+0x385/0x4c5 [amdgpu] [ 168.893347] amdgpu_amdkfd_device_init+0x138/0x180 [amdgpu] [ 168.901177] amdgpu_device_init.cold+0x141b/0x1716 [amdgpu] [ 168.909025] amdgpu_driver_load_kms+0x15/0x120 [amdgpu] [ 168.916458] amdgpu_pci_probe+0x19b/0x3a0 [amdgpu] [ 168.923442] local_pci_probe+0x40/0x80 [ 168.929249] work_for_cpu_fn+0x10/0x20 [ 168.935008] process_one_work+0x273/0x5b0 [ 168.940944] worker_thread+0x20f/0x3d0 [ 168.946623] kthread+0x176/0x1a0 [ 168.951765] ret_from_fork+0x1f/0x30 [ 168.957277] -> #0 (&topology_lock){++++}-{3:3}: [ 168.965993] check_prev_add+0x8f/0xbf0 [ 168.971613] __lock_acquire+0x1299/0x1ca0 [ 168.977485] lock_acquire+0xca/0x2e0 [ 168.982877] down_read+0x3e/0x140 [ 168.987975] kfd_topology_device_by_id+0x16/0x60 [amdgpu] [ 168.995583] kfd_device_by_id+0xa/0x20 [amdgpu] [ 169.002180] kfd_mmap+0x95/0x200 [amdgpu] [ 169.008293] mmap_region+0x337/0x5a0 [ 169.013679] do_mmap+0x3aa/0x540 [ 169.018678] vm_mmap_pgoff+0xdc/0x180 [ 169.024095] ksys_mmap_pgoff+0x186/0x1f0 [ 169.029734] do_syscall_64+0x35/0x80 [ 169.035005] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 169.041754] other info that might help us debug this: [ 169.053276] Chain exists of: &topology_lock --> &type->i_mutex_dir_key#6 --> &mm->mmap_lock#2 [ 169.068389] Possible unsafe locking scenario: [ 169.076661] CPU0 CPU1 [ 169.082383] ---- ---- [ 169.088087] lock(&mm->mmap_lock#2); [ 169.092922] lock(&type->i_mutex_dir_key#6); [ 169.100975] lock(&mm->mmap_lock#2); [ 169.108320] lock(&topology_lock); [ 169.112957] *** DEADLOCK *** This commit fixes the deadlock warning by ensuring pm.mutex is not held while holding the topology lock. For this, kfd_local_mem_info is moved into the KFD dev struct and filled during device init. This cached value can then be used instead of querying the value again and again. Signed-off-by: Mukul Joshi <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>

Kernel panic when injecting memory_failure for the global huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows. Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000 page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00 head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0 flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff) raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head)) ------------[ cut here ]------------ kernel BUG at mm/huge_memory.c:2499! invalid opcode: 0000 [rib#1] PREEMPT SMP PTI CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ rib#11 Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014 RIP: 0010:split_huge_page_to_list+0x66a/0x880 Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246 RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000 R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40 FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: try_to_split_thp_page+0x3a/0x130 memory_failure+0x128/0x800 madvise_inject_error.cold+0x8b/0xa1 __x64_sys_madvise+0x54/0x60 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fc3754f8bf9 Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9 RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000 RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490 R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000 We think that raising BUG is overkilling for splitting huge_zero_page, the huge_zero_page can't be met from normal paths other than memory failure, but memory failure is a valid caller. So we tend to replace the BUG to WARN + returning -EBUSY, and thus the panic above won't happen again. Link: https://lkml.kernel.org/r/f35f8b97377d5d3ede1bc5ac3114da888c57cbce.1651052574.git.xuyu@linux.alibaba.com Fixes: d173d54 ("mm/memory-failure.c: skip huge_zero_page in memory_failure()") Fixes: 6a46079 ("HWPOISON: The high level memory error handler in the VM v7") Signed-off-by: Xu Yu <[email protected]> Suggested-by: Yang Shi <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Yang Shi <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

The following VM_BUG_ON_FOLIO() is triggered when memory error event happens on the (thp/folio) pages which are about to be freed: [ 1160.232771] page:00000000b36a8a0f refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000 [ 1160.236916] page:00000000b36a8a0f refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x16a000 [ 1160.240684] flags: 0x57ffffc0800000(hwpoison|node=1|zone=2|lastcpupid=0x1fffff) [ 1160.243458] raw: 0057ffffc0800000 dead000000000100 dead000000000122 0000000000000000 [ 1160.246268] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 [ 1160.249197] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio)) [ 1160.251815] ------------[ cut here ]------------ [ 1160.253438] kernel BUG at include/linux/mm.h:788! [ 1160.256162] invalid opcode: 0000 [rib#1] PREEMPT SMP PTI [ 1160.258172] CPU: 2 PID: 115368 Comm: mceinj.sh Tainted: G E 5.18.0-rc1-v5.18-rc1-220404-2353-005-g83111+ rib#3 [ 1160.262049] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 [ 1160.265103] RIP: 0010:dump_page.cold+0x27e/0x2bd [ 1160.266757] Code: fe ff ff 48 c7 c6 81 f1 5a 98 e9 4c fe ff ff 48 c7 c6 a1 95 59 98 e9 40 fe ff ff 48 c7 c6 50 bf 5a 98 48 89 ef e8 9d 04 6d ff <0f> 0b 41 f7 c4 ff 0f 00 00 0f 85 9f fd ff ff 49 8b 04 24 a9 00 00 [ 1160.273180] RSP: 0018:ffffaa2c4d59fd18 EFLAGS: 00010292 [ 1160.274969] RAX: 000000000000003e RBX: 0000000000000001 RCX: 0000000000000000 [ 1160.277263] RDX: 0000000000000001 RSI: ffffffff985995a1 RDI: 00000000ffffffff [ 1160.279571] RBP: ffffdc9c45a80000 R08: 0000000000000000 R09: 00000000ffffdfff [ 1160.281794] R10: ffffaa2c4d59fb08 R11: ffffffff98940d08 R12: ffffdc9c45a80000 [ 1160.283920] R13: ffffffff985b6f94 R14: 0000000000000000 R15: ffffdc9c45a80000 [ 1160.286641] FS: 00007eff54ce1740(0000) GS:ffff99c67bd00000(0000) knlGS:0000000000000000 [ 1160.289498] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1160.291106] CR2: 00005628381a5f68 CR3: 0000000104712003 CR4: 0000000000170ee0 [ 1160.293031] Call Trace: [ 1160.293724] <TASK> [ 1160.294334] get_hwpoison_page+0x47d/0x570 [ 1160.295474] memory_failure+0x106/0xaa0 [ 1160.296474] ? security_capable+0x36/0x50 [ 1160.297524] hard_offline_page_store+0x43/0x80 [ 1160.298684] kernfs_fop_write_iter+0x11c/0x1b0 [ 1160.299829] new_sync_write+0xf9/0x160 [ 1160.300810] vfs_write+0x209/0x290 [ 1160.301835] ksys_write+0x4f/0xc0 [ 1160.302718] do_syscall_64+0x3b/0x90 [ 1160.303664] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 1160.304981] RIP: 0033:0x7eff54b018b7 As shown in the RIP address, this VM_BUG_ON in folio_entire_mapcount() is called from dump_page("hwpoison: unhandlable page") in get_any_page(). The below explains the mechanism of the race: CPU 0 CPU 1 memory_failure get_hwpoison_page get_any_page dump_page compound = PageCompound free_pages_prepare page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP folio_entire_mapcount VM_BUG_ON_FOLIO(!folio_test_large(folio)) So replace dump_page() with safer one, pr_err(). Link: https://lkml.kernel.org/r/[email protected] Fixes: 74e8ee4 ("mm: Turn head_compound_mapcount() into folio_entire_mapcount()") Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: John Hubbard <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: William Kucharski <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

Resource dump menu may span over more than a single page, support it. Otherwise, menu read may result in a memory access violation: reading outside of the allocated page. Note that page format of the first menu page contains menu headers while the proceeding menu pages contain only records. The KASAN logs are as follows: BUG: KASAN: slab-out-of-bounds in strcmp+0x9b/0xb0 Read of size 1 at addr ffff88812b2e1fd0 by task systemd-udevd/496 CPU: 5 PID: 496 Comm: systemd-udevd Tainted: G B 5.16.0_for_upstream_debug_2022_01_10_23_12 rib#1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x57/0x7d print_address_description.constprop.0+0x1f/0x140 ? strcmp+0x9b/0xb0 ? strcmp+0x9b/0xb0 kasan_report.cold+0x83/0xdf ? strcmp+0x9b/0xb0 strcmp+0x9b/0xb0 mlx5_rsc_dump_init+0x4ab/0x780 [mlx5_core] ? mlx5_rsc_dump_destroy+0x80/0x80 [mlx5_core] ? lockdep_hardirqs_on_prepare+0x286/0x400 ? raw_spin_unlock_irqrestore+0x47/0x50 ? aomic_notifier_chain_register+0x32/0x40 mlx5_load+0x104/0x2e0 [mlx5_core] mlx5_init_one+0x41b/0x610 [mlx5_core] .... The buggy address belongs to the object at ffff88812b2e0000 which belongs to the cache kmalloc-4k of size 4096 The buggy address is located 4048 bytes to the right of 4096-byte region [ffff88812b2e0000, ffff88812b2e1000) The buggy address belongs to the page: page:000000009d69807a refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812b2e6000 pfn:0x12b2e0 head:000000009d69807a order:3 compound_mapcount:0 compound_pincount:0 flags: 0x8000000000010200(slab|head|zone=2) raw: 8000000000010200 0000000000000000 dead000000000001 ffff888100043040 raw: ffff88812b2e6000 0000000080040000 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88812b2e1e80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88812b2e1f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88812b2e1f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff88812b2e2000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88812b2e2080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Fixes: 12206b1 ("net/mlx5: Add support for resource dump") Signed-off-by: Aya Levin <[email protected]> Reviewed-by: Moshe Shemesh <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>

As reported by Alan, the CFI (Call Frame Information) in the VDSO time routines is incorrect since commit ce7d805 ("powerpc/vdso: Prepare for switching VDSO to generic C implementation."). DWARF has a concept called the CFA (Canonical Frame Address), which on powerpc is calculated as an offset from the stack pointer (r1). That means when the stack pointer is changed there must be a corresponding CFI directive to update the calculation of the CFA. The current code is missing those directives for the changes to r1, which prevents gdb from being able to generate a backtrace from inside VDSO functions, eg: Breakpoint 1, 0x00007ffff7f804dc in __kernel_clock_gettime () (gdb) bt #0 0x00007ffff7f804dc in __kernel_clock_gettime () rib#1 0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6 rib#2 0x00007fffffffd960 in ?? () rib#3 0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6 Backtrace stopped: frame did not save the PC Alan helpfully describes some rules for correctly maintaining the CFI information: 1) Every adjustment to the current frame address reg (ie. r1) must be described, and exactly at the instruction where r1 changes. Why? Because stack unwinding might want to access previous frames. 2) If a function changes LR or any non-volatile register, the save location for those regs must be given. The CFI can be at any instruction after the saves up to the point that the reg is changed. (Exception: LR save should be described before a bl. not after) 3) If asychronous unwind info is needed then restores of LR and non-volatile regs must also be described. The CFI can be at any instruction after the reg is restored up to the point where the save location is (potentially) trashed. Fix the inability to backtrace by adding CFI directives describing the changes to r1, ie. satisfying rule 1. Also change the information for LR to point to the copy saved on the stack, not the value in r0 that will be overwritten by the function call. Finally, add CFI directives describing the save/restore of r2. With the fix gdb can correctly back trace and navigate up and down the stack: Breakpoint 1, 0x00007ffff7f804dc in __kernel_clock_gettime () (gdb) bt #0 0x00007ffff7f804dc in __kernel_clock_gettime () rib#1 0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6 rib#2 0x0000000100015b60 in gettime () rib#3 0x000000010000c8bc in print_long_format () rib#4 0x000000010000d180 in print_current_files () rib#5 0x00000001000054ac in main () (gdb) up rib#1 0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6 (gdb) rib#2 0x0000000100015b60 in gettime () (gdb) rib#3 0x000000010000c8bc in print_long_format () (gdb) rib#4 0x000000010000d180 in print_current_files () (gdb) rib#5 0x00000001000054ac in main () (gdb) Initial frame selected; you cannot go up. (gdb) down rib#4 0x000000010000d180 in print_current_files () (gdb) rib#3 0x000000010000c8bc in print_long_format () (gdb) rib#2 0x0000000100015b60 in gettime () (gdb) rib#1 0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6 (gdb) #0 0x00007ffff7f804dc in __kernel_clock_gettime () (gdb) Fixes: ce7d805 ("powerpc/vdso: Prepare for switching VDSO to generic C implementation.") Cc: [email protected] # v5.11+ Reported-by: Alan Modra <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Reviewed-by: Segher Boessenkool <[email protected]> Link: https://lore.kernel.org/r/[email protected]

The calling of siw_cm_upcall and detaching new_cep with its listen_cep should be atomistic semantics. Otherwise siw_reject may be called in a temporary state, e,g, siw_cm_upcall is called but the new_cep->listen_cep has not being cleared. This fixes a WARN: WARNING: CPU: 7 PID: 201 at drivers/infiniband/sw/siw/siw_cm.c:255 siw_cep_put+0x125/0x130 [siw] CPU: 2 PID: 201 Comm: kworker/u16:22 Kdump: loaded Tainted: G E 5.17.0-rc7 rib#1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Workqueue: iw_cm_wq cm_work_handler [iw_cm] RIP: 0010:siw_cep_put+0x125/0x130 [siw] Call Trace: <TASK> siw_reject+0xac/0x180 [siw] iw_cm_reject+0x68/0xc0 [iw_cm] cm_work_handler+0x59d/0xe20 [iw_cm] process_one_work+0x1e2/0x3b0 worker_thread+0x50/0x3a0 ? rescuer_thread+0x390/0x390 kthread+0xe5/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Fixes: 6c52fdc ("rdma/siw: connection management") Link: https://lore.kernel.org/r/d528d83466c44687f3872eadcb8c184528b2e2d4.1650526554.git.chengyou@linux.alibaba.com Reported-by: Luis Chamberlain <[email protected]> Reviewed-by: Bernard Metzler <[email protected]> Signed-off-by: Cheng Xu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>

When inserting a key range item (BTRFS_DIR_LOG_INDEX_KEY) while logging a directory, we don't expect the insertion to fail with -EEXIST, because we are holding the directory's log_mutex and we have dropped all existing BTRFS_DIR_LOG_INDEX_KEY keys from the log tree before we started to log the directory. However it's possible that during the logging we attempt to insert the same BTRFS_DIR_LOG_INDEX_KEY key twice, but for this to happen we need to race with insertions of items from other inodes in the subvolume's tree while we are logging a directory. Here's how this can happen: 1) We are logging a directory with inode number 1000 that has its items spread across 3 leaves in the subvolume's tree: leaf A - has index keys from the range 2 to 20 for example. The last item in the leaf corresponds to a dir item for index number 20. All these dir items were created in a past transaction. leaf B - has index keys from the range 22 to 100 for example. It has no keys from other inodes, all its keys are dir index keys for our directory inode number 1000. Its first key is for the dir item with a sequence number of 22. All these dir items were also created in a past transaction. leaf C - has index keys for our directory for the range 101 to 120 for example. This leaf also has items from other inodes, and its first item corresponds to the dir item for index number 101 for our directory with inode number 1000; 2) When we finish processing the items from leaf A at log_dir_items(), we log a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and a last offset of 21, meaning the log is authoritative for the index range from 21 to 21 (a single sequence number). At this point leaf B was not yet modified in the current transaction; 3) When we return from log_dir_items() we have released our read lock on leaf B, and have set *last_offset_ret to 21 (index number of the first item on leaf B minus 1); 4) Some other task inserts an item for other inode (inode number 1001 for example) into leaf C. That resulted in pushing some items from leaf C into leaf B, in order to make room for the new item, so now leaf B has dir index keys for the sequence number range from 22 to 102 and leaf C has the dir items for the sequence number range 103 to 120; 5) At log_directory_changes() we call log_dir_items() again, passing it a 'min_offset' / 'min_key' value of 22 (*last_offset_ret from step 3 plus 1, so 21 + 1). Then btrfs_search_forward() leaves us at slot 0 of leaf B, since leaf B was modified in the current transaction. We have also initialized 'last_old_dentry_offset' to 20 after calling btrfs_previous_item() at log_dir_items(), as it left us at the last item of leaf A, which refers to the dir item with sequence number 20; 6) We then call process_dir_items_leaf() to process the dir items of leaf B, and when we process the first item, corresponding to slot 0, sequence number 22, we notice the dir item was created in a past transaction and its sequence number is greater than the value of *last_old_dentry_offset + 1 (20 + 1), so we decide to log again a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and an end range of 21 (key.offset - 1 == 22 - 1 == 21), which results in an -EEXIST error from insert_dir_log_key(), as we have already inserted that key at step 2, triggering the assertion at process_dir_items_leaf(). The trace produced in dmesg is like the following: assertion failed: ret != -EEXIST, in fs/btrfs/tree-log.c:3857 [198255.980839][ T7460] ------------[ cut here ]------------ [198255.981666][ T7460] kernel BUG at fs/btrfs/ctree.h:3617! [198255.983141][ T7460] invalid opcode: 0000 [rib#1] PREEMPT SMP KASAN PTI [198255.984080][ T7460] CPU: 0 PID: 7460 Comm: repro-ghost-dir Not tainted 5.18.0-5314c78ac373-misc-next+ [198255.986027][ T7460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [198255.988600][ T7460] RIP: 0010:assertfail.constprop.0+0x1c/0x1e [198255.989465][ T7460] Code: 8b 4c 89 (...) [198255.992599][ T7460] RSP: 0018:ffffc90007387188 EFLAGS: 00010282 [198255.993414][ T7460] RAX: 000000000000003d RBX: 0000000000000065 RCX: 0000000000000000 [198255.996056][ T7460] RDX: 0000000000000001 RSI: ffffffff8b62b180 RDI: fffff52000e70e24 [198255.997668][ T7460] RBP: ffffc90007387188 R08: 000000000000003d R09: ffff8881f0e16507 [198255.999199][ T7460] R10: ffffed103e1c2ca0 R11: 0000000000000001 R12: 00000000ffffffef [198256.000683][ T7460] R13: ffff88813befc630 R14: ffff888116c16e70 R15: ffffc90007387358 [198256.007082][ T7460] FS: 00007fc7f7c24640(0000) GS:ffff8881f0c00000(0000) knlGS:0000000000000000 [198256.009939][ T7460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [198256.014133][ T7460] CR2: 0000560bb16d0b78 CR3: 0000000140b34005 CR4: 0000000000170ef0 [198256.015239][ T7460] Call Trace: [198256.015674][ T7460] <TASK> [198256.016313][ T7460] log_dir_items.cold+0x16/0x2c [198256.018858][ T7460] ? replay_one_extent+0xbf0/0xbf0 [198256.025932][ T7460] ? release_extent_buffer+0x1d2/0x270 [198256.029658][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.031114][ T7460] ? lock_acquired+0xbe/0x660 [198256.032633][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.034386][ T7460] ? lock_release+0xcf/0x8a0 [198256.036152][ T7460] log_directory_changes+0xf9/0x170 [198256.036993][ T7460] ? log_dir_items+0xba0/0xba0 [198256.037661][ T7460] ? do_raw_write_unlock+0x7d/0xe0 [198256.038680][ T7460] btrfs_log_inode+0x233b/0x26d0 [198256.041294][ T7460] ? log_directory_changes+0x170/0x170 [198256.042864][ T7460] ? btrfs_attach_transaction_barrier+0x60/0x60 [198256.045130][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.046568][ T7460] ? lock_release+0xcf/0x8a0 [198256.047504][ T7460] ? lock_downgrade+0x420/0x420 [198256.048712][ T7460] ? ilookup5_nowait+0x81/0xa0 [198256.049747][ T7460] ? lock_downgrade+0x420/0x420 [198256.050652][ T7460] ? do_raw_spin_unlock+0xa9/0x100 [198256.051618][ T7460] ? __might_resched+0x128/0x1c0 [198256.052511][ T7460] ? __might_sleep+0x66/0xc0 [198256.053442][ T7460] ? __kasan_check_read+0x11/0x20 [198256.054251][ T7460] ? iget5_locked+0xbd/0x150 [198256.054986][ T7460] ? run_delayed_iput_locked+0x110/0x110 [198256.055929][ T7460] ? btrfs_iget+0xc7/0x150 [198256.056630][ T7460] ? btrfs_orphan_cleanup+0x4a0/0x4a0 [198256.057502][ T7460] ? free_extent_buffer+0x13/0x20 [198256.058322][ T7460] btrfs_log_inode+0x2654/0x26d0 [198256.059137][ T7460] ? log_directory_changes+0x170/0x170 [198256.060020][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.060930][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.061905][ T7460] ? lock_contended+0x770/0x770 [198256.062682][ T7460] ? btrfs_log_inode_parent+0xd04/0x1750 [198256.063582][ T7460] ? lock_downgrade+0x420/0x420 [198256.064432][ T7460] ? preempt_count_sub+0x18/0xc0 [198256.065550][ T7460] ? __mutex_lock+0x580/0xdc0 [198256.066654][ T7460] ? stack_trace_save+0x94/0xc0 [198256.068008][ T7460] ? __kasan_check_write+0x14/0x20 [198256.072149][ T7460] ? __mutex_unlock_slowpath+0x12a/0x430 [198256.073145][ T7460] ? mutex_lock_io_nested+0xcd0/0xcd0 [198256.074341][ T7460] ? wait_for_completion_io_timeout+0x20/0x20 [198256.075345][ T7460] ? lock_downgrade+0x420/0x420 [198256.076142][ T7460] ? lock_contended+0x770/0x770 [198256.076939][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0 [198256.078401][ T7460] ? btrfs_sync_file+0x5e6/0xa40 [198256.080598][ T7460] btrfs_log_inode_parent+0x523/0x1750 [198256.081991][ T7460] ? wait_current_trans+0xc8/0x240 [198256.083320][ T7460] ? lock_downgrade+0x420/0x420 [198256.085450][ T7460] ? btrfs_end_log_trans+0x70/0x70 [198256.086362][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.087544][ T7460] ? lock_release+0xcf/0x8a0 [198256.088305][ T7460] ? lock_downgrade+0x420/0x420 [198256.090375][ T7460] ? dget_parent+0x8e/0x300 [198256.093538][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0 [198256.094918][ T7460] ? lock_downgrade+0x420/0x420 [198256.097815][ T7460] ? do_raw_spin_unlock+0xa9/0x100 [198256.101822][ T7460] ? dget_parent+0xb7/0x300 [198256.103345][ T7460] btrfs_log_dentry_safe+0x48/0x60 [198256.105052][ T7460] btrfs_sync_file+0x629/0xa40 [198256.106829][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120 [198256.109655][ T7460] ? __fget_files+0x161/0x230 [198256.110760][ T7460] vfs_fsync_range+0x6d/0x110 [198256.111923][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120 [198256.113556][ T7460] __x64_sys_fsync+0x45/0x70 [198256.114323][ T7460] do_syscall_64+0x5c/0xc0 [198256.115084][ T7460] ? syscall_exit_to_user_mode+0x3b/0x50 [198256.116030][ T7460] ? do_syscall_64+0x69/0xc0 [198256.116768][ T7460] ? do_syscall_64+0x69/0xc0 [198256.117555][ T7460] ? do_syscall_64+0x69/0xc0 [198256.118324][ T7460] ? sysvec_call_function_single+0x57/0xc0 [198256.119308][ T7460] ? asm_sysvec_call_function_single+0xa/0x20 [198256.120363][ T7460] entry_SYSCALL_64_after_hwframe+0x44/0xae [198256.121334][ T7460] RIP: 0033:0x7fc7fe97b6ab [198256.122067][ T7460] Code: 0f 05 48 (...) [198256.125198][ T7460] RSP: 002b:00007fc7f7c23950 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [198256.126568][ T7460] RAX: ffffffffffffffda RBX: 00007fc7f7c239f0 RCX: 00007fc7fe97b6ab [198256.127942][ T7460] RDX: 0000000000000002 RSI: 000056167536bcf0 RDI: 0000000000000004 [198256.129302][ T7460] RBP: 0000000000000004 R08: 0000000000000000 R09: 000000007ffffeb8 [198256.130670][ T7460] R10: 00000000000001ff R11: 0000000000000293 R12: 0000000000000001 [198256.132046][ T7460] R13: 0000561674ca8140 R14: 00007fc7f7c239d0 R15: 000056167536dab8 [198256.133403][ T7460] </TASK> Fix this by treating -EEXIST as expected at insert_dir_log_key() and have it update the item with an end offset corresponding to the maximum between the previously logged end offset and the new requested end offset. The end offsets may be different due to dir index key deletions that happened as part of unlink operations while we are logging a directory (triggered when fsyncing some other inode parented by the directory) or during renames which always attempt to log a single dir index deletion. Reported-by: Zygo Blaxell <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Fixes: 732d591 ("btrfs: stop copying old dir items when logging a directory") Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>

Since commit f1131b9 ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices") the following NULL pointer dereference is observed on a board with KSZ8061: # udhcpc -i eth0 udhcpc: started, v1.35.0 8<--- cut here --- Unable to handle kernel NULL pointer dereference at virtual address 00000008 pgd = f73cef4e [00000008] *pgd=00000000 Internal error: Oops: 5 [rib#1] SMP ARM Modules linked in: CPU: 0 PID: 196 Comm: ifconfig Not tainted 5.15.37-dirty torvalds#94 Hardware name: Freescale i.MX6 SoloX (Device Tree) PC is at kszphy_config_reset+0x10/0x114 LR is at kszphy_resume+0x24/0x64 ... The KSZ8061 phy_driver structure does not have the .probe/..driver_data fields, which means that priv is not allocated. This causes the NULL pointer dereference inside kszphy_config_reset(). Fix the problem by using the generic suspend/resume functions as before. Another alternative would be to provide the .probe and .driver_data information into the structure, but to be on the safe side, let's just restore Ethernet functionality by using the generic suspend/resume. Cc: [email protected] Fixes: f1131b9 ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices") Signed-off-by: Fabio Estevam <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

With CONFIG_FORTIFY_SOURCE enabled, string functions will also perform dynamic checks for string size which can panic the kernel, like incase of overflow detection. In papr_scm, papr_scm_pmu_check_events function uses stat->stat_id with string operations, to populate the nvdimm_events_map array. Since stat_id variable is not NULL terminated, the kernel panics with CONFIG_FORTIFY_SOURCE enabled at boot time. Below are the logs of kernel panic: detected buffer overflow in __fortify_strlen ------------[ cut here ]------------ kernel BUG at lib/string_helpers.c:980! Oops: Exception in kernel mode, sig: 5 [rib#1] NIP [c00000000077dad0] fortify_panic+0x28/0x38 LR [c00000000077dacc] fortify_panic+0x24/0x38 Call Trace: [c0000022d77836e0] [c00000000077dacc] fortify_panic+0x24/0x38 (unreliable) [c00800000deb2660] papr_scm_pmu_check_events.constprop.0+0x118/0x220 [papr_scm] [c00800000deb2cb0] papr_scm_probe+0x288/0x62c [papr_scm] [c0000000009b46a8] platform_probe+0x98/0x150 Fix this issue by using kmemdup_nul() to copy the content of stat->stat_id directly to the nvdimm_events_map array. mpe: stat->stat_id comes from the hypervisor, not userspace, so there is no security exposure. Fixes: 4c08d4b ("powerpc/papr_scm: Add perf interface support") Signed-off-by: Kajol Jain <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]

'rmmod pmt_telemetry' panics with: BUG: kernel NULL pointer dereference, address: 0000000000000040 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [rib#1] PREEMPT SMP NOPTI CPU: 4 PID: 1697 Comm: rmmod Tainted: G S W -------- --- 5.18.0-rc4 rib#3 Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-P DDR5 RVP, BIOS ADLPFWI1.R00.3056.B00.2201310233 01/31/2022 RIP: 0010:device_del+0x1b/0x3d0 Code: e8 1a d9 e9 ff e9 58 ff ff ff 48 8b 08 eb dc 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d af 80 00 00 00 53 48 89 fb 48 83 ec 18 <4c> 8b 67 40 48 89 ef 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 RSP: 0018:ffffb520415cfd60 EFLAGS: 00010286 RAX: 0000000000000070 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000080 R08: ffffffffffffffff R09: ffffb520415cfd78 R10: 0000000000000002 R11: ffffb520415cfd78 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f7e198e5740(0000) GS:ffff905c9f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000040 CR3: 000000010782a005 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> ? __xa_erase+0x53/0xb0 device_unregister+0x13/0x50 intel_pmt_dev_destroy+0x34/0x60 [pmt_class] pmt_telem_remove+0x40/0x50 [pmt_telemetry] auxiliary_bus_remove+0x18/0x30 device_release_driver_internal+0xc1/0x150 driver_detach+0x44/0x90 bus_remove_driver+0x74/0xd0 auxiliary_driver_unregister+0x12/0x20 pmt_telem_exit+0xc/0xe4a [pmt_telemetry] __x64_sys_delete_module+0x13a/0x250 ? syscall_trace_enter.isra.19+0x11e/0x1a0 do_syscall_64+0x58/0x80 ? syscall_exit_to_user_mode+0x12/0x30 ? do_syscall_64+0x67/0x80 ? syscall_exit_to_user_mode+0x12/0x30 ? do_syscall_64+0x67/0x80 ? syscall_exit_to_user_mode+0x12/0x30 ? do_syscall_64+0x67/0x80 ? exc_page_fault+0x64/0x140 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f7e1803a05b Code: 73 01 c3 48 8b 0d 2d 4e 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fd 4d 38 00 f7 d8 64 89 01 48 The probe function, pmt_telem_probe(), adds an entry for devices even if they have not been initialized. This results in the array of initialized devices containing both initialized and uninitialized entries. This causes a panic in the remove function, pmt_telem_remove() which expects the array to only contain initialized entries. Only use an entry when a device is initialized. Cc: "David E. Box" <[email protected]> Cc: Hans de Goede <[email protected]> Cc: Mark Gross <[email protected]> Cc: [email protected] Signed-off-by: David Arcari <[email protected]> Signed-off-by: Prarit Bhargava <[email protected]> Reviewed-by: David E. Box <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Hans de Goede <[email protected]>

Function ice_plug_aux_dev() assigns pf->adev field too early prior aux device initialization and on other side ice_unplug_aux_dev() starts aux device deinit and at the end assigns NULL to pf->adev. This is wrong because pf->adev should always be non-NULL only when aux device is fully initialized and ready. This wrong order causes a crash when ice_send_event_to_aux() call occurs because that function depends on non-NULL value of pf->adev and does not assume that aux device is half-initialized or half-destroyed. After order correction the race window is tiny but it is still there, as Leon mentioned and manipulation with pf->adev needs to be protected by mutex. Fix (un-)plugging functions so pf->adev field is set after aux device init and prior aux device destroy and protect pf->adev assignment by new mutex. This mutex is also held during ice_send_event_to_aux() call to ensure that aux device is valid during that call. Note that device lock used ice_send_event_to_aux() needs to be kept to avoid race with aux drv unload. Reproducer: cycle=1 while :;do echo "#### Cycle: $cycle" ip link set ens7f0 mtu 9000 ip link add bond0 type bond mode 1 miimon 100 ip link set bond0 up ifenslave bond0 ens7f0 ip link set bond0 mtu 9000 ethtool -L ens7f0 combined 1 ip link del bond0 ip link set ens7f0 mtu 1500 sleep 1 let cycle++ done In short when the device is added/removed to/from bond the aux device is unplugged/plugged. When MTU of the device is changed an event is sent to aux device asynchronously. This can race with (un)plugging operation and because pf->adev is set too early (plug) or too late (unplug) the function ice_send_event_to_aux() can touch uninitialized or destroyed fields. In the case of crash below pf->adev->dev.mutex. Crash: [ 53.372066] bond0: (slave ens7f0): making interface the new active one [ 53.378622] bond0: (slave ens7f0): Enslaving as an active interface with an u p link [ 53.386294] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready [ 53.549104] bond0: (slave ens7f1): Enslaving as a backup interface with an up link [ 54.118906] ice 0000:ca:00.0 ens7f0: Number of in use tx queues changed inval idating tc mappings. Priority traffic classification disabled! [ 54.233374] ice 0000:ca:00.1 ens7f1: Number of in use tx queues changed inval idating tc mappings. Priority traffic classification disabled! [ 54.248204] bond0: (slave ens7f0): Releasing backup interface [ 54.253955] bond0: (slave ens7f1): making interface the new active one [ 54.274875] bond0: (slave ens7f1): Releasing backup interface [ 54.289153] bond0 (unregistering): Released all slaves [ 55.383179] MII link monitoring set to 100 ms [ 55.398696] bond0: (slave ens7f0): making interface the new active one [ 55.405241] BUG: kernel NULL pointer dereference, address: 0000000000000080 [ 55.405289] bond0: (slave ens7f0): Enslaving as an active interface with an u p link [ 55.412198] #PF: supervisor write access in kernel mode [ 55.412200] #PF: error_code(0x0002) - not-present page [ 55.412201] PGD 25d2ad067 P4D 0 [ 55.412204] Oops: 0002 [rib#1] PREEMPT SMP NOPTI [ 55.412207] CPU: 0 PID: 403 Comm: kworker/0:2 Kdump: loaded Tainted: G S 5.17.0-13579-g57f2d6540f03 rib#1 [ 55.429094] bond0: (slave ens7f1): Enslaving as a backup interface with an up link [ 55.430224] Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/ 2021 [ 55.430226] Workqueue: ice ice_service_task [ice] [ 55.468169] RIP: 0010:mutex_unlock+0x10/0x20 [ 55.472439] Code: 0f b1 13 74 96 eb e0 4c 89 ee eb d8 e8 79 54 ff ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 48 8b 04 25 40 ef 01 00 31 d2 <f0> 48 0f b1 17 75 01 c3 e9 e3 fe ff ff 0f 1f 00 0f 1f 44 00 00 48 [ 55.491186] RSP: 0018:ff4454230d7d7e28 EFLAGS: 00010246 [ 55.496413] RAX: ff1a79b208b08000 RBX: ff1a79b2182e8880 RCX: 0000000000000001 [ 55.503545] RDX: 0000000000000000 RSI: ff4454230d7d7db0 RDI: 0000000000000080 [ 55.510678] RBP: ff1a79d1c7e48b68 R08: ff4454230d7d7db0 R09: 0000000000000041 [ 55.517812] R10: 00000000000000a5 R11: 00000000000006e6 R12: ff1a79d1c7e48bc0 [ 55.524945] R13: 0000000000000000 R14: ff1a79d0ffc305c0 R15: 0000000000000000 [ 55.532076] FS: 0000000000000000(0000) GS:ff1a79d0ffc00000(0000) knlGS:0000000000000000 [ 55.540163] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 55.545908] CR2: 0000000000000080 CR3: 00000003487ae003 CR4: 0000000000771ef0 [ 55.553041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 55.560173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 55.567305] PKRU: 55555554 [ 55.570018] Call Trace: [ 55.572474] <TASK> [ 55.574579] ice_service_task+0xaab/0xef0 [ice] [ 55.579130] process_one_work+0x1c5/0x390 [ 55.583141] ? process_one_work+0x390/0x390 [ 55.587326] worker_thread+0x30/0x360 [ 55.590994] ? process_one_work+0x390/0x390 [ 55.595180] kthread+0xe6/0x110 [ 55.598325] ? kthread_complete_and_exit+0x20/0x20 [ 55.603116] ret_from_fork+0x1f/0x30 [ 55.606698] </TASK> Fixes: f9f5301 ("ice: Register auxiliary device to provide RDMA") Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: Ivan Vecera <[email protected]> Reviewed-by: Dave Ertman <[email protected]> Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>

Will reported the following splat when running with Protected KVM enabled: [ 2.427181] ------------[ cut here ]------------ [ 2.427668] WARNING: CPU: 3 PID: 1 at arch/arm64/kvm/mmu.c:489 __create_hyp_private_mapping+0x118/0x1ac [ 2.428424] Modules linked in: [ 2.429040] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc2-00084-g8635adc4efc7 rib#1 [ 2.429589] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 [ 2.430286] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 2.430734] pc : __create_hyp_private_mapping+0x118/0x1ac [ 2.431091] lr : create_hyp_exec_mappings+0x40/0x80 [ 2.431377] sp : ffff80000803baf0 [ 2.431597] x29: ffff80000803bb00 x28: 0000000000000000 x27: 0000000000000000 [ 2.432156] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 2.432561] x23: ffffcd96c343b000 x22: 0000000000000000 x21: ffff80000803bb40 [ 2.433004] x20: 0000000000000004 x19: 0000000000001800 x18: 0000000000000000 [ 2.433343] x17: 0003e68cf7efdd70 x16: 0000000000000004 x15: fffffc81f602a2c8 [ 2.434053] x14: ffffdf8380000000 x13: ffffcd9573200000 x12: ffffcd96c343b000 [ 2.434401] x11: 0000000000000004 x10: ffffcd96c1738000 x9 : 0000000000000004 [ 2.434812] x8 : ffff80000803bb40 x7 : 7f7f7f7f7f7f7f7f x6 : 544f422effff306b [ 2.435136] x5 : 000000008020001e x4 : ffff207d80a88c00 x3 : 0000000000000005 [ 2.435480] x2 : 0000000000001800 x1 : 000000014f4ab800 x0 : 000000000badca11 [ 2.436149] Call trace: [ 2.436600] __create_hyp_private_mapping+0x118/0x1ac [ 2.437576] create_hyp_exec_mappings+0x40/0x80 [ 2.438180] kvm_init_vector_slots+0x180/0x194 [ 2.458941] kvm_arch_init+0x80/0x274 [ 2.459220] kvm_init+0x48/0x354 [ 2.459416] arm_init+0x20/0x2c [ 2.459601] do_one_initcall+0xbc/0x238 [ 2.459809] do_initcall_level+0x94/0xb4 [ 2.460043] do_initcalls+0x54/0x94 [ 2.460228] do_basic_setup+0x1c/0x28 [ 2.460407] kernel_init_freeable+0x110/0x178 [ 2.460610] kernel_init+0x20/0x1a0 [ 2.460817] ret_from_fork+0x10/0x20 [ 2.461274] ---[ end trace 0000000000000000 ]--- Indeed, the Protected KVM mode promotes __create_hyp_private_mapping() to a hypercall as EL1 no longer has access to the hypervisor's stage-1 page-table. However, the call from kvm_init_vector_slots() happens after pKVM has been initialized on the primary CPU, but before it has been initialized on secondaries. As such, if the KVM initcall procedure is migrated from one CPU to another in this window, the hypercall may end up running on a CPU for which EL2 has not been initialized. Fortunately, the pKVM hypervisor doesn't rely on the host to re-map the vectors in the private range, so the hypercall in question is in fact superfluous. Skip it when pKVM is enabled. Reported-by: Will Deacon <[email protected]> Signed-off-by: Quentin Perret <[email protected]> [maz: simplified the checks slightly] Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Do not allow to write timestamps on RX rings if PF is being configured. When PF is being configured RX rings can be freed or rebuilt. If at the same time timestamps are updated, the kernel will crash by dereferencing null RX ring pointer. PID: 1449 TASK: ff187d28ed658040 CPU: 34 COMMAND: "ice-ptp-0000:51" #0 [ff1966a94a713bb0] machine_kexec at ffffffff9d05a0be rib#1 [ff1966a94a713c08] __crash_kexec at ffffffff9d192e9d rib#2 [ff1966a94a713cd0] crash_kexec at ffffffff9d1941bd rib#3 [ff1966a94a713ce8] oops_end at ffffffff9d01bd54 rib#4 [ff1966a94a713d08] no_context at ffffffff9d06bda4 rib#5 [ff1966a94a713d60] __bad_area_nosemaphore at ffffffff9d06c10c rib#6 [ff1966a94a713da8] do_page_fault at ffffffff9d06cae4 rib#7 [ff1966a94a713de0] page_fault at ffffffff9da0107e [exception RIP: ice_ptp_update_cached_phctime+91] RIP: ffffffffc076db8b RSP: ff1966a94a713e98 RFLAGS: 00010246 RAX: 16e3db9c6b7ccae4 RBX: ff187d269dd3c180 RCX: ff187d269cd4d018 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ff187d269cfcc644 R8: ff187d339b9641b0 R9: 0000000000000000 R10: 0000000000000002 R11: 0000000000000000 R12: ff187d269cfcc648 R13: ffffffff9f128784 R14: ffffffff9d101b70 R15: ff187d269cfcc640 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 rib#8 [ff1966a94a713ea0] ice_ptp_periodic_work at ffffffffc076dbef [ice] rib#9 [ff1966a94a713ee0] kthread_worker_fn at ffffffff9d101c1b rib#10 [ff1966a94a713f10] kthread at ffffffff9d101b4d rib#11 [ff1966a94a713f50] ret_from_fork at ffffffff9da0023f Fixes: 77a7811 ("ice: enable receive hardware timestamping") Signed-off-by: Arkadiusz Kubalewski <[email protected]> Reviewed-by: Michal Schmidt <[email protected]> Tested-by: Dave Cain <[email protected]> Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>

The splat below can be seen when running kvm-unit-test: ============================= WARNING: suspicious RCU usage 5.18.0-rc7 rib#5 Tainted: G IOE ----------------------------- /home/kernel/linux/arch/x86/kvm/../../../virt/kvm/eventfd.c:80 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 4 locks held by qemu-system-x86/35124: #0: ffff9725391d80b8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x77/0x710 [kvm] rib#1: ffffbd25cfb2a0b8 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0xdeb/0x1900 [kvm] rib#2: ffffbd25cfb2b920 (&kvm->irq_srcu){....}-{0:0}, at: kvm_hv_notify_acked_sint+0x79/0x1e0 [kvm] rib#3: ffffbd25cfb2b920 (&kvm->irq_srcu){....}-{0:0}, at: irqfd_resampler_ack+0x5/0x110 [kvm] stack backtrace: CPU: 2 PID: 35124 Comm: qemu-system-x86 Tainted: G IOE 5.18.0-rc7 rib#5 Call Trace: <TASK> dump_stack_lvl+0x6c/0x9b irqfd_resampler_ack+0xfd/0x110 [kvm] kvm_notify_acked_gsi+0x32/0x90 [kvm] kvm_hv_notify_acked_sint+0xc5/0x1e0 [kvm] kvm_hv_set_msr_common+0xec1/0x1160 [kvm] kvm_set_msr_common+0x7c3/0xf60 [kvm] vmx_set_msr+0x394/0x1240 [kvm_intel] kvm_set_msr_ignored_check+0x86/0x200 [kvm] kvm_emulate_wrmsr+0x4f/0x1f0 [kvm] vmx_handle_exit+0x6fb/0x7e0 [kvm_intel] vcpu_enter_guest+0xe5a/0x1900 [kvm] kvm_arch_vcpu_ioctl_run+0x16e/0xac0 [kvm] kvm_vcpu_ioctl+0x279/0x710 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae resampler-list is protected by irq_srcu (see kvm_irqfd_assign), so fix the false positive by using list_for_each_entry_srcu(). Signed-off-by: Wanpeng Li <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ioctl to register OA configs at runtime #1

Add ioctl to register OA configs at runtime #1

rib commented Dec 2, 2015

Add ioctl to register OA configs at runtime #1

Add ioctl to register OA configs at runtime #1

Comments

rib commented Dec 2, 2015

Overview

Details

Add the ioctl

Validate user state

Track the new configs

Sysfs

Footnote