Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panthor driver backport #7

Closed
wants to merge 101 commits into from
Closed

Panthor driver backport #7

wants to merge 101 commits into from

Conversation

hbiyik
Copy link

@hbiyik hbiyik commented Mar 9, 2024

WIP, please do not merge

dliviu and others added 30 commits March 9, 2024 17:38
Arm has introduced a new v10 GPU architecture that replaces the Job Manager
interface with a new Command Stream Frontend. It adds firmware driven
command stream queues that can be used by kernel and user space to submit
jobs to the GPU.

Add the initial schema for the device tree that is based on support for
RK3588 SoC. The minimum number of clocks is one for the IP, but on Rockchip
platforms they will tend to expose the semi-independent clocks for better
power management.

v3:
- Cleanup commit message to remove redundant text
- Added opp-table property and re-ordered entries
- Clarified power-domains and power-domain-names requirements for RK3588.
- Cleaned up example

Note: power-domains and power-domain-names requirements for other platforms
are still work in progress, hence the bindings are left incomplete here.

v2:
- New commit

Signed-off-by: Liviu Dudau <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Conor Dooley <[email protected]>
Cc: [email protected]
This adds the infrastructure for an execution context for GEM buffers
which is similar to the existing TTMs execbuf util and intended to replace
it in the long term.

The basic functionality is that we abstracts the necessary loop to lock
many different GEM buffers with automated deadlock and duplicate handling.

v2: drop xarray and use dynamic resized array instead, the locking
    overhead is unnecessary and measurable.
v3: drop duplicate tracking, radeon is really the only one needing that.
v4: fixes issues pointed out by Danilo, some typos in comments and a
    helper for lock arrays of GEM objects.
v5: some suggestions by Boris Brezillon, especially just use one retry
    macro, drop loop in prepare_array, use flags instead of bool
v6: minor changes suggested by Thomas, Boris and Danilo
v7: minor typos pointed out by checkpatch.pl fixed

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Reviewed-by: Danilo Krummrich <[email protected]>
Tested-by: Danilo Krummrich <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Add infrastructure to keep track of GPU virtual address (VA) mappings
with a decicated VA space manager implementation.

New UAPIs, motivated by Vulkan sparse memory bindings graphics drivers
start implementing, allow userspace applications to request multiple and
arbitrary GPU VA mappings of buffer objects. The DRM GPU VA manager is
intended to serve the following purposes in this context.

1) Provide infrastructure to track GPU VA allocations and mappings,
   using an interval tree (RB-tree).

2) Generically connect GPU VA mappings to their backing buffers, in
   particular DRM GEM objects.

3) Provide a common implementation to perform more complex mapping
   operations on the GPU VA space. In particular splitting and merging
   of GPU VA mappings, e.g. for intersecting mapping requests or partial
   unmap requests.

Acked-by: Thomas Hellström <[email protected]>
Acked-by: Matthew Brost <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Tested-by: Matthew Brost <[email protected]>
Tested-by: Donald Robson <[email protected]>
Suggested-by: Dave Airlie <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
vm_flags are among VMA attributes which affect decisions like VMA merging
and splitting.  Therefore all vm_flags modifications are performed after
taking exclusive mmap_lock to prevent vm_flags updates racing with such
operations.  Introduce modifier functions for vm_flags to be used whenever
flags are updated.  This way we can better check and control correct
locking behavior during these updates.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arjun Roy <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Howells <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Oskolkov <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Punit Agrawal <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Sebastian Reichel <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Soheil Hassas Yeganeh <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Rename struct drm_gpuva_manager to struct drm_gpuvm including
corresponding functions. This way the GPUVA manager's structures align
very well with the documentation of VM_BIND [1] and VM_BIND locking [2].

It also provides a better foundation for the naming of data structures
and functions introduced for implementing a common dma-resv per GPU-VM
including tracking of external and evicted objects in subsequent
patches.

[1] Documentation/gpu/drm-vm-bind-async.rst
[2] Documentation/gpu/drm-vm-bind-locking.rst

Cc: Thomas Hellström <[email protected]>
Cc: Matthew Brost <[email protected]>
Acked-by: Dave Airlie <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Currently, the DRM GPUVM does not have any core dependencies preventing
a module build.

Also, new features from subsequent patches require helpers (namely
drm_exec) which can be built as module.

Reviewed-by: Christian König <[email protected]>
Reviewed-by: Dave Airlie <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Use drm_WARN() and drm_WARN_ON() variants to indicate drivers the
context the failing VM resides in.

Reviewed-by: Boris Brezillon <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Don't always WARN in drm_gpuvm_check_overflow() and separate it into a
drm_gpuvm_check_overflow() and a dedicated
drm_gpuvm_warn_check_overflow() variant.

This avoids printing warnings due to invalid userspace requests.

Reviewed-by: Thomas Hellström <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Drivers may use this function to validate userspace requests in advance,
hence export it.

Reviewed-by: Thomas Hellström <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Provide a common dma-resv for GEM objects not being used outside of this
GPU-VM. This is used in a subsequent patch to generalize dma-resv,
external and evicted object handling and GEM validation.

Reviewed-by: Boris Brezillon <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Introduce flags for struct drm_gpuvm, this required by subsequent
commits.

Reviewed-by: Boris Brezillon <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Implement reference counting for struct drm_gpuvm.

Signed-off-by: Danilo Krummrich <[email protected]>
Add an abstraction layer between the drm_gpuva mappings of a particular
drm_gem_object and this GEM object itself. The abstraction represents a
combination of a drm_gem_object and drm_gpuvm. The drm_gem_object holds
a list of drm_gpuvm_bo structures (the structure representing this
abstraction), while each drm_gpuvm_bo contains list of mappings of this
GEM object.

This has multiple advantages:

1) We can use the drm_gpuvm_bo structure to attach it to various lists
   of the drm_gpuvm. This is useful for tracking external and evicted
   objects per VM, which is introduced in subsequent patches.

2) Finding mappings of a certain drm_gem_object mapped in a certain
   drm_gpuvm becomes much cheaper.

3) Drivers can derive and extend the structure to easily represent
   driver specific states of a BO for a certain GPUVM.

The idea of this abstraction was taken from amdgpu, hence the credit for
this idea goes to the developers of amdgpu.

Cc: Christian König <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Currently the DRM GPUVM offers common infrastructure to track GPU VA
allocations and mappings, generically connect GPU VA mappings to their
backing buffers and perform more complex mapping operations on the GPU VA
space.

However, there are more design patterns commonly used by drivers, which
can potentially be generalized in order to make the DRM GPUVM represent
a basis for GPU-VM implementations. In this context, this patch aims
at generalizing the following elements.

1) Provide a common dma-resv for GEM objects not being used outside of
   this GPU-VM.

2) Provide tracking of external GEM objects (GEM objects which are
   shared with other GPU-VMs).

3) Provide functions to efficiently lock all GEM objects dma-resv the
   GPU-VM contains mappings of.

4) Provide tracking of evicted GEM objects the GPU-VM contains mappings
   of, such that validation of evicted GEM objects is accelerated.

5) Provide some convinience functions for common patterns.

Big thanks to Boris Brezillon for his help to figure out locking for
drivers updating the GPU VA space within the fence signalling path.

Reviewed-by: Boris Brezillon <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Suggested-by: Matthew Brost <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
This will be useful for GPU drivers who want to keep page tables in a
pool so they can:

- keep freed page tables in a free pool and speed-up upcoming page
  table allocations
- batch page table allocation instead of allocating one page at a time
- pre-reserve pages for page tables needed for map/unmap operations,
  to ensure map/unmap operations don't try to allocate memory in paths
  they're allowed to block or fail

It might also be valuable for other aspects of GPU and similar
use-cases, like fine-grained memory accounting and resource limiting.

We will extend the Arm LPAE format to support custom allocators in a
separate commit.

Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Steven Price <[email protected]>
We need that in order to implement the VM_BIND ioctl in the GPU driver
targeting new Mali GPUs.

VM_BIND is about executing MMU map/unmap requests asynchronously,
possibly after waiting for external dependencies encoded as dma_fences.
We intend to use the drm_sched framework to automate the dependency
tracking and VM job dequeuing logic, but this comes with its own set
of constraints, one of them being the fact we are not allowed to
allocate memory in the drm_gpu_scheduler_ops::run_job() to avoid this
sort of deadlocks:

- VM_BIND map job needs to allocate a page table to map some memory
  to the VM. No memory available, so kswapd is kicked
- GPU driver shrinker backend ends up waiting on the fence attached to
  the VM map job or any other job fence depending on this VM operation.

With custom allocators, we will be able to pre-reserve enough pages to
guarantee the map/unmap operations we queued will take place without
going through the system allocator. But we can also optimize
allocation/reservation by not free-ing pages immediately, so any
upcoming page table allocation requests can be serviced by some free
page table pool kept at the driver level.

I might also be valuable for other aspects of GPU and similar
use-cases, like fine-grained memory accounting and resource limiting.

Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Ease debugging of a multi-GPU system by using drm_WARN_*() and
drm_dbg_kms() helpers that print out DRM device name corresponding
to shmem GEM.

Reviewed-by: Thomas Zimmermann <[email protected]>
Suggested-by: Thomas Zimmermann <[email protected]>
Signed-off-by: Dmitry Osipenko <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
DMA-buf core has its own refcounting of vmaps, use it instead of drm-shmem
counting. This change prepares drm-shmem for addition of memory shrinker
support where drm-shmem will use a single dma-buf reservation lock for
all operations performed over dma-bufs.

Reviewed-by: Thomas Zimmermann <[email protected]>
Signed-off-by: Dmitry Osipenko <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
Replace all drm-shmem locks with a GEM reservation lock. This makes locks
consistent with dma-buf locking convention where importers are responsible
for holding reservation lock for all operations performed over dma-bufs,
preventing deadlock between dma-buf importers and exporters.

Suggested-by: Daniel Vetter <[email protected]>
Acked-by: Thomas Zimmermann <[email protected]>
Signed-off-by: Dmitry Osipenko <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
The dma-buf backend is supposed to provide its own vm_ops, but some
implementation just have nothing special to do and leave vm_ops
untouched, probably expecting this field to be zero initialized (this
is the case with the system_heap implementation for instance).
Let's reset vma->vm_ops to NULL to keep things working with these
implementations.

Fixes: 26d3ac3 ("drm/shmem-helpers: Redirect mmap for imported dma-buf")
Cc: <[email protected]>
Cc: Daniel Vetter <[email protected]>
Reported-by: Roman Stratiienko <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Tested-by: Roman Stratiienko <[email protected]>
Reviewed-by: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Replace all drm-shmem locks with a GEM reservation lock. This makes locks
consistent with dma-buf locking convention where importers are responsible
for holding reservation lock for all operations performed over dma-bufs,
preventing deadlock between dma-buf importers and exporters.

Suggested-by: Daniel Vetter <[email protected]>
Acked-by: Thomas Zimmermann <[email protected]>
Reviewed-by: Emil Velikov <[email protected]>
Signed-off-by: Dmitry Osipenko <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
This way we can grab a pages ref without acquiring the resv lock when
pages_use_count > 0. This is needed to implement asynchronous map using
the drm_gpuva_mgr when the map/unmap operation triggers a mapping split,
requiring the new left/right regions to grab an additional page ref
to guarantee that the pages stay pinned when the middle section is
unmapped.

Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Steven Price <[email protected]>
When many entities are competing for the same run queue
on the same scheduler, we observe an unusually long wait
times and some jobs get starved. This has been observed on GPUVis.

The issue is due to the Round Robin policy used by schedulers
to pick up the next entity's job queue for execution. Under stress
of many entities and long job queues within entity some
jobs could be stuck for very long time in it's entity's
queue before being popped from the queue and executed
while for other entities with smaller job queues a job
might execute earlier even though that job arrived later
then the job in the long queue.

Fix:
Add FIFO selection policy to entities in run queue, chose next entity
on run queue in such order that if job on one entity arrived
earlier then job on another entity the first job will start
executing earlier regardless of the length of the entity's job
queue.

v2:
Switch to rb tree structure for entities based on TS of
oldest job waiting in the job queue of an entity. Improves next
entity extraction to O(1). Entity TS update
O(log N) where N is the number of entities in the run-queue

Drop default option in module control parameter.

v3:
Various cosmetical fixes and minor refactoring of fifo update function. (Luben)

v4:
Switch drm_sched_rq_select_entity_fifo to in order search (Luben)

v5: Fix up drm_sched_rq_select_entity_fifo loop (Luben)

v6: Add missing drm_sched_rq_remove_fifo_locked

v7: Fix ts sampling bug and more cosmetic stuff (Luben)

v8: Fix module parameter string (Luben)

Cc: Luben Tuikov <[email protected]>
Cc: Christian König <[email protected]>
Cc: Direct Rendering Infrastructure - Development <[email protected]>
Cc: AMD Graphics <[email protected]>
Signed-off-by: Andrey Grodzovsky <[email protected]>
Tested-by: Yunxiang Li (Teddy) <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Otherwise we would crash if the job is not resubmitted.

v2: fix second usage of s_fence->parent as well.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
The currently default Round-Robin GPU scheduling can result in starvation
of entities which have a large number of jobs, over entities which have
a very small number of jobs (single digit).

This can be illustrated in the following diagram, where jobs are
alphabetized to show their chronological order of arrival, where job A is
the oldest, B is the second oldest, and so on, to J, the most recent job to
arrive.

    ---> entities
j | H-F-----A--E--I--
o | --G-----B-----J--
b | --------C--------
s\/ --------D--------

WLOG, assuming all jobs are "ready", then a R-R scheduling will execute them
in the following order (a slice off of the top of the entities' list),

H, F, A, E, I, G, B, J, C, D.

However, to mitigate job starvation, we'd rather execute C and D before E,
and so on, given, of course, that they're all ready to be executed.

So, if all jobs are ready at this instant, the order of execution for this
and the next 9 instances of picking the next job to execute, should really
be,

A, B, C, D, E, F, G, H, I, J,

which is their chronological order. The only reason for this order to be
broken, is if an older job is not yet ready, but a younger job is ready, at
an instant of picking a new job to execute. For instance if job C wasn't
ready at time 2, but job D was ready, then we'd pick job D, like this:

0 +1 +2  ...
A, B, D, ...

And from then on, C would be preferred before all other jobs, if it is ready
at the time when a new job for execution is picked. So, if C became ready
two steps later, the execution order would look like this:

......0 +1 +2  ...
A, B, D, E, C, F, G, H, I, J

This is what the FIFO GPU scheduling algorithm achieves. It uses a
Red-Black tree to keep jobs sorted in chronological order, where picking
the oldest job is O(1) (we use the "cached" structure), and balancing the
tree is O(log n). IOW, it picks the *oldest ready* job to execute now.

The implementation is already in the kernel, and this commit only changes
the default GPU scheduling algorithm to use.

This was tested and achieves about 1% faster performance over the Round
Robin algorithm.

Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Direct Rendering Infrastructure - Development <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Christian König <[email protected]>
Add a new function to update job dependencies from a resv obj.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
This was buggy because when we had to wait for entities which were
killed as well we would just deadlock.

Instead move all the dependency handling into the callbacks so that
will all happen asynchronously.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
This now matches much better what this is doing.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
This reverts commit e6c6338.

This feature basically re-submits one job after another to
figure out which one was the one causing a hang.

This is obviously incompatible with gang-submit which requires
that multiple jobs run at the same time. It's also absolutely
not helpful to crash the hardware multiple times if a clean
recovery is desired.

For testing and debugging environments we should rather disable
recovery alltogether to be able to inspect the state with a hw
debugger.

Additional to that the sw implementation is clearly buggy and causes
reference count issues for the hardware fence.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Thomas Zimmermann and others added 24 commits March 9, 2024 17:45
Call drm_gem_prime_handle_to_fd() and drm_gem_prime_fd_to_handle() by
default if no PRIME import/export helpers have been set. Both functions
are the default for almost all drivers.

DRM drivers implement struct drm_driver.gem_prime_import_sg_table
to import dma-buf objects from other drivers. Having the function
drm_gem_prime_fd_to_handle() functions set by default allows each
driver to import dma-buf objects to itself, even without support for
other drivers.

For drm_gem_prime_handle_to_fd() it is similar: using it by default
allows each driver to export to itself, even without support for other
drivers.

This functionality enables userspace to share per-driver buffers
across process boundaries via PRIME (e.g., wlroots requires this
functionality). The patch generalizes a pattern that has previously
been implemented by GEM VRAM helpers [1] to work with any driver.
For example, gma500 can now run the wlroots-based sway compositor.

v2:
	* clean up docs and TODO comments (Simon, Zack)
	* clean up style in drm_getcap()

Signed-off-by: Thomas Zimmermann <[email protected]>
Link: https://lore.kernel.org/dri-devel/[email protected]/ # 1
Reviewed-by: Simon Ser <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Jeffrey Hugo <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Clear all assignments of struct drm_driver's fd/handle callbacks to
drm_gem_prime_fd_to_handle() and drm_gem_prime_handle_to_fd(). These
functions are called by default. Add a TODO item to convert vmwgfx
to the defaults as well.

v2:
	* remove TODO item (Zack)
	* also update amdgpu's amdgpu_partition_driver

Signed-off-by: Thomas Zimmermann <[email protected]>
Reviewed-by: Simon Ser <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Acked-by: Jeffrey Hugo <[email protected]> # qaic
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
G610 Mali normally takes 2 regulators, but the devfreq implementation
can only deal with one. Let's add a regulator coupler as done for
mtk8183.
On architectures that support the preservation of memblock metadata
after __init, allow drivers to call memblock_free() to free a
reservation made by early arch code. This is a hack to support the
freeing of bootsplash reservations passed to Linux by the bootloader.

(This should be reworked in future versions of Android; do not
cherry-pick this patch forward.)

Bug: 139653858
Bug: 174620135
Change-Id: I32c0ee70c33c94deff70aa548896caa9978396fb
Signed-off-by: Alistair Delva <[email protected]>
DKMS modules use the same makefile, and random warnings on version jumps
or toolchain causes dkms modules fail to build
…troyed

Some users need to release resources attached to the vm_bo object when
it's destroyed. In Panthor's case, we need to release the pin ref so
BO pages can be returned to the system when all GPU mappings are gone.

This could be done through a custom drm_gpuvm::vm_bo_free() hook, but
this has all sort of locking implications that would force us to expose
a drm_gem_shmem_unpin_locked() helper, not to mention the fact that
having a ::vm_bo_free() implementation without a ::vm_bo_alloc() one
seems odd. So let's keep things simple, and extend drm_gpuvm_bo_put()
to report when the object is destroyed.

Signed-off-by: Boris Brezillon <[email protected]>
Panthor follows the lead of other recently submitted drivers with
ioctls allowing us to support modern Vulkan features, like sparse memory
binding:

- Pretty standard GEM management ioctls (BO_CREATE and BO_MMAP_OFFSET),
  with the 'exclusive-VM' bit to speed-up BO reservation on job
submission
- VM management ioctls (VM_CREATE, VM_DESTROY and VM_BIND). The VM_BIND
  ioctl is loosely based on the Xe model, and can handle both
  asynchronous and synchronous requests
- GPU execution context creation/destruction, tiler heap context
creation
  and job submission. Those ioctls reflect how the hardware/scheduler
  works and are thus driver specific.

We also have a way to expose IO regions, such that the usermode driver
can directly access specific/well-isolate registers, like the
LATEST_FLUSH register used to implement cache-flush reduction.

This uAPI intentionally keeps usermode queues out of the scope, which
explains why doorbell registers and command stream ring-buffers are not
directly exposed to userspace.

v5:
- Fix typo
- Add Liviu's R-b

v4:
- Add a VM_GET_STATE ioctl
- Fix doc
- Expose the CORE_FEATURES register so we can deal with variants in the
  UMD
- Add Steve's R-b

v3:
- Add the concept of sync-only VM operation
- Fix support for 32-bit userspace
- Rework drm_panthor_vm_create to pass the user VA size instead of
  the kernel VA size (suggested by Robin Murphy)
- Typo fixes
- Explicitly cast enums with top bit set to avoid compiler warnings in
  -pedantic mode.
- Drop property core_group_count as it can be easily calculated by the
  number of bits set in l2_present.

Co-developed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Reviewed-by: Liviu Dudau <[email protected]>
Those are the registers directly accessible through the MMIO range.

FW registers are exposed in panthor_fw.h.

v4:
- Add the CORE_FEATURES register (needed for GPU variants)
- Add Steve's R-b

v3:
- Add macros to extract GPU ID info
- Formatting changes
- Remove AS_TRANSCFG_ADRMODE_LEGACY - it doesn't exist post-CSF
- Remove CSF_GPU_LATEST_FLUSH_ID_DEFAULT
- Add GPU_L2_FEATURES_LINE_SIZE for extracting the GPU cache line size

Co-developed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Acked-by: Steven Price <[email protected]> # MIT+GPL2 relicensing,Arm
Acked-by: Grant Likely <[email protected]> # MIT+GPL2 relicensing,Linaro
Acked-by: Boris Brezillon <[email protected]> # MIT+GPL2 relicensing,Collabora
Reviewed-by: Steven Price <[email protected]>
The panthor driver is designed in a modular way, where each logical
block is dealing with a specific HW-block or software feature. In order
for those blocks to communicate with each other, we need a central
panthor_device collecting all the blocks, and exposing some common
features, like interrupt handling, power management, reset, ...

This what this panthor_device logical block is about.

v5:
- Suspend the MMU/GPU blocks if panthor_fw_resume() fails in
  panthor_device_resume()
- Move the pm_runtime_use_autosuspend() call before drm_dev_register()
- Add Liviu's R-b

v4:
- Check drmm_mutex_init() return code
- Fix panthor_device_reset_work() out path
- Fix the race in the unplug logic
- Fix typos
- Unplug blocks when something fails in panthor_device_init()
- Add Steve's R-b

v3:
- Add acks for the MIT+GPL2 relicensing
- Fix 32-bit support
- Shorten the sections protected by panthor_device::pm::mmio_lock to fix
  lock ordering issues.
- Rename panthor_device::pm::lock into panthor_device::pm::mmio_lock to
  better reflect what this lock is protecting
- Use dev_err_probe()
- Make sure we call drm_dev_exit() when something fails half-way in
  panthor_device_reset_work()
- Replace CSF_GPU_LATEST_FLUSH_ID_DEFAULT with a constant '1' and a
  comment to explain. Also remove setting the dummy flush ID on suspend.
- Remove drm_WARN_ON() in panthor_exception_name()
- Check pirq->suspended in panthor_xxx_irq_raw_handler()

Co-developed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Acked-by: Steven Price <[email protected]> # MIT+GPL2 relicensing,Arm
Acked-by: Grant Likely <[email protected]> # MIT+GPL2 relicensing,Linaro
Acked-by: Boris Brezillon <[email protected]> # MIT+GPL2 relicensing,Collabora
Reviewed-by: Steven Price <[email protected]>
Reviewed-by: Liviu Dudau <[email protected]>
Handles everything that's not related to the FW, the MMU or the
scheduler. This is the block dealing with the GPU property retrieval,
the GPU block power on/off logic, and some global operations, like
global cache flushing.

v5:
- Fix GPU_MODEL() kernel doc
- Fix test in panthor_gpu_block_power_off()
- Add Steve's R-b

v4:
- Expose CORE_FEATURES through DEV_QUERY

v3:
- Add acks for the MIT/GPL2 relicensing
- Use macros to extract GPU ID info
- Make sure we reset clear pending_reqs bits when wait_event_timeout()
  times out but the corresponding bit is cleared in GPU_INT_RAWSTAT
  (can happen if the IRQ is masked or HW takes to long to call the IRQ
  handler)
- GPU_MODEL now takes separate arch and product majors to be more
  readable.
- Drop GPU_IRQ_MCU_STATUS_CHANGED from interrupt mask.
- Handle GPU_IRQ_PROTM_FAULT correctly (don't output registers that are
  not updated for protected interrupts).
- Minor code tidy ups

Cc: Alexey Sheplyakov <[email protected]> # MIT+GPL2 relicensing
Co-developed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Acked-by: Steven Price <[email protected]> # MIT+GPL2 relicensing,Arm
Acked-by: Grant Likely <[email protected]> # MIT+GPL2 relicensing,Linaro
Acked-by: Boris Brezillon <[email protected]> # MIT+GPL2 relicensing,Collabora
Reviewed-by: Steven Price <[email protected]>
Anything relating to GEM object management is placed here. Nothing
particularly interesting here, given the implementation is based on
drm_gem_shmem_object, which is doing most of the work.

v5:
- Add Liviu's and Steve's R-b

v4:
- Force kernel BOs to be GPU mapped
- Make panthor_kernel_bo_destroy() robust against ERR/NULL BO pointers
  to simplify the call sites

v3:
- Add acks for the MIT/GPL2 relicensing
- Provide a panthor_kernel_bo abstraction for buffer objects managed by
  the kernel (will replace panthor_fw_mem and be used everywhere we were
  using panthor_gem_create_and_map() before)
- Adjust things to match drm_gpuvm changes
- Change return of panthor_gem_create_with_handle() to int

Co-developed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Acked-by: Steven Price <[email protected]> # MIT+GPL2 relicensing,Arm
Acked-by: Grant Likely <[email protected]> # MIT+GPL2 relicensing,Linaro
Acked-by: Boris Brezillon <[email protected]> # MIT+GPL2 relicensing,Collabora
Reviewed-by: Liviu Dudau <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Every thing related to devfreq in placed in panthor_devfreq.c, and
helpers that can be called by other logical blocks are exposed through
panthor_devfreq.h.

This implementation is loosely based on the panfrost implementation,
the only difference being that we don't count device users, because
the idle/active state will be managed by the scheduler logic.

v4:
- Add Clément's A-b for the relicensing

v3:
- Add acks for the MIT/GPL2 relicensing

v2:
- Added in v2

Cc: Clément Péron <[email protected]> # MIT+GPL2 relicensing
Reviewed-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Acked-by: Steven Price <[email protected]> # MIT+GPL2 relicensing,Arm
Acked-by: Grant Likely <[email protected]> # MIT+GPL2 relicensing,Linaro
Acked-by: Boris Brezillon <[email protected]> # MIT+GPL2 relicensing,Collabora
Acked-by: Clément Péron <[email protected]> # MIT+GPL2 relicensing
MMU and VM management is related and placed in the same source file.

Page table updates are delegated to the io-pgtable-arm driver that's in
the iommu subsystem.

The VM management logic is based on drm_gpuva_mgr, and is assuming the
VA space is mostly managed by the usermode driver, except for a reserved
portion of this VA-space that's used for kernel objects (like the heap
contexts/chunks).

Both asynchronous and synchronous VM operations are supported, and
internal helpers are exposed to allow other logical blocks to map their
buffers in the GPU VA space.

There's one VM_BIND queue per-VM (meaning the Vulkan driver can only
expose one sparse-binding queue), and this bind queue is managed with
a 1:1 drm_sched_entity:drm_gpu_scheduler, such that each VM gets its own
independent execution queue, avoiding VM operation serialization at the
device level (things are still serialized at the VM level).

The rest is just implementation details that are hopefully well explained
in the documentation.

v5:
- Fix a double panthor_vm_cleanup_op_ctx() call
- Fix a race between panthor_vm_prepare_map_op_ctx() and
  panthor_vm_bo_put()
- Fix panthor_vm_pool_destroy_vm() kernel doc
- Fix paddr adjustment in panthor_vm_map_pages()
- Fix bo_offset calculation in panthor_vm_get_bo_for_va()

v4:
- Add an helper to return the VM state
- Check drmm_mutex_init() return code
- Remove the VM from the AS reclaim list when panthor_vm_active() is
  called
- Count the number of active VM users instead of considering there's
  at most one user (several scheduling groups can point to the same
  vM)
- Pre-allocate a VMA object for unmap operations (unmaps can trigger
  a sm_step_remap() call)
- Check vm->root_page_table instead of vm->pgtbl_ops to detect if
  the io-pgtable is trying to allocate the root page table
- Don't memset() the va_node in panthor_vm_alloc_va(), make it a
  caller requirement
- Fix the kernel doc in a few places
- Drop the panthor_vm::base offset constraint and modify
  panthor_vm_put() to explicitly check for a NULL value
- Fix unbalanced vm_bo refcount in panthor_gpuva_sm_step_remap()
- Drop stale comments about the shared_bos list
- Patch mmu_features::va_bits on 32-bit builds to reflect the
  io_pgtable limitation and let the UMD know about it

v3:
- Add acks for the MIT/GPL2 relicensing
- Propagate MMU faults to the scheduler
- Move pages pinning/unpinning out of the dma_signalling path
- Fix 32-bit support
- Rework the user/kernel VA range calculation
- Make the auto-VA range explicit (auto-VA range doesn't cover the full
  kernel-VA range on the MCU VM)
- Let callers of panthor_vm_alloc_va() allocate the drm_mm_node
  (embedded in panthor_kernel_bo now)
- Adjust things to match the latest drm_gpuvm changes (extobj tracking,
  resv prep and more)
- Drop the per-AS lock and use slots_lock (fixes a race on vm->as.id)
- Set as.id to -1 when reusing an address space from the LRU list
- Drop misleading comment about page faults
- Remove check for irq being assigned in panthor_mmu_unplug()

Co-developed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Acked-by: Steven Price <[email protected]> # MIT+GPL2 relicensing,Arm
Acked-by: Grant Likely <[email protected]> # MIT+GPL2 relicensing,Linaro
Acked-by: Boris Brezillon <[email protected]> # MIT+GPL2 relicensing,Collabora
Contains everything that's FW related, that includes the code dealing
with the microcontroller unit (MCU) that's running the FW, and anything
related to allocating memory shared between the FW and the CPU.

A few global FW events are processed in the IRQ handler, the rest is
forwarded to the scheduler, since scheduling is the primary reason for
the FW existence, and also the main source of FW <-> kernel
interactions.

v5:
- Fix typo in GLB_PERFCNT_SAMPLE definition
- Fix unbalanced panthor_vm_idle/active() calls
- Fallback to a slow reset when the fast reset fails
- Add extra information when reporting a FW boot failure

v4:
- Add a MODULE_FIRMWARE() entry for gen 10.8
- Fix a wrong return ERR_PTR() in panthor_fw_load_section_entry()
- Fix typos
- Add Steve's R-b

v3:
- Make the FW path more future-proof (Liviu)
- Use one waitqueue for all FW events
- Simplify propagation of FW events to the scheduler logic
- Drop the panthor_fw_mem abstraction and use panthor_kernel_bo instead
- Account for the panthor_vm changes
- Replace magic number with 0x7fffffff with ~0 to better signify that
  it's the maximum permitted value.
- More accurate rounding when computing the firmware timeout.
- Add a 'sub iterator' helper function. This also adds a check that a
  firmware entry doesn't overflow the firmware image.
- Drop __packed from FW structures, natural alignment is good enough.
- Other minor code improvements.

Co-developed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Tiler heap growing requires some kernel driver involvement: when the
tiler runs out of heap memory, it will raise an exception which is
either directly handled by the firmware if some free heap chunks are
available in the heap context, or passed back to the kernel otherwise.
The heap helpers will be used by the scheduler logic to allocate more
heap chunks to a heap context, when such a situation happens.

Heap context creation is explicitly requested by userspace (using
the TILER_HEAP_CREATE ioctl), and the returned context is attached to a
queue through some command stream instruction.

All the kernel does is keep the list of heap chunks allocated to a
context, so they can be freed when TILER_HEAP_DESTROY is called, or
extended when the FW requests a new chunk.

v5:
- Fix FIXME comment
- Add Steve's R-b

v4:
- Rework locking to allow concurrent calls to panthor_heap_grow()
- Add a helper to return a heap chunk if we couldn't pass it to the
  FW because the group was scheduled out

v3:
- Add a FIXME for the heap OOM deadlock
- Use the panthor_kernel_bo abstraction for the heap context and heap
  chunks
- Drop the panthor_heap_gpu_ctx struct as it is opaque to the driver
- Ensure that the heap context is aligned to the GPU cache line size
- Minor code tidy ups

Co-developed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Steven Price <[email protected]>
This is the piece of software interacting with the FW scheduler, and
taking care of some scheduling aspects when the FW comes short of slots
scheduling slots. Indeed, the FW only expose a few slots, and the kernel
has to give all submission contexts, a chance to execute their jobs.

The kernel-side scheduler is timeslice-based, with a round-robin queue
per priority level.

Job submission is handled with a 1:1 drm_sched_entity:drm_gpu_scheduler,
allowing us to delegate the dependency tracking to the core.

All the gory details should be documented inline.

v5:
- Fix typos
- Call panthor_kernel_bo_destroy(group->syncobjs) unconditionally
- Don't move the group to the waiting list tail when it was already
  waiting for a different syncobj
- Fix fatal_queues flagging in the tiler OOM path
- Don't warn when more than one job timesout on a group
- Add a warning message when we fail to allocate a heap chunk
- Add Steve's R-b

v4:
- Check drmm_mutex_init() return code
- s/drm_gem_vmap_unlocked/drm_gem_vunmap_unlocked/ in
  panthor_queue_put_syncwait_obj()
- Drop unneeded WARN_ON() in cs_slot_sync_queue_state_locked()
- Use atomic_xchg() instead of atomic_fetch_and(0)
- Fix typos
- Let panthor_kernel_bo_destroy() check for IS_ERR_OR_NULL() BOs
- Defer TILER_OOM event handling to a separate workqueue to prevent
  deadlocks when the heap chunk allocation is blocked on mem-reclaim.
  This is just a temporary solution, until we add support for
  non-blocking/failable allocations
- Pass the scheduler workqueue to drm_sched instead of instantiating
  a separate one (no longer needed now that heap chunk allocation
  happens on a dedicated wq)
- Set WQ_MEM_RECLAIM on the scheduler workqueue, so we can handle
  job timeouts when the system is under mem pressure, and hopefully
  free up some memory retained by these jobs

v3:
- Rework the FW event handling logic to avoid races
- Make sure MMU faults kill the group immediately
- Use the panthor_kernel_bo abstraction for group/queue buffers
- Make in_progress an atomic_t, so we can check it without the reset lock
  held
- Don't limit the number of groups per context to the FW scheduler
  capacity. Fix the limit to 128 for now.
- Add a panthor_job_vm() helper
- Account for panthor_vm changes
- Add our job fence as DMA_RESV_USAGE_WRITE to all external objects
  (was previously DMA_RESV_USAGE_BOOKKEEP). I don't get why, given
  we're supposed to be fully-explicit, but other drivers do that, so
  there must be a good reason
- Account for drm_sched changes
- Provide a panthor_queue_put_syncwait_obj()
- Unconditionally return groups to their idle list in
  panthor_sched_suspend()
- Condition of sched_queue_{,delayed_}work fixed to be only when a reset
  isn't pending or in progress.
- Several typos in comments fixed.

Co-developed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Steven Price <[email protected]>
This is the last piece missing to expose the driver to the outside
world.

This is basically a wrapper between the ioctls and the other logical
blocks.

v5:
- Account for the drm_exec_init() prototype change
- Include platform_device.h

v4:
- Add an ioctl to let the UMD query the VM state
- Fix kernel doc
- Let panthor_device_init() call panthor_device_init()
- Fix cleanup ordering in the panthor_init() error path
- Add Steve's and Liviu's R-b

v3:
- Add acks for the MIT/GPL2 relicensing
- Fix 32-bit support
- Account for panthor_vm and panthor_sched changes
- Simplify the resv preparation/update logic
- Use a linked list rather than xarray for list of signals.
- Simplify panthor_get_uobj_array by returning the newly allocated
  array.
- Drop the "DOC" for job submission helpers and move the relevant
  comments to panthor_ioctl_group_submit().
- Add helpers sync_op_is_signal()/sync_op_is_wait().
- Simplify return type of panthor_submit_ctx_add_sync_signal() and
  panthor_submit_ctx_get_sync_signal().
- Drop WARN_ON from panthor_submit_ctx_add_job().
- Fix typos in comments.

Co-developed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Acked-by: Steven Price <[email protected]> # MIT+GPL2 relicensing,Arm
Acked-by: Grant Likely <[email protected]> # MIT+GPL2 relicensing,Linaro
Acked-by: Boris Brezillon <[email protected]> # MIT+GPL2 relicensing,Collabora
Reviewed-by: Steven Price <[email protected]>
Reviewed-by: Liviu Dudau <[email protected]>
Now that all blocks are available, we can add/update Kconfig/Makefile
files to allow compilation.

v4:
- Add Steve's R-b

v3:
- Add a dep on DRM_GPUVM
- Fix dependencies in Kconfig
- Expand help text to (hopefully) describe which GPUs are to be
  supported by this driver and which are for panfrost.

Co-developed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Acked-by: Steven Price <[email protected]> # MIT+GPL2 relicensing,Arm
Acked-by: Grant Likely <[email protected]> # MIT+GPL2 relicensing,Linaro
Acked-by: Boris Brezillon <[email protected]> # MIT+GPL2 relicensing,Collabora
Reviewed-by: Steven Price <[email protected]>
In cases where the # is known ahead of time, it is silly to do the table
resize dance.

Signed-off-by: Rob Clark <[email protected]>
Reviewed-by: Christian König <[email protected]>
Patchwork: https://patchwork.freedesktop.org/patch/568338/
@hbiyik hbiyik closed this Mar 10, 2024
@hbiyik hbiyik deleted the rk-6.1-rkr1-panthor-v5 branch March 10, 2024 23:43
Joshua-Riek pushed a commit that referenced this pull request Jul 9, 2024
commit 1a5352a upstream.

The ath11k active pdevs are protected by RCU but the temperature event
handling code calling ath11k_mac_get_ar_by_pdev_id() was not marked as a
read-side critical section as reported by RCU lockdep:

	=============================
	WARNING: suspicious RCU usage
	6.6.0-rc6 #7 Not tainted
	-----------------------------
	drivers/net/wireless/ath/ath11k/mac.c:638 suspicious rcu_dereference_check() usage!

	other info that might help us debug this:

	rcu_scheduler_active = 2, debug_locks = 1
	no locks held by swapper/0/0.
	...
	Call trace:
	...
	 lockdep_rcu_suspicious+0x16c/0x22c
	 ath11k_mac_get_ar_by_pdev_id+0x194/0x1b0 [ath11k]
	 ath11k_wmi_tlv_op_rx+0xa84/0x2c1c [ath11k]
	 ath11k_htc_rx_completion_handler+0x388/0x510 [ath11k]

Mark the code in question as an RCU read-side critical section to avoid
any potential use-after-free issues.

Tested-on: WCN6855 hw2.1 PCI WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.23

Fixes: a41d103 ("ath11k: add thermal sensor device support")
Cc: [email protected]      # 5.7
Signed-off-by: Johan Hovold <[email protected]>
Acked-by: Jeff Johnson <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Joshua-Riek pushed a commit that referenced this pull request Jul 9, 2024
commit b684c09 upstream.

ppc_save_regs() skips one stack frame while saving the CPU register states.
Instead of saving current R1, it pulls the previous stack frame pointer.

When vmcores caused by direct panic call (such as `echo c >
/proc/sysrq-trigger`), are debugged with gdb, gdb fails to show the
backtrace correctly. On further analysis, it was found that it was because
of mismatch between r1 and NIP.

GDB uses NIP to get current function symbol and uses corresponding debug
info of that function to unwind previous frames, but due to the
mismatching r1 and NIP, the unwinding does not work, and it fails to
unwind to the 2nd frame and hence does not show the backtrace.

GDB backtrace with vmcore of kernel without this patch:

---------
(gdb) bt
 #0  0xc0000000002a53e8 in crash_setup_regs (oldregs=<optimized out>,
    newregs=0xc000000004f8f8d8) at ./arch/powerpc/include/asm/kexec.h:69
 #1  __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:974
 #2  0x0000000000000063 in ?? ()
 #3  0xc000000003579320 in ?? ()
---------

Further analysis revealed that the mismatch occurred because
"ppc_save_regs" was saving the previous stack's SP instead of the current
r1. This patch fixes this by storing current r1 in the saved pt_regs.

GDB backtrace with vmcore of patched kernel:

--------
(gdb) bt
 #0  0xc0000000002a53e8 in crash_setup_regs (oldregs=0x0, newregs=0xc00000000670b8d8)
    at ./arch/powerpc/include/asm/kexec.h:69
 #1  __crash_kexec (regs=regs@entry=0x0) at kernel/kexec_core.c:974
 #2  0xc000000000168918 in panic (fmt=fmt@entry=0xc000000001654a60 "sysrq triggered crash\n")
    at kernel/panic.c:358
 #3  0xc000000000b735f8 in sysrq_handle_crash (key=<optimized out>) at drivers/tty/sysrq.c:155
 #4  0xc000000000b742cc in __handle_sysrq (key=key@entry=99, check_mask=check_mask@entry=false)
    at drivers/tty/sysrq.c:602
 #5  0xc000000000b7506c in write_sysrq_trigger (file=<optimized out>, buf=<optimized out>,
    count=2, ppos=<optimized out>) at drivers/tty/sysrq.c:1163
 #6  0xc00000000069a7bc in pde_write (ppos=<optimized out>, count=<optimized out>,
    buf=<optimized out>, file=<optimized out>, pde=0xc00000000362cb40) at fs/proc/inode.c:340
 #7  proc_reg_write (file=<optimized out>, buf=<optimized out>, count=<optimized out>,
    ppos=<optimized out>) at fs/proc/inode.c:352
 #8  0xc0000000005b3bbc in vfs_write (file=file@entry=0xc000000006aa6b00,
    buf=buf@entry=0x61f498b4f60 <error: Cannot access memory at address 0x61f498b4f60>,
    count=count@entry=2, pos=pos@entry=0xc00000000670bda0) at fs/read_write.c:582
 #9  0xc0000000005b4264 in ksys_write (fd=<optimized out>,
    buf=0x61f498b4f60 <error: Cannot access memory at address 0x61f498b4f60>, count=2)
    at fs/read_write.c:637
 #10 0xc00000000002ea2c in system_call_exception (regs=0xc00000000670be80, r0=<optimized out>)
    at arch/powerpc/kernel/syscall.c:171
 #11 0xc00000000000c270 in system_call_vectored_common ()
    at arch/powerpc/kernel/interrupt_64.S:192
--------

Nick adds:
  So this now saves regs as though it was an interrupt taken in the
  caller, at the instruction after the call to ppc_save_regs, whereas
  previously the NIP was there, but R1 came from the caller's caller and
  that mismatch is what causes gdb's dwarf unwinder to go haywire.

Signed-off-by: Aditya Gupta <[email protected]>
Fixes: d16a58f ("powerpc: Improve ppc_save_regs()")
Reivewed-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]
Cc: [email protected]
Signed-off-by: Aditya Gupta <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Joshua-Riek pushed a commit that referenced this pull request Sep 1, 2024
When l2tp tunnels use a socket provided by userspace, we can hit
lockdep splats like the below when data is transmitted through another
(unrelated) userspace socket which then gets routed over l2tp.

This issue was previously discussed here:
https://lore.kernel.org/netdev/[email protected]/

The solution is to have lockdep treat socket locks of l2tp tunnel
sockets separately than those of standard INET sockets. To do so, use
a different lockdep subclass where lock nesting is possible.

  ============================================
  WARNING: possible recursive locking detected
  6.10.0+ #34 Not tainted
  --------------------------------------------
  iperf3/771 is trying to acquire lock:
  ffff8881027601d8 (slock-AF_INET/1){+.-.}-{2:2}, at: l2tp_xmit_skb+0x243/0x9d0

  but task is already holding lock:
  ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(slock-AF_INET/1);
    lock(slock-AF_INET/1);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  10 locks held by iperf3/771:
   #0: ffff888102650258 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendmsg+0x1a/0x40
   #1: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
   #2: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
   #3: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x28b/0x9f0
   #4: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0xf9/0x260
   #5: ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10
   #6: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
   #7: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
   #8: ffffffff822ac1e0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0xcc/0x1450
   #9: ffff888101f33258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock#2){+...}-{2:2}, at: __dev_queue_xmit+0x513/0x1450

  stack backtrace:
  CPU: 2 UID: 0 PID: 771 Comm: iperf3 Not tainted 6.10.0+ #34
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  Call Trace:
   <IRQ>
   dump_stack_lvl+0x69/0xa0
   dump_stack+0xc/0x20
   __lock_acquire+0x135d/0x2600
   ? srso_alias_return_thunk+0x5/0xfbef5
   lock_acquire+0xc4/0x2a0
   ? l2tp_xmit_skb+0x243/0x9d0
   ? __skb_checksum+0xa3/0x540
   _raw_spin_lock_nested+0x35/0x50
   ? l2tp_xmit_skb+0x243/0x9d0
   l2tp_xmit_skb+0x243/0x9d0
   l2tp_eth_dev_xmit+0x3c/0xc0
   dev_hard_start_xmit+0x11e/0x420
   sch_direct_xmit+0xc3/0x640
   __dev_queue_xmit+0x61c/0x1450
   ? ip_finish_output2+0xf4c/0x1130
   ip_finish_output2+0x6b6/0x1130
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __ip_finish_output+0x217/0x380
   ? srso_alias_return_thunk+0x5/0xfbef5
   __ip_finish_output+0x217/0x380
   ip_output+0x99/0x120
   __ip_queue_xmit+0xae4/0xbc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? tcp_options_write.constprop.0+0xcb/0x3e0
   ip_queue_xmit+0x34/0x40
   __tcp_transmit_skb+0x1625/0x1890
   __tcp_send_ack+0x1b8/0x340
   tcp_send_ack+0x23/0x30
   __tcp_ack_snd_check+0xa8/0x530
   ? srso_alias_return_thunk+0x5/0xfbef5
   tcp_rcv_established+0x412/0xd70
   tcp_v4_do_rcv+0x299/0x420
   tcp_v4_rcv+0x1991/0x1e10
   ip_protocol_deliver_rcu+0x50/0x220
   ip_local_deliver_finish+0x158/0x260
   ip_local_deliver+0xc8/0xe0
   ip_rcv+0xe5/0x1d0
   ? __pfx_ip_rcv+0x10/0x10
   __netif_receive_skb_one_core+0xce/0xe0
   ? process_backlog+0x28b/0x9f0
   __netif_receive_skb+0x34/0xd0
   ? process_backlog+0x28b/0x9f0
   process_backlog+0x2cb/0x9f0
   __napi_poll.constprop.0+0x61/0x280
   net_rx_action+0x332/0x670
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? find_held_lock+0x2b/0x80
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   handle_softirqs+0xda/0x480
   ? __dev_queue_xmit+0xa2c/0x1450
   do_softirq+0xa1/0xd0
   </IRQ>
   <TASK>
   __local_bh_enable_ip+0xc8/0xe0
   ? __dev_queue_xmit+0xa2c/0x1450
   __dev_queue_xmit+0xa48/0x1450
   ? ip_finish_output2+0xf4c/0x1130
   ip_finish_output2+0x6b6/0x1130
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __ip_finish_output+0x217/0x380
   ? srso_alias_return_thunk+0x5/0xfbef5
   __ip_finish_output+0x217/0x380
   ip_output+0x99/0x120
   __ip_queue_xmit+0xae4/0xbc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? tcp_options_write.constprop.0+0xcb/0x3e0
   ip_queue_xmit+0x34/0x40
   __tcp_transmit_skb+0x1625/0x1890
   tcp_write_xmit+0x766/0x2fb0
   ? __entry_text_end+0x102ba9/0x102bad
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __might_fault+0x74/0xc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   __tcp_push_pending_frames+0x56/0x190
   tcp_push+0x117/0x310
   tcp_sendmsg_locked+0x14c1/0x1740
   tcp_sendmsg+0x28/0x40
   inet_sendmsg+0x5d/0x90
   sock_write_iter+0x242/0x2b0
   vfs_write+0x68d/0x800
   ? __pfx_sock_write_iter+0x10/0x10
   ksys_write+0xc8/0xf0
   __x64_sys_write+0x3d/0x50
   x64_sys_call+0xfaf/0x1f50
   do_syscall_64+0x6d/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f4d143af992
  Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 01 cc ff ff 41 54 b8 02 00 00 0
  RSP: 002b:00007ffd65032058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f4d143af992
  RDX: 0000000000000025 RSI: 00007f4d143f3bcc RDI: 0000000000000005
  RBP: 00007f4d143f2b28 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4d143f3bcc
  R13: 0000000000000005 R14: 0000000000000000 R15: 00007ffd650323f0
   </TASK>

Fixes: 0b2c597 ("l2tp: close all race conditions in l2tp_tunnel_register()")
Suggested-by: Eric Dumazet <[email protected]>
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=6acef9e0a4d1f46c83d4
CC: [email protected]
CC: [email protected]
Signed-off-by: James Chapman <[email protected]>
Signed-off-by: Tom Parkin <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Joshua-Riek pushed a commit that referenced this pull request Sep 16, 2024
Ethtool callbacks can be executed while reset is in progress and try to
access deleted resources, e.g. getting coalesce settings can result in a
NULL pointer dereference seen below.

Reproduction steps:
Once the driver is fully initialized, trigger reset:
	# echo 1 > /sys/class/net/<interface>/device/reset
when reset is in progress try to get coalesce settings using ethtool:
	# ethtool -c <interface>

BUG: kernel NULL pointer dereference, address: 0000000000000020
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 11 PID: 19713 Comm: ethtool Tainted: G S                 6.10.0-rc7+ #7
RIP: 0010:ice_get_q_coalesce+0x2e/0xa0 [ice]
RSP: 0018:ffffbab1e9bcf6a8 EFLAGS: 00010206
RAX: 000000000000000c RBX: ffff94512305b028 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9451c3f2e588 RDI: ffff9451c3f2e588
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffff9451c3f2e580 R11: 000000000000001f R12: ffff945121fa9000
R13: ffffbab1e9bcf760 R14: 0000000000000013 R15: ffffffff9e65dd40
FS:  00007faee5fbe740(0000) GS:ffff94546fd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000020 CR3: 0000000106c2e005 CR4: 00000000001706f0
Call Trace:
<TASK>
ice_get_coalesce+0x17/0x30 [ice]
coalesce_prepare_data+0x61/0x80
ethnl_default_doit+0xde/0x340
genl_family_rcv_msg_doit+0xf2/0x150
genl_rcv_msg+0x1b3/0x2c0
netlink_rcv_skb+0x5b/0x110
genl_rcv+0x28/0x40
netlink_unicast+0x19c/0x290
netlink_sendmsg+0x222/0x490
__sys_sendto+0x1df/0x1f0
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x82/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7faee60d8e27

Calling netif_device_detach() before reset makes the net core not call
the driver when ethtool command is issued, the attempt to execute an
ethtool command during reset will result in the following message:

    netlink error: No such device

instead of NULL pointer dereference. Once reset is done and
ice_rebuild() is executing, the netif_device_attach() is called to allow
for ethtool operations to occur again in a safe manner.

Fixes: fcea6f3 ("ice: Add stats and ethtool support")
Suggested-by: Jakub Kicinski <[email protected]>
Reviewed-by: Igor Bagnucki <[email protected]>
Signed-off-by: Dawid Osuchowski <[email protected]>
Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel)
Reviewed-by: Michal Schmidt <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.