Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vc4-drm : Are we losing hardware-accelerated blitting? #38

Open
vanfanel opened this issue Jun 11, 2016 · 6 comments
Open

vc4-drm : Are we losing hardware-accelerated blitting? #38

vanfanel opened this issue Jun 11, 2016 · 6 comments

Comments

@vanfanel
Copy link

vanfanel commented Jun 11, 2016

Hi,

I have been adapting other people's pupular programs (RetroArch, Scummvm, and a long etc) to use the dispmanx API for years now.
However, dispmanx should, sooner or later, be superseeded by KMS/DRM, so I am adapting RetroArch to it.

With dispmanx, things were easy for blitting. Let's say I want to copy a 256x256 rect for a 298x256 buffer to a dismanx "resource" (buffer). Well, I have this GREAT dispmanx function for that:

vc_dispmanx_resource_write_data( DISPMANX_RESOURCE_HANDLE_T res, VC_IMAGE_TYPE_T src_type, int src_pitch, void * src_address, const VC_RECT_T * rect );

You see, I can pass a pitch, let's say 256*4 if I am blitting a pixel array with 4 bytes per pixel, and then it will be uploaded to the GPU without using the CPU for the transfer. Very fast and good solution!

But in KMS, using a dumb buffer, without an specific function to do it, I would have to copy a pitch of pixels each line, so in 256 lines I would be calling memcpy() 256 times to archieve the same. And that's for small rect...

So, I have seen that /usr/include/drm contains some hardware-specific implementations of blitting functions. Is there something similar on the Pi? I don't mind using IOCTLs or whatever is needed...

@vanfanel vanfanel changed the title KMS/DRM: Are we losing hardware-accelerated blitting? vc4-drm : Are we losing hardware-accelerated blitting? Jun 11, 2016
@anholt
Copy link
Owner

anholt commented Jun 14, 2016

If you're using KMS directly, why not just map your dumb buffer and write the data directly into it in the first place, going from 1 or 2 copies with your current implementation, down to 0 copies?

@vanfanel
Copy link
Author

vanfanel commented Jun 14, 2016

@anholt : I fear that is not possible due to how RetroArch works. This is the prototype of the function on which RetroArch does an screen refresh:

static bool kms_gfx_frame(void *data, const void *frame, unsigned width,
      unsigned height, uint64_t frame_count, unsigned pitch, const char *msg)  

The function has the same parameters always. I mean, it's how it works in RetroArch internally: it passes me (I am on the KMS side of things) a pointer to the pixel array. Not exactly optimal for my interests because it forces me to make at leas ONE copy: from the internal Retroarch pixel array to my dumb buffers. These buffers are mapped already so I simply memcpy() to them for now.

But that's not the problem now. The problem is blitting. As I said, in many cases RetroArch will send me a pixel array on which there are extra, not-meant-to-be-renderer pixels between scanlines.
So I have to blit a rect extracted from another rect with a different pitch.
That's what vc_dispmanx_resource_write_data() allowed me to do since it accepted a pitch, and copied only that pitch per line.
Now, without vc_dispmanx_resource_write_data(), I would have to iterate over each line in the source pixels array, and copy only part of the line. That's a 256-iteration FOR loop per frame with the corresponding memcpy() in each iteration, and that when RetroArch is rendering a very low resolution system or game. In a 640x480 game/system (like scummvm running on libretro) I would have to do 480 memcpy() calls per frame, and so on. I am sure the hardware can do blitting, as it does on the dispmanx API.
I really need a way to support that blitting (with a custom pitch) by the KMS/DRM system, to make for the vc_dispmanx_resource_write_data() loss.

@anholt
Copy link
Owner

anholt commented Jun 15, 2016

Have you actually measured and found that the memcpy call overhead is a problem compared to a single memcpy? Because I bet you'll have a difficult time measuring a difference.

This should also be faster for the overall system than the dispmanx version. For shipping your pixels to dispmanx with that call, the kernel would need to pin the pages covering your area, look up their addresses and hand them across to the firmware, and then the firmware would do the loop of memcpys on the VPU. That's a lot of setup and communication overhead to get the same memcpy loop done on the same memory bus.

@vanfanel
Copy link
Author

vanfanel commented Jun 19, 2016

@anholt : Ok, made measurements and it's not that much really, at least on a Pi3 where memcpy() calls are faster. On a Pi1, there is a considerable performance penalty using this.
However, I have seen these in my lsmod:

syscopyarea 2945 1 drm_kms_helper
sysfillrect 3443 1 drm_kms_helper
sysimgblt 2069 1 drm_kms_helper

syscopyarea? sysimgblt? Are you sure we don't have methods to do hardware blitting here? Maybe these are totally unrelated, but still I'd like to ask.

@anholt
Copy link
Owner

anholt commented Aug 10, 2016

Those sys* are just for fbdev. We could probably accelerate fbdev using the dma engine, but nobody's built those helpers yet.

@anholt
Copy link
Owner

anholt commented Oct 13, 2016

Actually, if you care about the number of memcpy loops, you could just make your drm_framebuffer have the stride that you want, and copy your whole buffer. vc4 doesn't have any restrictions on pitch alignment that I know of.

anholt pushed a commit that referenced this issue Jul 19, 2017
Correct these checkpatch.pl errors:

|ERROR: space required before that '-' (ctx:OxO)
|#37: FILE: include/linux/bug.h:37:
|+#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))

|ERROR: space required before that '-' (ctx:OxO)
|#38: FILE: include/linux/bug.h:38:
|+#define BUILD_BUG_ON_NULL(e) ((void *)sizeof(struct { int:-!!(e); }))

I decided to wrap the bitfield expressions that begin with minus signs
in parentheses rather than insert spaces before the minus signs.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ian Abbott <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
stschake pushed a commit to stschake/linux that referenced this issue Mar 16, 2018
pppol2tp_release uses call_rcu to put the final ref on its socket. But
the session object doesn't hold a ref on the session socket so may be
freed while the pppol2tp_put_sk RCU callback is scheduled. Fix this by
having the session hold a ref on its socket until the session is
destroyed. It is this ref that is dropped via call_rcu.

Sessions are also deleted via l2tp_tunnel_closeall. This must now also put
the final ref via call_rcu. So move the call_rcu call site into
pppol2tp_session_close so that this happens in both destroy paths. A
common destroy path should really be implemented, perhaps with
l2tp_tunnel_closeall calling l2tp_session_delete like pppol2tp_release
does, but this will be looked at later.

ODEBUG: activate active (active state 1) object type: rcu_head hint:           (null)
WARNING: CPU: 3 PID: 13407 at lib/debugobjects.c:291 debug_print_object+0x166/0x220
Modules linked in:
CPU: 3 PID: 13407 Comm: syzbot_19c09769 Not tainted 4.16.0-rc2+ anholt#38
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
RIP: 0010:debug_print_object+0x166/0x220
RSP: 0018:ffff880013647a00 EFLAGS: 00010082
RAX: dffffc0000000008 RBX: 0000000000000003 RCX: ffffffff814d3333
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88001a59f6d0
RBP: ffff880013647a40 R08: 0000000000000000 R09: 0000000000000001
R10: ffff8800136479a8 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff86161420 R14: ffffffff85648b60 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88001a580000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020e77000 CR3: 0000000006022000 CR4: 00000000000006e0
Call Trace:
 debug_object_activate+0x38b/0x530
 ? debug_object_assert_init+0x3b0/0x3b0
 ? __mutex_unlock_slowpath+0x85/0x8b0
 ? pppol2tp_session_destruct+0x110/0x110
 __call_rcu.constprop.66+0x39/0x890
 ? __call_rcu.constprop.66+0x39/0x890
 call_rcu_sched+0x17/0x20
 pppol2tp_release+0x2c7/0x440
 ? fcntl_setlk+0xca0/0xca0
 ? sock_alloc_file+0x340/0x340
 sock_release+0x92/0x1e0
 sock_close+0x1b/0x20
 __fput+0x296/0x6e0
 ____fput+0x1a/0x20
 task_work_run+0x127/0x1a0
 do_exit+0x7f9/0x2ce0
 ? SYSC_connect+0x212/0x310
 ? mm_update_next_owner+0x690/0x690
 ? up_read+0x1f/0x40
 ? __do_page_fault+0x3c8/0xca0
 do_group_exit+0x10d/0x330
 ? do_group_exit+0x330/0x330
 SyS_exit_group+0x22/0x30
 do_syscall_64+0x1e0/0x730
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7f362e471259
RSP: 002b:00007ffe389abe08 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f362e471259
RDX: 00007f362e471259 RSI: 000000000000002e RDI: 0000000000000000
RBP: 00007ffe389abe30 R08: 0000000000000000 R09: 00007f362e944270
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000400b60
R13: 00007ffe389abf50 R14: 0000000000000000 R15: 0000000000000000
Code: 8d 3c dd a0 8f 64 85 48 89 fa 48 c1 ea 03 80 3c 02 00 75 7b 48 8b 14 dd a0 8f 64 85 4c 89 f6 48 c7 c7 20 85 64 85 e
8 2a 55 14 ff <0f> 0b 83 05 ad 2a 68 04 01 48 83 c4 18 5b 41 5c 41 5d 41 5e 41

Fixes: ee40fb2 ("l2tp: protect sock pointer of struct pppol2tp_session with RCU")
Signed-off-by: James Chapman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants