-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vc4-drm : Are we losing hardware-accelerated blitting? #38
Comments
If you're using KMS directly, why not just map your dumb buffer and write the data directly into it in the first place, going from 1 or 2 copies with your current implementation, down to 0 copies? |
@anholt : I fear that is not possible due to how RetroArch works. This is the prototype of the function on which RetroArch does an screen refresh:
The function has the same parameters always. I mean, it's how it works in RetroArch internally: it passes me (I am on the KMS side of things) a pointer to the pixel array. Not exactly optimal for my interests because it forces me to make at leas ONE copy: from the internal Retroarch pixel array to my dumb buffers. These buffers are mapped already so I simply memcpy() to them for now. But that's not the problem now. The problem is blitting. As I said, in many cases RetroArch will send me a pixel array on which there are extra, not-meant-to-be-renderer pixels between scanlines. |
Have you actually measured and found that the memcpy call overhead is a problem compared to a single memcpy? Because I bet you'll have a difficult time measuring a difference. This should also be faster for the overall system than the dispmanx version. For shipping your pixels to dispmanx with that call, the kernel would need to pin the pages covering your area, look up their addresses and hand them across to the firmware, and then the firmware would do the loop of memcpys on the VPU. That's a lot of setup and communication overhead to get the same memcpy loop done on the same memory bus. |
@anholt : Ok, made measurements and it's not that much really, at least on a Pi3 where memcpy() calls are faster. On a Pi1, there is a considerable performance penalty using this.
syscopyarea? sysimgblt? Are you sure we don't have methods to do hardware blitting here? Maybe these are totally unrelated, but still I'd like to ask. |
Those sys* are just for fbdev. We could probably accelerate fbdev using the dma engine, but nobody's built those helpers yet. |
Actually, if you care about the number of memcpy loops, you could just make your drm_framebuffer have the stride that you want, and copy your whole buffer. vc4 doesn't have any restrictions on pitch alignment that I know of. |
Correct these checkpatch.pl errors: |ERROR: space required before that '-' (ctx:OxO) |#37: FILE: include/linux/bug.h:37: |+#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); })) |ERROR: space required before that '-' (ctx:OxO) |#38: FILE: include/linux/bug.h:38: |+#define BUILD_BUG_ON_NULL(e) ((void *)sizeof(struct { int:-!!(e); })) I decided to wrap the bitfield expressions that begin with minus signs in parentheses rather than insert spaces before the minus signs. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ian Abbott <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Kees Cook <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
pppol2tp_release uses call_rcu to put the final ref on its socket. But the session object doesn't hold a ref on the session socket so may be freed while the pppol2tp_put_sk RCU callback is scheduled. Fix this by having the session hold a ref on its socket until the session is destroyed. It is this ref that is dropped via call_rcu. Sessions are also deleted via l2tp_tunnel_closeall. This must now also put the final ref via call_rcu. So move the call_rcu call site into pppol2tp_session_close so that this happens in both destroy paths. A common destroy path should really be implemented, perhaps with l2tp_tunnel_closeall calling l2tp_session_delete like pppol2tp_release does, but this will be looked at later. ODEBUG: activate active (active state 1) object type: rcu_head hint: (null) WARNING: CPU: 3 PID: 13407 at lib/debugobjects.c:291 debug_print_object+0x166/0x220 Modules linked in: CPU: 3 PID: 13407 Comm: syzbot_19c09769 Not tainted 4.16.0-rc2+ anholt#38 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 RIP: 0010:debug_print_object+0x166/0x220 RSP: 0018:ffff880013647a00 EFLAGS: 00010082 RAX: dffffc0000000008 RBX: 0000000000000003 RCX: ffffffff814d3333 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88001a59f6d0 RBP: ffff880013647a40 R08: 0000000000000000 R09: 0000000000000001 R10: ffff8800136479a8 R11: 0000000000000000 R12: 0000000000000001 R13: ffffffff86161420 R14: ffffffff85648b60 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88001a580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020e77000 CR3: 0000000006022000 CR4: 00000000000006e0 Call Trace: debug_object_activate+0x38b/0x530 ? debug_object_assert_init+0x3b0/0x3b0 ? __mutex_unlock_slowpath+0x85/0x8b0 ? pppol2tp_session_destruct+0x110/0x110 __call_rcu.constprop.66+0x39/0x890 ? __call_rcu.constprop.66+0x39/0x890 call_rcu_sched+0x17/0x20 pppol2tp_release+0x2c7/0x440 ? fcntl_setlk+0xca0/0xca0 ? sock_alloc_file+0x340/0x340 sock_release+0x92/0x1e0 sock_close+0x1b/0x20 __fput+0x296/0x6e0 ____fput+0x1a/0x20 task_work_run+0x127/0x1a0 do_exit+0x7f9/0x2ce0 ? SYSC_connect+0x212/0x310 ? mm_update_next_owner+0x690/0x690 ? up_read+0x1f/0x40 ? __do_page_fault+0x3c8/0xca0 do_group_exit+0x10d/0x330 ? do_group_exit+0x330/0x330 SyS_exit_group+0x22/0x30 do_syscall_64+0x1e0/0x730 ? trace_hardirqs_off_thunk+0x1a/0x1c entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x7f362e471259 RSP: 002b:00007ffe389abe08 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f362e471259 RDX: 00007f362e471259 RSI: 000000000000002e RDI: 0000000000000000 RBP: 00007ffe389abe30 R08: 0000000000000000 R09: 00007f362e944270 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000400b60 R13: 00007ffe389abf50 R14: 0000000000000000 R15: 0000000000000000 Code: 8d 3c dd a0 8f 64 85 48 89 fa 48 c1 ea 03 80 3c 02 00 75 7b 48 8b 14 dd a0 8f 64 85 4c 89 f6 48 c7 c7 20 85 64 85 e 8 2a 55 14 ff <0f> 0b 83 05 ad 2a 68 04 01 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 Fixes: ee40fb2 ("l2tp: protect sock pointer of struct pppol2tp_session with RCU") Signed-off-by: James Chapman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Hi,
I have been adapting other people's pupular programs (RetroArch, Scummvm, and a long etc) to use the dispmanx API for years now.
However, dispmanx should, sooner or later, be superseeded by KMS/DRM, so I am adapting RetroArch to it.
With dispmanx, things were easy for blitting. Let's say I want to copy a 256x256 rect for a 298x256 buffer to a dismanx "resource" (buffer). Well, I have this GREAT dispmanx function for that:
vc_dispmanx_resource_write_data( DISPMANX_RESOURCE_HANDLE_T res, VC_IMAGE_TYPE_T src_type, int src_pitch, void * src_address, const VC_RECT_T * rect );
You see, I can pass a pitch, let's say 256*4 if I am blitting a pixel array with 4 bytes per pixel, and then it will be uploaded to the GPU without using the CPU for the transfer. Very fast and good solution!
But in KMS, using a dumb buffer, without an specific function to do it, I would have to copy a pitch of pixels each line, so in 256 lines I would be calling memcpy() 256 times to archieve the same. And that's for small rect...
So, I have seen that /usr/include/drm contains some hardware-specific implementations of blitting functions. Is there something similar on the Pi? I don't mind using IOCTLs or whatever is needed...
The text was updated successfully, but these errors were encountered: