vc4-drm : Are we losing hardware-accelerated blitting? #38

vanfanel · 2016-06-11T17:16:37Z

Hi,

I have been adapting other people's pupular programs (RetroArch, Scummvm, and a long etc) to use the dispmanx API for years now.
However, dispmanx should, sooner or later, be superseeded by KMS/DRM, so I am adapting RetroArch to it.

With dispmanx, things were easy for blitting. Let's say I want to copy a 256x256 rect for a 298x256 buffer to a dismanx "resource" (buffer). Well, I have this GREAT dispmanx function for that:

vc_dispmanx_resource_write_data( DISPMANX_RESOURCE_HANDLE_T res, VC_IMAGE_TYPE_T src_type, int src_pitch, void * src_address, const VC_RECT_T * rect );

You see, I can pass a pitch, let's say 256*4 if I am blitting a pixel array with 4 bytes per pixel, and then it will be uploaded to the GPU without using the CPU for the transfer. Very fast and good solution!

But in KMS, using a dumb buffer, without an specific function to do it, I would have to copy a pitch of pixels each line, so in 256 lines I would be calling memcpy() 256 times to archieve the same. And that's for small rect...

So, I have seen that /usr/include/drm contains some hardware-specific implementations of blitting functions. Is there something similar on the Pi? I don't mind using IOCTLs or whatever is needed...

anholt · 2016-06-14T17:22:03Z

If you're using KMS directly, why not just map your dumb buffer and write the data directly into it in the first place, going from 1 or 2 copies with your current implementation, down to 0 copies?

vanfanel · 2016-06-14T20:50:09Z

@anholt : I fear that is not possible due to how RetroArch works. This is the prototype of the function on which RetroArch does an screen refresh:

static bool kms_gfx_frame(void *data, const void *frame, unsigned width,
      unsigned height, uint64_t frame_count, unsigned pitch, const char *msg)

The function has the same parameters always. I mean, it's how it works in RetroArch internally: it passes me (I am on the KMS side of things) a pointer to the pixel array. Not exactly optimal for my interests because it forces me to make at leas ONE copy: from the internal Retroarch pixel array to my dumb buffers. These buffers are mapped already so I simply memcpy() to them for now.

But that's not the problem now. The problem is blitting. As I said, in many cases RetroArch will send me a pixel array on which there are extra, not-meant-to-be-renderer pixels between scanlines.
So I have to blit a rect extracted from another rect with a different pitch.
That's what vc_dispmanx_resource_write_data() allowed me to do since it accepted a pitch, and copied only that pitch per line.
Now, without vc_dispmanx_resource_write_data(), I would have to iterate over each line in the source pixels array, and copy only part of the line. That's a 256-iteration FOR loop per frame with the corresponding memcpy() in each iteration, and that when RetroArch is rendering a very low resolution system or game. In a 640x480 game/system (like scummvm running on libretro) I would have to do 480 memcpy() calls per frame, and so on. I am sure the hardware can do blitting, as it does on the dispmanx API.
I really need a way to support that blitting (with a custom pitch) by the KMS/DRM system, to make for the vc_dispmanx_resource_write_data() loss.

anholt · 2016-06-15T21:01:45Z

Have you actually measured and found that the memcpy call overhead is a problem compared to a single memcpy? Because I bet you'll have a difficult time measuring a difference.

This should also be faster for the overall system than the dispmanx version. For shipping your pixels to dispmanx with that call, the kernel would need to pin the pages covering your area, look up their addresses and hand them across to the firmware, and then the firmware would do the loop of memcpys on the VPU. That's a lot of setup and communication overhead to get the same memcpy loop done on the same memory bus.

vanfanel · 2016-06-19T17:19:55Z

@anholt : Ok, made measurements and it's not that much really, at least on a Pi3 where memcpy() calls are faster. On a Pi1, there is a considerable performance penalty using this.
However, I have seen these in my lsmod:

syscopyarea 2945 1 drm_kms_helper
sysfillrect 3443 1 drm_kms_helper
sysimgblt 2069 1 drm_kms_helper

syscopyarea? sysimgblt? Are you sure we don't have methods to do hardware blitting here? Maybe these are totally unrelated, but still I'd like to ask.

anholt · 2016-08-10T18:46:28Z

Those sys* are just for fbdev. We could probably accelerate fbdev using the dma engine, but nobody's built those helpers yet.

anholt · 2016-10-13T02:10:50Z

Actually, if you care about the number of memcpy loops, you could just make your drm_framebuffer have the stride that you want, and copy your whole buffer. vc4 doesn't have any restrictions on pitch alignment that I know of.

Correct these checkpatch.pl errors: |ERROR: space required before that '-' (ctx:OxO) |#37: FILE: include/linux/bug.h:37: |+#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); })) |ERROR: space required before that '-' (ctx:OxO) |#38: FILE: include/linux/bug.h:38: |+#define BUILD_BUG_ON_NULL(e) ((void *)sizeof(struct { int:-!!(e); })) I decided to wrap the bitfield expressions that begin with minus signs in parentheses rather than insert spaces before the minus signs. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ian Abbott <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Kees Cook <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

pppol2tp_release uses call_rcu to put the final ref on its socket. But the session object doesn't hold a ref on the session socket so may be freed while the pppol2tp_put_sk RCU callback is scheduled. Fix this by having the session hold a ref on its socket until the session is destroyed. It is this ref that is dropped via call_rcu. Sessions are also deleted via l2tp_tunnel_closeall. This must now also put the final ref via call_rcu. So move the call_rcu call site into pppol2tp_session_close so that this happens in both destroy paths. A common destroy path should really be implemented, perhaps with l2tp_tunnel_closeall calling l2tp_session_delete like pppol2tp_release does, but this will be looked at later. ODEBUG: activate active (active state 1) object type: rcu_head hint: (null) WARNING: CPU: 3 PID: 13407 at lib/debugobjects.c:291 debug_print_object+0x166/0x220 Modules linked in: CPU: 3 PID: 13407 Comm: syzbot_19c09769 Not tainted 4.16.0-rc2+ anholt#38 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 RIP: 0010:debug_print_object+0x166/0x220 RSP: 0018:ffff880013647a00 EFLAGS: 00010082 RAX: dffffc0000000008 RBX: 0000000000000003 RCX: ffffffff814d3333 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88001a59f6d0 RBP: ffff880013647a40 R08: 0000000000000000 R09: 0000000000000001 R10: ffff8800136479a8 R11: 0000000000000000 R12: 0000000000000001 R13: ffffffff86161420 R14: ffffffff85648b60 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88001a580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020e77000 CR3: 0000000006022000 CR4: 00000000000006e0 Call Trace: debug_object_activate+0x38b/0x530 ? debug_object_assert_init+0x3b0/0x3b0 ? __mutex_unlock_slowpath+0x85/0x8b0 ? pppol2tp_session_destruct+0x110/0x110 __call_rcu.constprop.66+0x39/0x890 ? __call_rcu.constprop.66+0x39/0x890 call_rcu_sched+0x17/0x20 pppol2tp_release+0x2c7/0x440 ? fcntl_setlk+0xca0/0xca0 ? sock_alloc_file+0x340/0x340 sock_release+0x92/0x1e0 sock_close+0x1b/0x20 __fput+0x296/0x6e0 ____fput+0x1a/0x20 task_work_run+0x127/0x1a0 do_exit+0x7f9/0x2ce0 ? SYSC_connect+0x212/0x310 ? mm_update_next_owner+0x690/0x690 ? up_read+0x1f/0x40 ? __do_page_fault+0x3c8/0xca0 do_group_exit+0x10d/0x330 ? do_group_exit+0x330/0x330 SyS_exit_group+0x22/0x30 do_syscall_64+0x1e0/0x730 ? trace_hardirqs_off_thunk+0x1a/0x1c entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x7f362e471259 RSP: 002b:00007ffe389abe08 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f362e471259 RDX: 00007f362e471259 RSI: 000000000000002e RDI: 0000000000000000 RBP: 00007ffe389abe30 R08: 0000000000000000 R09: 00007f362e944270 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000400b60 R13: 00007ffe389abf50 R14: 0000000000000000 R15: 0000000000000000 Code: 8d 3c dd a0 8f 64 85 48 89 fa 48 c1 ea 03 80 3c 02 00 75 7b 48 8b 14 dd a0 8f 64 85 4c 89 f6 48 c7 c7 20 85 64 85 e 8 2a 55 14 ff <0f> 0b 83 05 ad 2a 68 04 01 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 Fixes: ee40fb2 ("l2tp: protect sock pointer of struct pppol2tp_session with RCU") Signed-off-by: James Chapman <[email protected]> Signed-off-by: David S. Miller <[email protected]>

vanfanel changed the title ~~KMS/DRM: Are we losing hardware-accelerated blitting?~~ vc4-drm : Are we losing hardware-accelerated blitting? Jun 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vc4-drm : Are we losing hardware-accelerated blitting? #38

vc4-drm : Are we losing hardware-accelerated blitting? #38

vanfanel commented Jun 11, 2016 •

edited

Loading

anholt commented Jun 14, 2016

vanfanel commented Jun 14, 2016 •

edited

Loading

anholt commented Jun 15, 2016

vanfanel commented Jun 19, 2016 •

edited

Loading

anholt commented Aug 10, 2016

anholt commented Oct 13, 2016

vc4-drm : Are we losing hardware-accelerated blitting? #38

vc4-drm : Are we losing hardware-accelerated blitting? #38

Comments

vanfanel commented Jun 11, 2016 • edited Loading

anholt commented Jun 14, 2016

vanfanel commented Jun 14, 2016 • edited Loading

anholt commented Jun 15, 2016

vanfanel commented Jun 19, 2016 • edited Loading

anholt commented Aug 10, 2016

anholt commented Oct 13, 2016

vanfanel commented Jun 11, 2016 •

edited

Loading

vanfanel commented Jun 14, 2016 •

edited

Loading

vanfanel commented Jun 19, 2016 •

edited

Loading