nvenc hardware accelerated encoding #370

totaam · 2013-07-06T09:22:33Z

Issue migrated from trac ticket # 370

component: server | priority: major | resolution: fixed

2013-07-06 09:22:33: antoine created the issue

See also #451 (libva accelerated encoding)

Pointers:

nvenc

nvidia cuda video encoder (specification pdf)

Worth mentioning:

H.264 encoding - CPU vs GPU: Nvidia CUDA, AMD Stream, Intel MediaSDK and x264 *What these solutions bring most of all is frustration. Whether NVIDIA, AMD or Intel solutions, rapidity has been accentuated to the detriment of quality. * - hopefully thing have improved

totaam · 2013-07-06T09:23:00Z

2013-07-06 09:23:00: antoine uploaded file `add-libva-stub.patch` (10.9 KiB)

stub libva files to get going

totaam · 2013-07-16T06:43:47Z

2013-07-16 06:43:47: antoine changed status from new to assigned

totaam · 2013-07-16T06:43:47Z

2013-07-16 06:43:47: antoine commented

More pointers (for libva - not nvenc):

wayland patch

vlc ticket

totaam · 2013-07-18T06:32:31Z

2013-07-18 06:32:31: antoine edited the issue description

totaam · 2013-09-09T16:53:28Z

2013-09-09 16:53:28: antoine uploaded file `nvenc-stub.patch` (57.2 KiB)

stubs for implementing an nvenc encoder

totaam · 2013-09-13T10:02:13Z

2013-09-13 10:02:13: ahuillet commented

Attached is a gdb command file for tracing NVENC calls made by the sample app from the SDK.

Use:

$ gdb --args ./nvEncoder -config=config.txt -outFile=bba.h264 inFile=../YUV/1080p/HeavyHandIdiot.3sec.yuv
(gdb) b NvEncodeAPICreateInstance
Function "NvEncodeAPICreateInstance" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (NvEncodeAPICreateInstance) pending.
(gdb) r
Starting program: /home/ahuillet/nvenc_3.0_sdk/Samples/nvEncodeApp/nvEncoder -config=config.txt -outFile=bba.h264 inFile=../YUV/1080p/HeavyHandIdiot.3sec.yuv
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

>> GetNumberEncoders() has detected 8 CUDA capable GPU device(s) <<
  [ GPU #0 - < GRID K1 > has Compute SM 3.0, NVENC Available ]
  [ GPU #1 - < GRID K1 > has Compute SM 3.0, NVENC Available ]
  [ GPU #2 - < GRID K1 > has Compute SM 3.0, NVENC Available ]
  [ GPU #3 - < GRID K1 > has Compute SM 3.0, NVENC Available ]
  [ GPU #4 - < GRID K1 > has Compute SM 3.0, NVENC Available ]
  [ GPU #5 - < GRID K1 > has Compute SM 3.0, NVENC Available ]
  [ GPU #6 - < GRID K1 > has Compute SM 3.0, NVENC Available ]
  [ GPU #7 - < GRID K1 > has Compute SM 3.0, NVENC Available ]
Breakpoint 1, 0x00007ffff6d179a0 in NvEncodeAPICreateInstance () from /lib64/libnvidia-encode.so.1
(gdb) n
(gdb) n
(gdb) source ~/trace_nvenc 
Breakpoint 2 at 0x7ffff6d175f0
...

Then continue the execution.
Results found in [/attachment/ticket/370/nvenc-trace.txt] (crazy long ticket comment cleaned up by totaam)

totaam · 2013-09-13T10:02:36Z

2013-09-13 10:02:36: ahuillet uploaded file `trace_nvenc` (3.5 KiB)

totaam · 2013-09-13T10:20:59Z

2013-09-13 10:20:59: ahuillet commented

After cleanup, this is the sequence of calls for setup:

nvEncOpenEncodeSessionEx
nvEncGetEncodeGUIDCount
nvEncGetEncodeGUIDs
nvEncGetEncodePresetCount
nvEncGetEncodePresetGUIDs
nvEncGetEncodePresetConfig
nvEncGetEncodeProfileGUIDCount
nvEncGetEncodeProfileGUIDs
nvEncGetInputFormatCount
nvEncGetInputFormats
nvEncInitializeEncoder

Then, create the input and output buffers:

nvEncCreateInputBuffer
nvEncCreateBitstreamBuffer
nvEncRegisterAsyncEvent
[repeat N times]
nvEncRegisterAsyncEvent

Then:

> NVENC Encoder[0] configuration parameters for configuration #0
> GPU Device ID             = 0
> Frames                    = 0 frames 
> ConfigFile                = (null) 
> Frame at which 0th configuration will happen = 0 
> maxWidth,maxHeight        = [1920,1080]
> Width,Height              = [1920,1080]
> Video Output Codec        = 4 - H.264 Codec
> Average Bitrate           = 6000000 (bps/sec)
> Peak Bitrate              = 0 (bps/sec)
> Rate Control Mode         = 1 - VBR (Variable Bitrate)
> Frame Rate (Num/Denom)    = (30000/1001) 29.9700 fps
> GOP Length                = 30
> Set Initial RC      QP    = 0
> Initial RC QP (I,P,B)     = I(0), P(0), B(0)
> Number of B Frames        = 2
> Display Aspect Ratio X    = 1920
> Display Aspect Ratio Y    = 1080
> Video codec profile       = 100
> Video codec Level         = 0
> FieldEncoding             = 0
> Number slices per Frame   = 1
> Encoder Preset            = 3 - High Quality (HQ) Preset
> NVENC API Interface       = 2 - CUDA
Input Filesize: 230227968 bytes
[ Source Input File ] = "../YUV/1080p/HeavyHandIdiot.3sec.yuv
[ # of Input Frames ] = 74
 -* Start Encode <../YUV/1080p/HeavyHandIdiot.3sec.yuv>, Frames [0,74] ** 
Loading Frames [0,73] into system memory queue (74 frames)
nvEncReconfigureEncoder
Encoding Frames [0,73]

and the actual encoding process, probably with async trace

nvEncCreateInputBuffer
nvEncCreateBitstreamBuffer
nvEncRegisterAsyncEvent
nvEncCreateInputBuffer
nvEncCreateBitstreamBuffer
nvEncRegisterAsyncEvent
nvEncCreateInputBuffer
nvEncCreateBitstreamBuffer
nvEncRegisterAsyncEvent
nvEncCreateInputBuffer
nvEncCreateBitstreamBuffer
nvEncRegisterAsyncEvent
nvEncCreateInputBuffer
nvEncCreateBitstreamBuffer
nvEncRegisterAsyncEvent
nvEncCreateInputBuffer
nvEncCreateBitstreamBuffer
nvEncRegisterAsyncEvent
nvEncCreateInputBuffer

totaam · 2013-09-13T14:30:51Z

2013-09-13 14:30:51: totaam commented

r4328 adds the stub encoder which successfully initializes nvenc (and nothing else yet..) and more: r4329 + r4330

totaam · 2013-09-18T08:41:31Z

2013-09-18 08:41:31: totaam commented

Lots more (too many changesets to list, see r4349 and earlier)

We now get valid h264 data out of it and have tests in and out.

What still needs to be done:

see how many concurrent contexts we can have per GPU, at what resolution

load-balance when we have multiple GPUs

fix the TLS issue (same as 422#comment:7), without moving the full init cost too much later because:

workaround long initialization time (5 seconds on dual K1 system)

find a way to get the input data in the right format as the K1 I test on only supports 'NV12_TILED64x16 and 'YUV444_TILED64x16'
Preferably on the GPU to save CPU (see cuda csc #384)

split the code into a support class (with all the cython bits and annoying header duplication.. wrapping the functionList and nvenc encoder context) and the encoder proper (may not even need to be cython?)

handle resize without a full encoder re-init (supported on nvenc >= 2.0) - will require a new codec capability and support code in the video source layer

totaam · 2013-09-27T10:41:09Z

2013-09-27 10:41:09: totaam commented

With correct padding, r4375 seems to work - though I have seen some of those:
[h264 @ 0x7fa4687d4e20] non-existing PPS 0 referenced
[h264 @ 0x7fa4687d4e20] decode_slice_header error
On the client side...

When we instantiate the client-side decoder with the window size (rounded down to even size) whereas the data we get from nvenc has dimensions rounded up to 32... Not sure if it is a problem, or if we can/should send the actual encoder size to the client.
Thinking more about it, the vertical size must be rounded up to 32 otherwise nvenc does not find the U and V planes where it wants them... but the horizontal size I am not so sure (as it seemed to work before without padding)

The padding could be useful when resizing a window: we don't need (at least not server side..) a full encoder re-init unless the new size crosses one of the 32-padded boundaries.

Note: it looks like the buffer formats advertised as being supported come in lists of bitmasks (the docs claims it is a plain list) - we use NV12_PL

totaam · 2013-09-27T15:41:30Z

2013-09-27 15:41:30: totaam uploaded file `csc-with-stride.patch` (19.1 KiB)

abandoned work on adding stride attributes to csc so nvenc can specify the padding to 32 generically

totaam · 2013-09-27T15:43:09Z

2013-09-27 15:43:09: totaam uploaded file `csc-with-stride.2.patch` (20.1 KiB)

abandoned work on adding stride attributes to csc so nvenc can specify the padding to 32 generically (again with missing server file)

totaam · 2013-09-28T11:26:46Z

2013-09-28 11:26:46: totaam uploaded file `nvenc-dualbuffers.patch` (7.6 KiB)

use two buffers CUDA side so we can use a kernel to copy (and convert) from one to the other

totaam · 2013-09-28T11:27:22Z

2013-09-28 11:27:22: totaam uploaded file `nvenc-pycuda.patch` (23.6 KiB)

use pycuda to remove lots of code... except this does not work because we need a context pointer for nvenc :(

totaam · 2013-09-28T12:59:00Z

2013-09-28 12:59:00: totaam uploaded file `nvenc-pycuda-with-kernel.patch` (27.1 KiB)

"working" pycuda version with an empty kernel

totaam · 2013-09-28T14:32:50Z

2013-09-28 14:32:50: totaam uploaded file `nvenc-pycuda-with-kernel2.patch` (27.6 KiB)

with kernel doing something - causes crashes..

totaam · 2013-09-28T14:49:47Z

2013-09-28 14:49:47: totaam commented

With [/attachment/ticket/370/nvenc-pycuda-with-kernel2.patch], not too much work is left:

fix the kernel so it doesn't crash... obviously! use cuda-gdb with PyCUDA can be useful, at first glance our use of buffers seems correct: it matches this Multiple Threads example

converting the kernel to take RGB input (the opencl ones can be used as examples) and adjusting the buffer sizes accordingly

use the input "pixels" buffer directly for upload to the GPU rather than copying to a host buffer

remove the width padding? (not necessary?)

figure out the relationship between threads and nvenc contexts: can we have many contexts used from the same thread without problems?

loading takes too long (still): we must delay it until after we start the client or server, but we should try to load it before we need it: waiting 8 seconds for the first window update is not friendly!

"xpra info" is being slowed down for no good reason! (does a CUDA init to get the list of encodings..)

figure out the optimal grids and blocks sizes:

Pycuda Blocks and Grids to work with big datas

CUDA grid, block, thread size

Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

Advanced CUDA Optimization 3. Execution

Threads and blocks and grids, oh my!

The TLS issues still need resolving:

a client built with nvenc will not run (fails to load gtk2)

on server exit, we fail to run the exit hooks (something is intercepting SIGINT?)

Other things we may want to do:

avoid encoder re-init if the new window size does not cross the padded size boundaries

choose the CUDA device using gpuGetMaxGflopsDeviceId?

use prepared_call to speed up kernel invocation? (meh - not much to save there)

totaam · 2013-10-06T12:05:50Z

2013-10-06 12:05:50: totaam uploaded file `nvenc-pycuda-with-kernel3.patch` (26.8 KiB)

use (py)cuda to copy from input buffer (already in NV12 format) to output buffer (nvenc buffer)

totaam · 2013-10-06T17:33:34Z

2013-10-06 17:33:34: totaam commented

r4414 adds the pycuda code (simplifies things) and does the BGRA to NV12 CSC using a custom CUDA kernel.
Total encoding time is way down.

New issues:

we probably need to always use the same CUDA context in the encoding thread (and not create a new context for each encoder) - don't try to use more than one encoding context at present: it seems to hang

we may want to zero out the padding (currently contains whatever random data was in the GPU's memory before we upload the pixels)

ensure we upload the pixels with a stride that allows CUDA to run with optimized memory accesses

easy to add support for RGBA by parameterizing the kernel code

handle scaling

optimize the kernel (use 32-bit memory accesses for grabbing RGB pixels)

honour max_block_sizes, max_grid_sizes and max_threads_per_block

allocate memory when needed rather than keeping it allocated for the duration of the encoder (fit more encoders on one card)

try mem_alloc instead of mem_alloc_pitch - and maybe use that for smaller areas? (where the padding becomes expensive)

move common code (kernel, cuda init, ..) to a support module which can be used by both csc_nvcuda and nvenc

compile the kernels at build time and load them with mod = driver.module_from_file(filename)

Here's how I run the server when testing:
PATH=$PATH:/usr/local/cuda-5.5/bin/ \
LD_LIBRARY_PATH=/usr/local/cuda-5.5/lib64 \
XPRA_NVENC_DEBUG=1 \
XPRA_DAMAGE_DEBUG=1 \
XPRA_VIDEOPIPELINE_DEBUG=1 \
XPRA_ENCODER_TYPE=nvenc \
xpra start :10

totaam · 2013-10-25T04:47:15Z

2013-10-25 04:47:15: totaam commented

For building, here is the nvenc.pc pkgconfig file I've used on Fedora 19:
prefix=/opt/nvenc_3.0_sdk
exec_prefix=${prefix}
core_includedir=${prefix}/Samples/core/include
api_includedir=${prefix}/Samples/nvEncodeApp/inc
libdir=/usr/lib64/nvidia

Name: nvenc
Description: NVENC
Version: 1.0
Requires: 
Conflicts:
Libs: -L${libdir} -lnvidia-encode
Cflags: -I${core_includedir} -I${api_includedir}
Note: this refers to unversioned libraries, which you may need to create, here for a 64-bit build:
cd /usr/lib64/nvidia/
ln -sf libnvidia-encode.so.1 libnvidia-encode.so
ln -sf libcuda.so.1 libcuda.so
#etc..
cd /usr/lib64
ln -sf nvidia/libcuda.so ./
(or you can add the version to the pkgconfig file)

totaam · 2013-10-25T05:00:12Z

2013-10-25 05:00:12: totaam uploaded file `nvenc-trace.txt` (42.1 KiB)

trace from comment:3

totaam · 2013-10-25T05:31:42Z

2013-10-25 05:31:42: totaam commented

Instructions for installing NVENC support from scratch on Fedora 19:

make sure Fedora is up to date, reboot with the latest kernel

Download CUDA 5.5 for Fedora 18. Do not use the RPM one. Yes, Fedora 19 has been out for many months and still no packages...

Download NVENC SDK for Linux

use the rpmfusion akmod as nvidia's own drivers fails to build (..). Verify that it loads with "modprobe nvidia". (and check dmesg for warnings/errors)

run the cuda*.run installer as root, but also add "-override-compiler":
sudo sh cuda_*.run -override-compiler
Do install CUDA, you can skip the rest, you must tell it not to install the broken drivers it wants to install.

configure your environment (must be done everytime CUDA is used):
export PATH=/usr/bin:/bin:/usr/local/cuda/bin/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
Be very careful not to place cuda ahead of the regular LD_LIBRARY_PATH as this can cause big problems with some libraries (ie: libopencl)

unzip the NVENC SDK into /opt/

install the nvenc.pc and cuda.pc (see [cuda csc #384#comment:3]) pkgconfig files and add unversioned cuda and nvidia-encode libraries as per comment:10

install pycuda from source. It expects to find cuda in /opt, so I chose to symlink it:
ln -sf /usr/local/cuda /opt
on headless systems (no X11), you may need to create the missing /dev/nvidia* devices:
rm -f /dev/nvidia*
# Count the number of NVIDIA controllers found.
N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`
N=`expr $N3D + $NVGA - 1`
for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i;
done
mknod -m 666 /dev/nvidiactl c 195 255
Finally, you can test that xpra builds with cuda/nvenc support:
./setup.py --with-nvenc --with-csc_nvcuda build
And that you can run the cuda/nvenc tests:
mkdir tmp && cd tmp
cp -apr ../tests ./
PYTHONPATH=. ./tests/xpra/codecs/test_csc_nvcuda.py
PYTHONPATH=. ./tests/xpra/codecs/test_nvenc.py

totaam · 2013-10-25T07:49:08Z

2013-10-25 07:49:08: totaam commented

Strangely enough, the test encoder fails on a GTX 760 and not with a graceful error:

$ gdb ./nvEncoder
(..)
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/nvenc_3.0_sdk/Samples/nvEncodeApp/nvEncoder...(no debugging symbols found)...done.
(gdb) break OpenEncodeSession
Function "OpenEncodeSession" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (OpenEncodeSession) pending.
(gdb) run -configFile=HeavyHand_1080p.txt -outfile=HeavyHandIdiot.3sec.264
Starting program: /opt/nvenc_3.0_sdk/Samples/nvEncodeApp/./nvEncoder -configFile=HeavyHand_1080p.txt -outfile=HeavyHandIdiot.3sec.264
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

>> GetNumberEncoders() has detected 1 CUDA capable GPU device(s) <<
  [ GPU #0 - < GeForce GTX 760 > has Compute SM 3.0, NVENC Available ]

>> InitCUDA() has detected 1 CUDA capable GPU device(s)<<
  [ GPU #0 - < GeForce GTX 760 > has Compute SM 3.0, Available NVENC ]

>> Select GPU #0 - < GeForce GTX 760 > supports SM 3.0 and NVENC
[New Thread 0x7ffff5bce700 (LWP 16417)]

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x000000000040b12f in CNvEncoder::OpenEncodeSession(int, char const**, unsigned int) ()
#2  0x000000000040dcb2 in CNvEncoderH264::EncoderMain(EncoderGPUInfo, EncoderAppParams, int, char const**) ()
#3  0x0000000000401f7b in main ()

What's even more strange is that our test code fails even earlier:

Traceback (most recent call last):
  File "./tests/xpra/codecs/test_nvenc.py", line 136, in <module>
    main()
  File "./tests/xpra/codecs/test_nvenc.py", line 128, in main
    test_encode_one()
  File "./tests/xpra/codecs/test_nvenc.py", line 17, in test_encode_one
    test_encoder(encoder_module)
  File "/home/spikesdev/src/tmp/tests/xpra/codecs/test_encoder.py", line 62, in test_encoder
    e.init_context(actual_w, actual_h, src_format, encoding, 20, 0, options)
  File "encoder.pyx", line 1179, in xpra.codecs.nvenc.encoder.Encoder.init_context (xpra/codecs/nvenc/encoder.c:5739)
  File "encoder.pyx", line 1217, in xpra.codecs.nvenc.encoder.Encoder.init_cuda (xpra/codecs/nvenc/encoder.c:6686)
  File "encoder.pyx", line 1232, in xpra.codecs.nvenc.encoder.Encoder.init_nvenc (xpra/codecs/nvenc/encoder.c:6813)
  File "encoder.pyx", line 1649, in xpra.codecs.nvenc.encoder.Encoder.open_encode_session (xpra/codecs/nvenc/encoder.c:12790)
  File "encoder.pyx", line 1102, in xpra.codecs.nvenc.encoder.raiseNVENC (xpra/codecs/nvenc/encoder.c:4935)
Exception: getting API function list - returned 15: This indicates that an invalid struct version was used by the client.

I am pretty sure that when I tested on a GTX 450, I got past this point and it failed when creating the context instead (since that card does not support nvenc), that's why there is the XPRA_NVENC_FORCE flag in the code.

-Edit*: this is a problem with the newer drivers which are incompatible with NVENC SDK v3. (SDK v2 works though!)

totaam · 2013-10-28T08:40:39Z

2013-10-28 08:40:39: totaam uploaded file `nvenc-sdkv2.patch` (22.3 KiB)

changes needed to build against the NVENC SDK version 2

totaam · 2013-10-28T08:41:42Z

2013-10-28 08:41:42: totaam uploaded file `nvenc2.pc` (0.3 KiB)

example pkgconfig file for NVENC SDK version 2

totaam · 2013-10-28T10:27:19Z

2013-10-28 10:27:19: totaam uploaded file `nvenc-sdkv2-v2.patch` (11.5 KiB)

updated (smaller) patch to apply on top of r4620

totaam · 2013-10-28T13:19:36Z

2013-10-28 13:19:36: totaam commented

As of r4621 the code supports both SDK v2 and v3, use whichever works with your current driver version.

totaam · 2013-11-01T09:34:43Z

2013-11-01 09:34:43: totaam commented

Took me a while to figure this out:

NVENC v3 works with driver version 3.19 only

NVENC v2 works with driver version 3.19 and above...
Looks like nvidia forgot to test backwards compatibility with their "upgrade".

r4651 makes V3 the default again - "newer is better", right?

Anyway, to install 3.19 on a system with kernel 3.10 or newer (ie: Fedora 19) is a PITA:

if you have them installed, remove the previous drivers, ie: for Fedora 19/rpmfusion:
sudo yum remove xorg-x11-drv-nvidia xorg-x11-drv-nvidia-libs akmod-nvidia kmod-nvidia
run the installer:
sudo sh NVIDIA-Linux-x86_64-319.49.run
which will fail at the DKMS stage if building against a kernel version 3.11 or newer..

apply this patch to nv-linux.h in /var/lib/dkms/nvidia/319.49/source/. The quick and dirty way:
sed -i -e 's/#define NV_NUM_PHYSPAGES num_physpages/#define NV_NUM_PHYSPAGES get_num_physpages/g' nv-linux.h
run the DKMS build again (adjust the driver version as needed):
sudo dkms install -m nvidia -v 319.49
create a new X11 config:
sudo nvidia-xconfig
reboot or restart gdm:
sudo service gdm restart

totaam · 2013-11-01T11:25:41Z

2013-11-01 11:25:41: totaam commented

Important fix in r4652 which will need to be backported to v0.10.x

totaam · 2013-11-05T08:01:12Z

2013-11-05 08:01:12: totaam edited the issue description

totaam · 2013-11-05T08:01:12Z

2013-11-05 08:01:12: totaam changed title from hardware accelerated encoding: libva and/or nvenc to nvenc hardware accelerated encoding

totaam · 2013-11-05T08:01:12Z

2013-11-05 08:01:12: totaam commented

Remaining tasks for nvenc:

dealing with context limits (32 per device at present)

workaround slow encoder initialization (could wait one second before trying to avoid wasting cycles, then initialize it in a new thread which becomes the encoding thread once complete)

handle scaling in cuda kernel

honour max_block_sizes, max_grid_sizes and max_threads_per_block

handle other RGB modes in kernel?

the cuda buffers are bigger than the picture we upload, we should pad the edges with zeroes (rather than the random garbage currently in there?)

handle YUV444P mode (see uint32_t separateColourPlaneFlag #[in]: Set to 1 to enable 4:4:4 separate colour planes)

multi-threaded issues? re-use the same cuda context from the same encoding thread?

we should build the cuda kernel at build time and load the "cubin" file and load them with mod = driver.module_from_file(filename)

handle resize without re-init when size changes fit in the padded size (and maybe make the padded size a little bit bigger too)

choose the cuda device using gpuGetMaxGflopsDeviceId: max_gflops = device_properties.multiProcessorCount * device_properties.clockRate;

At the moment, running out of contexts does this:
2013-11-05 14:40:58,590 setup_pipeline failed for (65, None, 'BGRX', codec_spec(nvenc))
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/window_video_source.py", line 605, in setup_pipeline
    self._video_encoder.init_context(enc_width, enc_height, enc_in_format, encoder_spec.encoding, quality, speed, self.encoding_options)
  File "encoder.pyx", line 1291, in xpra.codecs.nvenc.encoder.Encoder.init_context (xpra/codecs/nvenc/encoder.c:5883)
  File "encoder.pyx", line 1329, in xpra.codecs.nvenc.encoder.Encoder.init_cuda (xpra/codecs/nvenc/encoder.c:6830)
  File "encoder.pyx", line 1344, in xpra.codecs.nvenc.encoder.Encoder.init_nvenc (xpra/codecs/nvenc/encoder.c:6957)
  File "encoder.pyx", line 1828, in xpra.codecs.nvenc.encoder.Encoder.open_encode_session (xpra/codecs/nvenc/encoder.c:13775)
  File "encoder.pyx", line 1203, in xpra.codecs.nvenc.encoder.raiseNVENC (xpra/codecs/nvenc/encoder.c:5070)
Exception: opening session - returned 2: This indicates that devices pass by the client is not supported.
2013-11-05 14:40:58,593 error processing damage data: failed to setup a video pipeline for h264 encoding with source format BGRX

totaam · 2013-11-06T02:14:28Z

2013-11-06 02:14:28: totaam commented

Everything about nvenc is now on the wiki here

totaam · 2013-11-10T06:52:49Z

2013-11-10 06:52:49: totaam commented

Updates:

"dealing with context limits" turned out to be quite complicated to implement (and unfinished): there is no way of knowing in advance if we can create a context, and if we fail - at that point we've already destroyed the previous encoder so this is expensive. So we add some heuristics to try to prevent failures: keep track of how many contexts we use and avoid logging errors in those cases (see r4706), and when we fail we can lower the score further so nvenc becomes less likely to be chosen (see r4715). Ideally, we should try to keep the current encoder alive until the new one succeeds so we can fallback to the old one if the new one fails...

"workaround slow encoder initialization": r4704: pre-initialize nvenc in a background thread on startup so that the first client does not get a long delay

cleanup: use a background worker thread (r4708) so that the encoding thread does get long delays doing encoder cleanup (a single thread for now - though it should be easy to add more if needed)

"we should build the cuda kernel at build time": this is not possible as the compilation is for a specific CUDA device(s) (...), so we pre-compile the CUDA CSC kernel in memory at runtime (r4703) from the new background init thread:
compilation took 4124.1ms
client-side fix (needs backporting): r4717 and a sub-optimal workaround for it: r4722

try to honour speed and quality: r4716

refactorings: r4721, r4713, r4705, r4726, r4727

attempts at supporting YUV444P (not yet functional): r4728, r4731

keep track of device memory usage and use it as heuristic to decide if it is safe to create new contexts: r4729 + r4730

r4741 adds nvenc scaling via the cuda kernel (using ugly subsampling for now)

r4743 now allows us to handle windows bigger than 4096x4096 with video encoders (via automatic downscaling)

totaam · 2013-11-11T09:54:42Z

2013-11-11 09:54:42: totaam changed status from assigned to new

totaam · 2013-11-11T09:54:42Z

2013-11-11 09:54:42: totaam changed owner from antoine to afarr

totaam · 2013-11-11T09:54:42Z

2013-11-11 09:54:42: totaam commented

More details edited in comment:18, this is good enough for testing.

At this point the encoder should work and give us the decent quality (we need YUV444P for best quality support) with much lower latency, it also supports efficient scaling.

Please test it and try to break it. (please read [/wiki/Encodings/nvenc#UsingNVENC Using NVENC] first) Trying different resolutions, types of clients, etc.. Measuring fps, with and without, server load, bandwidth, etc...
Be aware that only newer clients can take advantage of nvenc at present (r4722 needs backporting). It may also take a few seconds for nvenc to beat x264 in our internal scoring system which decides the combination of encoder and csc modules to use.

Things that will probably be addressed in a follow up ticket for the next milestone:

zero out the image padding since it does get encoded!

honouring max_block_sizes, max_grid_sizes and max_threads_per_block - doesn't seem to be causing problems yet

handle YUV444P mode - needs docs (apparently not supported by the hardware??)

handle resize without re-init

handle quality changes by swapping the kernel we use (NV12 / YUV444P)

handle speed/quality changes with nvEncReconfigureEncoder (with edge resistance if it causes a new IDR frame)

allocate memory when needed rather than keeping it allocated for the duration of the encoder (fit more encoders on one card)

upload pixels in place? (skip inputBuffer)

Lower priority still:

choose the cuda device using gpuGetMaxGflopsDeviceId: max_gflops = device_properties.multiProcessorCount * device_properties.clockRate;

handle other RGB modes in kernel (easy - allows us to run in big endian servers)

access nvenc encoder statistics info?

try using nvenc on win32 for shadow servers

when downscaling automatically (one of the dimensions is >4k), we don't need to downscale both dimensions by the same ratio: a very wide window could be downscaled horizontally only

Those have been moved to #466

totaam · 2013-11-11T10:18:45Z

2013-11-11 10:18:45: totaam uploaded file `nvcrash.txt` (45.7 KiB)

this crash occurred as I killed xpra with SIGINT.. hopefully rare and due to SIGINT

totaam · 2014-02-12T19:21:00Z

2014-02-12 19:21:00: smo changed status from new to closed

totaam · 2014-02-12T19:21:00Z

2014-02-12 19:21:00: smo changed resolution from ** to fixed

totaam · 2014-02-12T19:21:00Z

2014-02-12 19:21:00: smo commented

Tested and working with fedora 20 server.

Started with this command line.

LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/cuda/lib64:/usr/lib64/nvidia \
    xpra --bind-tcp=0.0.0.0:1400 --start-child="xterm -fg white -bg black" \
         --no-daemon --encryption=AES --password-file=./passtest start :14

totaam · 2014-02-13T12:20:55Z

2014-02-13 12:20:55: antoine commented

see #517

totaam · 2015-03-29T08:17:21Z

2015-03-29 08:17:21: antoine commented

Note for those landing here: NVENC is not safe to use in versions older than 0.15 because of a context leak due to threading.

totaam closed this as completed Feb 12, 2014

This was referenced Jan 22, 2021

cuda csc #384

Closed

multiplexing multiple xpra instances through one port #426

Closed

libva accelerated encoding #451

Closed

nvenc improvements: YUV444P mode and bandwidth auto tuning #466

Closed

delegated encoding mode #504

Closed

nvenc hardware accelerated encoding #370

nvenc hardware accelerated encoding #370

Comments

totaam commented Jul 6, 2013

2013-07-06 09:22:33: antoine created the issue

totaam commented Jul 6, 2013

2013-07-06 09:23:00: antoine uploaded file add-libva-stub.patch (10.9 KiB)

totaam commented Jul 16, 2013

2013-07-16 06:43:47: antoine changed status from new to assigned

totaam commented Jul 16, 2013

2013-07-16 06:43:47: antoine commented

totaam commented Jul 18, 2013

2013-07-18 06:32:31: antoine edited the issue description

totaam commented Sep 9, 2013

2013-09-09 16:53:28: antoine uploaded file nvenc-stub.patch (57.2 KiB)

totaam commented Sep 13, 2013

2013-09-13 10:02:13: ahuillet commented

totaam commented Sep 13, 2013

2013-09-13 10:02:36: ahuillet uploaded file trace_nvenc (3.5 KiB)

totaam commented Sep 13, 2013

2013-09-13 10:20:59: ahuillet commented

totaam commented Sep 13, 2013

2013-09-13 14:30:51: totaam commented

totaam commented Sep 18, 2013

2013-09-18 08:41:31: totaam commented

totaam commented Sep 27, 2013

2013-09-27 10:41:09: totaam commented

totaam commented Sep 27, 2013

2013-09-27 15:41:30: totaam uploaded file csc-with-stride.patch (19.1 KiB)

totaam commented Sep 27, 2013

2013-09-27 15:43:09: totaam uploaded file csc-with-stride.2.patch (20.1 KiB)

totaam commented Sep 28, 2013

2013-09-28 11:26:46: totaam uploaded file nvenc-dualbuffers.patch (7.6 KiB)

totaam commented Sep 28, 2013

2013-09-28 11:27:22: totaam uploaded file nvenc-pycuda.patch (23.6 KiB)

totaam commented Sep 28, 2013

2013-09-28 12:59:00: totaam uploaded file nvenc-pycuda-with-kernel.patch (27.1 KiB)

totaam commented Sep 28, 2013

2013-09-28 14:32:50: totaam uploaded file nvenc-pycuda-with-kernel2.patch (27.6 KiB)

totaam commented Sep 28, 2013

2013-09-28 14:49:47: totaam commented

totaam commented Oct 6, 2013

2013-10-06 12:05:50: totaam uploaded file nvenc-pycuda-with-kernel3.patch (26.8 KiB)

totaam commented Oct 6, 2013

2013-10-06 17:33:34: totaam commented

totaam commented Oct 25, 2013

2013-10-25 04:47:15: totaam commented

totaam commented Oct 25, 2013

2013-10-25 05:00:12: totaam uploaded file nvenc-trace.txt (42.1 KiB)

totaam commented Oct 25, 2013

2013-10-25 05:31:42: totaam commented

totaam commented Oct 25, 2013

2013-10-25 07:49:08: totaam commented

totaam commented Oct 28, 2013

2013-10-28 08:40:39: totaam uploaded file nvenc-sdkv2.patch (22.3 KiB)

totaam commented Oct 28, 2013

2013-10-28 08:41:42: totaam uploaded file nvenc2.pc (0.3 KiB)

totaam commented Oct 28, 2013

2013-10-28 10:27:19: totaam uploaded file nvenc-sdkv2-v2.patch (11.5 KiB)

totaam commented Oct 28, 2013

2013-10-28 13:19:36: totaam commented

totaam commented Nov 1, 2013

2013-11-01 09:34:43: totaam commented

totaam commented Nov 1, 2013

2013-11-01 11:25:41: totaam commented

totaam commented Nov 5, 2013

2013-11-05 08:01:12: totaam edited the issue description

totaam commented Nov 5, 2013

2013-11-05 08:01:12: totaam changed title from hardware accelerated encoding: libva and/or nvenc to nvenc hardware accelerated encoding

totaam commented Nov 5, 2013

2013-11-05 08:01:12: totaam commented

totaam commented Nov 6, 2013

2013-11-06 02:14:28: totaam commented

totaam commented Nov 10, 2013

2013-11-10 06:52:49: totaam commented

totaam commented Nov 11, 2013

2013-11-11 09:54:42: totaam changed status from assigned to new

totaam commented Nov 11, 2013

2013-11-11 09:54:42: totaam changed owner from antoine to afarr

totaam commented Nov 11, 2013

2013-07-06 09:23:00: antoine uploaded file `add-libva-stub.patch` (10.9 KiB)

2013-09-09 16:53:28: antoine uploaded file `nvenc-stub.patch` (57.2 KiB)

2013-09-13 10:02:36: ahuillet uploaded file `trace_nvenc` (3.5 KiB)

2013-09-27 15:41:30: totaam uploaded file `csc-with-stride.patch` (19.1 KiB)

2013-09-27 15:43:09: totaam uploaded file `csc-with-stride.2.patch` (20.1 KiB)

2013-09-28 11:26:46: totaam uploaded file `nvenc-dualbuffers.patch` (7.6 KiB)

2013-09-28 11:27:22: totaam uploaded file `nvenc-pycuda.patch` (23.6 KiB)

2013-09-28 12:59:00: totaam uploaded file `nvenc-pycuda-with-kernel.patch` (27.1 KiB)

2013-09-28 14:32:50: totaam uploaded file `nvenc-pycuda-with-kernel2.patch` (27.6 KiB)

2013-10-06 12:05:50: totaam uploaded file `nvenc-pycuda-with-kernel3.patch` (26.8 KiB)

2013-10-25 05:00:12: totaam uploaded file `nvenc-trace.txt` (42.1 KiB)

2013-10-28 08:40:39: totaam uploaded file `nvenc-sdkv2.patch` (22.3 KiB)

2013-10-28 08:41:42: totaam uploaded file `nvenc2.pc` (0.3 KiB)

2013-10-28 10:27:19: totaam uploaded file `nvenc-sdkv2-v2.patch` (11.5 KiB)

2013-11-11 10:18:45: totaam uploaded file `nvcrash.txt` (45.7 KiB)