- Kai Ninomiya (Arch Linux/Windows 8, Intel i5-4670, GTX 750)
- Escape: save image and exit program
- Space: save image early and continue rendering
- Arrow keys: rotate camera
- WASDRF keys: fly through space laterally (WASD) and vertically (RF) relative to the camera orientation
- Live display of render progress using OpenGL interop
- Configuration file reading
- Hemisphere sampling function (for diffuse)
- Objects: sphere
- Pathtracing algorithms
- Materials: diffuse, reflective, refractive with Fresnel reflection
- Camera: movement (controls above), antialiasing, depth of field
- Objects: cube, sphere with correct normals when scaled
- Performance: ray-level stream compaction
(Extras in bold.)
Combined test render (rendering time: 9 min on GTX 750):
Combined test render (rendering time: 9 min on GTX 750):
In order to perform ray-level stream compaction, it was necessary to refactor the rendering kernel into a single-ray step along the path. The result of this is significantly more overhead, due primarily to performing stream compaction between every step. At low path depths (e.g. 4), performance is lower with stream compaction. However, stream compaction allows for extremely high path depths (tested up to 1000) without very significant performance degradation. This is because the vast majority of paths have terminated, and dead paths no longer use kernel threads.
Path Depth | Before | After |
---|---|---|
2 | 33.20 ms | 49.86 ms |
4 | 66.54 ms | 83.20 ms |
8 | 116.55 ms | 116.56 ms |
16 | 216.54 ms | 133.20 ms |
64 | 327.1x ms | 149.90 ms |
256 | 418.xx ms | 183.22 ms |
1024 | 660.xx ms | 300.23 ms |
This plot is remarkably uninteresting (beyond the expected low performance of block sizes which are not multiples of the warp size of 32). This is probably due to the fact that shared memory was not used at all.
With a block size of 512, performance becomes marginally worse. This may be due to the increased number of registers used, since the number of registers per thread is high (111). This may be also due to caching effects: at this block size, a lot of data is being loaded for each block, so the cache may not be able to hold everything effectively.
With a block size of 1024, the kernel does not run due to insufficient resources.
Block size | Time |
---|---|
16 | 399.9 ms |
32 | 266.5 ms |
33 | 416.5 ms |
48 | 316.5 ms |
64 | 266.5 ms |
65 | 349.3 ms |
80 | 299.9 ms |
96 | 266.5 ms |
128 | 266.5 ms |
256 | 266.5 ms |
512 | 271.3 ms |
Initially I did cube intersection naively, but this turned out to use many GPU registers and had very bad performance. Rewriting based on Kay and Kayjia's slab method increased performance significantly:
Naive | Slabs |
---|---|
102 reg | 95 reg |
83.21 ms | 66.54 ms |
Rays bounced randomly according to provided hemisphere sampling method.
Performance:
Before | After |
---|---|
55 reg | 103 reg |
16.55 ms | 66.55 ms |
Rays always reflected perfectly.
Performance:
Before | After |
---|---|
103 reg | 102 reg |
66.55 ms | 66.55 ms |
Rays always refracted perfectly. Required some additions to the intersection code
Performance:
Before | After |
---|---|
102 reg | 101 reg |
66.55 ms | 99.88 ms |
Samples are taken randomly from within each pixel. See Earlier Renders section for comparison.
Performance: Negligible impact.
Mean rendering time per iteration for an arbitrary example scene
Before | After |
---|---|
95 reg | 95 reg |
33.19 ms | 33.19 ms |
Origin and direction of camera rays is varied randomly (in a uniform circular distribution) to emulate a physical aperture.
Performance: Some impact per-sample when adding the implementation. This is due to additional randomness calculations and vector math for computing random rays. Depth of field also increases the number of samples needed for visual smoothness due to the extreme variation between samples. Implementation-wise, this is identical to analogous CPU code.
Mean rendering time per iteration for an arbitrary example scene
Before | After |
---|---|
95 reg | 95 reg |
49.87 ms | 61.6x ms |
I used Schlick's approximation to compute the fractions of light reflected/refracted, then used that as a probability for the next path ray.
Reflection is implemented using glm::reflect. Refraction uses glm::refract and handles total internal reflection. Intersection code was modified to report whether the intersection was inside or outside the object, which allows correct handling of indices of refraction at interfaces. (This technically could have been done by adopting a different normal direction convention and checking dot products with that, but this is more readable.)
(See debug render in the Debug Renders section below.)
Performance: Some performance impact per-sample. This is probably due to the additional Fresnel factor computation and the additional random branch calculation based on that factor.
Mean rendering time per iteration for an arbitrary example scene
Before | After |
---|---|
101 reg | 101 reg |
233.2 ms | 249.9 ms |
Keys for this are listed in the Keybindings section. This is implemented by simply modifying the location of the camera, clearing the render, and starting again.
This is a minor thing, but I fixed the provided code to use inverse transpose transformations to calculate the sphere normals.
(Error image in bloopers.)
Before | After |
---|---|
101 reg | 111 reg |
250.2 ms | 266.5 ms |
Higher iteration counts always improved image smoothness, since more samples were averaged over time. Higher path depths seem to cause fireflies, which seem to have a hard time getting reduced. This is (probably) due to randomly directly sampling the light more than nearby pixels (despite the very low probability of doing so; probably 1 sample instead of 0 or 2 instead of 1).
With Direct Lighting, depth=8 (not included in final version):
With Direct Lighting, depth=1 (not included in final version):
Direct lighting lit areas (not included in final version):
Fresnel reflected light factor (shown here for all reflective surfaces, but to be only applied to refractive surfaces):