workaround for possibly unsupported instruction by ZLUDA

In theory this can have impact on performance but there was no measurable difference after this change. This may be perhaps caused because the bottleneck are transfers from/to global memory and thus the computation duration is masked by the transfers. closes GH-90
CESNET · Feb 15, 2024 · bdbe869 · bdbe869
1 parent 698c1a1
commit bdbe869
Showing 1 changed file with 8 additions and 2 deletions.
diff --git a/src/gpujpeg_dct_gpu.cu b/src/gpujpeg_dct_gpu.cu
@@ -602,8 +602,14 @@ gpujpeg_idct_gpu_kernel(int16_t* source, uint8_t* result, int output_stride, uin
 		//cast float to uint8_t with saturation (.sat) which cuts values higher than 
 		//255 to 255 and smaller than 0 to 0; cuda can't use a reg smaller than 32b 
 		//(though it can convert to 8b for the saturation purposes and save to 32b reg)
-		uint32_t save;
-		asm("cvt.rni.u8.f32.sat	%0, %1;" : "=r"(save) : "f"(x[i] + ((float) 128.0)));
+		// uint32_t save;
+		// asm("cvt.rni.u8.f32.sat	%0, %1;" : "=r"(save) : "f"(x[i] + ((float) 128.0)));
+
+		// Following wokaround enables GPUJPEG with ZLUDA (see GH-90). May be slower
+		// but not measurable because perhaps the computation time is masked by global
+		// memory transfers.
+		int save = rintf(x[i] + 128.0F);
+		save = save < 0 ? 0 : save > 255 ? 255 : save;
 		((uint8_t*) tempResultP)[i] = save;
 	}