Skip to content

Commit

Permalink
workaround for possibly unsupported instruction by ZLUDA
Browse files Browse the repository at this point in the history
In theory this can have impact on performance but there was no measurable
difference after this change. This may be perhaps caused because the
bottleneck are transfers from/to global memory and thus the computation
duration is masked by the transfers.

closes GH-90
  • Loading branch information
MartinPulec committed Feb 15, 2024
1 parent 698c1a1 commit bdbe869
Showing 1 changed file with 8 additions and 2 deletions.
10 changes: 8 additions & 2 deletions src/gpujpeg_dct_gpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -602,8 +602,14 @@ gpujpeg_idct_gpu_kernel(int16_t* source, uint8_t* result, int output_stride, uin
//cast float to uint8_t with saturation (.sat) which cuts values higher than
//255 to 255 and smaller than 0 to 0; cuda can't use a reg smaller than 32b
//(though it can convert to 8b for the saturation purposes and save to 32b reg)
uint32_t save;
asm("cvt.rni.u8.f32.sat %0, %1;" : "=r"(save) : "f"(x[i] + ((float) 128.0)));
// uint32_t save;
// asm("cvt.rni.u8.f32.sat %0, %1;" : "=r"(save) : "f"(x[i] + ((float) 128.0)));
// Following wokaround enables GPUJPEG with ZLUDA (see GH-90). May be slower
// but not measurable because perhaps the computation time is masked by global
// memory transfers.
int save = rintf(x[i] + 128.0F);
save = save < 0 ? 0 : save > 255 ? 255 : save;
((uint8_t*) tempResultP)[i] = save;
}
Expand Down

0 comments on commit bdbe869

Please sign in to comment.