is it possible c-ocl_*_win64 #363

kh-abd-kh · 2022-11-27T22:20:56Z

Hi, thank you for your nice work.

for me and many others, opencl doesn't work on wsl2

so i am wondering if your c-rocm_win64 works for you but not for me like
#269
#284

i am wondering is it possible to extend antares opencl_*_win64
to use opencl from windows which definitely works fine for my amd/gpu and nvidia/gpu.

If possible, it will be great and will help many to go around opencl/wsl2 issues.

ghostplant · 2022-11-28T07:21:14Z

I'm afraid I don't have such OCL based Windows platform currently. Can you explain more about your OCL environment? e.g. For NVIDIA/AMD GPU, do they link to the same libocl.dll, using shared OCL header files?

ghostplant · 2022-11-28T07:22:28Z

BTW, we recommend you try hlsl_win64 since it is DirectX based, working for both AMD GPU, NVIDIA GPU, even Intel GPU.

kh-abd-kh · 2022-11-28T09:17:14Z

BACKEND=c-hlsl_win64 antares
didn't work for me, it pops a message (since i don't know how to include a png/jpg, i will copy it)
title: evaluator.c-hsl_win64 - Bad inage
Error1:
\wsl.localhost\Ubuntu\home\mabd.cahe\antares\dxompiler.dll is either not designed to run on Windows or
it contains an error. Try installing the program again using the original installation media or contact your system adminstrator
or the software vendor for support. Error status 0x000007b.
Error2:
[EvalAgent] Evaluating Modules .. (for backend = c-hlsl_win64)
[CheckFail] Failed to load .\antares_hlsl_v0.3.2_x64.dll, please download these libraries first!
[EvalAgent] Results = {}
[Antares] Incorrect compute kernel from evaluator.

To be honest i know a little about DirectX. a long time ago, i used.
for , open openCL, Now people are using pyopenCL or pyCUDA .
But i understand that is not what you need.

You need "old" OpenCL SDK from nVidia (~2012) and/or the last AMD OpenCL SDK (~2013-2014) as far as i remember.

Here from AMD

https://community.amd.com/t5/opencl/where-can-i-get-an-older-version-of-amd-app-sdk/m-p/167540#M14714

it should work for nVidia and AMD gpus. You shall find the header and the libraries inside. I haven't touched
this for years.

here the last/new from AMD one just some releases infos

https://github.com/GPUOpen-LibrariesAndSDKs/OCL-SDK/releases

ghostplant · 2022-11-28T10:05:41Z

Wow, it is definitely not expected. Firstly, you need "latest system updated" Windows 10 or Windows 11 (64 bit). Then you may suffer from a broken download of antares_hlsl_v0.3.2_x64.dll (blocked by firewall?), so you need to run antares clean in WSL to reset all settings, then try executing command BACKEND=c-hlsl_win64 antares to see if it works now?

ghostplant · 2022-11-28T10:08:01Z

BTW, if you install latest AMDGPU drivers, why your system doesn't have amdhip64.dll installed that can enable c-rocm_win64?

kh-abd-kh · 2022-11-28T10:37:13Z

Ok, we will be distracted by two different parallel problem. may be the directx can be done in another thread.
For the AMDGPU, Windows (Win 11) is not the problem. I have Ryzen 7 4800H using clinfo from windows i can see that my "built in" AMDGPU/Renoir is called gfx902.
And , actually the only way, so far, i can "program" it using pyopencl and running the old
opencl/AMD. it works fine and even it works fine with my second gpu nvidia 1660Ti/Turing"striped" with 6GB Ram.
That is amazing from Windows 11.

On, WSL, Ubuntu 18 or the newer one Ubuntu 20, I installed openCL but it doens't work.
I installed different rcom but not working, and it is never said explicitly in AMD site
that they support Renoir/Ryzen 7 over ubuntu not to mention win11/rcom (for programming). But it works for windows.

Now, I am scientist/mathematician i am calculting trillions terms. I can offload my cpu and nVidia
together over WSL (this time nvc++/HPC/nVidia is only over Ubuntu/WSL not windows)

So, I have AMDGPU working in win 11 and i am offloading my cpu/gpu(nvidia) over wsl. i want to let them talk.

PS: if you convert to nvc++/nvidia you can offload the cpu and nvidia cards simultaneously. with the same c code
just in compiling you add -acc=gpu or -acc=multicore then build .so and forked from c code. But over WSL.

Thank you really for antares, making it work for opencl will solve a lot of problems for many many different.
people.

PS: it was a nice trick to use mingw/WSL/Ubuntu to call .dll libraries for the amdhip64.dll, I like it, i didn't know it.

kh-abd-kh · 2022-11-30T01:53:20Z

for hlsl

I applied your reg file antares_hlsl_tdr_v0.1.reg
then it
Download Microsoft DirectX Shader Compiler 6 ...
then it works now.
Interesting it is using AMD/Renoir not the nVidia/1660TI.
so i should now learn what is hlsl

kh-abd-kh · 2022-12-01T10:34:46Z

I did it.
Now, I can offload my Renoir/Ryzen7 from win11 using opencl/pyopencl
or from WSL using cross platform opencl/HLSL
That 's great.

By the way, you have assumed that there is one WSL. There can be many so for antares to work one should
chose the correct one as "default".

ghostplant · 2022-12-01T16:34:03Z

Congratulations. Many of our previous investigations proves HLSL can work as efficient as openCL, and it has a standard interface defined by Windows that can cover all graphic GPUs, as long as you install graph drivers.

If your machines have both AMD/Renoir and nVidia/1660TI equipped, and hlsl work for AMD, a feasible way to make it turn to use nVIDIA resource, is by disabling AMD graphic device in "Windows Device Manager", although this may not be what you want if you want to use them simultaneously.

ghostplant · 2022-12-01T17:54:31Z

Can you have a try on this?

$ pip3 install --upgrade antares==0.3.20.12
$ antares clean
$ DEVICE_ID=0 STEP=100 antares     # this should use one of the GPU, maybe AMD
$ DEVICE_ID=1 STEP=100 antares     # this should use another GPU, maybe NVIDIA

kh-abd-kh · 2022-12-01T23:59:06Z

very good.

ok, I am learning HLSL now.
Now, I upgraded.
DEVICE_ID=0 STEP=100 antares # it uses the nvidia 1660Ti
$DEVICE_ID=0 STEP=100 antares

Backend = c-hlsl_win64, Python PID = 1757, Task = lang.generic;
MAKE_PARA = 1/16, EXEC_PARA = 1, TUNER = OpEvo
COMPUTE_V1 = - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})

[ ] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s
[] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', tpr = 0.000931, digest = 1.504583e+09, mem_occupy = -1 %
....
[] Param_entity on sid = 100: config = '{"Foutput0:D0": [-1, 2, 8, 8], "Foutput0:D1": [-1, 1, 16, 1], "Foutput0:O": [0, 1], "Foutput0:S": 2, "Foutput0:R": 1}', tpr = 0.000035, digest = 1.504583e+09, mem_occupy = -1 %

========================================================================================================================

STEP[100 / 100] Current Best Config = {"Foutput0:D0": [-1, 2, 4, 4], "Foutput0:D1": [-1, 1, 16, 1], "Foutput0:O": [1, 0], "Foutput0:S": 2, "Foutput0:R": 0}, Perf = 3.01407e-05 sec / op (17.3947 Gflops), MemRatio = -1 %, Occur Step = 29;

========================================================================================================================

[Best Config] CONFIG='{"Foutput0:D0": [-1, 2, 4, 4], "Foutput0:D1": [-1, 1, 16, 1], "Foutput0:O": [1, 0], "Foutput0:S": 2, "Foutput0:R": 0}' ==> Performance is up to 17.394686 Gflops, occurred at step 29 / 100; time per run = 3.01407e-05 sec.

DEVICE_ID=1 STEP=100 antares # it uses the AMD
$ DEVICE_ID=1 STEP=100 antares

Backend = c-hlsl_win64, Python PID = 25143, Task = lang.generic;
MAKE_PARA = 1/16, EXEC_PARA = 1, TUNER = OpEvo
COMPUTE_V1 = - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})

[ ] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s
[] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', tpr = 0.004155, digest = 1.504583e+09, mem_occupy = -1 %
.....
[] Param_entity on sid = 100: config = '{"Foutput0:D0": [-1, 4, 2, 8], "Foutput0:D1": [-1, 2, 8, 4], "Foutput0:O": [1, 0], "Foutput0:S": 2, "Foutput0:R": 0}', tpr = 0.000224, digest = 1.504583e+09, mem_occupy = -1 %

========================================================================================================================

STEP[100 / 100] Current Best Config = {"Foutput0:D0": [-1, 2, 4, 32], "Foutput0:D1": [-1, 1, 16, 2], "Foutput0:O": [1, 0], "Foutput0:S": 4, "Foutput0:R": 1}, Perf = 0.000184122 sec / op (2.8475 Gflops), MemRatio = -1 %, Occur Step = 90;

========================================================================================================================

[Best Config] CONFIG='{"Foutput0:D0": [-1, 2, 4, 32], "Foutput0:D1": [-1, 1, 16, 2], "Foutput0:O": [1, 0], "Foutput0:S": 4, "Foutput0:R": 1}' ==> Performance is up to 2.847503 Gflops, occurred at step 90 / 100; time per run = 0.000184122 sec.

just for curiosity i tried

$ DEVICE_ID=2 STEP=2 antares

Backend = c-hlsl_win64, Python PID = 32252, Task = lang.generic;
MAKE_PARA = 1/16, EXEC_PARA = 1, TUNER = OpEvo
COMPUTE_V1 = - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})

[ ] Param_entity on sid = 1: config = '{"Foutput0:D0": [-1, 1, 1, 1], "Foutput0:D1": [-1, 1, 1, 1], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 0}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s

[Antares] Incorrect compute kernel from evaluator.

[ ] Param_entity on sid = 2: config = '{"Foutput0:D0": [-1, 8, 16, 1], "Foutput0:D1": [-1, 4, 2, 4], "Foutput0:O": [0, 1], "Foutput0:S": 0, "Foutput0:R": 1}', dev_id = 0, upper_bound_tpr = 3.000000e+01 s

[Antares] Incorrect compute kernel from evaluator.

may be, it is better to give "Incorrect DEVICE_ID"

kh-abd-kh · 2022-12-02T00:01:37Z

But I am afraid it is very low GFlops, usually i can get some Teraflops.
How can I try something more complicated (** = power=python), using complex numbers (I=is the imaginary unit)
sum_{n=1}^{n=1000} sum_{m=1}^{m=1000} ( 1/ ( (m + I*n) ** 4 ) )
or at least its real part
( (m ** 4 - 6. * m ** 2 * n ** 2 + n ** 4) / ( (m ** 2 + n ** 2) ** 4 ) )
this called the lemniscate zeta. Even for up to 1000000 can be done in less than 1 min (c+omp,cythom+omp,numba ...)
Gauss calculated it ~1795 by hand but he used some tricks (the AGM=Algebro-Geometric Mean)

ghostplant · 2022-12-02T05:14:40Z

Because the computation by default is elementwise which is a memory-bound operation. If you want to test how high it can each in TFlops, you'd better try a large GEMM.

kh-abd-kh · 2022-12-02T12:31:16Z

OK, the main power of GPU is parallel array. This is how I discover it. Something that can takes hours in Mathematica(Parallel) can be done in milliseconds over the GPU.
Here a very simple example in pycuda,

I initialize an empty array then copied to the GPU, measuring and computing my zeta over the GPU,
then copied back to CPU and suming it over the CPU just to prove that the sum is not problamatic.

The main point that generating the whole zeta at once over the GPU takes milliseconds actually sometimes less
0.0009sec while trying to doing this over the CPU is meaningless it takes hours. Of course, the standard way is
accumulative
sum which is fast and even in parallel(16threads) faster than accumalative sum over the GPU.
But generating the whole zeta can be useful for example investigating hidden symmetry action and "connical"
distribution and so on.

Here, we go

=============================

import pycuda.driver as cuda
import pycuda.autoinit # noqa
from pycuda.compiler import SourceModule
from time import time as _time
import numpy

n = 32
gd = 512
dd = 32 * gd
a = numpy.empty( dd * dd ).reshape(dd,dd)
a = a.astype(numpy.float64)
a_gpu = cuda.mem_alloc( a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)

mod = SourceModule("""
__global__ void zetac(double *a)
{
int id = blockIdx.x * 32 * 32 + threadIdx.y * 32 + threadIdx.x;
int y = id/(32 * 496) +1;
int x = id%(32 * 496) +1;
double kk1 = double(x) * double(x);
double kk2 = double(y) * double(y);
a[id] = ((kk1-kk2) * (kk1-kk2)- 4. * kk1 * kk2) / (double( x * x+ y * y) * ( x * x+ y * y) * (x * x+ y *y ) * ( x * x+ y * y ));
}
""")

nn=32
tt1=_time()
func = mod.get_function("zetac")
func(a_gpu,block=(nn, nn, 1),grid=( gd*gd, 1, 1), shared=0)
tt2=_time()
print("func gpu time is =",tt2-tt1)
cuda.memcpy_dtoh(a, a_gpu)
tt3=_time()
print("memcpy time is =",tt3-tt2)

print("last element =", a[ nn * gd-1, nn * gd-1])
print("dim =", a.shape)

tt4=_time()
print("sum =", a.sum())
tt5=_time()
print("sum time over the cpu = ", tt5-tt4)

=================================

$ python3 ztc2.py
func gpu time is = 0.002030611038208008
memcpy time is = 0.4167020320892334
last element = -1.854725953404076e-18
dim = (16384, 16384)
sum = -0.29452023348172157
sum time over the cpu = 0.09859681129455566

of course i can do better, if it is for the sum only, in many different ways (numba-cython-c-....)
but just to see the point generating parallel array computationally over the GPU really fantastic.

can you compare this with antares different backends?!

kh-abd-kh · 2022-12-02T12:35:58Z

I hope the editor didn't miss up my code. I correct it 3 times.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is it possible c-ocl_*_win64 #363

is it possible c-ocl_*_win64 #363

kh-abd-kh commented Nov 27, 2022

ghostplant commented Nov 28, 2022

ghostplant commented Nov 28, 2022

kh-abd-kh commented Nov 28, 2022 •

edited

Loading

ghostplant commented Nov 28, 2022

ghostplant commented Nov 28, 2022

kh-abd-kh commented Nov 28, 2022 •

edited

Loading

kh-abd-kh commented Nov 30, 2022

kh-abd-kh commented Dec 1, 2022

ghostplant commented Dec 1, 2022

ghostplant commented Dec 1, 2022

kh-abd-kh commented Dec 1, 2022 •

edited

Loading

kh-abd-kh commented Dec 2, 2022 •

edited

Loading

ghostplant commented Dec 2, 2022

kh-abd-kh commented Dec 2, 2022 •

edited

Loading

kh-abd-kh commented Dec 2, 2022 •

edited

Loading

is it possible c-ocl_*_win64 #363

is it possible c-ocl_*_win64 #363

Comments

kh-abd-kh commented Nov 27, 2022

ghostplant commented Nov 28, 2022

ghostplant commented Nov 28, 2022

kh-abd-kh commented Nov 28, 2022 • edited Loading

ghostplant commented Nov 28, 2022

ghostplant commented Nov 28, 2022

kh-abd-kh commented Nov 28, 2022 • edited Loading

kh-abd-kh commented Nov 30, 2022

kh-abd-kh commented Dec 1, 2022

ghostplant commented Dec 1, 2022

ghostplant commented Dec 1, 2022

kh-abd-kh commented Dec 1, 2022 • edited Loading

kh-abd-kh commented Dec 2, 2022 • edited Loading

ghostplant commented Dec 2, 2022

kh-abd-kh commented Dec 2, 2022 • edited Loading

$ python3 ztc2.py func gpu time is = 0.002030611038208008 memcpy time is = 0.4167020320892334 last element = -1.854725953404076e-18 dim = (16384, 16384) sum = -0.29452023348172157 sum time over the cpu = 0.09859681129455566

kh-abd-kh commented Dec 2, 2022 • edited Loading

kh-abd-kh commented Nov 28, 2022 •

edited

Loading

kh-abd-kh commented Nov 28, 2022 •

edited

Loading

kh-abd-kh commented Dec 1, 2022 •

edited

Loading

kh-abd-kh commented Dec 2, 2022 •

edited

Loading

kh-abd-kh commented Dec 2, 2022 •

edited

Loading

$ python3 ztc2.py
func gpu time is = 0.002030611038208008
memcpy time is = 0.4167020320892334
last element = -1.854725953404076e-18
dim = (16384, 16384)
sum = -0.29452023348172157
sum time over the cpu = 0.09859681129455566

kh-abd-kh commented Dec 2, 2022 •

edited

Loading