Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samples of the Gpu Speed, #15

Closed
dextronomous opened this issue Apr 26, 2021 · 5 comments
Closed

samples of the Gpu Speed, #15

dextronomous opened this issue Apr 26, 2021 · 5 comments

Comments

@dextronomous
Copy link

guys would really like to know what is your gpu speed,
or who is using it and what is you'r average speed, maybe screenshot,.
would like to get more juice out of this one, thanks a lot

@trextrader
Copy link

trextrader commented Apr 29, 2021

Interesting request...Running this on Ubuntu 20.04

Curious, I am getting better compute performance from my Ryzen CPU as compared to an Nvidia RTX 2080ti using OpenCL??

There is something wrong here as my GPU under any other case whether its CUDA kernel or OpenCL, or the CPU using opencl, typically thoroughly smokes the CPU (non-opencl) in compute intensive tasks. As you will see that is not the case here.

I am not sure but it would seem the dev needs to dive in and take a closer look. I am inserting below some copy/paste(s) to show performance of ./oclvan... vs. ./van...One possibility I thought it could be that a setting is below the compute max for my GPU which is 75 but that is only for CUDA kernel not opencl. Also using pocl opencl icd for AMD ryzen, yet, still non-opencl on the AMD Ryzen CPU is crushing the CPU. If I saw this behavior in other applications like bitcrack or btcrecover I would suspect an issue on my end, however, that is not the case...So, we can postulate that the opencl performance issue is on the dev side, unless there is some configuration specifically for this source or some compiler flags that need to be set or perhaps a different version of g++, as revisions of openssl among other things I have noticed in other sources can significantly affect performance but I have not had the time to really dive in and investigate this further, and it is showing a dramatic increase in performance for other sources in my current revisions of what is being used for openssl, g++, nvcc (for CUDA), etc...

Any help from the developer on this would be advantageous as the performance difference between ./oclvan and ./van are substantial, in favor of ./van...by almost 30X the performance (which it should be the opposite)...


For OpenCL using BOTH GPU && CPU: ./oclvan...

./oclvanitygen++ -F compressed -k -i 19y73CRa -o 19y73CRa.txt
Difficulty: 54586762088
Device: pthread-AMD Ryzen Threadripper 2970WX 24-Core Processor
Vendor: AuthenticAMD (1022)
Driver: 1.6
Profile: FULL_PROFILE
Version: OpenCL 1.2 pocl HSTR: pthread-x86_64-pc-linux-gnu-znver1
Max compute units: 48
Max workgroup size: 4096
Global memory: 132937482240
Max allocation: 34359738368
OpenCL compiler flags: -DDEEP_PREPROC_UNROLL -DCOMPRESSED_ADDRESS
Loading kernel binary f14c319a9c88570c9c5466c73a8d6c03.oclbin
Grid size: 192x128
Modular inverse: 48 threads, 512 ops each
Available OpenCL platforms:
0: [NVIDIA Corporation] NVIDIA CUDA
0: [NVIDIA Corporation] GeForce RTX 2080 Ti, endian little: true
1: [The pocl project] Portable Computing Language
0: [AuthenticAMD] pthread-AMD Ryzen Threadripper 2970WX 24-Core Processor, endian little: true
Using OpenCL prefix matcher
[508.41 Kkey/s][total 65470464][Prob 0.1%][50% in 20.6h]

For AMD Ryzen CPU with threadcount set to 44 of 48: ./van...

./vanitygen++ -F compressed -k -i 19y73CRa -o 19y73CRa.txt -t 44
Difficulty: 54586762088
Using 44 worker thread(s)
[15.22 Mkey/s][total 648931798][Prob 1.2%][50% in 40.7min]


Some simple math:

15,220,0000 / 508,410 = 29.94 so approximately 30X faster than the opencl implementation.

Typically its the other way around so the expected computational speed of the GPU && the CPU using 48 threads, should be somewhere between 500 MKeys/sec to 1 BKeys/sec. This is in line with the performance I get from bitcrack and other sources using opencl. For CUDA that smokes opencl, however, this source does not include a make for CUDA kernel, only opencl - perhaps a CUDA port of the source would be beneficial for the developer to offer...


Output of clinfo:

┌──(Luv㉿To-Phuk-Hard)-[~]
└─$ clinfo
Number of platforms 2
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.2 CUDA 11.2.162
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_device_uuid
Platform Extensions function suffix NV

Platform Name Portable Computing Language
Platform Vendor The pocl project
Platform Version OpenCL 1.2 pocl 1.6, None+Asserts, LLVM 9.0.1, RELOC, SLEEF, DISTRO, POCL_DEBUG
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd
Platform Extensions function suffix POCL

Platform Name NVIDIA CUDA
Number of devices 1
Device Name GeForce RTX 2080 Ti
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Device UUID 93a045ae-970c-69e2-0f24-e47aa21ab072
Driver UUID 93a045ae-970c-69e2-0f24-e47aa21ab072
Valid Device LUID No
Device LUID 0000-000000000000
Device Node Mask 0
Driver Version 460.73.01
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 0000:42:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 68
Max clock frequency 1755MHz
Compute Capability (NV) 7.5
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple (kernel) 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 11553341440 (10.76GiB)
Error Correction support No
Max memory allocation 2888335360 (2.69GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 2228224 (2.125MiB)
Global Memory cache line size 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 268435456 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 32768x32768 pixels
Max 3D image size 16384x16384x16384 pixels
Max number of read image args 256
Max number of write image args 32
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) Yes
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 3
printf() buffer size 1048576 (1024KiB)
Built-in kernels (n/a)
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_device_uuid

Platform Name Portable Computing Language
Number of devices 1
Device Name pthread-AMD Ryzen Threadripper 2970WX 24-Core Processor
Device Vendor AuthenticAMD
Device Vendor ID 0x1022
Device Version OpenCL 1.2 pocl HSTR: pthread-x86_64-pc-linux-gnu-znver1
Driver Version 1.6
Device OpenCL C Version OpenCL C 1.2 pocl
Device Type CPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 48
Max clock frequency 3000MHz
Device Partition (core)
Max number of sub-devices 48
Supported partition types equally, by counts
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 4096x4096x4096
Max work group size 4096
Preferred work group size multiple (kernel) 8
Preferred / native vector sizes
char 16 / 16
short 16 / 16
int 8 / 8
long 4 / 4
half 0 / 0 (n/a)
float 8 / 8
double 4 / 4 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 132937482240 (123.8GiB)
Error Correction support No
Max memory allocation 34359738368 (32GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 8388608 (8MiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 2147483648 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 32768x32768 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 128
Local memory type Global
Local memory size 524288 (512KiB)
Max number of constant args 8
Max constant buffer size 524288 (512KiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 1ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
printf() buffer size 16777216 (16MiB)
Built-in kernels (n/a)
Device Extensions cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64

NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) NVIDIA CUDA
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [NV]
clCreateContext(NULL, ...) [default] Success [NV]
clCreateContext(NULL, ...) [other] Success [POCL]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) Invalid device type for platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform

ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.14
ICD loader Profile OpenCL 3.0


Hope this helps with the benchmarking analysis request you have posted...

/Zap

@dextronomous
Copy link
Author

dextronomous commented May 4, 2021

Difficulty: 873388193410
[139.00 Mkey/s][total 1056964608][Prob 0.1%][50% in 1.2h]
1660ti 6gb, 470.25

@dextronomous
Copy link
Author

hi there 10gic,. need you here, JeanLucPons/VanitySearch#52 (comment) this one you said does 256-64 what is it doing, is it not doing 192 if substracted, or i see wrong, how can i use keyspace in here, with -Z all 0000's i get lost there, -with -l 64 it needs the -Z, but how many 0's i need. is that 64bits range only.? thanks again, "$((256-20)) is equal to 236 in bash. i got stuck with this one.

@10gic
Copy link
Owner

10gic commented May 6, 2021

hi there 10gic,. need you here, JeanLucPons/VanitySearch#52 (comment) this one you said does 256-64 what is it doing, is it not doing 192 if substracted, or i see wrong, how can i use keyspace in here, with -Z all 0000's i get lost there, -with -l 64 it needs the -Z, but how many 0's i need. is that 64bits range only.? thanks again, "$((256-20)) is equal to 236 in bash. i got stuck with this one.

-Z <prefix> specify private key prefix in hex. Here is an example:

$ ./oclvanitygen++ -v -Z AAAA 1
Pattern: 1
Pubkey (hex): 046e2b135f2324b89f555db33f9d3da6ec40bf8ca32f845528872b28409f8d5ace4dd5c6536447a885403dd4a05026e329e5281fcaf79e69e407755c2810ec5fb3
Privkey (hex): AAAA44B33B39E388F0F8075087D9A3F0F11EDC66B3EEE15F8873D5295E8D52E9
Privkey (ASN1): 30740201010420aaaa44b33b39e388f0f8075087d9a3f0f11edc66b3eee15f8873d5295e8d52e9a00706052b8104000aa144034200046e2b135f2324b89f555db33f9d3da6ec40bf8ca32f845528872b28409f8d5ace4dd5c6536447a885403dd4a05026e329e5281fcaf79e69e407755c2810ec5fb3
Address: 1FUR5pK2aBeyD4WxK1rAHZUKX65E5zRYC8
Privkey: 5K7SzXimbKtFy4hYyGdZZsFq46wXHNCgcKzytMt7Lz9csnMt4B4

Note that the private key start with AAAA(hex) or 1010 1010 1010 1010 (binary). If we want the private key start with 1010 1010 1010 10, we can add -l 14, it means only consider first 14 binary in AAAA . Here is an example:

$ ./oclvanitygen++ -v -Z AAAA -l 14 1
Pattern: 1
Pubkey (hex): 04b7e45e88c0b658e722aa0d587cc2edfa8e35661865bbd18e0225f25481d4ad55411ef1c1fdf9b11641c3c436f7f78e5b6fb1b59283571f290cba7a50c25f938f
Privkey (hex): AAABE881463DE30517B0DD1C332A02F10101B5C07243B668613751B2F47748A4
Privkey (ASN1): 30740201010420aaabe881463de30517b0dd1c332a02f10101b5c07243b668613751b2f47748a4a00706052b8104000aa14403420004b7e45e88c0b658e722aa0d587cc2edfa8e35661865bbd18e0225f25481d4ad55411ef1c1fdf9b11641c3c436f7f78e5b6fb1b59283571f290cba7a50c25f938f
Address: 1PkNwvqKGZeJzspBXUVCQgKxHQGTFhQQWh
Privkey: 5K7TA2A3KqB9meivcTqGmi3GoeE7rRvRwrjXNcoWAp79ASE9mK6

If we run ./oclvanitygen++ -v -Z AAAA -l 14 1 multiple times, we can got the private key start with:
AAA8
AAA9
AAAA
AAAB
AAAC
AAAD
AAAE
AAAF

As the private key up to 64 hex characters (256 binary), If you want to solve puzzle, you can just set -Z 0000000000000000000000000000000000000000000000000000000000000000, 64 zeroes (hex format).

@dextronomous
Copy link
Author

ok great to get this part all cleared now, will close this one to. thanks guys,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants