/r/OpenCL
OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. (c) wikipedia
Related Subreddits:
/r/OpenCL
I’m trying to figure out how to optimize my opencl project; I’m currently heavily bottlenecked by buffer I/O. My data is about 80MB at max. I’ve preallocated the buffers which helped a lot, but reading out the result is taking over 100ms, which is really throttling the throughput of the whole pipeline. Any tips on where to look to improve this, either hw or sw wise?
Is there a guide of how some CUDA/GLSL functions map to equivalent OpenCL functions?
I am in particular interested in synchornization (__syncthreads(), __syncwarp(), __threadfence()) and subgroup functions (__ballot(), __shfl(), __shfl_xor()).
hi, im trying to get this deepfacelive program running but i get some kind of opencl error. i have the opencl runtime installer, the sdk, i have the nividia drivers i dont get whats going on if somebody know how to fix this
Running DeepFaceLive.
Traceback (most recent call last):
File "_internal\DeepFaceLive\main.py", line 104, in <module>
main()
File "_internal\DeepFaceLive\main.py", line 97, in main
args.func(args)
File "_internal\DeepFaceLive\main.py", line 39, in run_DeepFaceLive
from apps.DeepFaceLive.DeepFaceLiveApp import DeepFaceLiveApp
File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\apps\DeepFaceLive\DeepFaceLiveApp.py", line 11, in <module>
from . import backend
File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\apps\DeepFaceLive\backend\__init__.py", line 9, in <module>
from .FaceMerger import FaceMerger
File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\apps\DeepFaceLive\backend\FaceMerger.py", line 6, in <module>
from xlib import avecl as lib_cl
File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\__init__.py", line 16, in <module>
from xlib.avecl._internal.initializer.InitConst import InitConst
File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\initializer\__init__.py", line 1, in <module>
from .InitConst import InitConst
File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\initializer\InitConst.py", line 1, in <module>
from ..backend import Kernel
File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\backend\__init__.py", line 1, in <module>
from .Buffer import Buffer
File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\backend\Buffer.py", line 5, in <module>
from . import OpenCL as CL
File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\backend\OpenCL\__init__.py", line 4, in <module>
from .OpenCL import (CL_DEVICE_EXTENSIONS, CL_DEVICE_GLOBAL_MEM_SIZE,
File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\backend\OpenCL\OpenCL.py", line 202, in <module>
u/dll_import('OpenCL')
File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\backend\OpenCL\OpenCL.py", line 18, in dll_import
raise RuntimeError(f'Unable to load {dll_name} library.')
RuntimeError: Unable to load OpenCL library.
Press any key to continue . . .
whether your installed copy works normally. 2. Write an example program using OpenCL which can calculate matrix multiplication in parallel. can anypne help me in this case?I'm using arch linux kde plasma my laptop configuration is: Ryzen 5 3500U Vega 8 iGPU
I install opencl-amd And when i run the C(language) code my laptops display start blinking, and sometimes showing gpu hanged and sometimes just blinking (black and again visible)
This is just an appreciation post for OpenCL. It's great. The only other performance portable API that comes close is KernelAbstractions.jl.
OpenCL is just so good:
parallel_for
or something). I've just never found that to feel natural in practice. When you are writing code for GPUs, kernels are just so much easier to me.There's just so much to love.
I do 100% understand that there's some jank, but to be honest, it's been way easier for me to use OpenCL than other GPU solutions for my specific problems. It's even easier than CUDA, which is a big accomplishment. KernelAbstractions.jl is also really nice and offers many similar advantages, but for my specific work-case, I found OpenCL to be better.
I mean, it's 2024. To me, the only things I need my programming language to do are GPU Computing and Metaprogramming. OpenCL does both really well.
I have seen so many people hating on OpenCL over the years and don't fully understand why. It's great.
im working on bitcoin puzzle mining project but my code throwing runtime error: OUT_OF_RESOURCES
everything looks fine:
pyopencl:
import
pyopencl
as
cl
from
hashlib
import sha256
import
base58
import
os
import
numpy
as
np
import
warnings
warnings
.filterwarnings("ignore",
category
=
UserWarning
,
module
='pyopencl')
def
get_kernel(
cl
):
return open(
cl
,"r").read()
def
privToWIF(
private_key
):
# Step 1: Add version byte (0x80 for mainnet)
versioned_key = '80' +
private_key
# Step 3: Double SHA-256 hash
first_sha = sha256(
bytes
.fromhex(versioned_key)).hexdigest()
second_sha = sha256(
bytes
.fromhex(first_sha)).hexdigest()
# Step 4: Add first 4 bytes of the second SHA-256 hash as checksum
checksum = second_sha[:8]
final_key = versioned_key + checksum
# Step 5: Encode in Base58Check
compressed_private_key =
base58
.b58encode(
bytes
.fromhex(final_key)).decode('utf-8')
return compressed_private_key
def
compress_private_key(
private_key_hex
):
# Step 1: Add version byte (0x80 for mainnet)
versioned_key = '80' +
private_key_hex
# Step 2: Add compression byte (0x01)
compressed_key = versioned_key + '01'
# Step 3: Double SHA-256 hash
try:
sha = sha256(
bytes
.fromhex(sha256(
bytes
.fromhex(compressed_key)).hexdigest())).hexdigest()
except
Exception
as e:
print(
private_key_hex
)
with open(
os
.path.join(
os
.path.expanduser("~"), "Desktop\\privkey.txt"),"a") as f:
f.write("\n" +
f
"Bozuk hex: {
private_key_hex
} \n Compressli Bozuk Hex: {compress_private_key(
private_key_hex
)}\n")
return compress_private_key(
private_key_hex
)
# Step 4: Add first 4 bytes of the second SHA-256 hash as checksum
checksum = sha[:8]
final_key = compressed_key + checksum
# Step 5: Encode in Base58Check
compressed_private_key =
base58
.b58encode(
bytes
.fromhex(final_key)).decode('utf-8')
return compressed_private_key
def
main():
global result_string
for platform in
cl
.get_platforms():
devices = platform.get_devices(
cl
.device_type.GPU)
for device in devices:
context =
cl
.Context([device])
queue =
cl
.CommandQueue(context)
sizeof_output = 4096 # 256 bit hex int length
result =
np
.zeros(sizeof_output,
dtype
=
np
.uint8)
result_buffer =
cl
.Buffer(context,
cl
.mem_flags.WRITE_ONLY, result.nbytes)
program =
cl
.
Program
(context, get_kernel("kernel.cl")).build()
kernel = program.main_src
kernel.set_args(result_buffer)
global_work_size = (sizeof_output,)
cl
.enqueue_nd_range_kernel(queue, kernel, global_work_size, None)
cl
.enqueue_copy(queue, result, result_buffer)
queue.finish()
result_string = ''.join(chr(x) for x in result)
print(result_string)
wif = privToWIF(result_string)
print("\n" +
f
"Private Key: {result_string}" + "\n" +
f
"WIF Private Key: {wif}"+"\n" +
f
"WIF COMPRESSED Private Key:{compress_private_key(result_string)}" + "\n")
if True:
with open(
os
.path.join(
os
.path.expanduser("~"), "Desktop\\privkey.txt"),"a") as f:
f.write("\n" +
f
"Private Key: {result_string}" + "\n" +
f
"WIF Private Key: {wif}"+"\n" +
f
"WIF COMPRESSED Private Key:{compress_private_key(result_string)}" + "\n")
if __name__ == "__main__":
os
.system("cls")
main()
bigint library:
#include "bigint.clh"
int len(__global char *str)
{
int i = 0;
while (str[i] != '\0')
{
i++;
}
return i;
}
void int_to_bytes(__global uint *num, __global uchar *byte_array)
{
uint value = num[0];
for (int i = 0; i < 4; i++)
{
byte_array[3 - i] = (uchar)(value & 0xFF);
value >>= 8;
}
}
void bytes_to_int(__global uchar *byte_array, __global uint *result)
{
uint value = 0;
for (int i = 0; i < 4; i++)
{
value = (value << 8) | byte_array[i];
}
result[0] = value;
}
void concate_str(__global char *a, char *b, __global char *result)
{
int gid = get_global_id(0);
// Calculate the lengths of the input strings
int lena = 0;
while (a[lena] != '\0')
{
lena++;
}
int lenb = 0;
while (b[lenb] != '\0')
{
lenb++;
}
// Concatenate the strings
for (int i = 0; i < lena; i++)
{
result[i] = a[i];
}
for (int i = 0; i < lenb; i++)
{
result[lena + i] = b[i];
}
// Null-terminate the result string
result[lena + lenb] = '\0';
}
void concate_str_ng(__constant char *a, char *b, __global char *result)
{
int gid = get_global_id(0);
// Calculate the lengths of the input strings
int lena = 0;
while (a[lena] != '\0')
{
lena++;
}
int lenb = 0;
while (b[lenb] != '\0')
{
lenb++;
}
// Concatenate the strings
for (int i = 0; i < lena; i++)
{
result[i] = a[i];
}
for (int i = 0; i < lenb; i++)
{
result[lena + i] = b[i];
}
// Null-terminate the result string
result[lena + lenb] = '\0';
}
__constant uint k[64] = {
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1,
0x923f82a4, 0xab1c5ed5, 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174, 0xe49b69c1, 0xefbe4786,
0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147,
0x06ca6351, 0x14292967, 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85, 0xa2bfe8a1, 0xa81a664b,
0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a,
0x5b9cca4f, 0x682e6ff3, 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2};
__constant uint initial_hash[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372,
0xa54ff53a, 0x510e527f, 0x9b05688c,
0x1f83d9ab, 0x5be0cd19};
uint rotate_sha(uint x, uint n) { return (x >> n) | (x << (32 - n)); }
void sha256(__global uchar *in, __global uchar *out, uint length)
{
uint h[8];
for (int i = 0; i < 8; i++)
{
h[i] = initial_hash[i];
}
// Process each 512-bit chunk
for (int chunk = 0; chunk < length; chunk += 64)
{
uint w[64];
for (int i = 0; i < 16; i++)
{
w[i] = (in[chunk + 4 * i] << 24) | (in[chunk + 4 * i + 1] << 16) |
(in[chunk + 4 * i + 2] << 8) | in[chunk + 4 * i + 3];
}
for (int i = 16; i < 64; i++)
{
uint s0 = (rotate_sha(w[i - 15], 7) ^ rotate_sha(w[i - 15], 18) ^
(w[i - 15] >> 3));
uint s1 = (rotate_sha(w[i - 2], 17) ^ rotate_sha(w[i - 2], 19) ^
(w[i - 2] >> 10));
w[i] = w[i - 16] + s0 + w[i - 7] + s1;
}
uint a = h[0];
uint b = h[1];
uint c = h[2];
uint d = h[3];
uint e = h[4];
uint f = h[5];
uint g = h[6];
uint hh = h[7];
for (int i = 0; i < 64; i++)
{
uint S1 = (rotate_sha(e, 6) ^ rotate_sha(e, 11) ^ rotate_sha(e, 25));
uint ch = ((e & f) ^ (~e & g));
uint temp1 = hh + S1 + ch + k[i] + w[i];
uint S0 = (rotate_sha(a, 2) ^ rotate_sha(a, 13) ^ rotate_sha(a, 22));
uint maj = ((a & b) ^ (a & c) ^ (b & c));
uint temp2 = S0 + maj;
hh = g;
g = f;
f = e;
e = d + temp1;
d = c;
c = b;
b = a;
a = temp1 + temp2;
}
h[0] += a;
h[1] += b;
h[2] += c;
h[3] += d;
h[4] += e;
h[5] += f;
h[6] += g;
h[7] += hh;
}
for (int i = 0; i < 8; i++)
{
out[4 * i] = (h[i] >> 24) & 0xff;
out[4 * i + 1] = (h[i] >> 16) & 0xff;
out[4 * i + 2] = (h[i] >> 8) & 0xff;
out[4 * i + 3] = h[i] & 0xff;
}
}
__constant char base58_alphabet[] =
"123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz";
void base58_encode(__global uchar *input, uint input_len,
__global char *output)
{
uint i, j;
uint carry;
const uint max_output_len = 256; // Adjust this size based on your needs
uchar temp[max_output_len];
// Initialize the temp array
for (i = 0; i < max_output_len; i++)
{
temp[i] = 0;
}
for (i = 0; i < input_len; i++)
{
carry = input[i];
for (j = max_output_len; j > 0; j--)
{
carry += (uint)(temp[j - 1]) << 8;
temp[j - 1] = carry % 58;
carry /= 58;
}
}
// Skip leading zeros in temp
i = 0;
while (i < max_output_len && temp[i] == 0)
{
i++;
}
// Translate the digits to the base58 alphabet
j = 0;
while (i < max_output_len)
{
output[j++] = base58_alphabet[temp[i++]];
}
for (i = 0; i < input_len && input[i] == 0; i++)
{
output[j++] = '1';
}
// Null-terminate the output string
output[j] = '\0';
}
void create_versioned_key(__global const char *privkey,
__global char *versioned)
{
// Define the prefix and suffix
const char prefix[] = "80";
const char suffix[] = "01";
// Calculate the lengths
uint prefix_len = 2; // Length of "80"
uint privkey_len = 0;
uint suffix_len = 2; // Length of "01"
// Calculate the length of the private key
while (privkey[privkey_len] != '\0')
{
privkey_len++;
}
// Construct the versioned key
uint i = 0;
// Add the prefix
for (i = 0; i < prefix_len; i++)
{
versioned[i] = prefix[i];
}
for (uint j = 0; j < privkey_len; j++, i++)
{
versioned[i] = privkey[j];
}
for (uint j = 0; j < suffix_len; j++, i++)
{
versioned[i] = suffix[j];
}
// Null-terminate the versioned key
versioned[i] = '\0';
}
void compress_private_key(__global char *privkey, __global char *out)
{
int id =
get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id
// __global const char *versioned = "80" + privkey + "01";
__global char *versioned;
create_versioned_key(privkey, versioned);
__global uchar *versioned_b;
stringToByteArray(versioned, versioned_b, len(versioned));
__global uchar *firsh_sha;
sha256(versioned_b, firsh_sha, len(versioned));
__global uchar *second_sha;
sha256(firsh_sha, second_sha, sizeof(firsh_sha) / sizeof(firsh_sha[0]));
char checksum[8];
checksum[id % 8] = second_sha[id % 8];
__global char *final_key;
concate_str(versioned, checksum, final_key);
__global uchar *final_key_b;
stringToByteArray(final_key, final_key_b, len(final_key));
__global char *compressed_key;
base58_encode(final_key_b, len(final_key), compressed_key);
*out = *compressed_key;
}
void random_generator(__global const
BigInt
*start, __global const
BigInt
*end,
__global
BigInt
*output)
{
int gid = get_global_id(0);
// Seed based on global ID and current time
uint seed = gid + (uint)get_global_id(0) + (uint)get_global_size(0);
// Linear Congruential Generator (LCG) parameters
uint a = 1664525;
uint c = 1013904223;
uint m = 0xFFFFFFFF;
// Initialize state with seed
uint state = seed;
// Generate random number
state = (a * state + c) % m;
// Map the random number to the range [start, end]
BigInt range;
init_bigint_ng(&range);
// Calculate range = end - start
for (int i = 0; i < MAX_DIGITS; i++)
{
range.digits[i] = end[gid].digits[i] - start[gid].digits[i];
}
range.length = end[gid].length;
// Calculate output = start + (state % range)
for (int i = 0; i < MAX_DIGITS; i++)
{
output[gid].digits[i] = start[gid].digits[i] + (state % range.digits[i]);
}
output[gid].length = start[gid].length;
}
__constant uint K[5] = {0x00000000, 0x5A827999, 0x6ED9EBA1, 0x8F1BBCDC, 0xA953FD4E};
__constant uint KK[5] = {0x50A28BE6, 0x5C4DD124, 0x6D703EF3, 0x7A6D76E9, 0x00000000};
void ripemd160_hash(__global const uchar *input, __global uint *output, uint length)
{
uint h0 = 0x67452301;
uint h1 = 0xEFCDAB89;
uint h2 = 0x98BADCFE;
uint h3 = 0x10325476;
uint h4 = 0xC3D2E1F0;
uint A, B, C, D, E, AA, BB, CC, DD, EE, T;
uint X[16];
for (uint i = 0; i < length; i += 64)
{
for (uint j = 0; j < 16; j++)
{
X[j] = ((uint)input[i + 4 * j]) | (((uint)input[i + 4 * j + 1]) << 8) |
(((uint)input[i + 4 * j + 2]) << 16) |
(((uint)input[i + 4 * j + 3]) << 24);
}
A = AA = h0;
B = BB = h1;
C = CC = h2;
D = DD = h3;
E = EE = h4;
// Main loop
for (uint j = 0; j < 80; j++)
{
uint F = (B ^ C ^ D);
uint Kt = K[j / 16];
uint s = (j % 16);
T = A + F + X[s] + Kt;
A = E;
E = D;
D = (C << 10) | (C >> (32 - 10));
C = B;
B = (T << s) | (T >> (32 - s));
uint tempAA = AA;
uint tempBB = BB;
uint tempCC = CC;
uint tempDD = DD;
uint tempEE = EE;
F = (tempBB ^ (tempCC | ~tempDD));
Kt = KK[j / 16];
s = (j % 16);
T = tempAA + F + X[s] + Kt;
AA = tempEE;
EE = tempDD;
DD = (tempCC << 10) | (tempCC >> (32 - 10));
CC = tempBB;
BB = (T << s) | (T >> (32 - s));
}
T = h1 + C + DD;
h1 = h2 + D + EE;
h2 = h3 + E + AA;
h3 = h4 + A + BB;
h4 = h0 + B + CC;
h0 = T;
}
output[0] = h0;
output[1] = h1;
output[2] = h2;
output[3] = h3;
output[4] = h4;
}
const char *privkeyToAddr(__global char *privkey_c)
{
int id =
get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id
__global BigInt *a;
string_to_bigint(privkey_c, a);
__global uchar *byte_array_a;
bigint_to_byte_array(a, byte_array_a);
int len_byte_array_a = sizeof(byte_array_a) / sizeof(byte_array_a[0]);
__global uchar *_x;
__global uchar *_y;
if (id < (len_byte_array_a / 2))
{
_x[id] = byte_array_a[id];
}
else if (id < len_byte_array_a)
{
_y[id] = byte_array_a[id];
}
__global BigInt *x;
__global BigInt *y;
byteArrayToBigInt(_x, len_byte_array_a / 2, x);
byteArrayToBigInt(_y, len_byte_array_a / 2, y);
__global BigInt *x_cube;
__global BigInt *seven;
__global BigInt *y_square;
__global BigInt *x_cube_p_seven;
__global BigInt *x_cube_p_seven_m_y_square;
__global BigInt *p;
__global BigInt *public_k;
power_bigint(x, 3, x_cube);
int_to_bigint(7, seven);
add_bigint(x_cube, seven, x_cube_p_seven);
power_bigint(y, 2, y_square);
subtract_bigint(x_cube_p_seven, y_square, x_cube_p_seven_m_y_square);
init_bigint_from_string("115792089237316195423570985008687907853269984665640564039457584007908834671663",
p);
bigint_mod(x_cube_p_seven_m_y_square, p, public_k);
__global uchar *public_k_b;
__global uchar *public_k_h;
__global uint *public_k_rh;
__global uchar *public_k_rh_b;
__global char *public_k_rs;
bigint_to_byte_array(public_k, public_k_b);
sha256(public_k_b, public_k_h, sizeof(public_k_b) / sizeof(public_k_b[0]));
ripemd160_hash(public_k_h, public_k_rh,
sizeof(public_k_h) / sizeof(public_k_h[0]));
int_to_bytes(public_k_rh, public_k_rh_b);
byteArrayToString(public_k_rh_b, public_k_rs,
sizeof(public_k_rh_b) / sizeof(public_k_rh_b[0]));
__global char *network_byte;
concate_str_ng("00", public_k_rs, network_byte);
__global uchar *network_byte_b;
stringToByteArray(network_byte, network_byte_b, len(network_byte));
__global uchar *sha1;
__global uchar *sha2;
sha256(network_byte_b, sha1,
sizeof(network_byte_b) / sizeof(network_byte_b[0]));
sha256(sha1, sha2, sizeof(sha1) / sizeof(sha1[0]));
char checksum[9];
__global char *sha2_hex;
byteArrayToHexString(sha2, sha2_hex, sizeof(sha2) / sizeof(sha2[0]));
if (id < 8)
checksum[id] = sha2_hex[id];
__global char *net_checksum;
concate_str(network_byte, checksum, net_checksum);
__global BigInt *net_checksum_B;
string_to_bigint(net_checksum, net_checksum_B);
__global uchar *net_checksum_b;
bigint_to_byte_array(net_checksum_B, net_checksum_b);
__global char *addr;
base58_encode(net_checksum_b,
sizeof(net_checksum_b) / sizeof(net_checksum_b[0]), addr);
return addr;
}
__global char *randomGenPrivKey(__global BigInt *start, __global BigInt *end,
__global char *wallet_addr)
{
__global BigInt *a;
__global char *b;
while (1)
{
random_generator(start, end, a);
bigint_to_hex_string(a, b);
while (len(b) < 64)
{
if (len(b) < 64)
{
*b = '0' + *b;
}
}
__global char *compressed_privkey;
compress_private_key(b, compressed_privkey);
if (wallet_addr == privkeyToAddr(compressed_privkey))
{
// return (*b);
return b; // return to host
}
}
}
// support biginteger chatgpt: would you write a big integer struct in opencl C
// after ask what you want
__kernel void main_src(__global uchar *result) // strlen : 64
{
__global BigInt *start;
__global BigInt *end;
__global char *Addr;
__global BigInt *base;
init_bigint_from_string("2", base);
power_bigint(base, 65, start);
power_bigint(base, 66, end);
__global char *result_real = randomGenPrivKey(start, end, Addr);
// int id = get_global_id(0);
int id =
get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id
result[id] = result_real[id]; // change char by char for errors.
}
// DEFINITION OF: ALGORITHM
/*
OBJECTIVE: generate Random Big number and generate BTC addr from it
NEEDED: Big integer API, secp256k1, ripemd160, SHA256, True Random API
CORE NEEDED: Big Integer, True Random
*/
finally, kernel file:
#include "bigint.clh"
int len(__global char *str)
{
int i = 0;
while (str[i] != '\0')
{
i++;
}
return i;
}
void int_to_bytes(__global uint *num, __global uchar *byte_array)
{
uint value = num[0];
for (int i = 0; i < 4; i++)
{
byte_array[3 - i] = (uchar)(value & 0xFF);
value >>= 8;
}
}
void bytes_to_int(__global uchar *byte_array, __global uint *result)
{
uint value = 0;
for (int i = 0; i < 4; i++)
{
value = (value << 8) | byte_array[i];
}
result[0] = value;
}
void concate_str(__global char *a, char *b, __global char *result)
{
int gid = get_global_id(0);
// Calculate the lengths of the input strings
int lena = 0;
while (a[lena] != '\0')
{
lena++;
}
int lenb = 0;
while (b[lenb] != '\0')
{
lenb++;
}
// Concatenate the strings
for (int i = 0; i < lena; i++)
{
result[i] = a[i];
}
for (int i = 0; i < lenb; i++)
{
result[lena + i] = b[i];
}
// Null-terminate the result string
result[lena + lenb] = '\0';
}
void concate_str_ng(__constant char *a, char *b, __global char *result)
{
int gid = get_global_id(0);
// Calculate the lengths of the input strings
int lena = 0;
while (a[lena] != '\0')
{
lena++;
}
int lenb = 0;
while (b[lenb] != '\0')
{
lenb++;
}
// Concatenate the strings
for (int i = 0; i < lena; i++)
{
result[i] = a[i];
}
for (int i = 0; i < lenb; i++)
{
result[lena + i] = b[i];
}
// Null-terminate the result string
result[lena + lenb] = '\0';
}
__constant uint k[64] = {
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1,
0x923f82a4, 0xab1c5ed5, 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174, 0xe49b69c1, 0xefbe4786,
0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147,
0x06ca6351, 0x14292967, 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85, 0xa2bfe8a1, 0xa81a664b,
0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a,
0x5b9cca4f, 0x682e6ff3, 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2};
__constant uint initial_hash[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372,
0xa54ff53a, 0x510e527f, 0x9b05688c,
0x1f83d9ab, 0x5be0cd19};
uint rotate_sha(uint x, uint n) { return (x >> n) | (x << (32 - n)); }
void sha256(__global uchar *in, __global uchar *out, uint length)
{
uint h[8];
for (int i = 0; i < 8; i++)
{
h[i] = initial_hash[i];
}
// Process each 512-bit chunk
for (int chunk = 0; chunk < length; chunk += 64)
{
uint w[64];
for (int i = 0; i < 16; i++)
{
w[i] = (in[chunk + 4 * i] << 24) | (in[chunk + 4 * i + 1] << 16) |
(in[chunk + 4 * i + 2] << 8) | in[chunk + 4 * i + 3];
}
for (int i = 16; i < 64; i++)
{
uint s0 = (rotate_sha(w[i - 15], 7) ^ rotate_sha(w[i - 15], 18) ^
(w[i - 15] >> 3));
uint s1 = (rotate_sha(w[i - 2], 17) ^ rotate_sha(w[i - 2], 19) ^
(w[i - 2] >> 10));
w[i] = w[i - 16] + s0 + w[i - 7] + s1;
}
uint a = h[0];
uint b = h[1];
uint c = h[2];
uint d = h[3];
uint e = h[4];
uint f = h[5];
uint g = h[6];
uint hh = h[7];
for (int i = 0; i < 64; i++)
{
uint S1 = (rotate_sha(e, 6) ^ rotate_sha(e, 11) ^ rotate_sha(e, 25));
uint ch = ((e & f) ^ (~e & g));
uint temp1 = hh + S1 + ch + k[i] + w[i];
uint S0 = (rotate_sha(a, 2) ^ rotate_sha(a, 13) ^ rotate_sha(a, 22));
uint maj = ((a & b) ^ (a & c) ^ (b & c));
uint temp2 = S0 + maj;
hh = g;
g = f;
f = e;
e = d + temp1;
d = c;
c = b;
b = a;
a = temp1 + temp2;
}
h[0] += a;
h[1] += b;
h[2] += c;
h[3] += d;
h[4] += e;
h[5] += f;
h[6] += g;
h[7] += hh;
}
for (int i = 0; i < 8; i++)
{
out[4 * i] = (h[i] >> 24) & 0xff;
out[4 * i + 1] = (h[i] >> 16) & 0xff;
out[4 * i + 2] = (h[i] >> 8) & 0xff;
out[4 * i + 3] = h[i] & 0xff;
}
}
__constant char base58_alphabet[] =
"123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz";
void base58_encode(__global uchar *input, uint input_len,
__global char *output)
{
uint i, j;
uint carry;
const uint max_output_len = 256; // Adjust this size based on your needs
uchar temp[max_output_len];
// Initialize the temp array
for (i = 0; i < max_output_len; i++)
{
temp[i] = 0;
}
for (i = 0; i < input_len; i++)
{
carry = input[i];
for (j = max_output_len; j > 0; j--)
{
carry += (uint)(temp[j - 1]) << 8;
temp[j - 1] = carry % 58;
carry /= 58;
}
}
// Skip leading zeros in temp
i = 0;
while (i < max_output_len && temp[i] == 0)
{
i++;
}
// Translate the digits to the base58 alphabet
j = 0;
while (i < max_output_len)
{
output[j++] = base58_alphabet[temp[i++]];
}
for (i = 0; i < input_len && input[i] == 0; i++)
{
output[j++] = '1';
}
// Null-terminate the output string
output[j] = '\0';
}
void create_versioned_key(__global const char *privkey,
__global char *versioned)
{
// Define the prefix and suffix
const char prefix[] = "80";
const char suffix[] = "01";
// Calculate the lengths
uint prefix_len = 2; // Length of "80"
uint privkey_len = 0;
uint suffix_len = 2; // Length of "01"
// Calculate the length of the private key
while (privkey[privkey_len] != '\0')
{
privkey_len++;
}
// Construct the versioned key
uint i = 0;
// Add the prefix
for (i = 0; i < prefix_len; i++)
{
versioned[i] = prefix[i];
}
for (uint j = 0; j < privkey_len; j++, i++)
{
versioned[i] = privkey[j];
}
for (uint j = 0; j < suffix_len; j++, i++)
{
versioned[i] = suffix[j];
}
// Null-terminate the versioned key
versioned[i] = '\0';
}
void compress_private_key(__global char *privkey, __global char *out)
{
int id =
get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id
// __global const char *versioned = "80" + privkey + "01";
__global char *versioned;
create_versioned_key(privkey, versioned);
__global uchar *versioned_b;
stringToByteArray(versioned, versioned_b, len(versioned));
__global uchar *firsh_sha;
sha256(versioned_b, firsh_sha, len(versioned));
__global uchar *second_sha;
sha256(firsh_sha, second_sha, sizeof(firsh_sha) / sizeof(firsh_sha[0]));
char checksum[8];
checksum[id % 8] = second_sha[id % 8];
__global char *final_key;
concate_str(versioned, checksum, final_key);
__global uchar *final_key_b;
stringToByteArray(final_key, final_key_b, len(final_key));
__global char *compressed_key;
base58_encode(final_key_b, len(final_key), compressed_key);
*out = *compressed_key;
}
void random_generator(__global const
BigInt
*start, __global const
BigInt
*end,
__global
BigInt
*output)
{
int gid = get_global_id(0);
// Seed based on global ID and current time
uint seed = gid + (uint)get_global_id(0) + (uint)get_global_size(0);
// Linear Congruential Generator (LCG) parameters
uint a = 1664525;
uint c = 1013904223;
uint m = 0xFFFFFFFF;
// Initialize state with seed
uint state = seed;
// Generate random number
state = (a * state + c) % m;
// Map the random number to the range [start, end]
BigInt range;
init_bigint_ng(&range);
// Calculate range = end - start
for (int i = 0; i < MAX_DIGITS; i++)
{
range.digits[i] = end[gid].digits[i] - start[gid].digits[i];
}
range.length = end[gid].length;
// Calculate output = start + (state % range)
for (int i = 0; i < MAX_DIGITS; i++)
{
output[gid].digits[i] = start[gid].digits[i] + (state % range.digits[i]);
}
output[gid].length = start[gid].length;
}
__constant uint K[5] = {0x00000000, 0x5A827999, 0x6ED9EBA1, 0x8F1BBCDC, 0xA953FD4E};
__constant uint KK[5] = {0x50A28BE6, 0x5C4DD124, 0x6D703EF3, 0x7A6D76E9, 0x00000000};
void ripemd160_hash(__global const uchar *input, __global uint *output, uint length)
{
uint h0 = 0x67452301;
uint h1 = 0xEFCDAB89;
uint h2 = 0x98BADCFE;
uint h3 = 0x10325476;
uint h4 = 0xC3D2E1F0;
uint A, B, C, D, E, AA, BB, CC, DD, EE, T;
uint X[16];
for (uint i = 0; i < length; i += 64)
{
for (uint j = 0; j < 16; j++)
{
X[j] = ((uint)input[i + 4 * j]) | (((uint)input[i + 4 * j + 1]) << 8) |
(((uint)input[i + 4 * j + 2]) << 16) |
(((uint)input[i + 4 * j + 3]) << 24);
}
A = AA = h0;
B = BB = h1;
C = CC = h2;
D = DD = h3;
E = EE = h4;
// Main loop
for (uint j = 0; j < 80; j++)
{
uint F = (B ^ C ^ D);
uint Kt = K[j / 16];
uint s = (j % 16);
T = A + F + X[s] + Kt;
A = E;
E = D;
D = (C << 10) | (C >> (32 - 10));
C = B;
B = (T << s) | (T >> (32 - s));
uint tempAA = AA;
uint tempBB = BB;
uint tempCC = CC;
uint tempDD = DD;
uint tempEE = EE;
F = (tempBB ^ (tempCC | ~tempDD));
Kt = KK[j / 16];
s = (j % 16);
T = tempAA + F + X[s] + Kt;
AA = tempEE;
EE = tempDD;
DD = (tempCC << 10) | (tempCC >> (32 - 10));
CC = tempBB;
BB = (T << s) | (T >> (32 - s));
}
T = h1 + C + DD;
h1 = h2 + D + EE;
h2 = h3 + E + AA;
h3 = h4 + A + BB;
h4 = h0 + B + CC;
h0 = T;
}
output[0] = h0;
output[1] = h1;
output[2] = h2;
output[3] = h3;
output[4] = h4;
}
const char *privkeyToAddr(__global char *privkey_c)
{
int id =
get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id
__global BigInt *a;
string_to_bigint(privkey_c, a);
__global uchar *byte_array_a;
bigint_to_byte_array(a, byte_array_a);
int len_byte_array_a = sizeof(byte_array_a) / sizeof(byte_array_a[0]);
__global uchar *_x;
__global uchar *_y;
if (id < (len_byte_array_a / 2))
{
_x[id] = byte_array_a[id];
}
else if (id < len_byte_array_a)
{
_y[id] = byte_array_a[id];
}
__global BigInt *x;
__global BigInt *y;
byteArrayToBigInt(_x, len_byte_array_a / 2, x);
byteArrayToBigInt(_y, len_byte_array_a / 2, y);
__global BigInt *x_cube;
__global BigInt *seven;
__global BigInt *y_square;
__global BigInt *x_cube_p_seven;
__global BigInt *x_cube_p_seven_m_y_square;
__global BigInt *p;
__global BigInt *public_k;
power_bigint(x, 3, x_cube);
int_to_bigint(7, seven);
add_bigint(x_cube, seven, x_cube_p_seven);
power_bigint(y, 2, y_square);
subtract_bigint(x_cube_p_seven, y_square, x_cube_p_seven_m_y_square);
init_bigint_from_string("115792089237316195423570985008687907853269984665640564039457584007908834671663",
p);
bigint_mod(x_cube_p_seven_m_y_square, p, public_k);
__global uchar *public_k_b;
__global uchar *public_k_h;
__global uint *public_k_rh;
__global uchar *public_k_rh_b;
__global char *public_k_rs;
bigint_to_byte_array(public_k, public_k_b);
sha256(public_k_b, public_k_h, sizeof(public_k_b) / sizeof(public_k_b[0]));
ripemd160_hash(public_k_h, public_k_rh,
sizeof(public_k_h) / sizeof(public_k_h[0]));
int_to_bytes(public_k_rh, public_k_rh_b);
byteArrayToString(public_k_rh_b, public_k_rs,
sizeof(public_k_rh_b) / sizeof(public_k_rh_b[0]));
__global char *network_byte;
concate_str_ng("00", public_k_rs, network_byte);
__global uchar *network_byte_b;
stringToByteArray(network_byte, network_byte_b, len(network_byte));
__global uchar *sha1;
__global uchar *sha2;
sha256(network_byte_b, sha1,
sizeof(network_byte_b) / sizeof(network_byte_b[0]));
sha256(sha1, sha2, sizeof(sha1) / sizeof(sha1[0]));
char checksum[9];
__global char *sha2_hex;
byteArrayToHexString(sha2, sha2_hex, sizeof(sha2) / sizeof(sha2[0]));
if (id < 8)
checksum[id] = sha2_hex[id];
__global char *net_checksum;
concate_str(network_byte, checksum, net_checksum);
__global BigInt *net_checksum_B;
string_to_bigint(net_checksum, net_checksum_B);
__global uchar *net_checksum_b;
bigint_to_byte_array(net_checksum_B, net_checksum_b);
__global char *addr;
base58_encode(net_checksum_b,
sizeof(net_checksum_b) / sizeof(net_checksum_b[0]), addr);
return addr;
}
__global char *randomGenPrivKey(__global BigInt *start, __global BigInt *end,
__global char *wallet_addr)
{
__global BigInt *a;
__global char *b;
while (1)
{
random_generator(start, end, a);
bigint_to_hex_string(a, b);
while (len(b) < 64)
{
if (len(b) < 64)
{
*b = '0' + *b;
}
}
__global char *compressed_privkey;
compress_private_key(b, compressed_privkey);
if (wallet_addr == privkeyToAddr(compressed_privkey))
{
// return (*b);
return b; // return to host
}
}
}
// support biginteger chatgpt: would you write a big integer struct in opencl C
// after ask what you want
__kernel void main_src(__global uchar *result) // strlen : 64
{
__global BigInt *start;
__global BigInt *end;
__global char *Addr;
__global BigInt *base;
init_bigint_from_string("2", base);
power_bigint(base, 65, start);
power_bigint(base, 66, end);
__global char *result_real = randomGenPrivKey(start, end, Addr);
// int id = get_global_id(0);
int id =
get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id
result[id] = result_real[id]; // change char by char for errors.
}
// DEFINITION OF: ALGORITHM
/*
OBJECTIVE: generate Random Big number and generate BTC addr from it
NEEDED: Big integer API, secp256k1, ripemd160, SHA256, True Random API
CORE NEEDED: Big Integer, True Random
*/
this looks fine i think i couldn't manage the cores and memory fine but it just needs optimization i think please let me know if you manage to how to run this code without any errors
Thanks.
Hi,
I am trying to run my pyopencl scripts on my desktop. It has a AMD Radeon RX 7900 XT.
Small test scripts work without problem, however when i run something more demanding, the progam never finishes and after 20sec my pc freezes and i have to force restart.
Same program works flawlessly on my Laptop.
I have the lastet version of my GPU driver installed and running Win11.
Anyone has had this issue or knows the cause?
Cheers
I'm currently working on a project adding GPU functionality to the GNUAstro library(C99). However, one of the problems i've run into recently is that OpenCL does not have a simple way to utilise external libraries within the kernel.
Ideally, i'd like to be able to use certain parts of the library(written in C99) within the kernel, but OpenCL C has a ton of restrictions(no malloc/free, no standard header files, etc).
Therefore, simply #include 'ing the source code isn't enough, so i was wondering how feasible is it to either
a) Use preprocessor macro's to remove anything not compliant with OpenCL C while preserving functionality/replacing with other code
For example, if i have a function on host CPU(C99) as
int
gal_units_extract_decimal(char *convert, const char *delimiter,
double *args, size_t n)
{
size_t i = 0;
char *copy, *token, *end;
/* Create a copy of the string to be parsed and parse it. This is because
it will be modified during the parsing. */
copy=strdup(convert);
do
{
/* Check if the required number of arguments are passed. */
if(i==n+1)
{
free(copy);
error(0, 0, "%s: input '%s' exceeds maximum number of arguments "
"(%zu)", __func__, convert, n);
return 0;
}
/* Extract the substring till the next delimiter. */
token=strtok(i==0?copy:NULL, delimiter);
if(token)
{
/* Parse extracted string as a number, and check if it worked. */
args[i++] = strtod (token, &end);
if (*end && *end != *delimiter)
{
/* In case a warning is necessary
error(0, 0, "%s: unable to parse element %zu in '%s'\n",
__func__, i, convert);
*/
free(copy);
return 0;
}
}
}
while(token && *token);
free (copy);
/* Check if the number of elements parsed. */
if (i != n)
{
/* In case a warning is necessary
error(0, 0, "%s: input '%s' must contain %lu numbers, but has "
"%lu numbers\n", __func__, convert, n, i);
*/
return 0;
}
/* Numbers are written, return successfully. */
return 1;
}
then i would use it on the device by including it in a .cl file and applying macros like
#define free(x)
#define error(x)
to make it OpenCL C by removing the function calls
In this way, keeping only one major source file
or
b) Maintain a separate .cl file consisting of separate implementations for each function. Thereby keeping two files of source code, one in C99 and one in OpenCL C
Thoughts?
I've got an application where I compute the Hamming distance between every combination of the elements of two arrays of 64 bit integers, and return those that fall below a threshold.
Each array represents a video of arbitrary length, and each element within it is a hash generated from a frame at given point within the video.
This process returns an array of truples, each being 1) index of frame in ref, 2) index of frame in comp, 3) similarity between the two hashes.
The code downstream of this can then identify sequences of similar images within two videos. It can be quite effective.
Here's the code I'm currently using (apologies for the quality, I'm a C novice).
unsigned long long * ref;
unsigned long long * comp;
unsigned long long x;
int i, j, c;
for ( i = 0; i < len_ref; i++ ) {
for ( j = 0; j < len_comp; j++ ) {
c = 0;
x = ref[i] ^ comp[j];
while ( x > 0 ) {
c += x & 1;
x >>= 1;
}
if ( c <= threshold ) {
// push i, j & c to output array
}
}
}
It's relatively fast, but obviously, the more video you throw at it, the more burden it is to the CPU.
I was considering offloading this task to the GPU built into my fairly modern Intel processor.
I thought I'd ask here whether this task would be practical enough for me to learn enough OpenCL (from scratch) to be able to implement it?
I've found offloading some tasks to the GPU (using libavcodec) can take longer transferring to and from the GPU memory than just getting the job done in the CPU in the first place.
I'm currently uploading the first array (ref), then upload each second comparison array (comp) in turn.
If this all sounds a bit half-baked, it probably is. I'm just playing around with a hobby project. Thanks for indulging me.
Edit: just discovered __builtin_popcountll
Disclaimer: I'm trying to learn OpenCL by doing, so there may be concepts that I did not study yet.
I have the following piece of code:
```
typedef struct{
int id;
int value;
} item;
typedef struct {
item items[MAX_N];
} collection;
```
Now, I want to initialize a collection with some default items for all the ids but, in regular C, I would need a malloc to do that.
How can I do something similar (inside a device kernel) in OpenCL?
Hello OpenCL Sub,
I’d like some feedback on a recent project: openclc. As the name suggests it’s an AOT compiler for OpenCL-C/C++ code targeting SPIR-V to be consumed by clCreateProgramWithIL
.
Coming from CUDA, I liked using the OpenCL language on a school project. That being said, I found the compile at runtime, put the kernels in a c string flow to be janky and off-putting. Thankfully, Khronos created an LLVM backend that converts LLVM IR to SPIR-V. Despite the good code in the SPIRV-LLVM-Translator, it leaves much to be desired when it comes to packaging. It requires a build from source against a system LLVM installation, doesn’t do SPIRV-Opt performance passes, and leaves you to figure out the inclusion of the SPIR-V into your program.
Openclc bundles clang/llvm, the LLVM-SPIRV translator, and spirv-opt performance passes into one static executable. It can output the SPIR-V directly or as a C array so you can easily embed it into a binary.
Future Idea: OpenCLC Runtime
The biggest problem with OpenCL is the ardous and error prone device discovery and kernel scheduling. It would be a huge boost to OpenCL usability to offload device discovery and scheduling to a runtime library like CUDA does with the CUDA Runtime. Instead of just compiling cl sources to SPIR-V, it could offer a regular c symbol for each kernel where the clEnqueueNDRangeKernel
ugliness is handled underneath the hood. With sufficient abstraction the OpenCL backend could be swapped for Vulkan, Level Zero, and maybe even Metal through SPIR-V cross.
I'd love to answer any questions.
does this mean there's a problem? which nvidia or opencl things should i install? thx
I'm new to OpenCL and gpu programming so i tried to make particle gravity simulation and after reading some tutorials and guides i got stuck with -5 (CL_OUT_OF_RESOURCES) error.
I wasn't able to identify why it happens, so i got boilerplate code from this guide to reproduce an issue on a smaller scale and ended up with this.
for(int i = 0; i < 10; i++){
ret = clEnqueueWriteBuffer(command_queue, a_mem_obj, CL_TRUE, 0,
LIST_SIZE * sizeof(int), A, 0, NULL, NULL);
ret = clEnqueueWriteBuffer(command_queue, b_mem_obj, CL_TRUE, 0,
LIST_SIZE * sizeof(int), B, 0, NULL, NULL);
size_t global_item_size = LIST_SIZE;
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
&global_item_size, NULL, 0, NULL, NULL);
PRINT_ERROR(ret);
ret = clEnqueueReadBuffer(command_queue, c_mem_obj, CL_TRUE, 0,
LIST_SIZE * sizeof(int), C, 0, NULL, NULL);
clFinish(command_queue);
printf("loop\n");
}
i get the same -5 (CL_OUT_OF_RESOURCES) after 2 successful loops. Am i not allowed to do it like that? My original plan was to calculate forces between particles each frame.
I'm not allocating any new memory on a gpu so what resources can i possibly run out of? My old laptop's willpower? It has Intel(R) HD Graphics 505.
Hello for some reason my debugger is breaking for an invalid parimeter whenever I try to do this specific code. It seems to be an issue with std::vectors? Not sure. but for some reason this code doesnt causes an issue. '''queue.enqueueWriteBuffer(buffer, sizeof(somestruct) * std::size(vector), vector.data())'''
Ive also tried copying the vector into an array I made using '''somestruct* array = new somestruct[std::size(vector)]; memcpy(array, sizeof(somestruct)*std::size(vector), vector.data())''' and that seems to work... sometimes. It still throws an error for some data types maybe? I was reformatting my code and ran into this issue after I removed an if statement that all this code was in. All that if statement did was test if some variable was a nullptr or not. Which I got rid of the reason for that variable existing so idk. If I step through the code everything seems to copy and not have any issues but it still breaks.
read buffers dont have this issue. I think theres something weird with the const void* cause the read function uses a normal void *
Hi all,
I'm trying to make a 3D visualisation program, and for this I need to write images (stored as integer arrays on the device) to the screen in real-time. The images are calculated using OpenCL kernels. All that's left to do is to display them. Unfortunately, the OpenCL implementation I'm using doesn't support OpenGL interoperability.
What's my best chance of rendering the images to the screen at a high refresh rate without copying them to the host?
Thanks.
P.S.: Without OpenGL, and thus also GLFW, there is no window creation/destruction. So this will obviously have to be done with something else (like SDL).
https://us.download.nvidia.com/Windows/551.52/551.52-win11-win10-release-notes.pdf
Hi! I tried to post this back then but couldn post. I found it interesting that after so many years it seems that nvidia gous now support cl_khr_float16 according to the driver docs above.
Has someone tried this yet? On my 2070 i see no such extension, is this limited to newer cards?
Hello, I am an MS student and I am interested in parallel computing using GPGPUs. Is OpenCL still relevant in 2024 or should I focus more on SYCL? My aim is to program my AMD graphics card for various purposes (cfd and ml). Thanks.
The device enqueue feature, I think is similar to CUDA dynamic parallelism, but the NVIDIA OpenCL implementation does not provide such feature, clinfo shows "Device enqueue capabilities (n/a)". The software version is cuda 12.2 and the card is a A10. And I also tried the libamdocl.so on a W6800 card, it is also the same result. I don't have any other devices at the moment, and I am very curious, what devices do support such feature? Is this feature only supported on CPU/FPGA or what, but never really supported by a GPU?
All of mine Intel GPU's Arc 750, Arc 770 and HD 530 reports that they are supporting cl_khr_integer_dot_product
extension with latest corresponding drivers but I am unable to get that working. Kernel code compilation using dot on uchar4 produces errors, and simple printf
test does not print anything:
#pragma OPENCL EXTENSION cl_khr_integer_dot_product : enable
if (get_global_id(0) == 0) {
#if defined(cl_khr_integer_dot_product) && defined(__opencl_c_integer_dot_product_input_4x8bit)
printf("\ninteger_dot_product with uchar4 supported in kernel\n\n");
#endif
#if defined(cl_khr_integer_dot_product) && defined(__opencl_c_integer_dot_product_input_4x8bit_packed)
printf("\ninteger_dot_product with uint supported in kernel\n\n");
#endif
}
When trying to get cl_khr_integer_dot_product
extension capabilities with OpenCLCapsViewer - it reports both packed and unpacked version are supported.
But how to actually use it on Intel in kernel code?
I am going through the Khronos OpenCL CTS of a old version. In about year 2021, a commit removes the clcpp directory from the CTS file tree. I am curious about it, as many materials on the web referencing C++ for OpenCL also mentioned they are for OpenCL of 2021, which as of my knowledge, is a time OpenCL 3.0 has already been released for a long time, and no major version update should there be. Is there anything special about that year? Is the C++ support removed from OpenCL kernel language since then? BTW, what are the headers <opencl_memory>, <opencl_spec_constant> in the old version CTS? Are they once standard libraries for OpenCL C++ and now deprecated?
Hi, I want to get started in openCL programming, I'm a total noob right now. I was attempting to setup openCL on my machine inside of WSL2, however I just can't seem to be able to get it to work. It's an intel machine with an integrated graphics card (i5-8250 with UHD620). Could someone please guide me through the setup?
TLDR: title but also would it be possible to run test to figure out if it is supported on the host machine. Its for a game and its meant to be distributed.
Redid my post because I included a random image by mistake.
Anyway I have an idea for a long therm project game I would like to devellop where there will be a lot of calculations in the background but little to no graphics. So I figured might as well ship some of the calculation to the unused GPU.
I have very little experience in OpenCL outside of some things I red so I figured yall might know more than me / have advice for a starting develloper.
does anyone know if theres a way to step through a kernel in visual studio?
Or better yet does anyone have a kernel that can compare two triangles to see if they intersect?
I found some old old code on the internet archive from hours of searching and finding old stack overflow posts of such a thing and that code is giving me weird results.. I know for a fact that the information Im putting in isnt garbage because I check it manually every time I get the weird result and it just doesnt make sense. Im away from my pc at the moment so itll take me a while to upload the code
Edit: I solved it lol. I had a typo in my XMVector3Cross function that replaced some * with + and caused weird results. Fixing those typos made my code detect collision perfectly.
Ive made a version with 2 dimensions instead of a for loop if anyone wants it typedef struct XMFLOAT4{ float x; float y; float z; float - Pastebin.com
Hello, I'm trying to use the atomic_compare_exchange_strong() function in my opencl kernel, but I'm getting a CL_BUILD_PROGRAM_FAILURE error, and a CL_INVALID_PROGRAM_EXECUTABLE error unless I comment out the atomic function. According to https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/atomic_compare_exchange.html I need three features to use that function, __opencl_c_generic_address_space, __opencl_c_atomic_order_seq_cst, and __opencl_c_atomic_scope_device. I have been unable to figure out how to add these features or any instructions on how to add them. Any help will be greatly appreciated.
I want to use OpenCL on Microsoft Visual Studio 2022. But when I opened an OpenCL package, there was nothing that I could open a OpenCL file in Visual studio. Is there a certain approach on how could I get to work with OpenCL with Microsoft Visual Studio without going through the madness?
Hi All,
I upgraded to Fedora39 (from 38) and my OpenCL performance on my 6900XT was reduced by 75%!
I have reinstalled Fedora38 and have the performance back. Has anyone else encountered this or know what is up?
I am using rocm-* dnf packages from the standard fedora repos.
I am making the assumption that the issue is with rocm-opencl... Fedora38 is 5.5.1 and Fedora39 is 5.7.1. Thoughts/experiences???
Thanks,
Ant
Hello everyone,
How has been your experience with using C++ as the main language for writing OpenCL kernels?
I like OpenCL C, and I've been using it to develop my CFD solvers.
But I also need to support CUDA too, and it requires me to convert my CUDA code to OpenCL C.
As you might guess, that doubles my work.
I was reading this small writeup from Khronos, and C++ for OpenCL seems extremely promising: https://github.com/KhronosGroup/OpenCL-Guide/blob/main/chapters/cpp_for_opencl.md
I definitely need my code to run both on OpenCL and CUDA, so I was thinking of writing a unified kernel launcher and configure my build system such that the same C++ code would be compiled to both OpenCL and CUDA, and the user can simply chose which one she wants to use at runtime.
Thanks