/r/q?req.query.q -- Subreddit Search

2,458 Subscribers

Using GPU Parallelization for a Goal Oritented Action Planning Agent[X-Post]

0 Comments

2025/01/20
15:07 UTC

Low-Level optimizations - what do I need to know? OS? Compilers?

Hello,

I'm an EE major, so I did not take courses on OS, compilers, etc. I'm working on gaining expertise in parallel programming on GPUs (CUDA and OpenCL) and have written kernels to optimize various algorithms. (CNN, Flash Attention are a few examples)

I wanted to understand what knowledge someone who is an expert in this field would ideally have. I understand the principles of parallel programming and some things about GPU architecture. Would understanding OS, compilers help me at all in any way?

My goal is to work on efficient implementation of AI models.

I would appreciate some direction to improve myself in this area and gain more confidence to be able to say "I know how to make your algorithm run the fastest it can on this device." This is an exaggeration, but something along this line.

2 Comments

2024/12/31
02:49 UTC

Can I run OpenCL on AMD® Ryzen™ 5 5625U with integrated Radeon graphics?

I am a CSE undergraduate student and I want to explore high performance computing, GPU programming, etc. I have learned about OpenCL recently and the idea of having an open standard which is supported (at least theoretically) across different architectures seems interesting, unlike CUDA. I have some questions regarding getting started with OpenCL -

I have read that OpenCL is an abstraction for parallel computing across different architectures, I am presently running AMD® Ryzen™ 5 5625U with integrated Radeon graphics, is it possible to install necessary drivers for the same on my device. I have read from some other posts that AMD has dropped its support for OpenCL, and I'll have to use the Intel drivers for the same. Is it true? And if yes, is it practically possible to run OpenCL on AMD prcocessors?

If it is not possible to run OpenCL locally, is there some option to run it on some cloud, specifically for learning purposes.

Also, I was wondering what kind of parallel computation does OpenCL support for CPUs, since traditionally CPUs do not provide as highly parallel computation as GPUs. So is it vector operations, etc which are utilized while working with OpenCL on CPU to carry out parallel operations or is it something else?

3 Comments

2024/12/30
10:31 UTC

Setup OpenCL | Windows on arm

Hi. I've been trying to setup OpenCL on my windows 11 system - arm based.

However I am unable to find a resource that would help me do this. I checked out the OpenCL-SDK repository and executed steps for the build.

https://github.com/KhronosGroup/OpenCL-SDK

But I have no clue what to do to start opencl development. I included bin path so I can now execute clinfo from terminal. Also included OpenCL-SDK/install/include folder containing CL/* files. I tried to compile a simple test.cpp file:

#include<CL/opencl.h>
#include<stdio.h>
int main(void){
        printf("Hello world!\n");
}

It could not recognize the CL folder, so I manually included it.
But I get the following error:

g++ -I ..\OpenCL-SDK\install\include\ .\test.cpp -o a
In file included from ..\OpenCL-SDK\install\include/CL/cl.h:20:0,
                 from ..\OpenCL-SDK\install\include/CL/opencl.h:24,
                 from .\test.cpp:1:
..\OpenCL-SDK\install\include/CL/cl_version.h:22:104: note: #pragma message: cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 300 (OpenCL 3.0)
 #pragma message("cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 300 (OpenCL 3.0)")

Can someone please help me understand how to deal with this ecosystem?

NOTE: I am new to cmake, vcpkg, and other c/c++ dev tools.

0 Comments

2024/12/23
06:28 UTC

Issues compiling OpenCL with clang (stdio.h compiling error)

Hi everyone. I need to compile some benchmark from rodinia to obtain the .hsaco to simulate them in mgpusim. I know I have to use the -mcpu=gfx803 to obtain the hsaco for the architecture that is simulated otherwise this wouldn't work.

I use a docker rocm with version 4.0.1 and ubuntu 20.04. I am having compiling errors with stdio.h and cl.h

I don't know what to do. here is some helpful information about the errors and version of the package that contains stdio.h and cl.h

/usr/include/stdio.h:137:14: error: extern variable must reside in constant address space

extern FILE *stdin; /* Standard input stream. */

^

/usr/include/stdio.h:138:14: error: extern variable must reside in constant address space

extern FILE *stdout; /* Standard output stream. *

/usr/include/CL/cl.h:919:35: error: pointers to functions are not allowed

void (CL_CALLBACK * pfn_notify)(const char * errinfo,

there are more errors but i post this to get context. The packages are

root@df8345cad8fe:/home/gpu-rodinia/opencl/streamcluster# dpkg -S /usr/include/stdio.h

libc6-dev:amd64: /usr/include/stdio.h

root@df8345cad8fe:/home/gpu-rodinia/opencl/streamcluster# dpkg -S /usr/include/CL/cl.h

opencl-c-headers: /usr/include/CL/cl.h

If anyone can help me or provide me further information, I have never worked with openCL

0 Comments

2024/12/18
13:04 UTC

OpenCL 2.0 vs Vulkan 1.3.260 for GPGPU programming?

Hello everyone! I am building a neural network from scratch in C++ and was wondering which of the two would best tackle the task?

My computer is far from being considered a beast in computing/graphics power, so I would like to get the highest performance out of it. I have some experience with writing a 3D graphics renderer with Vulkan, so I am aware that the coding overhead sucks, but that is not a problem. I am shooting to get the most performance out of my program, so that is not playing a factor in my decision.

Some additional information about my driver specs:

OpenCL API version 2.0
OpenCL Driver version 31.0.21921.1000
Vulkan API version 1.3.260
Vulkan Driver version 2.0.279

2 Comments

2024/11/28
20:40 UTC

IWOCL Announces Keynote Speaker

IWOCL has announced their first keynote speaker for IWOCL 2025 will be Dr. Moritz Lehmann from Intel who will discuss his experience scaling up FluidX3D CFD beyond 100 Billion cells - a story about the true cross-compatibility of OpenCL. Learn more at https://iwocl.org.

0 Comments

2024/11/25
17:06 UTC

How to get OpenCL on AMD

I tried using Intel OpenCL Runtime and https://github.com/ptrumpis/OpenCL-AMD-GPU, no success.

Windows 11 Pro, 23H2, 22631.4391, Windows Feature Experience Pack 1000.22700.1047.0

AMD Ryzen Threadripper 7960X 24-Cores

AMD Radeon RX 7800 XT Driver Version: 32.0.12019.1028

https://preview.redd.it/pvw9aadm4i2e1.jpg?width=610&format=pjpg&auto=webp&s=6561decb20cb47a2c6a72507d758dd4293918e87

11 Comments

2024/11/22
18:46 UTC

Tips for troubleshooting memory copy speed?

I’m trying to figure out how to optimize my opencl project; I’m currently heavily bottlenecked by buffer I/O. My data is about 80MB at max. I’ve preallocated the buffers which helped a lot, but reading out the result is taking over 100ms, which is really throttling the throughput of the whole pipeline. Any tips on where to look to improve this, either hw or sw wise?

5 Comments

2024/11/09
20:12 UTC

CUDA/GLSL functions for OpenCL

Is there a guide of how some CUDA/GLSL functions map to equivalent OpenCL functions?

I am in particular interested in synchornization (__syncthreads(), __syncwarp(), __threadfence()) and subgroup functions (__ballot(), __shfl(), __shfl_xor()).

2 Comments

2024/10/15
11:28 UTC

can somebody help me fix this error

hi, im trying to get this deepfacelive program running but i get some kind of opencl error. i have the opencl runtime installer, the sdk, i have the nividia drivers i dont get whats going on if somebody know how to fix this

Running DeepFaceLive.

Traceback (most recent call last):

File "_internal\DeepFaceLive\main.py", line 104, in <module>

main()

File "_internal\DeepFaceLive\main.py", line 97, in main

args.func(args)

File "_internal\DeepFaceLive\main.py", line 39, in run_DeepFaceLive

from apps.DeepFaceLive.DeepFaceLiveApp import DeepFaceLiveApp

File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\apps\DeepFaceLive\DeepFaceLiveApp.py", line 11, in <module>

from . import backend

File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\apps\DeepFaceLive\backend\__init__.py", line 9, in <module>

from .FaceMerger import FaceMerger

File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\apps\DeepFaceLive\backend\FaceMerger.py", line 6, in <module>

from xlib import avecl as lib_cl

File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\__init__.py", line 16, in <module>

from xlib.avecl._internal.initializer.InitConst import InitConst

File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\initializer\__init__.py", line 1, in <module>

from .InitConst import InitConst

File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\initializer\InitConst.py", line 1, in <module>

from ..backend import Kernel

File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\backend\__init__.py", line 1, in <module>

from .Buffer import Buffer

File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\backend\Buffer.py", line 5, in <module>

from . import OpenCL as CL

File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\backend\OpenCL\__init__.py", line 4, in <module>

from .OpenCL import (CL_DEVICE_EXTENSIONS, CL_DEVICE_GLOBAL_MEM_SIZE,

File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\backend\OpenCL\OpenCL.py", line 202, in <module>

u/dll_import('OpenCL')

File "C:\Users\maxim\Downloads\dfl\DeepFaceLive_NVIDIA\_internal\DeepFaceLive\xlib\avecl\_internal\backend\OpenCL\OpenCL.py", line 18, in dll_import

raise RuntimeError(f'Unable to load {dll_name} library.')

RuntimeError: Unable to load OpenCL library.

Press any key to continue . . .

5 Comments

2024/10/04
18:11 UTC

Using OpenCL which can calculate matrix multiplication in parallel.

Install OpenCL in your computer and test

whether your installed copy works normally. 2. Write an example program using OpenCL which can calculate matrix multiplication in parallel. can anypne help me in this case?I'm using arch linux kde plasma my laptop configuration is: Ryzen 5 3500U Vega 8 iGPU

I install opencl-amd And when i run the C(language) code my laptops display start blinking, and sometimes showing gpu hanged and sometimes just blinking (black and again visible)

2 Comments

2024/09/24
17:44 UTC

OpenCL is great!

This is just an appreciation post for OpenCL. It's great. The only other performance portable API that comes close is KernelAbstractions.jl.

OpenCL is just so good:

Kernels are compiled at runtime, which means you can do whatever "metaprogramming" you want to the kernel strings before compilation. I understand this feature is a double-edged sword because error checking is sometimes a pain, but it genuinely makes certain workflows possible where they otherwise would not be (or would otherwise be a huge hassle in CUDA).
The JIT compiler is blazingly fast, at least from my personal tests. So much faster than GLSLangValidator, which is the only other tool I can use to compile my kernels at runtime. I actually have an OpenCL game engine mostly working and the benchmarks are really promising especially because the users never feel the Vulkan precompile times before the game starts.
Performance is great. I've seem benchmarks showing that OpenCL gets within 90% of CUDA performance, but from my own use-cases, the performance is near identical.
It works on my CPU. This is actually a great feature. I can do all my debugging on multiple devices to make sure my issues are not GPU-specific problems.
OpenCL lets users write actual kernels. A lot of performance portable solutions try to take serial code and transform it into GPU kernels (with some sort of parallel_for or something). I've just never found that to feel natural in practice. When you are writing code for GPUs, kernels are just so much easier to me.

There's just so much to love.

I do 100% understand that there's some jank, but to be honest, it's been way easier for me to use OpenCL than other GPU solutions for my specific problems. It's even easier than CUDA, which is a big accomplishment. KernelAbstractions.jl is also really nice and offers many similar advantages, but for my specific work-case, I found OpenCL to be better.

I mean, it's 2024. To me, the only things I need my programming language to do are GPU Computing and Metaprogramming. OpenCL does both really well.

I have seen so many people hating on OpenCL over the years and don't fully understand why. It's great.

16 Comments

2024/08/29
16:30 UTC

Need Optimization For Code

im working on bitcoin puzzle mining project but my code throwing runtime error: OUT_OF_RESOURCES
everything looks fine:

pyopencl:

import 
pyopencl
 as 
cl
from 
hashlib
 import sha256
import 
base58
import 
os
import 
numpy
 as 
np
import 
warnings

warnings
.filterwarnings("ignore", 
category
=
UserWarning
, 
module
='pyopencl')

def
 get_kernel(
cl
):
    return open(
cl
,"r").read()

def
 privToWIF(
private_key
):
    # Step 1: Add version byte (0x80 for mainnet)
    versioned_key = '80' + 
private_key
    
    # Step 3: Double SHA-256 hash
    first_sha = sha256(
bytes
.fromhex(versioned_key)).hexdigest()
    second_sha = sha256(
bytes
.fromhex(first_sha)).hexdigest()
    
    # Step 4: Add first 4 bytes of the second SHA-256 hash as checksum
    checksum = second_sha[:8]
    final_key = versioned_key + checksum
    
    # Step 5: Encode in Base58Check
    compressed_private_key = 
base58
.b58encode(
bytes
.fromhex(final_key)).decode('utf-8')
    
    return compressed_private_key

def
 compress_private_key(
private_key_hex
):
    # Step 1: Add version byte (0x80 for mainnet)
    versioned_key = '80' + 
private_key_hex
    
    # Step 2: Add compression byte (0x01)
    compressed_key = versioned_key + '01'
    
    # Step 3: Double SHA-256 hash
    try:
        sha = sha256(
bytes
.fromhex(sha256(
bytes
.fromhex(compressed_key)).hexdigest())).hexdigest()
    except 
Exception
 as e:
        print(
private_key_hex
)
        with open(
os
.path.join(
os
.path.expanduser("~"), "Desktop\\privkey.txt"),"a") as f:
            f.write("\n" + 
f
"Bozuk hex: {
private_key_hex
} \n Compressli Bozuk Hex: {compress_private_key(
private_key_hex
)}\n")
        return compress_private_key(
private_key_hex
)
    # Step 4: Add first 4 bytes of the second SHA-256 hash as checksum
    checksum = sha[:8]
    final_key = compressed_key + checksum
    
    # Step 5: Encode in Base58Check
    compressed_private_key = 
base58
.b58encode(
bytes
.fromhex(final_key)).decode('utf-8')
    
    return compressed_private_key


def
 main():
    global result_string

    for platform in 
cl
.get_platforms():
        devices = platform.get_devices(
cl
.device_type.GPU)
        for device in devices:
            context = 
cl
.Context([device])
            queue = 
cl
.CommandQueue(context)

            sizeof_output = 4096 # 256 bit hex int length
            result = 
np
.zeros(sizeof_output, 
dtype
=
np
.uint8)
            result_buffer = 
cl
.Buffer(context, 
cl
.mem_flags.WRITE_ONLY, result.nbytes)

            program = 
cl
.
Program
(context, get_kernel("kernel.cl")).build()
            kernel = program.main_src

            kernel.set_args(result_buffer)

            global_work_size = (sizeof_output,)
            
cl
.enqueue_nd_range_kernel(queue, kernel, global_work_size, None)

            
cl
.enqueue_copy(queue, result, result_buffer)
            queue.finish()

            result_string = ''.join(chr(x) for x in result)
            print(result_string)

    wif = privToWIF(result_string)
    print("\n" + 
f
"Private Key: {result_string}" + "\n" + 
f
"WIF Private Key: {wif}"+"\n" + 
f
"WIF COMPRESSED Private Key:{compress_private_key(result_string)}" + "\n")
    if True:
        with open(
os
.path.join(
os
.path.expanduser("~"), "Desktop\\privkey.txt"),"a") as f:
            f.write("\n" + 
f
"Private Key: {result_string}" + "\n" + 
f
"WIF Private Key: {wif}"+"\n" + 
f
"WIF COMPRESSED Private Key:{compress_private_key(result_string)}" + "\n")



if __name__ == "__main__":
    
os
.system("cls")
    main()

bigint library:

#include "bigint.clh"

int len(__global char *str)
{
    int i = 0;
    while (str[i] != '\0')
    {
        i++;
    }
    return i;
}

void int_to_bytes(__global uint *num, __global uchar *byte_array)
{
    uint value = num[0];
    for (int i = 0; i < 4; i++)
    {
        byte_array[3 - i] = (uchar)(value & 0xFF);
        value >>= 8;
    }
}

void bytes_to_int(__global uchar *byte_array, __global uint *result)
{
    uint value = 0;
    for (int i = 0; i < 4; i++)
    {
        value = (value << 8) | byte_array[i];
    }
    result[0] = value;
}

void concate_str(__global char *a, char *b, __global char *result)
{
    int gid = get_global_id(0);

    // Calculate the lengths of the input strings
    int lena = 0;
    while (a[lena] != '\0')
    {
        lena++;
    }

    int lenb = 0;
    while (b[lenb] != '\0')
    {
        lenb++;
    }

    // Concatenate the strings
    for (int i = 0; i < lena; i++)
    {
        result[i] = a[i];
    }

    for (int i = 0; i < lenb; i++)
    {
        result[lena + i] = b[i];
    }

    // Null-terminate the result string
    result[lena + lenb] = '\0';
}
void concate_str_ng(__constant char *a, char *b, __global char *result)
{
    int gid = get_global_id(0);

    // Calculate the lengths of the input strings
    int lena = 0;
    while (a[lena] != '\0')
    {
        lena++;
    }

    int lenb = 0;
    while (b[lenb] != '\0')
    {
        lenb++;
    }

    // Concatenate the strings
    for (int i = 0; i < lena; i++)
    {
        result[i] = a[i];
    }

    for (int i = 0; i < lenb; i++)
    {
        result[lena + i] = b[i];
    }

    // Null-terminate the result string
    result[lena + lenb] = '\0';
}

__constant uint k[64] = {
    0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1,
    0x923f82a4, 0xab1c5ed5, 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
    0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174, 0xe49b69c1, 0xefbe4786,
    0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
    0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147,
    0x06ca6351, 0x14292967, 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
    0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85, 0xa2bfe8a1, 0xa81a664b,
    0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
    0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a,
    0x5b9cca4f, 0x682e6ff3, 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
    0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2};

__constant uint initial_hash[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372,
                                   0xa54ff53a, 0x510e527f, 0x9b05688c,
                                   0x1f83d9ab, 0x5be0cd19};

uint rotate_sha(uint x, uint n) { return (x >> n) | (x << (32 - n)); }

void sha256(__global uchar *in, __global uchar *out, uint length)
{
    uint h[8];
    for (int i = 0; i < 8; i++)
    {
        h[i] = initial_hash[i];
    }

    // Process each 512-bit chunk
    for (int chunk = 0; chunk < length; chunk += 64)
    {
        uint w[64];
        for (int i = 0; i < 16; i++)
        {
            w[i] = (in[chunk + 4 * i] << 24) | (in[chunk + 4 * i + 1] << 16) |
                   (in[chunk + 4 * i + 2] << 8) | in[chunk + 4 * i + 3];
        }
        for (int i = 16; i < 64; i++)
        {
            uint s0 = (rotate_sha(w[i - 15], 7) ^ rotate_sha(w[i - 15], 18) ^
                       (w[i - 15] >> 3));
            uint s1 = (rotate_sha(w[i - 2], 17) ^ rotate_sha(w[i - 2], 19) ^
                       (w[i - 2] >> 10));
            w[i] = w[i - 16] + s0 + w[i - 7] + s1;
        }

        uint a = h[0];
        uint b = h[1];
        uint c = h[2];
        uint d = h[3];
        uint e = h[4];
        uint f = h[5];
        uint g = h[6];
        uint hh = h[7];

        for (int i = 0; i < 64; i++)
        {
            uint S1 = (rotate_sha(e, 6) ^ rotate_sha(e, 11) ^ rotate_sha(e, 25));
            uint ch = ((e & f) ^ (~e & g));
            uint temp1 = hh + S1 + ch + k[i] + w[i];
            uint S0 = (rotate_sha(a, 2) ^ rotate_sha(a, 13) ^ rotate_sha(a, 22));
            uint maj = ((a & b) ^ (a & c) ^ (b & c));
            uint temp2 = S0 + maj;

            hh = g;
            g = f;
            f = e;
            e = d + temp1;
            d = c;
            c = b;
            b = a;
            a = temp1 + temp2;
        }

        h[0] += a;
        h[1] += b;
        h[2] += c;
        h[3] += d;
        h[4] += e;
        h[5] += f;
        h[6] += g;
        h[7] += hh;
    }

    for (int i = 0; i < 8; i++)
    {
        out[4 * i] = (h[i] >> 24) & 0xff;
        out[4 * i + 1] = (h[i] >> 16) & 0xff;
        out[4 * i + 2] = (h[i] >> 8) & 0xff;
        out[4 * i + 3] = h[i] & 0xff;
    }
}

__constant char base58_alphabet[] =
    "123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz";

void base58_encode(__global uchar *input, uint input_len,
                   __global char *output)
{
    uint i, j;
    uint carry;
    const uint max_output_len = 256; // Adjust this size based on your needs
    uchar temp[max_output_len];

    // Initialize the temp array
    for (i = 0; i < max_output_len; i++)
    {
        temp[i] = 0;
    }

    for (i = 0; i < input_len; i++)
    {
        carry = input[i];
        for (j = max_output_len; j > 0; j--)
        {
            carry += (uint)(temp[j - 1]) << 8;
            temp[j - 1] = carry % 58;
            carry /= 58;
        }
    }

    // Skip leading zeros in temp
    i = 0;
    while (i < max_output_len && temp[i] == 0)
    {
        i++;
    }

    // Translate the digits to the base58 alphabet
    j = 0;
    while (i < max_output_len)
    {
        output[j++] = base58_alphabet[temp[i++]];
    }

    for (i = 0; i < input_len && input[i] == 0; i++)
    {
        output[j++] = '1';
    }

    // Null-terminate the output string
    output[j] = '\0';
}

void create_versioned_key(__global const char *privkey,
                          __global char *versioned)
{
    // Define the prefix and suffix
    const char prefix[] = "80";
    const char suffix[] = "01";

    // Calculate the lengths
    uint prefix_len = 2; // Length of "80"
    uint privkey_len = 0;
    uint suffix_len = 2; // Length of "01"

    // Calculate the length of the private key
    while (privkey[privkey_len] != '\0')
    {
        privkey_len++;
    }

    // Construct the versioned key
    uint i = 0;

    // Add the prefix
    for (i = 0; i < prefix_len; i++)
    {
        versioned[i] = prefix[i];
    }

    for (uint j = 0; j < privkey_len; j++, i++)
    {
        versioned[i] = privkey[j];
    }

    for (uint j = 0; j < suffix_len; j++, i++)
    {
        versioned[i] = suffix[j];
    }

    // Null-terminate the versioned key
    versioned[i] = '\0';
}
void compress_private_key(__global char *privkey, __global char *out)
{
    int id =
        get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id

    // __global const char *versioned = "80" + privkey + "01";
    __global char *versioned;
    create_versioned_key(privkey, versioned);

    __global uchar *versioned_b;
    stringToByteArray(versioned, versioned_b, len(versioned));

    __global uchar *firsh_sha;
    sha256(versioned_b, firsh_sha, len(versioned));

    __global uchar *second_sha;
    sha256(firsh_sha, second_sha, sizeof(firsh_sha) / sizeof(firsh_sha[0]));

    char checksum[8];
    checksum[id % 8] = second_sha[id % 8];

    __global char *final_key;
    concate_str(versioned, checksum, final_key);

    __global uchar *final_key_b;
    stringToByteArray(final_key, final_key_b, len(final_key));

    __global char *compressed_key;
    base58_encode(final_key_b, len(final_key), compressed_key);

    *out = *compressed_key;
}

void random_generator(__global const 
BigInt
 *start, __global const 
BigInt
 *end,
                      __global 
BigInt
 *output)
{
    int gid = get_global_id(0);

    // Seed based on global ID and current time
uint seed = gid + (uint)get_global_id(0) + (uint)get_global_size(0);

    // Linear Congruential Generator (LCG) parameters
    uint a = 1664525;
    uint c = 1013904223;
    uint m = 0xFFFFFFFF;

    // Initialize state with seed
    uint state = seed;

    // Generate random number
    state = (a * state + c) % m;

    // Map the random number to the range [start, end]
    BigInt range;
    init_bigint_ng(&range);

    // Calculate range = end - start
    for (int i = 0; i < MAX_DIGITS; i++)
    {
        range.digits[i] = end[gid].digits[i] - start[gid].digits[i];
    }
    range.length = end[gid].length;

    // Calculate output = start + (state % range)
    for (int i = 0; i < MAX_DIGITS; i++)
    {
        output[gid].digits[i] = start[gid].digits[i] + (state % range.digits[i]);
    }
    output[gid].length = start[gid].length;
}

__constant uint K[5] = {0x00000000, 0x5A827999, 0x6ED9EBA1, 0x8F1BBCDC, 0xA953FD4E};
__constant uint KK[5] = {0x50A28BE6, 0x5C4DD124, 0x6D703EF3, 0x7A6D76E9, 0x00000000};

void ripemd160_hash(__global const uchar *input, __global uint *output, uint length)
{
    uint h0 = 0x67452301;
    uint h1 = 0xEFCDAB89;
    uint h2 = 0x98BADCFE;
    uint h3 = 0x10325476;
    uint h4 = 0xC3D2E1F0;

    uint A, B, C, D, E, AA, BB, CC, DD, EE, T;
    uint X[16];

    for (uint i = 0; i < length; i += 64)
    {
        for (uint j = 0; j < 16; j++)
        {
            X[j] = ((uint)input[i + 4 * j]) | (((uint)input[i + 4 * j + 1]) << 8) |
                   (((uint)input[i + 4 * j + 2]) << 16) |
                   (((uint)input[i + 4 * j + 3]) << 24);
        }

        A = AA = h0;
        B = BB = h1;
        C = CC = h2;
        D = DD = h3;
        E = EE = h4;

        // Main loop
        for (uint j = 0; j < 80; j++)
        {
            uint F = (B ^ C ^ D);
            uint Kt = K[j / 16];
            uint s = (j % 16);
            T = A + F + X[s] + Kt;
            A = E;
            E = D;
            D = (C << 10) | (C >> (32 - 10));
            C = B;
            B = (T << s) | (T >> (32 - s));

            uint tempAA = AA;
            uint tempBB = BB;
            uint tempCC = CC;
            uint tempDD = DD;
            uint tempEE = EE;

            F = (tempBB ^ (tempCC | ~tempDD));
            Kt = KK[j / 16];
            s = (j % 16);
            T = tempAA + F + X[s] + Kt;
            AA = tempEE;
            EE = tempDD;
            DD = (tempCC << 10) | (tempCC >> (32 - 10));
            CC = tempBB;
            BB = (T << s) | (T >> (32 - s));
        }

        T = h1 + C + DD;
        h1 = h2 + D + EE;
        h2 = h3 + E + AA;
        h3 = h4 + A + BB;
        h4 = h0 + B + CC;
        h0 = T;
    }

    output[0] = h0;
    output[1] = h1;
    output[2] = h2;
    output[3] = h3;
    output[4] = h4;
}

const char *privkeyToAddr(__global char *privkey_c)
{
    int id =
        get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id

    __global BigInt *a;
    string_to_bigint(privkey_c, a);

    __global uchar *byte_array_a;
    bigint_to_byte_array(a, byte_array_a);

    int len_byte_array_a = sizeof(byte_array_a) / sizeof(byte_array_a[0]);
    __global uchar *_x;
    __global uchar *_y;

    if (id < (len_byte_array_a / 2))
    {
        _x[id] = byte_array_a[id];
    }
    else if (id < len_byte_array_a)
    {
        _y[id] = byte_array_a[id];
    }

    __global BigInt *x;
    __global BigInt *y;

    byteArrayToBigInt(_x, len_byte_array_a / 2, x);
    byteArrayToBigInt(_y, len_byte_array_a / 2, y);

    __global BigInt *x_cube;
    __global BigInt *seven;
    __global BigInt *y_square;
    __global BigInt *x_cube_p_seven;
    __global BigInt *x_cube_p_seven_m_y_square;
    __global BigInt *p;
    __global BigInt *public_k;

    power_bigint(x, 3, x_cube);
    int_to_bigint(7, seven);
    add_bigint(x_cube, seven, x_cube_p_seven);
    power_bigint(y, 2, y_square);
    subtract_bigint(x_cube_p_seven, y_square, x_cube_p_seven_m_y_square);
    init_bigint_from_string("115792089237316195423570985008687907853269984665640564039457584007908834671663",
                            p);
    bigint_mod(x_cube_p_seven_m_y_square, p, public_k);

    __global uchar *public_k_b;
    __global uchar *public_k_h;
    __global uint *public_k_rh;
    __global uchar *public_k_rh_b;
    __global char *public_k_rs;
    bigint_to_byte_array(public_k, public_k_b);
    sha256(public_k_b, public_k_h, sizeof(public_k_b) / sizeof(public_k_b[0]));
    ripemd160_hash(public_k_h, public_k_rh,
                   sizeof(public_k_h) / sizeof(public_k_h[0]));

    int_to_bytes(public_k_rh, public_k_rh_b);

    byteArrayToString(public_k_rh_b, public_k_rs,
                      sizeof(public_k_rh_b) / sizeof(public_k_rh_b[0]));

    __global char *network_byte;
    concate_str_ng("00", public_k_rs, network_byte);
    __global uchar *network_byte_b;
    stringToByteArray(network_byte, network_byte_b, len(network_byte));
    __global uchar *sha1;
    __global uchar *sha2;
    sha256(network_byte_b, sha1,
           sizeof(network_byte_b) / sizeof(network_byte_b[0]));
    sha256(sha1, sha2, sizeof(sha1) / sizeof(sha1[0]));
    char checksum[9];
    __global char *sha2_hex;
    byteArrayToHexString(sha2, sha2_hex, sizeof(sha2) / sizeof(sha2[0]));
    if (id < 8)
        checksum[id] = sha2_hex[id];
    __global char *net_checksum;
    concate_str(network_byte, checksum, net_checksum);
    __global BigInt *net_checksum_B;
    string_to_bigint(net_checksum, net_checksum_B);
    __global uchar *net_checksum_b;
    bigint_to_byte_array(net_checksum_B, net_checksum_b);
    __global char *addr;
    base58_encode(net_checksum_b,
                  sizeof(net_checksum_b) / sizeof(net_checksum_b[0]), addr);
    return addr;
}

__global char *randomGenPrivKey(__global BigInt *start, __global BigInt *end,
                                __global char *wallet_addr)
{
    __global BigInt *a;
    __global char *b;
    while (1)
    {
        random_generator(start, end, a);
        bigint_to_hex_string(a, b);

        while (len(b) < 64)
        {
            if (len(b) < 64)
            {
                *b = '0' + *b;
            }
        }
        __global char *compressed_privkey;
        compress_private_key(b, compressed_privkey);

        if (wallet_addr == privkeyToAddr(compressed_privkey))
        {
            // return (*b);
            return b; // return to host
        }
    }
}

// support biginteger chatgpt: would you write a big integer struct in opencl C
// after ask what you want

__kernel void main_src(__global uchar *result) // strlen : 64
{
    __global BigInt *start;
    __global BigInt *end;
    __global char *Addr;
    __global BigInt *base;
    init_bigint_from_string("2", base);
    power_bigint(base, 65, start);
    power_bigint(base, 66, end);

    __global char *result_real = randomGenPrivKey(start, end, Addr);

    // int id = get_global_id(0);
    int id =
        get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id
    result[id] = result_real[id];                                 // change char by char for errors.
}

// DEFINITION OF: ALGORITHM

/*
    OBJECTIVE: generate Random Big number and generate BTC addr from it

    NEEDED: Big integer API, secp256k1, ripemd160, SHA256, True Random API

    CORE NEEDED: Big Integer, True Random

*/

finally, kernel file:

#include "bigint.clh"

int len(__global char *str)
{
    int i = 0;
    while (str[i] != '\0')
    {
        i++;
    }
    return i;
}

void int_to_bytes(__global uint *num, __global uchar *byte_array)
{
    uint value = num[0];
    for (int i = 0; i < 4; i++)
    {
        byte_array[3 - i] = (uchar)(value & 0xFF);
        value >>= 8;
    }
}

void bytes_to_int(__global uchar *byte_array, __global uint *result)
{
    uint value = 0;
    for (int i = 0; i < 4; i++)
    {
        value = (value << 8) | byte_array[i];
    }
    result[0] = value;
}

void concate_str(__global char *a, char *b, __global char *result)
{
    int gid = get_global_id(0);

    // Calculate the lengths of the input strings
    int lena = 0;
    while (a[lena] != '\0')
    {
        lena++;
    }

    int lenb = 0;
    while (b[lenb] != '\0')
    {
        lenb++;
    }

    // Concatenate the strings
    for (int i = 0; i < lena; i++)
    {
        result[i] = a[i];
    }

    for (int i = 0; i < lenb; i++)
    {
        result[lena + i] = b[i];
    }

    // Null-terminate the result string
    result[lena + lenb] = '\0';
}
void concate_str_ng(__constant char *a, char *b, __global char *result)
{
    int gid = get_global_id(0);

    // Calculate the lengths of the input strings
    int lena = 0;
    while (a[lena] != '\0')
    {
        lena++;
    }

    int lenb = 0;
    while (b[lenb] != '\0')
    {
        lenb++;
    }

    // Concatenate the strings
    for (int i = 0; i < lena; i++)
    {
        result[i] = a[i];
    }

    for (int i = 0; i < lenb; i++)
    {
        result[lena + i] = b[i];
    }

    // Null-terminate the result string
    result[lena + lenb] = '\0';
}

__constant uint k[64] = {
    0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1,
    0x923f82a4, 0xab1c5ed5, 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
    0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174, 0xe49b69c1, 0xefbe4786,
    0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
    0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147,
    0x06ca6351, 0x14292967, 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
    0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85, 0xa2bfe8a1, 0xa81a664b,
    0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
    0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a,
    0x5b9cca4f, 0x682e6ff3, 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
    0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2};

__constant uint initial_hash[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372,
                                   0xa54ff53a, 0x510e527f, 0x9b05688c,
                                   0x1f83d9ab, 0x5be0cd19};

uint rotate_sha(uint x, uint n) { return (x >> n) | (x << (32 - n)); }

void sha256(__global uchar *in, __global uchar *out, uint length)
{
    uint h[8];
    for (int i = 0; i < 8; i++)
    {
        h[i] = initial_hash[i];
    }

    // Process each 512-bit chunk
    for (int chunk = 0; chunk < length; chunk += 64)
    {
        uint w[64];
        for (int i = 0; i < 16; i++)
        {
            w[i] = (in[chunk + 4 * i] << 24) | (in[chunk + 4 * i + 1] << 16) |
                   (in[chunk + 4 * i + 2] << 8) | in[chunk + 4 * i + 3];
        }
        for (int i = 16; i < 64; i++)
        {
            uint s0 = (rotate_sha(w[i - 15], 7) ^ rotate_sha(w[i - 15], 18) ^
                       (w[i - 15] >> 3));
            uint s1 = (rotate_sha(w[i - 2], 17) ^ rotate_sha(w[i - 2], 19) ^
                       (w[i - 2] >> 10));
            w[i] = w[i - 16] + s0 + w[i - 7] + s1;
        }

        uint a = h[0];
        uint b = h[1];
        uint c = h[2];
        uint d = h[3];
        uint e = h[4];
        uint f = h[5];
        uint g = h[6];
        uint hh = h[7];

        for (int i = 0; i < 64; i++)
        {
            uint S1 = (rotate_sha(e, 6) ^ rotate_sha(e, 11) ^ rotate_sha(e, 25));
            uint ch = ((e & f) ^ (~e & g));
            uint temp1 = hh + S1 + ch + k[i] + w[i];
            uint S0 = (rotate_sha(a, 2) ^ rotate_sha(a, 13) ^ rotate_sha(a, 22));
            uint maj = ((a & b) ^ (a & c) ^ (b & c));
            uint temp2 = S0 + maj;

            hh = g;
            g = f;
            f = e;
            e = d + temp1;
            d = c;
            c = b;
            b = a;
            a = temp1 + temp2;
        }

        h[0] += a;
        h[1] += b;
        h[2] += c;
        h[3] += d;
        h[4] += e;
        h[5] += f;
        h[6] += g;
        h[7] += hh;
    }

    for (int i = 0; i < 8; i++)
    {
        out[4 * i] = (h[i] >> 24) & 0xff;
        out[4 * i + 1] = (h[i] >> 16) & 0xff;
        out[4 * i + 2] = (h[i] >> 8) & 0xff;
        out[4 * i + 3] = h[i] & 0xff;
    }
}

__constant char base58_alphabet[] =
    "123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz";

void base58_encode(__global uchar *input, uint input_len,
                   __global char *output)
{
    uint i, j;
    uint carry;
    const uint max_output_len = 256; // Adjust this size based on your needs
    uchar temp[max_output_len];

    // Initialize the temp array
    for (i = 0; i < max_output_len; i++)
    {
        temp[i] = 0;
    }

    for (i = 0; i < input_len; i++)
    {
        carry = input[i];
        for (j = max_output_len; j > 0; j--)
        {
            carry += (uint)(temp[j - 1]) << 8;
            temp[j - 1] = carry % 58;
            carry /= 58;
        }
    }

    // Skip leading zeros in temp
    i = 0;
    while (i < max_output_len && temp[i] == 0)
    {
        i++;
    }

    // Translate the digits to the base58 alphabet
    j = 0;
    while (i < max_output_len)
    {
        output[j++] = base58_alphabet[temp[i++]];
    }

    for (i = 0; i < input_len && input[i] == 0; i++)
    {
        output[j++] = '1';
    }

    // Null-terminate the output string
    output[j] = '\0';
}

void create_versioned_key(__global const char *privkey,
                          __global char *versioned)
{
    // Define the prefix and suffix
    const char prefix[] = "80";
    const char suffix[] = "01";

    // Calculate the lengths
    uint prefix_len = 2; // Length of "80"
    uint privkey_len = 0;
    uint suffix_len = 2; // Length of "01"

    // Calculate the length of the private key
    while (privkey[privkey_len] != '\0')
    {
        privkey_len++;
    }

    // Construct the versioned key
    uint i = 0;

    // Add the prefix
    for (i = 0; i < prefix_len; i++)
    {
        versioned[i] = prefix[i];
    }

    for (uint j = 0; j < privkey_len; j++, i++)
    {
        versioned[i] = privkey[j];
    }

    for (uint j = 0; j < suffix_len; j++, i++)
    {
        versioned[i] = suffix[j];
    }

    // Null-terminate the versioned key
    versioned[i] = '\0';
}
void compress_private_key(__global char *privkey, __global char *out)
{
    int id =
        get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id

    // __global const char *versioned = "80" + privkey + "01";
    __global char *versioned;
    create_versioned_key(privkey, versioned);

    __global uchar *versioned_b;
    stringToByteArray(versioned, versioned_b, len(versioned));

    __global uchar *firsh_sha;
    sha256(versioned_b, firsh_sha, len(versioned));

    __global uchar *second_sha;
    sha256(firsh_sha, second_sha, sizeof(firsh_sha) / sizeof(firsh_sha[0]));

    char checksum[8];
    checksum[id % 8] = second_sha[id % 8];

    __global char *final_key;
    concate_str(versioned, checksum, final_key);

    __global uchar *final_key_b;
    stringToByteArray(final_key, final_key_b, len(final_key));

    __global char *compressed_key;
    base58_encode(final_key_b, len(final_key), compressed_key);

    *out = *compressed_key;
}

void random_generator(__global const 
BigInt
 *start, __global const 
BigInt
 *end,
                      __global 
BigInt
 *output)
{
    int gid = get_global_id(0);

    // Seed based on global ID and current time
uint seed = gid + (uint)get_global_id(0) + (uint)get_global_size(0);

    // Linear Congruential Generator (LCG) parameters
    uint a = 1664525;
    uint c = 1013904223;
    uint m = 0xFFFFFFFF;

    // Initialize state with seed
    uint state = seed;

    // Generate random number
    state = (a * state + c) % m;

    // Map the random number to the range [start, end]
    BigInt range;
    init_bigint_ng(&range);

    // Calculate range = end - start
    for (int i = 0; i < MAX_DIGITS; i++)
    {
        range.digits[i] = end[gid].digits[i] - start[gid].digits[i];
    }
    range.length = end[gid].length;

    // Calculate output = start + (state % range)
    for (int i = 0; i < MAX_DIGITS; i++)
    {
        output[gid].digits[i] = start[gid].digits[i] + (state % range.digits[i]);
    }
    output[gid].length = start[gid].length;
}

__constant uint K[5] = {0x00000000, 0x5A827999, 0x6ED9EBA1, 0x8F1BBCDC, 0xA953FD4E};
__constant uint KK[5] = {0x50A28BE6, 0x5C4DD124, 0x6D703EF3, 0x7A6D76E9, 0x00000000};

void ripemd160_hash(__global const uchar *input, __global uint *output, uint length)
{
    uint h0 = 0x67452301;
    uint h1 = 0xEFCDAB89;
    uint h2 = 0x98BADCFE;
    uint h3 = 0x10325476;
    uint h4 = 0xC3D2E1F0;

    uint A, B, C, D, E, AA, BB, CC, DD, EE, T;
    uint X[16];

    for (uint i = 0; i < length; i += 64)
    {
        for (uint j = 0; j < 16; j++)
        {
            X[j] = ((uint)input[i + 4 * j]) | (((uint)input[i + 4 * j + 1]) << 8) |
                   (((uint)input[i + 4 * j + 2]) << 16) |
                   (((uint)input[i + 4 * j + 3]) << 24);
        }

        A = AA = h0;
        B = BB = h1;
        C = CC = h2;
        D = DD = h3;
        E = EE = h4;

        // Main loop
        for (uint j = 0; j < 80; j++)
        {
            uint F = (B ^ C ^ D);
            uint Kt = K[j / 16];
            uint s = (j % 16);
            T = A + F + X[s] + Kt;
            A = E;
            E = D;
            D = (C << 10) | (C >> (32 - 10));
            C = B;
            B = (T << s) | (T >> (32 - s));

            uint tempAA = AA;
            uint tempBB = BB;
            uint tempCC = CC;
            uint tempDD = DD;
            uint tempEE = EE;

            F = (tempBB ^ (tempCC | ~tempDD));
            Kt = KK[j / 16];
            s = (j % 16);
            T = tempAA + F + X[s] + Kt;
            AA = tempEE;
            EE = tempDD;
            DD = (tempCC << 10) | (tempCC >> (32 - 10));
            CC = tempBB;
            BB = (T << s) | (T >> (32 - s));
        }

        T = h1 + C + DD;
        h1 = h2 + D + EE;
        h2 = h3 + E + AA;
        h3 = h4 + A + BB;
        h4 = h0 + B + CC;
        h0 = T;
    }

    output[0] = h0;
    output[1] = h1;
    output[2] = h2;
    output[3] = h3;
    output[4] = h4;
}

const char *privkeyToAddr(__global char *privkey_c)
{
    int id =
        get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id

    __global BigInt *a;
    string_to_bigint(privkey_c, a);

    __global uchar *byte_array_a;
    bigint_to_byte_array(a, byte_array_a);

    int len_byte_array_a = sizeof(byte_array_a) / sizeof(byte_array_a[0]);
    __global uchar *_x;
    __global uchar *_y;

    if (id < (len_byte_array_a / 2))
    {
        _x[id] = byte_array_a[id];
    }
    else if (id < len_byte_array_a)
    {
        _y[id] = byte_array_a[id];
    }

    __global BigInt *x;
    __global BigInt *y;

    byteArrayToBigInt(_x, len_byte_array_a / 2, x);
    byteArrayToBigInt(_y, len_byte_array_a / 2, y);

    __global BigInt *x_cube;
    __global BigInt *seven;
    __global BigInt *y_square;
    __global BigInt *x_cube_p_seven;
    __global BigInt *x_cube_p_seven_m_y_square;
    __global BigInt *p;
    __global BigInt *public_k;

    power_bigint(x, 3, x_cube);
    int_to_bigint(7, seven);
    add_bigint(x_cube, seven, x_cube_p_seven);
    power_bigint(y, 2, y_square);
    subtract_bigint(x_cube_p_seven, y_square, x_cube_p_seven_m_y_square);
    init_bigint_from_string("115792089237316195423570985008687907853269984665640564039457584007908834671663",
                            p);
    bigint_mod(x_cube_p_seven_m_y_square, p, public_k);

    __global uchar *public_k_b;
    __global uchar *public_k_h;
    __global uint *public_k_rh;
    __global uchar *public_k_rh_b;
    __global char *public_k_rs;
    bigint_to_byte_array(public_k, public_k_b);
    sha256(public_k_b, public_k_h, sizeof(public_k_b) / sizeof(public_k_b[0]));
    ripemd160_hash(public_k_h, public_k_rh,
                   sizeof(public_k_h) / sizeof(public_k_h[0]));

    int_to_bytes(public_k_rh, public_k_rh_b);

    byteArrayToString(public_k_rh_b, public_k_rs,
                      sizeof(public_k_rh_b) / sizeof(public_k_rh_b[0]));

    __global char *network_byte;
    concate_str_ng("00", public_k_rs, network_byte);
    __global uchar *network_byte_b;
    stringToByteArray(network_byte, network_byte_b, len(network_byte));
    __global uchar *sha1;
    __global uchar *sha2;
    sha256(network_byte_b, sha1,
           sizeof(network_byte_b) / sizeof(network_byte_b[0]));
    sha256(sha1, sha2, sizeof(sha1) / sizeof(sha1[0]));
    char checksum[9];
    __global char *sha2_hex;
    byteArrayToHexString(sha2, sha2_hex, sizeof(sha2) / sizeof(sha2[0]));
    if (id < 8)
        checksum[id] = sha2_hex[id];
    __global char *net_checksum;
    concate_str(network_byte, checksum, net_checksum);
    __global BigInt *net_checksum_B;
    string_to_bigint(net_checksum, net_checksum_B);
    __global uchar *net_checksum_b;
    bigint_to_byte_array(net_checksum_B, net_checksum_b);
    __global char *addr;
    base58_encode(net_checksum_b,
                  sizeof(net_checksum_b) / sizeof(net_checksum_b[0]), addr);
    return addr;
}

__global char *randomGenPrivKey(__global BigInt *start, __global BigInt *end,
                                __global char *wallet_addr)
{
    __global BigInt *a;
    __global char *b;
    while (1)
    {
        random_generator(start, end, a);
        bigint_to_hex_string(a, b);

        while (len(b) < 64)
        {
            if (len(b) < 64)
            {
                *b = '0' + *b;
            }
        }
        __global char *compressed_privkey;
        compress_private_key(b, compressed_privkey);

        if (wallet_addr == privkeyToAddr(compressed_privkey))
        {
            // return (*b);
            return b; // return to host
        }
    }
}

// support biginteger chatgpt: would you write a big integer struct in opencl C
// after ask what you want

__kernel void main_src(__global uchar *result) // strlen : 64
{
    __global BigInt *start;
    __global BigInt *end;
    __global char *Addr;
    __global BigInt *base;
    init_bigint_from_string("2", base);
    power_bigint(base, 65, start);
    power_bigint(base, 66, end);

    __global char *result_real = randomGenPrivKey(start, end, Addr);

    // int id = get_global_id(0);
    int id =
        get_global_id(1) * get_global_size(0) + get_global_id(0); // 1D global id
    result[id] = result_real[id];                                 // change char by char for errors.
}

// DEFINITION OF: ALGORITHM

/*
    OBJECTIVE: generate Random Big number and generate BTC addr from it

    NEEDED: Big integer API, secp256k1, ripemd160, SHA256, True Random API

    CORE NEEDED: Big Integer, True Random

*/

this looks fine i think i couldn't manage the cores and memory fine but it just needs optimization i think please let me know if you manage to how to run this code without any errors

Thanks.

0 Comments

2024/08/26
10:30 UTC

pyopencl crash/timeouts on AMD Radeon RX 7900 XT

Hi,

I am trying to run my pyopencl scripts on my desktop. It has a AMD Radeon RX 7900 XT.
Small test scripts work without problem, however when i run something more demanding, the progam never finishes and after 20sec my pc freezes and i have to force restart.

Same program works flawlessly on my Laptop.

I have the lastet version of my GPU driver installed and running Win11.

Anyone has had this issue or knows the cause?
Cheers

0 Comments

2024/08/22
12:31 UTC

Converting C to OpenCL C

I'm currently working on a project adding GPU functionality to the GNUAstro library(C99). However, one of the problems i've run into recently is that OpenCL does not have a simple way to utilise external libraries within the kernel.

Ideally, i'd like to be able to use certain parts of the library(written in C99) within the kernel, but OpenCL C has a ton of restrictions(no malloc/free, no standard header files, etc).

Therefore, simply #include 'ing the source code isn't enough, so i was wondering how feasible is it to either
a) Use preprocessor macro's to remove anything not compliant with OpenCL C while preserving functionality/replacing with other code
For example, if i have a function on host CPU(C99) as

int
gal_units_extract_decimal(char *convert, const char *delimiter,
                          double *args, size_t n)
{
  size_t i = 0;
  char *copy, *token, *end;

  /* Create a copy of the string to be parsed and parse it. This is because
     it will be modified during the parsing. */
  copy=strdup(convert);
  do
    {
      /* Check if the required number of arguments are passed. */
      if(i==n+1)
        {
          free(copy);
          error(0, 0, "%s: input '%s' exceeds maximum number of arguments "
                "(%zu)", __func__, convert, n);
          return 0;
        }

      /* Extract the substring till the next delimiter. */
      token=strtok(i==0?copy:NULL, delimiter);
      if(token)
        {
          /* Parse extracted string as a number, and check if it worked. */
          args[i++] = strtod (token, &end);
          if (*end && *end != *delimiter)
            {
              /* In case a warning is necessary
              error(0, 0, "%s: unable to parse element %zu in '%s'\n",
                    __func__, i, convert);
              */
              free(copy);
              return 0;
            }
        }
    }
  while(token && *token);
  free (copy);

  /* Check if the number of elements parsed. */
  if (i != n)
    {
      /* In case a warning is necessary
      error(0, 0, "%s: input '%s' must contain %lu numbers, but has "
            "%lu numbers\n", __func__, convert, n, i);
      */
      return 0;
    }

  /* Numbers are written, return successfully. */
  return 1;
}

then i would use it on the device by including it in a .cl file and applying macros like

#define free(x)

#define error(x)

to make it OpenCL C by removing the function calls

In this way, keeping only one major source file

b) Maintain a separate .cl file consisting of separate implementations for each function. Thereby keeping two files of source code, one in C99 and one in OpenCL C

Thoughts?

4 Comments

2024/08/19
15:40 UTC

Parallelisation of batch Hamming distance calculations (video frames).

I've got an application where I compute the Hamming distance between every combination of the elements of two arrays of 64 bit integers, and return those that fall below a threshold.

Each array represents a video of arbitrary length, and each element within it is a hash generated from a frame at given point within the video.

This process returns an array of truples, each being 1) index of frame in ref, 2) index of frame in comp, 3) similarity between the two hashes.

The code downstream of this can then identify sequences of similar images within two videos. It can be quite effective.

Here's the code I'm currently using (apologies for the quality, I'm a C novice).

unsigned long long * ref;
unsigned long long * comp;
unsigned long long x;
int i, j, c;
for ( i = 0; i < len_ref; i++ ) {
    for ( j = 0; j < len_comp; j++ ) {
        c = 0;
        x = ref[i] ^ comp[j];
        while ( x > 0 ) {
            c += x & 1;
            x >>= 1;
        }
        if ( c <= threshold ) {
            // push i, j & c to output array
        }
    }
}

It's relatively fast, but obviously, the more video you throw at it, the more burden it is to the CPU.

I was considering offloading this task to the GPU built into my fairly modern Intel processor.

I thought I'd ask here whether this task would be practical enough for me to learn enough OpenCL (from scratch) to be able to implement it?

I've found offloading some tasks to the GPU (using libavcodec) can take longer transferring to and from the GPU memory than just getting the job done in the CPU in the first place.

I'm currently uploading the first array (ref), then upload each second comparison array (comp) in turn.

If this all sounds a bit half-baked, it probably is. I'm just playing around with a hobby project. Thanks for indulging me.

Edit: just discovered __builtin_popcountll

2 Comments

2024/08/04
22:45 UTC

Initializing an array of structs in OpenCL

Disclaimer: I'm trying to learn OpenCL by doing, so there may be concepts that I did not study yet.

I have the following piece of code:

```

typedef struct{

int id;

int value;

} item;

typedef struct {

item items[MAX_N];

} collection;
```

Now, I want to initialize a collection with some default items for all the ids but, in regular C, I would need a malloc to do that.
How can I do something similar (inside a device kernel) in OpenCL?

3 Comments

2024/08/03
23:08 UTC

Standalone OpenCL --> SPIR-V Compiler

Hello OpenCL Sub,

I’d like some feedback on a recent project: openclc. As the name suggests it’s an AOT compiler for OpenCL-C/C++ code targeting SPIR-V to be consumed by clCreateProgramWithIL.

Coming from CUDA, I liked using the OpenCL language on a school project. That being said, I found the compile at runtime, put the kernels in a c string flow to be janky and off-putting. Thankfully, Khronos created an LLVM backend that converts LLVM IR to SPIR-V. Despite the good code in the SPIRV-LLVM-Translator, it leaves much to be desired when it comes to packaging. It requires a build from source against a system LLVM installation, doesn’t do SPIRV-Opt performance passes, and leaves you to figure out the inclusion of the SPIR-V into your program.

Openclc bundles clang/llvm, the LLVM-SPIRV translator, and spirv-opt performance passes into one static executable. It can output the SPIR-V directly or as a C array so you can easily embed it into a binary.

I also included builds of Spirv-Tools for windows, linux, and macos.

Future Idea: OpenCLC Runtime

The biggest problem with OpenCL is the ardous and error prone device discovery and kernel scheduling. It would be a huge boost to OpenCL usability to offload device discovery and scheduling to a runtime library like CUDA does with the CUDA Runtime. Instead of just compiling cl sources to SPIR-V, it could offer a regular c symbol for each kernel where the clEnqueueNDRangeKernel ugliness is handled underneath the hood. With sufficient abstraction the OpenCL backend could be swapped for Vulkan, Level Zero, and maybe even Metal through SPIR-V cross.

I'd love to answer any questions.

5 Comments

2024/08/02
04:55 UTC

[Help] Getting CL_OUT_OF_RESOURCES when running clEnqueueNDRangeKernel in a loop

I'm new to OpenCL and gpu programming so i tried to make particle gravity simulation and after reading some tutorials and guides i got stuck with -5 (CL_OUT_OF_RESOURCES) error.

I wasn't able to identify why it happens, so i got boilerplate code from this guide to reproduce an issue on a smaller scale and ended up with this.

    for(int i = 0; i < 10; i++){
        ret = clEnqueueWriteBuffer(command_queue, a_mem_obj, CL_TRUE, 0,
                LIST_SIZE * sizeof(int), A, 0, NULL, NULL);
        ret = clEnqueueWriteBuffer(command_queue, b_mem_obj, CL_TRUE, 0, 
                LIST_SIZE * sizeof(int), B, 0, NULL, NULL);

        size_t global_item_size = LIST_SIZE;
        ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, 
                &global_item_size, NULL, 0, NULL, NULL);

        PRINT_ERROR(ret);

        ret = clEnqueueReadBuffer(command_queue, c_mem_obj, CL_TRUE, 0, 
                LIST_SIZE * sizeof(int), C, 0, NULL, NULL);

        clFinish(command_queue);
        printf("loop\n");
    }

i get the same -5 (CL_OUT_OF_RESOURCES) after 2 successful loops. Am i not allowed to do it like that? My original plan was to calculate forces between particles each frame.

I'm not allocating any new memory on a gpu so what resources can i possibly run out of? My old laptop's willpower? It has Intel(R) HD Graphics 505.

15 Comments

2024/07/26
14:02 UTC

I hate whole AI industry is going with one single company nvidia CUDA, what is stopping openCL to kick the butts of CUDA?

13 Comments

2024/07/19
17:16 UTC

How is openCL going along with Rust?

0 Comments

2024/07/19
17:14 UTC

Weird error with write buffer

Hello for some reason my debugger is breaking for an invalid parimeter whenever I try to do this specific code. It seems to be an issue with std::vectors? Not sure. but for some reason this code doesnt causes an issue. '''queue.enqueueWriteBuffer(buffer, sizeof(somestruct) * std::size(vector), vector.data())'''

Ive also tried copying the vector into an array I made using '''somestruct* array = new somestruct[std::size(vector)]; memcpy(array, sizeof(somestruct)*std::size(vector), vector.data())''' and that seems to work... sometimes. It still throws an error for some data types maybe? I was reformatting my code and ran into this issue after I removed an if statement that all this code was in. All that if statement did was test if some variable was a nullptr or not. Which I got rid of the reason for that variable existing so idk. If I step through the code everything seems to copy and not have any issues but it still breaks.

read buffers dont have this issue. I think theres something weird with the const void* cause the read function uses a normal void *

1 Comment

2024/06/17
20:38 UTC

Drawing to screen from OpenCL without OpenGL

Hi all,
I'm trying to make a 3D visualisation program, and for this I need to write images (stored as integer arrays on the device) to the screen in real-time. The images are calculated using OpenCL kernels. All that's left to do is to display them. Unfortunately, the OpenCL implementation I'm using doesn't support OpenGL interoperability.

What's my best chance of rendering the images to the screen at a high refresh rate without copying them to the host?

Thanks.

P.S.: Without OpenGL, and thus also GLFW, there is no window creation/destruction. So this will obviously have to be done with something else (like SDL).

6 Comments

2024/06/12
18:37 UTC

2023 november nvidia driver added float16 support

https://us.download.nvidia.com/Windows/551.52/551.52-win11-win10-release-notes.pdf

Hi! I tried to post this back then but couldn post. I found it interesting that after so many years it seems that nvidia gous now support cl_khr_float16 according to the driver docs above.

Has someone tried this yet? On my 2070 i see no such extension, is this limited to newer cards?

1 Comment

2024/06/12
16:51 UTC

Is OpenCl still relevant?

Hello, I am an MS student and I am interested in parallel computing using GPGPUs. Is OpenCL still relevant in 2024 or should I focus more on SYCL? My aim is to program my AMD graphics card for various purposes (cfd and ml). Thanks.

25 Comments

2024/06/12
13:28 UTC

What are the devices that support device enqueue?

The device enqueue feature, I think is similar to CUDA dynamic parallelism, but the NVIDIA OpenCL implementation does not provide such feature, clinfo shows "Device enqueue capabilities (n/a)". The software version is cuda 12.2 and the card is a A10. And I also tried the libamdocl.so on a W6800 card, it is also the same result. I don't have any other devices at the moment, and I am very curious, what devices do support such feature? Is this feature only supported on CPU/FPGA or what, but never really supported by a GPU?

5 Comments

2024/06/04
01:50 UTC

cl_khr_integer_dot_product on Intel GPUs

All of mine Intel GPU's Arc 750, Arc 770 and HD 530 reports that they are supporting cl_khr_integer_dot_product extension with latest corresponding drivers but I am unable to get that working. Kernel code compilation using dot on uchar4 produces errors, and simple printf test does not print anything:

#pragma OPENCL EXTENSION cl_khr_integer_dot_product : enable
if (get_global_id(0) == 0) {
#if defined(cl_khr_integer_dot_product) && defined(__opencl_c_integer_dot_product_input_4x8bit)
  printf("\ninteger_dot_product with uchar4 supported in kernel\n\n");
#endif
#if defined(cl_khr_integer_dot_product) && defined(__opencl_c_integer_dot_product_input_4x8bit_packed)
  printf("\ninteger_dot_product with uint supported in kernel\n\n");
#endif
}

When trying to get cl_khr_integer_dot_product extension capabilities with OpenCLCapsViewer - it reports both packed and unpacked version are supported.

But how to actually use it on Intel in kernel code?

6 Comments

2024/05/30
08:06 UTC

Why are clcpp tests removed from OpenCL CTS at 2021?

I am going through the Khronos OpenCL CTS of a old version. In about year 2021, a commit removes the clcpp directory from the CTS file tree. I am curious about it, as many materials on the web referencing C++ for OpenCL also mentioned they are for OpenCL of 2021, which as of my knowledge, is a time OpenCL 3.0 has already been released for a long time, and no major version update should there be. Is there anything special about that year? Is the C++ support removed from OpenCL kernel language since then? BTW, what are the headers <opencl_memory>, <opencl_spec_constant> in the old version CTS? Are they once standard libraries for OpenCL C++ and now deprecated?

2 Comments

2024/05/18
05:32 UTC

Could someone please guide me through installation?

Hi, I want to get started in openCL programming, I'm a total noob right now. I was attempting to setup openCL on my machine inside of WSL2, however I just can't seem to be able to get it to work. It's an intel machine with an integrated graphics card (i5-8250 with UHD620). Could someone please guide me through the setup?

18 Comments

2024/05/14
06:07 UTC