/r/OpenCL

Photograph via snooOG

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. (c) wikipedia

Related Subreddits:

  • /r/ROCm -- AMD-only equivalent

  • /r/sycl -- higher-level C++ layer on top of OpenCL

  • /r/gpgpu -- most OpenCL posts are here right now; feel free to 'port' some over :)

  • /r/cuda -- NVidia-only equivalent

/r/OpenCL

2,264 Subscribers

6

How widespread is openCL support

TLDR: title but also would it be possible to run test to figure out if it is supported on the host machine. Its for a game and its meant to be distributed.

Redid my post because I included a random image by mistake.

Anyway I have an idea for a long therm project game I would like to devellop where there will be a lot of calculations in the background but little to no graphics. So I figured might as well ship some of the calculation to the unused GPU.

I have very little experience in OpenCL outside of some things I red so I figured yall might know more than me / have advice for a starting develloper.

15 Comments
2024/04/29
20:57 UTC

7

Debugging Kernel

does anyone know if theres a way to step through a kernel in visual studio?

Or better yet does anyone have a kernel that can compare two triangles to see if they intersect?

I found some old old code on the internet archive from hours of searching and finding old stack overflow posts of such a thing and that code is giving me weird results.. I know for a fact that the information Im putting in isnt garbage because I check it manually every time I get the weird result and it just doesnt make sense. Im away from my pc at the moment so itll take me a while to upload the code

Edit: I solved it lol. I had a typo in my XMVector3Cross function that replaced some * with + and caused weird results. Fixing those typos made my code detect collision perfectly.

Ive made a version with 2 dimensions instead of a for loop if anyone wants it typedef struct XMFLOAT4{ float x; float y; float z; float - Pastebin.com

19 Comments
2024/04/27
21:16 UTC

2

Unable To Use "atomic_compare_exchange_strong()" In Kernel

Hello, I'm trying to use the atomic_compare_exchange_strong() function in my opencl kernel, but I'm getting a CL_BUILD_PROGRAM_FAILURE error, and a CL_INVALID_PROGRAM_EXECUTABLE error unless I comment out the atomic function. According to https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/atomic_compare_exchange.html I need three features to use that function, __opencl_c_generic_address_space, __opencl_c_atomic_order_seq_cst, and __opencl_c_atomic_scope_device. I have been unable to figure out how to add these features or any instructions on how to add them. Any help will be greatly appreciated.

1 Comment
2024/04/25
18:44 UTC

4

OpenCL install approach?

I want to use OpenCL on Microsoft Visual Studio 2022. But when I opened an OpenCL package, there was nothing that I could open a OpenCL file in Visual studio. Is there a certain approach on how could I get to work with OpenCL with Microsoft Visual Studio without going through the madness?

4 Comments
2023/11/27
04:05 UTC

9

Fedora39 AMD OpenCL performance crushed - ? rocm-opencl issue

Hi All,

I upgraded to Fedora39 (from 38) and my OpenCL performance on my 6900XT was reduced by 75%!

I have reinstalled Fedora38 and have the performance back. Has anyone else encountered this or know what is up?

I am using rocm-* dnf packages from the standard fedora repos.

I am making the assumption that the issue is with rocm-opencl... Fedora38 is 5.5.1 and Fedora39 is 5.7.1. Thoughts/experiences???

Thanks,

Ant

2 Comments
2023/11/16
21:50 UTC

9

C++ for writing OpenCL kernels

Hello everyone,

How has been your experience with using C++ as the main language for writing OpenCL kernels?

I like OpenCL C, and I've been using it to develop my CFD solvers.

But I also need to support CUDA too, and it requires me to convert my CUDA code to OpenCL C.

As you might guess, that doubles my work.

I was reading this small writeup from Khronos, and C++ for OpenCL seems extremely promising: https://github.com/KhronosGroup/OpenCL-Guide/blob/main/chapters/cpp_for_opencl.md

I definitely need my code to run both on OpenCL and CUDA, so I was thinking of writing a unified kernel launcher and configure my build system such that the same C++ code would be compiled to both OpenCL and CUDA, and the user can simply chose which one she wants to use at runtime.

Thanks

4 Comments
2023/11/06
11:02 UTC

3

PTX kernel in OpenCL?

If I have a kernel in PTX (eg, generated with nvidia's compiler), is there a way to load that kernel and execute it in OpenCL?

1 Comment
2023/11/03
09:14 UTC

3

OpenCL to HIP transpiler?

Wondering if something like this existed/would be useful? Would help for interoperability between OpenCL and CUDA.

0 Comments
2023/10/30
13:22 UTC

7

Tensor cores in OpenCL

Are there any examples of using Nvidia (or AMD) tensor cores in OpenCL?

I know that for Nvidia you have to use inline assembly. I am wondering if anybody has

written a small header that exposes this capability in OpenCL.

3 Comments
2023/08/12
22:01 UTC

7

Use GPU from VBA

I have developed a C# library that enables you to perform calculations on a GPU/CPU from VBA. The library detects the current configuration of your GPU/CPU devices, compiles OpenCL sources, and runs them on the GPU/CPU (it can also run in asynchronous mode).

You can find the project (ClooWrapperVba) on GitHub or download and install it from SourceForge. The library is available for both x86 and x64 bit versions of Excel.

Requirements:

  • Excel/Windows
  • .Net 3.5

The example table ("OpenCl example.xlsm") contains four sheets:

  • "Hello world!" - A short example that prints the configuration of found devices and multiplies two matrices on the first found device.
  • "Configuration" - Lists all found platforms and devices corresponding to each platform.
  • "Performance" - Compares the performance of matrix multiplication code in VBA and OpenCL code executed on CPU/GPU.
  • "Asynchronous" - Executes matrix multiplications 20 times on CPU and GPU asynchronously.
2 Comments
2023/08/11
04:46 UTC

4

[Help] How to install OpenCL drivers for ARM Mali?

I'm pretty stumped here. I've spent about an hour trying to find out how I can download open cl drivers for an Orange PI 5. I found lots of references to ARM's Mali OpenCL drivers but no instructions on how to download it. I am super new to this so I'm not very surprised that I'm lost here. 😂

I would appreciate any help, pointers, and tips for installing OpenCL! How can I do it?

Btw, I'm running ML models on the OPI 5 (Llama.cpp and Whisper.cpp). Whisper.cpp can get a boost from having OpenCL. Let me know if you see anything wrong with my logic here...

Thank you!

2 Comments
2023/07/18
22:34 UTC

4

OpenCL GPU Programming for HPC Applications - ChEESE Center of Excellence Webinar Talk

0 Comments
2023/07/03
16:37 UTC

13

IWOCL & SYCLcon 2023 Video and Presentations

Videos and presentations from the talks and panels presented at last month's IWOCL & SYCLcon 2023 are now available!

https://www.iwocl.org/iwocl-2023/conference-program/

0 Comments
2023/05/02
23:54 UTC

21

I have open-sourced my OpenCL-Benchmark utility

A lot of people have requested it, so I have finally opensourced my OpenCL-Benchmark utility. This tool measures the peak performance/bandwidth of any GPU. Have fun!

GitHub link: https://github.com/ProjectPhysX/OpenCL-Benchmark

Example:

|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA A100-PCIE-40GB                                      |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 525.89.02                                                  |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 108 at 1410 MHz (6912 cores, 19.492 TFLOPs/s)              |
| Memory, Cache  | 40513 MB, 3024 KB global / 48 KB local                     |
| Buffer Limits  | 10128 MB global, 64 KB constant                            |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         9.512 TFLOPs/s (1/2 ) |
| FP32  compute                                        19.283 TFLOPs/s ( 1x ) |
| FP16  compute                                          not supported        |
| INT64 compute                                         2.664  TIOPs/s (1/8 ) |
| INT32 compute                                        19.245  TIOPs/s ( 1x ) |
| INT16 compute                                        15.397  TIOPs/s (2/3 ) |
| INT8  compute                                        18.052  TIOPs/s ( 1x ) |
| Memory Bandwidth ( coalesced read      )                       1350.39 GB/s |
| Memory Bandwidth ( coalesced      write)                       1503.39 GB/s |
| Memory Bandwidth (misaligned read      )                       1226.41 GB/s |
| Memory Bandwidth (misaligned      write)                        210.83 GB/s |
| PCIe   Bandwidth (send                 )                         22.06 GB/s |
| PCIe   Bandwidth (   receive           )                         21.16 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    8.77 GB/s |
|-----------------------------------------------------------------------------|
7 Comments
2023/04/30
18:27 UTC

11

In the next 5 years, what do you think can push OpenCL adoption?

To me it seems pretty obvious that CUDA (and Nvidia Chips) dominates the compute domain and Vulkan is the go-to for Graphics (bare in mind this is a fairly generalised statement). OpenCL still struggles to find larger adoption, particularly for compute tasks.

In your opinion, what could push adoption for it?

To me, the main one is going to be larger adoption of ML applications even on low power devices (mobile phones, autonomous cars etc..). Low power GPUs is the only segment where other manufacturers (ARM, Qualcomm, Imagination etc…) can compete with the Nvidia alternative. Another obvious one is larger investment from large hardware companies, but I doubt this will happen in the foreseeable future.

24 Comments
2023/04/26
08:38 UTC

17

Khronos Group releases OpenCL 3.0.14 update

Khronos has today released the OpenCL 3.0.14 maintenance update that introduces a new cl_khr_command_buffer_multi_device provisional extension that enables execution of a heterogeneous command-buffers across multiple devices. This release also includes significant improvements to the OpenCL C++ Bindings, a new code generation framework for the OpenCL extension headers, and the usual clarifications and bug fixes. The new specifications can be downloaded from the OpenCL Registry.

https://registry.khronos.org/OpenCL/

0 Comments
2023/04/18
16:02 UTC

9

Can OpenCL support direct data transfer between GPUs or between MPI nodes, similar to "CUDA aware MPI"?

Hello everyone,

CUDA has an amazing feature to send data inside the Device memory to another MPI node without first copying it to Host memory first: https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/

This is useful, as we don't need to do the slow copy from Device memory to Host memory first.

From OpenCL 2.0 luckily we have support for Shared Virtual Memory: https://developer.arm.com/documentation/101574/0400/OpenCL-2-0/Shared-virtual-memory and https://www.intel.com/content/www/us/en/developer/articles/technical/opencl-20-shared-virtual-memory-overview.html

So in theory, OpenCL should be able to transfer data similar to "CUDA aware MPI"

But unfortunately I haven't been able to find a definitive answer if it is possible, and how to do it.

I'm going to ask in MPI developer forum, but thought I would ask here first, if it's possible in OpenCL.

Thanks

6 Comments
2023/04/16
02:48 UTC

5

An example for OpenCL 3.0?

I've never used OpenCL, and I want to start using it. As the most recent version is 3.0, I tried to search for any example written in version 3.0. However, what I could find in the internet were not written in OpenCL 3.0, or uses deprecated features. So I ask here: Could you provide an example of printing OpenCL conformant devices and how to add vectors/ multiply matrices using OpenCL 3.0? C example should be okay, but if there's also a wrapper and an example for c++ then I'd also like that too.

4 Comments
2023/03/11
05:35 UTC

3

How fast can OpenCL code run on GPU?

Hello, everyone

While I was trying to learn OpenCL, I noticed that my code takes about 10 ms what seems really slow.

I guess the reason for this is the fact that I use the integrated GPU Intel HD Graphics 4600.

So, how fast can OpenCL code run on better GPU? Or the problem is in the code and not in GPU?

5 Comments
2023/03/07
10:11 UTC

3

What is better, 1 work item working with a float4 or 4 work items working with a simple float ?

I am sure I am just burdening myself with premature optimization here but I've been wondering about this for some time now. Which would be faster ?

Something like this:

__kernel void add(__global float4 *A,
                  __global float4 *B,
                  __global float4 *result) {
    size_t id = get_global_id(0);
    result[id] = A[id] + B[id];
}

working on 1 work item or

__kernel void add(__global float *A,
                  __global float *B,
                  __global float *result) {
    size_t id = get_global_id(0);
    result[id] = A[id] + B[id];
}

working on 4 work items

I'm wondering because it might seem obvious that the second is more parallelized so I should be faster but maybe the device can sum 4 numbers with other 4 numbers in a single operation (like with SIMD). Plus there might be some other hidden costs like buffering.

3 Comments
2023/03/03
18:37 UTC

6

Using integrated AMD-GPU for OpenCL?

Hey there, one question. I am using an old RX570 for KataGo with OpenCL. Now I switched to a new Ryzen 5700G with integrated GPU, and I thought I could use that as well for speeding up calculation. KataGo does support more than 1 OpenCL-device, but when I check with "clinfo", I only see the RX570. I did enable the integrated GPU in BIOS, but it doesn't show up... any ideas?

w@w-mint:~$ clinfo
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (3380.4)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 
  Platform Host timer resolution                  1ns
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 1
  Device Name                                     Ellesmere
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 2.0 AMD-APP (3380.4)
  Driver Version                                  3380.4 (PAL,HSAIL)
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     GPU
  Device Board Name (AMD)                         Radeon RX 570 Series
  Device Topology (AMD)                           PCI-E, 01:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
...
18 Comments
2023/03/01
08:14 UTC

10

Khronos releases open-source, OpenCL Tensor & Tiling Library

Developed by Mobileye, the open-source OpenCL Tensor & Tiling Library provides easy-to-use, portable, modular functionality to tile multi-dimensional tensors for optimized performance across diverse heterogeneous architectures. Tiling is particularly critical to devices with limited local memory that can partition data for asynchronously pipelining overlapped data import/export and processing.

Go to the OpenCL-TTL GitHub repository

0 Comments
2023/02/22
18:28 UTC

3

Unable to dynamically unbind a GPU without rendering opencl platform unusable

Okay so I have two GPUs in my system (5700 XT / 6950 XT). I'm using one of the GPUs for passthrough to a Windows VM most of the time. I am able to bind the GPU back to the host and clinfo tells me there are two devices. However, when I unbind one of the GPUs to give it back to the VM, clinfo tells me there is 0 device on the opencl platform.

I feel like opencl is unable to recover from one GPU disappearing. Is there a way I can reset opencl or something on linux?

0 Comments
2023/02/19
23:50 UTC

6

Trying to learn OpenCL. I only have IntelHD GPU available. Is it possible to gain some performance improvements?

Hello everyone,

I'm trying to learn OpenCL coding and GPU parallelize a double precision Krylov Linear Solver (GMRES(M)) for use in my hobby CFD/FEM solvers. I don't have a Nvidia CUDA GPU available right now.

Would my Intel(R) Gen9 HD Graphics NEO integrated GPU would be enough for this?

I'm limited by my hardware right now, yes, but I chose OpenCL so in future, the users of my code could also run them on cheaper hardware. So I would like to make this work.

My aim is to see at least 3x-4x performance improvements compared to the single threaded CPU code.

Is that possible?

Some information about my hardware I got from clinfo:

Number of platforms                               1
Platform Name                                   Intel(R) OpenCL HD Graphics
Platform Vendor                                 Intel(R) Corporation
Device Name                                     Intel(R) Gen9 HD Graphics NEO
Platform Version                                OpenCL 2.1 
Platform Profile                                FULL_PROFILE
Platform Host timer resolution                  1ns
Device Version                                  OpenCL 2.1 NEO 
Driver Version                                  1.0.0
Device OpenCL C Version                         OpenCL C 2.0 
Device Type                                     GPU
Max compute units                               23
Max clock frequency                             1000MHz
Max work item dimensions                        3
Max work item sizes                             256x256x256
Max work group size                             256
Preferred work group size multiple              32
Max sub-groups per work group                   32
Sub-group sizes (Intel)                         8, 16, 32
Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 1 / 1       
    half                                                 8 / 8        (cl_khr_fp16)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
Global memory size                              3230683136 (3.009GiB)
Error Correction support                        No
Max memory allocation                           1615341568 (1.504GiB)
Unified memory for Host and Device              Yes
Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   No
    Fine-grained system sharing                   No
    Atomics                                       No
Minimum alignment for any data type             128 bytes
Alignment of base address                       1024 bits (128 bytes)
Max size for global variable                    65536 (64KiB)
Preferred total size of global vars             1615341568 (1.504GiB)
Global Memory cache type                        Read/Write
Global Memory cache size                        524288 (512KiB)
Global Memory cache line size                   64 bytes
3 Comments
2023/02/11
10:58 UTC

4

How to install OpenCL for AMD CPU?

I want to program with OpenCL in C. I was able to install CUDA and get my program to recognize the Nvidia CUDA platform. Now I want to setup OpenCL to recognize my AMD CPU. I downloaded the amd sdk here and put the opencl.lib and associated headers in my project. When I run it, it still only recognizes the Nvidia CUDA platform. My guess is that OpenCL itself needs to be installed on my computer somehow like how I had to run an installer to install CUDA. Am I missing something? Does AMD have a way to install OpenCL so I can get it to recognize my AMD CPU?

9 Comments
2023/02/04
06:21 UTC

2

Branch divergence

Hello. I know that branch divergence causes significant performance decrease, but what if I have code structure inside kernel like this:

__kernel void ker(...)
{
    if(condition)
    {
        // do something
    }
}

In this situation, in my opinion, flow doesn't diverge. Work-item either ends computations instantly or compute 'if' body. Would this work slow or not? Why?

Thank you in advance!

7 Comments
2023/01/25
03:54 UTC

Back To Top