/r/sycl
SYCL makes it easier for developers to write software using a C++ single-source parallel programming model. This sub is for sharing news, tutorials and having discussions about SYCL. http://sycl.tech
What is SYCL ?
SYCL is a specification which defines a single source C++ programming layer that is built on top of OpenCL, it allows developers to parallelize standard C++ code on OpenCL compatible devices like GPUs
Specifications
V1.2 :
https://www.khronos.org/registry/SYCL/specs/sycl-1.2.pdf
V2.2 :
https://www.khronos.org/registry/SYCL/specs/sycl-2.2.pdf
Implementations
V1.2 :
https://www.codeplay.com/products/computesuite/computecpp
V2.2 :
https://github.com/triSYCL/triSYCL
Related Subreddits
/r/sycl
Hi currently, I have this SYCL code working fine (pastebin to not fill the post with code: https://pastebin.com/Tcs6nLE9) when using a gpu device, as soon as I pass to a cpu device I get:
warning: <unknown>:0:0: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
warning: <unknown>:0:0: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
I need to solve this, but I can't find what loop isn't being vectorized ...
I am also itnerested in diving the while loop kernel into my cpu and gpu would be enough to divide the range to half (to do 50-50 workloads ?)
while (converge > epsilon)
{
for (size_t i = 1; i < m; i++)
{
for (size_t j = 0; j < i; j++)
{
RotationParams rp = get_rotation_params_parallel(cpu_queue, U, m, n, i, j, converge);
size_t half_n = n / 2;
// Apply rotations on U and V
cpu_queue.submit([&](sycl::handler &h)
{ h.parallel_for(sycl::range<1>{half_n}, [=](sycl::id<1> idx)
{
double tan_val = U[idx * n + i];
U[idx * n + i] = rp.cos_val * tan_val - rp.sin_val * U[idx * n + j];
U[idx * n + j] = rp.sin_val * tan_val + rp.cos_val * U[idx * n + j];
tan_val = V[idx * n + i];
V[idx * n + i] = rp.cos_val * tan_val - rp.sin_val * V[idx * n + j];
V[idx * n + j] = rp.sin_val * tan_val + rp.cos_val * V[idx * n + j]; }); });
gpu_queue.submit([&](sycl::handler &h)
{ h.parallel_for(sycl::range<1>{n - half_n}, [=](sycl::id<1> idx)
{
double tan_val = U[(idx + half_n) * n + i];
U[(idx + half_n) * n + i] = rp.cos_val * tan_val - rp.sin_val * U[(idx + half_n) * n + j];
U[(idx + half_n) * n + j] = rp.sin_val * tan_val + rp.cos_val * U[(idx + half_n) * n + j];
tan_val = V[(idx + half_n) * n + i];
V[(idx + half_n) * n + i] = rp.cos_val * tan_val - rp.sin_val * V[(idx + half_n) * n + j];
V[(idx + half_n) * n + j] = rp.sin_val * tan_val + rp.cos_val * V[(idx + half_n) * n + j]; }); });
}
cpu_queue.wait();
gpu_queue.wait();
}
}
Thanks sorry for the code, but I am completly lost.
There is a virtual event coming up where I'll be speaking at and is hosted by the UXL Foundation, the new open governance from the Linux Foundation for the oneAPI specification and open source implementations.
It runs over two days and with friendly timings for different parts of the world.
There will be a good variety of presentations, in particular I will highlight:
Dave Airlie from Red Hat who is a major Mesa project contributor talking about what is needed for successful open source projects
Bongjun Kim from Samsung is presenting how they are standardising APIs through SYCL and oneAPI for new memory technology known as Processing in Memory.
Evgeny Drapkin from GE HealthCare will talk about their progress, success and challenges using SYCL and oneAPI.
Yu-Hsiang Tsai works on the Ginkgo project and will talk about implementing their SYCL backend.
Alongside this there will also be some panels exploring open source and automotive topics.
Register here and take a look at the agenda https://linuxfoundation.regfox.com/oneapiuxldevsummit2024?t=uxlds2024reddit
https://oneapi.io/events/oneapi-devsummit-hosted-by-uxl-foundation/#agenda
I've downloaded the sycl version of llama.cpp (LLM / AI runtime) binaries for Windows and my 11th gen Intel CPU with Iris Xe isn't recognized. OpenCL is installed and apparently working.
Do I also need to install the oneAPI, and if so, what is the minimum installation I need to do to have apps working on sycl - I'm not interested in building apps.
I'm using the open source intel/LLVM sycl compiler on Linux and I have successfully worked with a sycl buffer of std::variant's on device code, but I have not been successful in using std::visit on a variant object in device code. In particular, if I try std::visit(visitor, vars); in kernel code, I get an error: SYCL kernel cannot use exceptions. I suppose this is because std::visit can throw a bad_variant_access, but what alternative to I have?
MWE-ish
#include <sycl/sycl.hpp>
#include <variant>
#include <vector>
class A{double a;}
class B{double b;}
double funk(A a){return a.a;}
double funk(B b){return b.b;}
using Mix = std::variant<A,B>;
int main()
{
std::vector<Mix> mix = {A{0.0}, B{1.0}, A{2.0}};
{
std::buffer mixB(mix);
sycl::queue q;
q.submit([&](sycl::handler& h){
sycl::accessor mix_acc(mix, h);
h.single_task([=](){
std::visit([](auto x){return funk(x);}, mix_acc[0]);
});
}
}
}
I am trying to move from buffers/accessors model to USM pointers. I already see performance benefits of this approach in some cases such as dispatching a lot of small kernels. However, how I can use local workgroup memory when using USM pointers?
Hey everyone, distro swapped to fedora. But cant seem to be able to install the proper drivers for my gpu.
When running sycl-ls
I get:
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) HD Graphics 520 OpenCL 3.0 NEO [24.09.28717.17]
But when running code using gpu_selector_v for my queue device I get the following error:
The program was built for 1 devices
Build program log for 'Intel(R) HD Graphics 520':
IGC: Internal Compiler Error: Segmentation violation -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
Can anybody help me.
Hey everyone, sorry if this is not the right place to ask.
But I want to find if there is already implemented somewhere the SVD for sparse matrices, in Compressed Sparse Row format.
Thanks.
I’m working in a lab right now which is working with some HPC software. We are trying to adapt the software so it can run parallel on some gpus. Is this skill something that’s very transferable? Does it help with getting jobs working with other languages like Cuda? I am an undergraduate student, so I don’t know much about industry standards.
Hello, I’ve been trying to figure out how to get started with SYCL but I can’t find any resources. I’m not sure if there is an SDK I can download or something. I was hoping I could just include SYCL into my c++ project and start writing kernels for the gpu. Any help would be appreciated.
I want to compare sycl to other parallel programming systems and for now I'm doing host runs. So I want to do a scaling study with number of cores is 1,2,5,10,20,50.
I have not found a mechanism (probably specific to Intel OneAPI) to limit the nmber of cores. That should be spossible, right? Something with tbb or OpenCL or whatever.
Hello,
so I was wondering if I could use the C++ library PcapPlusPlus and it‘s header files in my SYCL Code. I am using CentOS Stream 8 and oneAPI Base Toolkit 2023.1. So I downloaded the Github repository and built the files. After placing the header files in the necessary folders, I tried to compile the code example of PcapPlusPlus with the icpx command but got a lot of „undefined reference“ errors. After some research, I can’t find anything that explicitly denies the possibility to use 3rd party libraries. Does anybody have an idea what I could be missing or is this straight up not possible to do?
I’m in the process of developing a prototype for a database engine that targets multiple architectures and accelerators. Maintaining a codebase for x86_64, ARM, various GPUs, and different accelerators is quite challenging, so I’m exploring ways to execute queries on different accelerators using a unified codebase.
I’ve experimented with LLVM MLIR and attempted to lower the affine dialect to various architectures. However, the experience was less than satisfactory, as it seemed that either I was not using it correctly, or there were missing compiler passes when I was lowering it to a code targeting a specific architecture.
I’m considering whether SYCL could be a solution to this problem. Is it feasible to generate SYCL or LLVM IR from SYCL at runtime? This capability would allow me to optimize the execution workflow in my database prototype.
Finally, given the context I’ve provided, would you recommend using SYCL, or am I perhaps using the wrong tool to address this problem?
For clarity, I'd like to build it for both Windows and Linux.
Sorry to spam this subreddit, if there are other places to discuss/ask for help please say so.
I found this code in a paper in CUDA, and with the help of this table. I tried to convert it to SYCL, the conversion compiles and runs, but is giving me the wrong answer.
The code is SPMV in Csr format.
__global__ void spmv_csr_vector_kernel(const int num_rows, const int *ptr,
const int *indices, const float *data,
const float *x, float *y) {
__shared__ float vals[];
int thread_id = blockDim.x * blockIdx.x + threadIdx.x; // global thread index
int warp_id = thread_id / 32; // global warp index
int lane = thread_id & (32 - 1); // thread index within the warp
// one warp per row
int row = warp_id;
if (row < num_rows) {
int row_start = ptr[row];
int row_end = ptr[row + 1];
// compute running sum per thread
vals[threadIdx.x] = 0;
for (int jj = row_start + lane; jj < row_end; jj += 32)
vals[threadIdx.x] += data[jj] * x[indices[jj]];
// parallel reduction in shared memory
if (lane < 16)
vals[threadIdx.x] += vals[threadIdx.x + 16];
if (lane < 8)
vals[threadIdx.x] += vals[threadIdx.x + 8];
if (lane < 4)
vals[threadIdx.x] += vals[threadIdx.x + 4];
if (lane < 2)
vals[threadIdx.x] += vals[threadIdx.x + 2];
if (lane < 1)
vals[threadIdx.x] += vals[threadIdx.x + 1];
// first thread writes the result
if (lane == 0)
y[row] += vals[threadIdx.x];
}
}
And here is my sycl implementation:
void SPMV_Parallel(sycl::queue q, int compute_units, int work_group_size,
int num_rows, int *ptr, int *indices, float *data, float *x,
float *y) {
float *vals = sycl::malloc_shared<float>(work_group_size, q);
q.fill(y, 0, n).wait();
q.fill(vals, 0, work_group_size).wait();
q.submit([&](sycl::handler &cgh) {
const int WARP_SIZE = 32;
assert(work_group_size % WARP_SIZE == 0);
cgh.parallel_for(
sycl::nd_range<1>(compute_units * work_group_size, work_group_size),
[=](sycl::nd_item<1> item) {
int thread_id = item.get_local_range(0) * item.get_group(0) *
item.get_local_id(0);
int warp_id = thread_id / WARP_SIZE;
int lane = thread_id & (WARP_SIZE - 1);
int row = warp_id;
if (row < num_rows) {
int row_start = ptr[row];
int row_end = ptr[row + 1];
vals[item.get_local_id(0)] = 0;
for (int jj = row_start + lane; jj < row_end; jj += WARP_SIZE) {
vals[item.get_local_id(0)] += data[jj] * x[indices[jj]];
}
if (lane < 16)
vals[item.get_local_id(0)] += vals[item.get_local_id(0) + 16];
if (lane < 8)
vals[item.get_local_id(0)] += vals[item.get_local_id(0) + 8];
if (lane < 4)
vals[item.get_local_id(0)] += vals[item.get_local_id(0) + 4];
if (lane < 2)
vals[item.get_local_id(0)] += vals[item.get_local_id(0) + 2];
if (lane < 1)
vals[item.get_local_id(0)] += vals[item.get_local_id(0) + 1];
if (lane == 0)
y[row] += vals[item.get_local_id(0)];
}
});
}).wait();
sycl::free(vals, q);
}
Any guidance would be greatly appreaciated !
Hi everyone,
Doing a master thesis in Heterogeneous computing and am expected to program in SYCl, the thing is I am having a hard time finding online materials to learn it.
I am aware of sycl-academy, one workshop given by EUROCC Sweden and a book (`Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL`), but it seems that examples and the classes are too simple.
I have experience in some parallel programming (OpenMp and OpenMPI) but all at CPU level, working with GPU is something completing new.
I am mostly missing (harder/more complex) exercises/examples, and having a hard time understanding `nd_range`.
Do you guys recommend anything ? How did you learn SYCL, do you use SYCL for any project ?
Hi need help converting the following cuda code to sycl. I am using unified shared memory, but the array y allways return 0, in all indexes.
I am genuinely lost. Any help is greatly appreciated.
global void
spmv_csr_scalar_kernel(
const int num_rows,
const int matrix->row_offsets,
const intmatrix->column_indices,
const float matrix->values,
const floatx,
float y)
{
int row = blockDim.x blockIdx.x + threadIdx.x;
if (row < num_rows)
{
float dot = 0;
int row_start = matrix->row_offsets[row];
int row_end = matrix->row_offsets[row + 1];
for (int jj = row_start; jj < row_end; jj++)
dot += matrix->values[jj] * x[matrix->column_indices[jj]];
y[row] += dot;
}
}
I have tried the following:
void SPMVV_Parallel(sycl::queue q, const CompressedSparseRow matrix, const float *x, float *y)
{
q.parallel_for(sycl::range<1>(n), [=](sycl::id<1> gid)
{
int row = gid[0];
if (row < n) {
float dot = 0;
int row_start = matrix->row_offsets[row];
int row_end = matrix->row_offsets[row+1];
for (size_t i = row_start; i < row_end; i++)
{
dot+=matrix->values[i] x[matrix->column_indices[i]];
}
y[row]+=dot;
} });
}
Biagio Cosenza from the University of Salerno / CINECA Supercomputing Center pens this blog on the SYnergy research project that enables efficient C++ based heterogeneous parallel programming with the Khronos SYCL API.
I'm looking to offload some data processing in a large existing application to the gpu. This project has a base library that does all the math, a QT app on top of the library, and a separate grpc app that acts as a web api to that library. The build system uses cmake and vcpkg to pull in dependencies.
Is there a way to integrate any of the SYCL implementations into a project like this? Writing a SYCL project from scratch is easy, but I can't find any good information on how to add it or if it's even possible to use SYCL with a pretty standard cmake/vcpkg project. It's definitely not as easy as changing the compiler and rebuilding.
In the past, I've compiled opencl down to spir or used cuda. Both of those are the easy way to go, but I'm trying to look towards the future where I can.
I'm a beginner and I need to debug SYCL program running on GPU(Nvidia). How should I move forward and what tools should I use? Do I need to PoCL for this?
Hello everyone. I was looking into the library-only compilation flow of OpenSycl. From what I read, it seams it tries to support every compiler and every OS. But it actually doesn't support many backends.
Is there a project / a hope that using syscl may be as portable as graphics APIs (eg : include and link the lib, build using any library, run anywhere by lowering at runtime) ?
Or would this require new language tooling such as reflection ?
Is there actually any out-the-box SYCL-implementation or plugins for any of existing SYCL-implementations for Windows, supporting nVidia and AMD GPUs as a compute devices?
There is a lot of discussions in the internet, including the posts in this sub, for example, "Learn SYCL or CUDA?", where one of the popular answers was: Cuda is nVidia-only, and SYCL is universal.
But the thing is that I can't compute on my nVidia GPU using SYCL in Windows. I installed DPCPP, and really liked the concept of SYCL, but all what I can get is a mediocre performant CPU-code (ISPC-based solutions are up to twice as fast in my tests), and GPU-code for Intel GPU, which is ran on my integrated Intel GPU even slower than the CPU-variant (and default device selector prefers integrated GPU, hm). I googled other implementations, and some of them provide nVidia/AMD support, but only for Linux.
Am I missing something?
Hiya I'm pretty new to SYCL but I want to allocate a struct and all its members to a sycl device but I keep getting errors about Illegal memory accesses in CUDA. can I have some help please or an alternative suggestion
This is my code. I create a struct, allocate it to the device as well as an int array, populate the int array and then print it out.
#include <sycl/sycl.hpp>
struct test_struct {
int* data = nullptr;
};
int test(test_struct **t){
try {
sycl::queue q;
*t = sycl::malloc_shared<test_struct>(1, *q);
int* host_res = (int*) malloc(20 * sizeof(int));
size_t size = 20;
(*t)->data = sycl::malloc_device<int>(size, q);
q.parallel_for(sycl::range<1>(size), [=](sycl::id<1> i) {
(*t)->data[i] = i;
}).wait();
q.memcpy(host_res,(*t)->data,size * sizeof(int)).wait();
for (size_t i = 0; i < 20; i++)
{
std::cout << host_res[i] << std::endl;
}
sycl::free((*t)->data, q);
}
catch (sycl::exception &e) {
std::cout << "SYCL exception caught: " << e.what() << std::endl;
}
return 0;
}
int main() {
test_struct *t;
test(&t);
return 0;
};
Hello SYCLers - wanted to let you all know that there is a oneAPI DevSummit on June 13th! We have a great State of the Union talk where you can find out the latest that is happening in the ecosystem. We have all the chat on discord. It'll be a fun way to hang out with fellow SYCLers and oneAPI enthusiasts.
Looking forward to seeing you there!
https://www.oneapi.io/events/oneapi-devsummit-2023/
Feedback of course is welcome. :-)
Hi,
I hope you're doing well.
I am searching for some libraries for signal processing and linear algebra for sycl. In addition to oneMKL. I am looking for other libraries that can execute in dpc++ (or hipSYCL or triSYCL).
Cheers,
The Khronos Group has issued a RFP for a SYCL 2020 Reference Guide. The project aims to improve the SYCL developer ecosystem by providing a more usable version of the SYCL specification. An online searchable reference is needed, along the lines of cppreference.com, through which developers can rapidly find relevant material in top ranked web searches or browsing.
Submit your bid by Monday, June 12, 2023!
Videos and presentations from the talks and panels presented at last month's IWOCL & SYCLcon 2023 are now available!