/r/CUDA
/r/CUDA
So recently I had an interview for a CUDA kernel dev related position and was talking about how I implemented tiled GEMM from scratch for one of my projects. I was talking about how I implemented GEMM the following way, and the interviewer seemed to have been surprised by how I was able achieve coalesced memory access without transposing the second matrix. Maybe I may have misread his reaction too, but either way, I wanted to verify my logic.
A little bit of info about my implementation, my main focus was to obviously coalesce my memory access so that all threads within a single warp can get their indices of data in 1 query instead of having to sequentially send out memory read requests separately.
What I realized was when doing GEMM, you obviously need to transpose the second matrix (this is for deep learning application, if it gives any better context). But that of course adds an additional cost because now you need to do a separate kernel for read and write to HBM. What I decided to do was to keep both tensors in row major order, and coalesce memory access for tiles in both tensors, but I would then transpose the indices when loading into shared memory.
Considering that memory access to shared memory is like accessing L1 cache, it’s better to compromise non coalesce access when interacting with shared memory than with HBM.
So in total, there’s a net performance benefit because you don’t need to pre transpose the matrix which is in total 4 HBM accesses (2 reads and 2 writes) and also, the GEMM kernel still coalesces memory access to HBM during reads, but is not coalesced when loading the data to shared memory.
Is my thought process consistent and logical?
Should I use cudamemcpy in different cpu threads with different memory address and data, or cudamemcpyasync, or should I use cudamemcpyasync
This article explores how CUDA C++ is leveraged to accelerate an AI for the game 2048. The techniques discussed can be widely applied.
https://trokebillard.com/blog/2048-ai/
Feel free to share your thoughts.
I'm looking to meet fellow CUDA developers. Please DM me.
Hi I am new to CUDA programming.
I wanted to know at maximum how many warps can be issued instructions in a single SM at the same time instance, considering SM has 2048 threads and there are 64 warps per SM.
When warp switching happens, do we have physically new threads running? or physically the same but logically new threads running?
If its physically new threads running, does it mean that we never utilize all the physical threads (CUDA cores) of an SM?
I am having difficulty in understanding these basic questions, it would be really helpful if anyone can help me here.
Thanks
Hi guys, I'm reading this code and confused about how the process of loading a matrix tile from global memory to shared memory works. As I understand it, the author performs matrix multiplication on 2 matrices of size 4096-by-4096 laid out in a 1D array, and he declares his kernel to be
Regarding the loading process of matrix A alone (which can be accessed by global_ptr
in the code), here's what I'm able to grasp from the code:
Each block in the grid will load (in a vectorized manner) a 128-by-128 tile of matrix A into its shared memory. However, since there are only 512 threads per block, each block can only load 1/4 of the tile (referred to as sub-tile from now on) at a time. This means that each thread will have access to 8 consecutive elements of the matrix, so 512 threads should be able to cover 128x32 elements. The local position of an element inside this sub-tile is represented by offset_.row
and offset_.col
in the code.
To assign different sub-tiles (row-wise) to different thread blocks, the author defines a variable called blockOffset=blockIdx.y * Threadblock::kM * K
, where Threadblock::kM=128
refers to the number of rows of a tile, and K=4096
is the number of columns of matrix A. So for different blockIdx.y
, global_ptr + blockOffset
will give us the first elements of the first sub-tiles of each row in matrix A (see the small red square in the figure below).
Next, The author converts the local positions (offset_.row, offset_.col)
within a sub-tile to the linear global positions with respect to the 4096-by-4096 matrix A: global_idx = offset_.row * K + offset_.col
. So elements with the same (offset\_.row, offset_.col)
across different sub-tiles will have the same global_idx
in the 4096x4096 1D array.
Then, to distinguish these orange positions, the author computes src = global_ptr + row * K + global_idx
, which results in the figure below.
However, as can be seen, the element across sub-tiles on the same row will access the same position (same color) in the 4096x4096 1D array.
Can someone provide an explanation for how this indexing scheme can cover the whole 4096x4096 elements of matrix A? I'll be thankful for any help or guidance!! 🙏🙏🙏
Link to the code: https://forums.developer.nvidia.com/t/cuda-kernel-slower-when-using-cuda-pipelines-despite-avoiding-bank-conflicts/280643
I tried a lot but was unsuccessful in installing these libs. Does anyone know of any solutions or guides for this?
Hi there, I used to work as an intern in making drones autonomous, there a problem stuck me which is to run orbslam3 on jetson nano. But the most cpu computing power is consumed by slam alone.So, that navigation and motion planning would be really difficult to execute on the embedded device alone. So, I had a plan that to parallelize the slam as much as possible since the nano has a lot of gpu cores which are under utilised.
Can anyone suggest me textbooks to learn gpu programming with C++ and Cuda.
I am reading matrix-multiplication background user guide by nvidia.
I am confused by the statement as follows:
A is a M x K matrix, B is a K X N matrix, and C is M x N matrix.
If I understand tiled matrix correctly, C is tiled into multiple submatrices, and the submatrix will be calculated by certain row and col of A and B, respectively.
The problem is, since M = 6912, N = 2048, C will be tiled into (6912 x 2048) / (256 x 128) = 432 submatrix, while an A100-SXM-80GB only has 108 SMs.
That means it needs one SM to handle four tiles.
What's more, in the Wave Quantization chapter, it says that:
An NVIDIA A100 GPU has 108 SMs; in the particular case of 256x128 thread block tiles, it can execute one thread block per SM, leading to a wave size of 108 tiles that can execute simultaneously.
But A100 only has 2048 maximum threads per SM, which is far more smaller than 256 x 128 ?
These two questions may be quite dumb, but I wish someone can help to enlight me.
Here are my information sources:
Hello, I am a student new to cuda.
I have an assignment of making flash attention in cuda with shared memory.
I have read some material but I just don't know how to apply it.
For example, this is a 1D kernel launch.
__global__ void RowMaxKernel(float *out, float *in, int br, int bc) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < br) {
float max_val = in[i * bc];
for (int j = 1; j < bc; j++) {
max_val = fmaxf(max_val, in[i * bc + j]);
}
out[i] = max_val;
}
}
this is 2D kernel launch
__global__ void QKDotAndScalarKernel(float *out, float *q, float *k, int br, int bc, int d, float scalar) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < br && j < bc) {
float sum = 0.0F;
for (int t = 0; t < d; t++) {
sum += q[i * d + t] * k[j * d + t];
}
out[i * bc + j] = sum * scalar;
}
}
Non of the TA or student are providing help. Please somebody so kind to demonstrate how to use shared-memory with these 2 example codes, please.
So, I have this program where I count the number of times a string appears in a given text file. So, I've defined an upper limit to the length of the string to be compared and which can be analyzed. My code finds all the substrings possible of the length of that upper limit and lesser and converts them into a Hash value using a hash function. The code is running smoothly in C++ but when I rewrote the code for CUDA C++ it's just not counting anything, it runs and every time gives "Substring not found!". Also, the CUDA program takes the same time for all cases, which means it's not doing things properly and is stuck in some particular area.
So, if someone can please look at the excerpt of the program and let me know of any possible flaws, it would be beneficial. Here is the CUDA kernel for my program:
Please let me know if more details are needed, I'm happy to discuss.
__global__ void countSubstringsKernel(const char* content, int* substringCount, int contentLength, int maxSubstringLength) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= contentLength) return;
// printf("Block ID: %d, Block Dim: %d, Thread ID: %d\n", blockIdx.x, blockDim.x, threadIdx.x);
// std::cout<<blockIdx.x<<"and"<<blockDim.x<<"and"<<threadIdx.x;
for (int len = 1; len <= maxSubstringLength; ++len) {
int hashValue = 0;
int power = 1;
// compute the hash for the current substring
for (int j = i; j < i + len && j < contentLength; ++j) {
hashValue = (hashValue + (content[j] - 'a' + 1) * power) % MOD;
power = (power * PRIME) % MOD;
}
// atomically increment the hash count
atomicAdd(&substringCount[hashValue], 1);
}
}
Newbie to CUDA here (Undergrad CS/math background), currently optimizing cuda kernel(s). I use Nsight compute and systems.
My target device is unfortunately not the current device and details regarding its architecture/specs is unknown atm.
With the currant kernel, I’m able to obtain max warp occupancy but overall would like to write good code that can support reducing register usage as end device most likely does not support enough registers per thread (for max warp occupancy)
I have a couple of questions, any help would be appreciated :)
I’m considering using 16 bit __halfs but I know CUDA registers are 32 bits. Does NVCC/PTX compiler know to pack 2 __halfs into 1 register? How? Is it better to explicitly use __half2 instead? Does reading/writing to a __half become (equivalent or) more expensive than to a 32 bit float?
Warp shuffling is also used for multiple registers, but I believe shuffling is limited to 32 bits. So shuffling __halfs is a no-go? Is it necessary that we shuffle __half2 and unpack them? Potential costs of this?
I currently use shared memory but with hard coded sizes. Ideally if our device can’t get max warp occupancy with 32 bit variables, I’d like to switch over to 16 bit halfs. And also, if device doesn’t have enough shared mem, I’d like to reduce shared memory into smaller “chunks” where we load smaller portions from global to shared, use it and do tons of computations, then load second batch again, etc (i.e., reuse shared mem). Is this potentially a bad idea? If bad, it’s probably better to just divide the problem into smaller pieces and just load into shared mem once? Or could it be good due to one block having multiple cases of altering between 2 states: high read/write memory and high computation good (Allowing warps waiting on memory operation to be put aside)?
For writing highly optimized yet general CUDA kernels targeting different devices, do you guys have any suggestions? Are launch bounds parameters necessary? Will I have to write separate kernels for devices that can’t reach max occupancy unless I use __halfs? I assume there is no NVCC/PTX compiler flag to automatically convert all 32 bits register variables into 16 bits for a specific kernel? I’ve tried maxrregcount but degrades performance a ton since my 32 bit usage is near max register usage already.
UPDATE: Turns out the issue was with RNG seeding, I didn't realise that time(null) only gave time to the nearest second! Now using randutils to create separate seeds for each thread and its working fine.
I have CUDA simulations I am executing in rapid succession (using python subprocess to run them). In my simulations I have random processes occurring. If I have a one second gap between each run my results are as expected. However, if I do not, then the rate at which random processes occur is incorrect... photos below
I've checked for memory leaks and fixed them, I'm not using more VRAM than my device has. I do have the number of threads set to the number of CUDA cores my device has.
So far I know that normal functioning require between a 0.3-0.7 s gap.
I am running the simulations sequentially for different values of dirTheta (oops forgot to label as radians).
With a one second wait:
With 1 second wait: What I would expect, some random noise
Without a one second wait:
Hi experts, I am looking for advice to move a mutable hash map from host DRAM to GPU HBM.
Currently, we maintain a large lookup hash map in host memory. The hash map will be read during user request servintg time and updated in a background cron job concurrently. The usage of the hash map is as follows. In each user request, it will have a list of ids of some sort. The ids are used to look up tensor data against the hash map. The lookup results are copied to GPU memory for computation for each user request. In this usage pattern, the GPU memory util percentage is not very high.
The optimization we are looking into is to increase the HBM utilization rate and hopefully increase overall performance as well. The main motivation is that the hash map is increasing over time and the host DRAM size might become a bottleneck. Conceptually, we will need to mirror the current operations of the current hash map into a new counterpart that sits in HBM. Specifically, we need something like below (in psuedo code; very high-level):
// request serving time
vector<MyTensor> vec;
for (auto id : ids):
auto tensor_ptr = gpu_lookupMap.get(id)
vec.push_back(tensor_ptr)
gpu.run(vec)
// background update
// step 1: in host memory
Buffer buffer
for (auto record : newUpdates):
buffer.add(record)
// step 2: in gpu memory
gpu_lookupMap.update(hostBuffer)
In this way, host DRAM doesn't need to be big enough to contain the entire hash map but rather big enough to accommodate the temporary buffer during update. We will also increase the ROI on the GPU HBM. So, here are my questions.
Is our intended new flow feasible with CUDA?
What caveats are there for having the hash map (mutated concurrently) in GPU memory?
Thank you in advance for your kind assistance.
Link: https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51119/
The information I have in my head is inconsistent. I thought that block clusters could only group blocks within a single SM, but in this video he implies at past the 12m mark, that they can group up to 16 SMs which'd allow the blocks in a cluster to access up to 3648 kb of shared memory. Nevermind that 224 * 16 is 3584.
Could you set me straight on this?
I am 40 years old and have been working in C,C++ and golang. Recently, got interest in parallel computing. Is that feasible to learn and do I hold chance to getting jobs in the parallel computing field?
Edited: This may be not a CUDA related problem.Running the same multiplication on CPU also results in same excecution time with float and double.
I'm a beginner in CUDA programming, and I'm working on a simple matrix multiplication program. What I found is when I change the input and output variable type from double to float, the time spent on moving data between host and device is halved, but the time spent on kernel execution is almost the same (even with large matrix size). I've already tried using Nsight Compute to profile the program, it confirmed that the two excecution is in float and double respectively, and the excecution time is the almost the same. Does anyone have an idea about this? Thank you.
Hi everyone,
I'm currently in the 2nd year of my master's program. Before starting my graduate studies, I worked for 3 years as a backend web developer, mainly focusing on building and maintaining web services. Recently, I got an exciting opportunity to work as a research assistant under a professor on a GPU-related project. The work involves using CUDA and Kokkos, and it has sparked a genuine interest in GPU programming, low-level development, and parallel computing.
I've been thinking about pivoting my career in this direction, as I feel the web development field has become highly saturated, making it tough to stand out in the current job market (especially as an international student). Even though I'm completely new to this field, I find it incredibly interesting and believe I can learn and grow in it.
My question is:
I’d appreciate any advice, insights, or resources you can share to help me make an informed decision and succeed in this area.
Thank you in advance!
Hi, I'm new to GPU programming and doing research on GEMM optimization. I came across a few online posts ( this and this) that mentions the performance of cuBLASS GEMM is roughly 50TFLOPS. I went on Google Colab to confirm this number using this code (generated by ChatGPT):
#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <iostream>
#include <chrono>
void checkCudaError(cudaError_t status, const char* msg) {
if (status != cudaSuccess) {
std::cerr << msg << " Error: " << cudaGetErrorString(status) << std::endl;
exit(EXIT_FAILURE);
}
}
void checkCublasError(cublasStatus_t status, const char* msg) {
if (status != CUBLAS_STATUS_SUCCESS) {
std::cerr << msg << " Error: " << status << std::endl;
exit(EXIT_FAILURE);
}
}
int main() {
const int N = 8192; // Matrix size (N x N)
const float alpha = 1.0f, beta = 0.0f;
// Allocate host memory
float *h_A, *h_B, *h_C;
h_A = new float[N * N];
h_B = new float[N * N];
h_C = new float[N * N];
// Initialize matrices
for (int i = 0; i < N * N; ++i) {
h_A[i] = 1.0f;
h_B[i] = 2.0f;
h_C[i] = 0.0f;
}
// Allocate device memory
float *d_A, *d_B, *d_C;
checkCudaError(cudaMalloc(&d_A, N * N * sizeof(float)), "CUDA malloc failed for d_A");
checkCudaError(cudaMalloc(&d_B, N * N * sizeof(float)), "CUDA malloc failed for d_B");
checkCudaError(cudaMalloc(&d_C, N * N * sizeof(float)), "CUDA malloc failed for d_C");
// Copy data to device
checkCudaError(cudaMemcpy(d_A, h_A, N * N * sizeof(float), cudaMemcpyHostToDevice), "Memcpy to d_A failed");
checkCudaError(cudaMemcpy(d_B, h_B, N * N * sizeof(float), cudaMemcpyHostToDevice), "Memcpy to d_B failed");
// Create cuBLAS handle
cublasHandle_t handle;
checkCublasError(cublasCreate(&handle), "cuBLAS initialization failed");
// Warm-up GEMM to stabilize performance
checkCublasError(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N),
"cuBLAS Sgemm warm-up failed");
cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord( start, 0 );
// Perform GEMM
checkCublasError(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N),
"cuBLAS Sgemm failed");
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time, start, stop );
printf("Time taken for GEMM: %f ms\n", time);
cudaEventDestroy( start );
cudaEventDestroy( stop );
// Cleanup
delete[] h_A;
delete[] h_B;
delete[] h_C;
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cublasDestroy(handle);
return 0;
}
which output about 209ms for running cublasSgemm kernel. I then calculate the throughput = (2 * M * N * K) / (elapsed_time * 1e12) = (2 * 8192^3) / (0.209 * 1e12) = 5.26 TFLOPS.
Can someone please help clarify this phenomenon? Thank you in advance!
Hi everyone,
My friends and I are working on a project: we have access to a GPU, and we want to ensure that each of us can use the GPU when needed. Do you know of any app that allows us to book time slots? Essentially, we’re looking for a shared calendar that’s convenient and easy to use.
Thanks, everyone!
I am a backend software engineer and a comp science grad . I am interested in learning Cuda but see that the intro books are having obsolete topics as per reviews. Should that matter ? Can I get any suggestions on which book or website to start with for fundamentals?
Hello everyone! I'm a new researcher working on Vulkan Compute Shader issues. I'm trying to reproduce a branch divergence issue on a Vulkan Compute Shader, but confusingly, the versions with and without divergence have the same average runtime. Through my investigation, I found an interface in NVAPI called NvReorderThread, and I'm wondering if this might be the reason why the issue can't be reproduced.
My questions are:
I would greatly appreciate any responses!
I initialized an array as
_FTYPE_ c_arr[64] = {0.0};
When I try to call c_arr[8] to write it to global memory, I get Cuda error: Error in matrixMul kernel: an illegal memory access was encountered. However, if I just write c_arr[0] to memory, it works. Does anyone know why this might be?
I am trying to export my code as an exe with static libraries, so I can use it on any system with a GPU in the office. Unfortunately, I can’t find the static libraries in my install. When I try to reinstall CUDA, there is no option to install static libraries. Have I completely misunderstood static libraries in CUDA? Do I need to get them elsewhere? Can the dynamic libraries be used as static libraries? I’d appreciate any help.
I am trying to implement this CUTLASS version of linear algebra matrix multiplication found here: https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
I was wondering if anyone understood what BlockItemsK would be in this picture where the tile from A is 128x8 and the tile from B is 8x128:
This is the incomplete sample code found on the site:
// Device function to compute a thread block’s accumulated matrix product
__device__ void block_matrix_product(int K_dim) {
// Fragments used to store data fetched from SMEM
value_t frag_a[ThreadItemsY];
value_t frag_b[ThreadItemsX];
// Accumulator storage
accum_t accumulator[ThreadItemsX][ThreadItemsY];
// GEMM Mainloop - iterates over the entire K dimension - not unrolled
for (int kblock = 0; kblock < K_dim; kblock += BlockItemsK) {
// Load A and B tiles from global memory and store to SMEM
//
// (not shown for brevity - see the CUTLASS source for more detail)
...
__syncthreads();
// Warp tile structure - iterates over the Thread Block tile
#pragma unroll
for (int warp_k = 0; warp_k < BlockItemsK; warp_k += WarpItemsK) {
// Fetch frag_a and frag_b from SMEM corresponding to k-index
//
// (not shown for brevity - see CUTLASS source for more detail)
...
// Thread tile structure - accumulate an outer product
#pragma unroll
for (int thread_x = 0; thread_x < ThreadItemsX; ++thread_x) {
#pragma unroll
for (int thread_y=0; thread_y < ThreadItemsY; ++thread_y) {
accumulator[thread_x][thread_y] += frag_a[y]*frag_b[x];
}
}
}
__syncthreads();
}
}
Hello everyone!
I'm a university student and I write a FEM code as research. First I have writed an Octave code for it, but because of the performance I have rewrote it to C++. The code itself has a lot of matrix operations so I started using Cuda for the matrices. I have a pc with an RTX 2060(12GB), however I need a laptop. I have to do some of the coding in the university. There are ocasions, where I have to run a quick test for my code, to show it to my professors. Internet is not always available in the university. That's why I need a cuda capable laptop. I would like to ask for some advice, what kind of laptop should I buy? My budget is 1000USD at max, but preferebly less than that. Would a used, but not so old workstation with a T-series(with about 4GB) GPU be enough or should I choose a 5 years old workstation with an RTX4000? Or maybe a new gaming laptop with like an RTX 4050 or 4060 would be better? I have some future plans/project ideas for honing my cuda skills, so I want it to be a long-time investment.
Hi! Do you know of any updated resources or examples that use CUDA with inheritance and polymorphism? I've searched, and most sources say that virtual functions are not supported, but the information is quite old. I tried migrating from inheritance to AoS + tagged union, but I have a very large structure that isn't used often, so this approach isn't ideal for my use case.
I have deep interest in High Performance Computing and Reinforcement Learning. Should I learn CUDA programming to kickstart my journey. Currently, I am a python developer and have worked with CPP before. Please advise.
Hi all, I have a few GPUs left over from mining, and I’m interested in starting a small-scale GPU-as-a-service. My goal is to set up a simple, side income that could help pay off my credit cards, as I already have a primary job.
What steps are needed for getting started with a small-scale GPU-as-a-service business focused on machine learning or AI? Any insights would be greatly appreciated!
Thanks in advance for any advice you can share!