/r/CUDA
/r/CUDA
Hi, does anyone know of a full list of all the errors/warnings that the compute-sanitizer program can give you and explanations for each? Searches around the documentation didn't yield anything.
I'm getting a warning that just says Empty malloc
, and I'm hoping there's some documentation somewhere to go along with this warning because I'm at a total loss.
Edit: I didn't find any explanation for that message, but I solved the bug. I was launching too many threads and I was running out of registers. I assume "empty malloc" means it tried to malloc but didn't have any space.
Hi! I'm interested in learning more about GPU programming and I know enough CUDA C++ to do memory copy to host/device but not much more. I'm also not awesome with C++, but yeah I do want to find something that has hands on practice or sample codes since that's how I learn coding stuff better usually.
I'm curious to know if anyone has done either of these two and has any thoughts on them? Money won't be an issue since I have around 200 in a small grant I got so that can cover the $90 for the NVIDIA course or a coursera plus subscription, and so I'd love to just know whichever one is better and/or more helpful for someone with a non programming background but who's picked up programming for their STEM degree and stuff.
(I'm also in the tech job market rn and not getting very favorable responses so any way to make my stand out as an applicant is a plus which is why I thought being good-ish at CUDA or GPGPU would be useful)
Heyy guys,
I am currently learning deep learning and wanted to explore cuda. Can you guys suggest a good roadmap with resources?
Hi everyone,
I’m currently working on a project based on Auto1111SDK, and I’m aiming to modify it to work with Zluda, a solution that supports AMD GPUs.
I found another project where this setup works: stable-diffusion-webui-amdgpu. This shows it should be possible to get Auto1111SDK running with Zluda, but I’m currently missing the know-how to adjust my project accordingly.
Does anyone have experience with this or know the steps necessary to adapt the Auto1111SDK structure for Zluda? Are there specific settings or dependencies I should be aware of?
Thanks a lot in advance for any help!
I am in need of a platform where I can host a Docker image / container and benchmark some CUDA operations with an Nvidia GPU.
I am looking for free for students or relatively cheap solutions.
Hello! I’ve been exploring the possibility of rewriting some C/C++ functionalities (large vectors +,*,/,^) using CUDA for a while. However, I’m also considering the option of using multithreading. So a natural question arises… how do I calculate or determine whether CUDA or multithreading is more advantageous? At what computational threshold can we say that CUDA is worth bringing into play? Okay, I understand it’s about a “very large number of calculations,” but how do I determine that exact number? I’d prefer not to test both options for all functions/methods and make comparisons—I’d like an exact way to determine it or at least a logical approach. I say this because, at a small scale (what is that level?), there’s no real difference in terms of timing. I want to allocate resources correctly, avoiding their use where problems can be solved differently. Essentially, I aim to develop robust applications that involve both GPU CUDA and CPU multithreading. Thanks!
I'm currently learning about this book. However, I couldn't find any solution to this book, so I decided to make solutions myself. I hope you guys can come and help correct my answers to make it perfect!
I will keep making solutions :)
The cuSparse function which I use to solve the forwards-backwards substition problem (triangular matrices), cusparseSpSM_solve(), doesn't work for large matrices, as it sets the first value in the resulting vector to a value of INF. Curiously, this only happens with the very first value in the resulting vector. I created a function to generate random, large SPD matrices and determined that any matrix with values outside of the main-diagonal and which has a dimension of 641x641 or larger has the same problem. Any matrix of 640x640 or smaller or which consists of only values on the main diagonal works just fine. The cuSparse function in question is opaque, I can't see what's happening in the background, I can only see the input and output.
I have confirmed that all inputs are correct and that it is not a memory issue. Finally, the function does not return an error, it simply sets the one value to INF and continues.
I can find no reason that the size of the matrix should influence the result, why the dimensions of 641x641 are relevant, why none of the cuSparse functions are throwing errors, or why this only happens to the very first value in the resulting vector. The Nvidia memcheck tool/CUDA sanitizer runs my code without returning any errors as well.
Writing this post just to share an interesting blog post I found while watching the freecodecamp cuda course.
The blog post explains How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance.
Even tho trying to mimic cuBLAS is pointless (just go ahead and use cuBLAS), the content of the post is very educational and I'm learning new concepts about GPU optimization and thought it would be a good share for this reddit, bye!
Hello, I am working on an OpenGL Engine that I want to extend with CUDA for a particle-based physics system. Today I spend a few hours trying to get everything setup, but every time I try to compile any .cu file, I get hundrets of errors inside the "cuda_fp16.hpp", which is part of the CUDA sdk.
The errors mostly look like missing ")" symbols or unknown symbols "__half".
Has anyone maybe got similar problems?
I am using Visual Studio 2022, an RTX 4070 with the latest NVidia driver and the CUDA Toolkit 12.6 installed.
I can provide more information, if needed.
Edit #2: I was able to solve the issue. I have followed @shexaholas suggestion and have included the faulty file myself. After also including 4 more CUDA files from the toolkit, the application is now beeing compiled successfully!
Edit: I am not including the cuda_fp16.hpp header by myself. I am only including:
<cuda_runtime.h>
<thrust/version.h>
<thrust/detail/config.h>
<thrust/detail/config/host_system.h>
<thrust/detail/config/device_system.h>
<thrust/device_vector.h>
<thrust/host_vector.h>
Hi, I am super new to CUDA and C++. While applying for ML and related jobs I noticed that several of these jobs require C++ these days. I wonder why? As CUDA is C based why don't they ask for C instead? Any leads would be appreciated as I am beginner and deciding weather to learn CUDA with C or C++. I have learnt Python, C, Java in the past but I am not familiar with C++. So before diving in, I want to ask your opinion.
Also, do u have any GitHub resources to learn from that u recommend? I am right now going through https://github.com/CisMine/Parallel-Computing-Cuda-C and plan to study this book "Programming Massively Parallel Processors: A Hands-on Approach" with https://www.youtube.com/playlist?list=PLzn6LN6WhlN06hIOA_ge6SrgdeSiuf9Tb videos. Any other alternatives you would suggest?
PS: I am currently unemployed trying to become employable with more skills and better projects. So any help is appreciated. Thank you.
Edit: Thank you very much to all you kind people. I was hoping that C will do but reading your comments motivates me towards C++. I will try my best to learn by Christmas this year. You all have been very kind. Thank you so much.
I was going through the freecodecamp yt video on cuda. And I don't understand why we aren't using cudaStreamSynchronize for stream1 & stream2 after line 50 (Before the kernel launch). How did not Synchronizing streams here still give out correct output?
Hi, I'm a physicist and i'm working with numerical integration. So far I managed to run N parallel simulation using a kernel like Integration<<<1,N>>>, one block N simulations (in this case N = 1024), and this is working fine.
But now, I'm paralellizing the parameters. Now there is a 2D parameter space, and for each point of this parameter space i want to run 1024 simulations. In this case the kernel would run something like
dim3 gridDim(A2_cols, p_rows); get_msd<<<gridDim, N>>>(d_X0S, d_Y0S, d_AS, d_PS, d_MSD); // the arguments relates to the initial conditions, the parameters on the Device // d_MSD is a A2_cols x p_rows x T 3d matrix, where for each step of the simulation some value is added
but something is not working right with the allocation of blocks threads. How many blocks could I allocate in the grid maintaining the 1024 simulations.
thanks
Hey All, I am struggling to understand optimizations made to naive matrix multiplication.
My kernel looks like this
// Assuming square matrices for simplicity
__global__ void matrixMult(int* A, int* B, int* C, int dimension)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int idx = row * dimension + col;
if (idx < dimension * dimension) {
int temp = 0;
for (int i = 0; i < dimension; i++) {
temp = temp + A[row * dimension + i] * B[i * dimension + col];
}
C[idx] = temp;
}
}
// Kernel Launch Configs
dim3 block(32, 32);
dim3 grid(CEIL_DIV(dimension / 32), CEIL_DIVE(dimension / 32));
matrixMult <<<grid, block >>> (dev_A, dev_B, dev_C, dimension);
A lot of tutorials online say this suffers from un-coalesced memory access in matrix A, and then proceed to change it using different indexing or shared memory. But here consecutive threads that are calculating a row in C will all access the same row in A (which will get broadcast?), and access consecutive columns in B which will be coalesced. Also a block dimension of 32 insures adjacent threads on x will end up in the same warp. I am sure there's something wrong with my understanding so let me know, Thanks.
Hello!
I am facing issues while installing and using PyTorch with CUDA support on my computer. Here are some details about my system and the steps I have taken:
### System Information:
- **Graphics Card:** NVIDIA GeForce GTX 1050 Ti
- **NVIDIA Driver Version:** 566.03
- **CUDA Version (from nvidia-smi):** 12.7
- **CUDA Version (from nvcc):** 11.7
### Steps Taken:
I installed Anaconda and created an environment named `pytorch_env`.
I installed PyTorch, torchvision, and torchaudio using the command:
```bash
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
```
I checked the installation by running Python and executing the following commands:
```python
import torch
print(torch.__version__) # PyTorch Version: 2.4.1
print(torch.cuda.is_available()) # CUDA Availability: False
```
### Problem:
Even though PyTorch is installed, CUDA availability returns `False`. I have checked the NVIDIA drivers and the installation of the CUDA Toolkit, but the issue persists.
### Questions:
How can I properly configure PyTorch to work with CUDA?
Do I need to install a different version of PyTorch or NVIDIA drivers to resolve this issue?
Are there any additional steps I could take to troubleshoot this problem?
I would appreciate any help or advice!
Let's say you're traversing a tree. For recursion, you'll have to run the same function n times, and for iteration, you'll have to run the same loop n times. The threads will still end at different times, so where is the increased divergence?
Hi All,
I'm hoping to get some feedback on a Monte Carlo simulation I've set up in CUDA. I'm an experienced Python developer but new to C/C++ & CUDA. I'm running this locally on a 4060. I'm relatively comfortable that the code is working and it's completing ~2.5b simulations in a little over a second.
I'm not at all sure I'm doing the right thing with respect to memory, and I'm interested in any feedback on other optimizations I can implement here both on the C & CUDA side. My next steps will be to figure out how to use Nsight-compute and profile it further there.
I'm simulating legs of the board game "Camel Up". In this game, the camels move around a track and can "stack" on top of each other. If a camel at the bottom of the stack moves, it carries all camels on top of it forward. Each camel is selected to roll & move once per leg and the dice are uniformly distributed between 1 and 3. When all camels have moved, the leg is over. I want to recover the probabilities of each camel winning the leg based upon the current board state.
Any help you can give would be much appreciated! Thanks in advance:
#include <curand.h>
#include <curand_kernel.h>
#include <iostream>
#define DICE_MIN 1
#define DICE_MAX 3
#define NUM_CAMELS 5
#define FULL_MASK 0xffffffff
__global__ void setup_kernel(curandState *state) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
curand_init((unsigned long long)clock() + idx, idx, 0, &state[idx]);
}
template <typename T>
__global__ void camel_up_sim(curandState *state, const int *positions,
const bool *remaining_dice, const int *stack,
T *results, const T local_runs) {
int thread_idx = threadIdx.x;
int idx = blockIdx.x * blockDim.x + thread_idx;
__shared__ T shared_results[NUM_CAMELS];
if (idx < NUM_CAMELS) {
shared_results[thread_idx] = 0;
}
__syncthreads();
T thread_results[NUM_CAMELS] = {0};
// Save the global variables in the local thread
// so we can reuse them without having to re-read globally.
int saved_local_positions[NUM_CAMELS];
bool saved_local_dice[NUM_CAMELS];
int saved_local_stack[NUM_CAMELS];
for (int i = 0; i < NUM_CAMELS; i++) {
saved_local_positions[i] = positions[i];
saved_local_dice[i] = remaining_dice[i];
saved_local_stack[i] = stack[i];
}
// Instantiate versions of this that can be used within the
// simulation.
int local_positions[NUM_CAMELS];
bool local_dice[NUM_CAMELS];
int local_stack[NUM_CAMELS];
int dice_remaining;
int camel_to_move;
int roll;
int camel_on_top;
int winner;
for (int r = 0; r < local_runs; r++) {
// Begin one simulation
dice_remaining = 0;
#pragma unroll
for (int i = 0; i < NUM_CAMELS; i++) {
// reset local arrays back to saved initial state.
local_positions[i] = saved_local_positions[i];
local_dice[i] = saved_local_dice[i];
local_stack[i] = saved_local_stack[i];
if (local_dice[i] == 1) {
dice_remaining++;
}
}
while (dice_remaining > 0) {
// Figure out which camel should be moved.
do {
camel_to_move = curand(&state[idx]) % NUM_CAMELS;
} while (!local_dice[camel_to_move]);
// Roll that camel's dice to see how far it moves.
roll = curand(&state[idx]) % DICE_MAX + 1;
// move that camel and set its dice as rolled.
local_positions[camel_to_move] += roll;
local_dice[camel_to_move] = 0;
#pragma unroll
for (int i = 0; i < NUM_CAMELS; i++) {
// If anyone was on the space the stack moved to, make that camel point
// to the bottom of the new stack
if ((i != camel_to_move) &&
(local_positions[i] == local_positions[camel_to_move]) &&
(local_stack[i] == -1)) {
local_stack[i] = camel_to_move;
} else if ((local_stack[i] == camel_to_move) &&
(local_positions[i] < local_positions[camel_to_move])) {
// If anyone pointed to camel_to_move and is on a previous space
// then make them uncovered.
local_stack[i] = -1;
}
}
camel_on_top = local_stack[camel_to_move];
// Move anyone who is on top of the camel that's moving
while (camel_on_top != -1) {
local_positions[camel_on_top] += roll;
// moved_camels[camel_on_top] = 1;
camel_on_top = local_stack[camel_on_top];
}
dice_remaining--;
}
winner = 0;
#pragma unroll
for (int i = 1; i < NUM_CAMELS; i++) {
if (local_positions[i] > local_positions[winner]) {
winner = i;
}
}
while (local_stack[winner] != -1) {
winner = local_stack[winner];
}
thread_results[winner] += 1;
}
// Start collecting the results from all the threads.
// Start by shuffling down on a warp basis.
#pragma unroll
for (int i = 0; i < NUM_CAMELS; i++) {
for (int offset = 16; offset > 0; offset /= 2) {
thread_results[i] +=
__shfl_down_sync(FULL_MASK, thread_results[i], offset);
}
// If it's the first thread in a warp - report the result to shared memory.
if (thread_idx % 32 == 0) {
atomicAdd(&shared_results[i], thread_results[i]);
}
}
__syncthreads();
// Report block totals back to the global results variable.
if (thread_idx == 0) {
#pragma unroll
for (int i = 0; i < NUM_CAMELS; i++) {
atomicAdd(&results[i], shared_results[i]);
}
}
}
template <typename T> void printArray(T arr[], int size) {
std::cout << "[";
for (int i = 0; i < size; i++) {
std::cout << arr[i];
if (i < size - 1) {
std::cout << (", ");
}
}
std::cout << "]\n";
}
int main() {
using T = unsigned long long int;
std::cout << "Starting program..." << std::endl;
constexpr int BLOCKS = 24 * 4; // Four per SM on the 4060
constexpr int THREADS = 256;
constexpr int RUNS_PER_THREAD = 100000;
// Without casting one of these to unsigned long long int then this can
// overflow integer multiplication and return something nonsensical.
constexpr unsigned long long int N =
static_cast<unsigned long long int>(BLOCKS) * THREADS * RUNS_PER_THREAD;
std::cout << "N: " << std::to_string(N) << std::endl;
std::cout << "Creating host variables..." << std::endl;
int positions[NUM_CAMELS] = {0, 0, 0, 0, 0};
bool remainingDice[NUM_CAMELS] = {1, 1, 1, 1, 1};
int stack[NUM_CAMELS] = {1, 2, 3, 4, -1};
T *results;
results = (T *)malloc(NUM_CAMELS * sizeof(T));
std::cout << "Creating device pointers..." << std::endl;
int *d_positions;
bool *d_remainingDice;
int *d_stack;
T *d_results;
curandState *d_state;
cudaMalloc((void **)&d_state, BLOCKS * THREADS * sizeof(curandState));
std::cout << "Setting up curand states..." << std::endl;
setup_kernel<<<BLOCKS, THREADS>>>(d_state);
std::cout << "Allocating memory on device..." << std::endl;
cudaMalloc((void **)&d_positions, NUM_CAMELS * sizeof(int));
cudaMalloc((void **)&d_results, NUM_CAMELS * sizeof(T));
cudaMalloc((void **)&d_remainingDice, NUM_CAMELS * sizeof(bool));
cudaMalloc((void **)&d_stack, NUM_CAMELS * sizeof(int));
cudaMemset(d_results, 0, NUM_CAMELS * sizeof(T));
std::cout << "Copying to device..." << std::endl;
cudaMemcpy(d_positions, positions, NUM_CAMELS * sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(d_remainingDice, remainingDice, NUM_CAMELS * sizeof(bool),
cudaMemcpyHostToDevice);
cudaMemcpy(d_stack, stack, NUM_CAMELS * sizeof(int), cudaMemcpyHostToDevice);
std::cout << "Starting sim..." << std::endl;
camel_up_sim<T><<<BLOCKS, THREADS>>>(d_state, d_positions, d_remainingDice,
d_stack, d_results, RUNS_PER_THREAD);
cudaDeviceSynchronize();
std::cout << "Copying results back..." << std::endl;
cudaMemcpy(results, d_results, NUM_CAMELS * sizeof(T),
cudaMemcpyDeviceToHost);
std::cout << "Results are:" << std::endl;
printArray(results, NUM_CAMELS);
float probs[NUM_CAMELS];
constexpr float N_float = static_cast<float>(N);
for (int i = 0; i < NUM_CAMELS; i++) {
probs[i] = static_cast<float>(results[i]) / N_float;
}
std::cout << "Probabilities are..." << std::endl;
printArray(probs, NUM_CAMELS);
cudaFree(d_positions);
cudaFree(d_results);
cudaFree(d_remainingDice);
cudaFree(d_state);
cudaFree(d_stack);
free(results);
}
I heard in a newer version of cuda you can allocate dynamic memory inside of a kernel for example global void foo(int x){ float* myarray = new float[x];
delete[] myarray;
} So you can basically use both new(keyword)and Malloc(function) within a kernel, but my question is if we can allocate dynamic memory within kernel why can’t I call cudamalloc within kernel too. Also is the allocated memory on the shared memory or global memory. And is it efficient to do this?
I followed following steps to setup conda environment with python 3.8, CUDA 11.8 and pytorch 2.4.1:
$ conda create -n py38_torch241_CUDA118 python=3.8
$ conda activate py38_torch241_CUDA11I followed following steps to setup conda environment with python 3.8, CUDA 11.8 and pytorch 2.4.1:
$ conda create -n py38_torch241_CUDA118 python=3.8
$ conda activate py38_torch241_CUDA118
$ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
Python and pytorch seem to have installed correctly:
$ python --version
Python 3.8.20
$ pip list | grep torch
torch 2.4.1
torchaudio 2.4.1
torchvision 0.20.0
But when I try to check CUDA version, I realise that nvcc
is not installed:
$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
This also caused issue in the further setup of some git repositories which require nvcc
. Do I need to run sudo apt install nvidia-cuda-toolkit
as suggested above? Shouldnt above conda install
command install nvcc
? I tried these steps again by completely deleting all packaged and environments of conda. But no help.
Below is some relevant information that might help debug this issue:
$ conda --version
conda 24.5.0
$ nvidia-smi
Sat Oct 19 02:12:06 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name User-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 2000 Ada Gene... Off | 00000000:01:00.0 Off | N/A |
| N/A 48C P0 588W / 35W | 8MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1859 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
$ which nvidia-smi
/usr/bin/nvidia-smi
Note that my machine runs NVIDIA RTX 2000 Ada Generation. Also above nvidia-smi
command says I am running CUDA 12.4. This driver I have installed manually long back when I did not have conda installed on the machine.
I tried setting CUDA_HOME
path to my conda environment, but no help:
$ export CUDA_HOME=$CONDA_PREFIX
$ echo $CUDA_HOME
/home/User-M/miniconda3/envs/FairMOT_py38_torch241_CUDA118
$ which nvidia-smi
/usr/bin/nvidia-smi
$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit8
$ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
Python and pytorch seem to have installed correctly:
$ python --version
Python 3.8.20
$ pip list | grep torch
torch 2.4.1
torchaudio 2.4.1
torchvision 0.20.0
But when I try to check CUDA version, I realise that nvcc
is not installed:
$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
This also caused issue in the further setup of some git repositories which require nvcc
. Do I need to run sudo apt install nvidia-cuda-toolkit
as suggested above? Shouldnt above conda install
command install nvcc
? I tried these steps again by completely deleting all packaged and environments of conda. But no help.
Below is some relevant information that might help debug this issue:
$ conda --version
conda 24.5.0
$ nvidia-smi
Sat Oct 19 02:12:06 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name User-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 2000 Ada Gene... Off | 00000000:01:00.0 Off | N/A |
| N/A 48C P0 588W / 35W | 8MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1859 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
Hi everyone,
I’m working on simulations that iterate 10,000,000 times and want to optimize these calculations using CUDA on my GPU. Here are my details:
Questions:
Any advice or insights would be greatly appreciated!
Thanks!
I'm trying to compute the low pass filter of a 50M point transform using cufftdx. The problem is that it seems to limit me to input sizes of 1 << 14. There's no documentation or usage with large inputs and I'm trying to understand how people approach this problem. Sure I can compute a bunch of fft blocks over the 50M point space... but am I supposed to then somehow combine the blocks into a single FFT to get the correct values? There's something I'm not understanding.
Hello everyone. I’ve been working on implementing a parallelizable cipher using CUDA. I’ve got it working with small inputs, but larger inputs cause the kernel to exit early (with seemingly only a few threads even able to start work).
It’s a block cipher (AES-ECB) so each block of 16 bytes can be encrypted in parallel. An input of size 40288 bytes completes just fine, but an input of size 40304 bytes (so just one more block) exits with this error code. The program outputs that an illegal memory access was encountered, but running an nsys profile on it shows the aforementioned error code, which as per some googling seems to mean anything from stack overflow to running out of memory on the GPU (or perhaps these are the same thing said differently).
I’m quite sure I’m not stepping out of bounds in my code because the smaller inputs work, even only by 16 bytes. There’s no recursion in my code. I pass the 40304 byte input into a kernel which uses a grid-step to assign 16-byte blocks to each thread block. I suppose my main question is, is there anything I can do about this? I’m only using inputs of this size for the sake of performance testing and nothing more, so it’s not a big deal. I’d just like to be able to see for myself (and not just in concept) how scalable the parallel processing is compared to a pure-serial approach.
All the best. Thanks for your time.
I have installed CUDA toolkit, VS with nsight, but I can't get intellisense to not give me a tonne of errors (only stdio.h is required to run this code, I am using these to mitigate other errors). This is the example from https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/ what do I do to get this to stop showing errors?
From tutorials for Mandelbrot-set, I can see only simple shapes with minimal divergence between pixels in average. For an experiement, I need a really chaotic map region where any two adjacent pixels have a lot of iteration difference.
Thanks in advance.
I am fom Brazil, and in my country there's rarelly any position for c++ dev and the case is even worse for c++ gpgpu dev. I come from a python + deep learning background and despite having 4yrs on the market, I have no work experience with c++ nor CUDA which is a prerequisite for all of the positions i've encountered so far.
How can i get this experience ? How can I get myself c++/CUDA situations that will count as work experience while being unemployed ? I thought of personal projects but it is hard to come up with ideas being so little experienced.
PS.: it's been about 2 months since I started to code with CUDA.
Hello everyone! I am a beginner to CUDA, and I was tasked with using CUDA to run a monte carlo simulation to find out the probability of N dice rolls adding up to 3*N. This is the code I've written for it, however it keeps returning a chance of 0. Does anyone know where the issue is?
I have used each thread to simulate a dice roll and then added up each N set of dice roll results to check if they add up to 3*N.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand.h>
#include <curand_kernel.h>
#include "thrust/device_vector.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define MIN 1
#define MAX 6
int N = 3; //Number of dice
int runs = 1000; //Number of runs
int num = N * runs;
__global__ void estimator(int* gpudiceresults, int num, int N, float* chance_d) {
//Calculating number of runs
int runs = N * num;
//indexing
int i = blockIdx.x * blockDim.x + threadIdx.x;
//Setting up cuRAND
curandState state;
curand_init((unsigned long long)clock() + i, i, 0, &state);
//Dice rolls, N dice times number of runs
if (i < num) {
gpudiceresults[i] = int(((curand_uniform(&state))*(MAX-MIN+ 0.999999))+MIN);
}
//Summing up every N dice rolls to check if they add up to 3N
int count = 0;
for (int j = 0; j < num; j+=N) {
int temp_sum = 0;
for (int k = j; k < N; k++) {
temp_sum += gpudiceresults[k];
}
if (temp_sum == 3 * N) {
count++;
}
}
//Calculating the chance of it being 3N
*chance_d = float(count) / float(runs);
return;
}
int main() {
//Blocks and threads
int THREADS = 256;
int BLOCKS = (N*runs + THREADS - 1) / THREADS;
//Initializing variables and copying them to the device
float chance_h = 0; //Chance variable on host
float* chance_d; //Pointer to chance variable on device
cudaMalloc(&chance_d, sizeof(chance_h));
cudaMemcpy(chance_d, &chance_h, sizeof(chance_h), cudaMemcpyHostToDevice);
int* gpudiceresults = 0;
cudaMalloc(&gpudiceresults, num * sizeof(int));
estimator <<<BLOCKS, THREADS >>> (gpudiceresults, num, N, chance_d);
cudaMemcpy(&chance_h, chance_d, sizeof(chance_h), cudaMemcpyDeviceToHost);
//cudaMemcpy(count_h, count_d, sizeof(count_d), cudaMemcpyDeviceToHost);
//count_h = *count_d;
//cudaFree(&gpudiceresults);
//float chance = float(*count_h) / float(runs);
std::cout << "the chance is " << chance_h << std::endl;
return 0;
}
I am pretty new to CUDA programming and even CPP(learnt it last week), so any criticism is accepted. I know my code isnt the best and there might be many dumb mistakes, so im looking forward to any suggestions on how to make it better.
Thank you.
🚀 Exciting news from Hugging Face! 🎉 Check out the featured paper "SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration." 🧠💡