/r/CUDA
/r/CUDA
I have deep interest in High Performance Computing and Reinforcement Learning. Should I learn CUDA programming to kickstart my journey. Currently, I am a python developer and have worked with CPP before. Please advise.
Hi all, I have a few GPUs left over from mining, and I’m interested in starting a small-scale GPU-as-a-service. My goal is to set up a simple, side income that could help pay off my credit cards, as I already have a primary job.
What steps are needed for getting started with a small-scale GPU-as-a-service business focused on machine learning or AI? Any insights would be greatly appreciated!
Thanks in advance for any advice you can share!
I just bought a 4060 for my desktop only to be able to use cuda for machine learning task. The CUDA compatibility website does not list the 4060 for desktop as CUDA capable. Does that mean that i will not be able to use CUDA on my 4060?
float4 a,b,c;
// element-wise multiplication
a = b * c; // does not compile
// element-wise square root
a = sqrtf(a); // does not compile
Why? Is it because nobody using float4 in computations? Is it only for vectorized-load operations?
It is a bit repeating itself too much this way:
// duplicated code x4
a.x = b.x * c.x;
a.y = b.y * c.y;
a.z = b.z * c.z;
a.w = b.w * c.w;
a.x = sqrtf(a.x);
a.y = sqrtf(a.y);
a.z = sqrtf(a.z);
a.w = sqrtf(a.w);
// one-liner no problem but still longer than dot(a)
float dot = a.x*a.x + a.y*a.y + a.z*a.z + a.w*a.w;
// have to write this to calculate cross-product
a.x = b.y*c.z - b.z*c.y;
a.y = b.x*c.z - b.x*c.y; // human error in cross product implementation? yes, probably
a.z = b.y*c.x - b.z*c.x;
cudaDeviceSynchronize() is deprecated for device (gpu) level synchronization which was earlier possible with older versions of CUDA (v5.0 which was in 2014, ugh........)
I want to launch a child kernel from a parent kernel and wait for all the child kernel threads to complete before it proceeds to the next operation in parent kernel.
Any workaround for device level synchronization? I am trying dynamic parallelism for differential rasterization and ray tracing.
PLEASE HELP!
I'm a postgrad who is currently in academic limbo. I work for an HPC centre and write PyTorch CUDA/C++ extensions. So in theory I should be having a blast in this AI bull market. Except when I search for "CUDA"+"PyTorch" jobs the number of open positions are not very numerous and most of them are "senior positions" which I probably don't qualify for yet with my 1-2 years of job experience. And the real bummer: I'm not American and it seems like most jobs of that nature are in the US. Before I got into writing AI stuff, I was doing numerical simulations and I ran into the same problem: jobs positions were rare and mostly senior and mostly in the US.
Now I'm kind of questioning my career choices. What am I missing here?
Hi, does anyone know of a full list of all the errors/warnings that the compute-sanitizer program can give you and explanations for each? Searches around the documentation didn't yield anything.
I'm getting a warning that just says Empty malloc
, and I'm hoping there's some documentation somewhere to go along with this warning because I'm at a total loss.
Edit: I didn't find any explanation for that message, but I solved the bug. I was launching too many threads and I was running out of registers. I assume "empty malloc" means it tried to malloc but didn't have any space.
Hi! I'm interested in learning more about GPU programming and I know enough CUDA C++ to do memory copy to host/device but not much more. I'm also not awesome with C++, but yeah I do want to find something that has hands on practice or sample codes since that's how I learn coding stuff better usually.
I'm curious to know if anyone has done either of these two and has any thoughts on them? Money won't be an issue since I have around 200 in a small grant I got so that can cover the $90 for the NVIDIA course or a coursera plus subscription, and so I'd love to just know whichever one is better and/or more helpful for someone with a non programming background but who's picked up programming for their STEM degree and stuff.
(I'm also in the tech job market rn and not getting very favorable responses so any way to make my stand out as an applicant is a plus which is why I thought being good-ish at CUDA or GPGPU would be useful)
Heyy guys,
I am currently learning deep learning and wanted to explore cuda. Can you guys suggest a good roadmap with resources?
Hi everyone,
I’m currently working on a project based on Auto1111SDK, and I’m aiming to modify it to work with Zluda, a solution that supports AMD GPUs.
I found another project where this setup works: stable-diffusion-webui-amdgpu. This shows it should be possible to get Auto1111SDK running with Zluda, but I’m currently missing the know-how to adjust my project accordingly.
Does anyone have experience with this or know the steps necessary to adapt the Auto1111SDK structure for Zluda? Are there specific settings or dependencies I should be aware of?
Thanks a lot in advance for any help!
I am in need of a platform where I can host a Docker image / container and benchmark some CUDA operations with an Nvidia GPU.
I am looking for free for students or relatively cheap solutions.
Hello! I’ve been exploring the possibility of rewriting some C/C++ functionalities (large vectors +,*,/,^) using CUDA for a while. However, I’m also considering the option of using multithreading. So a natural question arises… how do I calculate or determine whether CUDA or multithreading is more advantageous? At what computational threshold can we say that CUDA is worth bringing into play? Okay, I understand it’s about a “very large number of calculations,” but how do I determine that exact number? I’d prefer not to test both options for all functions/methods and make comparisons—I’d like an exact way to determine it or at least a logical approach. I say this because, at a small scale (what is that level?), there’s no real difference in terms of timing. I want to allocate resources correctly, avoiding their use where problems can be solved differently. Essentially, I aim to develop robust applications that involve both GPU CUDA and CPU multithreading. Thanks!
I'm currently learning about this book. However, I couldn't find any solution to this book, so I decided to make solutions myself. I hope you guys can come and help correct my answers to make it perfect!
I will keep making solutions :)
The cuSparse function which I use to solve the forwards-backwards substition problem (triangular matrices), cusparseSpSM_solve(), doesn't work for large matrices, as it sets the first value in the resulting vector to a value of INF. Curiously, this only happens with the very first value in the resulting vector. I created a function to generate random, large SPD matrices and determined that any matrix with values outside of the main-diagonal and which has a dimension of 641x641 or larger has the same problem. Any matrix of 640x640 or smaller or which consists of only values on the main diagonal works just fine. The cuSparse function in question is opaque, I can't see what's happening in the background, I can only see the input and output.
I have confirmed that all inputs are correct and that it is not a memory issue. Finally, the function does not return an error, it simply sets the one value to INF and continues.
I can find no reason that the size of the matrix should influence the result, why the dimensions of 641x641 are relevant, why none of the cuSparse functions are throwing errors, or why this only happens to the very first value in the resulting vector. The Nvidia memcheck tool/CUDA sanitizer runs my code without returning any errors as well.
Writing this post just to share an interesting blog post I found while watching the freecodecamp cuda course.
The blog post explains How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance.
Even tho trying to mimic cuBLAS is pointless (just go ahead and use cuBLAS), the content of the post is very educational and I'm learning new concepts about GPU optimization and thought it would be a good share for this reddit, bye!
Hello, I am working on an OpenGL Engine that I want to extend with CUDA for a particle-based physics system. Today I spend a few hours trying to get everything setup, but every time I try to compile any .cu file, I get hundrets of errors inside the "cuda_fp16.hpp", which is part of the CUDA sdk.
The errors mostly look like missing ")" symbols or unknown symbols "__half".
Has anyone maybe got similar problems?
I am using Visual Studio 2022, an RTX 4070 with the latest NVidia driver and the CUDA Toolkit 12.6 installed.
I can provide more information, if needed.
Edit #2: I was able to solve the issue. I have followed @shexaholas suggestion and have included the faulty file myself. After also including 4 more CUDA files from the toolkit, the application is now beeing compiled successfully!
Edit: I am not including the cuda_fp16.hpp header by myself. I am only including:
<cuda_runtime.h>
<thrust/version.h>
<thrust/detail/config.h>
<thrust/detail/config/host_system.h>
<thrust/detail/config/device_system.h>
<thrust/device_vector.h>
<thrust/host_vector.h>
Hi, I am super new to CUDA and C++. While applying for ML and related jobs I noticed that several of these jobs require C++ these days. I wonder why? As CUDA is C based why don't they ask for C instead? Any leads would be appreciated as I am beginner and deciding weather to learn CUDA with C or C++. I have learnt Python, C, Java in the past but I am not familiar with C++. So before diving in, I want to ask your opinion.
Also, do u have any GitHub resources to learn from that u recommend? I am right now going through https://github.com/CisMine/Parallel-Computing-Cuda-C and plan to study this book "Programming Massively Parallel Processors: A Hands-on Approach" with https://www.youtube.com/playlist?list=PLzn6LN6WhlN06hIOA_ge6SrgdeSiuf9Tb videos. Any other alternatives you would suggest?
PS: I am currently unemployed trying to become employable with more skills and better projects. So any help is appreciated. Thank you.
Edit: Thank you very much to all you kind people. I was hoping that C will do but reading your comments motivates me towards C++. I will try my best to learn by Christmas this year. You all have been very kind. Thank you so much.
I was going through the freecodecamp yt video on cuda. And I don't understand why we aren't using cudaStreamSynchronize for stream1 & stream2 after line 50 (Before the kernel launch). How did not Synchronizing streams here still give out correct output?
Hi, I'm a physicist and i'm working with numerical integration. So far I managed to run N parallel simulation using a kernel like Integration<<<1,N>>>, one block N simulations (in this case N = 1024), and this is working fine.
But now, I'm paralellizing the parameters. Now there is a 2D parameter space, and for each point of this parameter space i want to run 1024 simulations. In this case the kernel would run something like
dim3 gridDim(A2_cols, p_rows); get_msd<<<gridDim, N>>>(d_X0S, d_Y0S, d_AS, d_PS, d_MSD); // the arguments relates to the initial conditions, the parameters on the Device // d_MSD is a A2_cols x p_rows x T 3d matrix, where for each step of the simulation some value is added
but something is not working right with the allocation of blocks threads. How many blocks could I allocate in the grid maintaining the 1024 simulations.
thanks
Hey All, I am struggling to understand optimizations made to naive matrix multiplication.
My kernel looks like this
// Assuming square matrices for simplicity
__global__ void matrixMult(int* A, int* B, int* C, int dimension)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int idx = row * dimension + col;
if (idx < dimension * dimension) {
int temp = 0;
for (int i = 0; i < dimension; i++) {
temp = temp + A[row * dimension + i] * B[i * dimension + col];
}
C[idx] = temp;
}
}
// Kernel Launch Configs
dim3 block(32, 32);
dim3 grid(CEIL_DIV(dimension / 32), CEIL_DIVE(dimension / 32));
matrixMult <<<grid, block >>> (dev_A, dev_B, dev_C, dimension);
A lot of tutorials online say this suffers from un-coalesced memory access in matrix A, and then proceed to change it using different indexing or shared memory. But here consecutive threads that are calculating a row in C will all access the same row in A (which will get broadcast?), and access consecutive columns in B which will be coalesced. Also a block dimension of 32 insures adjacent threads on x will end up in the same warp. I am sure there's something wrong with my understanding so let me know, Thanks.
Hello!
I am facing issues while installing and using PyTorch with CUDA support on my computer. Here are some details about my system and the steps I have taken:
### System Information:
- **Graphics Card:** NVIDIA GeForce GTX 1050 Ti
- **NVIDIA Driver Version:** 566.03
- **CUDA Version (from nvidia-smi):** 12.7
- **CUDA Version (from nvcc):** 11.7
### Steps Taken:
I installed Anaconda and created an environment named `pytorch_env`.
I installed PyTorch, torchvision, and torchaudio using the command:
```bash
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
```
I checked the installation by running Python and executing the following commands:
```python
import torch
print(torch.__version__) # PyTorch Version: 2.4.1
print(torch.cuda.is_available()) # CUDA Availability: False
```
### Problem:
Even though PyTorch is installed, CUDA availability returns `False`. I have checked the NVIDIA drivers and the installation of the CUDA Toolkit, but the issue persists.
### Questions:
How can I properly configure PyTorch to work with CUDA?
Do I need to install a different version of PyTorch or NVIDIA drivers to resolve this issue?
Are there any additional steps I could take to troubleshoot this problem?
I would appreciate any help or advice!
Let's say you're traversing a tree. For recursion, you'll have to run the same function n times, and for iteration, you'll have to run the same loop n times. The threads will still end at different times, so where is the increased divergence?
Hi All,
I'm hoping to get some feedback on a Monte Carlo simulation I've set up in CUDA. I'm an experienced Python developer but new to C/C++ & CUDA. I'm running this locally on a 4060. I'm relatively comfortable that the code is working and it's completing ~2.5b simulations in a little over a second.
I'm not at all sure I'm doing the right thing with respect to memory, and I'm interested in any feedback on other optimizations I can implement here both on the C & CUDA side. My next steps will be to figure out how to use Nsight-compute and profile it further there.
I'm simulating legs of the board game "Camel Up". In this game, the camels move around a track and can "stack" on top of each other. If a camel at the bottom of the stack moves, it carries all camels on top of it forward. Each camel is selected to roll & move once per leg and the dice are uniformly distributed between 1 and 3. When all camels have moved, the leg is over. I want to recover the probabilities of each camel winning the leg based upon the current board state.
Any help you can give would be much appreciated! Thanks in advance:
#include <curand.h>
#include <curand_kernel.h>
#include <iostream>
#define DICE_MIN 1
#define DICE_MAX 3
#define NUM_CAMELS 5
#define FULL_MASK 0xffffffff
__global__ void setup_kernel(curandState *state) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
curand_init((unsigned long long)clock() + idx, idx, 0, &state[idx]);
}
template <typename T>
__global__ void camel_up_sim(curandState *state, const int *positions,
const bool *remaining_dice, const int *stack,
T *results, const T local_runs) {
int thread_idx = threadIdx.x;
int idx = blockIdx.x * blockDim.x + thread_idx;
__shared__ T shared_results[NUM_CAMELS];
if (idx < NUM_CAMELS) {
shared_results[thread_idx] = 0;
}
__syncthreads();
T thread_results[NUM_CAMELS] = {0};
// Save the global variables in the local thread
// so we can reuse them without having to re-read globally.
int saved_local_positions[NUM_CAMELS];
bool saved_local_dice[NUM_CAMELS];
int saved_local_stack[NUM_CAMELS];
for (int i = 0; i < NUM_CAMELS; i++) {
saved_local_positions[i] = positions[i];
saved_local_dice[i] = remaining_dice[i];
saved_local_stack[i] = stack[i];
}
// Instantiate versions of this that can be used within the
// simulation.
int local_positions[NUM_CAMELS];
bool local_dice[NUM_CAMELS];
int local_stack[NUM_CAMELS];
int dice_remaining;
int camel_to_move;
int roll;
int camel_on_top;
int winner;
for (int r = 0; r < local_runs; r++) {
// Begin one simulation
dice_remaining = 0;
#pragma unroll
for (int i = 0; i < NUM_CAMELS; i++) {
// reset local arrays back to saved initial state.
local_positions[i] = saved_local_positions[i];
local_dice[i] = saved_local_dice[i];
local_stack[i] = saved_local_stack[i];
if (local_dice[i] == 1) {
dice_remaining++;
}
}
while (dice_remaining > 0) {
// Figure out which camel should be moved.
do {
camel_to_move = curand(&state[idx]) % NUM_CAMELS;
} while (!local_dice[camel_to_move]);
// Roll that camel's dice to see how far it moves.
roll = curand(&state[idx]) % DICE_MAX + 1;
// move that camel and set its dice as rolled.
local_positions[camel_to_move] += roll;
local_dice[camel_to_move] = 0;
#pragma unroll
for (int i = 0; i < NUM_CAMELS; i++) {
// If anyone was on the space the stack moved to, make that camel point
// to the bottom of the new stack
if ((i != camel_to_move) &&
(local_positions[i] == local_positions[camel_to_move]) &&
(local_stack[i] == -1)) {
local_stack[i] = camel_to_move;
} else if ((local_stack[i] == camel_to_move) &&
(local_positions[i] < local_positions[camel_to_move])) {
// If anyone pointed to camel_to_move and is on a previous space
// then make them uncovered.
local_stack[i] = -1;
}
}
camel_on_top = local_stack[camel_to_move];
// Move anyone who is on top of the camel that's moving
while (camel_on_top != -1) {
local_positions[camel_on_top] += roll;
// moved_camels[camel_on_top] = 1;
camel_on_top = local_stack[camel_on_top];
}
dice_remaining--;
}
winner = 0;
#pragma unroll
for (int i = 1; i < NUM_CAMELS; i++) {
if (local_positions[i] > local_positions[winner]) {
winner = i;
}
}
while (local_stack[winner] != -1) {
winner = local_stack[winner];
}
thread_results[winner] += 1;
}
// Start collecting the results from all the threads.
// Start by shuffling down on a warp basis.
#pragma unroll
for (int i = 0; i < NUM_CAMELS; i++) {
for (int offset = 16; offset > 0; offset /= 2) {
thread_results[i] +=
__shfl_down_sync(FULL_MASK, thread_results[i], offset);
}
// If it's the first thread in a warp - report the result to shared memory.
if (thread_idx % 32 == 0) {
atomicAdd(&shared_results[i], thread_results[i]);
}
}
__syncthreads();
// Report block totals back to the global results variable.
if (thread_idx == 0) {
#pragma unroll
for (int i = 0; i < NUM_CAMELS; i++) {
atomicAdd(&results[i], shared_results[i]);
}
}
}
template <typename T> void printArray(T arr[], int size) {
std::cout << "[";
for (int i = 0; i < size; i++) {
std::cout << arr[i];
if (i < size - 1) {
std::cout << (", ");
}
}
std::cout << "]\n";
}
int main() {
using T = unsigned long long int;
std::cout << "Starting program..." << std::endl;
constexpr int BLOCKS = 24 * 4; // Four per SM on the 4060
constexpr int THREADS = 256;
constexpr int RUNS_PER_THREAD = 100000;
// Without casting one of these to unsigned long long int then this can
// overflow integer multiplication and return something nonsensical.
constexpr unsigned long long int N =
static_cast<unsigned long long int>(BLOCKS) * THREADS * RUNS_PER_THREAD;
std::cout << "N: " << std::to_string(N) << std::endl;
std::cout << "Creating host variables..." << std::endl;
int positions[NUM_CAMELS] = {0, 0, 0, 0, 0};
bool remainingDice[NUM_CAMELS] = {1, 1, 1, 1, 1};
int stack[NUM_CAMELS] = {1, 2, 3, 4, -1};
T *results;
results = (T *)malloc(NUM_CAMELS * sizeof(T));
std::cout << "Creating device pointers..." << std::endl;
int *d_positions;
bool *d_remainingDice;
int *d_stack;
T *d_results;
curandState *d_state;
cudaMalloc((void **)&d_state, BLOCKS * THREADS * sizeof(curandState));
std::cout << "Setting up curand states..." << std::endl;
setup_kernel<<<BLOCKS, THREADS>>>(d_state);
std::cout << "Allocating memory on device..." << std::endl;
cudaMalloc((void **)&d_positions, NUM_CAMELS * sizeof(int));
cudaMalloc((void **)&d_results, NUM_CAMELS * sizeof(T));
cudaMalloc((void **)&d_remainingDice, NUM_CAMELS * sizeof(bool));
cudaMalloc((void **)&d_stack, NUM_CAMELS * sizeof(int));
cudaMemset(d_results, 0, NUM_CAMELS * sizeof(T));
std::cout << "Copying to device..." << std::endl;
cudaMemcpy(d_positions, positions, NUM_CAMELS * sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(d_remainingDice, remainingDice, NUM_CAMELS * sizeof(bool),
cudaMemcpyHostToDevice);
cudaMemcpy(d_stack, stack, NUM_CAMELS * sizeof(int), cudaMemcpyHostToDevice);
std::cout << "Starting sim..." << std::endl;
camel_up_sim<T><<<BLOCKS, THREADS>>>(d_state, d_positions, d_remainingDice,
d_stack, d_results, RUNS_PER_THREAD);
cudaDeviceSynchronize();
std::cout << "Copying results back..." << std::endl;
cudaMemcpy(results, d_results, NUM_CAMELS * sizeof(T),
cudaMemcpyDeviceToHost);
std::cout << "Results are:" << std::endl;
printArray(results, NUM_CAMELS);
float probs[NUM_CAMELS];
constexpr float N_float = static_cast<float>(N);
for (int i = 0; i < NUM_CAMELS; i++) {
probs[i] = static_cast<float>(results[i]) / N_float;
}
std::cout << "Probabilities are..." << std::endl;
printArray(probs, NUM_CAMELS);
cudaFree(d_positions);
cudaFree(d_results);
cudaFree(d_remainingDice);
cudaFree(d_state);
cudaFree(d_stack);
free(results);
}
I heard in a newer version of cuda you can allocate dynamic memory inside of a kernel for example global void foo(int x){ float* myarray = new float[x];
delete[] myarray;
} So you can basically use both new(keyword)and Malloc(function) within a kernel, but my question is if we can allocate dynamic memory within kernel why can’t I call cudamalloc within kernel too. Also is the allocated memory on the shared memory or global memory. And is it efficient to do this?
I followed following steps to setup conda environment with python 3.8, CUDA 11.8 and pytorch 2.4.1:
$ conda create -n py38_torch241_CUDA118 python=3.8
$ conda activate py38_torch241_CUDA11I followed following steps to setup conda environment with python 3.8, CUDA 11.8 and pytorch 2.4.1:
$ conda create -n py38_torch241_CUDA118 python=3.8
$ conda activate py38_torch241_CUDA118
$ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
Python and pytorch seem to have installed correctly:
$ python --version
Python 3.8.20
$ pip list | grep torch
torch 2.4.1
torchaudio 2.4.1
torchvision 0.20.0
But when I try to check CUDA version, I realise that nvcc
is not installed:
$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
This also caused issue in the further setup of some git repositories which require nvcc
. Do I need to run sudo apt install nvidia-cuda-toolkit
as suggested above? Shouldnt above conda install
command install nvcc
? I tried these steps again by completely deleting all packaged and environments of conda. But no help.
Below is some relevant information that might help debug this issue:
$ conda --version
conda 24.5.0
$ nvidia-smi
Sat Oct 19 02:12:06 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name User-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 2000 Ada Gene... Off | 00000000:01:00.0 Off | N/A |
| N/A 48C P0 588W / 35W | 8MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1859 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
$ which nvidia-smi
/usr/bin/nvidia-smi
Note that my machine runs NVIDIA RTX 2000 Ada Generation. Also above nvidia-smi
command says I am running CUDA 12.4. This driver I have installed manually long back when I did not have conda installed on the machine.
I tried setting CUDA_HOME
path to my conda environment, but no help:
$ export CUDA_HOME=$CONDA_PREFIX
$ echo $CUDA_HOME
/home/User-M/miniconda3/envs/FairMOT_py38_torch241_CUDA118
$ which nvidia-smi
/usr/bin/nvidia-smi
$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit8
$ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
Python and pytorch seem to have installed correctly:
$ python --version
Python 3.8.20
$ pip list | grep torch
torch 2.4.1
torchaudio 2.4.1
torchvision 0.20.0
But when I try to check CUDA version, I realise that nvcc
is not installed:
$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
This also caused issue in the further setup of some git repositories which require nvcc
. Do I need to run sudo apt install nvidia-cuda-toolkit
as suggested above? Shouldnt above conda install
command install nvcc
? I tried these steps again by completely deleting all packaged and environments of conda. But no help.
Below is some relevant information that might help debug this issue:
$ conda --version
conda 24.5.0
$ nvidia-smi
Sat Oct 19 02:12:06 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name User-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 2000 Ada Gene... Off | 00000000:01:00.0 Off | N/A |
| N/A 48C P0 588W / 35W | 8MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1859 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
Hi everyone,
I’m working on simulations that iterate 10,000,000 times and want to optimize these calculations using CUDA on my GPU. Here are my details:
Questions:
Any advice or insights would be greatly appreciated!
Thanks!
I'm trying to compute the low pass filter of a 50M point transform using cufftdx. The problem is that it seems to limit me to input sizes of 1 << 14. There's no documentation or usage with large inputs and I'm trying to understand how people approach this problem. Sure I can compute a bunch of fft blocks over the 50M point space... but am I supposed to then somehow combine the blocks into a single FFT to get the correct values? There's something I'm not understanding.