/r/CUDA

Photograph via snooOG

/r/CUDA

8,534 Subscribers

27

Should I learn CUDA programming?

I have deep interest in High Performance Computing and Reinforcement Learning. Should I learn CUDA programming to kickstart my journey. Currently, I am a python developer and have worked with CPP before. Please advise.

22 Comments
2024/11/08
20:12 UTC

5

GPU as a service

Hi all, I have a few GPUs left over from mining, and I’m interested in starting a small-scale GPU-as-a-service. My goal is to set up a simple, side income that could help pay off my credit cards, as I already have a primary job.

What steps are needed for getting started with a small-scale GPU-as-a-service business focused on machine learning or AI? Any insights would be greatly appreciated!

Thanks in advance for any advice you can share!

18 Comments
2024/11/08
17:47 UTC

0

is 4060 CUDA capable?

I just bought a 4060 for my desktop only to be able to use cuda for machine learning task. The CUDA compatibility website does not list the 4060 for desktop as CUDA capable. Does that mean that i will not be able to use CUDA on my 4060?

12 Comments
2024/11/07
19:26 UTC

13

Why doesn't CUDA have built-in math operator overloading / functions for float4?

float4 a,b,c;
// element-wise multiplication
a = b * c; // does not compile
// element-wise square root
a = sqrtf(a); // does not compile

Why? Is it because nobody using float4 in computations? Is it only for vectorized-load operations?

It is a bit repeating itself too much this way:

// duplicated code x4
a.x = b.x * c.x;
a.y = b.y * c.y;
a.z = b.z * c.z;
a.w = b.w * c.w;
a.x = sqrtf(a.x);
a.y = sqrtf(a.y);
a.z = sqrtf(a.z);
a.w = sqrtf(a.w);

// one-liner no problem but still longer than dot(a)
float dot = a.x*a.x + a.y*a.y + a.z*a.z + a.w*a.w;

// have to write this to calculate cross-product
a.x = b.y*c.z - b.z*c.y;
a.y = b.x*c.z - b.x*c.y; // human error in cross product implementation? yes, probably
a.z = b.y*c.x - b.z*c.x;
13 Comments
2024/11/05
08:54 UTC

3

Dynamic Parallelism in newer versions of CUDA

cudaDeviceSynchronize() is deprecated for device (gpu) level synchronization which was earlier possible with older versions of CUDA (v5.0 which was in 2014, ugh........)

I want to launch a child kernel from a parent kernel and wait for all the child kernel threads to complete before it proceeds to the next operation in parent kernel.

Any workaround for device level synchronization? I am trying dynamic parallelism for differential rasterization and ray tracing.

PLEASE HELP!

4 Comments
2024/11/03
17:59 UTC

33

Is having CUDA as your career plan a risky move?

I'm a postgrad who is currently in academic limbo. I work for an HPC centre and write PyTorch CUDA/C++ extensions. So in theory I should be having a blast in this AI bull market. Except when I search for "CUDA"+"PyTorch" jobs the number of open positions are not very numerous and most of them are "senior positions" which I probably don't qualify for yet with my 1-2 years of job experience. And the real bummer: I'm not American and it seems like most jobs of that nature are in the US. Before I got into writing AI stuff, I was doing numerical simulations and I ran into the same problem: jobs positions were rare and mostly senior and mostly in the US.

Now I'm kind of questioning my career choices. What am I missing here?

26 Comments
2024/11/03
15:13 UTC

11

I made an animated video explaining how DRAM works and why should you care as a CUDA programmer

4 Comments
2024/11/02
09:48 UTC

1

Does anyone know of a list of compute-sanitizer warnings and explanations?

Hi, does anyone know of a full list of all the errors/warnings that the compute-sanitizer program can give you and explanations for each? Searches around the documentation didn't yield anything.

I'm getting a warning that just says Empty malloc, and I'm hoping there's some documentation somewhere to go along with this warning because I'm at a total loss.

Edit: I didn't find any explanation for that message, but I solved the bug. I was launching too many threads and I was running out of registers. I assume "empty malloc" means it tried to malloc but didn't have any space.

2 Comments
2024/11/01
02:39 UTC

17

NVIDIA Accelerated Programming course vs Coursera GPU Programming Specialization

Hi! I'm interested in learning more about GPU programming and I know enough CUDA C++ to do memory copy to host/device but not much more. I'm also not awesome with C++, but yeah I do want to find something that has hands on practice or sample codes since that's how I learn coding stuff better usually.

I'm curious to know if anyone has done either of these two and has any thoughts on them? Money won't be an issue since I have around 200 in a small grant I got so that can cover the $90 for the NVIDIA course or a coursera plus subscription, and so I'd love to just know whichever one is better and/or more helpful for someone with a non programming background but who's picked up programming for their STEM degree and stuff.

(I'm also in the tech job market rn and not getting very favorable responses so any way to make my stand out as an applicant is a plus which is why I thought being good-ish at CUDA or GPGPU would be useful)

12 Comments
2024/10/30
23:47 UTC

5

How to start with cuda?

Heyy guys,

I am currently learning deep learning and wanted to explore cuda. Can you guys suggest a good roadmap with resources?

11 Comments
2024/10/30
16:25 UTC

0

Help Needed: Using Auto1111SDK with Zluda

Hi everyone,

I’m currently working on a project based on Auto1111SDK, and I’m aiming to modify it to work with Zluda, a solution that supports AMD GPUs.

I found another project where this setup works: stable-diffusion-webui-amdgpu. This shows it should be possible to get Auto1111SDK running with Zluda, but I’m currently missing the know-how to adjust my project accordingly.

Does anyone have experience with this or know the steps necessary to adapt the Auto1111SDK structure for Zluda? Are there specific settings or dependencies I should be aware of?

Thanks a lot in advance for any help!

0 Comments
2024/10/29
22:31 UTC

2

Need Docker / CUDA cloud hosting

I am in need of a platform where I can host a Docker image / container and benchmark some CUDA operations with an Nvidia GPU.

I am looking for free for students or relatively cheap solutions.

4 Comments
2024/10/29
16:29 UTC

23

CUDA vs. Multithreading

Hello! I’ve been exploring the possibility of rewriting some C/C++ functionalities (large vectors +,*,/,^) using CUDA for a while. However, I’m also considering the option of using multithreading. So a natural question arises… how do I calculate or determine whether CUDA or multithreading is more advantageous? At what computational threshold can we say that CUDA is worth bringing into play? Okay, I understand it’s about a “very large number of calculations,” but how do I determine that exact number? I’d prefer not to test both options for all functions/methods and make comparisons—I’d like an exact way to determine it or at least a logical approach. I say this because, at a small scale (what is that level?), there’s no real difference in terms of timing. I want to allocate resources correctly, avoiding their use where problems can be solved differently. Essentially, I aim to develop robust applications that involve both GPU CUDA and CPU multithreading. Thanks!

11 Comments
2024/10/28
06:47 UTC

16

sharing answer of Programming-Massively-Parallel-Processors-A-Hands-on-Approach-4th

I'm currently learning about this book. However, I couldn't find any solution to this book, so I decided to make solutions myself. I hope you guys can come and help correct my answers to make it perfect!

my GitHub repo

I will keep making solutions :)

4 Comments
2024/10/27
12:15 UTC

2

cusparseSpSM_solve function returns INF value, only with large matrices

The cuSparse function which I use to solve the forwards-backwards substition problem (triangular matrices), cusparseSpSM_solve(), doesn't work for large matrices, as it sets the first value in the resulting vector to a value of INF. Curiously, this only happens with the very first value in the resulting vector. I created a function to generate random, large SPD matrices and determined that any matrix with values outside of the main-diagonal and which has a dimension of 641x641 or larger has the same problem. Any matrix of 640x640 or smaller or which consists of only values on the main diagonal works just fine. The cuSparse function in question is opaque, I can't see what's happening in the background, I can only see the input and output.

I have confirmed that all inputs are correct and that it is not a memory issue. Finally, the function does not return an error, it simply sets the one value to INF and continues.

I can find no reason that the size of the matrix should influence the result, why the dimensions of 641x641 are relevant, why none of the cuSparse functions are throwing errors, or why this only happens to the very first value in the resulting vector. The Nvidia memcheck tool/CUDA sanitizer runs my code without returning any errors as well. 

11 Comments
2024/10/26
22:46 UTC

12

Tutorial for Beginners: Matmul Optimization

Writing this post just to share an interesting blog post I found while watching the freecodecamp cuda course.
The blog post explains How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance.
Even tho trying to mimic cuBLAS is pointless (just go ahead and use cuBLAS), the content of the post is very educational and I'm learning new concepts about GPU optimization and thought it would be a good share for this reddit, bye!

4 Comments
2024/10/25
12:28 UTC

1

Problems with cuda_fp16.hpp

Hello, I am working on an OpenGL Engine that I want to extend with CUDA for a particle-based physics system. Today I spend a few hours trying to get everything setup, but every time I try to compile any .cu file, I get hundrets of errors inside the "cuda_fp16.hpp", which is part of the CUDA sdk.

The errors mostly look like missing ")" symbols or unknown symbols "__half".

Has anyone maybe got similar problems?

I am using Visual Studio 2022, an RTX 4070 with the latest NVidia driver and the CUDA Toolkit 12.6 installed.

I can provide more information, if needed.

Edit #2: I was able to solve the issue. I have followed @shexaholas suggestion and have included the faulty file myself. After also including 4 more CUDA files from the toolkit, the application is now beeing compiled successfully!

Edit: I am not including the cuda_fp16.hpp header by myself. I am only including:

<cuda_runtime.h>

<thrust/version.h>

<thrust/detail/config.h>

<thrust/detail/config/host_system.h>

<thrust/detail/config/device_system.h>

<thrust/device_vector.h>

<thrust/host_vector.h>

11 Comments
2024/10/24
19:38 UTC

27

CUDA with C or C++ for ML jobs

Hi, I am super new to CUDA and C++. While applying for ML and related jobs I noticed that several of these jobs require C++ these days. I wonder why? As CUDA is C based why don't they ask for C instead? Any leads would be appreciated as I am beginner and deciding weather to learn CUDA with C or C++. I have learnt Python, C, Java in the past but I am not familiar with C++. So before diving in, I want to ask your opinion.

Also, do u have any GitHub resources to learn from that u recommend? I am right now going through https://github.com/CisMine/Parallel-Computing-Cuda-C and plan to study this book "Programming Massively Parallel Processors: A Hands-on Approach" with https://www.youtube.com/playlist?list=PLzn6LN6WhlN06hIOA_ge6SrgdeSiuf9Tb videos. Any other alternatives you would suggest?

PS: I am currently unemployed trying to become employable with more skills and better projects. So any help is appreciated. Thank you.

Edit: Thank you very much to all you kind people. I was hoping that C will do but reading your comments motivates me towards C++. I will try my best to learn by Christmas this year. You all have been very kind. Thank you so much.

21 Comments
2024/10/24
13:58 UTC

3

CUDA question from freecodecamp yt video

https://github.com/Infatoshi/cuda-course/blob/master/05_Writing_your_First_Kernels/05%20Streams/01_stream_basics.cu

I was going through the freecodecamp yt video on cuda. And I don't understand why we aren't using cudaStreamSynchronize for stream1 & stream2 after line 50 (Before the kernel launch). How did not Synchronizing streams here still give out correct output?

7 Comments
2024/10/23
20:54 UTC

5

Parallel integration with CUDA

Hi, I'm a physicist and i'm working with numerical integration. So far I managed to run N parallel simulation using a kernel like Integration<<<1,N>>>, one block N simulations (in this case N = 1024), and this is working fine.

But now, I'm paralellizing the parameters. Now there is a 2D parameter space, and for each point of this parameter space i want to run 1024 simulations. In this case the kernel would run something like

dim3 gridDim(A2_cols, p_rows); get_msd<<<gridDim, N>>>(d_X0S, d_Y0S, d_AS, d_PS, d_MSD); // the arguments relates to the initial conditions, the parameters on the Device // d_MSD is a A2_cols x p_rows x T 3d matrix, where for each step of the simulation some value is added

but something is not working right with the allocation of blocks threads. How many blocks could I allocate in the grid maintaining the 1024 simulations.

thanks

7 Comments
2024/10/23
19:19 UTC

3

Uncoalesced memory access in Matrix Multiplication

Hey All, I am struggling to understand optimizations made to naive matrix multiplication.
My kernel looks like this

// Assuming square matrices for simplicity
__global__ void matrixMult(int* A, int* B, int* C, int dimension)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int idx = row * dimension + col;

    if (idx < dimension * dimension) {
        int temp = 0;
        for (int i = 0; i < dimension; i++) {
            temp = temp + A[row * dimension + i] * B[i * dimension + col];
        }
        C[idx] = temp;
    }
}  

// Kernel Launch Configs
dim3 block(32, 32);
dim3 grid(CEIL_DIV(dimension / 32), CEIL_DIVE(dimension / 32));
matrixMult <<<grid, block >>> (dev_A, dev_B, dev_C, dimension);

A lot of tutorials online say this suffers from un-coalesced memory access in matrix A, and then proceed to change it using different indexing or shared memory. But here consecutive threads that are calculating a row in C will all access the same row in A (which will get broadcast?), and access consecutive columns in B which will be coalesced. Also a block dimension of 32 insures adjacent threads on x will end up in the same warp. I am sure there's something wrong with my understanding so let me know, Thanks.

6 Comments
2024/10/23
16:58 UTC

3

CUDA Availability False in PyTorch: Seeking Solutions for GTX 1050 Ti

Hello!

I am facing issues while installing and using PyTorch with CUDA support on my computer. Here are some details about my system and the steps I have taken:

### System Information:

- **Graphics Card:** NVIDIA GeForce GTX 1050 Ti

- **NVIDIA Driver Version:** 566.03

- **CUDA Version (from nvidia-smi):** 12.7

- **CUDA Version (from nvcc):** 11.7

### Steps Taken:

  1. I installed Anaconda and created an environment named `pytorch_env`.

  2. I installed PyTorch, torchvision, and torchaudio using the command:

    ```bash

    conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

    ```

  3. I checked the installation by running Python and executing the following commands:

    ```python

    import torch

    print(torch.__version__) # PyTorch Version: 2.4.1

    print(torch.cuda.is_available()) # CUDA Availability: False

    ```

### Problem:

Even though PyTorch is installed, CUDA availability returns `False`. I have checked the NVIDIA drivers and the installation of the CUDA Toolkit, but the issue persists.

### Questions:

  1. How can I properly configure PyTorch to work with CUDA?

  2. Do I need to install a different version of PyTorch or NVIDIA drivers to resolve this issue?

  3. Are there any additional steps I could take to troubleshoot this problem?

I would appreciate any help or advice!

5 Comments
2024/10/23
12:35 UTC

2

How does recursion cause more divergence than iteration?

Let's say you're traversing a tree. For recursion, you'll have to run the same function n times, and for iteration, you'll have to run the same loop n times. The threads will still end at different times, so where is the increased divergence?

3 Comments
2024/10/21
01:56 UTC

5

Review Request - Monte Carlo simulations with CUDA

Hi All,

I'm hoping to get some feedback on a Monte Carlo simulation I've set up in CUDA. I'm an experienced Python developer but new to C/C++ & CUDA. I'm running this locally on a 4060. I'm relatively comfortable that the code is working and it's completing ~2.5b simulations in a little over a second.

I'm not at all sure I'm doing the right thing with respect to memory, and I'm interested in any feedback on other optimizations I can implement here both on the C & CUDA side. My next steps will be to figure out how to use Nsight-compute and profile it further there.

I'm simulating legs of the board game "Camel Up". In this game, the camels move around a track and can "stack" on top of each other. If a camel at the bottom of the stack moves, it carries all camels on top of it forward. Each camel is selected to roll & move once per leg and the dice are uniformly distributed between 1 and 3. When all camels have moved, the leg is over. I want to recover the probabilities of each camel winning the leg based upon the current board state.

Any help you can give would be much appreciated! Thanks in advance:

#include <curand.h>
#include <curand_kernel.h>
#include <iostream>

#define DICE_MIN 1
#define DICE_MAX 3
#define NUM_CAMELS 5
#define FULL_MASK 0xffffffff

__global__ void setup_kernel(curandState *state) {
  int idx = threadIdx.x + blockDim.x * blockIdx.x;
  curand_init((unsigned long long)clock() + idx, idx, 0, &state[idx]);
}

template <typename T>
__global__ void camel_up_sim(curandState *state, const int *positions,
                             const bool *remaining_dice, const int *stack,
                             T *results, const T local_runs) {
  int thread_idx = threadIdx.x;
  int idx = blockIdx.x * blockDim.x + thread_idx;

  __shared__ T shared_results[NUM_CAMELS];

  if (idx < NUM_CAMELS) {
    shared_results[thread_idx] = 0;
  }
  __syncthreads();

  T thread_results[NUM_CAMELS] = {0};

  // Save the global variables in the local thread
  // so we can reuse them without having to re-read globally.
  int saved_local_positions[NUM_CAMELS];
  bool saved_local_dice[NUM_CAMELS];
  int saved_local_stack[NUM_CAMELS];

  for (int i = 0; i < NUM_CAMELS; i++) {
    saved_local_positions[i] = positions[i];
    saved_local_dice[i] = remaining_dice[i];
    saved_local_stack[i] = stack[i];
  }

  // Instantiate versions of this that can be used within the
  // simulation.
  int local_positions[NUM_CAMELS];
  bool local_dice[NUM_CAMELS];
  int local_stack[NUM_CAMELS];
  int dice_remaining;

  int camel_to_move;
  int roll;
  int camel_on_top;
  int winner;

  for (int r = 0; r < local_runs; r++) {
    // Begin one simulation
    dice_remaining = 0;

#pragma unroll
    for (int i = 0; i < NUM_CAMELS; i++) {
      // reset local arrays back to saved initial state.
      local_positions[i] = saved_local_positions[i];
      local_dice[i] = saved_local_dice[i];
      local_stack[i] = saved_local_stack[i];

      if (local_dice[i] == 1) {
        dice_remaining++;
      }
    }

    while (dice_remaining > 0) {
      // Figure out which camel should be moved.
      do {
        camel_to_move = curand(&state[idx]) % NUM_CAMELS;
      } while (!local_dice[camel_to_move]);

      // Roll that camel's dice to see how far it moves.
      roll = curand(&state[idx]) % DICE_MAX + 1;

      // move that camel and set its dice as rolled.
      local_positions[camel_to_move] += roll;
      local_dice[camel_to_move] = 0;

#pragma unroll
      for (int i = 0; i < NUM_CAMELS; i++) {
        // If anyone was on the space the stack moved to, make that camel point
        // to the bottom of the new stack
        if ((i != camel_to_move) &&
            (local_positions[i] == local_positions[camel_to_move]) &&
            (local_stack[i] == -1)) {
          local_stack[i] = camel_to_move;
        } else if ((local_stack[i] == camel_to_move) &&
                   (local_positions[i] < local_positions[camel_to_move])) {
          // If anyone pointed to camel_to_move and is on a previous space
          // then make them uncovered.
          local_stack[i] = -1;
        }
      }

      camel_on_top = local_stack[camel_to_move];

      // Move anyone who is on top of the camel that's moving
      while (camel_on_top != -1) {
        local_positions[camel_on_top] += roll;
        // moved_camels[camel_on_top] = 1;
        camel_on_top = local_stack[camel_on_top];
      }

      dice_remaining--;
    }

    winner = 0;
#pragma unroll
    for (int i = 1; i < NUM_CAMELS; i++) {
      if (local_positions[i] > local_positions[winner]) {
        winner = i;
      }
    }

    while (local_stack[winner] != -1) {
      winner = local_stack[winner];
    }

    thread_results[winner] += 1;
  }

// Start collecting the results from all the threads.
// Start by shuffling down on a warp basis.
#pragma unroll
  for (int i = 0; i < NUM_CAMELS; i++) {
    for (int offset = 16; offset > 0; offset /= 2) {
      thread_results[i] +=
          __shfl_down_sync(FULL_MASK, thread_results[i], offset);
    }

    // If it's the first thread in a warp - report the result to shared memory.
    if (thread_idx % 32 == 0) {
      atomicAdd(&shared_results[i], thread_results[i]);
    }
  }

  __syncthreads();

  // Report block totals back to the global results variable.
  if (thread_idx == 0) {
#pragma unroll
    for (int i = 0; i < NUM_CAMELS; i++) {
      atomicAdd(&results[i], shared_results[i]);
    }
  }
}

template <typename T> void printArray(T arr[], int size) {
  std::cout << "[";
  for (int i = 0; i < size; i++) {
    std::cout << arr[i];
    if (i < size - 1) {
      std::cout << (", ");
    }
  }
  std::cout << "]\n";
}

int main() {

  using T = unsigned long long int;

  std::cout << "Starting program..." << std::endl;
  constexpr int BLOCKS = 24 * 4; // Four per SM on the 4060
  constexpr int THREADS = 256;
  constexpr int RUNS_PER_THREAD = 100000;
  // Without casting one of these to unsigned long long int then this can
  // overflow integer multiplication and return something nonsensical.
  constexpr unsigned long long int N =
      static_cast<unsigned long long int>(BLOCKS) * THREADS * RUNS_PER_THREAD;

  std::cout << "N: " << std::to_string(N) << std::endl;

  std::cout << "Creating host variables..." << std::endl;
  int positions[NUM_CAMELS] = {0, 0, 0, 0, 0};
  bool remainingDice[NUM_CAMELS] = {1, 1, 1, 1, 1};
  int stack[NUM_CAMELS] = {1, 2, 3, 4, -1};
  T *results;
  results = (T *)malloc(NUM_CAMELS * sizeof(T));

  std::cout << "Creating device pointers..." << std::endl;
  int *d_positions;
  bool *d_remainingDice;
  int *d_stack;
  T *d_results;

  curandState *d_state;
  cudaMalloc((void **)&d_state, BLOCKS * THREADS * sizeof(curandState));

  std::cout << "Setting up curand states..." << std::endl;
  setup_kernel<<<BLOCKS, THREADS>>>(d_state);

  std::cout << "Allocating memory on device..." << std::endl;
  cudaMalloc((void **)&d_positions, NUM_CAMELS * sizeof(int));
  cudaMalloc((void **)&d_results, NUM_CAMELS * sizeof(T));
  cudaMalloc((void **)&d_remainingDice, NUM_CAMELS * sizeof(bool));
  cudaMalloc((void **)&d_stack, NUM_CAMELS * sizeof(int));

  cudaMemset(d_results, 0, NUM_CAMELS * sizeof(T));

  std::cout << "Copying to device..." << std::endl;
  cudaMemcpy(d_positions, positions, NUM_CAMELS * sizeof(int),
             cudaMemcpyHostToDevice);
  cudaMemcpy(d_remainingDice, remainingDice, NUM_CAMELS * sizeof(bool),
             cudaMemcpyHostToDevice);
  cudaMemcpy(d_stack, stack, NUM_CAMELS * sizeof(int), cudaMemcpyHostToDevice);

  std::cout << "Starting sim..." << std::endl;
  camel_up_sim<T><<<BLOCKS, THREADS>>>(d_state, d_positions, d_remainingDice,
                                       d_stack, d_results, RUNS_PER_THREAD);

  cudaDeviceSynchronize();

  std::cout << "Copying results back..." << std::endl;
  cudaMemcpy(results, d_results, NUM_CAMELS * sizeof(T),
             cudaMemcpyDeviceToHost);

  std::cout << "Results are:" << std::endl;
  printArray(results, NUM_CAMELS);

  float probs[NUM_CAMELS];
  constexpr float N_float = static_cast<float>(N);
  for (int i = 0; i < NUM_CAMELS; i++) {
    probs[i] = static_cast<float>(results[i]) / N_float;
  }

  std::cout << "Probabilities are..." << std::endl;
  printArray(probs, NUM_CAMELS);

  cudaFree(d_positions);
  cudaFree(d_results);
  cudaFree(d_remainingDice);
  cudaFree(d_state);
  cudaFree(d_stack);

  free(results);
}
2 Comments
2024/10/19
05:49 UTC

2

Allocating dynamic memory in kernel???

I heard in a newer version of cuda you can allocate dynamic memory inside of a kernel for example global void foo(int x){ float* myarray = new float[x];

  delete[] myarray;

} So you can basically use both new(keyword)and Malloc(function) within a kernel, but my question is if we can allocate dynamic memory within kernel why can’t I call cudamalloc within kernel too. Also is the allocated memory on the shared memory or global memory. And is it efficient to do this?

10 Comments
2024/10/19
03:07 UTC

0

nvcc is not installed despite successfully running conda install command

I followed following steps to setup conda environment with python 3.8, CUDA 11.8 and pytorch 2.4.1:

$ conda create -n py38_torch241_CUDA118 python=3.8
$ conda activate py38_torch241_CUDA11I followed following steps to setup conda environment with python 3.8, CUDA 11.8 and pytorch 2.4.1:

$ conda create -n py38_torch241_CUDA118 python=3.8
$ conda activate py38_torch241_CUDA118
$ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

Python and pytorch seem to have installed correctly:

$ python --version
Python 3.8.20

$ pip list | grep torch
torch               2.4.1
torchaudio          2.4.1
torchvision         0.20.0

But when I try to check CUDA version, I realise that nvcc is not installed:

$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit

This also caused issue in the further setup of some git repositories which require nvcc. Do I need to run sudo apt install nvidia-cuda-toolkit as suggested above? Shouldnt above conda install command install nvcc? I tried these steps again by completely deleting all packaged and environments of conda. But no help.

Below is some relevant information that might help debug this issue:

$ conda --version
conda 24.5.0

$ nvidia-smi
Sat Oct 19 02:12:06 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                        User-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 2000 Ada Gene...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   48C    P0            588W /   35W |       8MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1859      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

$ which nvidia-smi
/usr/bin/nvidia-smi

Note that my machine runs NVIDIA RTX 2000 Ada Generation. Also above nvidia-smi command says I am running CUDA 12.4. This driver I have installed manually long back when I did not have conda installed on the machine.

I tried setting CUDA_HOME path to my conda environment, but no help:

$ export CUDA_HOME=$CONDA_PREFIX

$ echo $CUDA_HOME
/home/User-M/miniconda3/envs/FairMOT_py38_torch241_CUDA118

$ which nvidia-smi
/usr/bin/nvidia-smi

$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit8
$ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

Python and pytorch seem to have installed correctly:

$ python --version
Python 3.8.20

$ pip list | grep torch
torch               2.4.1
torchaudio          2.4.1
torchvision         0.20.0

But when I try to check CUDA version, I realise that nvcc is not installed:

$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit

This also caused issue in the further setup of some git repositories which require nvcc. Do I need to run sudo apt install nvidia-cuda-toolkit as suggested above? Shouldnt above conda install command install nvcc? I tried these steps again by completely deleting all packaged and environments of conda. But no help.

Below is some relevant information that might help debug this issue:

$ conda --version
conda 24.5.0

$ nvidia-smi
Sat Oct 19 02:12:06 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                        User-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 2000 Ada Gene...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   48C    P0            588W /   35W |       8MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1859      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+
2 Comments
2024/10/18
20:48 UTC

6

Can I Use CUDA with NVIDIA GeForce GT 730 on Windows 11 for Large-Scale Simulations?

Hi everyone,

I’m working on simulations that iterate 10,000,000 times and want to optimize these calculations using CUDA on my GPU. Here are my details:

  • GPU Model: NVIDIA GeForce GT 730
  • Operating System: Windows 11

Questions:

  1. Is the NVIDIA GeForce GT 730 compatible with CUDA for performing large-scale simulations?
  2. Are there any limitations or considerations I should be aware of when using CUDA with this GPU?
  3. What steps can I take to optimize my simulations using CUDA on this hardware?

Any advice or insights would be greatly appreciated!

Thanks!

9 Comments
2024/10/17
08:54 UTC

2

Using large inputs in cufftdx - ~ 50M points

I'm trying to compute the low pass filter of a 50M point transform using cufftdx. The problem is that it seems to limit me to input sizes of 1 << 14. There's no documentation or usage with large inputs and I'm trying to understand how people approach this problem. Sure I can compute a bunch of fft blocks over the 50M point space... but am I supposed to then somehow combine the blocks into a single FFT to get the correct values? There's something I'm not understanding.

7 Comments
2024/10/17
06:01 UTC

Back To Top