/r/CUDA

Photograph via snooOG

/r/CUDA

7,067 Subscribers

0

Cuda

Do you have experience in CUDA based real-time Ultrasound Imaging?

0 Comments
2024/05/05
12:42 UTC

1

GPU is not recognised : Ubuntu 22.04.4 LTS

Hello Beautiful Humans,
I am trying get a LLM model to work on my local GPU. I have tried downloading CUDA toolkit and other packages but unfortunately nothing works and I am lost in the web of drivers and compatible packages. Can any of you be so kind and help me out. Any ideas anything at all??
I appreciate any response and wish all of you the best in these stupid stupid job market.

Best Regards

OS : Ubuntu 22.04.4 LTS

NVIDIA-SMI 545.29.06

Driver Version: 545.29.06

CUDA Version: 12.3

7 Comments
2024/05/02
14:05 UTC

13

Best Practices for Designing Complex GPU Applications with CUDA with Minimal Kernel Calls

Hey everyone,

I've been delving into GPU programming with CUDA and have been exploring various tutorials and resources. However, most of the material I've found focuses on basic steps involving simple data structures and operations.

I'm interested in designing a medium to large-scale application for GPUs, but the data I need to transfer between the CPU and GPU is significantly more complex than just a few arrays. Think nested data structures, arrays of structs, etc.

My goal is to minimize the number of kernel calls for efficiency reasons, aiming for each kernel call to be high-level and handle a significant portion of the computation.

Could anyone provide insights or resources on best practices for designing and implementing such complex GPU applications with CUDA while minimizing the number of kernel calls? Specifically, I'm looking for guidance on:

  1. Efficient memory management strategies for complex data structures.
  2. Design patterns for breaking down complex computations into fewer, more high-level kernels.
  3. Optimization techniques for minimizing data transfer between CPU and GPU.
  4. Any other tips or resources for optimizing performance and scalability in large-scale GPU applications.

I appreciate any advice or pointers you can offer!

4 Comments
2024/05/01
12:11 UTC

1

noob question - do i need CUDA 12.4 with R550 - i have a fresh CPU - Ubuntu 22.04

As per https://docs.nvidia.com/deploy/cuda-compatibility/index.html

CUDA 12.4 is "Not required" for 550, as 12.4 was paired with 550 and therefore no extra packages are needed.

However, will having CUDA 12.4 improve performance?

I have a Nvidia T4

4 Comments
2024/04/30
21:22 UTC

4

Home Lab CUDA?

I'm used to using CUDA (for LLM training) using Google's Colab to access GPUs, and I understand a lot of folks use AWS or GCP. Is there a decent cheaper way to do this at home that people find useful? I wonder if a setup with some NUCs or mini-pcs running linux, would be useful for this?

I realize this gets posted periodically. Thanks for your patience.

3 Comments
2024/04/30
03:42 UTC

3

CUDA newbie CNN project help

I am working on parallelizing a CNN in CUDA but I have the issue not reaching high speed ups. When I launch each kernels in another program independently I reach expected high speed up but in this project only the first kernel "fp_c1" has high speed is having too many kernels like this causing a large overhead causing it to be slower? and what would you recommend to fix this?

// Forward propagation of a single row in dataset
static double forward_pass(double data[28][28])
{
float input[28][28];
for (int i = 0; i < 28; ++i) {
for (int j = 0; j < 28; ++j) {
input[i][j] = data[i][j];
}
  }
  l_input.clear();
  l_c1.clear();
  l_s1.clear();
  l_f.clear();

//Convolution Layer
  fp_c1<<<>>>((float (*)[28])l_input.output, (float (*)[24][24])l_c1.preact, (float (*)[5][5])l_c1.weight,l_c1.bias);
  apply_step_function<<<>>>(l_c1.preact, l_c1.output, l_c1.O);
// Pooling layer
  fp_s1<<<>>>((float (*)[24][24])l_c1.output, (float (*)[6][6])l_s1.preact, (float (*)[4][4])l_s1.weight,l_s1.bias);
  apply_step_function<<<>>>(l_s1.preact, l_s1.output, l_s1.O);
// Fully connected layer
fp_f<<<>>>((float (*)[6][6])l_s1.output, l_f.preact, (float (*)[6][6][6])l_f.weight,l_f.bias);
apply_step_function<<<>>>(l_f.preact, l_f.output, l_f.O);
}

4 Comments
2024/04/28
11:15 UTC

1

Cuda environment not detected

I m not able to run any models and even fine tune it due to cuda environment was not detected. But I do have cuda, cudnn library and nividia GPU drivers installed and paths are also set in environment variables. Any solution

5 Comments
2024/04/27
23:57 UTC

3

Trying to install the CUDA toolkit on Fedora 40

It seems only to have a repo for F39. I was wondering if I could use the local RPM or the .run file as an alternative, but I'm not entirely sure since they're probably both for F39 as well. Would appreciate any insights. Thanks!

1 Comment
2024/04/25
20:39 UTC

6

Need help in optimisation

Hello!!
I am trying to implement a algorithm which requires to find row sum of a 2D matrix
for example

0 13 21 22 = 56
13 0 12 13 = 38
21 12 0 13 = 46
22 13 13 0 = 48

I am currently using atomicAdd which is taking a lot of time to compute

__global__ void rowsum(int *d_matrix, int *d_sums, int n)
{
    long block_Idx = blockIdx.x + (gridDim.x) * blockIdx.y + (gridDim.y * gridDim.x) * blockIdx.z;
    long thread_Idx = threadIdx.x + (blockDim.x) * threadIdx.y + (blockDim.y * blockDim.x) * threadIdx.z;
    long block_Capacity = blockDim.x * blockDim.y * blockDim.z;
    long i = block_Idx * block_Capacity + thread_Idx;

    if (i < n)
    {
        d_sums[i] = 0; // Initialize the sum to 0
        for (int j = 0; j < n; ++j)
        {
            atomicAdd(&d_sums[i], d_matrix[i * n + j]);
        }
    }
}

Any help to reduce time usage would help a lot.
thanks

14 Comments
2024/04/25
17:54 UTC

4

Need a recommendation for a low profile NVIDIA GPU

Hi All,

I'm looking for recommendations for a low profile GPU to be used for parallel computing applications with CUDA. This GPU is to be installed in a Dell R540 server which is a 2U rack mounted server with no support for external power supplies to the GPU. I have been using an old Nvidia quadro nvs 295 and ready to upgrade to something new with more CUDA capabilities. Appreciate everyone's insight!

8 Comments
2024/04/24
16:50 UTC

3

CUDA Setup failed despite GPU being available

I need to use bitsandbytes package to run a code which uses Falcon7B model. I have installed CUDA and my system has NVIDIA RTX A6000 GPU. My system has Windows 11 OS.

Here is the code, it is just the importing section:

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, GenerationConfig
from peft import LoraConfig, get_peft_model, PeftConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer
import warnings
warnings.filterwarnings("ignore")

Here is the error:

RuntimeError: 
        CUDA Setup failed despite GPU being available. Please run the following command to get more information:

        python -m bitsandbytes

        Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues



RuntimeError: Failed to import transformers.training_args because of the following error (look up to see its traceback):

        CUDA Setup failed despite GPU being available. Please run the following command to get more information:

        python -m bitsandbytes

        Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

This error sometimes doesn't appear and the code works. But most of the times I get this error and I am unable to find an accurate fix. This error first appeared when CUDA wasn't installed in the system. It didn't give an error after installation, but when I ran it again the next day, the same error appeared. Next I tried downgrading python version to below 3.11.1, the code ran again after that. But again today I am facing the same error.

Here is my CUDA version:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

4 Comments
2024/04/24
07:34 UTC

14

WSL + CUDA + Tensorflow + PyTorch in 10 minutes

https://blog.tteles.dev/posts/gpu-tensorflow-pytorch-cuda-wsl/

I spent 2 days attempting to configure GPU acceleration for TF and PyTorch and condensed it into a 10 minute guide, where most of the time is spent on downloads. None of the guides I found online worked for me.

I'd be very happy to receive feedback.

10 Comments
2024/04/23
19:08 UTC

1

Non-VOLTA requirement version?

I am using Dask currently and wanted to experiment with cudf, I successfully installed everything in Ubunto but when I ran <conda create -n rapids-24.04 -c rapidsai -c conda-forge -c nvidia rapids=24.04 python=3.11 cuda-version=12.2> I realized my GTX 1080ti does not meat the Compute Capability.

What is my best path forward? Give up and wait till I upgrade GPU - or is it stable to work with an older version?

2 Comments
2024/04/23
15:27 UTC

48

I had my first CUDA related job interview and the interviewer confused CUDA with Quantum Computing

The girl that was making the interview, was talking about Quantum Computing, so I pointed out that it was not in the job description after saying that I had no Idea of Quantum computing at all, in which the women said, "that it was a requirement for the job". She got nerveous instantly.

She couldn't explained if the job was requiring OpenAI's Triton or NVIDIA's Triton inference model.

Sorry, I wanted to vent out.

12 Comments
2024/04/23
13:19 UTC

4

how to see cuBLAS data layout?

nvidia doc says the cuBLAS library uses column-major storage .

but I have a matrix:
1 2 3 4 5

6 7 8 9 10

...

21 22 23 24 25

in this kernel function:

//single thread print matrix
__global__ void printMatrixWithIndex(int *a, int n)
{
    for(auto r=0;r!=5;++r)
    {
        for(auto c=0;c!=5;++c)
        {
            printf("%d ", a[(r)*5+(c)]);
        }
        printf("\n");
    }
}

it should print : 1,6,... if it is column major. But still print 1 2 3 4 5 ...

complete code is here:

#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <iostream>
#include <algorithm>
#include <numeric>

//single thread print matrix
__global__ void printMatrixWithIndex(int *a, int n)
{
    for(auto r=0;r!=5;++r)
    {
        for(auto c=0;c!=5;++c)
        {
            printf("%d ", a[(r)*5+(c)]);
        }
        printf("\n");
    }
}
int main()
{
    //test for cublas matrix memory allocation.
    const int n = 5*5;
    // matrix on host A abd B
    int *a ;
    int *d_a;
    a=new int[n];
    std::iota(a, a + n, 1);
    for(auto r=0;r!=5;++r)
    {
        for(auto c=0;c!=5;++c)
        {
            std::cout << a[(r)*5+(c)] << " ";
        }
        std::cout << std::endl;
    }
    cudaMalloc(&d_a, n*sizeof(int));
    cublasSetMatrix(5, 5, sizeof(int), a, 5, d_a, 5);
    printMatrixWithIndex<<<1, 1>>>(d_a, n);

    //free resource
    cudaFree(d_a);
    delete[] a;
    return 0;
}
1 Comment
2024/04/22
02:32 UTC

5

Ideas for parallel programming project

In this semester I have parallel computing course and I have to purpose a project with deadline of one month.
I am a backend engineer and had been working with servers since 2018 so currently I have no idea what to do or implement as my project, what are your ideas (also have a potential to be an academic paper)?

2 Comments
2024/04/21
21:35 UTC

0

Tensorflow not detecting gpu

I have the proper gpu windows supported tensorflow 2.10 version installed and verified with pip.

I have CUDA 11.2 installed. System path variable is set for "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin"

CUDNN installed with system path set as "C:\Program Files\NVIDIA\CUDNN\v8.1\bin".

I get

C:\Users\Anonymous>python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

2024-04-21 15:27:31.033958: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found

The cudart64_110.dll is located in the path variable set -- C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin.

What gives? I'm about to move to Pytorch but my example I coded uses tensorflow, and I figured it wouldn't be this ridiculous. GPU is rtx 4070,

2 Comments
2024/04/21
20:07 UTC

1

How to install Nsight DL Design

I seek guidance regarding the utilization of Nsight DL Designer on Linux. Despite successfully downloading the application, I encountered difficulties in executing it. Upon downloading the provided .run file, I performed the requisite steps of granting executable permissions using 'chmod +x' and subsequently executing it with './'. However, upon completion of this process, the application did not manifest itself, and subsequent attempts to execute './' merely resulted in the extraction process recurring.

I would appreciate assistance in resolving this matter. Thank you

0 Comments
2024/04/21
10:53 UTC

4

Profiling energy usage for a PID

I am trying to profile a PID running in GPU but not sure how to do it. I am using it for Roslaunch executable.

1 Comment
2024/04/19
15:26 UTC

2

CUDA-Vulkan interoperation, image alignment

Hi. I'm trying to update the Vulkan texture from the CUDA kernel.

I have found simpleVulkan example that does the same but with a buffer. I adapted that approach for texture image because I need to update a height map. But the pitfall is image memory alignment (tiling was too, but was changed to linear). My question is how to take alignment into account during pixel coordinate calculation in the kernel? How to know how padded bytes were added by Vulkan? By each row? At the end of the whole image data? VkMemoryRequirements provides actual size of image data and alignment value only without any details.

In the case of my NVIDIA RTX A4500 it is added at the end of each row, but this was detected experimentally and I worry it is vendor specific.

0 Comments
2024/04/19
05:07 UTC

3

Is there a way to do the whole installation process of Cuda and cudnn on a virtual environment

Hello, I’m a student doing a deep learning project, and due to hardware limitation I’m working kn a computer in on of my university’s lab. Thus, I can’t do the usual Cuda installation, and I’ve been trying to install it directly on my virtual environment, but nothing I’ve tried seems to work. Does anyone know a way to do this ? The computer has an NVIDIA Quadro P6000 Thanks.

7 Comments
2024/04/18
20:26 UTC

2

Read data (CSV/Parquet) in CUDA C++.

Hi folks. I want to read data, considerably a huge amount in either CSV or Parquete in my CUDA C++ code. So far haven't been able to figure it out or find a straightforward solution. Any suggestion is highly appreciated.

9 Comments
2024/04/17
12:41 UTC

0

Parallelising physics equations for project/ research topic.

I want to do a project and write a paper on parallelising physics equations such as the wave equations using CUDA. Can anyone give me a head start. Many Thanks.

3 Comments
2024/04/17
11:56 UTC

0

CUDA 12.4 and pytorch

I have been trying to install CUDA toolkit and pytorch but I have facing errors every time I try to download them. Latest version of pytorch supports cuda 12.1 but if I download the 12.1 cuda, the Nvidia 530 driver automatically gets installed and messes up with my system(Ubuntu 22.04 LTS). Pytorch 11.8 uses Nvidia driver 525 which is not available for my system ( even on those Nvidia driver PPA websites). Is there a way that I can make cuda 12.4 and cudnn 8.9 and pytorch 2.2.2 work together?

2 Comments
2024/04/14
19:18 UTC

34

What’s the career for CUDA C++ skilled people

Hi CUDA folks,

I would like to knows what’s the position that people with CUDA C++ skill could be have?

For example I learned cuda as fresh graduate for acceleration some mathematical equations for couple of months. Although that I’ve ECE background..

So what’s possible positions/ jobs I could pursue and have good potential in future..

25 Comments
2024/04/12
17:27 UTC

Back To Top