/r/CUDA

Photograph via snooOG

/r/CUDA

7,023 Subscribers

3

Trying to install the CUDA toolkit on Fedora 40

It seems only to have a repo for F39. I was wondering if I could use the local RPM or the .run file as an alternative, but I'm not entirely sure since they're probably both for F39 as well. Would appreciate any insights. Thanks!

0 Comments
2024/04/25
20:39 UTC

5

Need help in optimisation

Hello!!
I am trying to implement a algorithm which requires to find row sum of a 2D matrix
for example

0 13 21 22 = 56
13 0 12 13 = 38
21 12 0 13 = 46
22 13 13 0 = 48

I am currently using atomicAdd which is taking a lot of time to compute

__global__ void rowsum(int *d_matrix, int *d_sums, int n)
{
    long block_Idx = blockIdx.x + (gridDim.x) * blockIdx.y + (gridDim.y * gridDim.x) * blockIdx.z;
    long thread_Idx = threadIdx.x + (blockDim.x) * threadIdx.y + (blockDim.y * blockDim.x) * threadIdx.z;
    long block_Capacity = blockDim.x * blockDim.y * blockDim.z;
    long i = block_Idx * block_Capacity + thread_Idx;

    if (i < n)
    {
        d_sums[i] = 0; // Initialize the sum to 0
        for (int j = 0; j < n; ++j)
        {
            atomicAdd(&d_sums[i], d_matrix[i * n + j]);
        }
    }
}

Any help to reduce time usage would help a lot.
thanks

13 Comments
2024/04/25
17:54 UTC

5

Need a recommendation for a low profile NVIDIA GPU

Hi All,

I'm looking for recommendations for a low profile GPU to be used for parallel computing applications with CUDA. This GPU is to be installed in a Dell R540 server which is a 2U rack mounted server with no support for external power supplies to the GPU. I have been using an old Nvidia quadro nvs 295 and ready to upgrade to something new with more CUDA capabilities. Appreciate everyone's insight!

8 Comments
2024/04/24
16:50 UTC

3

CUDA Setup failed despite GPU being available

I need to use bitsandbytes package to run a code which uses Falcon7B model. I have installed CUDA and my system has NVIDIA RTX A6000 GPU. My system has Windows 11 OS.

Here is the code, it is just the importing section:

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, GenerationConfig
from peft import LoraConfig, get_peft_model, PeftConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer
import warnings
warnings.filterwarnings("ignore")

Here is the error:

RuntimeError: 
        CUDA Setup failed despite GPU being available. Please run the following command to get more information:

        python -m bitsandbytes

        Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues



RuntimeError: Failed to import transformers.training_args because of the following error (look up to see its traceback):

        CUDA Setup failed despite GPU being available. Please run the following command to get more information:

        python -m bitsandbytes

        Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

This error sometimes doesn't appear and the code works. But most of the times I get this error and I am unable to find an accurate fix. This error first appeared when CUDA wasn't installed in the system. It didn't give an error after installation, but when I ran it again the next day, the same error appeared. Next I tried downgrading python version to below 3.11.1, the code ran again after that. But again today I am facing the same error.

Here is my CUDA version:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

4 Comments
2024/04/24
07:34 UTC

12

WSL + CUDA + Tensorflow + PyTorch in 10 minutes

https://blog.tteles.dev/posts/gpu-tensorflow-pytorch-cuda-wsl/

I spent 2 days attempting to configure GPU acceleration for TF and PyTorch and condensed it into a 10 minute guide, where most of the time is spent on downloads. None of the guides I found online worked for me.

I'd be very happy to receive feedback.

10 Comments
2024/04/23
19:08 UTC

1

Non-VOLTA requirement version?

I am using Dask currently and wanted to experiment with cudf, I successfully installed everything in Ubunto but when I ran <conda create -n rapids-24.04 -c rapidsai -c conda-forge -c nvidia rapids=24.04 python=3.11 cuda-version=12.2> I realized my GTX 1080ti does not meat the Compute Capability.

What is my best path forward? Give up and wait till I upgrade GPU - or is it stable to work with an older version?

0 Comments
2024/04/23
15:27 UTC

46

I had my first CUDA related job interview and the interviewer confused CUDA with Quantum Computing

The girl that was making the interview, was talking about Quantum Computing, so I pointed out that it was not in the job description after saying that I had no Idea of Quantum computing at all, in which the women said, "that it was a requirement for the job". She got nerveous instantly.

She couldn't explained if the job was requiring OpenAI's Triton or NVIDIA's Triton inference model.

Sorry, I wanted to vent out.

12 Comments
2024/04/23
13:19 UTC

4

how to see cuBLAS data layout?

nvidia doc says the cuBLAS library uses column-major storage .

but I have a matrix:
1 2 3 4 5

6 7 8 9 10

...

21 22 23 24 25

in this kernel function:

//single thread print matrix
__global__ void printMatrixWithIndex(int *a, int n)
{
    for(auto r=0;r!=5;++r)
    {
        for(auto c=0;c!=5;++c)
        {
            printf("%d ", a[(r)*5+(c)]);
        }
        printf("\n");
    }
}

it should print : 1,6,... if it is column major. But still print 1 2 3 4 5 ...

complete code is here:

#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <iostream>
#include <algorithm>
#include <numeric>

//single thread print matrix
__global__ void printMatrixWithIndex(int *a, int n)
{
    for(auto r=0;r!=5;++r)
    {
        for(auto c=0;c!=5;++c)
        {
            printf("%d ", a[(r)*5+(c)]);
        }
        printf("\n");
    }
}
int main()
{
    //test for cublas matrix memory allocation.
    const int n = 5*5;
    // matrix on host A abd B
    int *a ;
    int *d_a;
    a=new int[n];
    std::iota(a, a + n, 1);
    for(auto r=0;r!=5;++r)
    {
        for(auto c=0;c!=5;++c)
        {
            std::cout << a[(r)*5+(c)] << " ";
        }
        std::cout << std::endl;
    }
    cudaMalloc(&d_a, n*sizeof(int));
    cublasSetMatrix(5, 5, sizeof(int), a, 5, d_a, 5);
    printMatrixWithIndex<<<1, 1>>>(d_a, n);

    //free resource
    cudaFree(d_a);
    delete[] a;
    return 0;
}
1 Comment
2024/04/22
02:32 UTC

4

Ideas for parallel programming project

In this semester I have parallel computing course and I have to purpose a project with deadline of one month.
I am a backend engineer and had been working with servers since 2018 so currently I have no idea what to do or implement as my project, what are your ideas (also have a potential to be an academic paper)?

2 Comments
2024/04/21
21:35 UTC

0

Tensorflow not detecting gpu

I have the proper gpu windows supported tensorflow 2.10 version installed and verified with pip.

I have CUDA 11.2 installed. System path variable is set for "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin"

CUDNN installed with system path set as "C:\Program Files\NVIDIA\CUDNN\v8.1\bin".

I get

C:\Users\Anonymous>python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

2024-04-21 15:27:31.033958: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found

The cudart64_110.dll is located in the path variable set -- C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin.

What gives? I'm about to move to Pytorch but my example I coded uses tensorflow, and I figured it wouldn't be this ridiculous. GPU is rtx 4070,

2 Comments
2024/04/21
20:07 UTC

1

How to install Nsight DL Design

I seek guidance regarding the utilization of Nsight DL Designer on Linux. Despite successfully downloading the application, I encountered difficulties in executing it. Upon downloading the provided .run file, I performed the requisite steps of granting executable permissions using 'chmod +x' and subsequently executing it with './'. However, upon completion of this process, the application did not manifest itself, and subsequent attempts to execute './' merely resulted in the extraction process recurring.

I would appreciate assistance in resolving this matter. Thank you

0 Comments
2024/04/21
10:53 UTC

5

Profiling energy usage for a PID

I am trying to profile a PID running in GPU but not sure how to do it. I am using it for Roslaunch executable.

1 Comment
2024/04/19
15:26 UTC

2

CUDA-Vulkan interoperation, image alignment

Hi. I'm trying to update the Vulkan texture from the CUDA kernel.

I have found simpleVulkan example that does the same but with a buffer. I adapted that approach for texture image because I need to update a height map. But the pitfall is image memory alignment (tiling was too, but was changed to linear). My question is how to take alignment into account during pixel coordinate calculation in the kernel? How to know how padded bytes were added by Vulkan? By each row? At the end of the whole image data? VkMemoryRequirements provides actual size of image data and alignment value only without any details.

In the case of my NVIDIA RTX A4500 it is added at the end of each row, but this was detected experimentally and I worry it is vendor specific.

0 Comments
2024/04/19
05:07 UTC

3

Is there a way to do the whole installation process of Cuda and cudnn on a virtual environment

Hello, I’m a student doing a deep learning project, and due to hardware limitation I’m working kn a computer in on of my university’s lab. Thus, I can’t do the usual Cuda installation, and I’ve been trying to install it directly on my virtual environment, but nothing I’ve tried seems to work. Does anyone know a way to do this ? The computer has an NVIDIA Quadro P6000 Thanks.

7 Comments
2024/04/18
20:26 UTC

2

Read data (CSV/Parquet) in CUDA C++.

Hi folks. I want to read data, considerably a huge amount in either CSV or Parquete in my CUDA C++ code. So far haven't been able to figure it out or find a straightforward solution. Any suggestion is highly appreciated.

9 Comments
2024/04/17
12:41 UTC

0

Parallelising physics equations for project/ research topic.

I want to do a project and write a paper on parallelising physics equations such as the wave equations using CUDA. Can anyone give me a head start. Many Thanks.

3 Comments
2024/04/17
11:56 UTC

0

CUDA 12.4 and pytorch

I have been trying to install CUDA toolkit and pytorch but I have facing errors every time I try to download them. Latest version of pytorch supports cuda 12.1 but if I download the 12.1 cuda, the Nvidia 530 driver automatically gets installed and messes up with my system(Ubuntu 22.04 LTS). Pytorch 11.8 uses Nvidia driver 525 which is not available for my system ( even on those Nvidia driver PPA websites). Is there a way that I can make cuda 12.4 and cudnn 8.9 and pytorch 2.2.2 work together?

2 Comments
2024/04/14
19:18 UTC

34

What’s the career for CUDA C++ skilled people

Hi CUDA folks,

I would like to knows what’s the position that people with CUDA C++ skill could be have?

For example I learned cuda as fresh graduate for acceleration some mathematical equations for couple of months. Although that I’ve ECE background..

So what’s possible positions/ jobs I could pursue and have good potential in future..

22 Comments
2024/04/12
17:27 UTC

1

Simple PTX parser

Is there a simple PTX parser that extracts a kernel name and kernel parameter types?

1 Comment
2024/04/12
10:38 UTC

10

Fully Fused Map, Reduce And Scan Cuda Kernels In Spiral

7 Comments
2024/04/12
09:53 UTC

1

How to used cuda that is installed by conda in Cmake?

I need to compile a project used libtorch. Can I used cudatoolkit( the version is 11.8) that is installed by conda?If I can, how should I config the CMakeLists.txt? It seems that cmake will look for cuda in /usr dir although the conda envirenment has been activated.

4 Comments
2024/04/11
16:11 UTC

4

8bit gemm

Hello,

Im interested in Learning how to implement a int8 matmul in cuda. Someone could point me to a good implementation that i could study?

5 Comments
2024/04/10
20:43 UTC

1

NVCC gives no output when trying to compile

I have a project that uses cuda to perform matrix vector operations this project has been working fine but since I updated visual studio 2022 to 17.9.6 (I Don't know what version I updated from) my build fails and msvc gives the output the command "(long command)" exited with code 2. I have read other threads and tried changing the verbosity of msvc and nvcc but it gives no errors before this command is run and seems like there is no output. I tried running the command on my own from command prompt but it just gives no output, no exit code, no error, just nothing though there is a small delay as if its doing something when the command is run. I can run nvcc --version and have tried reinstalling cuda.

I have tried to compile the project in the command prompt and in visual studio with no success. I downloaded a sample project and it has the same issue.

11 Comments
2024/04/10
19:15 UTC

6

Efficiently implementing a broadcast in Cuda

Hi all,

I am trying to implement a broadcast operation in Cuda which given a tensor and an output shape, creates a new tensor with the output shape with dimensions that are a broadcasted version of the origianal tensor.

E.g. input shape could be [4096, 1] and output shape could be [4096, 4096].

I have the following implementation currently. The issue with this approach is that I am doing 4096 * 4096 loads and 4096 * 4096 stores for my example when theoretically I should be only doing 4096 stores. 

Is there a way to solve this with just 4096 stores? 

I think the shufl instruction might help but I am not sure how to generalize it to arbitrary dimensions and strides. 

Any other approaches or code pointers to existing implementations? Thanks

__global__ void broadcast(float * input_array, 
						  float * output_array,
						  vector<int> input_dims,
						  vector<int> input_strides,
						  vector<int> output_dims,
						  vector<int> output_strides) {
	int elem = blockIdx.x * blockDim.x + threadIdx.x;

	vector<int> output_coords(output_dims.size());
	vector<int> input_coords(input_dims.size());
	
	// calculate the output coordinates to write to
	// and input_coordinate to read from
	for(int i = 0; i < output_dims.size(); i++) {
		output_coords[i] = (elem / output_strides[i]) % output_dims[i];
		
		// input_dims[i] is 1, map to coordinate 0	
		if(input_dims[i] == 1) {
			input_coords[i] = 0;
		} else {
			input_coords[i] = output_coords[i];
		}
	}

	// load data
	for(int i = 0; i < input_coords.size(); i++) {
		input_array += input_coords[i] * input_strides[i];
	}
	float data = *input_array;

	// store data
	for(int i = 0; i < output_coords.size(); i++) {
		output_array += output_coords[i] * output_strides[i];
	}
	*output_array = data;
}
5 Comments
2024/04/10
05:34 UTC

0

How do you rate CUDA?

Hello, I am just doing some independent research. I was just curious how you, as CUDA developers/ enthusiasts, find CUDA overall in terms of usefulness? Thanks in advance.

7 Comments
2024/04/10
03:02 UTC

Back To Top