/r/CUDA
/r/CUDA
Do you have experience in CUDA based real-time Ultrasound Imaging?
Hello Beautiful Humans,
I am trying get a LLM model to work on my local GPU. I have tried downloading CUDA toolkit and other packages but unfortunately nothing works and I am lost in the web of drivers and compatible packages. Can any of you be so kind and help me out. Any ideas anything at all??
I appreciate any response and wish all of you the best in these stupid stupid job market.
Best Regards
OS : Ubuntu 22.04.4 LTS
NVIDIA-SMI 545.29.06
Driver Version: 545.29.06
CUDA Version: 12.3
Hey everyone,
I've been delving into GPU programming with CUDA and have been exploring various tutorials and resources. However, most of the material I've found focuses on basic steps involving simple data structures and operations.
I'm interested in designing a medium to large-scale application for GPUs, but the data I need to transfer between the CPU and GPU is significantly more complex than just a few arrays. Think nested data structures, arrays of structs, etc.
My goal is to minimize the number of kernel calls for efficiency reasons, aiming for each kernel call to be high-level and handle a significant portion of the computation.
Could anyone provide insights or resources on best practices for designing and implementing such complex GPU applications with CUDA while minimizing the number of kernel calls? Specifically, I'm looking for guidance on:
I appreciate any advice or pointers you can offer!
As per https://docs.nvidia.com/deploy/cuda-compatibility/index.html
CUDA 12.4 is "Not required" for 550, as 12.4 was paired with 550 and therefore no extra packages are needed.
However, will having CUDA 12.4 improve performance?
I have a Nvidia T4
I'm used to using CUDA (for LLM training) using Google's Colab to access GPUs, and I understand a lot of folks use AWS or GCP. Is there a decent cheaper way to do this at home that people find useful? I wonder if a setup with some NUCs or mini-pcs running linux, would be useful for this?
I realize this gets posted periodically. Thanks for your patience.
I am working on parallelizing a CNN in CUDA but I have the issue not reaching high speed ups. When I launch each kernels in another program independently I reach expected high speed up but in this project only the first kernel "fp_c1" has high speed is having too many kernels like this causing a large overhead causing it to be slower? and what would you recommend to fix this?
// Forward propagation of a single row in dataset
static double forward_pass(double data[28][28])
{
float input[28][28];
for (int i = 0; i < 28; ++i) {
for (int j = 0; j < 28; ++j) {
input[i][j] = data[i][j];
}
}
l_input.clear();
l_c1.clear();
l_s1.clear();
l_f.clear();
//Convolution Layer
fp_c1<<<>>>((float (*)[28])l_input.output, (float (*)[24][24])l_c1.preact, (float (*)[5][5])l_c1.weight,l_c1.bias);
apply_step_function<<<>>>(l_c1.preact, l_c1.output, l_c1.O);
// Pooling layer
fp_s1<<<>>>((float (*)[24][24])l_c1.output, (float (*)[6][6])l_s1.preact, (float (*)[4][4])l_s1.weight,l_s1.bias);
apply_step_function<<<>>>(l_s1.preact, l_s1.output, l_s1.O);
// Fully connected layer
fp_f<<<>>>((float (*)[6][6])l_s1.output, l_f.preact, (float (*)[6][6][6])l_f.weight,l_f.bias);
apply_step_function<<<>>>(l_f.preact, l_f.output, l_f.O);
}
I m not able to run any models and even fine tune it due to cuda environment was not detected. But I do have cuda, cudnn library and nividia GPU drivers installed and paths are also set in environment variables. Any solution
It seems only to have a repo for F39. I was wondering if I could use the local RPM or the .run
file as an alternative, but I'm not entirely sure since they're probably both for F39 as well. Would appreciate any insights. Thanks!
Hello!!
I am trying to implement a algorithm which requires to find row sum of a 2D matrix
for example
0 13 21 22 = 56
13 0 12 13 = 38
21 12 0 13 = 46
22 13 13 0 = 48
I am currently using atomicAdd which is taking a lot of time to compute
__global__ void rowsum(int *d_matrix, int *d_sums, int n)
{
long block_Idx = blockIdx.x + (gridDim.x) * blockIdx.y + (gridDim.y * gridDim.x) * blockIdx.z;
long thread_Idx = threadIdx.x + (blockDim.x) * threadIdx.y + (blockDim.y * blockDim.x) * threadIdx.z;
long block_Capacity = blockDim.x * blockDim.y * blockDim.z;
long i = block_Idx * block_Capacity + thread_Idx;
if (i < n)
{
d_sums[i] = 0; // Initialize the sum to 0
for (int j = 0; j < n; ++j)
{
atomicAdd(&d_sums[i], d_matrix[i * n + j]);
}
}
}
Any help to reduce time usage would help a lot.
thanks
Hi All,
I'm looking for recommendations for a low profile GPU to be used for parallel computing applications with CUDA. This GPU is to be installed in a Dell R540 server which is a 2U rack mounted server with no support for external power supplies to the GPU. I have been using an old Nvidia quadro nvs 295 and ready to upgrade to something new with more CUDA capabilities. Appreciate everyone's insight!
I need to use bitsandbytes package to run a code which uses Falcon7B model. I have installed CUDA and my system has NVIDIA RTX A6000 GPU. My system has Windows 11 OS.
Here is the code, it is just the importing section:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, GenerationConfig
from peft import LoraConfig, get_peft_model, PeftConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer
import warnings
warnings.filterwarnings("ignore")
Here is the error:
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:
python -m bitsandbytes
Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
RuntimeError: Failed to import transformers.training_args because of the following error (look up to see its traceback):
CUDA Setup failed despite GPU being available. Please run the following command to get more information:
python -m bitsandbytes
Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
This error sometimes doesn't appear and the code works. But most of the times I get this error and I am unable to find an accurate fix. This error first appeared when CUDA wasn't installed in the system. It didn't give an error after installation, but when I ran it again the next day, the same error appeared. Next I tried downgrading python version to below 3.11.1, the code ran again after that. But again today I am facing the same error.
Here is my CUDA version:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
https://blog.tteles.dev/posts/gpu-tensorflow-pytorch-cuda-wsl/
I spent 2 days attempting to configure GPU acceleration for TF and PyTorch and condensed it into a 10 minute guide, where most of the time is spent on downloads. None of the guides I found online worked for me.
I'd be very happy to receive feedback.
I am using Dask currently and wanted to experiment with cudf, I successfully installed everything in Ubunto but when I ran <conda create -n rapids-24.04 -c rapidsai -c conda-forge -c nvidia rapids=24.04 python=3.11 cuda-version=12.2> I realized my GTX 1080ti does not meat the Compute Capability.
What is my best path forward? Give up and wait till I upgrade GPU - or is it stable to work with an older version?
The girl that was making the interview, was talking about Quantum Computing, so I pointed out that it was not in the job description after saying that I had no Idea of Quantum computing at all, in which the women said, "that it was a requirement for the job". She got nerveous instantly.
She couldn't explained if the job was requiring OpenAI's Triton or NVIDIA's Triton inference model.
Sorry, I wanted to vent out.
nvidia doc says the cuBLAS library uses column-major storage .
but I have a matrix:1 2 3 4 5
6 7 8 9 10
...
21 22 23 24 25
in this kernel function:
//single thread print matrix
__global__ void printMatrixWithIndex(int *a, int n)
{
for(auto r=0;r!=5;++r)
{
for(auto c=0;c!=5;++c)
{
printf("%d ", a[(r)*5+(c)]);
}
printf("\n");
}
}
it should print : 1,6,... if it is column major. But still print 1 2 3 4 5 ...
complete code is here:
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <iostream>
#include <algorithm>
#include <numeric>
//single thread print matrix
__global__ void printMatrixWithIndex(int *a, int n)
{
for(auto r=0;r!=5;++r)
{
for(auto c=0;c!=5;++c)
{
printf("%d ", a[(r)*5+(c)]);
}
printf("\n");
}
}
int main()
{
//test for cublas matrix memory allocation.
const int n = 5*5;
// matrix on host A abd B
int *a ;
int *d_a;
a=new int[n];
std::iota(a, a + n, 1);
for(auto r=0;r!=5;++r)
{
for(auto c=0;c!=5;++c)
{
std::cout << a[(r)*5+(c)] << " ";
}
std::cout << std::endl;
}
cudaMalloc(&d_a, n*sizeof(int));
cublasSetMatrix(5, 5, sizeof(int), a, 5, d_a, 5);
printMatrixWithIndex<<<1, 1>>>(d_a, n);
//free resource
cudaFree(d_a);
delete[] a;
return 0;
}
In this semester I have parallel computing course and I have to purpose a project with deadline of one month.
I am a backend engineer and had been working with servers since 2018 so currently I have no idea what to do or implement as my project, what are your ideas (also have a potential to be an academic paper)?
I have the proper gpu windows supported tensorflow 2.10 version installed and verified with pip.
I have CUDA 11.2 installed. System path variable is set for "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin"
CUDNN installed with system path set as "C:\Program Files\NVIDIA\CUDNN\v8.1\bin".
I get
C:\Users\Anonymous>python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2024-04-21 15:27:31.033958: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
The cudart64_110.dll is located in the path variable set -- C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin.
What gives? I'm about to move to Pytorch but my example I coded uses tensorflow, and I figured it wouldn't be this ridiculous. GPU is rtx 4070,
I seek guidance regarding the utilization of Nsight DL Designer on Linux. Despite successfully downloading the application, I encountered difficulties in executing it. Upon downloading the provided .run file, I performed the requisite steps of granting executable permissions using 'chmod +x' and subsequently executing it with './'. However, upon completion of this process, the application did not manifest itself, and subsequent attempts to execute './' merely resulted in the extraction process recurring.
I would appreciate assistance in resolving this matter. Thank you
I am trying to profile a PID running in GPU but not sure how to do it. I am using it for Roslaunch executable.
Hi. I'm trying to update the Vulkan texture from the CUDA kernel.
I have found simpleVulkan example that does the same but with a buffer. I adapted that approach for texture image because I need to update a height map. But the pitfall is image memory alignment (tiling was too, but was changed to linear). My question is how to take alignment into account during pixel coordinate calculation in the kernel? How to know how padded bytes were added by Vulkan? By each row? At the end of the whole image data? VkMemoryRequirements
provides actual size of image data and alignment value only without any details.
In the case of my NVIDIA RTX A4500 it is added at the end of each row, but this was detected experimentally and I worry it is vendor specific.
Hello, I’m a student doing a deep learning project, and due to hardware limitation I’m working kn a computer in on of my university’s lab. Thus, I can’t do the usual Cuda installation, and I’ve been trying to install it directly on my virtual environment, but nothing I’ve tried seems to work. Does anyone know a way to do this ? The computer has an NVIDIA Quadro P6000 Thanks.
Hi folks. I want to read data, considerably a huge amount in either CSV or Parquete in my CUDA C++ code. So far haven't been able to figure it out or find a straightforward solution. Any suggestion is highly appreciated.
I want to do a project and write a paper on parallelising physics equations such as the wave equations using CUDA. Can anyone give me a head start. Many Thanks.
I have been trying to install CUDA toolkit and pytorch but I have facing errors every time I try to download them. Latest version of pytorch supports cuda 12.1 but if I download the 12.1 cuda, the Nvidia 530 driver automatically gets installed and messes up with my system(Ubuntu 22.04 LTS). Pytorch 11.8 uses Nvidia driver 525 which is not available for my system ( even on those Nvidia driver PPA websites). Is there a way that I can make cuda 12.4 and cudnn 8.9 and pytorch 2.2.2 work together?
Hi CUDA folks,
I would like to knows what’s the position that people with CUDA C++ skill could be have?
For example I learned cuda as fresh graduate for acceleration some mathematical equations for couple of months. Although that I’ve ECE background..
So what’s possible positions/ jobs I could pursue and have good potential in future..