/r/CUDA
/r/CUDA
It seems only to have a repo for F39. I was wondering if I could use the local RPM or the .run
file as an alternative, but I'm not entirely sure since they're probably both for F39 as well. Would appreciate any insights. Thanks!
Hello!!
I am trying to implement a algorithm which requires to find row sum of a 2D matrix
for example
0 13 21 22 = 56
13 0 12 13 = 38
21 12 0 13 = 46
22 13 13 0 = 48
I am currently using atomicAdd which is taking a lot of time to compute
__global__ void rowsum(int *d_matrix, int *d_sums, int n)
{
long block_Idx = blockIdx.x + (gridDim.x) * blockIdx.y + (gridDim.y * gridDim.x) * blockIdx.z;
long thread_Idx = threadIdx.x + (blockDim.x) * threadIdx.y + (blockDim.y * blockDim.x) * threadIdx.z;
long block_Capacity = blockDim.x * blockDim.y * blockDim.z;
long i = block_Idx * block_Capacity + thread_Idx;
if (i < n)
{
d_sums[i] = 0; // Initialize the sum to 0
for (int j = 0; j < n; ++j)
{
atomicAdd(&d_sums[i], d_matrix[i * n + j]);
}
}
}
Any help to reduce time usage would help a lot.
thanks
Hi All,
I'm looking for recommendations for a low profile GPU to be used for parallel computing applications with CUDA. This GPU is to be installed in a Dell R540 server which is a 2U rack mounted server with no support for external power supplies to the GPU. I have been using an old Nvidia quadro nvs 295 and ready to upgrade to something new with more CUDA capabilities. Appreciate everyone's insight!
I need to use bitsandbytes package to run a code which uses Falcon7B model. I have installed CUDA and my system has NVIDIA RTX A6000 GPU. My system has Windows 11 OS.
Here is the code, it is just the importing section:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, GenerationConfig
from peft import LoraConfig, get_peft_model, PeftConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer
import warnings
warnings.filterwarnings("ignore")
Here is the error:
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:
python -m bitsandbytes
Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
RuntimeError: Failed to import transformers.training_args because of the following error (look up to see its traceback):
CUDA Setup failed despite GPU being available. Please run the following command to get more information:
python -m bitsandbytes
Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
This error sometimes doesn't appear and the code works. But most of the times I get this error and I am unable to find an accurate fix. This error first appeared when CUDA wasn't installed in the system. It didn't give an error after installation, but when I ran it again the next day, the same error appeared. Next I tried downgrading python version to below 3.11.1, the code ran again after that. But again today I am facing the same error.
Here is my CUDA version:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
https://blog.tteles.dev/posts/gpu-tensorflow-pytorch-cuda-wsl/
I spent 2 days attempting to configure GPU acceleration for TF and PyTorch and condensed it into a 10 minute guide, where most of the time is spent on downloads. None of the guides I found online worked for me.
I'd be very happy to receive feedback.
I am using Dask currently and wanted to experiment with cudf, I successfully installed everything in Ubunto but when I ran <conda create -n rapids-24.04 -c rapidsai -c conda-forge -c nvidia rapids=24.04 python=3.11 cuda-version=12.2> I realized my GTX 1080ti does not meat the Compute Capability.
What is my best path forward? Give up and wait till I upgrade GPU - or is it stable to work with an older version?
The girl that was making the interview, was talking about Quantum Computing, so I pointed out that it was not in the job description after saying that I had no Idea of Quantum computing at all, in which the women said, "that it was a requirement for the job". She got nerveous instantly.
She couldn't explained if the job was requiring OpenAI's Triton or NVIDIA's Triton inference model.
Sorry, I wanted to vent out.
nvidia doc says the cuBLAS library uses column-major storage .
but I have a matrix:1 2 3 4 5
6 7 8 9 10
...
21 22 23 24 25
in this kernel function:
//single thread print matrix
__global__ void printMatrixWithIndex(int *a, int n)
{
for(auto r=0;r!=5;++r)
{
for(auto c=0;c!=5;++c)
{
printf("%d ", a[(r)*5+(c)]);
}
printf("\n");
}
}
it should print : 1,6,... if it is column major. But still print 1 2 3 4 5 ...
complete code is here:
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <iostream>
#include <algorithm>
#include <numeric>
//single thread print matrix
__global__ void printMatrixWithIndex(int *a, int n)
{
for(auto r=0;r!=5;++r)
{
for(auto c=0;c!=5;++c)
{
printf("%d ", a[(r)*5+(c)]);
}
printf("\n");
}
}
int main()
{
//test for cublas matrix memory allocation.
const int n = 5*5;
// matrix on host A abd B
int *a ;
int *d_a;
a=new int[n];
std::iota(a, a + n, 1);
for(auto r=0;r!=5;++r)
{
for(auto c=0;c!=5;++c)
{
std::cout << a[(r)*5+(c)] << " ";
}
std::cout << std::endl;
}
cudaMalloc(&d_a, n*sizeof(int));
cublasSetMatrix(5, 5, sizeof(int), a, 5, d_a, 5);
printMatrixWithIndex<<<1, 1>>>(d_a, n);
//free resource
cudaFree(d_a);
delete[] a;
return 0;
}
In this semester I have parallel computing course and I have to purpose a project with deadline of one month.
I am a backend engineer and had been working with servers since 2018 so currently I have no idea what to do or implement as my project, what are your ideas (also have a potential to be an academic paper)?
I have the proper gpu windows supported tensorflow 2.10 version installed and verified with pip.
I have CUDA 11.2 installed. System path variable is set for "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin"
CUDNN installed with system path set as "C:\Program Files\NVIDIA\CUDNN\v8.1\bin".
I get
C:\Users\Anonymous>python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2024-04-21 15:27:31.033958: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
The cudart64_110.dll is located in the path variable set -- C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin.
What gives? I'm about to move to Pytorch but my example I coded uses tensorflow, and I figured it wouldn't be this ridiculous. GPU is rtx 4070,
I seek guidance regarding the utilization of Nsight DL Designer on Linux. Despite successfully downloading the application, I encountered difficulties in executing it. Upon downloading the provided .run file, I performed the requisite steps of granting executable permissions using 'chmod +x' and subsequently executing it with './'. However, upon completion of this process, the application did not manifest itself, and subsequent attempts to execute './' merely resulted in the extraction process recurring.
I would appreciate assistance in resolving this matter. Thank you
I am trying to profile a PID running in GPU but not sure how to do it. I am using it for Roslaunch executable.
Hi. I'm trying to update the Vulkan texture from the CUDA kernel.
I have found simpleVulkan example that does the same but with a buffer. I adapted that approach for texture image because I need to update a height map. But the pitfall is image memory alignment (tiling was too, but was changed to linear). My question is how to take alignment into account during pixel coordinate calculation in the kernel? How to know how padded bytes were added by Vulkan? By each row? At the end of the whole image data? VkMemoryRequirements
provides actual size of image data and alignment value only without any details.
In the case of my NVIDIA RTX A4500 it is added at the end of each row, but this was detected experimentally and I worry it is vendor specific.
Hello, I’m a student doing a deep learning project, and due to hardware limitation I’m working kn a computer in on of my university’s lab. Thus, I can’t do the usual Cuda installation, and I’ve been trying to install it directly on my virtual environment, but nothing I’ve tried seems to work. Does anyone know a way to do this ? The computer has an NVIDIA Quadro P6000 Thanks.
Hi folks. I want to read data, considerably a huge amount in either CSV or Parquete in my CUDA C++ code. So far haven't been able to figure it out or find a straightforward solution. Any suggestion is highly appreciated.
I want to do a project and write a paper on parallelising physics equations such as the wave equations using CUDA. Can anyone give me a head start. Many Thanks.
I have been trying to install CUDA toolkit and pytorch but I have facing errors every time I try to download them. Latest version of pytorch supports cuda 12.1 but if I download the 12.1 cuda, the Nvidia 530 driver automatically gets installed and messes up with my system(Ubuntu 22.04 LTS). Pytorch 11.8 uses Nvidia driver 525 which is not available for my system ( even on those Nvidia driver PPA websites). Is there a way that I can make cuda 12.4 and cudnn 8.9 and pytorch 2.2.2 work together?
Hi CUDA folks,
I would like to knows what’s the position that people with CUDA C++ skill could be have?
For example I learned cuda as fresh graduate for acceleration some mathematical equations for couple of months. Although that I’ve ECE background..
So what’s possible positions/ jobs I could pursue and have good potential in future..
Is there a simple PTX parser that extracts a kernel name and kernel parameter types?
I need to compile a project used libtorch. Can I used cudatoolkit( the version is 11.8) that is installed by conda?If I can, how should I config the CMakeLists.txt? It seems that cmake will look for cuda in /usr dir although the conda envirenment has been activated.
Hello,
Im interested in Learning how to implement a int8 matmul in cuda. Someone could point me to a good implementation that i could study?
I have a project that uses cuda to perform matrix vector operations this project has been working fine but since I updated visual studio 2022 to 17.9.6 (I Don't know what version I updated from) my build fails and msvc gives the output the command "(long command)" exited with code 2. I have read other threads and tried changing the verbosity of msvc and nvcc but it gives no errors before this command is run and seems like there is no output. I tried running the command on my own from command prompt but it just gives no output, no exit code, no error, just nothing though there is a small delay as if its doing something when the command is run. I can run nvcc --version and have tried reinstalling cuda.
I have tried to compile the project in the command prompt and in visual studio with no success. I downloaded a sample project and it has the same issue.
Hi all,
I am trying to implement a broadcast operation in Cuda which given a tensor and an output shape, creates a new tensor with the output shape with dimensions that are a broadcasted version of the origianal tensor.
E.g. input shape could be [4096, 1] and output shape could be [4096, 4096].
I have the following implementation currently. The issue with this approach is that I am doing 4096 * 4096 loads and 4096 * 4096 stores for my example when theoretically I should be only doing 4096 stores.
Is there a way to solve this with just 4096 stores?
I think the shufl instruction might help but I am not sure how to generalize it to arbitrary dimensions and strides.
Any other approaches or code pointers to existing implementations? Thanks
__global__ void broadcast(float * input_array,
float * output_array,
vector<int> input_dims,
vector<int> input_strides,
vector<int> output_dims,
vector<int> output_strides) {
int elem = blockIdx.x * blockDim.x + threadIdx.x;
vector<int> output_coords(output_dims.size());
vector<int> input_coords(input_dims.size());
// calculate the output coordinates to write to
// and input_coordinate to read from
for(int i = 0; i < output_dims.size(); i++) {
output_coords[i] = (elem / output_strides[i]) % output_dims[i];
// input_dims[i] is 1, map to coordinate 0
if(input_dims[i] == 1) {
input_coords[i] = 0;
} else {
input_coords[i] = output_coords[i];
}
}
// load data
for(int i = 0; i < input_coords.size(); i++) {
input_array += input_coords[i] * input_strides[i];
}
float data = *input_array;
// store data
for(int i = 0; i < output_coords.size(); i++) {
output_array += output_coords[i] * output_strides[i];
}
*output_array = data;
}
Hello, I am just doing some independent research. I was just curious how you, as CUDA developers/ enthusiasts, find CUDA overall in terms of usefulness? Thanks in advance.