10,194 Subscribers

Templates for CUBLAS

Rewriting a double precision into a templated function for __half, float and double and asking ChatGPT, I noticed that I could wrap hgemm, sgemm and dgemm into a generic interface gemm that would select the correct function at compile time.


// General template (not implemented)
template <typename T>
void gemm(cublasHandle_t handle, int m, int n, int k, 
          const T* A, const T* B, T* C, 
          T alpha = 1.0, T beta = 0.0);

// Specialization for float (sgemm)
template <>
void gemm<float>(cublasHandle_t handle, int m, int n, int k, 
                 const float* A, const float* B, float* C, 
                 float alpha, float beta) {
    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 
                m, n, k, 
                &alpha, A, m, B, k, &beta, C, m);
}

// Specialization for double (dgemm)
template <>
void gemm<double>(cublasHandle_t handle, int m, int n, int k, 
                  const double* A, const double* B, double* C, 
                  double alpha, double beta) {
    cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 
                m, n, k, 
                &alpha, A, m, B, k, &beta, C, m);
}

Is there an open-source collection of templates for the cublas API? CUTLASS provides another implementation than CUBLAS. Note that here the implementation reorders the alpha and beta parameters but a more direct approach like the following would be appreciated too:

// Untested ChatGPT code
#include <cublas_v2.h>

template <typename T>
struct CUBLASGEMM;

template <>
struct CUBLASGEMM<float> {
    static constexpr auto gemm = cublasSgemm;
};

template <>
struct CUBLASGEMM<double> {
    static constexpr auto gemm = cublasDgemm;
};

template <>
struct CUBLASGEMM<__half> {
    static constexpr auto gemm = cublasHgemm;
};

template <typename T>
void gemm(cublasHandle_t handle, 
          cublasOperation_t transA, cublasOperation_t transB,
          int m, int n, int k, 
          const T* alpha, const T* A, int lda,
          const T* B, int ldb, 
          const T* beta, T* C, int ldc) {
    CUBLASGEMM<T>::gemm(handle, transA, transB, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
}

0 Comments

2025/02/03
13:53 UTC

Cuda strange behaviour on colab

(This is cross-posted from here)

Hello, testing the most elementary kernel on colab, I get a surprise :

First, after choosing the T4 GPU runtime,

!nvcc --version

returns

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Thu_Jun__6_02:18:23_PDT_2024 Cuda compilation tools, release 12.5, V12.5.82 Build cuda_12.5.r12.5/compiler.34385749_0 Cnvcc: NVIDIA

Then after

!pip install nvcc4jupyter

and

%load_ext nvcc4jupyter

the following

%%cuda #include <stdio.h>  

__global__ void hello(){          
printf("Hello from block: %u, thread: %u\n", blockIdx.x, threadIdx.x);  } 

int main(){           
    cudaError_t err = cudaSuccess;          
    hello<<<2, 2>>>();          
    err = cudaGetLastError();         
    if (err != cudaSuccess) {                
        fprintf(stderr, "Failed to launch kernel (error code %s)!\n", cudaGetErrorString(err));                
        exit(EXIT_FAILURE);          
    }          
    cudaDeviceSynchronize(); 
}

returns

Failed to launch kernel (error code the provided PTX was compiled with an unsupported toolchain.)!

I might well have missed something elementary, but I can't see what.

I'd be grateful for any hint ...

(Note : googling the error message, I found some threads here and there claiming the problem comes from an incompatibility between the cuda toolkit version and the driver of the GPU, but I guess Colab is not suspect of being in such an inconsistent state.)

0 Comments

2025/02/03
09:10 UTC

Installing older CUDA version on newer version of linux?

I have an nvidia geforce gtx 1050 ti (laptop) and I'm using mint 22. Apparently the maximum version of cuda my driver can handle is 11.8, which doesn't have an ubuntu 24.04 version. Is it still possible to install the CUDA toolkit in these circumstances? How would I go about it?

8 Comments

2025/02/02
19:54 UTC

Does anyone know how to force my gpu to use fp16

I'm trying to use an ai voice cloning program and my gpu is giving me this error CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) i cant get my gpu to use fp32 for some reason. It's a overclocked EVGA GeForce GTX 970 SC ACX 2.0 GAMING 4GB btw also ignore the title i meant to get it to use fp32. That's my bad

7 Comments

2025/02/02
14:55 UTC

CUDA + multithreading

I am working on a C++ framework, for neural network computation for a university project, specifically MNIST. I implemented every needed matrix operation, like e.g. matmul, convolution, etc. with a CUDA Kernel, which, after benchmarking, significantly improved performance. Per benchmark I am processing 128 images sequentially (batch size 128). Now I was thinking, is it possible to multithread the Images (CPU threads), in combination with my cudaKernel calling functions?

So I want to start e.g. 16 (CPU) threads, each computing 1 image at a time, calling the different matrix operations, and after the (CPU) thread is done it starts computing the next images. So with my batch size of 128 each threads would process 8 images.

Can I simply launch CPU threads, that call the different cuda functions, or will I get problems regarding the cudaRuntime or other memory stuff?

8 Comments

2025/02/01
20:50 UTC

Pipelines and Buffers

Hi!
What is the best method to orgainze multiple layers of pipelines and buffers on device?
Inside the pipeline are some graph or kernel call, the buffers are allocatted memories on device.
As I see it, I sould create cudaStream_t-s for each pipeline and somehow manage to wait eachother.

How would you orgainze the objects for this task?

Are there any well known method to solve this problem?

Thank you for answers!

4 Comments

2025/02/01
11:22 UTC

How is synchronization implemented between the host and device in CUDA code?

Although I am new to GPU programming, I am quite familiar with multithreading on the CPU. I am curious about how CUDA implements mechanisms to inform the waiting CPU thread about the completion of a kernel?

For example in a program to compute the sum of two vectors, the CUDA code is expressed as:

void vecAdd(float* A, float* B, float* C, int n) {

// Copy the operands A and B to the CUDA device

// Launch the kernel function on the device to compute the vector sum

// ------ HOW DOES THE CPU KNOW WHEN TO EXECUTE THE NEXT INSTRUCTION -------

// Copy the result C from device to the host

// Free device memory for A, B, C

}

If I were to think of concurrent CPU code to achieve this, I would launch a number of threads from my main program and perform the independent operations on each of them. They would then signal completion through some sort of synchronization primitive - possibly through a shared counter variable and a condition variable shared between the worker threads and the main thread. There are of course downsides to this approach (sharing a variable across multiple cores causes cache invalidations and throttles progress).

I assume that there should be little to no inter core communication between the GPU cores. How is this synchronization achieved efficiently?

6 Comments

2025/02/01
02:37 UTC

CUDA ran out of memory when using cuDF

I am new to cuDF when i load a csv file using read_csv it works fine but when i try to to df.corr() i get

 Call to cuMemcpyDtoH results in CUDA_ERROR_OUT_OF_MEMORY

im running it locally on my laptop with 6gb vram, is there any workaround to do this like any way to give instrcutions smaller or using cpu and memory as well...

5 Comments

2025/01/31
19:39 UTC

278

NVIDIA's paid CUDA courses for FREE (limited period)

NVIDIA has announced free access (for a limited time) to its premium courses, each typically valued between $30-$90, covering advanced topics in Generative AI and related areas.

The major courses made free for now are :

Retrieval-Augmented Generation (RAG) for Production: Learn how to deploy scalable RAG pipelines for enterprise applications.
Techniques to Improve RAG Systems: Optimize RAG systems for practical, real-world use cases.
CUDA Programming: Gain expertise in parallel computing for AI and machine learning applications.
Understanding Transformers: Deepen your understanding of the architecture behind large language models.
Diffusion Models: Explore generative models powering image synthesis and other applications.
LLM Deployment: Learn how to scale and deploy large language models for production effectively.

Note: There are redemption limits to these courses. A user can enroll into any one specific course.

Platform Link: NVIDIA TRAININGS

23 Comments

2025/01/29
06:49 UTC

DeepSeek's multi-head latent attention and other KV cache tricks explained

https://preview.redd.it/f2rkjiicatfe1.png?width=2076&format=png&auto=webp&s=dd2bab3ecf002b8616e06905b2fc1e3de82fb299

We wrote a blog post on MLA (used in DeepSeek) and other KV cache tricks. Hope it's useful for others!

4 Comments

2025/01/28
22:34 UTC

In your opinion, what is the hardest part about writing CUDA code?

For example, avoiding race conditions, picking the best block/grid size, etc.
As a follow up, what changes would you make to the CUDA language to make it easier?

5 Comments

2025/01/27
18:34 UTC

DeepSeek Inter-GPU communication with warp specialization

I'm particularly interested in the paragraph from the DeepSeek-V3 Paper:

In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition 20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2) IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The number of warps allocated to each communication task is dynamically adjusted according to the actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending, (2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also handled by dynamically adjusted warps. In addition, both dispatching and combining kernels overlap with the computation stream, so we also consider their impact on other SM computation kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs

I didn't even realize that NVIDIA offers primitives for handling NVLink/IB sending within kernels in a warp-specialized manner. I always thought it was an API call you make on the host. How do they accomplish this/is there NVIDIA documentation on how to do things like this?

9 Comments

2025/01/25
16:18 UTC

How to check algorithmic correctness | Unit tests

Hi,

I usually use CPU computations for my algorithms to test if the corresponding CUDA kernel is correct. I'm writing a bunch of parallel algorithms that seem to work correctly for small test inputs, but they fail for larger inputs. This is seen even for a very simple GEMM kernel. After some analysis I realized this issue is because of how floating point numbers are computed a little differently in both devices, which results in significant error propagation for larger inputs.

How are unit tests written and algorithmic correctness verified in standard practice?

P.S I use PyCUDA for host programming and python for CPU output generation.

Edit: For GEMM kernels, I found using integer matrices casted to float32 effective as inputs as there will be no error between the CPU and GPU outputs. But for kernels that involve some sort of division, this no longer is effective as intermediate floating points will cause divergence in outputs.

1 Comment

2025/01/25
03:48 UTC

Is anyone else having issues with NVIDIA CUDA repository mirrors being temporarily out of sync?

I guess this would be specific to Singularity/ Docker, but I assume other people here would know if they were trying to build something

0 Comments

2025/01/24
04:59 UTC

Really Basic CUDA Python script doesnt work properly.

Basically i just learned about nvidia CUDA and wanted to try creating a fast pixel search python script(i have a lot of use cases for this) and created the script below with a little help from github copilot. The script works great with under 1ms detection time but for some reason everytime i toggle the script the detection time will increase going from under 1ms to 5ms. I tried looking through this reddit for a similar issue and couldn't find anything, so I'm wondering if anyone else knows why this is happening. I'm on a RTX 2060 notebook edition(laptop).

import cv2
import numpy as np
import keyboard
import mss
from timeit import default_timer as timer
import win32api, win32con
import time
from threading import Thread, Lock

# Constants
TARGET_COLOR = (0, 161, 253)  # BGR format
COLOR_THRESHOLD = 1
MIN_CONTOUR_AREA = 100
TOGGLE_DELAY = 0.3
MAX_CPS = 10

class GPUProcessor:
    def __init__(self):
        cv2.cuda.setDevice(0)
        self.stream = cv2.cuda_Stream()
        
        # Pre-allocate GPU matrices
        self.gpu_frame = cv2.cuda_GpuMat()
        self.gpu_hsv = cv2.cuda_GpuMat()
        
        # Pre-calculate color bounds
        self.target_bgr = np.uint8([[TARGET_COLOR]])
        self.target_hsv = cv2.cvtColor(self.target_bgr, cv2.COLOR_BGR2HSV)[0][0]
        self.lower_bound = np.array([max(0, self.target_hsv[0] - COLOR_THRESHOLD), 50, 50], dtype=np.uint8)
        self.upper_bound = np.array([min(179, self.target_hsv[0] + COLOR_THRESHOLD), 255, 255], dtype=np.uint8)

    def process_frame(self, frame):
        try:
            start_time = timer()
            
            self.gpu_frame.upload(frame)
            self.gpu_hsv = cv2.cuda.cvtColor(self.gpu_frame, cv2.COLOR_BGR2HSV)
            hsv = self.gpu_hsv.download()
            mask = cv2.inRange(hsv, self.lower_bound, self.upper_bound)
            contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
            
            return contours, (timer() - start_time) * 1000
            
        except cv2.error as e:
            print(f"GPU Error: {e}")
            return [], 0

class State:
    def __init__(self):
        self.toggle = False
        self.running = True
        self.lock = Lock()
        self.last_toggle_time = 0
        self.last_click_time = 0

def click(x, y):
    win32api.SetCursorPos((x, y))
    win32api.mouse_event(win32con.MOUSEEVENTF_LEFTDOWN, x, y, 0, 0)
    win32api.mouse_event(win32con.MOUSEEVENTF_LEFTUP, x, y, 0, 0)

def keyboard_handler(state):
    while state.running:
        if keyboard.is_pressed('right shift'):
            with state.lock:
                current_time = time.time()
                if current_time - state.last_toggle_time > 0.3:
                    state.toggle = not state.toggle
                    state.last_toggle_time = current_time
                    print(f"Detection {'ON' if state.toggle else 'OFF'}")
        elif keyboard.is_pressed('esc'):
            state.running = False
            break
        time.sleep(0.1)

def main():
    state = State()
    gpu_processor = GPUProcessor()
    
    screen = mss.mss().monitors[1]
    monitor_region = {"top": 314, "left": 222, "width": 986, "height": 99}
    
    keyboard_thread = Thread(target=keyboard_handler, args=(state,), daemon=True)
    keyboard_thread.start()
    
    print("Press Right Shift to toggle detection ON/OFF")
    print("Press ESC to exit")
    
    while state.running:
        with state.lock:
            if not state.toggle:
                time.sleep(0.01)
                continue
        
        screenshot = screen.grab(monitor_region)
        frame = np.array(screenshot)[:, :, :3]
        
        contours, process_time = gpu_processor.process_frame(frame)
        
        current_time = time.time()
        with state.lock:
            if contours and (current_time - state.last_click_time) > (1.0 / MAX_CPS):
                largest_contour = max(contours, key=cv2.contourArea)
                if cv2.contourArea(largest_contour) > MIN_CONTOUR_AREA:
                    M = cv2.moments(largest_contour)
                    if M["m00"] != 0:
                        cx = int(M["m10"] / M["m00"])
                        cy = int(M["m01"] / M["m00"])
                        screen_x = monitor_region["left"] + cx
                        screen_y = monitor_region["top"] + cy
                        
                        click(screen_x, screen_y)
                        state.last_click_time = current_time
                        print(f"Detection time: {process_time:.2f}ms | FPS: {1000/process_time:.1f}")

    keyboard.unhook_all()

if __name__ == "__main__":
    main()

2 Comments

2025/01/22
23:34 UTC

Using my laptop, without a NVIDIA GPU, what options do I have for compiling and running CUDA code?

I'm running Linux Ubuntu, but don't have a GPU that can run CUDA code. I have read somewhere that I can still compile CUDA programs, but won't be able to run them. What options do I have for running CUDA programs? I'm learning it for a university class, and want to practice CUDA programming. Cheap or free options are preferred. I want to know what my options are.

11 Comments

2025/01/22
17:52 UTC

Uninstall previous versions of CUDA

I recently downloaded CUDA 11.1 without updating my display drivers. CUDA 11.1 wasn't compatible with my python project so it is currently useless. Now I will upgrade to the higher version of the driver(from 457.34 to 566.36). This will definitely allow higher versions of CUDA. So how can i uninstall the previous version. My OS is Windows 11. I know we can have multiple CUDA versions but they may cause path conflicts, so prefer to uninstall the old version.

4 Comments

2025/01/22
14:40 UTC

Complex project ideas in HPC/CUDA

I am learning OpenMPI and CUDA in C++. My aim is to make a complex project in HPC, it can go on for about 6-7 months.

Can you suggest some fields in which there is some work to do or needs any optimization.

Can you also suggest some resources to start the project?

We are a team of 5, so we can divide the workload also. Thanks!

9 Comments

2025/01/22
08:23 UTC

PCIe version and lanes used by RTX 3090 on my PC

I'm trying to figure out what PCIe version (3.0 vs 2.0 vs 1.0) and how many lanes (x16 vs x8 vs x4 vs x1) are actually used by my RTX3090 on my PC.

I have a Gigabyte Z490 motherboard with Intel i7-10700K.

I believe that my test commands are misreporting (false) results.

Here are the tests I did (on Debian 12).

1. Running lspci gave:

sudo lspci -vvv | grep -i "LnkSta:"

gave this output:

LnkSta: Speed 2.5GT/s (downgraded), Width x1 (downgraded)

LnkSta: Speed 8GT/s (downgraded), Width x16 (downgraded)

LnkSta: Speed 2.5GT/s (downgraded), Width x1 (downgraded)

LnkSta: Speed 8GT/s (downgraded), Width x16 (downgraded)

LnkSta: Speed 2.5GT/s (downgraded), Width x1 (downgraded)

LnkSta: Speed 16GT/s (ok), Width x32 (ok)

LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)

LnkSta: Speed 8GT/s (ok), Width x16 (ok)

LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)

LnkSta: Speed 8GT/s (ok), Width x16 (ok)

LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)

2. Running sudo nvidia-smi -q | grep -i pcie -A 5 gives the following output:

PCIe Generation

Max : 3

Current : 1

Device Current : 1

Device Max : 4

Host Max : 3

3. Running the bandwidthTest from CUDA samples, I get:

./bandwidthTest/bandwidthTest

[CUDA Bandwidth Test] - Starting...

Running on...

Device 0: NVIDIA GeForce RTX 3090

Quick Mode

Host to Device Bandwidth, 1 Device(s)

PINNED Memory Transfers

Transfer Size (Bytes) Bandwidth(GB/s)

32000000 12.6

Device to Host Bandwidth, 1 Device(s)

PINNED Memory Transfers

Transfer Size (Bytes) Bandwidth(GB/s)

32000000 12.2

Device to Device Bandwidth, 1 Device(s)

PINNED Memory Transfers

Transfer Size (Bytes) Bandwidth(GB/s)

32000000 770.8

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

I will now try to install the card to the x8 socket to see if the bandwidthTest changes.

The 3rd test (bandwidth test) suggests that I have close to the PCIe3.0 x16 max bandwidth according to this, definitely above the 8GB/s maximum of PCIe2.0 x16.

So, which is it? Any help is greatly appreciated!

1 Comment

2025/01/21
16:04 UTC

Recommended "entry level" GPUDirect RDMA-compatible GPU?

I'm looking to buy a GPU to experiment with the GPUDirect RDMA framework with a connectx-5 NIC I have.

I'm looking to buy used card because I don't want to drop thousands of dollars for a learning exercise. However, I've read on the internet that getting older cards with old versions of CUDA to work are painful. I was considering the RTX Quadro 4000, but are there better cards in terms of price and/or version compatibility?

7 Comments

2025/01/19
05:34 UTC

Parallel execution of tensorrt engine on jetson orin

I have two engines of two different dl models and I have created two contexts and running two different streams, but there is no parallelism in kernel execution when profiled, how to limit/make these executions parallel? Or paralelisation with other cuda operations

2 Comments

2025/01/18
12:31 UTC

PyTorch not detecting GPU after installing CUDA 11.1 with GTX 1650, despite successful installation

My GPU is a GTX 1650, OS is windows 11, python 3.11, and the CUDA version is 11.1. I have installed the CUDA toolkit. When I execute the command nvcc --version, it shows the toolkit version as well. However, when I try to install the Torch version using the following command:

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/cuda/11.1/torch_stable.html

I receive an error stating that it cannot find the specified Torch version (it suggests versions >2.0). While I can install the latest versions of Torch (2.x), when I run the following code:

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

It shows "cpu" instead of "cuda." Should I install a higher version of the CUDA toolkit? If so, how high can I go? I would really appreciate any help.

3 Comments

2025/01/18
06:29 UTC

266

LeetGPU – Write and execute CUDA on the web, no GPU required, for free

We found that there was a significant hardware barrier for anyone trying to learn CUDA programming. Renting and buying NVIDIA GPUs can be expensive, installing drivers can be a pain, submitting jobs can cause you to wait in long queues, etc.

That's why we built LeetGPU.com, an online CUDA playground for anyone to write and execute CUDA code without needing a GPU and for free.

We emulate GPUs on CPUs using two modes: functional and cycle accurate. Functional mode executes your code fast and provides you with the output of your CUDA program. Cycle accurate mode models the GPU architecture and provides you also with the time your program would have taken on actual hardware. We have used open-source simulators and stood on the shoulders of giants. See the help page on leetgpu.com/playground for more info.

Currently we support most core CUDA Runtime API features and a range of NVIDIA GPUs to simulate on. We're also working on supporting more features and adding more GPU options.

Please try it out and let us know what you think!

23 Comments

2025/01/17
20:56 UTC

Learn Nvidia tools for newbie

i've written a guide how to use Nvidia tools from zero, here is content:

Fix-Bug

Chapter01: Introduction to Nsight Systems - Nsight Compute

Chapter02: Cuda toolkit - Cuda driver

Chapter03: NVIDIA Compute Sanitizer Part 1

Chapter04: NVIDIA Compute Sanitizer Part 2

Chapter05: Global Memory Coalescing

Chapter06: Warp Scheduler

Chapter07: Occupancy Part 1

Chapter08: Occupancy Part 2

Chapter09: Bandwidth - Throughput - Latency

Chapter10: Compute Bound - Memory Bound

2 Comments

2025/01/16
04:38 UTC

Reset my pc , trying to download cuda again but it didn't work?

https://preview.redd.it/y9ry1d6ts6de1.png?width=1377&format=png&auto=webp&s=7c7be9cd4fb4cc3cf4df2a879d8c1d2e018e8ad2

https://preview.redd.it/8k0kh32xs6de1.png?width=973&format=png&auto=webp&s=e827c7946e33e40d677bac741e03e3d401b40120

I don't get it , i followed the same tutorial i followed back then and it worked , but this time it's not working , it shows cuda version 12.7 but i downloaded cuda version 12.4

4 Comments

2025/01/15
16:50 UTC

Switched over from A100 GPU environment to H100 vGPU environment and performance is unusable

Clearly, something is wrong with my environment, but I have no idea what it is. I am using a docker container with cuda 11.8 and pytorch 2.5.1.

Setting my device to cuda renders my models unusable. It is extremely slow. It runs faster using the cpu. Running the exact docker image something that took 15 seconds in the A100 environment takes multiple hours in the new H100 environment. I've confirmed the Nvidia driver version on the host (550) and that cuda is available via torch and that torch sees the correct available device. I've reinstalled all libraries many times. I've tried different images (latest one I tried is the official pytorch 2.5.1 image with cudnn9 runtime). I will reinstall the nvidia driver and the nvidia container toolkit next to see if that fixes things, but if it doesn't I am at a loss of what to try next.

Does anyone have any pointers for this? If this is the wrong place to ask for assistance I apologize and would love to know a good place to ask. Thanks!

13 Comments

2025/01/15
02:53 UTC

Which Cuda version to use 😭😭

I have a 4060 I want to use Cuda for my neural network can anyone tell me which Cuda version to use and which cuDNN along with which tensorflow version to use

9 Comments

2025/01/10
15:19 UTC