/r/ROCm

Photograph via snooOG

The ROCm Platform brings a rich foundation to advanced computing by seamlessly integrating the CPU and GPU with the goal of solving real-world problems. This software enables the high-performance operation of AMD GPUs for computationally-oriented tasks in the Linux operating system.

/r/ROCm

1,878 Subscribers

3

ROCm help

i want to start making a personal ai using rocm but everytime i try i never seem to get it to detect my gpu or when it does i just doesnt work. i have tried youtube videos and followwing the amd website and wiki but still no luck. Mainly wanted to know if someone knows of a better guide or solution to get it working

20 Comments
2024/12/01
12:54 UTC

4

Can i use my RX6600 GPU for machine learning?

Any help would be great help.

Suggest me a better GPU that is compatible with rocm 6.3 and rdna3

7 Comments
2024/11/27
03:43 UTC

0

Has ROCm 6.3 deprecated 7900 GPUs?

I saw some news about ROCm 6.3 recently and decided to check the support matrix - such as it is. From what I can see here: https://rocm.docs.amd.com/en/docs-5.3.3/release/gpu_os_support.html under the "GPU Support Table" it appears that the 7900-series GPUs are no longer supported. It's really rather surprising that they only appear to support gfx900, gfx906, gfx908, gfx90a, and gfx1030. Supported architectures are GCN5.0, GCN5.1, CDNA, CDNA2, and RDNA2. Is this a snapshot in time and RDNA2 / gfx1100 is coming or are they already deprecated.

Am I sending back the 7900GRE that I asked Santa for back unopened and buying nVidia instead? I much prefer the open source approach and that was guiding where I spend my money. Plus in the long term ROCm looks to be more versatile, but if this is really their hardware support strategy, at the very least it's not for people like me.

9 Comments
2024/11/27
00:32 UTC

6

Can I use my 6800xt for machine learning in any way?

Hello, I'm trying to start a new machine learning/deep learning project for my resume but I need to know if its possible with my GPU?

20 Comments
2024/11/25
19:56 UTC

6

can I use Radeon 780M iGPU on pytorch? I have Ryzen 7 8845 laptop

It will be amazing if it's possible

6 Comments
2024/11/24
15:17 UTC

5

PyTorch Model on Ryzen 7 7840U integrated graphics (780m)

Hello, is there any way I can run a YOLO model on my ryzen 7840u integrated graphics? I think official support is limited to nonexistant but I wonder if any of you know any way to make it work. I want to run yolov10 on it and it seems really powerful so its a waste I cant use it.

Thanks in advance!

6 Comments
2024/11/20
14:40 UTC

8

cheapest AMD GPU with ROCm support?

I am looking to swap my GTX 1060 for a cheap ROCm-compatible (for both windows and linux) AMD GPU. But according to this https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html , it doesn't seem there's any cheap AMD that is ROCm compatible.

44 Comments
2024/11/18
15:07 UTC

1

Tensorflow with Radeon 6700XT

Hello. I am trying to run some software that use libtensorflow.so. It works fine with CPU option. Someone managed to build this library with ROCm support and it is working with Radeon 7900XT. First it printed error that it ignore gfx1031 so after setting HSA_OVERRIDE_GFX_VERSION=10.3.0 I t got this error.

2024-11-17 18:40:59.363383: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-17 18:40:59.388308: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:920] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-11-17 18:40:59.426336: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:920] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-11-17 18:40:59.426398: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:920] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-11-17 18:40:59.426474: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:920] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-11-17 18:40:59.426527: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:920] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-11-17 18:40:59.426584: I external/local_xla/xla/stream_executor/rocm/rocm_executor.cc:920] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-11-17 18:40:59.426611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 11220 MB memory:  -> device: 0, name: AMD Radeon RX 6700 XT, pci bus id:         0000:0a:00.0
2024-11-17 18:41:00.232319: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-11-17 18:41:02.060300: F ./tensorflow/core/kernels/conv_2d_gpu.h:708] Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles<T, NumThreads, TileLongSide, TileShortSide, conjugate>, total_tiles_count,     NumThreads, 0, d.stream(), input, input_dims, output)
Status: INTERNAL: Cuda call failed with 98
Received signal 6

Any idea what is missing? I am running latest rocm 6.2.4 on ubuntu 24.04

This is steps that I followed https://sadrastro.com/pixinsight-gpu-acceleration-for-amd/

1 Comment
2024/11/18
14:06 UTC

4

12 years ago

5 Comments
2024/11/13
04:51 UTC

10

ROCm is very slow in WSL2

I have a 7900XT and after struggling a lot I managed to make PyTorch to work in WSL2, so I could run whisper, but it makes my computer so slow, and the performance is as bad as if I just execute it in a docker and let it use the CPU, could this be related with amdsmi being incompatible with WSL2? The funny thing is that my computer resources seems to be fine (except for the 17 out of 20 GB VRAM being consumed) so I don't really get why it is lagging

14 Comments
2024/11/12
21:14 UTC

1

RVC/sovits in win10 in rocm?

Title mostly sums it up, I'd prefer sovits but I'm open to any decent alternatives

0 Comments
2024/11/09
17:27 UTC

8

rocm 6.2 tensorflow on gfx1010 (5700XT)

Doesnt rocm 6.2.1/6.2.4 support gfx1010 hardware?

I do get this error when runing rocm tensorflow 2.16.1/2.16.2 from the official rocm repo via wheels

2024-11-09 13:34:45.872509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2306] Ignoring visible gpu device (device: 0, name: AMD Radeon RX 5700 XT, pci bus id: 0000:0b:00.0) with AMDGPU version : gfx1010. The supported AMDGPU versions are gfx900, gfx906, gfx908, gfx90a, gfx940, gfx941, gfx942, gfx1030, gfx1100

I have tried the
https://repo.radeon.com/rocm/manylinux/rocm-rel-6.2/
https://repo.radeon.com/rocm/manylinux/rocm-rel-6.2.3/

repo so far im running on ubuntu 22.04

any idea?

edit:
This is a real bummer. I've mostly supported AMD for the last 20 years, even though Nvidia is faster and has much better support in the AI field. After hearing that the gfx1010 would finally be supported (unofficially), I decided to give it another try. I set up a dedicated Ubuntu partition to minimize the influence of other dependencies... nope.

Okay, it's not the latest hardware, but I searched for some used professional AI cards to get better official support over a longer period while still staying in the budget zone. At work, I use Nvidia, but at home for my personal projects, I want to use AMD. I stumbled across the Instinct MI50... oh, nice, no support anymore.

Nvidia CUDA supports every single shitty consumer gaming card, and they even support them for more than 5 years.

Seriously, how is AMD trying to gain ground in this space? I have a one-to-one comparison. My laptop at work has a some 5y old nvidia professional gear, and I have no issues at all—no dedicated Ubuntu installation, just the latest Pop!_OS and that's it. It works.

If this is read by an AMD engineer: you've just lost a professional customer (I'm a physicist doing AI-driven science) to Nvidia. I will buy Nvidia also for my home project - and I even hate them.

22 Comments
2024/11/09
13:13 UTC

22

Liger Kernel v0.4.0 Unleashes the Power of AMD GPUs for LLMs (Benchmark included)

TL;DR:

- Faster training: Up to 26% faster multi-GPU training throughput!

- Reduced memory usage: Train larger models and use bigger batch sizes with up to 60% memory reduction.

- Longer context lengths: Explore new possibilities with support for up to 8x longer context lengths.

This is a game-changer for anyone training LLMs on AMD hardware. Liger Kernels, built on Triton, are really pushing the boundaries of what's possible.

Check out the benchmarks and release notes here:

- Benchmark blog post: https://embeddedllm.com/blog/cuda-to-rocm-portability-case-study-liger-kernel
- v0.4.0 release: https://github.com/linkedin/Liger-Kernel/releases/tag/v0.4.0

https://preview.redd.it/5u6j5554dpzd1.png?width=4031&format=png&auto=webp&s=cca2c55bb3fc585084ebab2aa5c3af2db1b7fe95

1 Comment
2024/11/08
16:16 UTC

2

What’s the best way to learn how to use and tools/features for ROCm?

Nooby at AI but wanting to learn the ROCm tool set.. any advice?

3 Comments
2024/11/05
05:48 UTC

4

Use ROCm for machine learning projects on a mobile RX 6700S?

Hello, I'm currently using an AMD G14 with a RX 6700S GPU and I am interested in running some machine learning projects. I am currently using Windows.

Is there any way for me to use the RX 6700S GPU to run machine learning projects that uses tensorflow and pytorch on Windows? If not, can I do them using WSL?

I am not that familiar with installations yet so if you can give me some detailed answers or instructions I would really appreciate it.

Thank you!

6 Comments
2024/11/03
10:55 UTC

11

Improving Poor vLLM Benchmarks (w/o reproducibility, grr)

This article popped up in my feed https://valohai.com/blog/amd-gpu-performance-for-llm-inference/ and besides having poorly labeled charts and generally being low effort, the lack of reproducibility is a bit grating (not to mention that they entitle their article a "Deep Dive" but publish... basicaly no details). They have an "Appendix: Benchmark Details" in the article, but specifically without any of the software versions or settings they use to test. Would it kill them to include a few lines of additional details?

Anyway, one thing that's interesting about a lot of these random benchmarks is that they're pretty underoptimized:

MetricMy MI300X RunMI300XH100
Successful requests100010001000
Benchmark duration (s)17.3564.07126.71
Total input tokens213,652217,393217,393
Total generated tokens185,960185,616185,142
Request throughput (req/s)57.6415.617.89
Output token throughput (tok/s)10,719.132,896.941,461.09
Total Token throughput (tok/s)23,034.496,289.833,176.70
Time to First Token (TTFT)
Mean TTFT (ms)3,632.198,422.8822,586.57
Median TTFT (ms)3,771.906,116.6716,504.55
P99 TTFT (ms)5,215.7723,657.6263,382.86
Time per Output Token (TPOT)
Mean TPOT (ms)72.3580.35160.50
Median TPOT (ms)71.2372.41146.94
P99 TPOT (ms)86.85232.86496.63
Inter-token Latency (ITL)
Mean ITL (ms)71.8866.83134.89
Median ITL (ms)41.3645.9590.53
P99 ITL (ms)267.67341.85450.19

On a single HotAisle MI300X I ran a similar benchmark_serving.py benchmark on the same Qwen/Qwen1.5-MoE-A2.7B-Chat model they use and improved request and token throughput by 3.7X, lower mean TTFT by 2.3X, while keeping TPOT and ITL about the same wihthout any additional tuning.

This was using a recent HEAD build of ROCm/vLLM (0.6.3.post2.dev1+g1ef171e0) and using the best practices from the recent vLLM Blog article and my own vLLM Tuning Guide.

So anyone can replicate my results, here is my serving settings:

VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Qwen/Qwen1.5-MoE-A2.7B-Chat --num-scheduler-steps 20 --max-num-seqs 4096

And here's how I approximated their input/output tokens (such weird numbers to test):

python benchmark_serving.py --backend vllm --model Qwen/Qwen1.5-MoE-A2.7B-Chat  --dataset-name sonnet  --num-prompt=1000 --dataset-path="sonnet.txt" --sonnet-input-len 219 --sonnet-output-len 188

(that wasn't so hard to include was it?)

2 Comments
2024/11/02
18:01 UTC

1

Fedora 41 + ROCm (dkms) compatibility

Hey folks, do you know, will amdgpu dkms work in the latest Fedora 41?

I guess it will not because it has kernel 6.11, but just want to make sure. I have AMD Mi100 and unfortunately it requires amdgpu dkms to work. So maybe someone have already tried to install it?

I saw this issue https://github.com/ROCm/ROCm/issues/3870

but maybe you have more information.

8 Comments
2024/11/01
23:05 UTC

19

ROCm 6.2 for Radeon gpus

https://community.amd.com/t5/ai/new-amd-rocm-6-2-for-radeon-gpus-delivers-performance-amp/ba-p/715854

Triton beta support. Official support for stable diffusion 2.1

Flash attention 2

6 Comments
2024/11/01
14:37 UTC

3

Trying to install SS webui with zluda but having issues with webui-user.bat

Used this guide as a basis for installing SD webui: https://youtu.be/n8RhNoAenvM?si=nEXr1st0I33TR3wW

Yet when I open webui-usee.bat and it attempts to open the webui on my browser it craps out after the onyx check giving me a Exception Code: 0xC0000005, currently don't have the full specific strings of issues besides me seeing a zluda dll, a bunch of rocm 6.1 dlls, and some python dlls being listed.

Currently using a 6700xt, python 3.10.06, ROCm 6.1, and latest zluda release.

4 Comments
2024/11/01
11:27 UTC

3

Is there a working version of flash attention 2 for AMD MI50/MI60 (gfx906, Vega 20 chip)?

Hi everyone,

I have been trying to install flash attention 2 to work with my 2x MI60 GPUs. However, I was not successful in finding a correctly working version. Here is what I tried.

I compiled https://github.com/ROCm/flash-attention.git (v2.6.3) successfully on my Ubuntu 22.04.5 LTS (x86_64). By default, gfx906 is not officially supported. I changed file setup.py line 126 - added "gfx906" to allowed_archs. It took 2 hours to compile successfully. But it failed all the tests: pytest -q -s tests/test_flash_attn.py

Still, I tried to benchmark a single MI60. Benchmark worked fine: python benchmarks/benchmark_flash_attention.py

### causal=False, headdim=128, batch_size=16, seqlen=1024 ###
Flash2 fwd: 70.61 TFLOPs/s, bwd: 17.20 TFLOPs/s, fwd + bwd: 21.95 TFLOPs/s
Pytorch fwd: 5.07 TFLOPs/s, bwd: 6.51 TFLOPs/s, fwd + bwd: 6.02 TFLOPs/s
Triton fwd: 0.00 TFLOPs/s, bwd: 0.00 TFLOPs/s, fwd + bwd: 0.00 TFLOPs/s

If FA2 worked correctly, above numbers meant I would get almost 14x improvements in fwd pass and 3x speed up in bwd pass.

Additionally, triton also does not work and for this reason the numbers for triton above is 0 (I have pytorch-triton-rocm 3.1.0).

I was curious and installed exllamav2 that can use FA2 for faster inference. Unfortunately, with FA2 enabled, exllamav2 for llama3 8b was outputting gibberish text. When I disabled FA2, the model was outputting text correctly but 2 times slower.

I also compiled aphrodite-engine (commit) and it worked fine without FA2 using gptq models. However, when I enabled FA2, it also outputted garbage text.

I also compiled the official FA2 repo (https://github.com/Dao-AILab/flash-attention.git) but it did not even run due to gfx906 not being in their support list (I could not find the code to bypass this requirement).

I have PyTorch version 2.6.0, ROCm version 6.2.4, Python 3.10.12, transformers 4.44.1.

Here is how I installed pytorch with ROCm:

python3 -m venv myenv && source myenv/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.2/

My question is, has anyone been able to correctly compile FA2? or has there ever been support a working version of FA2 for MI50/60? Since AMD manufactured these cards as server cards, I imagine they were used for training and inference of models at some point but what was their use case if they did not support pytorch libraries earlier?

Side note, I have working python experience and happy to look into modifying the ROCm FA2 repo if you could share some pointers on how to get started (which parts I should focus on for gfx906 architecture support)?

Thank you!

4 Comments
2024/10/31
14:45 UTC

1

7600S for Windows HIP SDK?

I have a CUDA application which I want to eventually run on an MI300X. It's being developed on Windows but also runs on Linux.

The easiest path for porting would be a laptop that's compatible with the Windows HIP SDK. The HIP SDK doesn't mention any Radeon mobile GPUs, but I'm wondering if anyone knows if they'd work. The 7600S is easiest for me to get. The 7600 (desktop) is supported.

0 Comments
2024/10/31
14:10 UTC

14

Llama 3.2 Vision on AMD MI300X with vLLM

Check out this post: https://embeddedllm.com/blog/see-the-power-of-llama-32-vision-on-amd-mi300x

https://reddit.com/link/1ggb4a0/video/s8j3n06sh2yd1/player

The ROCm/vLLM fork now includes experimental cross-attention kernel support, essential for running Llama 3.2 Vision on MI300X.

This post shows you how to run Meta's Llama 3.2-90B-Vision-Instruct model on an AMD MI300X GPU using vLLM. We provide Docker commands, code, and a video demo to get you started with image-based prompts.

1 Comment
2024/10/31
10:18 UTC

21

Llama 3.1 Inference on AMD MI300X GPUs: A Technical Guide with vLLM (With benchmark)

Check this out on vLLM Blog:
https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html

This post provides a deep dive into optimizing vLLM for inference of Llama 3.1 models on AMD's MI300X GPUs. We explore key parameters and techniques to maximize throughput and minimize latency.

Key Results:

  • vLLM: 1.5x higher throughput and 1.7x faster TTFT than Text Generation Inference (TGI) for Llama 3.1 405B; 1.8x higher throughput and 5.1x faster TTFT for Llama 3.1 70B.

https://preview.redd.it/p0mw0b8kf2yd1.jpg?width=2486&format=pjpg&auto=webp&s=d0527f134bd24dd2129397e43d6f22d0802d3e1b

The post studies the 9 parameters,

  1. Chunked Prefill: Disable this on MI300X in most cases for better performance.
  2. Multi-Step Scheduling: Set --num-scheduler-steps between 10 and 15 to optimize GPU utilization.
  3. Prefix Caching: Combine with chunked prefill cautiously, considering the caching hit rate.
  4. Graph Capture: For long context models, set --max-seq-len-to-capture to 16384, but monitor for potential performance degradation.
  5. AMD-Specific Optimizations: Disable NUMA balancing and tune NCCL_MIN_NCHANNELS.
  6. KV Cache Data Type: Use the default setting to match the model's data type.
  7. Tensor Parallelism: Adjust based on your throughput vs. latency requirements.
  8. Maximum Number of Sequences: Increase --max-num-seqs (e.g., to 512 or higher) to improve resource utilization.
  9. Use CK Flash Attention: Prioritize the CK Flash Attention implementation for significant speed gains.
8 Comments
2024/10/31
10:05 UTC

8

Any improvements after OpenAI started using AMD?

Recently stumbled upon this article https://www.amd.com/en/newsroom/press-releases/2024-5-21-amd-instinct-mi300x-accelerators-power-microsoft-a.html and started wondering if anyone can see any improvements using AMD cards for deep learning, any sizeable improvements in ROCm stability for example, new features, performance etc.

Currently thinking to buy a bunch of 3090s, but wanted to understand if a couple AMD cards will be a potentially better investment for the next year/two.

3 Comments
2024/10/30
11:32 UTC

4

Help: I want to Use Stable Diffusion CLI with Zluda…

Hi everyone,

I’m currently working on a project based on Auto1111SDK, and I’m aiming to modify it to work with Zluda, a solution that supports AMD GPUs.

I found another project where this setup works: stable-diffusion-webui-amdgpu. This shows it should be possible to get Auto1111SDK running with Zluda, but I’m currently missing the know-how to adjust my project accordingly.

Does anyone have experience with this or know the steps necessary to adapt the Auto1111SDK structure for Zluda? Are there specific settings or dependencies I should be aware of?

Thanks a lot in advance for any help!

28 Comments
2024/10/29
22:16 UTC

3

installing ROCm on iGPU Ryzen 5 4500U (gfx909) (arch)?

is it safe and possible/at which versions to install ROCm on a ryzen 5 4500U's iGPU (gfx909)?
there is no dedicated gpu in the laptop.
in the past while it wasn't officially supported it was possible to install it without even needing to compile it yourself. I do however not remember what version of ROCm that was however.

but now I wanted to reinstall it(added new ssd and switched from mint to arch(just because I like to see many options once in a while to see how far all have come))
and when I visited the ROCm website and it's install page it showed a warning message surrounding installing it on systems with a iGPU. speciffically saying they needed to be disabled in the bios since otherwise it could cause unstability issues and chrashes.
the wiki however(arch wiki) doesn't seem to mention such a warning)

so this left me with multiple questions. many of them are described as exact questions but rough estimates are fine.

  1. can ROCm be installed on that gpu, or speciffically the newest version?
    1. does it need to be manually compiled now?
    2. up to what version can it be directly installed?
    3. up to what version can it be installed with manual compilation?
  2. do those mentioned instabilities still happen?
  3. do those mentioned instabilities also happen on a system without a dedicated gpu?
  4. are there custom versions aimed speciffically at IGPU's or recommended build/compile arguments to optimize it for iGPU?
  5. what are those instabilities and chrashes like?

I would be okay with it occasionally chrashing, as long as it doesn't actually destroy my system or other projects I am working on, and a reboot or such at most is enough to get rid of the chrashes effects, but preffer no or very unlikely chrashes.
as for installing it or compiling it I am okay with those, even though if there are speciffic arguments to make it better or work properly those might be nice.
I just do not want to install it only to find it makes the entire system unstable or such(kind of like NVIDIA drivers do(or atleast before I knew well enough to avoid NVIDIA like hell since they are so closed source you just can't use them properly)) while I know things can be fixed, but I preffer to not seek to much problems if they might be easy to avoid.

6 Comments
2024/10/28
19:13 UTC

3

ROCm on RX 5700 XT / gfx1010 with pytorch ?

I'm new at using ROCm. I've been trying to get it working on RDNA1. However the docs say there is no official support for gfx1010 even though I've come across ROCm/Tensile#1897 going with ROCm 6.2 on this thread? Does it really work or do I have to use rocm_sdk_builder to build for a custom target such as gfx1010 and then build pytorch from source for that custom ROCm?

Many Thanks.

12 Comments
2024/10/26
06:44 UTC

3

is it worth buying rx7800xt for rocm?

7800 is officially supported by rocm ( win or linux)? I want to try tensorflow and ai art (Stable diffusion etc).

20 Comments
2024/10/25
09:23 UTC

2

Trying to gain an understanding on how to install it properly

I have a 6950 xt and Arrch Linux. I already have it set up quite well for gaming and I don't want to botch what I already have. Would Docker be the appropriate solution to isolate any ROCm configurations from my gaming setup? Do I have that right?

5 Comments
2024/10/23
20:16 UTC

2

i unfortunately cannot for the life of me get the rocm fork of koboldcpp working on fedora 40. can someone help?

Ive been able to download comfyui and have it work with my 7900xtx but for some reason koboldcpp keeps giving me the "ROCm error: no kernel image is available for execution on the device" error. ive tried messing with setting hip devices and even turning off my integrated graphics for my 7800x3d in the bios. but nothing i do seems to work. from what i gatheres its not supposed to be that hard to get it up and running on fedora but im stumped. can anyone give some guidance? i can provide any necessary terminal outputs and the sorts.

6 Comments
2024/10/23
19:35 UTC

Back To Top