/r/HPC

Photograph via snooOG

Multicore, cluster, and high-performance computing news, articles and tools.

Multicore, cluster, and high-performance computing news, articles and tools.

"Anyone can build a fast CPU. The trick is to build a fast system." - Seymour Cray

Smokey says: avoid over-packaged products to fight climate change! [see more tips]

Other subreddits you may like:

Does this sidebar need an addition or correction? Tell us here

/r/HPC

13,469 Subscribers

1

Slurm 22 GPU Sharding Issues [Help Required]

Hi,
I have a slurm22 setup, where I am trying to shard a L40S node.
For this I add the lines:
AccountingStorageTRES=gres/gpu,gres/shard
GresTypes=gpu,shard
NodeName=gpu1 NodeAddr=x.x.x.x Gres=gpu:L40S:4,shard:8 Feature="bookworm,intel,avx2,L40S" RealMemory=1000000 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1 State=UNKNOWN

in my slurm.conf and it in the gres.conf of the node I have:

AutoDetect=nvml
Name=gpu Type=L40S File=/dev/nvidia0
Name=gpu Type=L40S File=/dev/nvidia1
Name=gpu Type=L40S File=/dev/nvidia2
Name=gpu Type=L40S File=/dev/nvidia3

Name=shard Count=2 File=/dev/nvidia0
Name=shard Count=2 File=/dev/nvidia1
Name=shard Count=2 File=/dev/nvidia2
Name=shard Count=2 File=/dev/nvidia3

This seems to work and I can get a job if I ask for 2 shards, or a gpu. However, the issue is after my job finishes, the next job is just stuck on pending (resources) until I do a scontrol reconfigure.

This happens everytime I ask for more than 1 GPU. Secondly, I can't seem to book a job with 3 shards. That goes through the same pending (resources) issue but does not resolve itself even if I do scontrol reconfigure. I am a bit lost as to what I may be doing wrong or if it is a slurm22 bug. Any help will be appreciated

7 Comments
2024/12/02
12:35 UTC

2

SLURM Node stuck in Reboot-State

Hey,

I got a problem with two of our compute nodes.
I ran some updates and rebooted all Nodes as usual with:
scontrol reboot nextstate=RESUME reason="Maintenance" <NodeName>

Two of our nodes however are now stuck in weird state.
sinfo shows them as
compute* up infinite 2 boot^ m09-[14,19]
even though they finished the reboot and are reachable from the controller.

They even accept jobs and can be allocted. At one point I saw this state:
compute* up infinite 1 alloc^ m09-19

scontrol show node m09-19 gives:
State=IDLE+REBOOT_ISSUED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A NextState=RESUME

scontrol update NodeName=m09-14,m09-19 State=RESUME
or
scontrol update NodeName=m09-14,m09-19 State=CANCEL_REBOOT
both result in
slurm_update error: Invalid node state specified

All slurmd are up and running. Another restart did nothing.
Do you have any ideas?

EDIT:
I resolved my problem by removing the stuck nodes from the slurm.conf and restarting the slurmctl.
This removed the nodes from sinfo. I then readded them as before and restarted again.
Their STATE went to unkown. After restarting the affected slurmd, the reappeared as IDLE.

7 Comments
2024/12/02
12:06 UTC

4

IBM Cell processor vs Vector processor vs GPU

Where does the Cell processor fit in comparison to vector processors and GPUs?

1 Comment
2024/12/01
11:27 UTC

23

LCI Introductory HPC Workshop (OPEN)

Hello Everyone,

I hope each of you is having a great weekend. I wanted to share this since I haven't seen anyone make a post about it yet; the Linux Cluster Institute (LCI) is hosting an introductory workshop on HPC and registrations are now open.

  • Event: Linux Cluster Institute (LCI) Introductory Workshop on HPC
  • Dates: February 10th to 14th, 2025
  • Location: Mississippi State University, Starkville, MS

I think this is a great opportunity for those who are new or interested in learning HPC administration/engineering. Also, they have Powerpoints/Slides from previous workshops available in their Archive page if you want to learn at your own pace.

Thank you for your time and have a great day!

9 Comments
2024/11/30
21:09 UTC

0

Intel A580 Battlemage 11% Slower Than A770 Alchemist in Blender Benchmark! :)

0 Comments
2024/11/29
13:52 UTC

6

Can anyone share guidance on enabling NFS over RDMA on a CentOS 7.9 cluster

I installed it using the command ./mlnxofedinstall --add-kernel-support --with-nfsrdma and configured NFS over RDMA to use port 20049. However, when running jobs with Slurm, I encountered an issue where the RDMA module keeps unloading unexpectedly. This causes compute nodes to lose connection, making even ssh inaccessible until the nodes are restarted.

Any insights or troubleshooting tips would be greatly appreciated!

17 Comments
2024/11/29
07:08 UTC

1

Seeking Advice on Masters in HPC

Hello!

For some context, I've been looking into possibly pursuing a Masters Degree in HPC at the University of Edinburgh for the 2025-2026 school year. I recently graduated this May with a Bachelors in CS and really liked the topic as some HPC concepts were taught and I want to dive into that field more. I've been working as a ML Engineer in the U.S. for a year and am a citizen here so there's no concern about going out of the country to study for a year and comeback.

The program seems really good and it specifically covers topics only related to HPC, I've looked at some programs in the U.S. and the MSc programs are really general and broad (and basically undergrad courses for masters credit) with like 2 or 3 additional HPC focused classes. I also think it would be a great life experience to study abroad for a year as I've always been here in the U.S. which is something I'm grateful for.

I'm posting to seek any advice on this topic, with the degree I hope to work at a company that does a lot of work on the application level and applying what I've learned to large clusters and things like that as opposed to the HE side of things, I might be misguided in thinking that this specialization is highly valuable at companies companies. I'm wondering if people in the industry think this would be a good investment to make, if it wouldn't be too crazy hard to get a job back in the U.S. and any other considerations.

Here is also the program link for any interested: MSc HPC Edinburgh

1 Comment
2024/11/29
00:12 UTC

40

Slurm-web v4 is now available, discover the new features.

Rackslab is delighted to announce the release of Slurm-web v4.0.0, the new major version of the open source web interface for Slurm workload manager.

https://preview.redd.it/4ns8nuib8n3e1.png?width=750&format=png&auto=webp&s=60bc7102f36c8850c367385103a754591c5822a4

This release includes many new features:

  • Interactive charts of resources status and jobs queue in the dashboard
  • Add /metrics endpoint for integration with Prometheus (or any other OpenMetrics compatible solution)
  • Jobs status badges to visualize status of the job queue at glance and instantly spot possible jobs failures
  • Custom service messages on login form to communicate effectively with end users (ex: planned maintenances, ongoing issues, links to docs, etc…)
  • Get list of current jobs allocated on a specific node
  • Official support of Slurm 24.11

Many other minor features and bug fixes are also included, see the release notes for reference.

Popularity of Slurm-web is growing fast in the HPC & AI community, we are thrilled to see downloads are constantly increasing! We look forward to reading your feedback on these new features.

If you already used it, we also feel curious about the features you most expect from Slurm-web, please tell us in comments!

More links:

6 Comments
2024/11/28
13:24 UTC

20

SC24 post mortem

Ok, now that all the hoopla has died down, how was everyone’s show? Highlights? Lowlights? We had a few first timers post here before the show and I’d love to hear how things went for them.

26 Comments
2024/11/25
02:12 UTC

2

Inconsistent SSH Login Outputs Between Warewulf Nodes

I’m pretty new to HPC and not sure if this is the right place to ask, but I figured it wouldn’t hurt to try. I’m running into an issue with two Warewulf nodes on my cluster, cnode01 and cnode02. They’re both CPU nodes, and I’m accessing them from a head node.

Both nodes are assigned the same profile and container, but their SSH login outputs don’t match:

[root@ctl2 ~]# ssh cnode01

Last login: Thu Nov 21 20:03:25 2024 from x.x.x.x

[root@ctl2 ~]# ssh cnode02

warewulf Node: cnode02

Container: rockylinux-9-kernel

Kernelargs: quiet crashkernel=no net.ifnames=1

Last login: Thu Nov 21 20:07:18 2024 from x.x.x.x

I’ve rebuilt and reapplied overlays, rebooted the nodes, and checked their configurations using —everything seems identical. But for some reason, cnode01 doesn’t show container or kernel info during login. It’s not affecting functionality, but it’s bugging me :/

Any ideas on what might be causing this or what to check next?

Thanks!

3 Comments
2024/11/25
22:02 UTC

0

Review my Statement of Purpose!

I am applying to graduate school, and I am currently thinking I want to specialize in HPC. I will have 3 YOE by the time I join, I've worked in two major companies (one a very reputed American brand), and I wanted to get my Statement of Purpose reviewed from some professionals in the field. Please leave a comment if you can extent a helping hand for an honest review and I'll DM the docment. Thanks!

3 Comments
2024/11/25
12:28 UTC

18

Job titles to look for in HPC/ Cluster Computing

This is a pretty dumb question, I am pretty lost when it comes to understanding how the industry works. So I apologize for that.

What job titles should I look for when applying for HPC jobs ? I am a senior CS student with 2 years of HPC experience (student HPC Engineer) at my universities research supercomputer. I have an internship lined up for this coming summer as “Linux System Admin” at a decently sized company. It just seems like every company has the role titled differently even if they’re more or less the same thing, and I don’t know what all positions I should be looking for. Also from what I heard (I don’t know how credible it is) if I want to work in HPC my only real options are universities or a handful of larger companies.

Any help is greatly appreciated, thank you

Edit: I just wanted to again say thank you to everyone who replied. I truly enjoy working in HPC and up until making this post I thought I would probably have to leave the field once I graduated and left my student position. You all have given me new opportunities that I didn’t know existed. I will be applying for all of them in my spare time.

14 Comments
2024/11/24
23:34 UTC

12

Learning CUDA or any other parallel computing and getting into the field

I am 40 years old and have been working in C,C++ and golang. Recently, got interest in parallel computing. Is that feasible to learn and do I hold chance to getting jobs in the parallel computing field?

6 Comments
2024/11/23
14:59 UTC

2

Minimal head node setup on small cpu-only ubuntu cluster

So long story short, the team thought we were good to go with getting an easy8 license of BCM10... lo and behold, nvidia declined to maintain that program and Bright now only officially exists as part of their huge AI Enterprise Infra thing... Basically if you aren't buying armloads of Nvidia GPUs you don't exist to them anymore. Anyway, our trial period expired (sidenote, it turns out if that happens and you don't have a license, instead of just ceasing to function it nukes the whole cm directory on your head node).

BCM was nice but it was rather bloated for us. The main functionality I used was the software image system for managing node installation (all nodes were tftp booting bare metal ubuntu from the head node). I suppose it also kept the nodes in sync with the head node and we liked having a central place to manage category-level configs for filesystem mounting, networking, etc.

Would trying to stay with BCM even be a good idea for our use case? If not or if it's prohibitively expensive to do so, what's another route? OpenHPC isn't supported on ubuntu but if it's the only other option we can fork out for RHEL I suppose.

2 Comments
2024/11/23
02:34 UTC

6

Nvidia B200 overheating

5 Comments
2024/11/23
10:51 UTC

3

Accelerating: For Hardware Engineer's Perspective

*I'm a first-year CPE student with a burning desire to accelerate AI. I'm fascinated by the intersection of hardware and software, and I'm keen to learn more about the specific skills and knowledge needed to succeed in this field.

What are some of the biggest challenges and opportunities in hardware acceleration today? What kind of projects or experiences would be beneficial for someone starting out? Any insights from experienced hardware engineers would be invaluable.

2 Comments
2024/11/22
05:18 UTC

8

Apple Silicon in the HPC world?

Do folks have thoughts or papers they can point me to that talks about HPC applications on Apple Silicon chips? The lower power profile and high memory bandwidth on the new M4 chips seem ripe for HPC environments. I've never done any HPC outside of academia and algorithmic applications, but I could imagine building a small cluster of mac mini's is probably pretty affordable for a lot of CPU based use cases.

One huge caveat to this is GPGPU workloads, I don't think Mac's have a great story for gpu programming yet and I'm not sure what the cost/performance/energy tradeoffs for Apple Silicon chips vs something like an L40S would be.

10 Comments
2024/11/20
16:57 UTC

1

Panasas Active store support for RDMA (RoCE v2)

Hello, We are planning to upgrade the existing 10 Gb Ethernet network in our data center to utilize RDMA (RoCE v2) in order to reduce latency in the network. We have Panasas Active Store 16 storage systems, but these systems not covered by VDURA (former Panasas) support any more. So we don't have contacts at VDURA to ask whether Panasas Active Store 16 systems support RoCE. If you have experience with Panasas storages, could you please confirm whether Panasas Active Store supports RoCE v2?

1 Comment
2024/11/19
06:12 UTC

2

Hpc computing of Fourier transform (FFT). Yay or nah project

Hey,

I've found some cool videos about the FFT, and being an HPC newbie, I was wondering if maybe following these tutorials and including some of my very limited knowledge about HPC and Python HPC techniques. This would actually be my first mathy and HPC project, and i was wondering if this could be a nice project to do ? Like resume worthy.

Thanks!

0 Comments
2024/11/19
03:01 UTC

14

Flux Framework - Tutorial Series 🚀

We are kicking off #SC24 with a Flux Tutorial series - Dinosaur Edition! 🥑 We didn't get an "official" tutorial, but guess what? This presented an opportunity - one to create a series of tutorials open to *everyone* across time and space. 🚀

Instead of re-posting all the content (and images) I'll provide a link to all the details here: 👉 https://bsky.app/profile/vsoch.bsky.social/post/3lbam473mtk2b

3 Comments
2024/11/18
19:34 UTC

5

What all skillset is expected from a fresher who is interested in HPC ? Any study path ?

12 Comments
2024/11/17
22:52 UTC

13

SCC @SC25 Betting Odds!

T-3 days to the start of the Student Cluster Competition. Let's do this, it's betting odds time.

... wait, where are the posters?

UNM HPC (University of New Mexico) 9-1

Newbies no longer, the University of New Mexico is returning for their second season in a row with all new faces other than who I can only imagine is the team leader. The team is prioritizing GPU optimizations: a tried-and-true strategy that many teams in the past have run. Let's see what kind of spin they can put on this plan to stand out. Also congrats on having an S-Tier state flag.

Gig-em Bytes (Texas A&M University) 10-1

Everything is bigger in Texas, and Texas is back in the big leagues. Represented this year by team Gig-em Bytes, who are flipping the script by utilizing LinkedIn Learning courses to become familiar with Linux. Wow this is really making me wish I had the team poster. 'grats on your promotion.

Clemson Cybertigers (Clemson University) 9-1

The Clemson Cybertigers are blowing UC San Diego out of the water with access to not just one, but an incredible four Raspberry Pi's. Sounds like someone read the betting odds last year :) Have team members not been undertaking specific benchmarks in the past? That's SCC 101!

Friedrich-Alexander-Universitat (Friedrich-Alexander University) 6-1

A team that comes with a rich history of SCC competition, Friedrich-Alexander University definitely sports the coolest team name. Can I get one of those umlauts? We've seen them place on the podium in the past, winning the (now defunct) HPCG category as recently as SC22. This is the underdog team to keep and eye on, so no need to be so camera-shy.

NTHU (National Tsing Hua University) 2-1

You can't get much more HPC than blue polos, and the National Tsing Hua University team members have one each. Loving the color coordination. Hao-Tien Yu shows us that he's not only got a GPU, but he knows how to use it. This team is a force to be reckoned with, sweeping the SC22 competition in Dallas. Betting on NTHU is like hitting on a soft 17: you hate doing it, but the casino does it so it's probably a good idea.

Team Diabo (Tsinghua University) 2-1

Hunh? Two Tsinghua teams this year? There must be some mistake, I need to get Stephen Leake on the phone. Correct me if I'm wrong, but this looks to be the first time both National Tsing Hua University (from Taiwan) and Tsinghua University (from China) are competing. Inside sources tell me that the SCC committee couldn't justify leaving one of them out this year. Bring a water bottle, because this is gonna get heated. One more thing, apparently Team Diablo is bringing a new compute-optimized, omnisciently-sentient, totally-not-proprietary LLM called DadFS to the competition this year!

NTU (Nanyang Technological University) 4-1

Look, NTU team, here me out. If you're gonna name your server "Coffeepot", you'd might as well do the same for you team name. Maybe "Team Roasted" or something. Looking at Tsinghua, they have a cool team name and they win something every year. Nanya, I'm gonna call y'all Nanya, have put up solid results in the past. A sweep at SC17, Linpack at 18, tack on an HPCG in 19. What happened to the hot streak? Also, sorry, you have NVIDIA, AMD, and Super Micro as your hardware vendors? Two of those are redundant and I'm not gonna say which.

University of Helsinki/Aalto University 10-1

Finland is taking a cue from the notably absent Boston area team by combining multiple universities into one team. An exclusive interview with the Boston team captains a few years back revealed that this was done for practical purposes. I would love to hear why the finnish teams decided to do the same (call me!). This is the first competition for all of the members, who come from a wide range of academic disciplines. Three cheers for the team to get to the Finnish line.

Team Triton LLC (Last Level Cache) (University of California, San Diego) 4-1

Fan favorite Team Triton are back again for the fourth year in a row, making it the most recent team to hit the record four years of back-to-back SCC appearances. During SC23, they were expected to place on the podium, but unfortunately it did not work out for them! Word on the street is that Team Triton hosted the Single Board Cluster Competition this past year in their home stadium, which was a smash hit. Will their knowledge of hosting competitions also translate to points while competing?

Team RACKlette (ETH Zurich) 2-1

Last year's overall winner and fan favorite Team RACKlette has cemented itself in the SCC Hall of Fame by obtaining 2-1 betting odds, making it the only non-Asian team to have achieved this feat. The team apparently has detailed internal Wiki documents about past competition applications. If there are any whistleblowers on the team we might have a scandel larger than the one Julian Assange was a part of.

Peking University 3-1

If you thought Squid Game was cool, you're gonna wish you went to Peking University, who I've been told held an HPC game to attract top talent to its team. But is SCC more talent or experience? The Peking team is entirely new, which may have been a strategic move to ensure the team's inclusion in the competition this year. Either way, all we really care about is what type of keyswitch is in their gaming keyboards.

0 Comments
2024/11/15
21:15 UTC

5

Persistent Hostnames Warewulf4 IPA

Hello Everyone, I setup WW4 and wondering how to persist the compute nodes hostnames as well as have them enrolled to my freeIPA server. Do i have to set the full fqdn in /etc/hosts on the management server and move it to the overlay? Any guidance would greatlyb3 appreciated.

5 Comments
2024/11/15
18:30 UTC

2

Z array performance issue on HPC cluster

Hi everyone, I'm new to working with z arrays in our Lab, and one of our current existing workflow uses them. I'm hoping someone here could provide some insight and/or suggestions.

We are working from a multi-node HPC cluster that has SLURM. With a network-file storage system that supposedly supports RAID.

The file in question that we are using (a zarray) contains a large number of data chunks, and we've observed some performance issues. Specifically, concurrent reads (multiple jobs accessing the same zarray) slow down the process. Additionally, even with a single job running, the reading speed seems inconsistent. We suspect this may be due to other users accessing files stored on the same disk.

Any one experienced issues like these before when working with Z-arrays?

5 Comments
2024/11/15
15:17 UTC

2

8x64GB vs 16x32GB in a HPC node with 16 DIMMs: Which will be a better choice?

I am trying to purchase a Tyrone compute note for work and I am wondering if I should go for 8x64GB vs 16x32GB.

- 16x32GB would use up all the DIMM slots and result in a balanced configuration. Will limit my ability for future upgrades.

- 8x64GB, half of the DIMMs slots are unused. Will this lead to performance issues while doing memory intensive tasks?

Which is better? Can you point me to some study that has investigated the performance issue with such unbalanced DIMM configs? Thanks.

11 Comments
2024/11/15
06:46 UTC

0

Student Researcher. Academic Paper Request.

Hi, I'm reaching out with an unusual request for assistance. I am a student researcher, I'm in need of a paper from IEEE Computer Society:

Title: Performance Characterization of Large Language Models on High-Speed Interconnects

DOI: 10.1109/HOTI59126.2023.00022

Link: https://www.computer.org/csdl/proceedings-article/hoti/2023/047500a053/1RoJ4lNvAXK

Would anyone with an active IEEE Computer Society subscription be willing to share or download the paper for me? Your help would greatly support my research.

4 Comments
2024/11/15
02:14 UTC

13

Developer Stories Podcast - Dan Reed "HPC Dan" on the Future of High Performance Computing

In case you need a good listen for your SC24 travel, the Developer Stories Podcast is featuring Dan Reed - "HPC Dan" - a prominent, humble, and insightful voice in our community. I've really enjoyed talking to Dan (and reading his blog "Reed's Ruminations" because it covers everything from the technology space, to policy, humor, and literary references, to stories of his family and how he feels about fruit cake! Here are several ways to listen - I hope you enjoy!

8 Comments
2024/11/14
17:43 UTC

1

Strategies for parallell jobs spanning nodes

Hello fellow nerds,

I've got a cluster working for my (small) team, and so far their workloads consist of R scripts with 'almost embarassingly parallel' subroutines using the built-in R parallel libraries. I've been able to allow their scripts to scale to use all available CPUs of a single node for their parallellized loops in pbapply() and such using something like

srun --nodelist=compute01 --tasks=1 --cpus-per-task=64 --pty bash

and manually passing a number of cores to use as a parameter to a function in the r script. Not ideal, but it works. (Should I have them use 2x the cpu cores for hyperthreading? AMD EPYC CPUs)

However, there will come a time soon that they would like to use several nodes at once for a job, and tackling this is entirely new territory for me.

Where do I start looking to learn how to adapt their scripts for this if necessary, and what strategy should I use? MVAPICH2?

Or... is it possible to spin up a container that consumes CPU and memory from multiple nodes, then just run an rstudio-server and let them run wild?

Is it impossible to avoid breaking it up into altogether separate R script invocations?

3 Comments
2024/11/14
00:46 UTC

Back To Top