/r/q?req.query.q -- Subreddit Search

13,969 Subscribers

slurm sucked for me as an end user. that's why I'm fixing it

I know a lot of diehard Slurm users, especially university and research center admins, who love to admire the massive clusters they manage. And to be fair, it’s impressive—I’ll give them that. But I was always a little less in awe… mostly because of the problems I ran into.

When I was in college, I hated using Slurm. My jobs would get stuck in pending forever, I’d get hit with OOM errors with zero ways to diagnose them, my logs were inconsistent or missing, I had no visibility into stdout while the job was running, and I’d run into inefficient or failed nodes due to config issues. And honestly, that’s just scratching the surface.

When I broke out of the university setting, I started working with some really impressive DevOps teams who built much easier-to-use, more reliable cloud clusters. That experience pushed me to rethink how cluster computing should work.

I’m currently open-sourcing a cluster compute tool that I believe drastically simplifies things—with the goal of creating a much much better experience for end users and admins.

If you have any frustrations with slurm I'd love to chat, hopefully building in the right direction.

anyways here's the repo and I just turned on a 256 CPU cluster (thank you google for the free credits) you can mess around it here.

17 Comments

2025/01/31
22:08 UTC

Intel open sources Tofino and P4 Studio

Intel has open sourced Tofino backend and their P4 Studio application recently. https://p4.org/intels-tofino-p4-software-is-now-open-source/

P4/Tofino is not a highly active project these days. With the ongoing AI hype, high performance networking is more important than ever before. Would these changes spark the interest for P4 again?

1 Comment

2025/01/30
21:25 UTC

Does a single MPI rank represents a single physical CPU core

15 Comments

2025/01/30
12:44 UTC

How to allocate two node with 3 processor from 1st node and 1 processor from 2nd node so that i get 4 processors in total to run 4 mpi process. My intention is to run 4 mpi process such that 3 process is run from 1st node and 1 process remaining from 2nd node... Thanks

How to allocate two node with 3 processor from 1st node and 1 processor from 2nd node so that i get 4 processors in total to run 4 mpi process. My intention is to run 4 mpi process such that 3 process is run from 1st node and 1 process remaining from 2nd node... Thanks

9 Comments

2025/01/30
06:37 UTC

slurm array flag: serial instead of parallel jobs?

I have a slurm job that I'm trying to run serially, since each job is big. So something like:

SBATCH --array=1-3

big_job_%a

where instead of running big_job_1, big_job_2, and big_job_3 in parallel, it waits until big_job_1 is done to issue big_job_2 and so on.

My AI program suggested to use:

if [ $task_id -gt 1 ]; then

 while ! scontrol show job $SLURM_JOB_ID.${task_id}-1 |   grep "COMPLETED" &> /dev/null; do
sleep 5

done fi

but that seems clunky. Any better solutions?

5 Comments

2025/01/30
00:01 UTC

SLURM Consultant

I am in search of a consultant to help configure and troubleshoot SLURM for a small cluster. Does anyone have any recommendations beyond going direct to SchedMD? I am interested in working with an individual, not a big firm. Feel free to DM me or reply below.

17 Comments

2025/01/29
19:07 UTC

GPU node installation

Hello Team, I am newbie. I have got 1 h100 node with 8 GPU's SXM. I do not have any cluster manager. I want to have the GPU installed with all the necessary drivers, slurm and so on. Does any one have any documented procedure or guide me pointing to the right one. Any help is highly appreciated and thanks in advance.

6 Comments

2025/01/29
18:28 UTC

Does anyone here uses SUNK (Slurm on K8s) ? What is the state of the SUNK project ? Can you describe your experience with it ?

4 Comments

2025/01/28
19:13 UTC

SSO integration with Putty

Hello,

Currently the students access the cli using the following.

1)The students access the Cisco VPN, enters the credentials

- they get a DUO Push

Students open putty, enter the credentials and server to connect

- Linux machine runs SSSD (connects to Active Directory for authentication).

We want to expand and allow other schools to access our systems. We have access to Cirrus Identity.

A lot of our web applications, students access a URL (with SSO integrated), once in the students have access to the portal/web applications.

For our HPC, can we integrate SSO onto putty? This is my first time working with SSO. I will be working with another person that has experience with SSO integrations with the web applications.

https://blog.ronnyvdb.net/2019/01/20/howto-ssh-auto-login-to-your-raspberry-pi-with-putty/

Thanks,

2 Comments

2025/01/28
05:55 UTC

Detecting Hardware Failure

I am curious to hear your experience on detecting hardware failures:

What tools do you use to detect if a hardware has failed ?
Whats the process in general when you want to replace a component from your vendor ?
Anything else I should look out for ?

5 Comments

2025/01/26
11:00 UTC

Building flang (new)

Hi everyone, I have been trying to build the new flang by the LLVM and I simply cannot do it. I have a gcc install from source that I use to bootstrap my LLVM install. I build gcc like this:

./configure --prefix=/shared/compilers/gcc/x.y.z --enable-languages=c,c++,fortran --enable-libgomp --enable-bootstrap --enable-shared --enable-threads=posix --with-tune=generic

In this case x.y.z is 13.2.0 then with this I clone the llvm-project git and for now I am in version 20.x. I am using the following configuration line for the LLVM

cmake -DCMAKE_BUILD_TYPE=Release \

-DCMAKE_INSTALL_PREFIX=$INSTALLDIR \

-DCMAKE_CXX_STANDARD=17 \

-DCMAKE_CXX_LINK_FLAGS="-Wl,-rpath, -L$GCC_DIR/lib64 -lstdc++" \

-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \

-DFLANG_ENABLE_WERROR=ON \

-DLLVM_ENABLE_ASSERTIONS=ON \

-DLLVM_TARGETS_TO_BUILD=host \

-DLLVM_LIT_ARGS=-v \

-DLLVM_ENABLE_PROJECTS="clang;mlir;flang;openmp" \

-DLLVM_ENABLE_RUNTIMES="compiler-rt" \

../llvm

Then a classic make -j. It goes all the way until it tries to build flang with the recently built clang, but clang fails because it can't find bloody bits/c++config.h

I don't want to do a sudo apt install anything to get this. I was able to build clang classic because I was able to pass -DGCC_INSTALL_PREFIX=$GCC_DIR to my llvm build, someone deprecated this in the LLVM and the one thing that got it to work on the previous one does not work with the latest. I want to use as new of a flang-new as possible.

Has anyone successfully built flang-new lately that has gone through a similar issue? I have not been able to find a solution online so maybe someone that works at an HPC center has some knowledge for me

Thanks in advance

1 Comment

2025/01/26
10:42 UTC

Troubleshooting deviceQuery Errors: Uable to Determine Device Handle for GPU on Specific Node.

Hi CUDA/HPC Community,

I’m reaching out to discuss an issue I’ve encountered while running deviceQuery and CUDA-based scripts on a specific node of our cluster. Here’s the situation:

The Problem

When running the deviceQuery tool or any CUDA-based code on node ndgpu011, I consistently encounter the following errors: 1. deviceQuery Output:

Unable to determine the device handle for GPU0000:27:00.0: Unknown Error cudaGetDeviceCount returned initialization error Result = FAIL

2.	nvidia-smi Output:

Unable to determine the device handle for GPU0000:27:00.0: Unknown Error

The same scripts work flawlessly on other nodes like ndgpu012, where deviceQuery detects GPUs and outputs detailed information without any issues.

What I’ve Tried 1. Testing on Other Nodes: • The issue is node-specific. Other nodes like ndgpu012 run deviceQuery and CUDA workloads without errors. 2. Checking GPU Health: • Running nvidia-smi on ndgpu011 as a user shows the same Unknown Error. On healthy nodes, nvidia-smi correctly reports GPU status. 3. SLURM Workaround: • Excluding the problematic node (ndgpu011) from SLURM jobs works as a temporary solution:

sbatch --exclude=ndgpu011 <script_name>

4.	Environment Details:
•	CUDA Version: 12.3.2
•	Driver Version: 545.23.08
•	GPUs: NVIDIA H100 PCIe
5.	Potential Causes Considered:
•	GPU Error State: The GPUs on ndgpu011 may need a reset.
•	Driver Issue: Reinstallation or updates might be necessary.
•	Hardware Problem: Physical issues with the GPU or related hardware on ndgpu011.

Questions for the Community 1. Has anyone encountered similar issues with deviceQuery or nvidia-smi failing on specific nodes? 2. What tools or techniques do you recommend for further diagnosing and resolving node-specific GPU issues? 3. Would resetting the GPUs (nvidia-smi --gpu-reset) or rebooting the node be sufficient, or is there more to consider? 4. Are there specific SLURM or cgroup configurations that might cause node-specific issues with GPU allocation?

Any insights, advice, or similar experiences would be greatly appreciated.

Looking forward to your suggestions!

0 Comments

2025/01/26
01:05 UTC

Is a Master's in HPC worth it for a Data Scientist working on scalable ML?

Hi everyone,

I’m currently a data scientist with a strong interest in scalable machine learning and distributed computing. My work often involves large datasets and training complex models, and I’ve found that scalability and performance optimization are increasingly critical areas in my projects. I have a BSc in AI.

I’ve been considering pursuing a Master's degree in High-Performance Computing (HPC) with Data Science at Edinburgh University on a part-time basis, as I feel it could give me a deeper understanding of parallel programming, distributed systems, and optimization techniques. However, I’m unsure how much of the curriculum in an HPC program would directly align with the kind of challenges faced in ML/AI (e.g., distributed training, efficient use of GPUs/TPUs, scaling frameworks like PyTorch or TensorFlow, etc.).

Would a Master’s in HPC provide relevant and practical knowledge for someone in my position? Or would it be better to focus on self-study or shorter programs in areas like distributed machine learning or systems-level programming?

I’d love to hear from anyone with experience in HPC, particularly if you’ve applied it in ML/AI contexts. How transferable are the skills, and do you think the investment in a Master's degree would be worth it?

Thanks in advance for your insights!

4 Comments

2025/01/25
07:24 UTC

How can you get nodes per system in the top 500 list?

Hi everyone!

I'm trying to understand the scale of the systems in the top 500 list across a few dimensions. The only one I can't find is the number of nodes for each of the systems. Do you have any idea how I could calculate that? Or if there is another source for this kind of information?

8 Comments

2025/01/24
19:01 UTC

Do you face any pain point maintaining/using your University on prem GPU cluster ?

I'm curious to hear about your experiences with university GPU clusters, whether you're a student using them for research/projects or part of the IT team maintaining them.

What cluster management software does your university use? (Slurm, PBS, LSF, etc.)
What has been your experience with resource allocation, queue times, and getting help when needed?
Any other challenges I should think about ?

40 Comments

2025/01/24
15:19 UTC

H100 80gig vs 94gig

I will get getting 2x H100 cards for my homelab

I need to choose between the nvidia h100 80 gig and h100 94 gig.

I will be using my system purely for nlp based tasks and training / fine tuning smaller models.

I also want to use the llama 70b model to assist me with generating things like text summarizations and a few other text based tasks.

Now is there a massive performance difference between the 2 cards to actually warrant this type of upgrade for the cost is the extra 28 gigs of vram worth it?

Is there any sort of mertrics online that i can read about these cards going head to head.

18 Comments

2025/01/23
18:00 UTC

Complex project ideas in HPC

I am learning OpenMPI and CUDA in C++. My aim is to make a complex project in HPC, it can go on for about 6-7 months.

Can you suggest some fields in which there is some work to do or needs any optimization.

Can you also suggest some resources to start the project?

We are a team of 5, so we can divide the workload also. Thanks!

9 Comments

2025/01/22
04:55 UTC

Help with immersion / cooling at the chip for HPC deployment

Searching for someone who works with immersion or cooling at the chip products for NVIDIA H200 boards / servers. Feel free to either DM or post any recommendations.

0 Comments

2025/01/22
03:29 UTC

Eu Server Provider

Searching For a Server Provider

I recently moved to germany and want to purchase a new AI/ML server for home.

512mb ram 48 core cpu 2x h100 or 2x h200 gpus 2x 4tb nvme storage (have a fast external nas)

What are some good server providers in germany or in the EU that you have used and are reliable.

14 Comments

2025/01/21
01:51 UTC

Faster rng

Hey yall,

I'm working on a c++ code (using g++) that's eventually meant to be run on a many-core node (although I'm currently working on the linear version). After profiling it, I discovered that the bigger part of the execution time is spent on a Gaussian rng, located at the core of the main loop so I'm trying to make that part faster.

Right now, it's implemented using std::mt19937 to generate a random number which is then fed to std::normal_distribution which gives the final Gaussian random number.

I tried different solutions like replacing mt19937 with minstd_rand (slower) or even implementing my own Gaussian rng with different algorithms like Karney, Marsaglia (WAY slower because right now they're unoptimized naive versions I guess).

Instead of wasting too much time on useless efforts, I wanted to know if there was an actual chance to obtain a faster implementation than std::normal_distribution ? I'm guessing it's optimized to death under the hood (vectorization etc), but isn't there a faster way to generate in the order of millions of Gaussian random numbers ?

Thanks

9 Comments

2025/01/20
20:59 UTC

Any new technologies for TAPE backups?

We recently faced a rejection for the delivery of LTO-9 tape devices due to the bankruptcy of Overland-Tandberg. The dealer is unable to provide the promised 3-5 years warranty. Now, I'm uncertain about the best long-term solution for backing up petabytes of data for 10-15 years. Are there any new suggestions in HPC for reliable backup systems, such as alternatives to traditional tapes?

24 Comments

2025/01/17
21:28 UTC

malloc(): unaligned tcache chunk detected. Has anyone faced this before for MPI fortran programs?

0 Comments

2025/01/15
22:58 UTC

Remote student - what are my options for HPC system access?

Hi all,

I'm studying HPC basics indepentently via The University of Iceland's online lecture videos via Dr Morris.

The issue is, as an external, I do not have access to their HPC Server Eija; I'm beginning to work on C basics and leaning how to use the cheduler to execute programs on Compute Nodes.

How can I play around with this independently? I'm UK based and my previous university did not have a department for HPC - what are my options, if any?

14 Comments

2025/01/15
11:31 UTC

Setting up test of LSF - how restricted is the community edition?

I think the software I'm trying to cluster only officially supports LSF, but obviously I want to test it before I go running to IBM for a big fat PO for LSF. I've read 2 separate conflicting notes about CPU support, and wondering if anyone can clarify for me. The IBM notes seem to suggest you can only have 10 CPUs total, I take that to mean cores. But other notes have suggested it supports up to 10 hosts. Does anyone know for sure? The machines I want to cluster will have 16 or 24 cores each plus a GRID vGPU.

7 Comments

2025/01/14
17:27 UTC

HPC newbie, curious about cuda design

Hey all I'm pretty new to HPC in general but in general I'm seeing if anyone had an idea of why cuda kernels were written the way they are (specifically the parameters of blocksize and stuff).

To me it seems like they give halfway autonomy - you're responsible for allocating the number of blocks and threads each kernel would use, but they hide other important things

Which blocks on the actual hardware the kernel will actually be using
what happens to consumers of the outputs? Does the output data get moved into global memory or cache and then to the block that consumers of the output need? Are you able to persist that data in register memory and use it for another kernel?

Idk to me it seems like there's more work on the engineer to specify how many blocks they need without control over how data moves between blocks.

2 Comments

2025/01/13
21:00 UTC

Seeking Advice for Breaking into HPC Optimization/Performance Tunning Roles

Hi All,

I’m seeking advice from industry veterans to help me transition into a role as an HPC application/optimization engineer at a semiconductor company.

I hold a PhD in computational mechanics, specializing in engineering simulations using FEA. During grad school, I developed and implemented novel FEA algorithms using hybrid parallelism (OpenMP + MPI) on CPUs. After completing my PhD, I joined a big tech company as a CAE engineer, where my role primarily involves developing Python automation tools. While I occasionally use SLURM for job submissions, I don’t get to fully apply my HPC skills.

To stay updated on industry trends—particularly in GPUs and AI/ML workloads—I enrolled in Georgia Tech’s OMSCS program. I’ve already completed an HPC course focusing on parallel algorithms, architecture, and diverse parallelization paradigms.

Despite my background, I’ve struggled to convince hiring managers to move me to technical interviews for HPC-focused roles. They often prefer candidates with more “experience,” which is frustrating since combining FEA for solids/structures with GPGPU computing feels like a niche and emerging field.

How can I strengthen my skillset and better demonstrate my ability to optimize and tune applications for hardware? Would contributing large-scale simulation codes to GitHub help? Should I take more specialized HPC courses?

I’d greatly appreciate any advice on breaking into this field. It sometimes feels like roles like these are reserved for people with experience at national labs like LLNL or Sandia.

What am I missing? What’s the secret sauce to becoming a competitive candidate for hiring managers?

Thank you for your insights!

PS: I’m a permanent resident.

2 Comments

2025/01/12
23:44 UTC

Putting together my first Beowulf cluster and feeling very... stupid.

Maybe I'm just dumb or maybe I'm just looking in the wrong places, but there doesn't seem to be a lot of in depth resources about just getting a cluster up and running. Is there a comprehensive resource on setting up a cluster or is it more of a trial and error process scattered across a bunch of websites?

20 Comments

2025/01/12
15:37 UTC

How to Run R code in HPC that should utilizes all nodes and cores

I am new to both R and HPC. I have used reddit before but posting this first time, not sure it should post here or not.

No. of Compute Nodes-4,Total No. of Processors-8,Total No. of Cores-96,Memory per node-64 GB ,Operating System Rocky Linux 8.8, it uses PBS also. These are specifications.

I can able to login using Putty, i can run R code using PBS script. but i am not sure this hpc is using all nodes or not , because the time taken to run R code is same on this HPC and a normal system. i use chatgpt to rewrite the normal code to hpc specific code but still hpc takes more time.

i just want to show that by using hpc i can run R code faster. code can be any R like matrix multiplication, factorial etc.

Is there any documents or video i can refer or learn about this. that also might help.

2 Comments

2025/01/10
05:13 UTC

How long does it typically take to go from scratch to publishing a Q1 paper in HPC? Worst-case vs. Optimistic Scenarios

I’m trying to understand how long it typically takes to go from starting from scratch to publishing a Q1 journal article. I know the timeline can vary widely, but I’m curious about the extremes—both the worst-case and the most optimistic scenarios.

In particular, I’m interested in the following stages:

Literature review and initial planning.
Algorithm design and coding (e.g., CUDA programming or other HPC techniques).
Debugging and optimizing performance.
Experimentation and testing.
Writing and revising the paper.
Submission and peer review.

Worst-case scenario: How long have others experienced when facing significant roadblocks (e.g., major coding issues, experimental setbacks, unexpected results, etc.)?
Optimistic scenario: On the flip side, what’s the best case, where things go smoothly, and progress is faster than expected?

Negative results: How often do you encounter negative results (e.g., performance not matching expectations, code failing to scale, unexpected bugs)? How do you manage or pivot from these challenges, especially when they delay your progress?

I’d love to hear about your experiences or tips for navigating potential challenges. How long did it take for you to get from initial research to submitting a Q1 paper, and what obstacles or successes shaped that timeline?

Thanks in advance for your insights!

1 Comment

2025/01/10
14:12 UTC

Clustering on a small scale.

Office Upgrade.

I have just competed a full system upgrade for a small business in my town upgrading all of their units. I was allowed to just keep the older units. I now have in my possession 12 Dell optiplex 3060s with coffee lake 6 core i5s and a few other miscellaneous units of similar power. Is there anyway I could data mine or in any other way chain these together to make passive income? I’m just making sure I’m not forgoing any other options aside from throwing in a low profile 1650 and ebay flipping them. I don’t reallllyyyy need the cash so if y’all can think of any other cool projects I could do with them let me know.

3 Comments

2025/01/09
20:17 UTC

✻ Smokey says: avoid over-packaged products to fight climate change! [see more tips]

13,969 Subscribers

slurm sucked for me as an end user. that's why I'm fixing it

Intel open sources Tofino and P4 Studio

Does a single MPI rank represents a single physical CPU core

How to allocate two node with 3 processor from 1st node and 1 processor from 2nd node so that i get 4 processors in total to run 4 mpi process. My intention is to run 4 mpi process such that 3 process is run from 1st node and 1 process remaining from 2nd node... Thanks

slurm array flag: serial instead of parallel jobs?

SLURM Consultant

GPU node installation

Does anyone here uses SUNK (Slurm on K8s) ? What is the state of the SUNK project ? Can you describe your experience with it ?

SSO integration with Putty

Detecting Hardware Failure

Building flang (new)

Troubleshooting deviceQuery Errors: Uable to Determine Device Handle for GPU on Specific Node.

Is a Master's in HPC worth it for a Data Scientist working on scalable ML?

How can you get nodes per system in the top 500 list?

Do you face any pain point maintaining/using your University on prem GPU cluster ?

H100 80gig vs 94gig

Complex project ideas in HPC

Help with immersion / cooling at the chip for HPC deployment

Eu Server Provider

Faster rng

Any new technologies for TAPE backups?

malloc(): unaligned tcache chunk detected. Has anyone faced this before for MPI fortran programs?

Remote student - what are my options for HPC system access?

Setting up test of LSF - how restricted is the community edition?

HPC newbie, curious about cuda design

Seeking Advice for Breaking into HPC Optimization/Performance Tunning Roles

Putting together my first Beowulf cluster and feeling very... stupid.

How to Run R code in HPC that should utilizes all nodes and cores

How long does it typically take to go from scratch to publishing a Q1 paper in HPC? Worst-case vs. Optimistic Scenarios

Clustering on a small scale.