/r/HPC

Photograph via snooOG

Multicore, cluster, and high-performance computing news, articles and tools.

Multicore, cluster, and high-performance computing news, articles and tools.

"Anyone can build a fast CPU. The trick is to build a fast system." - Seymour Cray

Smokey says: suggest green changes to the products of your employer to fight climate change! [see more tips]

Other subreddits you may like:

Does this sidebar need an addition or correction? Tell us here

/r/HPC

11,365 Subscribers

1

Splitting workloads into a small cluster

I've been trying to figure out the best solution for this and have ended up here so apologies if this isn't particularly on brand, if so I'm open to other pages or links for info.

I have a hobby of having hobbies and end up lacking compute power things like 3d modeling, physics simulations, game development, and running local LLMs to name a few daily tasks. I'm not doing these at insane scale or for any buisness, however I don't have thousands to shell out on super high end parts that can accommodate everything in one system.

My ideal goal is using one host with a decent cpu and gpu for fast processing along with managing and scheduling between maybe 2-4 nodes. the nodes would be used for the "heavy lifting" with more cores, ram, and vram. Preferably in a smart-ish system that prioritizes currently focused applications.

From what I've read so far I may need to settle for a system that equally schedules everything all the time, which isn't bad so long as I can still accomplish the main goal of having access to large compute strength. OpenPBS sounds like a possible option paired with Ubuntu, I was looking into something like dragonflyBSD or centOS, but I have 0 experience with either.

Something that may or may not cause issues is that almost all of my programs and software only run on windows. A vm running windows on the host or access node is more than doable, but I'm not sure if there are any issues using a vm to access the full range of a cluster.

Edit: forgot to actually state what I'm asking, I'm looking for advice, tips, or just criticism of my idea, I don't have any solid requirements for hardware or OS yet so recommendations are more than welcome. I was looking at possibly using a small 4 node blade server if I can find some with slots for some workstation gpus like a tesla k80 for high VRAM at cheap prices.

Again sorry if this isn't the place for this or if some of these questions are basic knowledge, it took many hours of searching to even get to this point so please point me along if I'm not at the end of my search.

8 Comments
2024/03/29
22:30 UTC

1

GPFS C Api.

GPFS Optimizations -

We are using GPFS - I am a user, not a admin -we have a specific use case of reading read only files over and over again. I was wondering if using the C api directly gpfs_read etc can optimize this specific use case? Can't seem to find performance numbers of reading data with or without using the c api directly.

2 Comments
2024/03/29
16:47 UTC

0

Compiling CMake File with MPI and OpenCV on CLion

Hello all,

I am currently writing a C++ program using MPI and OpenCV, and I am having trouble executing the program.

When I build it using the CLion's Compiler and run it, it seems to be working fine

However, when I compile it using cmake . && make, and run using mpirun, I am unable to execute the code. It is not giving me any output at all, or is giving me a path error.

Link to Stack Overflow

Any advice would be much appreciated.

7 Comments
2024/03/29
12:52 UTC

1

BrightCM + jupyterhub + Active Directory

I am newly administrating this platform. I am now going to use AD as the authentication source.

For SSH, I can use SSSD + LDAP combination to let user login and everything smooth.

For jupyterhub, it seems BrightCM customized the environment which only can authentictae to CMdaemon, which is an internal LDAP.

I would ask anybody had experience before to make jupyterhub in BrightCM to authenticate with AD. Thank you.

0 Comments
2024/03/28
04:56 UTC

12

Relevant skill building projects with HPC help

I’m hoping to find project ideas to build skills and show what I know to apply myself to a future HPC role.

TL;DR about the role, mainly troubleshooting clusters, bash, using SLURM, K8 admin, and other automation ways to help with daily roles.

Sorry to make it vague but I cannot find much online other that the listed job for information I would like as each “HPC Engineer” role is HIGHLY varied haha

12 Comments
2024/03/27
14:48 UTC

3

Is it a good idea to put create user home directory under its primary group (/home/{primarygroup}/{user})

A HPC service provider requires a change of user's home directory from /home/{user} to /home/{primarygroup}/{user} if we want to upgrade the admin platform.

It seems very rare to me to see the user home in such pattern, what's the pro and con of manage home directory this way?

13 Comments
2024/03/27
09:00 UTC

4

How to run DMTCP with SLURM?

I have both DMTCP and SLURM installed on Ubuntu 18.04 on a small 2 nodes cluster. I'm planning on running some MPI applications and checkpoint them, but I don't know how to run DMTCP via SLURM.

2 Comments
2024/03/26
01:49 UTC

6

Where do Research Papers Get Training Times for ML HPC Research

0 Comments
2024/03/25
16:01 UTC

13

What does the interview process for HPC jobs look like?

Hi, I'm looking to get into HPC, but I have no idea what the interview process looks like. Is it like SWE interviews where they ask leetcode problems? Or is it mostly on domain knowledge?

Clarification:

I want to be an HPC software engineer (Not sure if this is the correct term). (Accelerating/Optimizing scientific computing or AI/ML training)

9 Comments
2024/03/24
16:19 UTC

6

3 node mini cluster

I'm in the process of buying 3 r760 dual CPU machines.

I want to connect them together with infiniband in a switchlese configuration and need some guidance.

Based on poking around it seems easiest to have a dual port adapter and connect each host to the other 2. Then setup a subnet with static routing. Someone else will be helping with this part.

I guess my main question is affordable hardware (<$5k) to accomplish this that will provide good performance for distributed memory computations.

I cannot buy used/older gear. Adapters/cables must be available for purchase brand new from reputable vendors.

The r760 has ocp 3.0 but dell does not appear to offer an infiniband card for it. Is the ocp 3.0 socket beneficial over using pcie?

Since these systems are dual socket is there a performance hit of using a single card to communicate with both CPUs? (The pcie slot belongs to a particular socket?).

It looks like Nvidia had some newer options for host chaining when I was poking around.

Is getting a single port card with a splitter cable a better option than a dual port?

What would you all suggest?

27 Comments
2024/03/23
01:11 UTC

1

File System Recommendation

Hi forks,

I am very new to HPC environment and all the server related subjects.

Now i am trying to set up SLURM cluster on my machines, and some file systems.

I am trying to run multiple jobs from multiple clients, and each job should do lot of read / write opertions.

I've read several articles from the communities and heard about the BeeGfs, but when tested with fio randwrite it is way slower than the NFS mounted point.

Hence now i am looking for something else for the FS. Can you recommend any others?

(ps : I am trying to run synopsys vcs regression tests on this cluster)

2 Comments
2024/03/21
10:55 UTC

18

The Flux Operator - an HPC workload manager in Kubernetes

I'm pleased to announce that our work on the Flux Framework operator is published in F1000Research! This is an example of converged computing and was (continues to be) a joy to collaborate with Aldo and Antonio (Google batch/networking teams, respectively). https://doi.org/10.12688/f1000research.147989.1. I hope to do (and inspire others to do) work like this more often! <3

12 Comments
2024/03/21
16:04 UTC

8

Anyone tried nvidia aistore ?

Except for the repository, i can't find anything about it.

https://github.com/NVIDIA/aistore/tree/main
https://aiatscale.org/

Skimming through the doc, it seems rather feature complete, more flexible than minio, with more potential for performances, its backed by a big corp, and is open source with no strings attached.

So it seems like a very good candidate and i am surprised, i can't find any feedback on it on google.

6 Comments
2024/03/20
08:33 UTC

18

I wrote a paper on pricing derivatives with Monte Carlo simulation on Slurm computer clusters in Python

I thought I'd share in case this is a helpful resource for someone interested in learning about high performance computing for quantitative finance applications. It includes an introduction to high performance computing, a reference to a guide I co-wrote on configuring a small Slurm cluster, and a Python script template with tested examples for implementing Monte Carlo option pricing programs on Slurm clusters.

Paper: https://github.com/scottgriffinm/Monte-Carlo-Option-Pricing-on-a-SLURM-Cluster/blob/main/Monte_Carlo_Option_Pricing_with_SLURM.pdf

0 Comments
2024/03/19
02:53 UTC

0

Can you evaluate this build as a budget local HPC?

We are looking a PRO build system for some calculations and model training using GPU. We are aiming for a budget built with 1TB RAM. Can you give your thoughts setup below? Thanks.

Asus Pro WS WRX80E-SAGE SE WIFI II

Seasonic Prime Tx-1600 80+ Titanium Psu

Ryzen ThreadRipper PRO 5975WX

"Crucial Micron 128GB 3200MHz CL22 DDR4 SDRAM DIMM 288-pin" x8

Fractal Design Define 7 XL Light Glass

Kingston FURY Renegade SSD 1000GB M.2 2280 PCI Express 4.0 x4 (NVMe)

ASUS GeForce RTX 4090 TUF Gaming OC 24GB

Seagate Exos X20 20TB 3.5" 7200rpm SATA-600

Kingston DC600M SSD 7680GB 2.5" SATA-600

Corsair ICUE H150I Elite Capellix XT

5 Comments
2024/03/18
23:16 UTC

1

Teslas T4 in R740xd

What do I need in order to install 4x T4s into my R740xd? I don't need power cables since they are 70w each right? Would I only need the Risers, and if yes, which of these risers do I need? Dell keeps only redirecting me to their installment kit which is a pain in the ass to buy and still comes with too many extras. Are those extras needed?

1 Comment
2024/03/18
20:14 UTC

2

Should I install SLURM before or after DMTCP?

I'm creating a SLURM cluster with an MPICH/DMTCP configuration. What should the installation order be?

4 Comments
2024/03/16
14:06 UTC

20

Is there ever a reason to build a raspberry pi cluster?

Ik it's nice for educational purposes but is there ever a practical reason to build it for preformance? Or is going a bit bigger on the cpu always worth it?

36 Comments
2024/03/16
05:30 UTC

3

Bash function `module` not work in singularity container

Given a bash script named test.sh

module load cuda/11.6
env

If I run in host system with bash test.sh, everything is fine.

But if I run it in a singularity container:

singularity exec rocky8.sif bash -l test.sh

Then it will report module not found

But the output show that the function is existed:

BASH_FUNC_module()=() {  local _mlredir=1;
 if [ -n "${MODULES_REDIRECT_OUTPUT+x}" ]; then
 if [ "$MODULES_REDIRECT_OUTPUT" = '0' ]; then
 _mlredir=0;
 else
 if [ "$MODULES_REDIRECT_OUTPUT" = '1' ]; then
 _mlredir=1;
 fi;
 fi;
 fi;
 case " $@ " in
 *' --no-redirect '*)
 _mlredir=0
 ;;
 *' --redirect '*)
 _mlredir=1
 ;;
 esac;
 if [ $_mlredir -eq 0 ]; then
 _module_raw "$@";
 else
 _module_raw "$@" 2>&1;
 fi
}

How to fix this?

7 Comments
2024/03/15
05:12 UTC

4

IHPCSS 2024 Summer School Application

Hi, there was a deadline for IHPCSS application on 31th January. I applied for the first time ever - does anyone know if they send rejection emails? On the application they said it'll take a month or so, and it's month and half, so I don't know if I'm rejected or just impatient.

Thanks in advance!

5 Comments
2024/03/14
19:48 UTC

17

Training / Courses for HPC

Hi Experts,
I am new to the HPC world and I want to learn more about it.
Is there a training course or some content that can help me understand , visualize and practice HPC ?

Tried searching Udemy but that didn't help much.

7 Comments
2024/03/12
08:04 UTC

8

Benefit of running a Slurm cluster with QOS only instead of partitions

Hi.

Our current cluster has multiple partitions, mainly to separate between long and short jobs.

I'm starting to see more and more clusters that have only 1 partition and manage their nodes via QOS only. Often I see a "long" and "short" QOS which restricts jobs to specific nodes.

What is the benefit of using QOS here?

3 Comments
2024/03/11
19:37 UTC

3

Benefit of running a Slurm cluster with QOS only instead of partitions

Hi.

Our current cluster has multiple partitions, mainly to separate between long and short jobs.

I'm starting to see more and more clusters that have only 1 partition and manage their nodes via QOS only. Often I see a "long" and "short" QOS which restricts jobs to specific nodes.

What is the benefit of using QOS here?

0 Comments
2024/03/11
19:37 UTC

5

Cuda "dialects?"

I am reading through github repo of cuda code. Like just whatever comes first or some common tools I use.

I am noticing there are 2 distinct dialects (I think idk I m no expert). The ai people do a lot of meta programing and use common libraries this makes their code even inside kernals very c++ish

In contrast the physics simulations look like plain c with some fancy syntax for kernal lunching. And most of the surrounding code is c or c like c++.

Is this something you have noticed? Is this a thing that transcends cuda or is it specific to that languge?

0 Comments
2024/03/11
17:54 UTC

3

any tips for making good openmp gpu code thats cross platform

right now I am stuck not being able to compile on my machine (not the question here) now I will probably find a solution. but I would never know this is an issue on other platforms.

23 Comments
2024/03/10
18:24 UTC

9

Advices on HPC Masters Degrees in Europe

Hello,

I’m currently studying computer science and mathematics. Next year I’ll have to choose a master degree and I heard about HPC. What I really enjoy is developing performant softwares using pretty low level programming languages like C or Rust and optimizing algorithms. Also I would really like to fight against the environmental crisis we’re facing nowadays. And I’ve found out that maybe with HPC I could combine the two. Developing performant softwares for researchers in meteorology, climatology, ecosystem simulations,... I would also like to work on the public research field. Do you think HPC is what I’m im looking for ? Are HPC engineers in demand in the European public research? Does anybody here do this? Do you know what are the best HPC masters degrees in Europe?

Thanks in advance for your answers

7 Comments
2024/03/09
12:26 UTC

1

How to find that last submit time for a job in LSF?

In our environment, we have large number of queues and it's difficult to manage them all. This includes queues that are no longer used.

So, we need to do some housekeeping and remove queues that are no longer in use. Is there anyway I can find when was the last time a job ran on each queue in LSF?

I've tried fetching data from RTM, but it's tedious to go through each queue and manually scroll/sort for them. It would be much easier to fetch through a script.

1 Comment
2024/03/09
04:24 UTC

4

which one is easier to master, OpenMPI or MPICH?

I have built my Discrete Element Method (DEM) code for simulation of granular systems in C++. As the simulation of particle dynamics is fully resolved, I want it to be run on our cluster. I would skip OpenMP implementation even it might be easier than using MPI.

In terms of the APIs, which one is more user-friendly? or they have the same APIs. Suppose I already know the basic algorithm for parallel simulation of system of many particles, Is it doable in 6 months for the implementaiton?

25 Comments
2024/03/08
12:31 UTC

3

Getting around networking bottlenecks with a SLURM cluster

All of my compute nodes can run at a maximum network speed of 1gbps, given the networking in the building. My SLURM cluster is configured so that there is an NFS node that the compute nodes draw their stuff from, but when someone is using a very large dataset or model it takes forever to load. In fact, sometimes it takes longer to load the data or model than it does to run the inference.

I'm thinking of re-configuring the whole damn thing anyway. Given that I am currently limited by the building's networking but my compute nodes have a preposterous amount of hard drive space, I'm thinking about the following solution:

Each compute node is connected to the NFS for new things, but common things (such as models or datasets) are mirrored on every compute node. The compute node SSDs are practically unused, so storage isn't an issue. This way, a client can request that their dataset be stored locally rather than on the NFS, so loading should be much faster.

Is that kludgy? Note that each compute node has a 10gbps NIC on board, but building networking throttles us. The real solution is to set up a LAN for all of the compute nodes to take advantage of the faster NIC, but that's a project for a few months from now when we finally tear the cluster down and rebuild it with all of the lessons we have learned.

13 Comments
2024/03/08
11:55 UTC

Back To Top