/r/HPC

Photograph via snooOG

Multicore, cluster, and high-performance computing news, articles and tools.

Multicore, cluster, and high-performance computing news, articles and tools.

"Anyone can build a fast CPU. The trick is to build a fast system." - Seymour Cray

Smokey says: avoid over-packaged products to fight climate change! [see more tips]

Other subreddits you may like:

Does this sidebar need an addition or correction? Tell us here

/r/HPC

13,223 Subscribers

0

Why is my cpu load at 60 despite the machine only having 48 cpus ( running Fluent )

I am running the fluentbench.pl script to benchmark a large model on various machines. I am using this command:

/shared_data/apps/ansys_inc/v242/fluent/bin/fluentbench.pl -path=/shared_data/apps/ansys_inc/v242/fluent/ -noloadchk -norm -nosyslog Roller_Zone_M -t48

Some machines only have 28 cpus, so I replace 48 with that number. On those machines the load via "top" never exceeds 28. But on the 48 cpu machine, it stays at 60. The job runs very slowly compared to the 28 machines ( which actually has older and slower cpus )! Hyperthreading is off on all my machines.

The cpu usage of each core seems to fluctuate between 50-150%. Here are the cpu specs below. The machine originally had 256 GB memory, but one stick failed a few months ago. So I pulled out two sticks. Now each CPU has three 32GB sticks. Perhaps slowdown is related to that, but doubtful..

Architecture:        x86\_64  
CPU op-mode(s):      32-bit, 64-bit  
Byte Order:          Little Endian  
CPU(s):              48  
On-line CPU(s) list: 0-47  
Thread(s) per core:  1  
Core(s) per socket:  24  
Socket(s):           2  
NUMA node(s):        2  
Vendor ID:           GenuineIntel  
CPU family:          6  
Model:               85  
Model name:          Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz  
Stepping:            7  
CPU MHz:             3583.197  
CPU max MHz:         4000.0000  
CPU min MHz:         1200.0000  
BogoMIPS:            6000.00  
Virtualization:      VT-x  
L1d cache:           32K  
L1i cache:           32K  
L2 cache:            1024K  
L3 cache:            36608K  
NUMA node0 CPU(s):   0-23  
NUMA node1 CPU(s):   24-47
5 Comments
2024/11/01
03:19 UTC

5

HPC Nodes Interface name change

Hi Everyone, just a little paranoia setting in and wondering if anyone changes the interface names like enp1s0 and so on to eth1 or eth0. Or you just change or rename the connection names since the new Interface naming seems a bit too long to remmeber .

6 Comments
2024/10/31
21:36 UTC

1

Image Streaming with Snapshotters (containerd plugins) in Kubernetes

This is relevant to the HPC community as we both consider moving our workloads to cloud (and want to minimize time and thus cost) along with considering running Kubernetes on-premises alongside our workload managers.

https://youtu.be/ZXM1gP4goP8?si=ZVlJm0SGzQuDq52E

The basic idea is that the kubelet (service running on a node to manage pods) is going to use plugins to help manage containers. One of them is called a snapshotter, and it's in charge of preparing container root filesystems. The default snapshotter, overlayfs, is going to prepare snapshots for all layers, meaning you wait for the pull and extraction for all layers in the image before you get the final thing to start your container. This doesn't make sense given that (work has shown) less than 7% of actual image contents are needed at startup. Thus, "lazy loading" snapshotters have been developed, namely eStargz and then SOCI (Seekable OCI) that will pre-load prioritized files (based on recording file access) to allow the container to start as soon as this essential content is ready. The rest of content is loaded on demand via a custom fuse filesystem, which uses the index to find content of interest and then does a range request to the registry to retrieve it, returning back an inode!

This talk goes through that process in technical detail (on the level of function calls) after doing an HPC performance study on three clouds, and there are timestamps in the description to make it easy to jump to spots of interest. As a community, I think we should be thinking more about cost effective strategies for using cloud (this being just one) along with what other creative things we might do with these plugin interfaces afforded by containerd, and specifically for our HPC workloads.

0 Comments
2024/10/31
00:13 UTC

5

Update slurm controller for a cluster using OpenHPC tools

Dear All,

I have tried to update slurm controller for a rebooted cluster. sinfo shows all the nodes are in "Down" states. Slurm version is 18.08.8 . Operating system is CentOs 7. However, when I use slurm update command by:

scontrol: update NodeName=cn01 State=DOWN Reason="undraining"

Unfortunately, I get below error:

Error: A valid LosF config directory was not detected. You must provide a valid config path for your local cluster. This can be accomplished via one of two methods: (1) Add your desired config path to the file -> /opt/ohpc/admin/losf/config/config_dir (2) Set the LOSF_CONFIG_DIR environment variable Example configuration files are availabe at -> /opt/ohpc/admin/losf/config/config_example Note: for new systems, you can also run "initconfig <YourClusterName>" to create a starting LosF configuration template.

Which means there is OpenHPC. Any comments on updating slurm in this case is highly appreciated.

2 Comments
2024/10/30
14:12 UTC

21

Nightmare of getting infiniband to work on older Mellanox cards

I've spent several days trying to get infiniband working on an older enclosure. The blades have 40 gbps Mellanox ConnectX-3 cards. There is some confusion if ConnectX-3 is still supported, so I was worried the cards might be e-waste.

I first installed Alma Linux 9.4 on the blades and then did a:

dnf -y groupinstall "Infiniband Support"

That worked and I was able to run ibstatus and check performance using ib_read_lat and ib_read_bw . See below:

[~]$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:4a0f:cfff:fef5:c6d0
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      Ethernet    

Latency was around 3us which is what I expected. Next I installed openmpi, per "dnf install -y openmpi". I then ran the Ohio State mpi/pt2pt benchmarks, specifically, osu_latency and osu_bw . I got 20us latency . Seems openmpi was only using TCP. It couldn't find any openib/verbs to use. After hours of googling I found out I needed to do:

dnf install libibverbs-devel # rdma-core-devel

Then I reinstalled openmpi and it seemed to pickup the openib/verbs BTL. But then it gave a new error:

[me:160913] rdmacm CPC only supported when the first QP is a PP QP; skipped
[me:160913] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped

More hours of googling seemed to conclude this is because verbs is obsolete and no longer supported. They said to switch to UCX. So I did that with:

dnf install ucx.x86_64 ucx-devel.x86_64 ucx-ib.x86_64 ucx-rdmacm.x86_64

Then reinstalled openmpi and now the osu_latency benchmarks gives 2-3us. Kind of miracle it worked since I was ready to give up on this old hardware :-) Annoying how they make this so complicated...

25 Comments
2024/10/29
18:09 UTC

7

Tips for benchmarking?

Hey guys, I'm working on a project that is basically simulate wave propagation with different tools and compare them, and I need to know the dimensions/parameters of my simulation to be big enough for comparison.

Do you guys have any tips? Are there other communities beyond r/HPC to consult about these simulations (something like seismic)? I'm probably going to work with 4 or 8 gpus 2080 super.

2 Comments
2024/10/28
23:10 UTC

0

How to run a parallelized R script?

Hey all, im quite desperate for my masters thesis. I have an R script which has several library dependencies and a few custom functions. The script is made to perform a simulation on multiple cores using the parallel package. What would be the steps to run this script on a HPC?

So far I only managed to login to Waldur and generate ssh keys. With that I managed to login to the HPC using putty software. Im completely lost here and my faculty doesnt have any instruction on how run such scripts.

13 Comments
2024/10/28
15:06 UTC

3

Need help with Infiniband Virtualization - Unique LID's for vHCA

I am trying virtualize my ConnectX-4 with SR-IOV and assigning it to VM's for creating my GPU and IB lab to create automation tools and scripts for testing and deployment.

I have successfully created 8 vHCA's and I am able to assign them to the VM. But the problem is when I run the SM I get the same LID for Parent Function and the Virtual HCA's, I know this is how it should be. But for my use case I need unique LID for each vHCA.

I saw some video from 7 years back that this is possible. If anyone knows how to assign unique LID's for vHCA's could you please help me out. Would really appreciate it.

1 Comment
2024/10/28
04:14 UTC

0

Basics of setting up an HPC cluster cloud

Title,I want to learn how to set up a basics of HPC cluster cloud,step by step,networking,storage,virtualization,etc. All suggestions are welcome,thanks in advance

8 Comments
2024/10/27
16:46 UTC

29

HPC communities beyond r/HPC

I'm looking for networking and knowledge sources in the HPC space. While r/HPC is great, I'd love to know what other active communities and forums you frequent for technical discussions and staying up-to-date with HPC developments.

Any other forums, Slack/Discord channels, mailing lists, or any other platforms where you share experiences and insights?

Thanks in advance for your suggestions!

11 Comments
2024/10/27
10:51 UTC

4

DDN not in Gartner’s magic quadrant

Anyone knows why?

14 Comments
2024/10/27
07:35 UTC

0

Need help with SLURM JOB code

Hello,

I am a complete beginner in slurm jobs and dockers.

Basically, I am creating a docker container, in which am installing packages and softwares as needed. The supercomputer in our institute needs to install softwares using slurm jobs from inside the container, so I need some help in setting up my code.

I am running the container from inside /raid/cedsan/nvidia_cuda_docker, where nvidia_cuda_docker is the name of the container using the command docker run -it nvidia_cuda /bin/bash and I am mounting an image called nvidia_cuda. Inside the container, my final use case is to compile VASP, but initially I want to test a simple program, for e.g. installing pymatgen and finally commiting the changes inside the container. using a slurm job

Following is the sample slurm job code provided by my institute:

!/bin/sh

#SBATCH --job-name=serial_job_test ## Job name

#SBATCH --ntasks=1 ## Run on a single CPU can take upto 10

#SBATCH --time=24:00:00 ## Time limit hrs:min:sec, its specific to queue being used

#SBATCH --output=serial_test_job.out ## Standard output

#SBATCH --error=serial_test_job.err ## Error log

#SBATCH --gres=gpu:1 ## GPUs needed, should be same as selected queue GPUs

#SBATCH --partition=q_1day-1G ## Specific to queue being used, need to select from queues available

#SBATCH --mem=20GB ## Memory for computation process can go up to 100GB

pwd; hostname; date |tee result

docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v <uid>_vol:/workspace/raid/<uid> <preferred_docker_image_name>:<tag> bash -c 'cd /workspace/raid/<uid>/<path to desired folder>/ && python <script to be run.py>' | tee -a log_out.txt

Can someone please help me setup the code for my use case?

Thanks

3 Comments
2024/10/26
18:57 UTC

17

VAST vs. Weka: Experience & Pain points

I'm aware of previous discussions in this community about VAST and Weka, but I'd like to get current, hands-on feedback from users. Looking for real-world experiences, both positive and negative.

Specifically interested in:

VAST users:

  • How's the performance meeting your use cases?
  • What workloads are you running?
  • Any unexpected challenges or pleasant surprises?

Weka users:

  • Are you running with data reduction and encryption enabled? How's the experience?
  • Experience with S3 tiering (either on-prem or cloud) How smooth is the tiering process in practice?

For all users:

  • What's working particularly well?
  • How satisfied are you with the documentation? Any gaps?
  • How's the vendor support experience? Response times, issue resolution, etc.?
  • What are your main pain points?
  • Any deployment or maintenance challenges?

Context about your environment and workloads would be greatly appreciated.

Thanks a lot in advance!

7 Comments
2024/10/26
13:59 UTC

4

Do you manually update the kernel or stick to the default version?

I'm curious after a discussion with colleagues.

How many of you manually update the kernel for better hardware support?

View Poll

5 Comments
2024/10/25
16:38 UTC

5

Developer Stories Podcast: Michela Taufer 🎉

Today on the Developer Stories podcast we talk to Michela Taufer - Dongarra Professor of HPC at the University of Tennessee, head of The Global Computing Laboratory, and prominent voice for #ISC25. We hope you enjoy! There are several ways to listen:

0 Comments
2024/10/24
14:29 UTC

3

HPC engineer internship interview as a relative noob?

Hello, I got invited for an interview for an HPC engineer internship as a Sophomore in datascience/AI field. (one of Ansys, Altair, Dassault, Siemens. Non-US branch)

I really didn't expect my resume to get an interview based on my background. Somewhat related experience to HPC is handling network equipments in the military, having a decent homelab(imo) and server/network/support admin related Coursera courses. which was all included on the resume. (however I was always interested in big fat computing muscles and thought DS/AI was not really for me as a job)

some notable requirements were: (rough translation)

  • accustomed to UNIX/LINUX systems
  • network, FS knowledge
  • server hardware architecture knowledge
  • accustomed to scripting languages such as Python, Bash...

requirements didn't seem to be that demanding (also I guess since it's an intern), I presumed the position itself is pretty niche or they're gonna filter a lot on the interview.

My question is, as a person who never actually used HPC, how would I prepare for this and what would you expect from such interns? This is also my first time doing an interview 🫣. I want to hear some perspective from people in the related field. Thank you!

6 Comments
2024/10/24
13:36 UTC

14

OpenHPC alternative for Ubuntu

We have an OpenHPC cluster on an old version of CentOS. All packages are now too out of date and we need to upgrade. Although I set up the old cluster, I'm not a HPC expert and just followed the OpenHPC recipe.

We have a strong preference for Ubuntu. It's unfortunate that there are no OpenHPC binaries for Ubuntu available. Compiling from source would be too big a task. Ultimately we'll stay with RHEL variant if needed.

How does Qluster compare to OpenHPC or what else could you recommend that can run on Ubuntu?

For provisioning, we currently use Warewulf, but can easily change if needed.

For job scheduling, we use SLURM and have strong preference not to change that.

We also use MPICH and also do not want to change that.

We will also install BeeGFS & Infiniband drivers.

Any recommendations on how to go about building or new replacement cluster?

If recommendation is to stay with OpenHPC and a RHEL variant, my next question is whether to use AlmaLinux or Rocky?

15 Comments
2024/10/24
12:50 UTC

4

Are the CPUS on a seven year old Dell PowerEdge VRTX worth upgrading? ( Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz )

It has four blades. Each with 24-cores using dual socket Intel Xeon CPU E5-2670 v3 @ 2.30GHz.

Can I throw a couple hundred dollars at it from eBay parts to get some "oomph" back into it?

Workload is mainly CFD ( Fluent ). We only need it to run for a couple more years before retiring it.

18 Comments
2024/10/21
18:08 UTC

8

DDN vs Pure Storage

Which is more established in the industry? Which is more suitable for inference/training needs?

13 Comments
2024/10/19
04:50 UTC

7

Research HPC for $15000

Let me preface this by saying that I haven't built or used an HPC before. I work mainly with seismological data and my lab is considering getting an HPC to help speed up the data processing. We are currently working with workstations that use an i9-14900K paired with 64GB RAM. For example, one of our current calculations take 36hrs with maxxed out cpu (constant 100% utilization) and approximately 60GB RAM utilization. The problem is similar calculations have to be run a few hundred times rendering our systems useless for other work during this. We have around $15000 fund that we can use.

  1. Is it logical to get an HPC for this type of work or price?
  2. How difficult is the setup and running and management? The software, the OS, power management etc. Since I'll probably end up having to take care of it alone.
  3. How do I start on getting one setup?

Thank you for any and al help.

Edit 1 : The process I've mentioned is core intensive. More cores should finish the processing faster since more chains can run in parallel. That should also allow me to process multiple sets of data.

I would like to try running the code on a GPU but the thing is I don't know how. I'm a self taught coder. Also the code is not mine. It has been provided by someone else and uses a python package that has been developed by another someone. The package has little to no documentation.

Edit 2 : https://github.com/jenndrei/BayHunter?tab=readme-ov-file This is the package in use. We use a modified version.

Edit 3 : The supervisor has decided to go for a high end workstation.

47 Comments
2024/10/18
20:20 UTC

5

AI computing server suggestion

I am given a loose budget of 15k-20k€ to build an AI server as an internship task. Below is some info needed to target a specific hardware:

  • Main jobs are going to be Computer Vision based AI tasks; object detection/segmentation/tracking in a mixture of inference and training.
  • On average a medium to large models will be ran on the hardware (very rough estimate of 25 million parameters)
  • There is no need for containerization or VMs to be ran on the server
  • Physical casing should not be rack mountable, but standard standalone case (like Corsair Obsidian 1000D)
  • There will be few CPU intensive tasks related to robotics and ROS2 software that may not be able to utilize GPUs
  • There should be enough storage to load the full dataset into NVMe for faster data loading and also enough long-term storage for all the datasets and images/videos in general.

With those constraints in mind, I have gathered a list of compatible components that seem suitable for this setup:
GPUs: 2 x RTX A6000 [11000€]
CPU: AMD Ryzen™ Threadripper™ PRO 7955WX [1700€]
MOTHERBOARD: ASROCK WRX90 WS EVO [1200€]
RAM: 4 x 32GB DDR5 RDIMM 5600MT/s [800€]
CASE: Fractal Meshify 2 XL [250€]
COOLING: To my knowledge sTR4=sTR5 for mounting bracket, so any sTR4 360 or 420 AIO cooler [200€]
STORAGE: 1 x 4TB Samsung 990PRO [300€] + 16TB HDD WD RED PRO [450€]
PSU: Corsair Platinum AX1600i [600€]

Total cost: 16200€

Note that the power consumption/electricity cost is not a concern.
Based on the following components, do you see room for improvement or any compatibility issues?

Does it make more sense to have 3x RTX 4090 GPUs, or to switch up any components to result in a more effective server?

Is there anything worth adding to have better perfomance or robustness of the server?

18 Comments
2024/10/18
09:23 UTC

1

Double precision emulation with single, single

Is it advised? I theoretically should be able to get 16 times more tflops given the rtx is nerfed.

Is there any easy straightforward method to do it? I want my program to have that as optional.

Are there any straightforward libraries that are just pip install or alternatively where I add this functionality?

3 Comments
2024/10/17
14:34 UTC

9

Understanding User Needs: HPC vs. Standard Server Setup

Hello everyone,

I’m currently working in the IT department of a university research laboratory. We're facing a challenge with our aging HPC system, where most machines are now retired. The team is considering a new setup, leaning towards one storage server and one compute server instead of an HPC solution, with a budget of around €100,000.

From a recent user survey, we gathered that they are interested in features typically associated with HPC setups, including:

  • GPU
  • Large memory nodes
  • High-speed interconnects (e.g., InfiniBand)
  • Larger local SSDs on nodes

Given these responses, I’m trying to determine whether users genuinely need HPC capabilities or if a standard server would suffice.

What specific questions should I ask the users to clarify their needs? How can I assess whether an HPC setup is necessary for their workloads?

Thank you for your insights!

6 Comments
2024/10/17
14:32 UTC

11

GPU server for 20 000 (maybe more) Euros

Basically there are 20000 maybe more euros to be spent and this is would be on actually useful way to spend them (possibly). Could you point me to a starting point for knowledge about what to buy or if you want even make a suggestion? E.g. I know 4090 are more cost-effective but don't work for shared memory computations? and mixed precision but how relevant is that now/in the future?

11 Comments
2024/10/16
08:58 UTC

5

Very Basic Storage Advice

Hi all, I’m used to the different filesystems on an HPC system from a user perspective, but I’m less certain of my understanding of them from the hardware-side of things. Do the following structure, storage numbers, and RAID configurations make sense (assuming 2-3 compute nodes, 1-3 users max., and datasets which would normally be < 100 GB, but could, for one or two, reach up to 5 TB)?

Head/Login Node (1 TB SSD for OS, 2x 2 TB SSDs in a RAID 1 for storage) - Filesystem for user home directories (for light data viz and, assuming the same architecture, compilation). Don’t want to go too much higher for head storage unless I have to, and am even willing to go lower.

Compute Nodes (1 TB SSD for OS, 2x 4 TB SSDs and 2x 4 TB HDDs in a RAID 01 for storage) - Parallel filesystem made up of individual compute node storage for scratch space. Willing to go higher per compute node here.

Storage Node (2x 1 TB SSDs in RAID 1 for OS, 2x 2 TB SSDs in RAID 1 for Metadata Offload, up to 12x 24 TB HDDs in RAID 10 for storage) - Filesystem for long-term storage/ data archival. Configuration is the vendor’s. The 12x 3.5s is about my max for one node, but I may be able to grab two of these.

All nodes will be interconnected through a 10 G switch.

13 Comments
2024/10/15
21:37 UTC

5

Entry level jobs in HPC

Hi everyone,

I just graduated from undergrad and am looking for full time work. I worked at my school's HPC center for four years, did summer research at a national lab, and had internships in HPC-related work at other companies.

From what I've learned, the three options seem to be academia, national lab, and private industry. Right now I would prefer to go the industry route if it's possible.

When I look at job boards, it seems like most positions mentioning HPC are looking for senior level people. Is this inherent to how these companies operate, or am I simply looking at a bad time?

Would appreciate any tips or suggestions. I have lots of HPC experience and would love to work in the field. but I'm unsure if I should just pursue regular SWE positions. Thanks!

1 Comment
2024/10/15
12:46 UTC

1

MPI - CUDA Aware MPI and C++ Best resources or courses

Greetings all! I'm starting with HPC, I have a little bit of background regarding MPI, CUDA, OpenMP, and C++ but I want to go deeper, what would you recommend to go deeper in understanding, even projects like implementing something like the game of life with SYCL, for example. Is just an idea I was thinking of.

Thanks to all in advance!

0 Comments
2024/10/15
10:11 UTC

1

CPU cluster marketplace like Vast.ai?

The Vast.ai marketplace is really impressive--some really dirt-cheap prices, seemingly much cheaper than AWS in many cases. https://cloud.vast.ai/create/.

*But* I can't seem to find the equivalent type of marketplaces for high-end *CPU* clusters. Does anyone know of a CPU equivalent to vast.ai?

I can of course rent CPU clusters on AWS. But I'm looking for these kinds of markets, which may be cheaper.

Use case: I'm creating an enormous amount of "synthetic data" for code that is not easily ported to GPUs. I would ideally be running servers constantly. No idle time on the project. This is why price point is even more important than usual for my use case.

0 Comments
2024/10/14
23:38 UTC

14

Comparison of WEKA, VAST and Pure storage

Has anyone got and practical differences / considerations when choosing between these storage options ?

26 Comments
2024/10/14
13:39 UTC

2

How do user environments work in HPC

Hi r/HPC,

I am fairly new to HPC and recently started a job working with HPCM. I would like to better understand how user environments are isolated from the base OS. I come from a background in Solaris with zones and Linux VMs. That isolation is fairly clear to me but I don't quite understand how user environments are isolated in HPC. I get that modules are loaded to change the programming environment but not how each users environment is separate from others. Is everything just "available" to any user and the PATH is changed depending on what is loaded? Thanks in advance.

7 Comments
2024/10/14
13:06 UTC

Back To Top