/r/q?req.query.q -- Subreddit Search

152,166 Subscribers

How do you handle pre-deployment jobs with GitOps?

We're moving to Kubernetes and want to use ArgoCD to deploy our applications. This mostly works great, but I'm yet to find a decent solution for pre-deployment jobs such as database migrations, or running Terraform to provision application-required infrastructure (mostly storage accounts, user managed identities, basically anything that can't run on AKS - not the K8s platform).

I've looked into Argo Sync phases and waves, and whilst database migrations are the canonical example, I'm finding them clunky as they run every time the app is synced, not just when a new version is deployed. (`hook-delete-policy: never` would work great here)

I'm assuming the answers here are make sure the database migrations are idempotent and split out terraform from the gitops deployment process? Am I missing any other options?

7 Comments

2024/12/21
19:20 UTC

Knative/KServe + cert-manager: HTTP-01 Challenge Fails (‘connection reset by peer’) for One Service Only

Hey folks! I’m running a Kubernetes cluster with Knative and KServe to serve machine-learning models, and I use cert-manager (ACME/Let’s Encrypt) to handle TLS certificates for these inference endpoints. Everything works smoothly for most of my Inference Services—except for one specific service that stubbornly refuses to get a valid cert.

Here’s the breakdown:

Inference Service “A” spins up fine, but the certificate never goes Ready.
The associated Certificate object shows status.reason = “DoesNotExist,” and says “Secret does not exist”. There exists a temporary secret of type Opaque not kubernetes.io/tls.
Digging into the Order and Challenge reveals an HTTP-01 self-check error:connection reset by peer cert-manager is trying to reach http://my-service-A.default.my-domain.sslip.io/.well-known/acme-challenge/..., but the request fails.

I’ve successfully deployed other Inference Services using the same domain format (.sslip.io), and they get certificates without any trouble. I even tried using Let’s Encrypt’s staging environment—same result. Knative autoTLS was earlier enabled and I disabled it to no positive change.

This also happened earlier when I tried deploying the same service multiple times. I am not sure but it can be a similar scenario here.

What I’ve Tried So Far:

Deleted the “opaque” secret, re-deployed the service. It still recreates an Opaque secret.
Compared logs and resources from a successful Inference Service vs. this failing one. Nothing obvious stands out.
Confirmed no immediate Let’s Encrypt rate-limiting (no 429 errors).

Has anyone else encountered a scenario where Knative autoTLS + cert-manager leads to just one domain failing an HTTP-01 challenge (it can be due to deploying and deleting the same service over a set period of time), while others pass?

I’d love any insights on how to debug deeper—maybe tips on dealing with leftover secrets, or best practices for letting KServe manage certificates. Thanks in advance for your help!

0 Comments

2024/12/21
19:19 UTC

HA or fault tolerant edge clusters with only 3-4 nodes

I've been trying to determine the best way to handle fault tolerance in a 3-4 node cluster. I'm doing more work involving edge computing these days and have run into issues where we need a decent level of resilience in a cluster with 3, max 4 nodes probably.

Most of the reading I've done seems to imply that running 3x master/worker hybrids might be the best way to go without doing anything too unusual (external datastores, changing architecture to something like Hashicorp Nomad etc.). This way I can lose 1 master on a 3-4 node cluster without it commuting seppuku.

I'm also worried about resource consumption being that I'm constrained to a maximum of 4 nodes (granted each can have up to 128 GB RAM) since the powers that be want to squeeze as much vendor software onto our edge solutions as possible.

Anyone have any thoughts on some potential ways to handle this? I appreciate any ideas or experiences other have had!

7 Comments

2024/12/21
18:32 UTC

Air gapped Kubernetes with Talos

We recently shipped a highly requested feature in Talos 1.9. Lets you cache arbitrary container images as part of your installation media. Helps with air gapped environments and preseeding applications for faster scaling

0 Comments

2024/12/21
18:30 UTC

Kubernetes' repository summary

3 Comments

2024/12/21
13:49 UTC

Gitlab install help

Hello, I would like to deploy Gitlab in a k8s cluster but I see in the doc that we can't prod for stateful components. Is there a way to install Gitlab on the entire cluster?

6 Comments

2024/12/21
13:18 UTC

CoreDNS help

I have an issue atm where I need to add some host files to CoreDNS.

If I add like below, the host files do work however this breaks forwarding. (From the pod: Can ping host entries, can't ping google.co.uk for example) nslookup seems to work correctly just not ping

Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
log . {
class error
}
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . 8.8.8.8 {
force_tcp
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
hosts custom.hosts dummy.dummy.net {
172.25.212.11 server1.dummy.dummy.net
172.25.212.10 server2.dummy.dummy.net
fallthrough
}
}

Could someone point me in the right direction for formatting? Host entries are configured in /etc/hosts. If I could point CoreDNS towards this that would be preferable

Thanks!

1 Comment

2024/12/21
12:14 UTC

Different Pods with different RAMs in a Statefulsets

Hi. Is it by any means possible to set up a statefulset in such a way that some Pods have more RAM or cpu assigned to them?

Many thanks

5 Comments

2024/12/21
11:53 UTC

HomeLab with 2 old laptop

So currently I'm interested in kubernetes and want to have experiences with it so i want to start building my homelab. But I wonder my case: I have a dell lattitude 6430 which has an i5 and 2 core with 16gb of ram, and a dell inspirion 3420(no screen, i made an external monitor using this LoL) which has an i3 2328M and 2 core with 6gb of ram. My main laptop is an thinkbook which has 8 cores and 32gb of ram. What do you suggestions that i can most take advantage of my homelab? (I'm newbies and know nothing, pls be nice (●'◡'●))

4 Comments

2024/12/21
10:34 UTC

Guide to Kubernetes RBAC

https://www.cerbos.dev/blog/guide-to-kubernetes-rbac

0 Comments

2024/12/21
06:13 UTC

Karpenter disruption.budgets not working as expected

Hi, everyone. I’m having issues with my node pool’s disruption budgets. The goal is for it to block node scaling down during weekdays (Monday to Friday) between 11:00 AM and 11:00 PM UTC and only allow scaling down in the following scenarios:

Outside of this time frame.
When a node pool is empty.
When the node pool has been modified.

Here’s the configuration I’m using, but it’s not working as expected:

disruption:
  consolidationPolicy: WhenEmptyOrUnderutilized
  consolidateAfter: 1m
  budgets:
   - nodes: '0'
     reasons:
      - Underutilized
     schedule: '0 11 * * mon-fri'  # Starts at 11:00 AM UTC, Monday to Friday
     duration: 12h                # Duration is 12 hours (ends at 11:00 PM UTC)
   - nodes: '1'
     reasons:
      - Empty
      - Drifted

The scaling behavior doesn’t match the intended restrictions. What’s wrong with this configuration, and how can I fix it to achieve the desired functionality?

5 Comments

2024/12/21
02:38 UTC

Modifying contairned configuration file tu use a ca certificate

Hello guys,

I created a kind cluster with this configuration file :

apiVersion: kind.x-k8s.io/v1alpha4

kind: Cluster

nodes:

- role: control-plane

...

extraMounts:

- hostPath: "/usr/local/share/ca-certificates/ca.domain.local.crt"

containerPath: "/etc/containerd/certs.d/ca.crt"

readOnly: true

...

The problem I have is containers on this cluster accountered a TLS error when they try to load images on a private gitlab registry.

Gitlab registry certificate is issued from a self-signed ca. I don't succeed to modify containerd v1.7.18 to use the ca certificate.

This is what i added in /etc/containerd/config.toml in the kind cluster before restarted containerd :

[plugins."io.containerd.grpc.v1.cri".registry.configs."gitlab.domain.local:5050".tls]

ca_file = "/etc/containerd/certs.d/ca.crt"

When I am in the container, i succeed to load an image if I use --tlscacert option

ctr --namespace=k8s.io image pull --tlscacert etc/containerd/certs.d/ca.crt --user argocd:** gitlab.domain.local:5050/....:tag

without specifying tlscacert, its fails with this error "tls: failed to verify certificate: x509: certificate signed by unknown authority"

Please help me

Marie

1 Comment

2024/12/20
20:31 UTC

Seeking Advice on Building a Self Managed Multi Cloud Kubernetes (multi) Cluster

hi everyone,

I’m currently working on a project that involves building a self managed multi cloud cluster. My plan is to host the master node on AWS (or any cloud) and run the worker nodes on a GPU IaaS provider. The primary goal is to efficiently run ML workloads across this setup.

Here’s my current understanding and situation:

Cluster Architecture:
- I want to create a single unified cluster that spans multiple clouds, rather than setting up separate clusters for each cloud.
- I’m considering using Cilium as my CNI for networking, leveraging its Cluster Mesh feature for connectivity across clouds.
Workload Orchestration:
- I plan to have a management cluster that will run deployments with selectors for distributing ML workloads based on resource availability and other criteria.
Centralized Management:
- I need advice on how to set up a centralized management cluster that can effectively orchestrate workloads across these multiple environments.

My Questions:

Single Cluster vs Multi-Cluster: Is it feasible to have a single Kubernetes cluster spanning multiple clouds? What are the pros and cons? Or Just have multiple cluster connected to the management cluster?

Centralized Management: What tools or strategies would you recommend for centralized management of this multi-cloud (and maybe multi cluster setup)? Are there specific tools for workload orchestration that work well in such environments?

ETCD Management: How should ETCD be managed in this multi-cloud, multi-cluster context?

Best Practices: Any best practices or lessons learned from your own experiences in similar setups would be greatly appreciated!

Thank you for helping your bro out!

2 Comments

2024/12/20
12:40 UTC

K8s not working on RHEL 9

Has anyone been able to run K8s on RHEL with Calico? I've setup a minimal cluster with two workers and one master. When I expose any deployment via service I'm unable to get a response from the the exposed service IPs. I've read a few forums where they say that calico and firewalld don't work together on RHEL 9. Has anyone been able to get it to work?

7 Comments

2024/12/20
12:29 UTC

Running GenAI on Supercomputers with Virtual Kubelet: Bridging HPC and Modern AI Infrastructure

Thank you to Diego Ciangottini, the Italian National Institute for Nuclear Physics, the InterLink project, and the Vega Supercomputer – all for doing the heavy lifting getting HelixML GPU runners running on Kubernetes bridged to Slurm HPC infra to take advantage of hundreds of thousands of GPUs running on Slurm infrastructure and transform them into multi-tenant GenAI systems.

Read about what we did and see the live demo here: https://blog.helix.ml/p/running-genai-on-supercomputers-bridging

0 Comments

2024/12/20
11:04 UTC

Weekly: Share your victories thread

Got something working? Figure something out? Make progress that you are excited about? Share here!

4 Comments

2024/12/20
11:00 UTC

Vagrant and VirtualBox: IP addressing problem on a Kubernetes cluster

Hello,

I'm working on a project to enhance application security within a Kubernetes cluster. I'm attempting to create a cluster with one master node and two worker nodes on three separate VMs using Vagrant and VirtualBox.

However, I'm encountering an issue: all three VMs are assigned the same IP address, preventing them from communicating with each other.

I've provided my Vagrantfile and the output of the ifconfig command below:

Vagrantfile :

Vagrant.configure("2") do |config|

#config.gui = true

config.vm.boot_timeout = 60000 # 60 seconds timeout

config.vm.define "k8s-master" do |master|

master.vm.box = "ubuntu/bionic64"

master.vm.network "private_network", type: "dhcp"

#master.vm.provider "virtualbox"

master.vm.provider "virtualbox" do |vb|

vb.memory = "2048"

vb.cpus = 2

end

master.vm.hostname = "k8s-master"

master.vm.provision "shell", inline: "echo 'This is the Kubernetes Master Node'"

end

(1..2).each do |i|

#config.gui = true

config.vm.boot_timeout = 60000 # 60 seconds timeout

config.vm.define "k8s-worker-#{i}" do |worker|

worker.vm.box = "ubuntu/bionic64"

worker.vm.network "private_network", ip: "10.0.2.#{i+15}"

worker.vm.provider "virtualbox" do |vb|

vb.memory = "2048"

vb.cpus = 2

end

worker.vm.hostname = "k8s-worker-#{i}"

worker.vm.provision "shell", inline: "echo 'This is Kubernetes Worker Node #{i}'"

end

ifconfig :

enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 10.0.2.15 netmask 255.255.255.0 broadcast 10.0.2.255

enp0s8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 192.168.56.3 netmask 255.255.255.0 broadcast 192.168.56.255

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

inet 127.0.0.1 netmask 255.0.0.0

I've tried several tutorials, but I'm still facing the same problem.

I'm working on Windows with VirtualBox and Vagrant.

Any help or suggestions would be greatly appreciated!

1 Comment

2024/12/20
08:14 UTC

How to config alertmanager send alert to slack with prometheus-community/kube-prometheus-stack chart?

I'm using https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack chart.

I created this values.yaml:

alertmanager:
  config:
    route:
      receiver: 'slack'
    receivers:
      - name: 'slack'
        slack_configs:
          - api_url:
              valueFrom:
                secretKeyRef:
                  name: slack-secret
                  key: webhook-notification
            channel: "#alert-channel"
            title: "Test"
            text: "Test"

But after install it

helm install my-prometheus prometheus-community/prometheus -f values.yaml

I can't see the new config. But I can see the real alert happened in alertmanager UI.

It didn't send to slack channel. How to config?

4 Comments

2024/12/20
07:47 UTC

DeepGram Application not working on Kubernetes

I’ve build an real-time transcription Socket API for internal use which uses DeepGram SDK for doing the job.

Its working on my local system and provide me transcript as intended but when I host this on kubernetes under service and setup ingress for this and try to send the audio bytes, it receives them but they does not transmitted to the deepgram as I never receive any transcript.

I also try to work with DeepGram’s Socket API which do the same job but I was getting connection closed error with deepgram as soon I tried to send audio bytes to my api

I don’t understand now why this is happening, if anyone can provide any clues, I’ll get going with that.

0 Comments

2024/12/20
05:29 UTC

Does the master node have to be always online?

I want to learn kubernetes by deploying a simple project. For cost reasons, it'll be limited to one node, so I can't use a firewall to close off or hide the master node. The most straightforward way to do this is to put the master node online on the same machine the worker node is on, but I'd rather not have my k8s controller online.

As I'm not doing any real load-balancing, can I keep the master node mostly offline (say, on my local computer)?

10 Comments

2024/12/20
01:12 UTC

Picking the right scrape intervals?

I switched recently to VictoriaMetrics and I was wondering if you could share your experiences how to properly set the metrics scrape intervals, for your monitored services. Thank you.

2 Comments

2024/12/20
00:26 UTC

Kubernetes Pod Can't Connect to Deepgram (Works Locally)

I'm banging my head against a wall with a network issue and hoping someone here can offer some insight. I've got an application that uses the Deepgram SDK for real-time transcription. It works flawlessly on my local machine, but when I deploy it as a pod in my Kubernetes cluster, it just… stops working.

Here's the breakdown:

Local: Everything's perfect. Audio streams in, transcripts come out.
Kubernetes: The pod receives the audio bytes input. No errors in my application logs related to authentication or API keys, after connection its also printing log for successful connection. Issue come when sending the bytes chunks to the DeepGram, it never happened.

I've tried a few things already:

Basic connectivity checks: Inside the pod, nslookup api.deepgram.com resolves correctly. I can even curl -v https://api.deepgram.com, and get a 200 OK, which tells me basic HTTP traffic is going out.
Deepgram SDK/Socket API: Initially, I was using the Deepgram SDK. Thinking it might be an SDK issue, I tried working directly with WebSockets, but I get the same behavior: the connection gets closed immediately after the request.
Resource limits: I've checked CPU and memory limits on the pod; they're nowhere near being maxed out.

I'm clueless now about the issue. I'm attaching the code as well, if that makes sense.

Code: https://pastebin.com/WQkW7sYL

Thanks in advance for your help!

0 Comments

2024/12/19
19:05 UTC

K8 Control Plane HA with 2 Servers

I am looking at an Edge application that would have 2 physical Servers in a small cluster. I would like to leverage CP HA should one of the servers go down. The servers would both be worker nodes and one would also run the CP at the same time. In order to get CP HA 3 Control Nodes are required. Am I able to run two CP Nodes on one physical Server and the third on the other? The only idea I have is to run VMs so they are virtually separate Nodes on one Physical server. I'd like to avoid the hypervisor if possible however.

Another option is to make the Control node reachable across a WAN link to the site but that is likely troublesome.

Thanks

https://preview.redd.it/6l2ts0lubu7e1.png?width=592&format=png&auto=webp&s=9eb61c24bab9113b08f32ba05243c2d80d7337e1

16 Comments

2024/12/19
17:30 UTC

Help with Istio VirtualService and DestinationRule ignored when calling a service from inside the mesh

I am trying to implement blue/green deployment using Istio virtual services and destination rules. Seems like I can make it work when calling from outside the cluster, but when a pod calls another pod the requests are balanced on both blue and green instances. All the pods involved have the Envoy sidecar.

Istio ingress gateway has been installed with the default istioctl install.

The other relevant manifests are the following:

apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
  name: bg-gateway
  namespace: bluegreen-playground
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 8080
      name: http
      protocol: HTTP
    hosts:
    - bluegreen.myexample.com

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: bg-route
  namespace: bluegreen-playground
spec:
  gateways:
    - bg-gateway
  hosts:
  - bluegreen.myexample.com
  - core-svc.bluegreen-playground.svc.cluster.local
  - front-svc.bluegreen-playground.svc.cluster.local
  http:
  - name: core
    match: 
    - uri:
        prefix: "/core"
    route:
    - destination:
        host: core-svc.bluegreen-playground.svc.cluster.local
        subset: blue
  - name: front
    match: 
    - uri: 
        prefix: "/"
    route:
    - destination:
        host: front-svc.bluegreen-playground.svc.cluster.local
        subset: blue

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: core-destination
  namespace: bluegreen-playground
spec:
  host: core-svc.bluegreen-playground.svc.cluster.local
  subsets:
  - name: blue
    labels:
      version: v1
  - name: green
    labels:
      version: v2

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: front-destination
  namespace: bluegreen-playground
spec:
  host: front-svc.bluegreen-playground.svc.cluster.local
  subsets:
  - name: blue
    labels:
      version: v1
  - name: green
    labels:
      version: v2

All the pods are correctly labeled with version: v1 or version: v2. The microservices call each other by using the standard Kubernetes service FQDN (e.g. front-svc.bluegreen-playground.svc.cluster.local).

Am I missing anything?

5 Comments

2024/12/19
16:53 UTC

Using Kubernetes to scale load testing

Generating the load required for load tests can not realistically be done from your local machine - considering the needs for high and consistent compute/network throughput over prolonged periods of time. We show how Kubernetes can be used as a foundation for a scalable load-testing strategy, using CNCF tools to orchestrate, execute and analyze results accordingly.

https://testkube.io/learn/using-kubernetes-to-scale-your-load-testing

0 Comments

2024/12/19
14:33 UTC

Add manual gate to argocd before prod deployment

My team wants to migrate from GH Actions to Argocd. I just have one problem, can I configure manual approval gates before my changes get deployed to staging/ prod. The auto sync is great for dev, but before anything gets deployed to prod our stakeholders want to manually approve the release.

17 Comments

2024/12/19
14:03 UTC