/r/kubernetes
Kubernetes discussion, news, support, and link sharing.
Kubernetes discussion, news, support, and link sharing.
Kubernetes Links
Learn Kubernetes
Newsletters
/r/kubernetes
We're moving to Kubernetes and want to use ArgoCD to deploy our applications. This mostly works great, but I'm yet to find a decent solution for pre-deployment jobs such as database migrations, or running Terraform to provision application-required infrastructure (mostly storage accounts, user managed identities, basically anything that can't run on AKS - not the K8s platform).
I've looked into Argo Sync phases and waves, and whilst database migrations are the canonical example, I'm finding them clunky as they run every time the app is synced, not just when a new version is deployed. (`hook-delete-policy: never` would work great here)
I'm assuming the answers here are make sure the database migrations are idempotent and split out terraform from the gitops deployment process? Am I missing any other options?
Hey folks! I’m running a Kubernetes cluster with Knative and KServe to serve machine-learning models, and I use cert-manager (ACME/Let’s Encrypt) to handle TLS certificates for these inference endpoints. Everything works smoothly for most of my Inference Services—except for one specific service that stubbornly refuses to get a valid cert.
Here’s the breakdown:
Ready
.Certificate
object shows status.reason
= “DoesNotExist,” and says “Secret does not exist”. There exists a temporary secret of type Opaque
not kubernetes.io/tls
.Order
and Challenge
reveals an HTTP-01 self-check error:connection reset by peer
cert-manager is trying to reach http://my-service-A.default.my-domain.sslip.io/.well-known/acme-challenge/...
, but the request fails.I’ve successfully deployed other Inference Services using the same domain format (.sslip.io
), and they get certificates without any trouble. I even tried using Let’s Encrypt’s staging environment—same result. Knative autoTLS was earlier enabled and I disabled it to no positive change.
This also happened earlier when I tried deploying the same service multiple times. I am not sure but it can be a similar scenario here.
What I’ve Tried So Far:
Opaque
secret.Has anyone else encountered a scenario where Knative autoTLS + cert-manager leads to just one domain failing an HTTP-01 challenge (it can be due to deploying and deleting the same service over a set period of time), while others pass?
I’d love any insights on how to debug deeper—maybe tips on dealing with leftover secrets, or best practices for letting KServe manage certificates. Thanks in advance for your help!
I've been trying to determine the best way to handle fault tolerance in a 3-4 node cluster. I'm doing more work involving edge computing these days and have run into issues where we need a decent level of resilience in a cluster with 3, max 4 nodes probably.
Most of the reading I've done seems to imply that running 3x master/worker hybrids might be the best way to go without doing anything too unusual (external datastores, changing architecture to something like Hashicorp Nomad etc.). This way I can lose 1 master on a 3-4 node cluster without it commuting seppuku.
I'm also worried about resource consumption being that I'm constrained to a maximum of 4 nodes (granted each can have up to 128 GB RAM) since the powers that be want to squeeze as much vendor software onto our edge solutions as possible.
Anyone have any thoughts on some potential ways to handle this? I appreciate any ideas or experiences other have had!
We recently shipped a highly requested feature in Talos 1.9. Lets you cache arbitrary container images as part of your installation media. Helps with air gapped environments and preseeding applications for faster scaling
Hello, I would like to deploy Gitlab in a k8s cluster but I see in the doc that we can't prod for stateful components. Is there a way to install Gitlab on the entire cluster?
I have an issue atm where I need to add some host files to CoreDNS.
If I add like below, the host files do work however this breaks forwarding. (From the pod: Can ping host entries, can't ping google.co.uk for example) nslookup seems to work correctly just not ping
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
log . {
class error
}
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . 8.8.8.8 {
force_tcp
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
hosts custom.hosts dummy.dummy.net {
172.25.212.11 server1.dummy.dummy.net
172.25.212.10 server2.dummy.dummy.net
fallthrough
}
}
Could someone point me in the right direction for formatting? Host entries are configured in /etc/hosts. If I could point CoreDNS towards this that would be preferable
Thanks!
Hi. Is it by any means possible to set up a statefulset in such a way that some Pods have more RAM or cpu assigned to them?
Many thanks
So currently I'm interested in kubernetes and want to have experiences with it so i want to start building my homelab. But I wonder my case: I have a dell lattitude 6430 which has an i5 and 2 core with 16gb of ram, and a dell inspirion 3420(no screen, i made an external monitor using this LoL) which has an i3 2328M and 2 core with 6gb of ram. My main laptop is an thinkbook which has 8 cores and 32gb of ram. What do you suggestions that i can most take advantage of my homelab? (I'm newbies and know nothing, pls be nice (●'◡'●))
Hi, everyone. I’m having issues with my node pool’s disruption budgets. The goal is for it to block node scaling down during weekdays (Monday to Friday) between 11:00 AM and 11:00 PM UTC and only allow scaling down in the following scenarios:
Here’s the configuration I’m using, but it’s not working as expected:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
budgets:
- nodes: '0'
reasons:
- Underutilized
schedule: '0 11 * * mon-fri' # Starts at 11:00 AM UTC, Monday to Friday
duration: 12h # Duration is 12 hours (ends at 11:00 PM UTC)
- nodes: '1'
reasons:
- Empty
- Drifted
The scaling behavior doesn’t match the intended restrictions. What’s wrong with this configuration, and how can I fix it to achieve the desired functionality?
Hello guys,
I created a kind cluster with this configuration file :
apiVersion:
kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
...
extraMounts:
- hostPath: "/usr/local/share/ca-certificates/ca.domain.local.crt"
containerPath: "/etc/containerd/certs.d/ca.crt"
readOnly: true
...
The problem I have is containers on this cluster accountered a TLS error when they try to load images on a private gitlab registry.
Gitlab registry certificate is issued from a self-signed ca. I don't succeed to modify containerd v1.7.18 to use the ca certificate.
This is what i added in /etc/containerd/config.toml in the kind cluster before restarted containerd :
[plugins."io.containerd.grpc.v1.cri".registry.configs."gitlab.domain.local:5050".tls]
ca_file = "/etc/containerd/certs.d/ca.crt"
When I am in the container, i succeed to load an image if I use --tlscacert option
ctr --namespace=k8s.io image pull
--tlscacert etc/containerd/certs.d/ca.crt --user argocd:** gitlab.domain.local:5050/....:tag
without specifying tlscacert, its fails with this error "tls: failed to verify certificate: x509: certificate signed by unknown authority"
Please help me
Marie
hi everyone,
I’m currently working on a project that involves building a self managed multi cloud cluster. My plan is to host the master node on AWS (or any cloud) and run the worker nodes on a GPU IaaS provider. The primary goal is to efficiently run ML workloads across this setup.
Here’s my current understanding and situation:
Cluster Architecture:
- I want to create a single unified cluster that spans multiple clouds, rather than setting up separate clusters for each cloud.
- I’m considering using Cilium as my CNI for networking, leveraging its Cluster Mesh feature for connectivity across clouds.
Workload Orchestration:
- I plan to have a management cluster that will run deployments with selectors for distributing ML workloads based on resource availability and other criteria.
Centralized Management:
- I need advice on how to set up a centralized management cluster that can effectively orchestrate workloads across these multiple environments.
My Questions:
Single Cluster vs Multi-Cluster: Is it feasible to have a single Kubernetes cluster spanning multiple clouds? What are the pros and cons? Or Just have multiple cluster connected to the management cluster?
Centralized Management: What tools or strategies would you recommend for centralized management of this multi-cloud (and maybe multi cluster setup)? Are there specific tools for workload orchestration that work well in such environments?
ETCD Management: How should ETCD be managed in this multi-cloud, multi-cluster context?
Best Practices: Any best practices or lessons learned from your own experiences in similar setups would be greatly appreciated!
Thank you for helping your bro out!
Has anyone been able to run K8s on RHEL with Calico? I've setup a minimal cluster with two workers and one master. When I expose any deployment via service I'm unable to get a response from the the exposed service IPs. I've read a few forums where they say that calico and firewalld don't work together on RHEL 9. Has anyone been able to get it to work?
Thank you to Diego Ciangottini, the Italian National Institute for Nuclear Physics, the InterLink project, and the Vega Supercomputer – all for doing the heavy lifting getting HelixML GPU runners running on Kubernetes bridged to Slurm HPC infra to take advantage of hundreds of thousands of GPUs running on Slurm infrastructure and transform them into multi-tenant GenAI systems.
Read about what we did and see the live demo here: https://blog.helix.ml/p/running-genai-on-supercomputers-bridging
Got something working? Figure something out? Make progress that you are excited about? Share here!
Hello,
I'm working on a project to enhance application security within a Kubernetes cluster. I'm attempting to create a cluster with one master node and two worker nodes on three separate VMs using Vagrant and VirtualBox.
However, I'm encountering an issue: all three VMs are assigned the same IP address, preventing them from communicating with each other.
I've provided my Vagrantfile and the output of the ifconfig
command below:
Vagrantfile :
Vagrant.configure("2") do |config|
#config.gui = true
config.vm.boot_timeout = 60000 # 60 seconds timeout
config.vm.define "k8s-master" do |master|
master.vm.box
= "ubuntu/bionic64"
master.vm.network
"private_network", type: "dhcp"
#master.vm.provider "virtualbox"
master.vm.provider "virtualbox" do |vb|
vb.memory = "2048"
vb.cpus = 2
end
master.vm.hostname = "k8s-master"
master.vm.provision "shell", inline: "echo 'This is the Kubernetes Master Node'"
end
(1..2).each do |i|
#config.gui = true
config.vm.boot_timeout = 60000 # 60 seconds timeout
config.vm.define "k8s-worker-#{i}" do |worker|
worker.vm.box
= "ubuntu/bionic64"
worker.vm.network
"private_network", ip: "10.0.2.#{i+15}"
worker.vm.provider "virtualbox" do |vb|
vb.memory = "2048"
vb.cpus = 2
end
worker.vm.hostname = "k8s-worker-#{i}"
worker.vm.provision "shell", inline: "echo 'This is Kubernetes Worker Node #{i}'"
end
end
end
ifconfig
:
enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.2.15 netmask 255.255.255.0 broadcast 10.0.2.255
enp0s8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.56.3 netmask 255.255.255.0 broadcast 192.168.56.255
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
I've tried several tutorials, but I'm still facing the same problem.
I'm working on Windows with VirtualBox and Vagrant.
Any help or suggestions would be greatly appreciated!
I'm using https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack chart.
I created this values.yaml:
alertmanager:
config:
route:
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- api_url:
valueFrom:
secretKeyRef:
name: slack-secret
key: webhook-notification
channel: "#alert-channel"
title: "Test"
text: "Test"
But after install it
helm install my-prometheus prometheus-community/prometheus -f values.yaml
I can't see the new config. But I can see the real alert happened in alertmanager UI.
It didn't send to slack channel. How to config?
I’ve build an real-time transcription Socket API for internal use which uses DeepGram SDK for doing the job.
Its working on my local system and provide me transcript as intended but when I host this on kubernetes under service and setup ingress for this and try to send the audio bytes, it receives them but they does not transmitted to the deepgram as I never receive any transcript.
I also try to work with DeepGram’s Socket API which do the same job but I was getting connection closed error with deepgram as soon I tried to send audio bytes to my api
I don’t understand now why this is happening, if anyone can provide any clues, I’ll get going with that.
I want to learn kubernetes by deploying a simple project. For cost reasons, it'll be limited to one node, so I can't use a firewall to close off or hide the master node. The most straightforward way to do this is to put the master node online on the same machine the worker node is on, but I'd rather not have my k8s controller online.
As I'm not doing any real load-balancing, can I keep the master node mostly offline (say, on my local computer)?
I switched recently to VictoriaMetrics and I was wondering if you could share your experiences how to properly set the metrics scrape intervals, for your monitored services. Thank you.
I'm banging my head against a wall with a network issue and hoping someone here can offer some insight. I've got an application that uses the Deepgram SDK for real-time transcription. It works flawlessly on my local machine, but when I deploy it as a pod in my Kubernetes cluster, it just… stops working.
Here's the breakdown:
I've tried a few things already:
nslookup
api.deepgram.com
resolves correctly. I can even curl -v
https://api.deepgram.com
, and get a 200 OK, which tells me basic HTTP traffic is going out.I'm clueless now about the issue. I'm attaching the code as well, if that makes sense.
Code: https://pastebin.com/WQkW7sYL
Thanks in advance for your help!
I am looking at an Edge application that would have 2 physical Servers in a small cluster. I would like to leverage CP HA should one of the servers go down. The servers would both be worker nodes and one would also run the CP at the same time. In order to get CP HA 3 Control Nodes are required. Am I able to run two CP Nodes on one physical Server and the third on the other? The only idea I have is to run VMs so they are virtually separate Nodes on one Physical server. I'd like to avoid the hypervisor if possible however.
Another option is to make the Control node reachable across a WAN link to the site but that is likely troublesome.
Thanks
I am trying to implement blue/green deployment using Istio virtual services and destination rules. Seems like I can make it work when calling from outside the cluster, but when a pod calls another pod the requests are balanced on both blue and green instances. All the pods involved have the Envoy sidecar.
Istio ingress gateway has been installed with the default istioctl install
.
The other relevant manifests are the following:
apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
name: bg-gateway
namespace: bluegreen-playground
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 8080
name: http
protocol: HTTP
hosts:
- bluegreen.myexample.com
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: bg-route
namespace: bluegreen-playground
spec:
gateways:
- bg-gateway
hosts:
- bluegreen.myexample.com
- core-svc.bluegreen-playground.svc.cluster.local
- front-svc.bluegreen-playground.svc.cluster.local
http:
- name: core
match:
- uri:
prefix: "/core"
route:
- destination:
host: core-svc.bluegreen-playground.svc.cluster.local
subset: blue
- name: front
match:
- uri:
prefix: "/"
route:
- destination:
host: front-svc.bluegreen-playground.svc.cluster.local
subset: blue
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: core-destination
namespace: bluegreen-playground
spec:
host: core-svc.bluegreen-playground.svc.cluster.local
subsets:
- name: blue
labels:
version: v1
- name: green
labels:
version: v2
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: front-destination
namespace: bluegreen-playground
spec:
host: front-svc.bluegreen-playground.svc.cluster.local
subsets:
- name: blue
labels:
version: v1
- name: green
labels:
version: v2
All the pods are correctly labeled with version: v1
or version: v2
. The microservices call each other by using the standard Kubernetes service FQDN (e.g. front-svc.bluegreen-playground.svc.cluster.local).
Am I missing anything?
Generating the load required for load tests can not realistically be done from your local machine - considering the needs for high and consistent compute/network throughput over prolonged periods of time. We show how Kubernetes can be used as a foundation for a scalable load-testing strategy, using CNCF tools to orchestrate, execute and analyze results accordingly.
https://testkube.io/learn/using-kubernetes-to-scale-your-load-testing
My team wants to migrate from GH Actions to Argocd. I just have one problem, can I configure manual approval gates before my changes get deployed to staging/ prod. The auto sync is great for dev, but before anything gets deployed to prod our stakeholders want to manually approve the release.