/r/kubernetes
Kubernetes discussion, news, support, and link sharing.
Kubernetes discussion, news, support, and link sharing.
Kubernetes Links
Learn Kubernetes
Newsletters
/r/kubernetes
Red Hat's container registry has been unavailable for many hours. Since our images rely on the Red Hat Universal Base Image (UBI), our users are experiencing issues with installing or upgrading our tool. I’m wondering if there are ways to ensure fault tolerance in this scenario. To be honest, I hadn’t considered this type of risk before… How do you handle situations like this? Any suggestions?
Hi all,
This is test cluster. Hence while testing I decided to run delete -all on pv. Result below
Since PVCs are intact - there is no data loss and PVs are just stuck in terminating state.
How do I bring back these PVs to bound state as before?
edit - tool suggested in commet works. get this tool & run it from path shown below.
root@a-master1:/etc/kubernetes/pki/etcd$./resetpv-linux-x86-64 --etcd-cert server.crt --etcd-key server.key --etcd-host <IPADDRESSOFETCDPOD> pvc-XX
Did you pass a cert? Congratulations, tell us about it!
Did you bomb a cert exam and want help? This is the thread for you.
Do you just hate the process? Complain here.
(Note: other certification related posts will be removed)
This monthly post can be used to share Kubernetes-related job openings within your company. Please include:
If you are interested in a job, please contact the poster directly.
Common reasons for comment removal:
Hi,
I have a Talos cluster for learning. I was wondering, what's the best way to deploy Kubernetes manifests to it , and why?
ArgoCD/Codefresh looks good, I like GitOps.
Should I combine this with Crossplane and if so, why?
Thanks!
Hi, i am stuck... maybe someone of you can help me.
I have a Statefulset that i want to manage with a Keda ScaledObject.
I want it to be scaled to 0 if a prometheus value is 0 for at least 5 minutes.
I got this working already without issues.
But my problem now is, that i want to be able to manually scale the Statefulset to 1. Keda should not scale it down to 0 in the first 5 minutes after it has been up.
Does anyone know how i can do this?
Right now, when i scale up the StatefulSet, keda says that the activation target is not met and scales it down again immediately...
Longhorn documentation states that writes are performed synchronously to replicas. I understand that to mean multiple replicas will hurt write performance as all replicas theoretically must acknowledge the write before longhorn considers the operation to be successful. However, is this really the case whereby multiple replicas truly do impact write performance or are writes performed against one volume then replicated by the engine to the rest? I assume the former, not the latter, just seeking clarification.
KRO (pronounced “crow”) or Kubernetes Resource Orchestrator is an Open Source tool built in collaboration between Google Cloud, AWS and Azure.
Kube Resource Orchestrator (kro) is a new open-source project that simplifies Kubernetes deployments . It allows you to group applications and their dependencies as a single, easily consumable resource. It's compatible with ECK, ASO and KCC
GitHub - https://github.com/kro-run/kro
Google Cloud - https://cloud.google.com/blog/products/containers-kubernetes/introducing-kube-resource-orchestrator…
AWS - https://aws.amazon.com/blogs/opensource/kube-resource-orchestrator-from-experiment-to-community-project/…
Azure - https://azure.github.io/AKS/2025/01/30/kube-resource-orchestrator…
Hi sub!
I have a service which I need to expose inside my cluster with TLS. I have cert-manager installed and a self-signed CA available as a ClusterIssuer.
I’m deploying my service with Kustomize to several environments (dev, staging, prod). Basically what I’d like to do is configure Kustomize so that I don’t have to patch in each overlay the `dnsNames` of cert-manager Certificate object.
Plus, currently I have to hardcode the namespace name, which is not very modular…
Here is the tree view:
.
├── base
│ ├── deployment.yaml
│ ├── certificate.yaml
│ ├── kustomization.yaml
│ └── service.yaml
└── overlays
├── production
│ ├── certificate.patch.yaml
│ └── kustomization.yaml
└── staging
├── certificate.patch.yaml
└── kustomization.yaml
5 directories, 8 files
And the relevant files content:
base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- certificate.yaml
- service.yaml
base/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: internal-tls
annotations:
cert-manager.io/issue-temporary-certificate: "true"
spec:
secretName: internal-tls
issuerRef:
name: my-internal-ca
kind: ClusterIssuer
isCA: false
dnsNames:
- localhost
- myapp.myapp-dev
- myapp.myapp-dev.svc
- myapp.myapp-dev.svc.cluster.local
usages:
- server auth
- client auth
staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: myapp-staging
resources:
- ../../base
patches:
- path: certificate.patch.yaml
target:
kind: Certificate
name: internal-tls
staging/certificate.patch.yaml
- op: replace
path: /spec/dnsNames/1
value: myapp.myapp-staging
- op: replace
path: /spec/dnsNames/2
value: myapp.myapp-staging.svc
- op: replace
path: /spec/dnsNames/3
value: myapp.myapp-staging.svc.cluster.local
I looked at the replacements
stanza but it doesn’t seem to match my needs since I would have to perform something like string interpolation from the Service metadata.name
…
Of course, the current setup is working fine but if I want to change the namespace name I will have to update it both in kustomization.yaml
and certificate.patch.yaml
. Same goes for the service name, if I want to change it I will have to update it both in service.yaml
and certificate.patch.yaml
.
Am I right in assuming that what I want to do is not possible at all with Kustomize? Or am I missing something?
Thanks!
Hey everyone,
We have a Kubernetes cluster running on Rancher with 3 master nodes and 4 worker nodes. Occasionally, one of our worker nodes crashes due to high memory usage (RAM gets full). When this happens, the node goes into a "NotReady" state, and we have to manually restart it to bring it back.
My questions:
Thanks in advance! 🚀
Hi all, I was wondering what everyone is doing when it comes to persisting data in PV's in cases where you fully need to redeploy a cluster.
In our current setup, we have a combination of Terraform and Ansible that can automatically build and rebuild all our clusters, with ArgoCD and a Bootstrap yaml included in our management cluster. Then ArgoCD takes over and provisions everything else that runs in the clusters using the AppofApps pattern and Application Sets. This works very nicely and gives us the capability to very quickly recover from any kind of disaster scenario; our datacenters could burn down and we'd be back up and running the moment the Infra team gets the network back up.
The one thing that annoys me is how we handle Persistent Volumes and Persistent Volume Claims. Our Infra team maintains a Dell Powerscale (Isilon) storage cluster that we can use to provision storage. We've integrated that with our clusters using the official Dell CSI drivers (https://github.com/dell/csi-powerscale), and it mostly works; You make a Persistent Volume Claim with the Powerscale Storage Class, and the CSI driver automatically creates a Persistent Volume and underlying storage in the backend. But then if you include that PVC in your application deployment, if you need to redeploy the app for any reason (like disaster recover), it'll just make a new PV and provision new storage in Powerscale instead of binding to the existing one.
The way we've "solved" it now, is by creating the initial PVC manually and setting the reclaimPolicy in the Storage Class to Retain. Every time we want to onboard a new application that needs persistent storage one of our admins goes into the cluster, creates a PVC with the Powerscale Storage Class, and waits for the CSI driver to create the PV and associated backend filesystem. Then we copy all of the data within the PV spec to a PV yaml that gets deployed by ArgoCD, and we immediately delete the manually created PVC and PV, but the volume keeps existing in the backend thanks to our Storage Class. ArgoCD then deploys the PV with the existing spec, which allows it to bind to the existing storage in the backend, so if we fully redeploy the cluster from scratch, all of the data in those PV's persists without us needing to do data migrations. The PVC deployment of the app is then made without a Storage Class parameter, but with the name of the pre-configured PV.
It works, but it does bring some manual work with it, are we looking at this backwards and is there a better way to do this? I'm curious how others are handling this.
Got something working? Figure something out? Make progress that you are excited about? Share here!
I am using a Kind cluster on my home computer.
I have TLS setup for for my ingress controller to a specific backend. I also have redirects from HTTP to HTTPs.
The HTTP/HTTPs ports are also exposed as node ports.
If I got to: <nodeIP>:<NodePort> For either HTTP/HTTPs, my ingress controller works fine and takes me to my service.
But what I want to do is not have to enter the NodePort every time.
My idea was to put an Nginx reverse proxy on my computer and forward requests on port 80:443 to the respective Node Ports.
However, I can't seem to get it to work - it seems to have issues with the TLS termination.
On Cloudflare, if I setup my domain to point at my NodeIP, and then I enter my Domain Name:<NodePort/HTTPs Port>, it takes me to my service.
But if I point Cloudflare to my Nginx with is forwarding requests onto my ingress controller, it tells me that there was TLS issues.
My nginx configuration:
virtualHosts."my-domain.com" = {
# Listen on port 80 (HTTP) and 443 (HTTPS)
listen = [
{
addr = "my-ip";
port = 80;
}
{
addr = "my-ip";
port = 443;
}
];
# Forward requests to the Kubernetes Ingress Controller NodePort over HTTP
locations."/" = {
proxyPass = "http://172.20.0.6:31413"; # Forward to the Ingress Controller NodePort
proxyWebsockets = true; # Enable WebSocket support if needed
extraConfig = ''
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
'';
172.20.06:31413 is the NodeIP and NodePort for (443)
For a self-hosted AI application deployed on customer Kubernetes clusters, what robust methods exist to protect my code from reverse engineering or unauthorized copying? I'm particularly interested in solutions beyond simple obfuscation, considering the customer has root access to their environment. Are there techniques like code sealing, homomorphic encryption (if applicable), or specialized container runtime security measures that are practical in this scenario? What are the performance implications of these approaches?
This is a tool I spend around 1.5 years building. So any suggestion would be helpful. Thanks.
Hey yall. I'm trying to put together a solid monitoring system for our kubernetes for the long term, and I'm trying to figure out if I'm making a mistake and need to back up.
For setting up prometheus, the common answer seemed pretty clear, "just use the kube-promethues stack with helm". My issues with that at first were it seemed like way overkill for my specific use case. We already have an external grafana instance, so there's no reason to install that, and same with alertmanager, we alert through grafana -> pagerduty
That in mind, I got through the vast majority of just setting things up with vanilla prometheus, configured the scrape jobs myself, etc. Got it working so I'm actually using the kube prometheus dashboards in my own grafana instance, just not with the stack.
Now that I'm looking at it again though, I'm realizing i can just change the kube-prometheus stack to not install most of the components i don't need, and the promwtheus operator can handle automatically most of the scrape jobs i wrote myself.
Basically my question is, am I going to regret using vanilla prometheus instead of the kube prometheus stack? Are there any benefits to NOT using the full stack and just trimming it to what I need?
Hi all,
We are installing Argocd using Helm and at some point we get the below error. This is a new AKS cluster. Been troubleshooting for a while - any pointers appreciated.
Objects listed" error:Get "https://172.xx.xx.xx:443/api/v1/namespaces/argocd/secrets?limit=500&resourceVersion=0": EOF 10086ms
My thaught was https related due to the ip. Not sure why the ip and not a hostname.
Thanks.
Hey K8s folks! I've been working with EKS for a while now, and something that keeps coming up is how tricky the initial cluster setup can be. A few friends and I started building a tool to help make this easier, but before we go further, we really want to understand everyone else's experience with it.
I'd love to hear your EKS stories - whether you're working solo, part of a team, or just tinkering with it. Doesn't matter if you're a developer, DevOps engineer, or any other technical role. What was your experience like? What made you bang your head against the wall? What worked well?
If you're up for a casual chat about your EKS journey (the good, the bad, and the ugly), I'd be super grateful. Happy to share what we've learned so far and get you early access to what we're building in return. Thanks for reading!
Hey all, I'm running into a weird issue. After upgrading to EKS 1.32 (Doing incremental upgrades between control plane and nodes), I am experiencing a lot of weird networking issues.
I can intermittently resolve google.com. and when I do the traceroute doesn't make any hops.
```
traceroute to google.com (142.251.179.139), 30 hops max, 60 byte packets
1 10.10.81.114 (10.10.81.114) 0.408 ms 0.368 ms 0.336 ms
2 * * *
3 * * *
4 * * *
5 * * *
6 * * *
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
```
EKS addons are up to date. No other changes were made. Doing things like `apt update` or anything else network related either times out or takes a significantly long period of time.
I have been stuck at this for hours, so any help is really appreciated.
My cluster is currently running rke2, with multus + cilium as cni.
The goal is to add a secondary macvlan network interface to some pods to get them a persistent directly routable ip address assigned by the main networks dhcp server aka my normal router.
I got it mostly working, each pod successfully requests an ip via the rke2-multus-dhcp pods from the main router, all the routing works, i can directly ping the pod from my pc and they show up under dhcp leases in my router.
The only issue - Each time a pod is restarted, a new mac address is used for the dhcp request, resulting in a new ip address assigned to it by the router and making in impossible to assign that pod / mac address a static ip / dhcp reservations in the router.
I prefer to do all the ip address assignment in one central place (my router) so i ususally set all devices to dhcp and then do the static leases in opnsense.
Changing the type from dhcp to static and hardcoding the ips / subnet info into each pods config would get them the persistent ip but this will get very hard to track / avoid duplicates, so i really want to avoid that.
Is there any way to define a "static" mac address to be used for the dhcp request in the pod / deployment configuration, so it will get the same ip assigned by my router every time?
My current multus network attachment definition
apiVersion:
k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: #string
annotations:
{}
# key: string
labels:
{}
# key: string
namespace: default
spec:
config: |-
{
"cniVersion": "0.3.1",
"type": "macvlan",
"master": "enp6s18",
"mode": "bridge",
"ipam": {
"type": "dhcp"
}
}
Hi guys,
For every kubectl command I'm trying to run, I'm getting:
zsh: killed kubectl cluster-info
Looking online, people are suggesting a number of reasons. Not enough memory, architecture related issues (since I'm on the arm chip - but I have rosetta enabled) etc.
What could be the issue?
Edit: I just found out docker desktop also can't open. Must be an architectural issue.
Thanks
A recent project required me to quickly get to grips with Kubernetes, and the first thing I realised was just how much I don’t know! (Disclaimer: I’m a data scientist, but keen to learn.)
The most notable challenge was understanding the distributed nature of containers and resource allocation - unfortunately, paired with the temperamental attitude my pods have towards falling over all the time.
My biggest problem was how long it took to identify why a service wasn’t working and then get it back up again. Sometimes, a pod would simply need more CPU - but how would I know that if it had never happened before?! Usually, this is time sensitive work, and things need to be back in service ASAP.
Anyway, I got bored (and stressed) having to remember all the kubectl
commands to check logs, take action, and ensure things were healthy every morning. So, I built a tool that brings all the relevant information to me and tells me exactly what I need to do.
Under the hood, I have a bunch of pipelines that run various kubectl commands to gather logs and system data. It then filters out only the important bits (i.e. issues in my Kubernetes system) and sends them to me on demand.
As the requirements kept changing - and for fun (I’m a data scientist, don’t forget!) - I wrapped GPT-4o around it to make it more user friendly and dynamic based on what I want to know.
So, my question is - would anyone be interested in also keeping their pods up? Do you even have this problem or am i special?
I’d love to open source it and get contributions from others. It’s still a bit rough, but it does a really good job keeping me and my pods happy :)
Hi,
There’s an use case where I need to copy a huge amount of data from a IBM COS Bucket or Amazon S3 Bucket to a internal PVC which is mounted on an init container.
Once the contents are copied onto the PVC, we mount that PVC onto a different runtime container for further use case but right now I’m wondering if there are any open source MIT Licensed applications that could help me achieve that?
I’m currently running a python script in the init container which copies the contents using a regular cp command and also parallel copy is enabled.
Any help would be much appreciated.
Thanks
I believe that my project evolved like this: we originally had an on-prem Jenkins server where the jobs were scheduled to run overnight using the chron-like capability of Jenkins. We then migrated to an openshift cluster, but we kept the Jenkins scheduling. On Jenkins we have a script that kicks off the openshift job, monitors execution, and gathers the logs at the end.
Jenkins doesn't have any idea what load openshift is under so sometimes jobs fail because we're out of resources. We'd like to move to a strategy where openshift is running at full capacity until the work is done.
I can't believe that we're using these tools correctly. What's the usual way to run all of the jobs at full cluster utilization until they're done, collect the logs, and display success/failure?