/r/devops
/r/DevOps is a subreddit dedicated to the DevOps movement where we discuss upcoming technologies, meetups, conferences and everything that brings us together to build the future of IT systems
What is DevOps? Learn about it on our wiki!
All articles will require a short submission statement of 3-5 sentences.
Use the article title as the submission title. Do not editorialize the title or add your own commentary to the article title.
No vendor spam. Buy an ad from reddit instead.
Find a DevOps meetup near you!
/r/devops
I have a Flask REST api that handles POST requests and responds with image html templates. What would be a more introductory approach to deploy it just for light stuff like showing it off to some friends or potential employer coming across my portfolio? Nothing fancy, no frills, just serve the html on request. I should mention that I'm running windows 10 enterprise and I don't think I can port the whole thing to linux, setting up all the dependencies took days and it was quite a headache, but I have been trying to work with WSL lately, albeit unsuccessfully ...
I'm currently about to graduate from an engineering school within the french system. I'm also in an end of study internship ( pretty much doing automation tasks/ ci cd pipelines : ansible, gitlab..). I think this is the time I should be looking for a job as a my internship is about to end ( mid july ) and I'm confused on how to build a portfolio to apply for jobs on LinkedIn, I'm mainly going to apply for jobs In Germany/austria/France, this is the tools that I know : docker/K8s/ansible/jenkins/gitlab , and a little bit of terraform. as for the cloud providers, I only know how to work with aws as most of the tests and practices were on it.
any advice would be appreciated !
There really is too much to know these days, what areas are there to specialize in?
My thoughts:
Kubernetes - I can see why some engineers love it. An awesome paradigm at the base layer and so much interesting built on top of it.
Observability - almost a science in itself and plenty to get into (or related to) be it monitoring, alerting, analytics, service management.
Platform management - building out a consumable platform, kinda like being a developer for developers.
Architect - the problem I have with this is developers are going to have their own software architects doing system design that the may overlap already with the infra side. Also many expect engineers to have software architects skills anyway. So where does that leave the cloud/DevOps architect? I feel there is not much mileage in this path.
Any others? As each year passes the more I think it is not a good idea to stay in the middle as a generalist and time to pick a path.
At Doppler, we use comment prefixes when we review code. We've found that this helps streamline our review process and improve our team's communication. Does your team do something similar? If so, how do you do it differently (or why not)?
I wrote a short blog on what exactly we do: https://www.doppler.com/blog/code-review-comment-prefixes-for-clearer-feedback
Hello colleagues,
does anyone have experience with migration of alertmanager alerts to webex teams? Currently we are in transition from slack to webex (don't ask me why) and we are migrating all of the slack alerts/notifications to webex. This is current configuration (relevant part of it) of alertmanager:
....
receivers:
- name: default
- name: alerts_webex
webex_configs:
- api_url: 'https://webexapis.com/v1/messages'
room_id: '..............'
send_resolved: false
http_config:
proxy_url: ..............
authorization:
type: 'Bearer'
credentials: '..............'
message: |-
{{ if .Alerts }}
{{ range .Alerts }}
"**[{{ .Status | upper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] Event Notification**\n\n**Severity:** {{ .Labels.severity }}\n**Alert:** {{ .Annotations.summary }}\n**Message:** {{ .Annotations.message }}\n**Graph:** [Graph URL]({{ .GeneratorURL }})\n**Dashboard:** [Dashboard URL]({{ .Annotations.dashboardurl }})\n**Details:**\n{{ range .Labels.SortedPairs }} • **{{ .Name }}:** {{ .Value }}\n{{ end }}"
{{ end }}
{{ end }}
....
But the bad part is that we receive 400 error from alertmanager:
msg="Notify for alerts failed" num_alerts=2 err="alerts_webex/webex[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 400: {\"message\":\"One of the following must be non-empty: text, file, or meetingId\",\"errors\":[{\"description\":\"One of the following must be non-empty: text, file, or meetingId\"}],\"trackingId\":\"ROUTERGW_......\"}"
The connection works, as the simple messages are sent, however these "real" messages are dropped. We also thought about using webhook_configs, but the payload can't be modified (without proxy in the middle).
Anyone with experience with this issue? Thanks
I've just 10 months of DevOps Engineer experience. I know it's quite less, but I'm quite enthusiast and passionate about Linux and DevOps world. I've been using Linux for more than 5 years and I'm quite good with it.
Not only that, in my such a short professional career (10 months), I've delivered production grade OpenStack setup (private cloud) for my current company, created an Ansible project for their complex clustered environment setup, automated AWS infra using Terraform, improved their existing Jenkins pipeline and Dockerfile, and many other work.
Many times, I've figured out and solved a lot of production related issue in quite less time which was affecting their business and their 10 more years experienced developers and team leads were not able to find. I'm not exaggerating, just telling the truth.
Their entire DevOps dependency is on me, and even I've delivered more than their expections. But the appraisal I got was not even peanuts. Even peanuts are costly.
I really need a DevOps, Cloud, Linux related job. I've just 10 months experience but please judge me at my skills. Location prefered Delhi NCR, but open to all India.
Thanks for reading.
I wanted to share an open-source project I’ve been working on called k8sAI. It’s a personal AI Kubernetes expert that can answer questions about your cluster, suggests commands, and even executes relevant kubectl commands to help diagnose and suggest fixes to your cluster, all in the CLI!
As a relative newcomer to k8s, this tool has really streamlined my workflow. I can ask questions about my cluster, k8sAI will run kubectl commands to gather info, and then answer those question. It’s also found several issues in my cluster for me - all I’ve had to do is point it in the right direction. I’ve really enjoyed making and using this so I thought it could be useful for others. Added bonus is that you don’t need to copy and paste into ChatGPT anymore!
k8sAI operates with read-only kubectl commands to make sure your cluster stays safe.
All you need is an OpenAI API key and a valid kubectl config. Start chatting with k8sAI using:
$ pip install k8sAI
$ k8sAI chat
or to fix an issue:
$ k8sAI fix -p="take a look at the failing pod in the test namespace"
Would love to get any feedback you guys have!
Here's the repo for anyone who wants to take a look
As suggested by someone from this community, I've moved the pipelines on Automate All The Things from Azure Devops to GitHub Actions.
I didn’t know GitHub Actions was free when the repo is public. This makes it SOO much easier to get started with the project, so thank you stranger who suggested this!
https://github.com/tferrari92/automate-all-the-things
The Azure DevOps version is still there in its own branch.
Any other feedback or suggestions are always welcomed! There's always room for improvement.
Also.. Nirvana Edition with Backstage.io is coming soon!
Have you ever deployed a pod in Kubernetes, only to watch in horror as it gets stuck in an endless restart loop?
You're not alone. The "CrashLoopBackOff" error strikes fear into the hearts of Kubernetes users everywhere. When a pod enters this state, it's trapped in a cycle of crashing, restarting, and crashing again.
But what exactly causes this dreaded CrashLoopBackOff? And more importantly, how can you troubleshoot and resolve it to get your pods running smoothly again?
In this post, we'll walk through the key insights you need to break free from Kubernetes restart hell.
https://www.perfectscale.io/blog/kubernetes-crashloopbackoff-an-ultimate-guide
As the title says anyone here?
Hello everyone, a couple of months ago, I wrote here to ask for your opinion on the tool I've developed, which allows you to generate IaC from a drag-and-drop interface. I've implemented several suggestions I received, including extending the number of components (the tool now covers all AWS RDS offerings except Oracle), adding VPC endpoint support, and improving architecture validation.
It would be great if you could check it out and maybe suggest some more features it's missing: https://app.archformation.com/
Ive been trying to find a way around this, but not really coming up with much. We run grafana agent as a deamonset in flow mode in our EKS cluster. All our apps send their open telemetry traces to it, which is then forwarded into Tempo.
We have a step in the pipeline, which I think should be adding k8s attributes to the spans, so they can be more easily searched in Tempo.
otelcol.processor.k8sattributes "default" {
extract {
label {
from = "pod"
key_regex = "(.*)/(.*)"
tag_name = "$1.$2"
}
metadata = [
"k8s.namespace.name",
"k8s.deployment.name",
"k8s.statefulset.name",
"k8s.daemonset.name",
"k8s.cronjob.name",
"k8s.job.name",
"k8s.node.name",
"k8s.pod.name",
"k8s.pod.uid",
"k8s.pod.start_time",
]
}
output {
metrics = [otelcol.processor.memory_limiter.default.input]
logs = [otelcol.processor.memory_limiter.default.input]
traces = [otelcol.processor.memory_limiter.default.input]
}
}
The problem is this just ends up with all of the attributes reflecting the grafana agent, not the actual source of the span.
e.g. a app in name qa1
sends a trace, and it ends up in tempo with "k8s.namespace.name=grafana-agent"
.
Here is an example, this is a ebs-csi
pod, in the namespace ebs-csi
sending a trace to Open Telemetry.
app.kubernetes.io.instance
"grafana-agent"
app.kubernetes.io.name
"grafana-agent"
container.id
"230dcd1933a94746a2b75f73fe22e9f92772c15ec014eb04287ff0fab5ee4caf"
host.name
"ebs-csi-node-s48hp"
k8s.daemonset.name
"grafana-agent"
k8s.namespace.name
"grafana-agent"
k8s.node.name
"i-018bb590272435fe2.us-gov-west-1.compute.internal"
k8s.pod.ip
"10.2.30.17"
k8s.pod.name
"grafana-agent-2dwwl"
k8s.pod.start_time
"2024-05-01 21:44:10 +0000 UTC"
k8s.pod.uid
"9854add6-4f12-4f42-80bd-0567d8934a01"
linkerd.io.control-plane-ns
"linkerd"
linkerd.io.proxy-daemonset
"grafana-agent"
linkerd.io.workload-ns
"grafana-agent"
os.description
"Amazon Linux 2023 (Linux ebs-csi-node-s48hp 6.1.82 #1 SMP PREEMPT_DYNAMIC Fri Apr 5 22:26:15 UTC 2024 x86_64)"
os.type
"linux"
process.command_args
[
"/bin/aws-ebs-csi-driver",
"node",
"--endpoint=unix:/csi/csi.sock",
"--logging-format=text",
"--v=2",
"--enable-otel-tracing=true"
]
process.executable.name
"aws-ebs-csi-driver"
process.executable.path
"/usr/bin/aws-ebs-csi-driver"
process.owner
"root"
process.pid
1
process.runtime.description
"go version go1.22.2 linux/amd64"
process.runtime.name
"go"
process.runtime.version
"go1.22.2"
service.name
"ebs-csi-controller"
most of the relevant details you would use to search, show up as the grafana-agent
which makes looking this up difficult, especially if you have multiple deployments of the same app(s). Anyone have any idea where im going wrong?
Hello! I was wondering if anyone had any experience using either of these. Right now I have a project with a company to essentially improve the log management they use. Its a large enterprise level company but the team itself and the application they use is for internal staff, and it creates around 80-100GB of logs per week. Its hosted on a Kubernetes cluster.
They're currently using Google Cloud Operations Suite with FluentBit as the log shipper, where logs are sent to Cloud Logging. Metrics are monitored with Prometheus and there's no tracing. Alerts are also dealt with through Google Alerts.
I essentially wanted to implement the LGTM stack considering this has very good integration with Kubernetes running in microservices mode - I can configure tracing through Tempo and OpenTelemetry and also set up metrics through Prometheus for an observability stack showing logs, metrics and traces in Grafana.
However after a lot of research I still can't quite figure out whether this implementation would actually improve anything on thier end. There's no real information on Loki/lgtm stack vs GC Operations suite and I don't know if there would be any big differences in the cost/speed/resources/performance/etc. Is Loki better than Google Cloud Logging at what it does? Are Grafana Alerts better than Google Alerts? Are there alternatives I can use instead? Its a big company so the actual costs of the additional resources really don't matter as long as the solution works.
Thank you for any advice you can give me on this!
Hey everyone👋
If you're in London UK next week and interested in observability & Open Telemetry, I think you'll enjoy this edition of the Observability Engineering Meetup.
Who: Dan is the Observability lead at Skyscanner, a member of the OpenTelemetry Governance Committee, and the author of "Practical OpenTelemetry: Adopting Open Observability Standards Across Your Organization."
What: Dan will share some of his experiences leading an observability transformation at Skyscanner, from custom solutions to telemetry standards and from a root cause analysis based on intuition and past experience to one based on context and evidence.
If you can't make it we'll record the talk and post it on this YouTube channel.
Please roastresume Be detailed and provide detailed inputs on what skills am i missing and suggested learning sources if you can to jump my package to 14 lpa to 20 lpa or more in india
The US and European Union have set ambitious goals to reduce emissions by 40% and 55% by 2030. Yet, many companies lack solid strategies for making their tech stack more sustainable.
Discover innovative methods to create a greener future with Kubernetes clusters without sacrificing application performance. https://www.perfectscale.io/blog/what-is-the-carbon-impact-of-kubernetes
Hello how much system design needed for a devops guys
Please share your experiance.
I am currently reading alex vus system design interview prep book is it good .
I want to switch job soon so need your inputs on topic.
Dear high paying devops engineers please post your experiance on how to become a high paying devops guys with skillset
How do you manage the changes in Infrastructure as code, with respect to testing before putting into production? Production infra might differ a lot from the lower environments. Sometimes the infra component we are making a change to, may not even exist on a non-prod environment.
Have anyone you implemented AI in infrastructure provisioning? If so, how beneficial has it been for your operations? #shareit
Good day.
Anyone knows a step-by-step guide/link for integration of Robot Framework house in Bitbucket to XRAY+Jira? The Xray documentaion doesn't really help much. My experience is more on test scripting so this kind of setting up is new to me. Thank you in advance.
Hi everyone, I currently work remotely at a small business that does physical labor contracting and I maintain there website and everything considered IT within the business. I have basically automated everything to where I have nothing to do all day other than study and wanted to try and get into the devops field and need help peer reviewing my resume.
Also, please let me know if any changes are necessary and why, so I can learn and improve on making/editing my resume. Thanks in advance for all your feedback!
(🤯)
How do you ensure that your monitoring stack is working as expected and you didn't messed up the config?
If you're using a Saas (Grafana cloud, datadog, whatever) do you have another solution that will alert you in case of an outage?
Maybe it's just that there's no simple solution that's worth the effort. ¯\_(ツ)_/¯
Hello, I am new to devops. I have been watching some video on Youtube on how to get starterd (mostly videos on Gitlab). So far I watched TechWorld with Nana 1 hour video and Automation Step by Step playlist on Gitlab. My question is how should I practice CI/CD. And any other resources preferrably free.
I want to test and train my AI model on GPU VM but when trying Azure or Google Cloud, they doesnt allow to use free credits on GPU instances (at least those in which I'm interested in). Is there any provider I could use or I will need to pay for this kind of machines from my wallet?
What APM do you use for React application?
We have already Grafana Tempo, we don't want to install Elasticsearch just for react metrics.
I'm learning CICD with ArgoCD and Argo Workflows, and I want to set up a system for devs where they can deploy new apps just by providing a link to a repository.
The workflow I have in mind is:
What I'm not sure about is how to proceed with steps 2 and 3. I've searched a lot and found many examples for updating an image tag in an existing manifest with things like ArgoCD image updater, but nothing about adding new manifests
What would be the best way to do this? Am I supposed to have the workflow commit the manifests to git? Something like this?
- name: Prepare Kubernetes manifests
run: |
# Example: Dynamically creating a Helm values file if it doesn't exist
if [ ! -f "chart/${{ matrix.app }}/values.yaml" ]; then
echo "image: myregistry/${{ matrix.app }}:${{ github.sha }}" > chart/${{ matrix.app }}/values.yaml
fi
- name: Commit and push changes
run: |
git config user.name 'GitHub Actions'
git config user.email 'actions@github.com'
git add .
git commit -m "Deploying new version of ${{ matrix.app }} - ${{ github.sha }}"
git push
Hello, there is a DevOps school in Chicagoland that encourages the students to fake their experience I mean they literally tell you to pick a company and add 5-8 years of fake experience. They have a fake staffing company with the backup numbers in case when the companies are trying to verify the experience. Are there any ways to stop this nonsense?
Don't get me wrong but it creates a lot of unfair condition to the people who just starting in devops or don't want to fake on their resumes.
I thought some users looking for an alternative would appreciate seeing this from a former Hashicorp Solutions Engineer.
In full transparency, I work for a competitor r/akeyless
Hello everyone, I wanted to learn k8, I visited the documentation and was overwhelmed. Can you please help me with some resources that would help me? Maybe a course on udemy it something?
Thank you in advance
Hi everyone, we are currently using mac in cloud services in order to run a dedicated mac server, as we have crucial software that is available only on IOS.
We have multiple users login via VNC in order to work simultaneously, and currently we are experiencing some issues with mac in cloud.
I am looking for a new mac server cloud provider, and looking for recommendations.
I saw macstadium and macweb, anyone have any previous experience with them?