/r/dataengineering
News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.
News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.
Read our wiki: https://dataengineering.wiki/
Rules:
Don't be a jerk
Search the sub & wiki before asking a question: Your question has likely been asked and answered before so do a quick search before posting.
Keep it related to data engineering: Posts that are unrelated to data engineering may be better for other communities.
Limit self-promotion posts/comments to once a month: Self promotion: Any form of content designed to further an individual's or organization's goals. If one works for an organization this rule applies to all accounts associated with that organization. See also rule #5.
No shill/opaque marketing: f you work for a company/have a monetary interest in the entity you are promoting you must clearly state your relationship. For posts, you must distinguish the post with the Brand Affiliate flag. See more here: https://www.ftc.gov/influencers
No job posts: Please use r/dataengineeringjobs instead.
No resume reviews/interview posts: We no longer allow resume reviews or interview questions because it's a seperate topic from Data Engineering. Instead, for resume reviews please use r/resumes or search our subreddit history for previous resume review advice. For interview questions, use sites like Glassdoor and Blind instead or search our subreddit history for previous interview advice.
No technical error/bug questions: Please post any error/bug question on StackOverflow.
/r/dataengineering
With NodeJS I need to insert an array of JSON objects into a BigQuery table that bypasses the streaming buffer. I dont care if the records dont show up for 5, 10 or even 15 minutes. When they are INSERTED I want them to be partitioned and able to be UPDATED or DELETED. We will be inserting 100,000s of records a day
So the best I could come up with is to write the data I want inserted to a temporary JSONL file in a storage bucket then use the following to load the data into the table. Then delete the file after.
Hey everyone. I’m diving into research about enterprise data platforms and curious if anyone in the group has worked for (or with) a company that uses Palantir Foundry. If so:
Trying to cut through the marketing hype and understand real-world impact. Happy to hear success stories, pain points, or even “meh” experiences.
I am currently actively interviewing for DE/ BIE roles. I have strong SQL, C#, Visualization, moderate python, spark, data modeling and Azure certification. But when it comes to interviews it seems impossible to crack coding interviews. I seem to forget how to code, talk and even freeze up. It seems like an impossible task.
What helps you get past this? How do you all crack the screening to even get to next level and prove you are in fact a good hire?
Appreciate any advice.
Hey all,
We’re doing a quick research study on database costs & infrastructure—figuring out how developers & companies use PostgreSQL, InfluxDB, ClickHouse, and managed DBaaS.
Common problems we hear:
🔥 If you run databases, we’d love your insights!
👉 Survey Link (2 mins, no email required): https://app.formbricks.com/s/cm6r296dm0007l203s8953ph4
(Results will be shared back with the community!)
Hello!
I've been a data analyst for about 5 years, and know python/SQL/tableau, the basics. But recently, my job responsibilities have kind of expanded and I'm beginning to feel like data engineering is important to advance and do my job better.. I've had to figure out what CDC/Flink/hive stuff is and it's been a real uphill battle.
Are there any affordable options for boot camps or maybe even a masters that might be helpful?
Hey, I recently read another post in this subreddit around joining a startup and the data warehouse being quite a mess, and discussing strategies on how to tackle that, and I completely relate to that story, and yeah, I would like to hear more opinions and perspectives on how to tackle that. We currently use Postgres, and we don't have that much data, maybe 15 GB of data. We do ELT, the medallion architecture and we use the star schema as well - nothing fancy. Some issues with our current implementation of star schema, keys being generated several times, and using hash keys instead of simply sequential keys, but other than that don't really know where to start.
I've been watching some videos on data modeling, which are very useful, but they never go into the nitty-gritty, which is writing the actual SQL and making it work, and does anyone have any resources (videos, courses or books) on that?
data warehouse cleaning? Kind of like the step before the optimisation?
A fresh-out-of-the-box ADF will have the "Managed private endpoints" menu item disabled until such time as you go through configuring one, as shown below.
I'm working in an environment where several data factories, all alike, were created via Terraform. In one data factory, someone was experimenting with private endpoints, thus enabling them. However, that experiment is complete, and we don't have current plans to use them. As the Terraform does not include managed_virtual_network_enabled
it wants to destroy and recreate that data factory if it is run:managed_virtual_network_enabled = true -> null # forces replacement
As I see it, my options are:
Allow Terraform to destroy and recreate, and reconnect to the ADF repo to reload the configuration. This is a development environment, so it can be down for a while without consequences. However, I've never attempted this before, so it does give me pause.
Set managed_virtual_network_enabled = true
in Terraform. Running this as a plan appears to work as expected. This seems to be the easier path. Is there any harm in doing this without actually making use of it?
Half way between the meme and the "real" blog post, I wrote https://smallbigdata.substack.com/p/what-if-dbt-and-sqlmesh-were-weapons
A light take on the transformation layer ongoing battle between the 2 majors contenders: dbt and SQLMesh! What transformation tool are you going for this year?
It feels overwhelming- I have learnt most of technologies but my goodness - they keep escaping from my memory. May be we can help study better and retain more knowledge. Anybody here pls? Dm me if intrested.
I am pursuing a data science masters and am stuck between a few schools, I’m of course looking at curriculum. But also cost and its reputation and connections. The end goal is getting a good paying data science job.
UCB, Columbia, Georgia Tech, UCLA and University of Michigan. Thinking about applying for UVa
For those of you who have worked with Dataproc using PySpark, do you have any feedback on its use, particularly when using PySpark in local mode?
Whether you use Dataproc in cluster mode or Serverless mode, how do you test your scripts? Do you just throw everything at Dataproc? The idea for me is to avoid the extra cost of deploying jobs on Dataproc...
Or do you:
- submit-spark on your local environment with a local cluster? (=> install PySpark + Spark + JAVA etc.)
- use a Docker container that contains the entire Spark installation so as not to pollute your env?
Basically, I want to work locally as much as possible (i.e. run a local script that will read either local Dataframes or my bucket file) in order to save time and money.
So my company is still using the old tool Pentaho and also thier communty edition and we have hosted our reports on their ui server and now the reports are not getting downloaded consistently .
I have debugged the issue and it is creating a CSV/xls file in the temp directory in the tomcat folder and also the logs show a null pointer exception but we do not manage the codebase as it is a third party tool.
Please help if anyone has faced the same issue as I will be out of job if it doesn't get resolved🥲.
Curious to know how Unit Testing of ELT pipelines is being done at everyone’s work.
At my work, we do manual testing. I’m looking to streamline and automate the process if possible. Looking for inspirations :-)
In your opinion, is data engineering as a job, in the next 5-10 years a growing or declining job in terms of demand, given the hype of the AI that exists nowadays? If you also have opinions about SWE and MLE jobs, please share it.
4 years ago I made a career shift from civil engineering to software engineering and then data engineering about a year ago, and tbh its a bit concerning the huge changes that are happening in the scene nowadays, so I figured I’d get some opinions.
Thanks.
Hey everyone,
I’m looking to dive into the world of data engineering, and I’d love some guidance from those who’ve been there before. I want to start building a project that will help me get hands-on experience with creating and managing data pipelines, but I’m feeling a bit overwhelmed by the sheer amount of resources and content available online. It’s hard to know where to begin!
A bit about me:
But when it comes to designing and building actual pipelines, I’m not sure where to start. Should I focus on learning specific tools like Apache Airflow, Kafka, or dbt? Or should I start with basic concepts of data modeling, ETL processes, etc., and then gradually expand into more advanced tools?
If anyone could share their journey or point me to some good resources, I’d really appreciate it. I’m eager to learn and get my hands dirty with a real project!
I am working on some practice problems on Data Camp and I seem to have run into an issue and I am not sure how to fix. I am tasked with cleaning some data and outputting a formatted database as a result. the following images outline my parameters. The python code I created as as follows:
"# Use as many python cells as you wish to write your code
import pandas as pd
def map_activity(activity):
if not isinstance(activity, str):
return "Unknown"
activity = activity.lower().strip()
if 'walk' in activity:
return 'Walking'
elif 'play' in activity:
return 'Playing'
elif 'rest' in activity:
return 'Resting'
elif 'health' in activity:
return 'Health'
else:
print(f"unmapped activity: {activity}")
return 'unknown'
def all_pet_data():
pet_activities = pd.read_csv('pet_activities.csv', encoding='utf-8')
pet_health = pd.read_csv('pet_health.csv', encoding='utf-8')
users = pd.read_csv('users.csv', encoding='utf-8')
pet_health.rename(columns={'visit_date': 'date'}, inplace=True)
pet_health['activity_type'] = None
pet_health['duration_minutes'] = None
pet_activities['issue'] = None
pet_activities['resolution'] = None
pet_data = pd.concat([pet_activities, pet_health], ignore_index=True)
merged_data = pet_data.merge(users, on='pet_id', how='left')
final_columns = [
'pet_id', 'date', 'activity_type', 'duration_minutes', 'issue', 'resolution','owner_id', 'owner_age_group', 'pet_type'
]
merged_data = merged_data[final_columns]
merged_data['pet_id'] = merged_data['pet_id'].astype(int)
merged_data['date'] = pd.to_datetime(merged_data['date'])
merged_data['activity_type'].fillna('health', inplace=True)
merged_data['duration_minutes'] = pd.to_numeric(merged_data['duration_minutes'], errors='coerce').fillna(0).astype(int)
merged_data['issue'].fillna('', inplace=True)
merged_data['resolution'].fillna('', inplace=True)
merged_data['owner_id'] = merged_data['owner_id'].astype(int)
merged_data['owner_age_group'] = merged_data['owner_age_group'].astype(str)
merged_data['owner_age_group'].fillna('', inplace=True)
merged_data['pet_type'] = merged_data['pet_type'].astype(str)
merged_data['pet_type'].fillna('', inplace=True)
return merged_data
df = all_pet_data()
df['activity_type'] = df['activity_type'].apply(map_activity).str.capitalize()
print(df.head(50)) "
I produce a formatted Database with the required information, yet the system indicates that my result is incorrect. Can anyone tell where I am going wrong and why this code is not meeting the criteria outlined?
I typically enjoy my job and get satisfaction out of figuring things out. However sometimes I feel overloaded with all the pieces of data and relationships. I think this is typically when requirements aren't pinned down to so ething specific and I'm just going through data no-one has worked with.
I still feel grateful for my job but wondering how you all deal with these setbacks if you experience them. My mind is basically telling me this isn't sustainable, I just want to retire, maybe go into a job with less deep thinking, or how much do I need to buy a cabin in the woods.
I know it's a mental setback so I'm trying to build documentation and meet with my manager to try to pin down what the end results are expected to look like for analytics users.
Any advice?
TLDR; it can be challenging at times. How do you cope.
Edit 2:
The sharp drop in engagement is, at minimum, suspicious. This post wasn’t removed, but it completely dropped in engagement after only a few minutes. Avoiding any tin hat statements; I’ll be creating a BlueSky handle. Lead some discussion over there.
Edit:
It’s clear that this isn’t the right community for this kind of outreach. Not because there is a misalignment in scope; this is a data engineering question after all—aiming to find ways to better hone engineering for the practice of peaceful protest. The misalignment actually stems from a culture difference. The community here is not on the same page as these efforts, clearly by the comments I’ve amassed.
This is fine. The first step in protest is to find your people. To engineers who wonder over to my post, if you find my mission to align with your own, know that you aren’t alone.
I’m going to reach out to groups in my local community and see what they need.
—-
Original post:
50501 is planning protests in all 50 states, tomorrow. To take place at the state capitol building—times vary but are publicly posted on various articles and sites.
I realized today that my state capitol building is 6 hours away, and there’s no way I can manage that trip with the protest while also managing my kids. I have no childcare except for school, and I wouldn’t be able to get them there plus pick them up on time. Logistically, I just can not go to the protest.
Now, I want to contribute. If I can’t show up physically, so be it. I need to do my part too. So the question becomes: what can I do?
It feels like I’m back at square one, honestly. That question has been going around in my my head for a long time now, but it’s been difficult to answer because there has yet to be a spearhead that I can latch onto and contribute toward. Where is the fight?
Well, that’s changed now that there are protests. For starters, I can help by aiding the protestors. Maybe I can try finding a police radio feed for the city my capitol building resides in, then transcribe it to text and make automated police chatter updates on a BlueSky account. That could help the protestors keep their edge against any unjust attempts to shut them down, right?
What else though? How can we contribute? How can us engineers show, we might be small, but we can pack a lot of fucking heat? What’s the best way to guide these efforts?
Have any of you given this thought? What have been your considerations? What do you already know of? What have you already done? What would you like to see others doing?
Hey r/dataengineering, check out Duck-UI - a browser-based UI for DuckDB! 🦆
I'm excited to share Duck-UI, a project I've been working on to make DuckDB (yet) more accessible and user-friendly. It's a web-based interface that runs directly in your browser using WebAssembly, so you can query your data on the go without any complex setup.
Features include a SQL editor, data import (CSV, JSON, Parquet, Arrow), a data explorer, and query history.
This project really opened my eyes to how simple, robust, and straightforward the future of data can be!
Would love to get your feedback and contributions! Check it out on GitHub: [GitHub Repository Link](https://github.com/caioricciuti/duck-ui) and if you can please start us, it boost motivation a LOT!
You can also see the demo on https://demo.duckui.com
or simply run yours:
docker run -p 5522:5522
ghcr.io/caioricciuti/duck-ui:latest
Thank you all have a great day!
I’ve put together the most commonly asked questions for azure synapse along with examples from my own experiences in a Medium story.
If you’re preparing for data engineering interviews, this might be helpful!
🔗 Cracking Azure Synapse Interviews: Most Frequently Asked Questions
It’s free to read and ad-free. Hope you find it useful! 😊.
I will be also sharing SQL and python shortly.
This is how the response body of the swagger UI should look like, I tried multiple things but everything comes out to be time consuming. Here are the things I tried:
json_loaded = [row.asDict() for row in data_df.collect()]
json_loaded = [row.asDict() for row in data_df.toLocalIterator()]
json_loaded = data_df.toPandas().to_dict(orient="records")
As the title says what do people in the hiring position look for in a persons CV/application. What are things that put you off or things that make you go okay he might be worth a look or or things you actually find impressive.
Also another question is at my company we have no automation to screen the CV so our Lead actually wants to check all the cv quickly. How does it work in your company. Is there a automation that looks for key wards and if you don't have them bye bye you get a rejection.
searched the sub but didn't find something to what i was looking for. If i missed it feel free to send the post.
Hi all, I am stuck waiting for my flight back home(12h+) and I thought I'd kill some time with a small project.
I have a python script that can gather data and insert them to a database, I have everything set up and working on my home computer so I am sure the code is fine. If there is no way to run the script I have a recent export of the database(sql) backed up on my phone.
I also have exports of some grafana dashboards of those data if thats even useful.
What I am looking to do is to find a way to visualize said data, no laptops or anything just an s22 ultra. This is more of a "is this possible" kind of thing. Ideally we are going with 0 cost but if its impossible anything less than 10usd or euro/month is fine.
Curious to see what do you suggest!
I’m new to AWS and need to export tables from Amazon Redshift to S3 in Iceberg format. Since Redshift’s UNLOAD command only supports Parquet, CSV, and JSON, I’m unsure of the best way to achieve this.
Would it be better to:
Unload as Parquet first, then use an AWS service like Glue or EMR to convert and store it in Iceberg format?
Directly write to Iceberg format using AWS Glue or another tool?
If either of these approaches works, I’d really appreciate a step-by-step guide on how to set it up. My priority is a cost-effective and scalable solution, so I’d love to know the best tools and best practices to use.
Any insights or recommendations would be greatly appreciated! Thanks in advance!
- We have a table of 19 billion rows with 2 million rows adding each day
- The FE sends a GET request to rails BE and it turns send the query to snowflake, which returns result to rails and we send it to FE.
- This approach works well enough for smaller data sets but the for a customer with around 2 billion rows it takes more than 1 minute.
- Regarding the query, what is does is it calculates the metrics for a given time range. There are multiple columns in the tables, to calculate some metrics it only involves summation of the columns within the date range, but for some metrics we are using partition on the fly.
- One more thing is if the date range is of 1 year, we are also calculating the metrics of the previous year from the given date range and showing them as comparison metrics.
- We need a solution either to optimize the query or to use a new tech to make the api response faster.
Any suggestions?
Thanks
Hi everyone, I have been in my current organisation for six months as a dataengineer where my work is maintain and look after pipelines hosted on both on premises and on GCP vms. The jobs are scheduled on Control-M and the scripts are in shell and python(for transformation). I have just started to explore GCP and will be appearing for the Associate Cloud Engineer and then Professional Data Engineer certification.
Being a large organisations,I am finding hard to understand the architecture and logic behind it and my learning is also being slowed down.The TL is not helpful as he focus more on the work than teaching the why's behind the approach. I am feeling I am just just running and tweaking scripts without knowing the reason behind it
So it would be helpful if you guys could share your experiences and resources you learnt from regarding different system design and the scenerios where it would be applicable to.
Basically, title. In general, we're working on implementing self-service to move data using YAMLs (with single entrypoint that parses them and generates DAGs). Now we get task from business to develop interface so they could generate these DAGs (YAMLs, in fact) programmatically. As much as I understand, Airflow tends to work with real files on OS and that's the ground we build our solution on. But we cannot get yet how to work with jsons our future service will receive. OFC there should also be monitoring, logging, progress tracking, etc (maybe not in MVP, but it shouldn't be very complicated to implement).
I already designed possible solution but it seems a bit over-engineered for me. Idea is that there are our future service, separate database to store jsons and track DAGs states (queue). There would be files on S3; each one is belonged to one DAG. When rpc triggered, config changes. Each DAG has sensor on its config and if config was changed DAG starts. There are more details obviously, but you can get the idea.
WDYT??
What companies are you using for your synthetic data?