/r/q?req.query.q -- Subreddit Search

252,035 Subscribers

How to insert a record in BigQuery and bypass the streaming buffer

With NodeJS I need to insert an array of JSON objects into a BigQuery table that bypasses the streaming buffer. I dont care if the records dont show up for 5, 10 or even 15 minutes. When they are INSERTED I want them to be partitioned and able to be UPDATED or DELETED. We will be inserting 100,000s of records a day

Using table.insert() the data goes through the streaming buffer which has its 90 minute limitation. I could potentially just use this and wait 90 minutes but is that a hard maximum? AFAIK there's no guaranteed way to know if data is in the streaming buffer unless you partition on ingestion timestamp and you get acces to _PARTITIONTIME but I don't want that as my partition.
I think using insert DML statements is not an option for the amount we will be inserting. I am confused by how their limitations here: Google Cloud Blog. If it is an option how can I calculate the cost?

So the best I could come up with is to write the data I want inserted to a temporary JSONL file in a storage bucket then use the following to load the data into the table. Then delete the file after.

await table.load(storage.bucket("test-1").file("some-uuid.json"), { sourceFormat: 'NEWLINE_DELIMITED_JSON', writeDisposition: 'WRITE_APPEND',});
Does this avoid the buffer stream?
Is there a way I could use this without having to upload to a storage bucket first? Like some sort of fake File object I could load with data and pass into this function. If not is there an optimization I can make to my approach? Ive looked into Pub/Sub but that also uses the buffer.

3 Comments

2025/02/05
00:16 UTC

Is Palantir Foundry really useful?

Hey everyone. I’m diving into research about enterprise data platforms and curious if anyone in the group has worked for (or with) a company that uses Palantir Foundry. If so:

What’s the true value proposition of the platform from an end-user or client perspective?
Did it meaningfully solve problems that other tools couldn’t?

Trying to cut through the marketing hype and understand real-world impact. Happy to hear success stories, pain points, or even “meh” experiences.

33 Comments

2025/02/05
00:01 UTC

Technical screen jitters

I am currently actively interviewing for DE/ BIE roles. I have strong SQL, C#, Visualization, moderate python, spark, data modeling and Azure certification. But when it comes to interviews it seems impossible to crack coding interviews. I seem to forget how to code, talk and even freeze up. It seems like an impossible task.

What helps you get past this? How do you all crack the screening to even get to next level and prove you are in fact a good hire?

Appreciate any advice.

1 Comment

2025/02/04
23:59 UTC

How Much Do You Spend on Databases? (2-Min Survey)

Hey all,

We’re doing a quick research study on database costs & infrastructure—figuring out how developers & companies use PostgreSQL, InfluxDB, ClickHouse, and managed DBaaS.

Common problems we hear:

💸 AWS RDS costs way more than expected
😩 Managing high availability & scaling is painful
🔗 Vendor lock-in sucks

🔥 If you run databases, we’d love your insights!

👉 Survey Link (2 mins, no email required): https://app.formbricks.com/s/cm6r296dm0007l203s8953ph4

(Results will be shared back with the community!)

2 Comments

2025/02/04
23:18 UTC

Masters in data engineering?

Hello!

I've been a data analyst for about 5 years, and know python/SQL/tableau, the basics. But recently, my job responsibilities have kind of expanded and I'm beginning to feel like data engineering is important to advance and do my job better.. I've had to figure out what CDC/Flink/hive stuff is and it's been a real uphill battle.

Are there any affordable options for boot camps or maybe even a masters that might be helpful?

4 Comments

2025/02/04
22:30 UTC

Navigating a messy data warehouse - strategies and opinions

Hey, I recently read another post in this subreddit around joining a startup and the data warehouse being quite a mess, and discussing strategies on how to tackle that, and I completely relate to that story, and yeah, I would like to hear more opinions and perspectives on how to tackle that. We currently use Postgres, and we don't have that much data, maybe 15 GB of data. We do ELT, the medallion architecture and we use the star schema as well - nothing fancy. Some issues with our current implementation of star schema, keys being generated several times, and using hash keys instead of simply sequential keys, but other than that don't really know where to start.

I've been watching some videos on data modeling, which are very useful, but they never go into the nitty-gritty, which is writing the actual SQL and making it work, and does anyone have any resources (videos, courses or books) on that?

data warehouse cleaning? Kind of like the step before the optimisation?

2 Comments

2025/02/04
22:12 UTC

Azure Data Factory - Managed private endpoints

A fresh-out-of-the-box ADF will have the "Managed private endpoints" menu item disabled until such time as you go through configuring one, as shown below.

I'm working in an environment where several data factories, all alike, were created via Terraform. In one data factory, someone was experimenting with private endpoints, thus enabling them. However, that experiment is complete, and we don't have current plans to use them. As the Terraform does not include managed_virtual_network_enabled it wants to destroy and recreate that data factory if it is run:managed_virtual_network_enabled = true -> null # forces replacement As I see it, my options are:

Allow Terraform to destroy and recreate, and reconnect to the ADF repo to reload the configuration. This is a development environment, so it can be down for a while without consequences. However, I've never attempted this before, so it does give me pause.
Set managed_virtual_network_enabled = true in Terraform. Running this as a plan appears to work as expected. This seems to be the easier path. Is there any harm in doing this without actually making use of it?

https://preview.redd.it/nckm8thgw6he1.png?width=406&format=png&auto=webp&s=72000c37e837f42c5e0eca7f6b8300f8084a1391

0 Comments

2025/02/04
22:00 UTC

What if dbt and SQLMesh were weapons

Half way between the meme and the "real" blog post, I wrote https://smallbigdata.substack.com/p/what-if-dbt-and-sqlmesh-were-weapons
A light take on the transformation layer ongoing battle between the 2 majors contenders: dbt and SQLMesh! What transformation tool are you going for this year?

0 Comments

2025/02/04
21:45 UTC

I need a fellow data engineering buddy.

It feels overwhelming- I have learnt most of technologies but my goodness - they keep escaping from my memory. May be we can help study better and retain more knowledge. Anybody here pls? Dm me if intrested.

17 Comments

2025/02/04
20:59 UTC

Stuck on where to do masters for best results

I am pursuing a data science masters and am stuck between a few schools, I’m of course looking at curriculum. But also cost and its reputation and connections. The end goal is getting a good paying data science job.

UCB, Columbia, Georgia Tech, UCLA and University of Michigan. Thinking about applying for UVa

14 Comments

2025/02/04
20:51 UTC

Dataproc local development best practices

For those of you who have worked with Dataproc using PySpark, do you have any feedback on its use, particularly when using PySpark in local mode?

Whether you use Dataproc in cluster mode or Serverless mode, how do you test your scripts? Do you just throw everything at Dataproc? The idea for me is to avoid the extra cost of deploying jobs on Dataproc...

Or do you:

- submit-spark on your local environment with a local cluster? (=> install PySpark + Spark + JAVA etc.)
- use a Docker container that contains the entire Spark installation so as not to pollute your env?

Basically, I want to work locally as much as possible (i.e. run a local script that will read either local Dataframes or my bucket file) in order to save time and money.

0 Comments

2025/02/04
19:50 UTC

Assigned the worst issue with the worst tool

So my company is still using the old tool Pentaho and also thier communty edition and we have hosted our reports on their ui server and now the reports are not getting downloaded consistently .

I have debugged the issue and it is creating a CSV/xls file in the temp directory in the tomcat folder and also the logs show a null pointer exception but we do not manage the codebase as it is a third party tool.

Please help if anyone has faced the same issue as I will be out of job if it doesn't get resolved🥲.

1 Comment

2025/02/04
19:18 UTC

How does your team do ELT Unit Testing?

Curious to know how Unit Testing of ELT pipelines is being done at everyone’s work.

At my work, we do manual testing. I’m looking to streamline and automate the process if possible. Looking for inspirations :-)

12 Comments

2025/02/04
19:04 UTC

A common question, I guess ?

In your opinion, is data engineering as a job, in the next 5-10 years a growing or declining job in terms of demand, given the hype of the AI that exists nowadays? If you also have opinions about SWE and MLE jobs, please share it.

4 years ago I made a career shift from civil engineering to software engineering and then data engineering about a year ago, and tbh its a bit concerning the huge changes that are happening in the scene nowadays, so I figured I’d get some opinions.

Thanks.

10 Comments

2025/02/04
18:23 UTC

Looking for Guidance on Starting a Data Engineering Project

Hey everyone,

I’m looking to dive into the world of data engineering, and I’d love some guidance from those who’ve been there before. I want to start building a project that will help me get hands-on experience with creating and managing data pipelines, but I’m feeling a bit overwhelmed by the sheer amount of resources and content available online. It’s hard to know where to begin!

A bit about me:

I’m comfortable with Python
I have experience with databases like SQL, PostgreSQL, and MongoDB
I know my way around version control

But when it comes to designing and building actual pipelines, I’m not sure where to start. Should I focus on learning specific tools like Apache Airflow, Kafka, or dbt? Or should I start with basic concepts of data modeling, ETL processes, etc., and then gradually expand into more advanced tools?

If anyone could share their journey or point me to some good resources, I’d really appreciate it. I’m eager to learn and get my hands dirty with a real project!

4 Comments

2025/02/04
17:50 UTC

Issue with a Practice Exam on Data Camp

I am working on some practice problems on Data Camp and I seem to have run into an issue and I am not sure how to fix. I am tasked with cleaning some data and outputting a formatted database as a result. the following images outline my parameters. The python code I created as as follows:

"# Use as many python cells as you wish to write your code

import pandas as pd

def map_activity(activity):

if not isinstance(activity, str):

return "Unknown"

activity = activity.lower().strip()

if 'walk' in activity:

return 'Walking'

elif 'play' in activity:

return 'Playing'

elif 'rest' in activity:

return 'Resting'

elif 'health' in activity:

return 'Health'

else:

print(f"unmapped activity: {activity}")

return 'unknown'

def all_pet_data():

pet_activities = pd.read_csv('pet_activities.csv', encoding='utf-8')

pet_health = pd.read_csv('pet_health.csv', encoding='utf-8')

users = pd.read_csv('users.csv', encoding='utf-8')

pet_health.rename(columns={'visit_date': 'date'}, inplace=True)

pet_health['activity_type'] = None

pet_health['duration_minutes'] = None

pet_activities['issue'] = None

pet_activities['resolution'] = None

pet_data = pd.concat([pet_activities, pet_health], ignore_index=True)

merged_data = pet_data.merge(users, on='pet_id', how='left')

final_columns = [

'pet_id', 'date', 'activity_type', 'duration_minutes', 'issue', 'resolution','owner_id', 'owner_age_group', 'pet_type'

]

merged_data = merged_data[final_columns]

merged_data['pet_id'] = merged_data['pet_id'].astype(int)

merged_data['date'] = pd.to_datetime(merged_data['date'])

merged_data['activity_type'].fillna('health', inplace=True)

merged_data['duration_minutes'] = pd.to_numeric(merged_data['duration_minutes'], errors='coerce').fillna(0).astype(int)

merged_data['issue'].fillna('', inplace=True)

merged_data['resolution'].fillna('', inplace=True)

merged_data['owner_id'] = merged_data['owner_id'].astype(int)

merged_data['owner_age_group'] = merged_data['owner_age_group'].astype(str)

merged_data['owner_age_group'].fillna('', inplace=True)

merged_data['pet_type'] = merged_data['pet_type'].astype(str)

merged_data['pet_type'].fillna('', inplace=True)

return merged_data

df = all_pet_data()

df['activity_type'] = df['activity_type'].apply(map_activity).str.capitalize()

print(df.head(50)) "

https://preview.redd.it/j7uo0df9q5he1.png?width=2842&format=png&auto=webp&s=08b642320796cf7ff72da7b06bfa1cd6d4e1b62c

https://preview.redd.it/1d7h5bf9q5he1.png?width=1792&format=png&auto=webp&s=ba8157678aa74c8cd10a5508ecc007b875b860c7

https://preview.redd.it/uqi9mcf9q5he1.png?width=1786&format=png&auto=webp&s=0af574eafada1bdc434056b9955d6b9958547ba0

https://preview.redd.it/o1wfwpf9q5he1.png?width=942&format=png&auto=webp&s=51f796f85cfecf2c0347efdabff5c108ec762855

I produce a formatted Database with the required information, yet the system indicates that my result is incorrect. Can anyone tell where I am going wrong and why this code is not meeting the criteria outlined?

0 Comments

2025/02/04
17:29 UTC

Dealing with tough days

I typically enjoy my job and get satisfaction out of figuring things out. However sometimes I feel overloaded with all the pieces of data and relationships. I think this is typically when requirements aren't pinned down to so ething specific and I'm just going through data no-one has worked with.

I still feel grateful for my job but wondering how you all deal with these setbacks if you experience them. My mind is basically telling me this isn't sustainable, I just want to retire, maybe go into a job with less deep thinking, or how much do I need to buy a cabin in the woods.

I know it's a mental setback so I'm trying to build documentation and meet with my manager to try to pin down what the end results are expected to look like for analytics users.

Any advice?

TLDR; it can be challenging at times. How do you cope.

11 Comments

2025/02/04
16:42 UTC

How do we utilize our engineering skills to fight fascism?

Edit 2:

The sharp drop in engagement is, at minimum, suspicious. This post wasn’t removed, but it completely dropped in engagement after only a few minutes. Avoiding any tin hat statements; I’ll be creating a BlueSky handle. Lead some discussion over there.

Edit:

It’s clear that this isn’t the right community for this kind of outreach. Not because there is a misalignment in scope; this is a data engineering question after all—aiming to find ways to better hone engineering for the practice of peaceful protest. The misalignment actually stems from a culture difference. The community here is not on the same page as these efforts, clearly by the comments I’ve amassed.

This is fine. The first step in protest is to find your people. To engineers who wonder over to my post, if you find my mission to align with your own, know that you aren’t alone.

I’m going to reach out to groups in my local community and see what they need.

—-

Original post:

50501 is planning protests in all 50 states, tomorrow. To take place at the state capitol building—times vary but are publicly posted on various articles and sites.

I realized today that my state capitol building is 6 hours away, and there’s no way I can manage that trip with the protest while also managing my kids. I have no childcare except for school, and I wouldn’t be able to get them there plus pick them up on time. Logistically, I just can not go to the protest.

Now, I want to contribute. If I can’t show up physically, so be it. I need to do my part too. So the question becomes: what can I do?

It feels like I’m back at square one, honestly. That question has been going around in my my head for a long time now, but it’s been difficult to answer because there has yet to be a spearhead that I can latch onto and contribute toward. Where is the fight?

Well, that’s changed now that there are protests. For starters, I can help by aiding the protestors. Maybe I can try finding a police radio feed for the city my capitol building resides in, then transcribe it to text and make automated police chatter updates on a BlueSky account. That could help the protestors keep their edge against any unjust attempts to shut them down, right?

What else though? How can we contribute? How can us engineers show, we might be small, but we can pack a lot of fucking heat? What’s the best way to guide these efforts?

Have any of you given this thought? What have been your considerations? What do you already know of? What have you already done? What would you like to see others doing?

16 Comments

2025/02/04
16:32 UTC

Duck-UI: A Browser-Based UI for DuckDB (WASM)

Hey r/dataengineering, check out Duck-UI - a browser-based UI for DuckDB! 🦆

I'm excited to share Duck-UI, a project I've been working on to make DuckDB (yet) more accessible and user-friendly. It's a web-based interface that runs directly in your browser using WebAssembly, so you can query your data on the go without any complex setup.

Features include a SQL editor, data import (CSV, JSON, Parquet, Arrow), a data explorer, and query history.

This project really opened my eyes to how simple, robust, and straightforward the future of data can be!

Would love to get your feedback and contributions! Check it out on GitHub: [GitHub Repository Link](https://github.com/caioricciuti/duck-ui) and if you can please start us, it boost motivation a LOT!

You can also see the demo on https://demo.duckui.com

or simply run yours:

docker run -p 5522:5522 
ghcr.io/caioricciuti/duck-ui:latest

Thank you all have a great day!

0 Comments

2025/02/04
16:31 UTC

Cracking Azure Synapse Interviews: Most Frequently Asked Questions

I’ve put together the most commonly asked questions for azure synapse along with examples from my own experiences in a Medium story.

If you’re preparing for data engineering interviews, this might be helpful!

🔗 Cracking Azure Synapse Interviews: Most Frequently Asked Questions

It’s free to read and ad-free. Hope you find it useful! 😊.

I will be also sharing SQL and python shortly.

1 Comment

2025/02/04
16:01 UTC

How to efficiently convert a Sparkdf to a list of dictionaries without converting it to pandas

https://preview.redd.it/cqbzec92z4he1.png?width=484&format=png&auto=webp&s=2bcc8053d4300af1e522765363d9936466eb1ceb

This is how the response body of the swagger UI should look like, I tried multiple things but everything comes out to be time consuming. Here are the things I tried:

json_loaded = [row.asDict() for row in data_df.collect()]
json_loaded = [row.asDict() for row in data_df.toLocalIterator()]
json_loaded = data_df.toPandas().to_dict(orient="records")

https://preview.redd.it/y3zh553rz4he1.jpg?width=1069&format=pjpg&auto=webp&s=a0b31dd8a6f8dbbba03118f5377b7ad54c1647d8

10 Comments

2025/02/04
14:59 UTC

Managers/Lead/Senior what do you look for when hiring a new data engineer or similar

As the title says what do people in the hiring position look for in a persons CV/application. What are things that put you off or things that make you go okay he might be worth a look or or things you actually find impressive.

Also another question is at my company we have no automation to screen the CV so our Lead actually wants to check all the cv quickly. How does it work in your company. Is there a automation that looks for key wards and if you don't have them bye bye you get a rejection.

searched the sub but didn't find something to what i was looking for. If i missed it feel free to send the post.

4 Comments

2025/02/04
14:42 UTC

Data Visualization on Android

Hi all, I am stuck waiting for my flight back home(12h+) and I thought I'd kill some time with a small project.

I have a python script that can gather data and insert them to a database, I have everything set up and working on my home computer so I am sure the code is fine. If there is no way to run the script I have a recent export of the database(sql) backed up on my phone.

I also have exports of some grafana dashboards of those data if thats even useful.

What I am looking to do is to find a way to visualize said data, no laptops or anything just an s22 ultra. This is more of a "is this possible" kind of thing. Ideally we are going with 0 cost but if its impossible anything less than 10usd or euro/month is fine.

Curious to see what do you suggest!

1 Comment

2025/02/04
12:58 UTC

How to Export Redshift Tables to S3 in Iceberg Format – Best Approach?

I’m new to AWS and need to export tables from Amazon Redshift to S3 in Iceberg format. Since Redshift’s UNLOAD command only supports Parquet, CSV, and JSON, I’m unsure of the best way to achieve this.

Would it be better to:

Unload as Parquet first, then use an AWS service like Glue or EMR to convert and store it in Iceberg format?
Directly write to Iceberg format using AWS Glue or another tool?

If either of these approaches works, I’d really appreciate a step-by-step guide on how to set it up. My priority is a cost-effective and scalable solution, so I’d love to know the best tools and best practices to use.

Any insights or recommendations would be greatly appreciated! Thanks in advance!

1 Comment

2025/02/04
12:42 UTC

Snowflake query on 19 billion rows taking more than a minute

- We have a table of 19 billion rows with 2 million rows adding each day
- The FE sends a GET request to rails BE and it turns send the query to snowflake, which returns result to rails and we send it to FE.

- This approach works well enough for smaller data sets but the for a customer with around 2 billion rows it takes more than 1 minute.
- Regarding the query, what is does is it calculates the metrics for a given time range. There are multiple columns in the tables, to calculate some metrics it only involves summation of the columns within the date range, but for some metrics we are using partition on the fly.
- One more thing is if the date range is of 1 year, we are also calculating the metrics of the previous year from the given date range and showing them as comparison metrics.
- We need a solution either to optimize the query or to use a new tech to make the api response faster.

Any suggestions?
Thanks

35 Comments

2025/02/04
12:06 UTC

Recommendations needed on resources to learn about DE problems and solutions

Hi everyone, I have been in my current organisation for six months as a dataengineer where my work is maintain and look after pipelines hosted on both on premises and on GCP vms. The jobs are scheduled on Control-M and the scripts are in shell and python(for transformation). I have just started to explore GCP and will be appearing for the Associate Cloud Engineer and then Professional Data Engineer certification.

Being a large organisations,I am finding hard to understand the architecture and logic behind it and my learning is also being slowed down.The TL is not helpful as he focus more on the work than teaching the why's behind the approach. I am feeling I am just just running and tweaking scripts without knowing the reason behind it

So it would be helpful if you guys could share your experiences and resources you learnt from regarding different system design and the scenerios where it would be applicable to.

1 Comment

2025/02/04
10:42 UTC

We develop internal lib to ETL data using Airflow. Can we move from YAMLs to grpc calls?

Basically, title. In general, we're working on implementing self-service to move data using YAMLs (with single entrypoint that parses them and generates DAGs). Now we get task from business to develop interface so they could generate these DAGs (YAMLs, in fact) programmatically. As much as I understand, Airflow tends to work with real files on OS and that's the ground we build our solution on. But we cannot get yet how to work with jsons our future service will receive. OFC there should also be monitoring, logging, progress tracking, etc (maybe not in MVP, but it shouldn't be very complicated to implement).

I already designed possible solution but it seems a bit over-engineered for me. Idea is that there are our future service, separate database to store jsons and track DAGs states (queue). There would be files on S3; each one is belonged to one DAG. When rpc triggered, config changes. Each DAG has sensor on its config and if config was changed DAG starts. There are more details obviously, but you can get the idea.

WDYT??

1 Comment

2025/02/04
10:11 UTC

Where are you getting synthetic data?

What companies are you using for your synthetic data?

4 Comments

2025/02/04
09:57 UTC