/r/dataengineering

Photograph via snooOG

News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.

News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.

Read our wiki: https://dataengineering.wiki/

Rules:

  1. Don't be a jerk

  2. Limit Self-Promotion: Remember the reddit self-promotion rule of thumb: "For every 1 time you post self-promotional content, 9 other posts (submissions or comments) should not contain self-promotional content."

  3. Search the sub & wiki before asking a question: Your question has likely been asked and answered before so do a quick search before posting.

  4. No job posts (posts or comments)

  5. No technical error/bug questions: Any error/bug question belongs on StackOverflow.

  6. Keep it related to data engineering

/r/dataengineering

238,282 Subscribers

9

Build Scalable Real-Time ETL Pipelines with NATS and Pathway — Alternatives to Kafka & Flink

Hey everyone! I wanted to share a tutorial created by a member of the Pathway community that explores using NATS and Pathway as an alternative to a Kafka + Flink setup.

The tutorial includes step-by-step instructions, sample code, and a real-world fleet monitoring example. It walks through setting up basic publishers and subscribers in Python with NATS, then integrates Pathway for real-time stream processing and alerting on anomalies.

App template (with code and details):
https://pathway.com/blog/build-real-time-systems-nats-pathway-alternative-kafka-flink

Key Takeaways:

  • Seamless Integration: Pathway’s NATS connectors simplify data ingestion.
  • High Performance & Low Latency: NATS handles rapid messaging; Pathway processes data on-the-fly.
  • Scalability & Reliability: NATS clustering and Pathway’s distributed workloads help with scaling and fault-tolerance.
  • Flexible Data Formats: JSON, plaintext, and raw bytes are supported.
  • Lightweight & Efficient: The NATS pub/sub model is less complex than a full Kafka deployment.
  • Advanced Analytics: Pathway supports real-time ML, graph processing, and complex transformations.

Would love to know what you think—any feedback or suggestions.

0 Comments
2024/12/19
13:35 UTC

0

Help please. We have to work on doing a predictive analysis using R on the spotify 2024 tracks dataset on kaggle but our Residual vs fitted plot is coming like this. Continued

It's a multiple linear regression, with around 7 regressors. When we tried taking log of the regressors to try to get a more linear plot it kept showing error because apparently there might have been NA or inf values even after using na.omit() we still didn't get an output. We used other things like sqrt() etc..but to no avail. Any ideas on how to deal with this?Our class mainly only taught regression. Is it possible to do clustering for this ( given we have multiple fields including those with qualitative values) any links to sources on how to do this would be great

0 Comments
2024/12/19
13:34 UTC

3

Want to get into Data Engineering

My current job is a Data Admin, and I already have experience as a Data Analyst. I also have a degree in Computer Science.

What roles should I go for or what certifications should I try getting.

9 Comments
2024/12/19
11:53 UTC

5

Which tools are you using to communicate data architecture to non-techies?

I’m frustrated because I’m not that great to communicate with words 🤣 I always have to show something visually to explain alongside. What tools are you using? Curious to hear :)

12 Comments
2024/12/19
11:52 UTC

2

The Essential Role of Data Verification in Healthcare

Patient safety relies heavily on accurate and reliable data. In healthcare, data verification ensures that critical information—like medical records, diagnoses, and prescriptions—is accurate and up-to-date.

Without proper verification, errors can compromise patient care and safety. This blog highlights why data verification is vital for maintaining data integrity in healthcare systems.

Check it out here: Ensuring Patient Safety and Data Integrity

How does your organization handle data verification?

1 Comment
2024/12/19
10:45 UTC

27

How much Github Actions should I know as a data engineer?

Basically title. I really don't want to deep dive into it and get lost in the process and become a devops engineer. Do you have any recommendation materials?

Thanks!

14 Comments
2024/12/19
09:44 UTC

13

Airflow in windows

Are there any disadvantages to using Apache Airflow on Windows with Docker, or should I consider Prefect instead since it runs natively on Windows?

but I feel that Airflow’s UI and features are better compared to Prefect

My main requirement is to run orchestration workflows on a Windows system

11 Comments
2024/12/19
06:27 UTC

19

Time to move after 3 months at a new company?

Hi there,

My current company is small but little did I know, their DB size is 10GB and not expected to grow much in couple years. All of their process is just

Application ------> Azure OLTP Db

No pipelines, no reporting database—nothing fancy. I’d love to suggest improvements, but honestly, anything beyond what they have now would feel like overkill.

Before I joined , I was told about Fabric and Spark and DW in the future. However, I have seen their future plans and Its no good at all. They are not planning to change anything.

I have another job offer which uses Spark and GCP and other new tools which I used to work with and would like to work with newer tech rather than what I am doing right now.

Am I crazy for switching after 3 months?

16 Comments
2024/12/19
04:24 UTC

2

Should I Swap Companies?

I graduated with 1 year of internship experience in May 2023 and have worked at my current company since August 2023. I make around 72k after the yearly salary increase. My boss told me about 6 months ago I would be receiving a promotion to senior data engineer due to my work and mentoring our new hire, but has told me HR will not allow me to be promoted to senior until 2026, so I’ll likely be getting a small raise (probably to about 80k after negotiating) this year and be promoted to senior in 2026 which will be around 100k. However I may receive another offer for a data engineer position which is around 95k plus bonus. Would it be worth it to leave my current job or stay for the almost guaranteed senior position? Wondering which is more valuable long term.

It is also noteworthy that my current job is in healthcare industry and the new job offer would be in the financial services industry. The new job would also be using a more modern stack.

I am also doing my MSCS at Georgia Tech right now and know that will probably help with career prospects in 2026.

I guess I know the new job offer is better but I’m wondering if it will look too bad for me to swap with only 1.3 years. I also am wondering if the senior title is worth staying at a lower paying job for an extra year. I also would like to get out of healthcare eventually since it’s lower paying but not sure if I should do that now or will have opportunities later.

29 Comments
2024/12/19
03:37 UTC

5

For students interested in DE, what classes are must have in university?

Like ofc, python is a big one. And data warehousing I’m assuming and database foundations.

What are some others?

8 Comments
2024/12/19
01:42 UTC

2

Any Data Engineers w/ K12 Education Experience?

More or less the question is in the title. Have some contracts coming up soon and will need some additional hands. Would be interested in talking to some people, experience in Airflow / Big Query is a plus - but I know there's a lot of different flavors of the same thing out there.

Would also be interested in just hearing about some general common issues or problems you've run into working in education. Most common thing I see so far is having too many SaaS platforms that are all redundant or are being used by some schools, but not all.

3 Comments
2024/12/19
00:10 UTC

8

Selecting stack for time-series data dashboard with future IoT integration

Greetings,

I'm building a data dashboard that needs to handle: 

  • Time-series performance metrics (~500KB initially)
  • Near-future IoT sensor integration 
  • Small group of technical users (<10) 
  • Interactive visualizations and basic analytics
  • Future ML integration planned 

My background:

Intermediate Python, basic SQL, learning JavaScript. Looking to minimize complexity while building something scalable. 

Stack options I'm considering: 

  1. Streamlit + PostgreSQL 
  2. Plotly Dash + PostgreSQL 
  3. FastAPI + React + PostgreSQL 

Planning to deploy on Digital Ocean, but welcome other hosting suggestions.

Main priorities: 

  •  Quick MVP deployment 
  • Robust time-series data handling 
  • Multiple data source integration 
  • Room for feature growth 

Would appreciate input from those who've built similar platforms. Are these good options? Any alternatives worth considering?

10 Comments
2024/12/18
23:34 UTC

1

Securing trino backends

How are folks securing backend resources in trino? Currently we're file based access control. I'm not even sure if I'm working this correctly, but we want to use azure users and groups and policies based on catalog data to formulate access.

Is anyone using catalog data and groups to manage that access like that? What does your stack look like?

Thx

2 Comments
2024/12/18
23:26 UTC

2

New Podcast - Snowflake Cluster Keys to Maximise Query Performance

Curious about how to maximize Snowflake query performance using Cluster Keys. Check out this podcast.

https://youtu.be/rh4j4nJY8pU?si=UduG4iik_aRPB3LE

2 Comments
2024/12/18
18:21 UTC

15

How big a pipeline can one person manage ?

If you were to measure in terms of number of jobs and tables? 24 hour SLA, daily batches

25 Comments
2024/12/18
17:22 UTC

41

Which tasks are you performing in your current ETL job and which tool are you using?

What tasks are you performing in your current ETL job and which tool are you using? How much data are you processing/moving? Complexity?

How is the automation being done?

23 Comments
2024/12/18
17:04 UTC

2

dbt call into ecs task served by airflow dags

Hello,

i host a dbt service into an ECS container. When i call it from airflow, this works perfectly but it call the service each time i call it.

if for sample if i have in my dag :
dbt seed --model XXX >> dbt test --selector data_silver >> dbt compile --selector data_silver ... etc

It will call X times my ecs task.

My question is pretty simple, how could i do to have a timeout of 5 minutes on my ECS task and reuse the same ? In the same time, i will have one cloudwatch log for all the dbt calls which is a timae saver when i look at cloudwatch.

I don't want to use ECS services as you pay for a service always on.

i did the same with cloudrun and a timeout.

2 Comments
2024/12/18
16:27 UTC

3

Airflow monitoring is broken?

I followed the docs for Airflow 2.0.2 to expose metrics via statsd. I used statsd exporter so that Prometheus can scrape the metrics. However I see that metrics related to scheduler and some other useful metrics are not getting exposed even when they are mentioned on the docs. Has anyone faced this issue?

1 Comment
2024/12/18
15:55 UTC

10

Data Engineering in Azure Synapse Analytics

The primary tool my team has is Azure Synapse Analytics. We also have Azure Functions Apps and Logic Apps. We have may be able to get additional Azure resources, but we are basically limited to Azure/Microsoft products (as well as GitHub). Given this limitation, are there any recommendations for pipelines/workflows? The basic process now is to use Azure Synapse pipelines and dataflows or notebooks. GitHub is what we want to use for source control, but that has proven problematic (users can’t publish straight from the Synapse workspace and we really aren’t sure where the changes are supposed to be approved).

4 Comments
2024/12/18
14:57 UTC

38

How do you practice and hone your SQL skills?

I am able to formulate a query given a situation but sometimes to come up with even sime query I take a lot of time. I am practising my SQL from Datalemur SQL problems and sometimes leetcode. What would you recommend the right approach for it?

33 Comments
2024/12/18
14:16 UTC

3

DBT or Apache Spark for creating analytics models?

Currently, I have a PostgreSQL database with some raw_data tables that I process using DBT. Basically, my workflow for processing these data looks like this:

  1. Group raw data from various tables into concept tables (for example, cost data coming from different sources and stored in separate tables are standardized and stored in a single "costs" table).
  2. Perform the actual analytical processing by joining these concept tables to create pre-visualization tables (in a metric-dimension model).

I’ve managed to make all the models in step 1 incremental in DBT, using an upsert logic based on a unique ID and only processing what’s been exported to the raw tables in the last X minutes (e.g., 30 minutes).

However, in step 2, I end up having to recalculate all the data in all the models as full tables in DBT. Since these are cross-sourced data, I can’t filter them simply based on what’s been exported in the last X minutes. This makes step 2 extremely slow in my processing pipeline.

To optimize, I’ve been thinking if DBT is actually the best tool for this reprocessing step to generate the analytical models I consume in my visualization tools. Or, should I look into using distributed processing tools (e.g., Apache Spark) for step 2 to generate these metric-dimension tables?

Have you ever faced this kind of issue? Did you go with a similar approach? Do you recommend using DBT for this or some other solution? These are some of the questions I’ve been grappling with.

EDIT: Just one thing I forgot to mention. I'm working with a medium volume of data—there’s about 100GB stored in the database already. However, besides this data volume, the processing queries in step 2 are quite complex and involve a lot of JOINs, which makes it the slowest step in the pipeline.

4 Comments
2024/12/18
14:07 UTC

3

Options for replication from AS400 Db2 to Fabric lakehouse

Hey, I'm a DBA whose ETL experience is limited to SSIS. The shop I work at is migrating our infrastructure to Fabric. We have a consultant setting up replication from our AS400 to a Fabric lakehouse, but we're running into these issues:

  • Latency is more than 15 minutes

  • Since we have a lakehouse instead of a warehouse, the SQL endpoint cannot be used to write data. This led to:

  • The target is manually-created parquet files and delta logs, which the lakehouse does not recognize as a table. To work around this, we have table-value functions and views to create a simulated table to then use

This seems like an unnecessary workaround, but I'm not familiar enough with modern data engineering to know what a better solution might look like. What would be an option for us to stream data from our Java-based AS400 CDC tool into Fabric? I've suggested ADF and Spark, but both have been rejected for being too inefficient to keep latency below 15 minutes. Since we built the CDC tool, we can modify it as needed.

3 Comments
2024/12/18
13:11 UTC

7

More generic DBT for access management

I have build quite big data access control system on our Redshift with help of RBAC. I implemented it with liquibase. Each time all external tables, roles, users permissions are recreated. The problem is that it is extremely slow to re run all staff always and I don’t know how to create dependencies between changesets.

I would need something that build graph like DBT. So i could run on downstream/upstream changes for all modified changesets. Do you know some ather tool to build graph relationships or how to implement it in dbt / liquibase?

I know I could use Airflow / Dagster to build graph relationships from scratch but I love dbt ref(“”) that automatically creates graph.

I would need dbt but I will not create views/ models but grant permissions.

0 Comments
2024/12/18
12:38 UTC

3

How to Automatically Categorize Construction Products in an SQL Database?

Hi everyone! I’m working with an SQL database containing hundreds of construction products from a supplier. Each product has a specific name (e.g., Adesilex G19 Beige, Additix PE), and I need to assign a general product category (e.g., Adhesives, Concrete Additives).

The challenge is that the product names are not standardized, and I don’t have a pre-existing mapping or dictionary. To identify the correct category, I would typically need to look up each product's technical datasheet, which is impractical given the large volume of data.

Example:

product_codeproduct_name
2419926Additix P bucket 0.9 kg (box of 6)
410311Adesilex G19 Beige unit 10 kg

I need to add a column like this:

general_product_category
Concrete Additives
Adhesives

How can I automate this categorization without manually checking every product's technical datasheet? Are there tools, Python libraries, or SQL methods that could help with text analysis, pattern matching, or even online lookups?

Any help or pointers would be greatly appreciated! Thanks in advance 😊

12 Comments
2024/12/18
11:38 UTC

4

How does downstream and stakeholders review your changes?

We have a data product (sql tables) and there are new migrations coming which might be a breaking change on downstream teams. The data product is stored in databricks and also snowflake (exactly the same but it’s duplicated for different stakeholders need) and we have staging and production environments. The problem is that whenever we have a breaking change we push to staging and wait for a couple of days from stakeholders side and if they give us green signal we then proceed. But this is bottle neck if there is sth meanwhile to be deployed to production, and we then have to revert back the changes. The process of moving to staging and revert back is so cumbersome and the team doesn’t agree on having feature flag. (Bcs staging and production are different then and they don’t like if conditions) Curious to know how you do review and get approval from downstream teams?

Imo once we agreed on the plans and changes and communicated this to downstream we should not be dependent on extra table verification from their side but team does not agree.

2 Comments
2024/12/18
11:14 UTC

28

SQL - Working with large data (10M rows) efficiently but with a lot of restrictions?

Hello,

I'm currently working on upserting to a 100M row table in SQL server. The process is this:

* Put data into staging table. I only stage the deltas which need upserting into the table.

* Run stored procedure which calculates updates and does updates followed by inserts into a `dbo` table.

* This is done by matching on `PKHash` (composite key hashed) and `RowHash` (the changes we're measuring hashed). These are both `varchar(256)`

The problem:

* Performance on this isn't great and I'd really like to improve this. It's taking over an hour to do a row comparison of ~1M rows against ~10M rows. I have an index on `PKHash` and `RowHash` on the `dbo` table but not on the staging table as this is dynamically created from Spark in SQL server. I can change that though.

* I would love to insert 1000 rows at a time into a temp table and then only do 1000 at a time batchwise, although there's a business requirement either the whole thing succeeds or it fails. I also have to capture the number of records updated or inserted into the table and log it elsewhere.

Not massively familiar with working with large data so it'd be helpful to get some advice. Is there anyway I can basically boost the performance on this and/or batch this up whilst simultaneously being able to rollback as well as get row counts for updates and inserts?

Cheers

25 Comments
2024/12/18
10:45 UTC

Back To Top