/r/dataengineering
News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.
News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.
Read our wiki: https://dataengineering.wiki/
Rules:
Don't be a jerk
Limit Self-Promotion: Remember the reddit self-promotion rule of thumb: "For every 1 time you post self-promotional content, 9 other posts (submissions or comments) should not contain self-promotional content."
Search the sub & wiki before asking a question: Your question has likely been asked and answered before so do a quick search before posting.
No job posts (posts or comments)
No technical error/bug questions: Any error/bug question belongs on StackOverflow.
Keep it related to data engineering
/r/dataengineering
Hey everyone! I wanted to share a tutorial created by a member of the Pathway community that explores using NATS and Pathway as an alternative to a Kafka + Flink setup.
The tutorial includes step-by-step instructions, sample code, and a real-world fleet monitoring example. It walks through setting up basic publishers and subscribers in Python with NATS, then integrates Pathway for real-time stream processing and alerting on anomalies.
App template (with code and details):
https://pathway.com/blog/build-real-time-systems-nats-pathway-alternative-kafka-flink
Key Takeaways:
Would love to know what you think—any feedback or suggestions.
It's a multiple linear regression, with around 7 regressors. When we tried taking log of the regressors to try to get a more linear plot it kept showing error because apparently there might have been NA or inf values even after using na.omit() we still didn't get an output. We used other things like sqrt() etc..but to no avail. Any ideas on how to deal with this?Our class mainly only taught regression. Is it possible to do clustering for this ( given we have multiple fields including those with qualitative values) any links to sources on how to do this would be great
My current job is a Data Admin, and I already have experience as a Data Analyst. I also have a degree in Computer Science.
What roles should I go for or what certifications should I try getting.
I’m frustrated because I’m not that great to communicate with words 🤣 I always have to show something visually to explain alongside. What tools are you using? Curious to hear :)
Patient safety relies heavily on accurate and reliable data. In healthcare, data verification ensures that critical information—like medical records, diagnoses, and prescriptions—is accurate and up-to-date.
Without proper verification, errors can compromise patient care and safety. This blog highlights why data verification is vital for maintaining data integrity in healthcare systems.
Check it out here: Ensuring Patient Safety and Data Integrity
How does your organization handle data verification?
Basically title. I really don't want to deep dive into it and get lost in the process and become a devops engineer. Do you have any recommendation materials?
Thanks!
Are there any disadvantages to using Apache Airflow on Windows with Docker, or should I consider Prefect instead since it runs natively on Windows?
but I feel that Airflow’s UI and features are better compared to Prefect
My main requirement is to run orchestration workflows on a Windows system
Hi there,
My current company is small but little did I know, their DB size is 10GB and not expected to grow much in couple years. All of their process is just
Application ------> Azure OLTP Db
No pipelines, no reporting database—nothing fancy. I’d love to suggest improvements, but honestly, anything beyond what they have now would feel like overkill.
Before I joined , I was told about Fabric and Spark and DW in the future. However, I have seen their future plans and Its no good at all. They are not planning to change anything.
I have another job offer which uses Spark and GCP and other new tools which I used to work with and would like to work with newer tech rather than what I am doing right now.
Am I crazy for switching after 3 months?
I graduated with 1 year of internship experience in May 2023 and have worked at my current company since August 2023. I make around 72k after the yearly salary increase. My boss told me about 6 months ago I would be receiving a promotion to senior data engineer due to my work and mentoring our new hire, but has told me HR will not allow me to be promoted to senior until 2026, so I’ll likely be getting a small raise (probably to about 80k after negotiating) this year and be promoted to senior in 2026 which will be around 100k. However I may receive another offer for a data engineer position which is around 95k plus bonus. Would it be worth it to leave my current job or stay for the almost guaranteed senior position? Wondering which is more valuable long term.
It is also noteworthy that my current job is in healthcare industry and the new job offer would be in the financial services industry. The new job would also be using a more modern stack.
I am also doing my MSCS at Georgia Tech right now and know that will probably help with career prospects in 2026.
I guess I know the new job offer is better but I’m wondering if it will look too bad for me to swap with only 1.3 years. I also am wondering if the senior title is worth staying at a lower paying job for an extra year. I also would like to get out of healthcare eventually since it’s lower paying but not sure if I should do that now or will have opportunities later.
Like ofc, python is a big one. And data warehousing I’m assuming and database foundations.
What are some others?
More or less the question is in the title. Have some contracts coming up soon and will need some additional hands. Would be interested in talking to some people, experience in Airflow / Big Query is a plus - but I know there's a lot of different flavors of the same thing out there.
Would also be interested in just hearing about some general common issues or problems you've run into working in education. Most common thing I see so far is having too many SaaS platforms that are all redundant or are being used by some schools, but not all.
Greetings,
I'm building a data dashboard that needs to handle:
My background:
Intermediate Python, basic SQL, learning JavaScript. Looking to minimize complexity while building something scalable.
Stack options I'm considering:
Planning to deploy on Digital Ocean, but welcome other hosting suggestions.
Main priorities:
Would appreciate input from those who've built similar platforms. Are these good options? Any alternatives worth considering?
How are folks securing backend resources in trino? Currently we're file based access control. I'm not even sure if I'm working this correctly, but we want to use azure users and groups and policies based on catalog data to formulate access.
Is anyone using catalog data and groups to manage that access like that? What does your stack look like?
Thx
Curious about how to maximize Snowflake query performance using Cluster Keys. Check out this podcast.
If you were to measure in terms of number of jobs and tables? 24 hour SLA, daily batches
What tasks are you performing in your current ETL job and which tool are you using? How much data are you processing/moving? Complexity?
How is the automation being done?
Hello,
i host a dbt service into an ECS container. When i call it from airflow, this works perfectly but it call the service each time i call it.
if for sample if i have in my dag :
dbt seed --model XXX >> dbt test --selector data_silver >> dbt compile --selector data_silver ... etc
It will call X times my ecs task.
My question is pretty simple, how could i do to have a timeout of 5 minutes on my ECS task and reuse the same ? In the same time, i will have one cloudwatch log for all the dbt calls which is a timae saver when i look at cloudwatch.
I don't want to use ECS services as you pay for a service always on.
i did the same with cloudrun and a timeout.
I followed the docs for Airflow 2.0.2 to expose metrics via statsd. I used statsd exporter so that Prometheus can scrape the metrics. However I see that metrics related to scheduler and some other useful metrics are not getting exposed even when they are mentioned on the docs. Has anyone faced this issue?
The primary tool my team has is Azure Synapse Analytics. We also have Azure Functions Apps and Logic Apps. We have may be able to get additional Azure resources, but we are basically limited to Azure/Microsoft products (as well as GitHub). Given this limitation, are there any recommendations for pipelines/workflows? The basic process now is to use Azure Synapse pipelines and dataflows or notebooks. GitHub is what we want to use for source control, but that has proven problematic (users can’t publish straight from the Synapse workspace and we really aren’t sure where the changes are supposed to be approved).
I am able to formulate a query given a situation but sometimes to come up with even sime query I take a lot of time. I am practising my SQL from Datalemur SQL problems and sometimes leetcode. What would you recommend the right approach for it?
Currently, I have a PostgreSQL database with some raw_data tables that I process using DBT. Basically, my workflow for processing these data looks like this:
I’ve managed to make all the models in step 1 incremental in DBT, using an upsert logic based on a unique ID and only processing what’s been exported to the raw tables in the last X minutes (e.g., 30 minutes).
However, in step 2, I end up having to recalculate all the data in all the models as full tables in DBT. Since these are cross-sourced data, I can’t filter them simply based on what’s been exported in the last X minutes. This makes step 2 extremely slow in my processing pipeline.
To optimize, I’ve been thinking if DBT is actually the best tool for this reprocessing step to generate the analytical models I consume in my visualization tools. Or, should I look into using distributed processing tools (e.g., Apache Spark) for step 2 to generate these metric-dimension tables?
Have you ever faced this kind of issue? Did you go with a similar approach? Do you recommend using DBT for this or some other solution? These are some of the questions I’ve been grappling with.
EDIT: Just one thing I forgot to mention. I'm working with a medium volume of data—there’s about 100GB stored in the database already. However, besides this data volume, the processing queries in step 2 are quite complex and involve a lot of JOINs, which makes it the slowest step in the pipeline.
Hey, I'm a DBA whose ETL experience is limited to SSIS. The shop I work at is migrating our infrastructure to Fabric. We have a consultant setting up replication from our AS400 to a Fabric lakehouse, but we're running into these issues:
Latency is more than 15 minutes
Since we have a lakehouse instead of a warehouse, the SQL endpoint cannot be used to write data. This led to:
The target is manually-created parquet files and delta logs, which the lakehouse does not recognize as a table. To work around this, we have table-value functions and views to create a simulated table to then use
This seems like an unnecessary workaround, but I'm not familiar enough with modern data engineering to know what a better solution might look like. What would be an option for us to stream data from our Java-based AS400 CDC tool into Fabric? I've suggested ADF and Spark, but both have been rejected for being too inefficient to keep latency below 15 minutes. Since we built the CDC tool, we can modify it as needed.
I have build quite big data access control system on our Redshift with help of RBAC. I implemented it with liquibase. Each time all external tables, roles, users permissions are recreated. The problem is that it is extremely slow to re run all staff always and I don’t know how to create dependencies between changesets.
I would need something that build graph like DBT. So i could run on downstream/upstream changes for all modified changesets. Do you know some ather tool to build graph relationships or how to implement it in dbt / liquibase?
I know I could use Airflow / Dagster to build graph relationships from scratch but I love dbt ref(“”) that automatically creates graph.
I would need dbt but I will not create views/ models but grant permissions.
Hi everyone! I’m working with an SQL database containing hundreds of construction products from a supplier. Each product has a specific name (e.g., Adesilex G19 Beige, Additix PE), and I need to assign a general product category (e.g., Adhesives, Concrete Additives).
The challenge is that the product names are not standardized, and I don’t have a pre-existing mapping or dictionary. To identify the correct category, I would typically need to look up each product's technical datasheet, which is impractical given the large volume of data.
product_code | product_name |
---|---|
2419926 | Additix P bucket 0.9 kg (box of 6) |
410311 | Adesilex G19 Beige unit 10 kg |
I need to add a column like this:
general_product_category |
---|
Concrete Additives |
Adhesives |
How can I automate this categorization without manually checking every product's technical datasheet? Are there tools, Python libraries, or SQL methods that could help with text analysis, pattern matching, or even online lookups?
Any help or pointers would be greatly appreciated! Thanks in advance 😊
We have a data product (sql tables) and there are new migrations coming which might be a breaking change on downstream teams. The data product is stored in databricks and also snowflake (exactly the same but it’s duplicated for different stakeholders need) and we have staging and production environments. The problem is that whenever we have a breaking change we push to staging and wait for a couple of days from stakeholders side and if they give us green signal we then proceed. But this is bottle neck if there is sth meanwhile to be deployed to production, and we then have to revert back the changes. The process of moving to staging and revert back is so cumbersome and the team doesn’t agree on having feature flag. (Bcs staging and production are different then and they don’t like if conditions) Curious to know how you do review and get approval from downstream teams?
Imo once we agreed on the plans and changes and communicated this to downstream we should not be dependent on extra table verification from their side but team does not agree.
Hello,
I'm currently working on upserting to a 100M row table in SQL server. The process is this:
* Put data into staging table. I only stage the deltas which need upserting into the table.
* Run stored procedure which calculates updates and does updates followed by inserts into a `dbo` table.
* This is done by matching on `PKHash` (composite key hashed) and `RowHash` (the changes we're measuring hashed). These are both `varchar(256)`
The problem:
* Performance on this isn't great and I'd really like to improve this. It's taking over an hour to do a row comparison of ~1M rows against ~10M rows. I have an index on `PKHash` and `RowHash` on the `dbo` table but not on the staging table as this is dynamically created from Spark in SQL server. I can change that though.
* I would love to insert 1000 rows at a time into a temp table and then only do 1000 at a time batchwise, although there's a business requirement either the whole thing succeeds or it fails. I also have to capture the number of records updated or inserted into the table and log it elsewhere.
Not massively familiar with working with large data so it'd be helpful to get some advice. Is there anyway I can basically boost the performance on this and/or batch this up whilst simultaneously being able to rollback as well as get row counts for updates and inserts?
Cheers