/r/ETL

Photograph via snooOG

This subreddit is for discussions about ETL / pipelines / workflow systems / etc...

/r/ETL

8,976 Subscribers

2

Interview help

Hello! I am wondering if I could get this community’s help as a non-technical manager hiring for an ETL and integrations developer. I have recently inherited a team which includes this position and have to hire a resource. While I am familiar with the overall ETL processes and tools, neither myself nor my larger team has the expertise to conduct a technical interview. Behavioral questions are all I have at this point. I need experience with ADF, SQL (T-SQL), and Power BI.

Dows anyone have suggestions for how I can ensure quality work from a candidate?

Thank you so much in advance.

2 Comments
2025/01/31
15:13 UTC

1

File format conversion from QVD to Parquet

Hi fellow tech savvies,

I am looking for a way to convert QVD files to Parquet file, because it is efficient csv file format. If anyone knows a solution, I am in need of it please post your suggestions. Thank you.

11 Comments
2025/01/30
06:56 UTC

3

Integrating LLMs into ETL pipelines using langchian-beam

Hi everyone, I've been working on a Apache beam and langchian integration to use langchian components like LLM interface in beam ETL pipelines to leverage model's capabilities for data processing.

Would like to know your thoughts.

Repository link - https://github.com/Ganeshsivakumar/langchain-beam

Demo video - https://youtu.be/SXE1O-SlxZo?si=jzH4Cs0Tcl0AxE_5

0 Comments
2025/01/27
16:47 UTC

2

Data Migration Overview (ETL)

0 Comments
2025/01/21
11:30 UTC

0

Etl suggestion

Hi everyone, I want to build an etl tool. I have 3+ years of experience in building and managing etl tools in my work. I want some suggestions on what to build next. I am open for collaboration also

2 Comments
2025/01/14
19:57 UTC

2

data migration tools?

i've been looking for tools which can help me transfer data from databases (such as MySQL, PostgreSQL, etc) particularly to data warehouses. Any tools to achieve this? Which tools were trending in the past year?

10 Comments
2025/01/03
11:51 UTC

2

BI flowchart?

I'm trying to draw a flowchart describing data and control flows in the company's BI system. I would greatly appreciate your suggestions on what notation should I take.

https://preview.redd.it/isz550eqjnae1.png?width=969&format=png&auto=webp&s=f6fa6f02d656fc8c9c37dafbcccd56b27d6ca046

2 Comments
2025/01/02
21:57 UTC

7

Data Engineering Wrap up 2024

Hey folks! 👋 I came across this cool blog that wraps up the key data engineering trends from 2024. It covers a lot of what went down this year and what’s next. Would love to hear your thoughts—what trends in data engineering stood out to you in 2024? Check it out if you're into data!

0 Comments
2024/12/27
11:50 UTC

6

Help how to move from traditional business intelligence to data engineering

Hello everyone, I am looking for advice on moving towards the cloud knowing that I have more than ten years of experience in ETL, BI, Sql and data modeling (datawarhouse) So I would like to train while taking advantage of my previous expertise around business intelligence

It is complicated for me to find a freelance mission (I have been listening to the market for 6 months but no suitable opportunities) I would like to transfer my skills to the cloud (ELT) while focusing on solutions that require less code, with a more visual or simplified approach What tools or platforms would you recommend to me to evolve in this direction? And do you have any training recommendations (online or face-to-face) adapted to this type of need? today I see that python is the language to know but I am not attracted, having significant experience in business intelligence where our tools are based more on objects and design which remains pleasant to handle I want if possible to use the same cloud approach I have more than 6 years of experience on datastage A tool very little in demand today I no longer plan to work with missions whose main tool is datastage

I ideally want to work in hybrid projects such as migration which combines old etl like datastage, ssis and new tool like snowflake big query...in order to get out of it The problem is when you have no experience with these latest technologies, the client is not interested

Is certification enough?

Thank you in advance for your suggestions

1 Comment
2024/12/23
17:27 UTC

10

Build Scalable Real-Time ETL Pipelines with NATS and Pathway — Alternatives to Kafka & Flink

Hey everyone! I wanted to share a tutorial created by a member of the Pathway community that demonstrates how to build a real-time ETL pipeline using NATS and Pathway —offering a more streamlined alternative to a traditional Kafka + Flink setup.

The tutorial includes step-by-step instructions, sample code, and a real-world fleet monitoring example. It walks through setting up basic publishers and subscribers in Python with NATS, then integrates Pathway for real-time stream processing and alerting on anomalies.

App template link (with code and details):
https://pathway.com/blog/build-real-time-systems-nats-pathway-alternative-kafka-flink

Key Takeaways:

  • Seamless Integration: Pathway’s native NATS connectors allow direct ingestion from NATS subjects, reducing integration overhead.
  • High Performance & Low Latency: NATS delivers messages quickly, while Pathway processes and analyzes data in real time, enabling near-instant alerts.
  • Scalability & Reliability: With NATS clustering and Pathway’s distributed workloads, scaling is straightforward. Message acknowledgment and state recovery help maintain reliability.
  • Flexible Data Formats: Pathway handles JSON, plaintext, and raw bytes, so you can choose the data format that suits your needs.
  • Lightweight & Efficient: NATS’s simple pub/sub model is well-suited for asynchronous, cloud-native systems—without the added complexity of a Kafka cluster.
  • Advanced Analytics: Pathway supports real-time machine learning, dynamic graph processing, and complex transformations, enabling a wide range of analytical use cases.

Would love to know what you think—any feedback or suggestions on this real-time ETL.

0 Comments
2024/12/19
13:29 UTC

2

Experiences with Installing and Managing Airbyte on Raspberry Pi?

Hi everyone, I'm considering installing Airbyte on a Raspberry Pi and would like to know if anyone here has experience with this setup. Specifically, I'm interested in how well does Airbyte run on a Raspberry Pi? And there are any limitations to be aware of?

Any insights or advice would be greatly appreciated! Thanks in advance.

0 Comments
2024/12/18
14:49 UTC

6

How to Automate an SSIS ETL Process? Need Guidance

Hi everyone,

I’m trying to automate an SSIS ETL process that runs every day. Here’s the situation:

  • The ETL reads two Excel files:
    • One is manually downloaded from an email (this i can automate using power automate).
    • The other is downloaded via an API (this part seems automatable).

The challenge is getting the SSIS package to run automatically without using the GUI. I tried using dtexec, but I’ve run into problems I don’t know how to solve.

A bit about me: I’m new to this. I used to work in a call center but recently transitioned into a data engineering role. Now, I’ve been tasked with automating this process, and I’m unsure where to start or what best practices to follow.

Could anyone point me in the right direction? Any advice or resources would be greatly appreciated!

Thanks in advance for your help!

3 Comments
2024/12/15
15:50 UTC

19

What's the ETL Developer roadmap should look like?

In my area there are a lot of jobs on ETL Developer and Data Integration/Migration projects. The salaries are not bad as well. What could be the right roadmap for this kind of role? Which tools should I learn and how long can it take to become ready for it?

2 Comments
2024/12/09
15:13 UTC

2

Pipeline design help needed!

Hii! I'm trying to build a pipeline that monitors the invoices (.xml format) in a folder that are generated by a restaurant's POS (point of service). Whenever a new invoice is added to the folder, I want to extract it, process it, and load it into a cloud database. I'm currently doing so with a simple Python script using watchdog, is this good enough? or should I be using a more robust tool like Kafka or something? The ultimate goal is to load this invoice data into the database so that I can feed a dashboard.

Any guidance is welcome. Thank you!!! :)

6 Comments
2024/12/08
21:14 UTC

8

Achieving Sub-Second Latency with S3 Storage—Using Pathway, a Kafka Alternative

Hey everyone,

I've been working on simplifying streaming architectures and wanted to share an approach that serves as a Kafka alternative, especially if you're already using S3-compatible storage.

You can skip description and jump to the code here: https://pathway.com/developers/templates/kafka-alternative#building-your-streaming-pipeline-without-kafka

The Identified Gap Addressed Here

While Apache Kafka is a go-to for real-time data streaming, it comes with complexities and costs—setting up and managing clusters, incurring high costs in Confluent cloud (~2k monthly for the use case here), and so on. 

Getting Streaming Performance with your Existing S3 Storage without Kafka

Instead of Kafka, you can leverage Pathway alongside Delta Tables on S3-compatible storage like MinIO. Pathway is a Pythonic stream processing engine with an underlying Rust engine.

Why Consider This Setup?

  • Sub-Second Latency: Benchmarks show that you can get stable sub-second latency for workloads up to 60,000 messages per second.
  • Cost-Effective: Eliminates the need for Kafka clusters, reducing both complexity and operational costs.
  • Simplified Architecture: Fewer components to manage, leveraging your existing S3 storage.
  • Scalable Performance: Handles up to 250,000 messages per second with near-real-time latency (~3-4 seconds).

Building the Pipeline

For the technical details, including code walkthrough and benchmarks, check out this article: Python Kafka Alternative: Achieve Sub-Second Latency with Your S3 Storage Without Kafka Using Pathway

Use Cases

This setup is suitable for various applications:

  • IoT and Logistics: Collecting data from numerous sensors or devices.
  • Financial Services: Real-time transaction processing and fraud detection.
  • Web and Mobile Analytics: Monitoring user interactions and ad impressions.
0 Comments
2024/11/27
17:07 UTC

4

Any recommendations for open-source ETL solutions to call HTTP apis and save data in bigquey and DB(postgresql)?

I need to call an http API to fetch json data, transform and load to either bigquery or DB. Every day, there will be more than 2M api calls to the API and roughly 6M record upserted.

Current solution with different api built with Ruby on rails but struggling to scale.

Our infrastructure is built based on Google cloud and want to utilise for all of our ETL process.

I am looking for open-source on premises solution as we are just starup and self funded.

6 Comments
2024/11/25
13:30 UTC

0

Reviews on Snowflake Pricing Calculator

Hi Everyone Recently I had the opportunity to work on deploying a Snowflake Pricing Calculator. Its a Rough estimate of the costs and can vary on region to region. If any of you are interested you can check it out and give your reviews.

2 Comments
2024/11/25
08:53 UTC

0

Sort string lines by parsed multiple date formats

#devs #dotnet #Analytics Sort string lines by parsed multiple date formats added to new version Clipboard Plaintext Power Tool https://clipboardplaintextpowertool.blogspot.com/

https://preview.redd.it/e596rvwrv41e1.jpg?width=427&format=pjpg&auto=webp&s=c8613510b138811d8cf777057dea12cc91a7a42c

5 Comments
2024/11/15
21:31 UTC

9

Looking for ETL tools to scale data pipelines

Hey folks, I’m in the process of scaling up my data pipelines and looking for some solid ETL tools that can handle large data volumes smoothly. What tools have worked well for you when it comes to efficiency and scalability? Any tips or suggestions would be awesome!

8 Comments
2024/11/15
12:17 UTC

3

XML API connector

Does anyone have any good resources or pipelines on github that queries an API and then incrementally loads data to a database?

Our use case is querying the NetSuite Openair XML API and writing the data to a Databricks Metastore every day.

Airbyte don’t have low code connector builder for XML.

I’m a one man band at my company so ideally not looking to custom build something huge with the potential for technical debt, but still need the pipeline to be idempotent.

11 Comments
2024/11/12
23:36 UTC

2

Version Control for ETL Scripts: What Works for You?

How do you manage version control for ETL scripts? Any tools or workflows that have worked well?

8 Comments
2024/11/12
10:08 UTC

3

CTO of iceDQ, Sandesh Gawande, joined Eric Kavanagh on DM Radio to discuss Data Testing Automation for ETL Pipelines and Production Monitoring.

0 Comments
2024/10/31
16:12 UTC

2

Discussion

Hi Everyone, just a little background about me I have been working with ETL tools like Alteryx & Knime for the past 6 months so I might not know the full potential of these tools hence my question here.I was recently asked to build a client solution to automatically store address’s that are provided in customer information(Current process on client end manually look at address and enter in db). Now the information isn’t clearly structured for example that they should put country name state , city & building name and all that in a particular order. Sometimes information is missing some aspects too. Sometimes a building name is entered in the start which very well could be a country or state name. Some people have even gone above and beyond in this information storing as giving direction till there door(this is junk for me). Is it possible for me to build an Automated solution that can dissect this information accurately for me. If it can’t fully be automated I was thinking of setting criteria that if some levels of information is missing it can be thrown as exception capture which can then be resolved with human intervention (manually). Thank you and let me know your thoughts if it’s possible. If so what tools should I be using(Data privacy is also a concern). If any suggestions/approach I should take.

6 Comments
2024/10/10
21:12 UTC

0

Looking for Change Data Capture Tool? This blog helped me!

Hi everyone,

Recently, I came across some challenges with Change Data Capture (CDC) for a project that I am working on, and I needed to find a reliable CDC tool. I stumbled upon this blog that lists the seven best CDC tools, and it really helped me. The article does a very good job of breaking down the pros and cons of each tool so that it is much easier for me to pick the right one for what I need.

If anyone else is looking for a CDC solution, I'd recommend checking this out.
Blog: 7 Best CDC Tools for Change Data Capture in 2024
Hope it helps!

3 Comments
2024/10/08
12:33 UTC

0

Need a Reliable ETL Tool for GCP? Here Are the Best Options!

Hey everyone,

If you're working with the Google Cloud environment and looking for the right ETL tools to streamline your data integration process, you know how tricky it can be to choose the right one.

I recently found a guide that breaks down the top GCP ETL tools to help you avoid those headaches. Whether you need simplicity, speed, or flexibility, this guide covers the pros and cons of each tool so you can choose what works best for your setup. If you’re looking to save time and keep your pipelines running smoothly, it’s worth a read!

0 Comments
2024/10/07
12:28 UTC

0

Found a Great Resource Choosing ETL Tools!

Hello everyone,

Recently, I found myself in a crunch situation - deciding on the best ETL tool for my data integration needs. It was after much research and I finally found this amazing blog- it went quite a long way in helping me make a final choice.

The articles do not only outline different ETL tools, their features, and use cases but also will make the decision-making process of choosing which one is the best suitable to the requirements easier. If you experience something similar or just curious to know what your options are when it comes to ETL, then you should certainly have a look!

Best ETL Tools

Hope you find it as helpful as I did!

8 Comments
2024/10/07
12:26 UTC

0

Stuck with Choosing the Best Cloud ETL Tool? Here's What Helped Me

I was trying to figure out the best cloud ETL tools for our data needs. The choices were overwhelming, and my team didn't have the time or expertise to dig into all the technical details for each tool. We needed something that was powerful yet easy to use.

That’s when I discovered this list of the 8 Best Cloud ETL Tools. It was a game-changer! The article breaks down each ETL tool, highlighting their features, strengths, and use cases in a way that's easy to understand. It helped me quickly narrow down my options to find the best fit for our needs.

If you're struggling to find the right ETL tool for your cloud data integration, I highly recommend checking out that guide. It gives a comprehensive overview of the best tools out there and will save you a lot of time in making your decision.

4 Comments
2024/10/07
12:25 UTC

1

Stuck with Oracle Redo Logs? This Blog Helped Me Out!

Hey everyone! 👋

I recently ran into an issue while working with Oracle Redo Logs, and I had no clue how to extract or use them for analysis. 😩 I was searching for a way to make sense of it when I stumbled upon this blog: Working With Oracle Redo Logs. It really broke down the concept and gave step-by-step guidance on handling Redo Logs efficiently.

If you’re also struggling with managing Oracle Redo Logs, I’d highly recommend giving it a read! 💡

Hope this helps someone else too. 😊

0 Comments
2024/10/07
12:19 UTC

Back To Top