/r/ETL

Photograph via snooOG

This subreddit is for discussions about ETL / pipelines / workflow systems / etc...

/r/ETL

8,232 Subscribers

1

Accounting & Ecommerce Focused ETL Tool Recommendation

Hi. I was using Fivetran for a while to extract data from Quickbooks to Postgres, but the source connector seems to have been broken for a while. I tried Airbyte but the community connector only pulled a small amount of data, and I have no idea why. Others like Rivery, Stitch, and Portable didn't fit the bill due to a lack of functionality and price. (I'd prefer something with a free tier like Fivetran so I can test it out.)

At the moment I'm trying out Meltano. Even though it's easy to setup, at the end of the day I'd rather have a point and click interface.

Ultimately I want to build out a few ecommerce connections in addition to Quickbooks. Are there any recommendations for ETL software that supports extraction from Quickbooks, Walmart, Shopify, and Tiktok? Thanks in advance.

0 Comments
2024/04/16
05:25 UTC

8

Why is ETL still a thing

I see there are no posts here, so let me be the first.

When I first got into Data Fivetran had barely done a Series A but I kinda already felt like ELT was solved ( know this subreddit is ETL but whatever).

That's because I pressed a button and data (in this case, Salesforce) simply landed in my destination. Schema updates were handled, stuff didn't really break, life was good.

Years on there are a million vendors building cloud saas elt. There are open-source servers like Airbyte. There are open source frameworks for ingesting data where you would run it yourself.

The ELT market also suffers from intense competition, and (rightly) a scornful eye from many data engineers. People don't want to be paying hundreds of thousands of dollars for connectors they could run cheaply, but no-one can be bothered to build them (fair) so we buy them anyway. There's lots of demand and also a race to the bottom, in terms of price.

So the question is - why hasn't the ELT market reached a perfect equilibrium? Why are Salesforce buying Informatica? Why are GCP and Snowflake investing millions in this area of Data? Why are there smart people still thinking about novel ways to move data if we know what good looks like? Prices are going down, competition is heating up, everything should become similar, but it's never looked more different. Why?

1 Comment
2024/04/15
23:13 UTC

9

Example Data Pipeline with Prefect, Delta Lake, and Dask

I’m an OSS developer (primarily working on Dask) and lately I’ve been talking to users about how they’re using Dask for ETL-style production workflows and this inspired me to make something myself. I wanted a simple example that met the following criteria:

  • Run locally (optionally). Should be easy to try out locally and easily scalable.
  • Scalable to cloud. I didn’t want to think hard about cloud deployment.
  • Python forward. I wanted to use tools familiar to Python users, not an ETL expert.

The resulting data pipeline uses Prefect for workflow orchestration, Dask to scale the data processing across a cluster, Delta Lake for storage, and Coiled to deploy Dask on the cloud.

I really like the outcome, but wanted to get more balanced feedback since lately I’ve been more on the side of building these tools rather than using them heavily for data engineering. Some questions I’ve had include:

  • Prefect vs. Airflow vs. Dagster? For the users I’ve been working with at Coiled, Prefect is the most commonly used tool. I also know Dagster is quite popular and could easily be swapped into this example.
  • DeltaLake or something else? To be honest I mostly see vanilla Parquet in the wild, but I’ve been curious about Delta for a while and mostly wanted an excuse to try it out (pandas and Dask support improved a lot with delta-rs).

Anyway, if people have a chance to read things over and give feedback I’d welcome constructive critique.

Code: https://github.com/coiled/etl-tpch
Blog post: https://docs.coiled.io/blog/easy-scalable-production-etl.html

0 Comments
2024/04/11
13:47 UTC

3

Why Matillion with snowflake?

I've recently joined a project and that project was running for past 3 or 3 and half years with snowflake for a Data warehouse and For ETL and ELT transformations.But, my client switched matillion for ETL and ELT.And the reason they are stating is that snowflake cost so much money for them for Running TASKS and They are having JavaScript stored procedure's to make a load from Source to fact or dim tables.so,they also stating it's toughest to maintain code quality.So, we are switching matillion for cost cut and better code structure using components.But, in matillion also they are mostly using many SQL queries with individual SQL components and My thinking is that snowflake cost that much cost for tasks and still matillion run with snowflake as a warehouse and computer engine etc. what you guys think about this?

Why ETL or ETL in Matillion ? Not in snowflake ?

4 Comments
2024/04/10
16:34 UTC

0

Maîtriser les bases de Talend Open Studio pour ETL – Des conseils ?

Hey !

Je suis en train de me plonger dans le monde de l'ETL (Extract, Transform, Load) et j'ai décidé d'utiliser Talend Open Studio pour commencer. Pour ceux d'entre vous qui ne le savent pas, Talend est un outil puissant pour gérer les processus d'ETL, permettant d'intégrer, de transformer et de charger des données entre différents systèmes.

J'ai trouvé une formation gratuite qui semble idéale pour quelqu'un qui débute avec Talend, promettant d'enseigner les fondamentaux nécessaires pour démarrer efficacement avec des projets d'ETL.

Je cherche à comprendre :

  • Les concepts de base et les bonnes pratiques en ETL avec Talend.
  • Comment configurer et utiliser Talend Open Studio pour mes premiers projets.
  • Des astuces pour optimiser mes workflows d'ETL et éviter les erreurs communes.

Avez-vous des conseils ou expériences à partager ?

  • Des ressources ou tutoriels qui ont été particulièrement utiles dans votre apprentissage de Talend.
  • Des défis que vous avez rencontrés en utilisant Talend et comment vous les avez surmontés.
  • Des fonctionnalités de Talend que vous trouvez inestimables pour les projets d'ETL.

Si vous avez des questions sur la formation que j'ai mentionnée ou si vous voulez partager vos propres conseils et expériences avec Talend, je suis tout ouïe. N'hésitez pas à répondre ou à me contacter en privé.

Merci d'avance pour votre aide et vos partages !

2 Comments
2024/04/09
12:12 UTC

6

Open Source Data Quality Tools

I wrote a blog post about open source data quality tools. After vetting some, I found 5 noteworthy options. I am open to additions so if you have any open source tools that you have tried and would like to share with the community, please let me know.

https://www.datacoves.com/post/data-quality-tools

1 Comment
2024/04/04
18:12 UTC

1

Maîtriser les bases de Talend Open Studio pour ETL – Des conseils ?

Hey !

Je suis en train de me plonger dans le monde de l'ETL (Extract, Transform, Load) et j'ai décidé d'utiliser Talend Open Studio pour commencer. Pour ceux d'entre vous qui ne le savent pas, Talend est un outil puissant pour gérer les processus d'ETL, permettant d'intégrer, de transformer et de charger des données entre différents systèmes.

J'ai trouvé une formation gratuite qui semble idéale pour quelqu'un qui débute avec Talend, promettant d'enseigner les fondamentaux nécessaires pour démarrer efficacement avec des projets d'ETL.

Je cherche à comprendre :

  • Les concepts de base et les bonnes pratiques en ETL avec Talend.
  • Comment configurer et utiliser Talend Open Studio pour mes premiers projets.
  • Des astuces pour optimiser mes workflows d'ETL et éviter les erreurs communes.

Avez-vous des conseils ou expériences à partager ?

  • Des ressources ou tutoriels qui ont été particulièrement utiles dans votre apprentissage de Talend.
  • Des défis que vous avez rencontrés en utilisant Talend et comment vous les avez surmontés.
  • Des fonctionnalités de Talend que vous trouvez inestimables pour les projets d'ETL.

Si vous avez des questions sur la formation que j'ai mentionnée ou si vous voulez partager vos propres conseils et expériences avec Talend, je suis tout ouïe. N'hésitez pas à répondre ou à me contacter en privé.

Merci d'avance pour votre aide et vos partages !

0 Comments
2024/04/02
14:54 UTC

0

Maîtriser les bases de Talend Open Studio pour ETL – Des conseils ?

Hey

Je suis en train de me plonger dans le monde de l'ETL (Extract, Transform, Load) et j'ai décidé d'utiliser Talend Open Studio pour commencer. Pour ceux d'entre vous qui ne le savent pas, Talend est un outil puissant pour gérer les processus d'ETL, permettant d'intégrer, de transformer et de charger des données entre différents systèmes.

J'ai trouvé une formation gratuite qui semble idéale pour quelqu'un qui débute avec Talend, promettant d'enseigner les fondamentaux nécessaires pour démarrer efficacement avec des projets d'ETL.

Je cherche à comprendre :

  • Les concepts de base et les bonnes pratiques en ETL avec Talend.
  • Comment configurer et utiliser Talend Open Studio pour mes premiers projets.
  • Des astuces pour optimiser mes workflows d'ETL et éviter les erreurs communes.

Avez-vous des conseils ou expériences à partager ?

  • Des ressources ou tutoriels qui ont été particulièrement utiles dans votre apprentissage de Talend.
  • Des défis que vous avez rencontrés en utilisant Talend et comment vous les avez surmontés.
  • Des fonctionnalités de Talend que vous trouvez inestimables pour les projets d'ETL.

Si vous avez des questions sur la formation que j'ai mentionnée ou si vous voulez partager vos propres conseils et expériences avec Talend, je suis tout ouïe. N'hésitez pas à répondre ou à me contacter en privé.

Merci d'avance pour votre aide et vos partages !

2 Comments
2024/04/02
14:19 UTC

2

Exploring versions of the Postgres logical replication protocol

https://blog.peerdb.io/exploring-versions-of-the-postgres-logical-replication-protocol

🚀 Did you know that the way Postgres logical replication protocol has evolved over the past few years? Did you know that Postgres logical replication has "versions" which make it more efficient and feature-rich?

This blog will dive into this evolution, its impact on performance, and present some useful benchmarks. This blog is useful for anyone who uses Postgres Logical Replication in practice!

🔍 Version 1 set the stage by allowing the streaming of committed transactions, laying the groundwork for what was to come.

🌊 Version 2 introduced a game-changer: streaming of in-progress transactions. This dramatically improved decoding speeds and reduced peak slot size duration, addressing critical performance bottlenecks.

📊 The blog provides a detailed benchmark of Version 2's impact compared to Version 1. TL;DR - faster decoding speed and lesser peak slot size duration.

🔄 Versions 3 and 4 brought in support for two-phase commits and parallel apply of in-flight transactions, further enhancing the flexibility and efficiency of logical replication.

For a detailed analysis on all the above topics on Postgres Logical Replication, checkout this blog.

1 Comment
2024/04/01
18:17 UTC

1

Accounting (General Ledger) Data Mapping

Would appreciate any feedback on this desired project, and recommended tools to handle.

I would like to create a common data model for a specific industry (trucking) for summary financial and operational data. I have previously built an excel based add-in to facilitate the mapping of disparate GL information to the common template, however the workload associated with this method is getting untenable. We use Matillion for ETL other data transformation processes, but have never thought about using this tool to replace the excel add in.

The essential steps (currently):

  1. Create the common data model to map to.

  2. Import in Trial Balances (Account ID, Account Description, and Net Change values) for a given month/year for a unique company.

  3. Map the accounts to the common data model:

    • Direct Mapping: Creating 1:1 relationships between source account IDs and the common model accounts. • Percentage Mapping: Distributing values across multiple accounts based on predefined percentages. • Ratio-Based Mapping: Using operational metrics (e.g., miles, hours) to dynamically allocate values.

  1. Once the mapping relationships have been established, and confirmed/reviewed all subsequent imports of trial balances (we can use Azure blob storage for the Trial Balances in csv format with the naming convention of the file identifying the company and month/year) would transform the data based on the established mapping relationships.

  2. any new accounts identified would trigger an exception to establish a mapping relationship

The transformed data would then reside in Snowflake.

Is this doable with an open sourced tool or Matillion? Am I overthinking this?

Thanks

2 Comments
2024/03/29
23:51 UTC

1

Datastage hands on tutorials

Hi all,

I am trying to learn Datastage. It is a old fashioned tool so I can not find enough documents/videos. I just found the below playlist but some videos are missing:
https://www.youtube.com/playlist?list=PLeF_eTIR-7UpGbIOhBqXOgiqOqXffMDWj

Could you please share resources for learning Datastage?

Thanks

2 Comments
2024/03/18
20:03 UTC

2

How Harmonic Saved $80,000 by Switching from Fivetran to PeerDB

0 Comments
2024/03/14
21:01 UTC

2

GitOps for Data - the Write-Audit-Publish (WAP) pattern

Link to blog post here - feedback welcome!

Do you test all your changes in prod? 🤦‍♂️ Let's borrow some concepts from software engineering and make sure that bad data never enters production. One such way is the Write-Audit-Publish (WAP) pattern.

Just released a blog post explaining it and showing how to make sure you're:

  • Always working on production data in an isolated environment (dev/staging/prod environments)
  • Collaborating securely with custom approval flows (GitOps)
  • Preventing faulty builds from going into production (CI/CD)

Check it out and share your thoughts :)

0 Comments
2024/03/14
15:03 UTC

7

skyffel - prototype for generating Airbyte connectors

1 Comment
2024/03/07
11:29 UTC

3

Using Airflow-dag for tm1-job-dag monotoring.Need help with DAGBAG class to get all dagids for a specific tag. Problems with broken dags.

0 Comments
2024/03/06
13:43 UTC

1

[Video] Custom Python ETL connector demo - feedback welcome

Off-the-shelf data ingestion works great about 80% of the time. The other 20% is where good data engineers make all the difference.

At Y42, we've released Python ETL connectors to cater to the "other 20%", next to our existing Airbyte-, Fivetran-, and proprietary ingestion capabilities. The goal of this new feature is to:

  • implement custom ingestion logic,
  • remove boilerplate code to load data into your data warehouse,
  • get standardized metadata, lineage, and documentation out of the box.

Check out the demo video, very curious about your feedback: https://www.youtube.com/watch?v=L252iaNylbo.

To those who want to read more about it, check out the announcement post: https://www.y42.com/blog/announcing-python-ingest.

Thanks!

0 Comments
2024/03/02
18:32 UTC

0

Scott Hanselman Interviewing Sai Srirampur from PeerDB on Postgres Replication

The podcast touches on so many interesting topics including Postgres, Open Source, Migrations, Replication,  Data Movement, Building Fault Tolerant Enterprise-grade systems, PeerDB and so on. Loved the way Scott navigated through each of these topics and create story. Totally worth a watch!

https://open.spotify.com/episode/3jZu78eH79aat9UozoHWIQ?si=Ow2mF2h9TB2d6EeH4UmfIQ&nd=1&dlsi=317bc349bf314f1f

0 Comments
2024/03/01
18:34 UTC

2

Data Driven Culture Discussion

Hey Everyone,

This is an insightful article discussing becoming data-driven and how it is not just about adopting new technologies but also about nurturing trust and alignment within the organization.

Article 👉🏼 https://www.datacoves.com/post/data-driven-culture

Here are some focal points from the article, paired with questions I believe could spark valuable discussions:

  1. Alignment with Business Objectives: The article emphasizes the importance of getting everyone on the same page from the beginning and ensuring that data analytics strategies are directly aligned with business goals. Have any of you faced challenges where data projects fell short because they weren't aligned with broader business objectives? How did you navigate these challenges?
  2. User-Centric Data Solutions: It's pointed out that solutions should be tailored to solve actual user problems rather than coming up with an overly technical solution. Can you share experiences where focusing on user needs led to successful data projects? Or perhaps a time when overlooking this led to failure?
  3. Data Management and Governance: According to the article, robust data management and governance are crucial for sustaining trust in data analytics. What strategies, practices or tools have you found effective in maintaining data quality and governance in your work?

Looking forward to your experiences and thoughts!

0 Comments
2024/02/27
17:47 UTC

4

Datastage as Orchestration Tool? (Best Practices for non ETL data loading with datastage?)

0 Comments
2024/02/27
16:29 UTC

19

Talend is no longer free

Now that Talend is no longer free, what other ETL tool would you recommend that has data transformation capabilities as powerful as the tMap component?

https://www.talend.com/products/talend-open-studio/

Thanks!

9 Comments
2024/02/22
02:58 UTC

5

Multiwoven - Open-source reverse ETL

Hello folks!

https://github.com/Multiwoven/multiwoven

I'm Subin, co-founder at Multiwoven .Multiwoven is a OSS reverse ETL platform that helps dev & data teams to sync data from databases to business tools. Multiwoven is built using Ruby on Rails . Our data sync orchestration is built on top of Temporal using temporal-ruby SDK.I would greatly appreciate any feedback. Our codebase is available at Github. Please star us to get updates.

7 Comments
2024/02/20
10:53 UTC

3

Looking for ETL server load-balancing tips

Hello! I'm hoping to pick the brains of you fine ETL experts, as I suspect this is not a problem specific to my company or team. For context, I am the manager of a data conversion team. I tried to keep my situation as high-level as possible, but please let me know if I can clarify anything. I'm admittedly not the technical SME on our tools, but I'll try to answer as much as I can if anyone has any questions!

I'm wondering if there are any suggested methods or strategies to proactively load-balance our servers to avoid queues from stacking up as the engineers wait to run data out. Ideally, I'd like to be smarter about how we assign the initial server for each project so our only worry afterwards is balancing a week in advance for final migrations. A standard round-robin method almost seems too easy, but has not been explored in the past. Right now, a different team creates the initial projects and assigns the server - they typically try to fill one server before moving to the next.

The volume of projects has more than doubled in the past 12 months and while we've definitely got the human resources to handle it, but we're encountering daily problems with the limitations our technology imposes on us. We are planning to switch to a new platform in the future (DataBricks) which should eliminate some of these problems, but that is more than a year away and all of the forecasting models I've been given from upper leadership indicates our project volume is only going to continue to increase between now and then.

Right now the engineers use dedicated MS Teams channels for each server to coordinate who's using it and build queues for who needs to use it next. It would be relatively easy to implement an actual hourly schedule for each server that they could use to reserve time and block for scheduled Finals. But my worry with that is the common problem of things taking longer than predicted due to large data sets or failures, which would throw the schedule off for everyone else. It'd be better than nothing, but I have this burning feeling there must be a smarter way to do this based on predictive models or SOMETHING!

Here's as many details as I could think of:

https://preview.redd.it/evpwwtg3jnjc1.png?width=1110&format=png&auto=webp&s=c168326cc9321055e95e223cc77d1c47b0b4fdb4

0 Comments
2024/02/20
02:40 UTC

6

Compiling a List of Essential Terms in Data Analytics/Engineering

I'm currently working on compiling a comprehensive list of important terms and definitions in the Data Engineering/Analytics space. I think it is important, especially for new comers to this field to have something.

Here's what I've got so far: https://www.datacoves.com/post/data-analytics-glossary-terms

This is where I need your help:

  • Adding More Terms: What are some other terms that you think are crucial for someone to understand? I want this list to be as inclusive and informative as possible.
  • Refining Definitions: If you see a definition that could use more clarity or you have a better way to explain it, please share your suggestions! I'm all for making this as accurate and helpful as possible.

I am open to discourse as I want to find definitions that are accurate and widely accepted.

Thank you for your help and insights!

1 Comment
2024/02/13
18:19 UTC

Back To Top