/r/dataengineering

Photograph via snooOG

News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.

News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.

Read our wiki: https://dataengineering.wiki/

Rules:

  1. Don't be a jerk

  2. Limit Self-Promotion: Remember the reddit self-promotion rule of thumb: "For every 1 time you post self-promotional content, 9 other posts (submissions or comments) should not contain self-promotional content."

  3. Search the sub & wiki before asking a question: Your question has likely been asked and answered before so do a quick search before posting.

  4. No job posts (posts or comments)

  5. No technical error/bug questions: Any error/bug question belongs on StackOverflow.

  6. Keep it related to data engineering

/r/dataengineering

182,299 Subscribers

3

AWS Redshift and dbt best practices

Hey

I'm starting a new gig as a data engineer and the company I'm joining is on AWS using Redshift RA3 and Spectrum. From what I know they are not using dbt, but want to start.

I've mainly used dbt with Snowflake and BigQuery, and would like to gather best practices on what should be the ideal setup for Redshift.

I found this article on Redshift dbt best practices from March 2023.

It looks good, and there are some great tips on:

  • how to configure single redshift cluster with multiple DBs (source, dev, production) vs multiple redshift clusters.
  • performance tuning using sort-keys, dist-keys and WLM
  • choosing appropriate model materialisation
  • limiting data used in dev
  • (... and much more)

I'm wondering whether someone came across anything else (or know from experience), that could help me ensure that performance won't be a problem, and in general best practices/tips to work with Redshift in particular.

ThanksπŸ™‡πŸ»

0 Comments
2024/05/09
12:39 UTC

2

connecting to google cloud tools through spark is tough

I'm trying to test the pipeline I'm making locally first before uploading it to cloud but I really face errors in every small step

-when I try to spark read csv from cloud storage -> no filesystem for scheme "gs" 
(all hadoop config solutions on stack overflow didn't work for me sadly)
-when I try to write in big query -> [DATA_SOURCE_NOT_FOUND] Failed to find the data source: bigquery.

how can I face those issues and similar ones? it has been a week of me constantly trying to find solutions to make my local environment good enough but just getting error with (docker, airflow, spark etc...)

0 Comments
2024/05/09
12:34 UTC

3

Dealing with Multiple Dates in Fact/Dimension Model

I have a model that is order transaction centric. I have many different KPIs that are based on various dates in the system. The source system has a table containing work orders and a table containing work order statuses - for each work order the work order status table can have multiple rows against it, depending on the journey of the order.

I want to simplify things in my model. Some measures will be looking at the work order creation date, others will be looking at the received, in progress or completed dates. I have around 15 different status dates that relate to different KPI measures. How best can I structure the model. My thoughts are to either replicate the work order table as the fact and the work order status table as a dimension, or add each relevant date as a column to the fact table, bearing in mind that in this case there will be blank rows (e.g. a recently created order will not have an in progress, started or completion date associated with it).

4 Comments
2024/05/09
12:23 UTC

3

Data migration to Data lake on s3...

I am working on building a data platform for my company. I have chosen s3 to be my data lake. There are multiple sources data will be coming from but for now I have to work with the operational MySQL DB. As there is no warehouse already, the DB is all there is. I was thinking of taking this ingestion process into two phases;

Phase one: Take the tables I need from the DB (or all the tables), and migrate the data in the lake using AWS glue. This will be one-time thing.

Phase 2: Here we move on to the next phase to build a CDC pipeline, using open source tools (company requirement) for the data newer records that are generated and need to be ingested in the lake.

This is just a high level theoretical approach I came up with. Does there work, what are the roadblocks that I can hit here and is this a normal approach towards building a data platform? How much would GLUE cost, as the Db size is 850 GB... Is it a normal practice to migrate all the tables into the lake, even if there are hundreds of them?

I could really use some suggestions and advices from seniors, this is my first time building something like this. I am excited and also terrified...

0 Comments
2024/05/09
12:10 UTC

3

Check if data in API has new data?

Basically building out a pipeline to query data from an API. The data typically updates Monday but can be later depending on holidays etc.

Between basically Monday - Tuesday I want to check whether there is new data in the API from the last call (there is a date in the data that could help if no metadata) and then once there is a new data, run the extraction process.

I guess it’s bad practice and pointless to run this the full process everyday and makes it more compute expensive if I start chaining processes off it.

There was a stack overflow post that suggested running HEAD command and checking the metadata. But never heard of this, still quite new to APIs so any advice would help.

Using ADF but happy to use Python too.

Cheers.

3 Comments
2024/05/09
11:37 UTC

3

Migrating from Liquibase to SchemaChange for Snowflake

We have been using Liquibase with Stash for a long time now and recently there has been an org level change to move into GitHub. Liquibase has been clunky and complex to work with and I've been trying out Schema Change.

Based on Proof of Concept that I did, SchemaChange seems to check all the boxes we require. The only problem being, how do I move from Liquibase to Schemachange? I have a couple of questions regarding that.

  1. If I add the whole repo into SchemaChange, what will happen during the first run? Will all objects be recreated?
  2. How do I go about making sure only the modified objects are deployed?
  3. Do I need to move my existing liquibase history table into Schemachange history table to make it understand the historical activity so far?

Please do let me know if any of you have done this in your experience.

Thanks!

0 Comments
2024/05/09
10:55 UTC

15

Apache Spark with Java or Python?

What is the best way to learn Spark? Is it through Java or Python, my org uses Java with Spark and I could not find any good tutorial for this. Is it better to learn it through PySpark since its widely used than Java?

21 Comments
2024/05/09
07:09 UTC

4

why columnar db perform bad on joins ?

7 Comments
2024/05/09
06:05 UTC

8

"The problem HAS to be the source"

12 Comments
2024/05/09
05:51 UTC

17

Trying to understand unit tests

Hi everyone! I am a Data Analyst who is trying to learn Data Engineering in my spare time. I have been hearing a lot about unit tests, and many other tests. Can somebody bring me an example of what that is? I don’t fully understand what a data engineer unit tests. Sorry for the dumb question :D just trying to learn more about the field.

9 Comments
2024/05/09
02:47 UTC

105

What's a data engineering related book or article that you read recently that "connected the dots" for you?

I've recently been diving into stream processing and was getting overwhelmed with all the various nuances. I recently read Making Sense of Stream Processing by Martin Kleppmann, and it just made everything else I read before "click" for me.

I'm curious what resources have been similarly helpful for you.

27 Comments
2024/05/09
01:35 UTC

3

Mass CSV import tips and recommendations? (e.g. Flatfile, One Schema, csvbox, Droom)

Hi, I'm working with an agency team that needs to collect and consolidate data from spreadsheets.

Note that each spreadsheet originates from a different client. They may have the same information, but they're never going to be consistent. Plus, we have to stick to file uploads since a direct API integration is not realistic.

Ideally, we would upload the files into an app which we could program to rename, reformat, and reorder the columns. Once the data is transformed, it would get written to a table in our data warehouse. Bonus point if we could also have the ability to overwrite or delete data from a flawed file upload.

This might be an oversimplified description, but I'm wondering if anyone has suggestions for this use case. I'm familiar with apps like Flatfile, One Schema, CSVImporter, and Droom -- but they seem like embeddable SDKs. Since internal users will be making these uploads, we don't need anything whitelabeled.

Any thoughts are appreciated -- thx!

0 Comments
2024/05/08
23:37 UTC

4

Can I set up a dbt project without a database/output?

For testing purposes, I essentially want to hardcode some sql models, ie just:

select 'Bob' as First_Fame, 'Smith' as Last_Name

And execute dbt test to runs some tests on them.

However, I haven't managed to do this. Without a valid output in profile in errors, and if I point it to the actual database, it errors because the table doesn't exist.

6 Comments
2024/05/08
21:36 UTC

0

Question on leading data teams (non-tech manager)

I'm currently a Group Product Manager with an MBA and over 12 years of experience in growth marketing and marketing automation tools. There's a new opportunity on the horizon for me to expand my role to lead both data engineering and analytics teams. Given my background is not heavily focused on data science, I'm looking for advice on how to effectively manage and lead these teams. What are the key skills I should develop? Are there particular challenges I should be aware of?

6 Comments
2024/05/08
21:03 UTC

4

What are some good PDF context reading tools/OCR tools?

I need to find a way to take lots of financial statement pdfs from 600+ clients and extract data from them.

I tried building a gpt chatbot that reads the pdfs but that didn't work out. There's just too much variance between the documents so I couldn't nudge it in the right direction since the directions are different for each document.

So far I've tried using the Adobe tool which is shit, a tool called liner which is just a pretty version of what I built so that was useless. And I'm currently in the trial process with a company called Super.ai which looks promising but so far haven't gotten the results I wanted.

Any suggestions for good tools for this use case.

7 Comments
2024/05/08
20:30 UTC

239

I dislike Azure and 'low-code' software, is all DE like this?

I hate my workflow as a Data Engineer at my current company. Everything we use is Microsoft/Azure. Everything is super locked down. ADF is a nightmare... I wish I could just write and deploy code in containers but I stuck trying to shove cubes into triangle holes. I have to use Azure Databricks in a locked down VM on a browser. THE LAG. I am used to VIM keybindings and its torture to have such a slow workflow, no modern features, and we don't even have GIT integration on our notebooks.

Are all data engineer jobs like this? I have been thinking lately I must move to SWE so I don't lose my mind. Have been teaching myself Java and studying algorithms. But should I close myself off to all data engineer roles? Is AWS this bad? I have some experience with GCP which I enjoyed significantly more. I also have experience with Linux which could be an asset for the right job.

I spend half my workday either fighting with Teams, security measures that prevent me from doing my jobs, searching for things in our nonexistent version management codebase or shitty Azure software with no decent documentation that changes every 3mo. I am at my wits end... is DE just not for me?

118 Comments
2024/05/08
20:25 UTC

0

Guide me to the right Path.

I have 9YOE, as a Data Engineer (not sure about the title though).

I worked on SQL, PLSQL and Postgres (and bash, God! I love bash) for about 5/6 years. I consider myself as Good in these tools. Then started on Pytton and Pyspark and Cloud (AWS) data solutions (alongside with my previous knowledge). I consider myself not as good on Python/Pyspark - mostly because I usually just google and code the requirements.

I do not have imposter syndrome (but a year back I thought I had). I am confident on my skillset of data management/storage/analytics.

Whenever I appear for interviews or search and apply for new DE jobs, In most cases I do not β€œfeel” myself matched with the requirements. Whatever I had applied till now, most cases the person asked about programming questions, however I feel it should be more like How do I build or manage some requirements into code. How do i make cheap data management, or how do I make faster ETL with low failure rate. Or how would I utilise AWS solutions to build scalable Data pipelines. Instead they always ask what is the difference between coleace and repartition.πŸ€¦β€β™‚οΈ

  1. Looking for help in getting a remote job (desperately) which matches my skillset mentioned above. Specially where to look. And the How part, any real experience would help a lot.

  2. Considering my skills above, What more should I learn (to get the job) and from where. What should be the learning goals, any practice needed, if yes then how (or maybe some resources).

  3. One of my DE buddy says β€˜Bro, go deeper not wider’. (Basically asking me not to nibble in multiple tools/languages everyday after reading a medium blogpost). My question is, In this time, considering the market, how good of advise that is.

Should I follow my buddy’s advise considering the tech world right now?

4 Comments
2024/05/08
19:57 UTC

2

What actual methodologies and frameworks do you use for data modeling and design? (Discussion)

I’m currently reading a book call β€œAgile Data Warehouse Design” to prep for an engagement I have coming up. They have a methodology called β€œBEAM*” which they describe as a process to use for collaboratively working with stakeholders to identify data and document business processes.

Reading this has gotten me thinking, how do others go about performing this work? I’m talking about starting from meeting with stakeholders and business analysts, finding out what questions they’re interested in asking against data, documenting this in a way that’s understandable and useful to both technical and non technical folks, and then ultimately building a star schema or something akin to it. Do you guys just wing it or do you follow a specific methodology that you’ve found useful? I feel like there’s quite a bit of overlap with DDD in a sense of modeling business events for example. And I know Kimball talked about things like the enterprise bus matrix (i think that’s what it was called) among other frameworks.

I’m also curious in how far you go in discussing these more abstract questions before looking at the actual data available and its quality. For example a business can talk all about how they want to understand efficiency of gas mileage for example in their company vehicles, but if they don’t collect data related to that (or the data is of bad quality) then it probably doesn’t make sense to spend a ton of time discussing it.

11 Comments
2024/05/08
19:08 UTC

1

XML Metadata Interchange in R

Hi,
for a research I need to analyze a large data set of xmi files using R. Can anyone help directly or send me a website with suitable help? Thanks in advance.
Best

0 Comments
2024/05/08
18:58 UTC

7

Advice on next steps in data career?

Hello all. Looking for advice on next career steps.

Going on three years now I have worked at a start up where I started as an "analyst" and now my official title is "Business Intelligence Engineer". My day to day has me primarily using SQL where I mainly keep the database up to date. Whether it be updating, inserting, or deleting data with some reporting on the side. At times I have also advised on table design, and every once and a while I get to use a little Python to script some ETL. I use PSQL, UNIX, and GIT fairly regularly, and I can slowly make my way around a VIM editor.

As of late the job has stagnated in duties, the company still isn't profitable and I don't see anything changing anytime soon. I don't know what kinds of jobs I am qualified to move to though.

I don't have a CS degree. I did take some data analysis classes for my undergrad though.

I don't feel I have enough Python experience to be a true data engineer. If I had more free time, I would code on the side to gain experience, and I have considered quitting and taking a month or two off to up my skills. I have spent lots of time working with a data engineer in my company. He has taught me a lot about everything related to data, and I envy the projects he gets to work on.

I don't really enjoy the analytics side of data. I much prefer just writing code. I know I can't totally escape from it, but I don't want to be a true data analyst, writing reports and building dashboards.

I have some experience working with AWS. Mainly the console.

I really enjoy coding. I feel very confident in my SQL. I would love to be a data engineer. I also enjoy discussing data architecture.

Given my history, what might be a next good step for me? If I can provide more context or information please ask. I am open to all feedback. I am US based and 29 years old if it makes a difference.

5 Comments
2024/05/08
18:53 UTC

9

Choosing Between DevOps and Data Engineering"

I've worked with SAP for four years and just completed my MS in Data Science. While studying, I found out that I prefer programming over the math-heavy parts of data science. This got me interested in data engineering. But now that I'm 30 and feeling the pressure of time, I'm not sure what to choose next. Should I stick with DevOps since I have experience with SAP and even got certified in AWS, which is about to expire? Or should I go for data engineering instead? I want to make the right choice based on my past experience and what's best for my career.

12 Comments
2024/05/08
18:42 UTC

5

Evaluate approach for Kafka, Spark, Iceberg pipeline & help for schema management

Hi all, I'm a relatively new (2 years) data engineer. I am working with the following architecture:

Kafka (Cloudera on-prem) + Spark (On-prem cluster) + Nessie(k8s) with Iceberg on Minio (on-prem s3 storage)

My use case is a team of 50 engineers that run scripts few times a day that generate GB of data - consequent runs may share the same schema but it is possible that on a new run there is a column added. The data needs to be stored in a data lake in relatively real-time (few seconds) and then queried using Dremio or visualized for forecasting. I decided the flow should be:

  1. Python Kafka Producer checks Nessie Catalog/Iceberg if 'TableName' exists, if yes retrieves schema and validates current record against that, if not then registers the schema of current record as a new table in Nessie.
  2. Producer appends TableName to record and writes the record as json_str to Kafka
  3. Spark Streaming Consumer
    1. reads record from Kafka and parses table name (maybe from key)
    2. Gets the schema for the table name from Nessie and uses that to create a Spark dataframe from the json_str record
    3. writes Spark dataframe as a row to iceberg table in S3

I'm seeking feedback on whether this approach is sound and have a few questions:

  1. Should there be a unique Kafka topic for each schema considering multiple schemas from 50+ engineers? (I've decided against it for now but would appreciate thoughts.)
  2. How can I ensure a single source of truth for the schema? I've considered Schema Registry for Kafka but opted against it due to lack of expertise and time. I'm using Nessie as a 'schema registry', but there's no pure Python way to interact with it apart from using its Iceberg-Nessie Spark plugin and running the producer on Spark.
  3. Do the steps of the producer registering schema for every new table/getting schema for every new record, and doing the same thing again in Spark, add too much overhead and unnecessary latency for a streaming solution? Maintaining sub-second latency isn't crucial but data quality is more important to me.

https://preview.redd.it/2vceao8vx8zc1.jpg?width=2932&format=pjpg&auto=webp&s=2f426ea0e1eda43799a13ee53998ceb6e6e78af8

1 Comment
2024/05/08
18:33 UTC

1

query engine user-facing data layer over delta lake tables in s3

Hi,
I'm creating a data lakehouse for online reporting. I am using S3 with Iceberg tables in the storage layer, and in the processing layer, I'm using Spark Streaming on Kubernetes (K8s). However, I need a query engine to read data from these optimized zone in lakehouse I'm unsure which one to use. Could you provide some suggestions?(open source)

2 Comments
2024/05/08
17:22 UTC

3

Need advice to improve hands-on pyspark skills

I have been giving interviews for DE position since last 2 months. I have given total of 4 interviews, i could answer the theoretical questions very comfortably, but whenever a simple pyspark problem is asked i am not able to solve it.

To prepare for this I had collected some questions from linkedin posts but when similar pattern questions are not asked, I cannot solve it.

Need some advice on how do I improve hands on pyspark, java spark coding skills.

3 Comments
2024/05/08
17:38 UTC

0

Is AI becoming a real thread? Your backup plans

Hey all, I hope you a nice and happy Wednesday!

I started to hear from my close ones about how layy-offs started to get correlated to ai causes. 5 % of layoffs seem to be due to GenAI replacement policies in corporate says ft here.alternative

Seeing friends struggling to find their next jobs(these are quite experienced and senior guys btw) in the data field. As I'm being bullish about the future of de jobs at least 5 more years, I just started questioning my current status like do I have any back-ups or what would be any possible scenario that I can proceed with.

One thing I've heard from one of my colleagues at work today is agriculture but obviously backed by data solutions.Like iot services integrated, sensors to gather climate, moisture, temperature of the land etc. He seemed so excited about how he could leverage agriculture with data, but I am not sure if it is really going to create value in practice. He is really planning to rent a land and to train himself at least a day in a week πŸ˜€

I found it so interesting and an optimal solution, maybe I can join him too πŸ€” If it works, it can be used to become a self sufficient farmer and also can be sold as a tool. One solution he mentioned already started to find customers here

What do you think? Do you have any backup plans ?

4 Comments
2024/05/08
16:28 UTC

2

Simple row-level transformations in Postgres Change Data Capture (CDC)

Today we are excited to announce Lua based row-level transformations as part of Postgres Change Data Capture (CDC) to Kafka and other message brokers. This feature unlocks many powerful use cases, such as: https://blog.peerdb.io/row-level-transformations-in-postgres-cdc-using-lua
πŸ”’ Masking PII Data: Obfuscate sensitive PII with tokens for enhanced privacy in Kafka.
πŸ”„ Changing Data Format: Convert data into formats like Protobuf, JSON for optimized system handling.
πŸ“Š Generated Columns: Generate new values from existing data for real-time analysis enhancements.
πŸ—‚οΈ Unnesting JSONs: Flatten JSON elements into separate Kafka fields for improved usability.
πŸ”€ Topic Routing: Direct CDC events to designated Kafka topics for targeted streaming.
πŸ” Data Encryption: Encrypt sensitive data before Kafka entry to secure it during transit.
Give it a try and we would love hear your feedback! 😊

1 Comment
2024/05/08
15:58 UTC

13

I'm building a tool that automatically writes DBT staging codes using LLMs

4 Comments
2024/05/08
14:38 UTC

0

Exploring Azure Cosmos DB: A Guide to Scalable NoSQL Database Solutions

πŸš€ Dive into the future of databases with our latest blog on Azure Cosmos DB! 🌐 Discover how this fully managed NoSQL and relational database service can revolutionize your applications through global scalability, massive throughput, and efficient data partitioning. 🌟

πŸ”— Learn about the key features:

  • Scalable partitioning (Logical & Physical)
  • Horizontal scaling for high availability
  • Global distribution and multi-master replication

πŸ› οΈ Plus, get a step-by-step guide on setting up your own Cosmos DB instance!

Perfect for developers looking to elevate their applications to the next level. Check it out now!
https://erwinschleier.medium.com/exploring-azure-cosmos-db-a-guide-to-scalable-nosql-database-solutions-24c5474f74ca

AzureCosmosDB #NoSQL #DataScalability #CloudComputing #MicrosoftAzure

1 Comment
2024/05/08
14:19 UTC

Back To Top