/r/dataengineering
News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.
News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.
Read our wiki: https://dataengineering.wiki/
Rules:
Don't be a jerk
Limit Self-Promotion: Remember the reddit self-promotion rule of thumb: "For every 1 time you post self-promotional content, 9 other posts (submissions or comments) should not contain self-promotional content."
Search the sub & wiki before asking a question: Your question has likely been asked and answered before so do a quick search before posting.
No job posts (posts or comments)
No technical error/bug questions: Any error/bug question belongs on StackOverflow.
Keep it related to data engineering
/r/dataengineering
Hey all,
Sorry if this question has been asked before but I specifically want to know what kind of skills do I need to land a job in DE as fresher in Europe ?
My background: I am a data science student. I am in my final year of studies and I've landed an data analytics internship at a company. I'm building projects with use cases (not copy paste projects from YouTube). I've listed out a couple of general skills like SQL, big data technology, Spark and AWS
Real question: How much of these skills are actually required to get the job ? As I've started to dig deep I realized that it's a lot to catch up on and the technologies are vast in itself. Companies basically list out every technology. Do they expect us to know everything? I had to dig deep to just get this internship, now it's about getting a full time job. To make things worse I've to learn the local language as well (I'm currently on an intermediate level and actively taking speaking classes)
Any help and suggestions on how to go about it would really help. Thank you in advance.
I use python and sql at my DE job and good at both. But I have always done DSA in c++; as DE primarily uses python, can I solve DSA questions in c++ during interviews? I am starting my preparation for interviews and will be applying soon
Hi fellow engineers, how to get the jobs data of all workspaces and make a pattern out of it using a dashboard maybe a dashboard from databricks itself. Any help would be greatly appreciated.
Hi fellow engineers, in databricks how to monitor all the jobs in our workspace and create A dashboard or pattern to understand the jobs pattern. Any help would be very much appreciated.
Hi everyone,
I have an Interveiw scheduled for a DE role, and the details (date and time) are visible on the Workday portal. However, I haven’t received the invite link or any additional instructions via email.
I’ve checked my spam/junk folders and the Workday notifications, but nothing seems to be there and I don't have any direct contact with HR. Has anyone faced a similar issue?
Thanks in advance for your help!
Hi! I'm trying to get a good grasp of the field of Data Preparation and would love to hear about your journey of building ML/AI applications or pipelines.
Here are some questions that I have but would also appreciate other elaboration:
Thanks.
Hello fellow engineers!
A while back, I had asked a similar question regarding data store for IoT data (which I have already implemented and works pretty well).
Today, I am exploring another possibility of ingesting IoT data from a different data source, where this data is of finer details than what I have been ingesting. I am thinking of ingesting this data at a 15 minutes interval but I realised that doing this would generate lots of rows.
I did a simple calculation with some assumption (under worst case):
400 devices * 144 data points * 96 (15 minutes interval in 24 hours) * 365 days = 2,018,304,000 rows/year
And assuming each row size is 30 bytes:
2,018,304,000 * 30 bytes = approx. 57 GB/year
My intent is to feed this data into my PostgreSQL. The data will end up in a dashboard to perform analysis.
I read up quite a bit online and I understand that PostgreSQL can handles billion rows data table well as long as the proper optimisation techniques are used.
However, I can't really find anyone with literally billions (like 100 billions+?) of rows of data who said that PostgreSQL is still performant.
My question here is what is the best approach to handle such data volume with the end goal of pushing it for analytics purposes? Even if I can solve the data store issue, I would imagine calling these sort of data into my visualisation dashboard will kill its performance literally.
Note that historical data are important as the stakeholders needs to analyse degradation over the years trending.
Thanks!
We're currently using hive-partitioned parquet to store a very modest amount of data (17g, but about a billion rows and just starting, so this is just fine in the medium term but I'm looking ahead).
And the simplest "next step" looks to be iceberg. I'm loving the hive/parquet because it's very simple--doesn't require an active database, just uses a filesystem, easy to make small models on local machines for testing and development. For the foreseeable future, our writes are very infrequent and easily coordinated (storing data for offline machine learning, so data import is rare but queries are fairly frequent), so this checks our boxes, but I'd like to look toward the next step up.
I see how to do that with iceberg/pyiceberg, but the simplest filesystem-based catalog (sqlite) doesn't even work reliably over locally networked file systems like CIFS (I get locking errors immediately). I'd like a very simple way to build and modify a catalog over just a local/networked/s3 filesystem and have the whole thing queriable by duckdb. We're likely moving to AWS more, so I could start using eg glue, but I'd rather keep things as simple, local, and nonproprietary as I can, particularly while we're in the flailing stage.
What's the best way to go about that?
^title - Found this comprehensive guide but wondering if it's still worth following as a learning resource. Or would you recommend others? which?
Edit: Link to cookbook: andkret/Cookbook on GitHub
Does anyone have a concrete answer as to the returned value from the above? MS Learn states combination of ISO 8601 and provides example of 2017-06-01T22:20:00.4061448Z.
In the majority of my testing the precision appears to be consistent, however we had a single failure where the precision was shorter e.g. 12345Z.
Simple enough to handle the variability, was just wondering what the definitive expectation should be. Especially as after countless unit tests it’s only occurred the once.
My work primarily involves extracting raw data, performing extensive data modeling using dbt and Google BigQuery, and delivering data to analysts for their insights.
I rarely use Python in my current role, except for occasional orchestration tasks with Airflow. However, most Data Engineering interviews require Python skills.
Although I have Python knowledge, I struggle with live coding questions during interviews.
Plus with ChatGPT, anything that I need can be done using it.
There is never a reason for me to write python
Do you have any suggestions on how I can improve?
Hi folks recently I was transferred to a new team which is in charge of data pipelines. Everything is pretty exciting and I feel that it’s great to pickup new transferable skills.
However these pipelines are in scala spark, and I’ve no experience in spark internals or spark APIs. Does anyone know some good resources I could use to quickly get up to speed?
I am interested to know the possibilities of what spark can do. Or some examples of data pipelines for example to see how everything is put together.
TLDR: I’m in a bit of a time crunch to perform well in the new team im posted to. Need to get up to speed on scala spark APIs and how I can put them together in a typical pipeline.
I am a Cpp/C# developer, what to do transition in Data Engineering. Looking for a good course which would also help me get a job by getting its certification.
I have been in Canada for few years and moved here from the west coast. Working as an HPC and Al engineer with about a decade of experience, job search in Canada seems much harder compared to the US. Even 2 years ago, it didn't seem this difficult in the US as compared to here. Getting companies to simply look at your profile has been extremely difficult.
What has been experience of Reddit?
I have to choose one of the two and I have until the end of this calendar month to sit for the preferred exam. DP203 seems more stable and more widely used as most orgs have not adopted Microsoft Fabric, however, with the push that Microsoft is making to move everyone onto the platform it seems a bit more future proof.
Do I just stick with the tried and tested and revisit taking the DP700 again in a year or two, or do I just try beat the curve and do it now instead of DP203?
I have a bunch of tables I need synced to a different database on the regular. Are there tools for that in sqlalchemy or psycopg that I don't know of, or any other standards replication method?
Hello, I am a fresh graduate (I graduated this year in June), and currently I am working as an Associate Software Engineer in ROR. However, I've always been interested in Data, and even took data related courses in University (Data Science, ML, DL etc.). Our University did not offer courses related to Data Engineering. I'd say I'm pretty decent in Python and SQL, and I do understand Data Engineering on a high level, but I do not have a portfolio for Data Engineering at the moment.
1 - How do I level up my knowledge about DE? 2 - How do I start building a DE Portfolio? 3 - How should I go about pursuing a career in DE with hardly any local opportunities? 4 - Any other/more advice you would like to give would be lovely as well.
I do not have any direction in respect to this career path at the moment, and need quite some guidance. Thanks in advance!
if i want to start to learn infrastructure , what is the 1st step and next step? start with docker?
all fundamental that i should know to how to set up
Soo…my company just acquired Foundry and somehow management is convinced that Foundry can magically cure all data quality issue. Many of them think data quality Isn’t the issue because they got Foundry.
We’ve been fighting to make management care about data quality but no luck!
What are you guys stories? What’s your strategy to convince fhem? Long term? Short term?
What are some red/green flags for good teams? How do you identify good places to learn and grow? For example, I am interviewing right now, and one of the thing that worries me is that there doesn't seem to be a strong culture of best practices... most of the team members are newer in the data engineering role, or are coming from a non-data engineering more analyst background. Most of them do not have an original CS degree either, and for some of them I saw their previous roles go from Analyst (1 yr) ---> Data Engineer (1-2 yr) --> Senior Data Engineer (1 yr). Is that just me or does it sound weird? Am I being too pessimistic? How do you identify good teams?
Hi legends! A newbie from India here :) I’m looking for some advice on what to do next in my career. I graduated in 2023 and started working as a data engineer. I am a bit confused about what I can do next. I have interest in learning machine learning methods and maybe build a career in the same.
Should I go for a masters in ML (or other related domains) Or stick to jobs and switch to a different one now? (Pros and cons??)
For masters, I was looking at German universities; a bit ambitious coz I feel I don’t have a strong academic background. (Graduated from IIT Gandhinagar with 7/10: mechanical major and a minor in CSE) I’m confident I can get into some above average masters program in the US.
Would love to know your thoughts, advice..
I have a database where I snapshot external source data using dbt that was impossible to efficiently extra data from. I would like to pass on this snapshotted data using CDC to another database entirely. At times I would prefer to avoid extracting the entire snapshot table and just extract the most up to date values.
Given that dbt rematerializes tables entirely, what happens when you use CDC on these types of tables? Views won’t work since you cannot use CDC, at least in Postgres.
Any suggestions?
I’ve been asked to create a dashboard or a visual that shows the usage of all our tables in redshift. We have 12,000+ tables and want to understand the most used ones to then be able to better craft SLAs around data incidents should they arise for particular data sources. I’d love to hear thoughts on if anyone has done this before and ideas on how to implement. Thanks so much in advance!
I've been working in my current DE role for a few months, previously working more in the data science/analytics side for the past several years. Like many of you, my motivation to switch over to DE was because I like the programming side of things more than I do analyzing data. I guess I feel more satisfied developing data products than I really do delivering insights.
I went into my job hoping I can use Python more as a part of my day to day work and do more programming, but most my job currently feels like 40% SQL, 10% trying to align source data into a data model, 1% AWS, Python and 49% trying to figure out what end users are even asking for. As a result, I've been feeling kind of overwhelmed, the part of writing SQL code or doing anything technical feels far easier than keeping up with people not being remotely clear with what they want, saying they want one thing one day and another thing next day, saying they want something but not clearly defining it, using confusing acronyms or not properly explaining the definition or parameters.
Is this typical in everybody else's DE job? Don't get me wrong, there are things I like about this job, but I feel like my if I don't proactively upskill on the side, then I feel like my job itself won't get me the technical experience I'm looking for. I've been wanting to spend time upskilling to fill that gap, but by the time I'm done with work, I feel kinda tired lol.
I’ve been with my company for two years and was recently promoted to Data Engineer. Our data infrastructure was established several years ago and has never been reevaluated. Now that I’m in this position, I’m looking to simplify the process and reduce costs.
Currently, our setup works as follows:
.parquet
files on AWS S3.Can I optimize and reduce the costs of this infrastructure? I believe eliminating the MSSQL server and keeping my final project tables in AWS RDS would be a better alternative.
Does anyone have any suggestions or criticisms? Will this transition be very challenging?