/r/datascience
A space for data science professionals to engage in discussions and debates on the subject of data science.
/r/datascience
I’m planning to start a new data analysis project and want to ensure it aligns with what interviewers look for. Can anyone share insights on what really stands out to hiring managers in data analysis projects? Any advice on key elements or focus areas that would make a project more impressive during interviews would be much appreciated!
It's the answer to life, apparently.
I work at a company where the data engineering team is new and quite junior - mostly focused on simple ingestion and pushing whatever the logic our (also often junior) data scientists give them. Data scientists also write up the orchestration, like how to process a real-time streaming pipeline for their metric construction and models. So, we have a lot of messy code the data scientists put together that can be inefficient.
As the most senior person on my team, I've been tasked with taking on more of a lead in teaching the team best practices related to data engineering - simple things like good approaches for backfilling, modularizing queries and query efficiency, DAG construction and monitoring ,etc. While I've picked up a lot from experience, I'm curious to learn more "proper" ways to approach some of these problems.
What are some good and practical data/analytics engineering resources you've used? I saw dbt has interesting documentation on best practices for analytics engineering in the context of their product but looking for other uses.
I think I have a fairly solid grasp now of what a random forest is and how it works in practice, but I am still unsure as to exactly how a random forest makes predictions on data it hasn’t seen before. Let me explain what I mean.
When you fit something like a logistic regression model, you train/fit it (I.e. find the model coefficients which minimise prediction error) on some data, and evaluate how that model performs using those coefficients on unseen data.
When you do this for a decision tree, a similar logic applies, except instead of finding coefficients, you’re finding “splits” which likewise minimise some error. You could then evaluate the performance of this tree “using” those splits on unseen data.
Now, a random forest is a collection of decision trees, and each tree is trained on a bootstrapped sample of the data with a random set of predictors considered at the splits. Say you want to train 1000 trees for your forest. Sampling dictates a scenario where for a single datapoint (row of data), you could have it appear in 300/1000 trees. And for 297/300 of those trees, it predicts (1), and for the other 3/300 it predicts (0). So the overall prediction would be a 1. Same logic follows for a regression problem except it’d be taking the arithmetic mean.
But what I can’t grasp is how you’d then use this to predict on unseen data? What are the values I obtained from fitting the random forest model, I.e. what splits is the random forest using? Is it some sort of average split of all the trees trained during the model?
Or, am I missing the point? I.e. is a new data point actually put through all 1000 trees of the forest?
I just joined a company that is premature when it comes to DS. All codes are in notebooks. No version control on data, code, or model. There are some model documentations. But these documents are quite sparse and more of a business memo than a technical document.
The good part is that my manager wants to improve the DS workflow. The devops team also want to improve the implementations and they are quite collaborating. The issue are the product owners.
I recently did a task for one of them. This was to update one of their models. The model lacked a proper technical document, half-ass code on git (just a model dump on s3), even the labels were partly incorrect, no performance monitoring, ... You just guess the rest.
After working on it for a few weeks, I realized that the product owner only expects a document as my efforts outcome. A document exactly like their business memos that half of the numbers are cherry-picked and the other half are rather either meaningless data tables or some other info that are downright wrong. He also insists that the graphs to be preserved. Graph that lack basic attributes such as axis labels, proper font size, having graphs overlaid to allow easy comparison. He single argument is that "customer like it this way". Which, if true, is quite a valid argument.
I cave in for now. Any suggestions on how to navigate the situation?
Hi guys, recently I've been collaborating on automated pricing, we built a model based on demand elasticity for e-commerce products, all statistical methods, even the price elasticity of demand function is assuming a linear relationship between demand and margins.
So far, it's doing "good" but not "good enough".
The thing is here, the elasticity is considering the trendline of the product alone, which is not the case, what if that product was a complement to another product/s or if it was a substitute?
I managed to tackle that partially with cross-elasticity, it did give good estimations, but still..
There's too much room for improvement, my manager is planning on converting a lot of the structure of the model to RL, in the model, we're actually mimicing the behaviour of RL, where there's an achievement function that we're trying to optimize and it works with the estimated PED. but I find RL really hard to make it practical, And I read in many articles and discussions that this problem many companies tackled with deep learning, because it can exhibit and absorb many complex patterns in the data so I decided to give it a shot and started learning deep learning from a hands-on deep learning book and a youtube course.
But is it worth it?
If anyone worked on pricing would share wisdom that'd be awesome.
Thanks.
Project goal: create a 'reasonable' 30 year forecast with some core component generating variation which resembles reality.
Input data: annual US macroeconomic features such as inflation, GDP, wage growth, M2, imports, exports, etc. Features have varying ranges of availability (some going back to 1900 and others starting in the 90s.
Problem statement: Which method(s) is SOTA for this type of prediction? The recent papers I've read mention BNNs, MAGAN, and LightGBM for smaller data like this and TFT, Prophet, and NeuralProphet for big data. I'm mainly curious if others out there have done something similar and have special insights. My current method of extracting temporal features and using a Trend + Level blend with LightGBM works, but I don't want to be missing out on better ideas--especially ones that fit into a Monte Carlo framework and include something like labeling years into probabilistic 'regimes' of boom/recession.
I am currently two years into my first role as a data scientist in risk in finance post grad. I have been applying for my next opportunity and I have two choices as of now. I could either be doing a lateral internal move into more of a project manager role in credit risk data analytics. This job is very safe, everything is pretty built out, and with banking things evolve a lot slower. My other option is become a data scientist in a niche department in a new company. This is a lot riskier/exciting as I am the first data scientist in this role and I would building things in my vision essentially. Has anyone ever been the first data scientist in a department? What would be the things I should expect in a role like this? I have two years of experience would that be enough to take on a challenge like this.
I'm thinking of doing more of these, not necessarily on politics, but on other topics. Would love to hear feedback/critique on communication, visuals, content etc.
Hey there! Does anyone here know if those sequential models like LSTMs and Transformers work for real trading? I know that stock price data usually has low autocorrelation, but I’ve seen DL courses that use that kind of data and get good results.
I am new to time series forecasting and trading, so please forgive my ignorance
I have extensive experience in experimentation, but they were generally coming from those higher up - all I had to do was analyse and give a thumbs up if its worth pursuing.
However, I want to know how you conduct analysis and come up with potential experiments to try out from existing data itself. A big part of it would be to look at pain points in customer journeys, but because the sequence of events is unique for everyone, I'm wondering how you go on about this analysis and visualise this.
I am through my way and built a few small size models, and now I am looking forward for deployment but can't find any resources that help me to do so
so if any one here can recommend any resources for model deployment that are straight forward
I have been looking for months for this formula and an explanation for it and I can’t wrap my head around the math. Basically my problem is
https://matheusfacure.github.io/python-causality-handbook/02-Randomised-Experiments.html (The source for me trying to figure it out) -also checked github issues still dont get it & https://clas.ucdenver.edu/marcelo-perraillon/sites/default/files/attached-files/week_3_causal_0.pdf (Professor lectures)
I dont get whats going on?
This is like a blocker for me before i understand anything further. I am trying to genuinely understand it and try to apply it in my job but I can’t seem to get the whole estimation part.
I have seen cases where a data scientist would say that causal inference problems are basically predictive modeling problems when they think of the DAGs for feature selection and the features importance/contribution is basically the causal inference estimation of the outcome. Nothing mentioned regarding experimental design, or any of the methods like PSM, or meta learners. So from the looks of it everyone has their own understanding of this some of which are objectively wrong and others i am not sure exactly why its inconsistent.
How can the insight be ethical and properly validated. Predictive modeling is very well established but i am struggling to see that level of maturity in the causal inference sphere. I am specifically talking about model fairness and racial bias as well as things like sensitivity and error analysis?
Can someone with experience help clear this up? Maybe im overthinking this but typically there is a level of scrutiny in out work if in a regulated field so how do people actually work with high levels of scrutiny?
Create unlimited AI wallpapers using a single prompt with Stable Diffusion on Google Colab. The wallpaper generator :
Check the demo here : https://youtu.be/1i_vciE8Pug?si=NwXMM372pTo7LgIA
I'm a data analyst. I had a business idea that is pretty much a tool to help students study better: a LLM that will be trained with the past exams of specific schools. The idea is to have a tool that would help aid students, giving them questions and helping them solve the question if necessary. If the student would give a wrong answer, the tool would point out what was wrong and teach them what's the right way to solve that question.
However, I have no idea where to start. There's just so much info out there about the matter that I really don't know. None of the Data Scientists I know work with LLM so they couldn't help me with this.
What should I study to make that idea mentioned above come to life? ]
Edit: I expressed myself poorly in the text. I meant I wanted to develop a tool instead of a whole LLM from scratch. Sorry for that :)
I was having an argument with my colleague regarding this. We know that data leakage becomes a problem when the training data has a peek into the test data before testing phase. But is it really a problem if the reverse happens?
I'll change our exact use case for privacy reasons. But basically let's say I am predicting whether a cab driver will accept an ride request. Some of the features we are using for this is the driver's historical data for all of his rides (like his overall acceptance rate). Now, for the training dataset, I am obviously calculating the drivers history over the training data only. However, for the test dataset, I have computed the driver history features over the entire dataset. The reason is that each driver's historical data would also be available during inference time in prod. Also, a lot of drivers won't have any historical data if we calculate it just on the test set. Note that my train test split is time based. The entire test set lies in the future to the train set.
My collage argues that this is wrong and this is still data leakage, but I don't agree.
What would be your views on this?
With experimentation being a major focus at a lot of tech companies, there is a demand for understanding the causal effect of interventions.
Traditional causal inference techniques have been used quite a bit, propensity score matching, diff n diff, instrumental variables etc, but these generally are harder to implement in practice with modern datasets.
A lot of the traditional causal inference techniques are grounded in regression, and while regression is very great, in modern datasets the functional forms are more complicated than a linear model, or even a linear model with interactions.
Failing to capture the true functional form can result in bias in causal effect estimates. Hence, one would be interested in finding a way to accurately do this with more complicated machine learning algorithms which can capture the complex functional forms in large datasets.
This is the exact goal of double/debiased ML
https://economics.mit.edu/sites/default/files/2022-08/2017.01%20Double%20DeBiased.pdf
We consider the average treatment estimate problem as a two step prediction problem. Using very flexible machine learning methods can help identify target parameters with more accuracy.
This idea has been extended to biostatistics, where there is the idea of finding causal effects of drugs. This is done using targeted maximum likelihood estimation.
My question is: how much has double ML gotten adoption in data science? How often are you guys using it?
This is really interesting to see how DS is more and more useful in our world and there is Cooley with their huge offer for a DS Director.
Job offer: Director of Data Science (in US) - Salary Range: $325,000 - $400,000
Found it here: https://jobs-in-data.com/
I'm currently managing a churn prediction model (XGBoost) that I retrain every month using Jupyter Notebook. I've get by with my own routines:
model_20241001_data_20240901.pkl
)As the project grows, it's becoming increasingly difficult to track different permutations of columns, model types, and data ranges, and to know which version I'm iterating on.
I've found options on internet like MLFlow, W&B, ClearML, and other vendor solutions, but as a one-person team, paid solutions seem overkill. I'm struggling to find good discussions or general consensus on this. How do you all handle this?
Edit:
I'm seeing a consensus around on MLFlow with logging and tracking. But to trigger experiments or run through model grid with different features / configurations, I need to combine it with other orchestration tools like KubeFlow / Prefect / Metaflow?
Just adding some more context:
My data is currently sitting in GCP BigQuery tables. I'm currently just training on Vertex AI jupyter lab. I know GCP will recommend us to use Vertex AI Model Registry, Vertex Experiments, but they seem overkill and expensive for my use.
As a data scientist with a software engineering background, I sometimes struggle to connect my technical skills with the business needs. I find myself questioning my statistical knowledge and ability to truly solve problems from a business perspective. It appears that I lack the intuition to work smart and achieve business outcomes, especially when it comes to the customer churn analysis/prediction project. I'm working on now.
Any advice on how to overcome this imposter syndrome and bridge the gap?
Right now, a lot of buzz is around AI Agents in Generative AI where recently Claude 3.5 Sonnet was said to be trained on agentic flows. This video explains What are Agents, how are they different from LLMs, how Agents access tools and execute tasks and potential threats : https://youtu.be/LzAKjKe6Dp0?si=dPVJSenGJwO8M9W6
We just hired on a new lead DS who mentioned Miller's 2014 or 2025 text several times. I know I can get one cheaply but it's a decade old. What would you recommend as an updated version of it?
Been a DS for 5+ years, working on some ideas around improving how insights get delivered/consumed across orgs. Would love to hear your war stories:
Feel free to comment or DM to chat more in-depth.
For context: I'm a former Meta/FB DS - worked on FAIR language, Instagram, Reality Labs, and Election Integrity teams. Now exploring solutions to problems I kept seeing
I’ve never used it myself, but from what I understand about it I can’t think of what situation it would realistically be useful for. It’s a feature engineering technique to reduce many features down into a smaller space that supposedly has much less covariance. But in models ML this doesn’t seem very useful to me because:
What are others’ thoughts on this? Maybe it could be useful for real time or edge models if it needs super fast inference and therefore a small feature space?
Hello, Please let me know the best way to learn LLM's preferably fast but if that is not the case it does not matter. I already have some experience in ML and DL but do not know how or where to start with LLM's. I do not consider myself an expert in the subject but I am not a beginner per se as well.
Please let me know if you recommend some courses, tutorials or info regarding the subject and thanks in advance. Any good resource would help as well.
Hi!
I'm starting to look for a job in UK, and Linkedin is a mess. For ‘Data Science’ or ‘Data Scientist’ it shows 5% of related jobs, the rest are analyst, engineers, etc.
Any advice for a job platform for IT jobs? If it's in UK even better.
Thanks in advance!
Cheers!
I have a Data Science supervisory position that just opened on my growing team. You would manage 5-7 people who do a variety of analytic projects, from a machine learning model to data wrangling to descriptive statistics work that involves a heavy amount of policy research/understanding. This is a federal government job in the anti-fraud arena.
The position can be located in various parts of the country (specifics are in the posting). Due to agency policy, if you're located in Woodlawn, MD or DC, you would be required to report to the office 3 days a week. Other locations are currently at 100% telework.
If interested, you apply through this USAJOBS link: https://www.usajobs.gov/job/816105500
Unless I'm missing something obvious, I see lots of template repos for python packages, but not much out there for the more typical data science grunt work.
My ideal template has all the nice poetry/conda/pre-comimit etc but isn't broken into scr/ and tests/
Rather, because I work in consulting, my ideal template would be structured along the lines of:
Here are a couple of examples of the kinds of python package repos I'm talking about:
What do you guys use? TIA!
OpenAI recently released Swarm, a framework for Multi AI Agent system. The following playlist covers :
Playlist : https://youtube.com/playlist?list=PLnH2pfPCPZsIVveU2YeC-Z8la7l4AwRhC&si=DZ1TrrEnp6Xir971