/r/datascience

Photograph via snooOG

A space for data science professionals to engage in discussions and debates on the subject of data science.

/r/datascience

2,295,197 Subscribers

1

Learning Power BI practically

I know python, I know sql, but every time I dive into Power BI to learn it I’m overwhelmed with options and varying levels of things to learn.

I’ve done more than a few workshops and even mocked up visuals via Python, excel, and other tools which a data engineer than translated into power BI dashboards. But I’m not sure how to practically learn myself.

I suspect that I need to get a firm understanding of the modeling piece prior to visualization, but looking for practical tips or primary concepts to focus on in an applied manner.

0 Comments
2024/11/03
00:18 UTC

17

Which licenses should I get?

I’m heading into my last months in my master degree and I still have access to thousands of educational licenses. I was wondering what you guys recommend me as tools to get access - at least for a while.

Most of my job is done using open source stuff like python, streamlit, airflow.. so that’s why I’m asking for help

Btw, I already have the jetbrains IDE but I honestly prefer VS Code..

Any thoughts?

Thanks!

9 Comments
2024/11/02
18:19 UTC

18

Need to make a dashboard using Python for the team, but no means to deploy it. What are my options?

I want to create a dashboard for my team but I don’t have any means to deploy my dashboard within the team’s infrastructure. I use Python daily so have been looking into libraries that support easy sharing of the dashboard.

So far dash seems promising and I did create a demo app that is rendering well but the problem is it’s local host link and I don’t know how will I share it with my team. Another option is to make a bunch of plotly plots and turn it into html using jupyter notebooks. I think it will lack some interactivity that I am seeking.

What other options do I have? I tried panels but it’s not installed in the jupyter environment and I am not allowed to install new libraries.

Edit: It’s very ad hoc. Only needs to be refreshed once a quarter.

37 Comments
2024/11/02
15:01 UTC

150

Dumb question, but confused

Dumb question, but the relationship between x and y (not including the additional datapoints at y == 850 ) is no correlation, right? Even though they are both Gaussian?

Thanks, feel very dumb rn

78 Comments
2024/11/02
13:44 UTC

1

Oasis : open-sourced model to generate playable video games

Oasis by decart and etched has been released which can output playable video games and user can perform actions like move, jump, inventory check, etc. This is not like GameNGen by Google which can only output gameplay videos (but can't be played). Check the demo and other details here : https://youtu.be/INsEs1sve9k

0 Comments
2024/11/02
09:52 UTC

9

Neural Network Learning - Inner Layer Visualization

4 Comments
2024/11/02
09:40 UTC

22

How often do you make mistakes at work?

How often do stakeholders point out issues in your calculations? Honestly, such cases make me question my proficiency

19 Comments
2024/11/02
07:59 UTC

66

Is there any industry you would never want to work in? If so, which one?

I haven’t worked in advertising industry but have read not-so-good experiences in advertising industry.

140 Comments
2024/11/02
07:40 UTC

76

What Do Interviewers Look for in Data Analysis Projects for Aspiring Data Scientists?

I’m planning to start a new data analysis project and want to ensure it aligns with what interviewers look for. Can anyone share insights on what really stands out to hiring managers in data analysis projects? Any advice on key elements or focus areas that would make a project more impressive during interviews would be much appreciated!

20 Comments
2024/11/02
02:35 UTC

4

Data / analytics engineering resources (online courses ideally) for data scientists to learn good practices?

I work at a company where the data engineering team is new and quite junior - mostly focused on simple ingestion and pushing whatever the logic our (also often junior) data scientists give them. Data scientists also write up the orchestration, like how to process a real-time streaming pipeline for their metric construction and models. So, we have a lot of messy code the data scientists put together that can be inefficient.

As the most senior person on my team, I've been tasked with taking on more of a lead in teaching the team best practices related to data engineering - simple things like good approaches for backfilling, modularizing queries and query efficiency, DAG construction and monitoring ,etc. While I've picked up a lot from experience, I'm curious to learn more "proper" ways to approach some of these problems.

What are some good and practical data/analytics engineering resources you've used? I saw dbt has interesting documentation on best practices for analytics engineering in the context of their product but looking for other uses.

4 Comments
2024/11/01
23:55 UTC

50

How does a random forest make predictions on “unseen” data

I think I have a fairly solid grasp now of what a random forest is and how it works in practice, but I am still unsure as to exactly how a random forest makes predictions on data it hasn’t seen before. Let me explain what I mean.

When you fit something like a logistic regression model, you train/fit it (I.e. find the model coefficients which minimise prediction error) on some data, and evaluate how that model performs using those coefficients on unseen data.

When you do this for a decision tree, a similar logic applies, except instead of finding coefficients, you’re finding “splits” which likewise minimise some error. You could then evaluate the performance of this tree “using” those splits on unseen data.

Now, a random forest is a collection of decision trees, and each tree is trained on a bootstrapped sample of the data with a random set of predictors considered at the splits. Say you want to train 1000 trees for your forest. Sampling dictates a scenario where for a single datapoint (row of data), you could have it appear in 300/1000 trees. And for 297/300 of those trees, it predicts (1), and for the other 3/300 it predicts (0). So the overall prediction would be a 1. Same logic follows for a regression problem except it’d be taking the arithmetic mean.

But what I can’t grasp is how you’d then use this to predict on unseen data? What are the values I obtained from fitting the random forest model, I.e. what splits is the random forest using? Is it some sort of average split of all the trees trained during the model?

Or, am I missing the point? I.e. is a new data point actually put through all 1000 trees of the forest?

17 Comments
2024/11/01
18:24 UTC

30

A delicate situation

I just joined a company that is premature when it comes to DS. All codes are in notebooks. No version control on data, code, or model. There are some model documentations. But these documents are quite sparse and more of a business memo than a technical document.

The good part is that my manager wants to improve the DS workflow. The devops team also want to improve the implementations and they are quite collaborating. The issue are the product owners.

I recently did a task for one of them. This was to update one of their models. The model lacked a proper technical document, half-ass code on git (just a model dump on s3), even the labels were partly incorrect, no performance monitoring, ... You just guess the rest.

After working on it for a few weeks, I realized that the product owner only expects a document as my efforts outcome. A document exactly like their business memos that half of the numbers are cherry-picked and the other half are rather either meaningless data tables or some other info that are downright wrong. He also insists that the graphs to be preserved. Graph that lack basic attributes such as axis labels, proper font size, having graphs overlaid to allow easy comparison. He single argument is that "customer like it this way". Which, if true, is quite a valid argument.

I cave in for now. Any suggestions on how to navigate the situation?

22 Comments
2024/11/01
03:42 UTC

41

You do pricing using....

Hi guys, recently I've been collaborating on automated pricing, we built a model based on demand elasticity for e-commerce products, all statistical methods, even the price elasticity of demand function is assuming a linear relationship between demand and margins.

So far, it's doing "good" but not "good enough".

The thing is here, the elasticity is considering the trendline of the product alone, which is not the case, what if that product was a complement to another product/s or if it was a substitute?

I managed to tackle that partially with cross-elasticity, it did give good estimations, but still..

There's too much room for improvement, my manager is planning on converting a lot of the structure of the model to RL, in the model, we're actually mimicing the behaviour of RL, where there's an achievement function that we're trying to optimize and it works with the estimated PED. but I find RL really hard to make it practical, And I read in many articles and discussions that this problem many companies tackled with deep learning, because it can exhibit and absorb many complex patterns in the data so I decided to give it a shot and started learning deep learning from a hands-on deep learning book and a youtube course.

But is it worth it?

If anyone worked on pricing would share wisdom that'd be awesome.

Thanks.

14 Comments
2024/11/01
00:44 UTC

9

Multi-step multivariate time-series macroeconomic forecasting - What's SOTA for 30 year forecasts?

Project goal: create a 'reasonable' 30 year forecast with some core component generating variation which resembles reality.

Input data: annual US macroeconomic features such as inflation, GDP, wage growth, M2, imports, exports, etc. Features have varying ranges of availability (some going back to 1900 and others starting in the 90s.

Problem statement: Which method(s) is SOTA for this type of prediction? The recent papers I've read mention BNNs, MAGAN, and LightGBM for smaller data like this and TFT, Prophet, and NeuralProphet for big data. I'm mainly curious if others out there have done something similar and have special insights. My current method of extracting temporal features and using a Trend + Level blend with LightGBM works, but I don't want to be missing out on better ideas--especially ones that fit into a Monte Carlo framework and include something like labeling years into probabilistic 'regimes' of boom/recession.

23 Comments
2024/10/31
16:11 UTC

11

first person in this role

I am currently two years into my first role as a data scientist in risk in finance post grad. I have been applying for my next opportunity and I have two choices as of now. I could either be doing a lateral internal move into more of a project manager role in credit risk data analytics. This job is very safe, everything is pretty built out, and with banking things evolve a lot slower. My other option is become a data scientist in a niche department in a new company. This is a lot riskier/exciting as I am the first data scientist in this role and I would building things in my vision essentially. Has anyone ever been the first data scientist in a department? What would be the things I should expect in a role like this? I have two years of experience would that be enough to take on a challenge like this.

14 Comments
2024/10/31
15:51 UTC

0

I put together an analysis and forecast for the 2024 election. Let me know what you think.

I'm thinking of doing more of these, not necessarily on politics, but on other topics. Would love to hear feedback/critique on communication, visuals, content etc.

https://youtu.be/kFDkvrICM48?si=FbXzoepCNSCUz0wi

51 Comments
2024/10/31
12:47 UTC

18

Does Sequential Models actually work for Trading?

Hey there! Does anyone here know if those sequential models like LSTMs and Transformers work for real trading? I know that stock price data usually has low autocorrelation, but I’ve seen DL courses that use that kind of data and get good results.

I am new to time series forecasting and trading, so please forgive my ignorance

8 Comments
2024/10/31
07:15 UTC

18

Coming up with Potential Experiments and Customer Journey Mapping

I have extensive experience in experimentation, but they were generally coming from those higher up - all I had to do was analyse and give a thumbs up if its worth pursuing.

However, I want to know how you conduct analysis and come up with potential experiments to try out from existing data itself. A big part of it would be to look at pain points in customer journeys, but because the sequence of events is unique for everyone, I'm wondering how you go on about this analysis and visualise this.

13 Comments
2024/10/30
18:06 UTC

14

I need some help on how to deploy my models

I am through my way and built a few small size models, and now I am looking forward for deployment but can't find any resources that help me to do so

so if any one here can recommend any resources for model deployment that are straight forward

29 Comments
2024/10/30
16:50 UTC

24

How can one explain the ATE formula for causal inference?

I have been looking for months for this formula and an explanation for it and I can’t wrap my head around the math. Basically my problem is

  1. Every person uses different terminology its actually confusing.
  2. Saw a professor lectures out there where the formula is not the same as the ATE formula from

https://matheusfacure.github.io/python-causality-handbook/02-Randomised-Experiments.html (The source for me trying to figure it out) -also checked github issues still dont get it & https://clas.ucdenver.edu/marcelo-perraillon/sites/default/files/attached-files/week_3_causal_0.pdf (Professor lectures)

I dont get whats going on?

This is like a blocker for me before i understand anything further. I am trying to genuinely understand it and try to apply it in my job but I can’t seem to get the whole estimation part.

  1. I have seen cases where a data scientist would say that causal inference problems are basically predictive modeling problems when they think of the DAGs for feature selection and the features importance/contribution is basically the causal inference estimation of the outcome. Nothing mentioned regarding experimental design, or any of the methods like PSM, or meta learners. So from the looks of it everyone has their own understanding of this some of which are objectively wrong and others i am not sure exactly why its inconsistent.

  2. How can the insight be ethical and properly validated. Predictive modeling is very well established but i am struggling to see that level of maturity in the causal inference sphere. I am specifically talking about model fairness and racial bias as well as things like sensitivity and error analysis?

Can someone with experience help clear this up? Maybe im overthinking this but typically there is a level of scrutiny in out work if in a regulated field so how do people actually work with high levels of scrutiny?

12 Comments
2024/10/30
11:19 UTC

0

I created an unlimited AI wallpaper generator using Stable Diffusion

Create unlimited AI wallpapers using a single prompt with Stable Diffusion on Google Colab. The wallpaper generator :

  1. Can generate both desktop and mobile wallpapers
  2. Uses free tier Google Colab
  3. Generate about 100 wallpapers per hour
  4. Can generate on any theme.
  5. Creates a zip for downloading

Check the demo here : https://youtu.be/1i_vciE8Pug?si=NwXMM372pTo7LgIA

6 Comments
2024/10/30
04:32 UTC

0

Studying how to develop an LLM. Where/How to start?

I'm a data analyst. I had a business idea that is pretty much a tool to help students study better: a LLM that will be trained with the past exams of specific schools. The idea is to have a tool that would help aid students, giving them questions and helping them solve the question if necessary. If the student would give a wrong answer, the tool would point out what was wrong and teach them what's the right way to solve that question.

However, I have no idea where to start. There's just so much info out there about the matter that I really don't know. None of the Data Scientists I know work with LLM so they couldn't help me with this.

What should I study to make that idea mentioned above come to life? ]

Edit: I expressed myself poorly in the text. I meant I wanted to develop a tool instead of a whole LLM from scratch. Sorry for that :)

30 Comments
2024/10/29
22:08 UTC

0

Can data leak from the training set to the test set?

I was having an argument with my colleague regarding this. We know that data leakage becomes a problem when the training data has a peek into the test data before testing phase. But is it really a problem if the reverse happens?

I'll change our exact use case for privacy reasons. But basically let's say I am predicting whether a cab driver will accept an ride request. Some of the features we are using for this is the driver's historical data for all of his rides (like his overall acceptance rate). Now, for the training dataset, I am obviously calculating the drivers history over the training data only. However, for the test dataset, I have computed the driver history features over the entire dataset. The reason is that each driver's historical data would also be available during inference time in prod. Also, a lot of drivers won't have any historical data if we calculate it just on the test set. Note that my train test split is time based. The entire test set lies in the future to the train set.

My collage argues that this is wrong and this is still data leakage, but I don't agree.

What would be your views on this?

43 Comments
2024/10/29
18:48 UTC

48

Double Machine Learning in Data Science

With experimentation being a major focus at a lot of tech companies, there is a demand for understanding the causal effect of interventions.

Traditional causal inference techniques have been used quite a bit, propensity score matching, diff n diff, instrumental variables etc, but these generally are harder to implement in practice with modern datasets.

A lot of the traditional causal inference techniques are grounded in regression, and while regression is very great, in modern datasets the functional forms are more complicated than a linear model, or even a linear model with interactions.

Failing to capture the true functional form can result in bias in causal effect estimates. Hence, one would be interested in finding a way to accurately do this with more complicated machine learning algorithms which can capture the complex functional forms in large datasets.

This is the exact goal of double/debiased ML

https://economics.mit.edu/sites/default/files/2022-08/2017.01%20Double%20DeBiased.pdf

We consider the average treatment estimate problem as a two step prediction problem. Using very flexible machine learning methods can help identify target parameters with more accuracy.

This idea has been extended to biostatistics, where there is the idea of finding causal effects of drugs. This is done using targeted maximum likelihood estimation.

My question is: how much has double ML gotten adoption in data science? How often are you guys using it?

97 Comments
2024/10/29
17:11 UTC

0

"Where Innovation Meets the Law" - Cooley is going to offer even $400k for new Director of Data Science

This is really interesting to see how DS is more and more useful in our world and there is Cooley with their huge offer for a DS Director.

Job offer: Director of Data Science (in US) - Salary Range: $325,000 - $400,000

Found it here: https://jobs-in-data.com/

6 Comments
2024/10/29
10:48 UTC

30

How do you log and iterate on your experiments / models?

I'm currently managing a churn prediction model (XGBoost) that I retrain every month using Jupyter Notebook. I've get by with my own routines:

  • Naming conventions (e.g., model_20241001_data_20240901.pkl)
  • Logging experiment info in CSV/Excel (e.g., columns used, column crosses, data window, sample method)
  • Storing validation/test metrics in another CSV

As the project grows, it's becoming increasingly difficult to track different permutations of columns, model types, and data ranges, and to know which version I'm iterating on.

I've found options on internet like MLFlow, W&B, ClearML, and other vendor solutions, but as a one-person team, paid solutions seem overkill. I'm struggling to find good discussions or general consensus on this. How do you all handle this?

Edit:
I'm seeing a consensus around on MLFlow with logging and tracking. But to trigger experiments or run through model grid with different features / configurations, I need to combine it with other orchestration tools like KubeFlow / Prefect / Metaflow?

Just adding some more context:

My data is currently sitting in GCP BigQuery tables. I'm currently just training on Vertex AI jupyter lab. I know GCP will recommend us to use Vertex AI Model Registry, Vertex Experiments, but they seem overkill and expensive for my use.

34 Comments
2024/10/29
08:37 UTC

30

Dealing with Imposter Syndrome

As a data scientist with a software engineering background, I sometimes struggle to connect my technical skills with the business needs. I find myself questioning my statistical knowledge and ability to truly solve problems from a business perspective. It appears that I lack the intuition to work smart and achieve business outcomes, especially when it comes to the customer churn analysis/prediction project. I'm working on now.

Any advice on how to overcome this imposter syndrome and bridge the gap?

42 Comments
2024/10/29
04:39 UTC

0

What are AI Agents ? explained in detail

Right now, a lot of buzz is around AI Agents in Generative AI where recently Claude 3.5 Sonnet was said to be trained on agentic flows. This video explains What are Agents, how are they different from LLMs, how Agents access tools and execute tasks and potential threats : https://youtu.be/LzAKjKe6Dp0?si=dPVJSenGJwO8M9W6

1 Comment
2024/10/29
04:09 UTC

23

Updated book to follow Miller's "Modeling for Predictive Analytics"

We just hired on a new lead DS who mentioned Miller's 2014 or 2025 text several times. I know I can get one cheaply but it's a decade old. What would you recommend as an updated version of it?

7 Comments
2024/10/28
22:14 UTC

Back To Top