Photograph via snooOG

A space for data science professionals to engage in discussions and debates on the subject of data science.


1,861,962 Subscribers


Handling 1-10 scale survey questions in regression

I am currently analyzing surveys to predict product launch success. We track several products in the same industry for different clients. The survey question responses are coded between 1-10. For example: "On a scale from 1 - 10..."

  • "... how familiar are you with the product?"
  • "... how accessible is the product in your local market?"
  • "... how advanced is the product relative to alternatives?"

'Product launch success' is defined as a ratio of current market share relative to estimated peak market share expected once the product is fully deployed to market.

I would like to build a regression model using these survey scores as IVs and 'product launch success' ratio as my target variable.

  1. Should the survey metrics be coded as ordinal variables since they are range-bound between 1-10? If so, I am concerned about the impact on degrees of freedom if I have to one-hot encode 9 levels for each survey metric, not to mention the difficulty in interpreting 8 separate coefficients. Furthermore, we rarely (if ever) see extremes on this scale--i.e. most respondents answer between 4 - 9. So far, I have treated these variables simply as continuous, which causes the regression model to return a negative intercept. Would normalizing or standardizing be a valid approach then?
  2. There is a temporal aspect here as well because we ask respondents these questions each month during the launch phase. Therefore, there is value in understanding how the responses change over time. It also means that a simple linear regression across all months makes no sense--the survey scores need to be framed as relative to each other within each month.
  3. Because the target variable is a ratio bounded between 0 and 1, I was also wondering if beta regression would be the best approach.
06:06 UTC


Take home problems during interviews still a thing?

I conducted my first interview for a data scientist, and at the end of it he asked if there are any take home problems before the next round. We're looking for a mid to senior level person. Are take home problems still a common thing at that level?

In the age of chatgpt, are take home problems even effective or do you feel it is too easy to game?

If you've recently done take home problems during your interviews, how complex are the problems and did you feel it was a necessary evil or a waste of time?

01:53 UTC


“Stat Sig” for metric that matures over time

Hello everyone, I got asked at what point will the sample size for a cohorts LTV will be considered stat sig. I’m pretty sure that it’s not really possible to do that, at least to attach a p value to it.

What I did do was run some simulations.


  1. Capture 30,60,….365 LTV for all users

  2. Get the multiplier for each to 365 day.

  3. Then randomly sample 100 users at each interval, multiply its corresponding multiplier and see how far it is from the population LTV.

Do step 3 1,000 times to get some sort of range.

Then do all the above steps again but with 100,200,….,1000 samples

Findings, I saw that the ranges were very wide at the first measurements but dropped off steadily until 180 and then remained consistent.

The ranges were lower if you increase the sample size but not dramatically lower

I don’t know if I over engineered this, would love some feedback

23:11 UTC


Image embedding

Is it possible to use image embeddings from LLMs for color classification? I approached this by calculating the cosine similarity scores between the embeddings of an image and reference images. But the similarity scores were consistently below 0.1. How can I improve the effectiveness of this method or what alternative approaches should I consider?

19:42 UTC


Not able to detect growth in time series

Hi all,

I am working on a model to predict the stock depletion of various products to determine when to reorder. The goal is to identify when the stock will reach zero so that we can place an order 14 days in advance.

I am aware of the EOQ model, but it isn’t suitable in this context as it’s for a rapidly growing startup.

To smooth out stock fluctuations, I aggregate sales over the past months since daily sales figures are less relevant and average out over time.

I used Facebook Prophet for the forecast. The forecast of the aggregated sales looks good as it is almost linear. However, the issue arises when reconstructing daily sales: the forecasted sales for each week are the same. Prophet doesn’t seem to capture the increasing growth rate of the cumulative sales.

I’m quite lost at this point and unsure how to proceed. Has anyone faced similar issues or can recommend alternative algorithms? Any help would be greatly appreciated.


15:59 UTC


Does it make sense to probabilistic ML my thing?

Does it make sense to make* probabilistic ML my thing?

I'm a master's DS student, I understand topics under ML quite broadly, but I have not done a deep dive into any specific topic like CV, NLP, PGMs, etc.

Now I have gotten interested in probabilistic graphical models and probabilistic methods as a whole, as they seem to hold potential for a lot of real-world scenarios thanks to explainability and the loosened requirement for amount of data thanks to setting priors.

I also think results like VAE are fascinating. However, I am not that familiar with the associated literature.

So my question is, do you think I should really spend the time to master probabilistic ML? Will it give me an edge in the industry, as in landing a good job? Or is it a branch that has lost its edge to deep learning (even if compatible to a degree)? Should I rather focus on NLP, CV, RL, or some other discipline?

My end-game is to be a skilled applied machine learning engineer. I am also considering a phd.

Thank you so much, please don't eat me alive.

09:43 UTC


Favorite writing on designing product metrics?

In product analytics DS roles, developing metrics to measure and guide product development and growth is a core responsibility (and frequent interview subject). What are your favorite writings on how to design good product metrics?

17:32 UTC


Post-Bachelor Degree Learning, Growing, and Development

Hello everyone, I just completed my bachelor's degree this last summer and am wondering what is the advanced material that you all have found most useful for either your work or general knowledge, whether it be a certain regression model, a widespread approach, or advice? I don't want to grad school yet either. Right now I am proficient in Python and Pandas, don't remember much of my R but I do not need it for my job, and have basic SQL knowledge. I'm not looking for a new job but want to improve my work. Thank you!

16:15 UTC


Exporting Ad Data From Meta

I have a client who wants analyze the performances of their ads on Facebook and Instagram. They offered to extract the data themselves and to send it over, but they are having a really hard time. I guess Facebook limits the size of the reports they can generate so they must run multiple reports. The whole thing sounds tedious but also sounds like something that could be automated. I've never worked with Meta’s ad data previously so I'm not sure how easy it would be to automate the data extraction process. I don’t want my first interaction with this client to be a failed promise to retrieve this extracted data.

I’ve read about 3^(rd) party applications (such as Supermetrics) that do this for you, but many of them are prohibitively expensive.

Any thoughts on how I can quickly extract this data?

15:49 UTC


How do you stay up to date?

If you're like me, you don't enjoy reading countless medium articles, worthless newsletters and niche papers which may or may not add 0.001% value 10 years from now. Our field is huge and fast evolving, everybody's has their niche and jumping from one to another when learning, is a very inefficient way to make an impact with our work.

What I enjoy doing is having a great wide picture of what tools/methodologies are out there, what are their pros/cons and what can they do for me and my team. Then if something is interesting or promising, I have no problem in further researching/experimenting, but doing it every single time just to know what's out there is exhausting.

So what do you do? Do some knowledge aggregators that can be quickly consulted for knowing what's up at a general level?

13:17 UTC


Weekly Entering & Transitioning - Thread 15 Jul, 2024 - 22 Jul, 2024

Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:

  • Learning resources (e.g. books, tutorials, videos)
  • Traditional education (e.g. schools, degrees, electives)
  • Alternative education (e.g. online courses, bootcamps)
  • Job search questions (e.g. resumes, applying, career prospects)
  • Elementary questions (e.g. where to start, what next)

While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads.

04:01 UTC


New Data science jobs in the NBA, Formula1 and sports analytics companies

Hey guys,

I'm constantly checking for jobs in the sports and gaming analytics industry. I've posted recently in this community and had some good comments.

I run www.sportsjobs.online, a job board in that niche.

In the last month I added around 200 jobs:

I'm celebrating I automated all the NBA teams with this post and doing so I've found a few interesting data science jobs.


There are multiple more jobs related to data science, engineering and analytics in the job board.

I've created also a reddit community where I post recurrently the openings if that's easier to check for you.

I hope this helps someone!

00:16 UTC


How to better embbed words to extract aspect in a text using LLM

Hi! So I'm currently trying to do Aspect Based Sentiment Analysis (ABSA) using multiple models like BERT, Roberta, etc. But, the ABSA has to be correlated to the 3 categories that I defined and the sentiments using word embedding to find the similarity between words to the categories. I know I could have tried to train the model so it can work better and I'll have more control to the model's performance. But say, if I need it fast and training the pre-trained model is not an option, is there any other way to do this?

15:46 UTC


What would you say the most important concept in langchain is?

I would like to think it’s chain cause I mean if you want to tailor an llm to your own data we have rag for that

09:02 UTC


Whatever happened to blockchain?

Did your company or clients get super hyped about Blockchain a few years ago? Did you do anything with blockchain tech to make the hype worthwhile (outside of cryptocurrency)? I had a few clients when I was consulting who were all hyped about their blockchains, but then I switched companies/industries and I don't think I've heard the word again ever since.

00:53 UTC


On the vision of /r/datascience

I enjoy this sub. I’ve had some great discussions, been exposed to different viewpoints, and learned a ton from some very smart people. However, lately I’ve realized I have no clue what the vision for this sub actually is. Posts like this one get routinely deleted and told the discussion is better suited for the weekly entering/transitioning megathread, which to me makes no sense. Frankly I’m not sure what type of posts are even allowed here anymore and it feels like the life is being slowly moderated out of the sub.

Could someone shed some light on what type of community we’re trying to build?

22:18 UTC


How I lost 1000€ betting on CS:GO with Machine Learning

I wrote two blog posts based on my experience betting on CS:GO in 2019.

The first post covers the following topics:

  • What is your edge?
  • Financial decision-making with ML
  • One bet: Expected profits and decision rule
  • Multiple bets: The Kelly criterion
  • Probability calibration
  • Winner’s curse

The second post covers the following topics:

  • CS:GO basics
  • Data scraping
  • Feature engineering
  • TrueSkill
    • Side note on inferential vs predictive models
  • Dataset
  • Modelling
  • Evaluation
  • Backtesting
  • Why I lost 1000 euros

I hope they can be useful. All the code and dataset are freely available on Github. Let me know if you have any feedback!

17:15 UTC


Focusing on classical Statistics and econometrics in a Data Science career after a decade in the Industry

Hello everyone,

I've been a data scientist for the past 10 years, with a background in computer science. In recent years, I've found myself spending more time studying, learning, and applying concepts from classical statistics and econometrics, such as synthetic control, multi-level mixed models, experimental design methodologies, and so on. On the other hand, I probably haven't opened a machine learning book in years.

Do any of you have a similar experience? I think that unless you are working at an LLM or computer vision startup, this might be an expected career path. Can you share your experiences?

At the end of the day, I think that most business and research questions fall on the "why" side of things, which a straightforward prediction framework can't answer.

09:28 UTC


Isnt cross validation pointless for larger datasets

As the dataset size grows larger, our sample statistics approaches the true values. So the big advantage of CV which is averaging out the variances of random train/val splits diminishes while the cost of running it becomes higher since with a 5 fold split youre effectively building your model 5 times over in the most extreme case.

Is this accurate? Am I missing something?

16:46 UTC


Sagemaker makes me hate my job

I'm a Data Scientist in a startup. Meaning that my roles are: data scientist, data engineer, data analyst, or any possible job that have "data" in its name. I really like my job but EVERY TIME I have to do something on Sagemaker (especially creating endpoints) I want to cry.

The documentation is comprehensive if you have to do some well established procedure, if you need to so something more custom it becomes a nightmare very fast. I'm currently trying to deploy a custom vision transformer model that locally works perfectly... As soon as I publish the endpoint it gets me an error and nowhere states why that error exists. It feels like everything is an excuse to make you pay their assistance

13:50 UTC


Most data is underutilized

The quote goes “90% of data is unstructured” meaning it’s probably not being used

In curious, what percentage if data not being used do you think exists in your organization?

13:34 UTC


How is model selection done in companies ?


So I just finished an ML project where I was iterating over different models to find the right model for my data.
I used metrics like MSE, MAE, R^2, etc to find the right model. And the one that gave the best result based on these metrics, I choose that.

Is this how it's done at companies too? or are there other factors that need to be considered?

Im assuming that convincing stakeholders to go with this model, hardware requirements to train/re-train the model, amount of data needed to get good results for a particular model will be some factors, but is there anything else? Just want to understand how this process is in companies.

08:34 UTC


Divergence between academia and practice

In 2018 I combined a master's in Econ with full time job in a research department of a top government office. Back then I was in constant clashes with my boss about the research level. I was very enthusiastic about the methods we learned in school for fine grained estimations of causal effects, or the combination of theory with statistical modeling, what economist call Structural Models. For comparison, the methods we employed where 20 years lagging. What I didn't understand there, was that those results were enough, and even though they hired me because of my enthusiasm, they weren't let me spend a couple of years on a research just because I want to use new tools (that nobody really understands).

After a long period of back and forth of me complaining about the lack of standards, and my boss telling me that my desire are impractical (they weren't), I finally had enough and decided I will look for a new home to conduct some research. I knew I didn't want to go to academia, and for me economic research for the government was a sweet spot between a circlejerk and working for corporate finance.

A long story short, I went back to academia, got myself a Master's in Stats, and land a job a few years ago. As the time passes, I am starting to notice a similar phenomenon in ML. I work in a company that does fraud detection, we use mostly tabular data, we face classical issues like selection, imputation, etc. When working with text, it is mostly for feature engineering, where we might use BERT, and even then I am having hard time explaining my 50yo manager how transformers work. Our system has to deliver things fast, and we sometimes leave predictive power on the table just to meet our customers' requirements.

I try to stay updated with the field. However, looking at all of the new research coming out everyday, I am starting to see a similar thing as I have seen with Economic Research. Most of my friends in other companies don't even use transformers directly in their product, especially when CNN does the job. Let alone diffusions, LLMs or now the new thing called agents. Those tools aren't relevant for most of the use cases, and I find that my task as a team lead (before a ds/senior ds) doesn't revolves around using the top tools available, but in making sure data is being used correctly in our models, better ways of processing and data-centric feature engineering (we would rather spend more time finding orthogonal signal over utilizing a better model to replace a good-enough one).

Of course there are other issues like computing power and using 3rd party models like with ChatGPT, but how long do you think it would take businesses to adopt the technologies that are being developed today?

00:21 UTC


How do you go about planning out an analysis before starting to type away?

Too many times have I sat down then not know what to do after being assigned a task. Especially when it's an analysis I have never tried before and have no framework to work around.

Like when SpongeBob tried writing his paper and got stuck after "The". Except for me its SELECT or def.

And I think I just suck at planning an analysis. I'm also tired of using ChatGPT for that

How do you do that at your work?

21:40 UTC


Did I just fail a Turing test?

I had the most bizarre thing happen to me. I get an email from a recruiter for a Sr. DS position. I ask for some more info and I manage to get the name of the website. The recruiter scheduled an interview with me for the next day which I was fine with.

Here's where things start getting weird. They say they are partnered with openAI and they have an AI they matches my resume with their job description and they said my match was >90% and that I was amongst the top candidates. I have to admit that my resume was a strong fit for the job posting and there are unique items in my work history that make me a pretty good candidate for this position.

Because they have this AI matching software, there would only be one 45 minute interview and they would give me an answer within 24hrs. On top of that, the AI determines the salary based on my fit for the job.

During the interview, the interviewer did not turn their camera on and so I left my camera off. The interviewer told me about the company and explained why I was a good candidate. I had previous work history and they are actually partnered with a job I had previously worked at. He used terms that were specific to that job, making me feel like a good fit for the position.

During the interview, the interviewer spoke 80-90% of the time, giving me some opportunities to speak in between. He asked if I can do hybrid and we debated on 2-3 days a week, though I was pretty set on 2 and he seemed pretty set on 3. It would be a 2 hour commute but the comp made it worth it (190k).

This whole process took <24hrs from the time I was contacted by a recruiter to an offer. I'm almost certain it's a scam but what are they getting out of it? The only thing that comes to mind is that I interviewed with a fcking AI and they were conducting some sort of Turing test. Otherwise, why would they waste their time getting nothing out of me?

Edit: I appreciate all of the feedback. Don't worry, I have no intention of falling for any scams and will not be providing any PII data or payment to this company.

20:36 UTC


scikit-learn: PLS or SIMPLS?

Hello all. I’m studying “Applied Predictive Modeling” by Kuhn and there the SIMPLS algorithm is described as a more efficient form of PLS (according to my very limited understanding, which may totally be wrong) I’m trying to implement a practical example with scikit-learn but I’m unable to find out whether scikit-learn uses PLS or SIMPLS as the underlying method in PLSRegression() Is there a way to find out? Does this question make sense at all? Sorry if not: I’m a total beginner.

15:32 UTC


Toronto machine learning summit

How to get free tickets?

Is it worth going ?

Is it a networking event?

1 Comment
15:30 UTC


Does anyone work with people with an actuarial background, and how is it going for you?

I know this might not be everyone’s experience, but I’m just curious how you all work with people who have an actuarial background, especially if they’re the leaders of the analytics and data science sections of your company.

I personally have come in contact with discussions about rates that come up. It’s especially been an issue in explaining some of the logical issues with looking at rates and trying to apply it to individual datapoints and the lack of statistical rigor in looking at information that way. I can’t tell if it’s just my organization, or if this is a common occurrence. Any thoughts?

04:55 UTC


For the Professors

Quit teaching your students R. Nobody uses it outside of University. You’re committing a great disservice on behalf of your students by not making Python a core component of your curriculum. If you don’t know python and refuse to learn, please quit teaching data science. That’s all.

EDIT: I should’ve said “Quit teaching your students ONLY R”. Also, of course people use R, I was obviously being hyperbolic. Daddy chill

19:57 UTC

Back To Top