Photograph via snooOG

A space for data science professionals to engage in discussions and debates on the subject of data science.


1,509,739 Subscribers


Help in learning Linear programming for real time cases

I got hired as a DS for the supply chain team and am pretty much a lone wolf here. I have zero knowledge of LPP and optimization and has somehow scraped by in the last couple of months.

I have gone through many articles but they only use predefined equations. Most of my problems came from inability to convert the Excel solver to pulp format equations.

Can I please get some advice on learning about LP optimization?

11:52 UTC


Weekly Entering & Transitioning - Thread 08 Apr, 2024 - 15 Apr, 2024

Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:

  • Learning resources (e.g. books, tutorials, videos)
  • Traditional education (e.g. schools, degrees, electives)
  • Alternative education (e.g. online courses, bootcamps)
  • Job search questions (e.g. resumes, applying, career prospects)
  • Elementary questions (e.g. where to start, what next)

While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads.

04:01 UTC


Any marketing graduates who have switched to DA/DS?

History about myself😅 I’m 27 and studied a bachelors degree in marketing with honours(From South Africa). Then I did another honours degree in financial planning and have been a Paraplanner/Digital Marketer the past 3 years. I got frustrated about a year ago as the job was really boring me, I end up working about 3 hours a day. I enjoy the free time though but decided after dabbling with some minor excel data analysis for my company to self teach myself python and SQL as I had made a decision to start a Masters in Applied Data Science(MADS) in 2024 at one of the top 5 universities in South Africa, which is a 2 year program. In my class, about 90 students I am the only one coming from a marketing degree, rest are from engineering and economics. I’m guessing the Python entrance exam phased out a lot of people. I’ve been enjoying the course so far and have learnt more about Python the last 3 months then I did last year self learning😅 I am curious if there if there are others with my kind of background who have made it into the Data industry and any advice they can give?

17:37 UTC


From two competeing models in a team, how do i bring up data leakage in the other?

For this project that I am working on we have been developing two competeing models. Having access to the codebase, I noticed the other model which has been accepted to be used in production for seemingly better results, has data leakage (using information during training from test data). Synthetic data generation done on the entire dataset and other feature engineering such as standardising the values on the entire dataset.

I brought this up in the group chat once, but it hasn't been paid attention that much. How do I assert myself and bring this up? Because my model is unfairly being put on a second place.

09:38 UTC


I made my very first python library! It converts reddit posts to text format for feeding to LLM's!

Hello everyone, I've been programming for about 4 years now and this is my first ever library that I created!

What My Project Does

It's called Reddit2Text, and it converts a reddit post (and all its comments) into a single, clean, easy to copy/paste string.

I often like to ask ChatGPT about reddit posts, but copying all the relevant information among a large amount of comments is difficult/impossible. I searched for a tool or library that would help me do this and was astonished to find no such thing! I took it into my own hands and decided to make it myself.

Target Audience

This project is useable in its current state, and always looking for more feedback/features from the community!


There are no other similar alternatives AFAIK

Here is the GitHub repo: https://github.com/NFeruch/reddit2text

It's also available to download through pip/pypi :D

Some basic features:

  1. Gathers the authors, upvotes, and text for the OP and every single comment
  2. Specify the max depth for how many comments you want
  3. Change the delimiter for the comment nesting

Here is an example truncated output: https://pastebin.com/mmHFJtcc

Under the hood, I relied heavily on the PRAW library (python reddit api wrapper) to do the actual interfacing with the Reddit API. I took it a step further though, by combining all these moving parts and raw outputs into something that's easily useable and very simple.

Could you see yourself using something like this?

21:35 UTC


Looking for a kaggle Team....

Looking for teammates who could take part in kaggle competitions with me, i have knowledge in Computer Vision, Artificial Neural networks, CNN and recommender systems....

19:05 UTC


LLM APIs vs Hosting OSS/Fine-tuned models

Hi guys, just want to check my line of thinking.

I'm managing a DS/ML team in my company, and we've been picking up a couple of projects that uses LLM.

To date, I see that for the applications happening inside the company, using LLM APIs (OpenAI, Google, etc) and building systems around it (RAG, guardrails, prompting, you name it) is still the way to go because of:

  • Speed to iterate

  • Fine-tuning data not readily available

  • The current + foreseeable future use cases seems to be able to be solved using "general knowledge" contained in the big tech's pretraining + instruct-tuning

I still see fine-tuning being thrown around by either the big tech sales people (I get it, they're sales function at the end of the day) or by senior leadership that knows a bit into the details behind these LLMs, but personally I don't see a specific value yet of doing fine-tuning at my company's scale.

The reasons I can think of on why someone in my position resorts to fine tuning is:

  • If there is an available infrastructure + team managing it already, and serving our own fine-tuned model is cheaper (economy of scale).

  • Compliance issues (eg. maybe Banks really don't want to risk their data being sent to other company's server)

  • Risk of the model's response stability being at the hand of the provider

  • If the task is proven to be too specific, and even GPT-4/Opus/Gemini-1.5 with RAG, etc can't solve (or the modifications around it becoming too expensive)

Based on your experience, is there a major reason that I miss? Another recent data point is Cognition labs. If people at their caliber wrap their system around GPT-4, why should I bother with fine-tuning? (other than the reasons stated above)

16:17 UTC


Philly Data & AI - April Happy Hour

If anyone is interested in meeting other data and AI folks in the Philly area, I run a monthly connect to make friends and build local industry connections. Our next connect is April 16th. See here for details: Philly Data & AI - April Happy Hour

11:58 UTC


What's your way of upskilling and continuous learning in this field?

As the title suggests. How do you think and go about long term learning and growth?

08:14 UTC


I just can't fine tune BERT over 40% accuracy for text-classification task

Hi everyone, this is the first time I'm fine tuning an LLM and I just can't get over 40% accuracy for the text-classification task.

I'm using BERT from transformers library to load and train the model and peft for LoRA implementation. My data set contains English written summaries of news articles and with each article there is a label such as Economics, Politics, Science, Entertainment, etc... (14 unique labels). The maximum length of summaries can extend up to 250-300 tokens. My training set has 800 examples and validation set has 200 examples.

At first the training loss was reaching very low but the validation loss was not going too low with validation accuracy going maximum up to 45%. Since it was overfitting, I changed dropout rate form 0.1 to 0.5. After that the model is not overfitting now, but it is underfitting, with validation and training loss being almost the same and validation accuracy still reaching 45% max.

I tried removing LoRA implementation but nothing changed, except for the training time. At this point I'm confused as to what should I do. I've tried tuning hyperparameters but nothing changes.

Can anyone help me out in understanding what possibly could I be missing here. I can share stats and code implementation or I can even get on call if that's possible. Any help will be very much appreciated.

07:31 UTC


ECS Task for ML Inference

Hey All

I'm trying to build an ML inference application. I would like to be able to run inference on many uploaded files and return those results to the user. I'm having some trouble formulating what this would look like and am looking for some advice. Here's what I have:

  • A Dash-based web application hosted as a task on ECS where users can upload files to run inference on. This application pulls the model down from S3 with the viz app, loads the model, and then runs the inference methods, and returns the results in a Dash callback. This works fine so far, but only for a single file, one at a time.

I can build the components easily as two tasks on ECS as a Dash viz app and a FastAPI application for running inference. I would like the user to be able to upload multiple files, process them in parallel (or close to it) by the FastAPI task, and then send the results back to the web app for visualization.

I'm confused about how to scale this up -- will sending multiple files in a Python `requests` body to the FastAPI application automatically scale up the ECS Task? Or will it come in as a "batch inference" problem and be run as a single task?

I looked briefly into an Lambda w/ S3 trigger -> SQS -> ECS but this seems like a roundabout solution -- how can I get multiple inference results back in a requests body?

Any help is appreciated.

23:40 UTC


Recommend good books/ courses

Hi all.

I’m really free these days, unemployed and looking for employment, but the way the market is right now, I guess it’ll take some time. So can anyone recommend me good data science books/ courses?

What im looking for:

  • mlops,
  • docker, kubernetes in data science
  • tackling data science problems without business context
  • how to modularize code (not just Jupyter notebooks, but how to create entire pipelines on vscode/ pycharm.
  • create web dashboards

Looking forward to the recommendations


20:36 UTC


How can I address small journey completions/conversions in experimentation

I’m running into issues with sample sizing and wondering how folks experiment with low conversion rates. Let say my conversion rate is 0.5%, depending on traffic ( my denominator) a power analysis may suggest I need to run an experiment for months to achieve statistically significant detectable lift which is outside of an acceptable timeline.

How does everyone deal with low conversion rate experiments and length of experiments?

19:56 UTC


upskilling for ex-academic with skill gaps

Hey folks, I’m looking for advice on filling in some skill gaps. I’m a social science academic with a highly quantitative background, left academia a couple years ago for a nonprofit role, and am now looking for my next thing.

My job search revealed that I have some noticeable skill gaps that affect interviewing and hiring. But typical data science training options are pitched too low — I’m qualified/have been recruited to teach subjects like causal inference, experiment design, surveys, data viz, and R programming at the grad level. I’d like to upskill on at least the following topics:

  • ⁠Python, but the intro stuff is just unbearably boring. Is there a Python transition course for R experts?

  • SQL, ditto. I fully understand most concepts around data manipulation …. in R.

  • ⁠Forecasting and predictive analytics. Would be happy to read a book or take a class on this.

  • ⁠Product oriented analytics. I’m solid on working with non-technical stakeholders but there seem to be some common issues (churn, pricing, auctions, marketing/attribution, risk, search) where specific knowledge of how people typically approach the problems would be helpful.

  • AI/ML basics and assessment. Again, looking for stuff for someone with minimal ML experience but a strong stats/quant background.

Also interested in anything you think would be a good direction to pursue. I’m not currently in a hurry, plus the market is miserable, so I’d like to set myself up for a big push next year. I have a substantial amount of PD money I can use as long as it’s started in the next 6 months, so, happy to pay for courses if they’re useful.

17:21 UTC


Deduplication with SPLINK

I'm trying to figure out a way to deduplicate a large-ish dataset (tens of millions) of records, and SPLINK was recommended. It looks very solid as an approach, and some comparisons are already well defined. For example, I have a categorical variable that is unlikely to be wrong (e.g., sex), dates, for which there are some built in date comparisons, and I could define the comparison myself be something like abs(date_l - date_r)<=5 to get the left and right dates within 5 days of each other. This will help with blocking the data into more manageable chunks, but the real comparisons I want are some multi-classification fields.

These have large dictionaries behind them. An example would be a list of ingredients. There might be 3000 ingredients in the dictionary, and any entry could have 1 or more ingredients. I want to design a comparator that looks at the intersection of the sets of ingredients listed, but I'm having trouble with how to define this in SQL and what format to use. If I can block by "must have at least one ingredient in common" and use a Jaccard-like measure of similarity I would be pretty happy, I'm just struggling with how to define it. Anyone have any experience with that kind of task?

16:32 UTC


Need guidance for (lack of) career path

I'm at a loss of where I stand in the Data Analyst career path. I did an econ MA in 2019 immediately after finishing my BA, which was a terrible idea because I was playing catchup on the maths and couldn't really properly learn any of econ models or causal inference/statistics.

After graduating I struggled to find an "Econ" job while my peers got positions months before graduation. Thanks to Twitter hobby-posting during the start of COVID though, I got my first gig as a Data Analyst late 2020 with the Dept of Health. Thats when I started self teaching Python alongside PowerBI and Tableu. More recently I've picked up SQL and R...

Fast forward to now, I've been through about a job per year and I am once again not too happy with the position I am in. I'm a glorified data wrangler at my mental health research lab, which has a small 3 person data analyst team (4 if we cound the boss/director). I get barraged with so much ad-hoc stuff that I can't say no to that I don't have time to revisit all the modelling/causal inference stuff I didn't fully grasp during my MA... nor does anyone really care about my opinion in that topic. I've had countless instances of cases where, despite not know how to fix an issue, I call out an issue in an analysis that is egregious (ex: operating on a dataset for which, due to issues with my peer's R code, only 30% of observations had an IPTW and the rest where NULL, when none should be NULL). No one ever cares - they are in the well-known social sciences loop of "shit out as many papers as possible, or perish due to lack of grants".

Whenever I do get the chance to go beyond data wrangling, I'm basically sent on fishing expeditions that we use to show some silly model in a silly one-time presentation never to be revisited.

I have insisted at times for my name not to be included on a paper we submit to journals, but they always get me included because I can't get myself to say "the reason is you have a lot of issues in there, which I pointed out and you chose to ignore. I don't wanna be victim to a replication crisis blogpost". It's demoralizing and I can't continue this way for longer.

It seems all academic-ish jobs in social sciences are like this, from what I've read on forums. But I just don't have the skillset to make it as a "Data Scientist" in industry either...and I don't have the time to fill the gaps while I'm working because I'm always data monkeying away, and often times reading a shitton of documentation that wears me out from being able to get into my Statistics bookmarks after work...Right now I have been tasked with figuring out our datawarehouse, which is prepared in fucking SAS-SQL and has dozens of SAS programs each with copies like code_v1 thru code_v16_final_FINAL - the person that did all that work for years, and was my mentor when I joined the lab, abruptly quit recently.

What should I do? I have savings...My partner is OK with me quitting to figure things out. But I'm not sure I am. I need a plan, at the very least, before doing that... I've considered proposing they have me as a part time employee, or just returning to my previous job for which I had similar issues but they weren't in this magnitude...

If it matters below is my "career path" thus far. I've an Econ/IR double BA and an Applied Econ MA...

  1. COVID contact tracing team - ended after 1 year because politics
  2. Development NGO - quit after accepting job on #3 because my pay would be doubled, plus I was like 3 additional unpaid roles there on top of DA
  3. Govt. transparency, civic participation, econ development think tank - quit after getting told I couldn't work remotely from the state I wanted fo move to so I could move in with my long distance partner. However they did ask me to rejoin 1 month later and I said no...still in good terms
  4. Mental health research lab - current job...pays well enough but dreading it hence this post
16:11 UTC


Data challenge take-homes. How are you setting up your repos?

Howdy folks.

I've been on the job market for a while now (7 YoE, laid off from one of the larger tech companies last year), and written a bunch of data challenges in that time. Mostly I've just been sending over a requirements file and a jupyter notebook with extensive discussion of what I did and why I did it, but this is not how I would actually code on the job -- just how I code given the constraints of having to do things quickly to turn around data challenges. I don't think it's a great way to showcase my actual coding habits which would typically involve being more thoughtful and including unit/integration tests, but when you're faced with a tight deadline to turn something around, you do what you have to.

That said -- some of you must have boilerplate that you copy from one project to the next to make the process less painful. Show me your repo structures! So that I can steal them and not have to think next time.

14:54 UTC


Opinions on a side project for a recent grad?

Hey everyone 👋

I’m graduating in 4 weeks. Got an analyst role, but I eventually want to land a DS role. Was thinking of taking a gap year and getting an MS degree in CS with an emphasis on ML from Georgia Tech. In that year, I wanted to work on a side project.

I was honestly thinking of teaching myself object oriented programming and making a video game without an engine, just using hard coding. I know that’s not DS related, but I’ll be doing plenty of analytical stuff with SQL/Python/Tableau at my day job. And this felt like a project that would teach me more about the programming side of things, less of the basic scripting side that I do at work.

I am wondering if anyone sees value in a side project like this, in regard to landing an actual DS role in the future? I really want to learn outside of work, but also want it to be something I’m interested in. Thanks for the feedback!

10:45 UTC


Why there is nope for Data Science Juniors

Since the last year, I never seen anyone from a different field (not Computer Science, Statistics, DS grad) get an entry level job. Even if one complete many projects and courses, bootcamp, github etc.

Do you think the market is dead for outcomers, Actually do you have anybody got the entry levrl job. without any related academical degree, in last 6 months? Just prove me wrong, I want to see real examples to not lose my hopes completely,

-- Btw I am a 3 year+ python developer, with experience on deploying DS models on industry. I have applied more than 100 jobs and got no interview. I am in Turkey and appying mostly for foreign jobs.

10:07 UTC


What would you consider "advanced" seaborn plotting?

Hey guys, I'm doing this little project on the side where I'm exploring how to do things with seaborn that are usually not covered in most courses and tutorials on seaborn, and have involved basically figure things out by tinkering and adapting code I found online for my purposes, and then finding more elegant ways to ago about the same code I found. So I decided I want to put together a little "advanced seaborn" project to collate these things, share them, and also for future personal reference.

So far I'm covering:

-Labeling some/all points/bars on a plot with data that is not part of the plot (i.e. that is not passed to the containers), especially when using hue, order and hue_order,

-Plotting over background images,

-Centering long axis labels using custom functions,

-Custom legend when the defaults don't work well,

-Styling/modifying xtick and ytick labels on the fly

-Using latex

What would you add to the list or, alternatively, what do you think would make you say "that's cool/handy" when plotting with seaborn?


04:27 UTC


DeepMLeet is like Leet code but if you want to be an AI Researcher, just added some new questions to the Machine Learning section let me know what you think!

01:52 UTC


You are under arrest, you can be released if and only if, you explain p-values the right way.

So, I asked chatGPT this:
"I know, And you know, that you have been trained on a lot of data from reddit. Pretend you are a tough critic on datascience sub and validate my statements: I am studying p-value, in ab testing and in hypothesis testing in general, here's my interpretation to it: Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unusual or extreme as the observed results. How wrong am I ?"

And here's its reply:

Your interpretation of the p-value is essentially correct and aligns well with standard statistical definitions. Let's break it down a bit more for clarity and completeness, especially from the perspective of a critical view one might encounter in discussions, say, on the data science subreddit:

  1. Definition Accuracy: The p-value measures the probability of observing results as extreme as, or more extreme than, what was actually observed, assuming that the null hypothesis is true. Your description captures this essence correctly. It's important to emphasize that this is under the assumption that the null hypothesis is true.
  2. Context of 'Extreme': The term "as unusual or extreme" is accurately used here. In statistical testing, "extreme" is defined in the context of the null hypothesis. What constitutes "extreme" is determined by the test statistic used and its distribution under the null hypothesis. It's not just about being unusual; it's about being so far from what the null hypothesis predicts that it prompts a second look.
  3. Not the Probability of the Null Hypothesis Being True: A common misconception about p-values that's important to avoid is thinking that the p-value tells you the probability that the null hypothesis is true or false. Your statement avoids this pitfall by correctly not making claims about the truth of the null hypothesis itself. The p-value doesn't provide direct evidence about the truth of the null or alternative hypotheses; it's about the evidence against the null hypothesis given the data.
  4. Implications for Decision Making: While not directly stated in your interpretation, it's essential to recognize that a small p-value indicates that, if the null hypothesis were true, the observed result would be surprising. However, deciding on a threshold for "small" (like p < 0.05) is somewhat arbitrary and should be considered in the context of the field of study, the stakes of the decision, and the study design.
  5. Misinterpretations and Limitations: It's also worth noting, especially in a critical discussion, that p-values are widely misunderstood and misused. They should not be used in isolation for making decisions. Other factors, including the effect size, the study design, the data quality, and the pre-study probabilities of the hypotheses, are crucial for a comprehensive understanding of the study's results.

Your interpretation of the p-value as the probability of obtaining results as unusual or more extreme than the observed results, under the null hypothesis, is a concise and correct starting point. Expanding upon this with an awareness of common misunderstandings and the context of its application will deepen the understanding and communication of statistical results.


23:34 UTC


How much money is there in packaging and selling economic data?

I know this is a vague question but just looking for a ballpark. My friend's business is sitting on a bunch of real time economic data on traffic volumes at everything from ports to raw materials suppliers and construction companies across a single EU country. How would they even go about sizing this, and is it likely to have a market?

20:46 UTC


Turning down a job but offering to work as a consultant instead?

I got a job offer for a position that I'm going to turn down (pay is lower than I make currently and even if that wasn't a factor, the benefits aren't great). I've definitely already decided not to take it. But I wonder if it would be possible to ask to work part time for them if they wanted. The work is infinitely more interesting than what I'm doing now, so it would be nice to be involved.

Has anyone done this before? I don't want to insult them or anything by offering such a thing.

19:41 UTC


Small Company vs. Larger Company for a Data Scientist: A Discussion on Generalist vs. Specialis5

I have experience in machine learning engineering, data science, data engineering, and MLOps, which aligns my profile more with that of a generalist. I'm uncertain if this is beneficial, but I believe being a generalist could offer more opportunities in the future. Am I mistaken? Additionally, due to my ADHD/Autism, I often find myself quickly bored with repetitive tasks.

Recently, I've been participating in the hiring process for a data scientist position and am now a finalist at two companies: a small startup (which I'll refer to as "S company") and a large corporation ("B company").

S company is just beginning to implement AI, and the vacancy is for a senior-level position. As a data scientist, I would need to identify opportunities to apply AI. They lack an MLOps platform but promise freedom to deploy and use technologies as needed, with minimal bureaucracy. This seems appealing for rapid growth within the company. However, the downside is the limited number of experienced data scientists in the team (only one, actually). S company, focusing on package delivery, has many opportunities to apply optimization algorithms for routing. Yet, I've noticed the company seems somewhat disorganized.

B company, is a large bank known for being data-driven, frequently hosts insightful conferences on ML and DS on YouTube. The vacancy relates to credit limits, and I was told that specializing in credit and loans is crucial for advancement within this company. The position is mid-level and offers a higher salary than S company. This could be advantageous, as it allows for learning without excessive pressure. However, I wonder if becoming too specialized in this area might limit my future career options.

What you would do in my position?

19:28 UTC


Does anyone recommend a clustering algorithm that can also update existing clusters?

For instance say I have 1000 features that I cluster with algorithm A. I obtain another 500 features, I would like to use the existing cluster information without reclustering everything from the start.

Is there a clustering algorithm (ideally in sklearn and not k-means) that can handle this type of usage?

In one use case, the distance metric I plan on using will be jaccard since my data will be binary.

18:42 UTC


Feels like I’m in a grey area education wise.

I’ll be graduating in a month with a BsC in Applied Statistics. The 3 most important classes I’ve taken are Regression Models (Poisson, NB, Beta), Multivariate Analysis (PCA’s, Discriminant Analysis, Factor Analysis), and Machine Learning (SVM, Decision Trees, SMOTE). Have a course in data visualization using the tidyverse package in R, a course dedicated to the SAS programming specialist certification, and 2 courses preparing for the actuarial FM exam and P exam amongst other electives.

I don’t know if an undergraduate is enough for competency in the first 3 classes mentioned given each one has a graduate level variation (Ex: I’m taking 410, grad students take 510). Feel like my degree gave me breadth but not depth stats wise. Math wise I got to the Real Analysis sequence but I don’t think I’m cut out for a pure stats approach.

Is this enough for an entry level job or is it gatekept by a post grad level of education?

17:08 UTC


Does anyone knows how to scrape post on Reddit thread into Python for data analysis?

Hi does anyone knows how to scrape post on Reddit thread into Python for data analysis? I tried to connect python into the reddit server and this is what i got. Does anyone know how to solve this issue?

After the user authorizes the app and Reddit redirects to the specified redirect URI with a code parameter, you need to extract that code from the URL.

For example, if the redirect URI is http://localhost:65010/authorize\_callback
, and Reddit redirects to a URL like http://localhost:65010/authorize\_callback?code=example\_code&state=unique\_state
, you would need to parse the code
parameter from the URL, which in this case is 'example_code'.

Once you have extracted the code, you need to use it to obtain the access token by making a POST request to Reddit's API token endpoint. This endpoint is usually something like https://www.reddit.com/api/v1/access_token.

Here's a general outline of how you can do it:

  1. Extract the code parameter from the redirect URI.
  2. Make a POST request to Reddit's API token endpoint with the code, along with your app's client ID, client secret, redirect URI, and grant type (which is typically 'authorization_code'

). 3. Reddit's API will respond with an access token. 4. You can then use this access token to authenticate requests to the Reddit API.

The specific details of making the POST request, handling the response, and using the access token will depend on the programming language and libraries you are using. You'll need to refer to Reddit's API documentation for the exact endpoints, parameters, and response formats.

08:33 UTC


Any learning resource recommendations about space usage optimizations.

Anyone here work on optimizing space usage and have any good resources for learning?

I'm working on a problem similar to retail space optimizations where the goal is to find out the best combination of product/promotions to be placed in store that would maximize profit. There are some research papers on store space optimizations but most of them seem quite theoretical. Any leads to applied resources would be much appreciated 🙏

03:51 UTC

Back To Top