/r/datascience

Photograph via snooOG

A space for data science professionals to engage in discussions and debates on the subject of data science.

/r/datascience

2,071,565 Subscribers

4

Detecting Marathon Cheaters: Using Python to Find Race Anomalies

Driven by curiosity, I scraped some marathon data to find potential frauds and found some interesting results; https://medium.com/p/4e7433803604

Although I'm active in the field, I must admit this project is actually more data analysis than science. But it was still fun nonetheless.

0 Comments
2024/09/09
11:48 UTC

1

Optapy does not respect any constraints in VRPs

I am trying to migrate from vroom to optaplanner but the library has no proper documentation nor people with experience working with it, only a quick start guide on their GitHub but I ran into some problems here on this forum and really need some help: https://stackoverflow.com/questions/78964911/optapy-hard-constraint-is-not-respected-in-a-vrp

If you have any recommendations to another tool for VRPs please share with us.

Thanks.

0 Comments
2024/09/09
10:47 UTC

3

Weekly Entering & Transitioning - Thread 09 Sep, 2024 - 16 Sep, 2024

Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:

  • Learning resources (e.g. books, tutorials, videos)
  • Traditional education (e.g. schools, degrees, electives)
  • Alternative education (e.g. online courses, bootcamps)
  • Job search questions (e.g. resumes, applying, career prospects)
  • Elementary questions (e.g. where to start, what next)

While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads.

11 Comments
2024/09/09
04:01 UTC

292

Whats your Data Analyst/Scientist/Engineer Salary?

I'll start.

2020 (Data Analyst ish?)

  • $20Hr
  • Remote
  • Living at Home (Covid)

2021 (Data Analyst)

  • 71K Salary
  • Remote
  • Living at Home (Covid)

2022 (Data Analyst)

  • 86k Salary
  • Remote
  • Living at Home (Covid)

2023 (Data Scientist)

  • 105K Salary
  • Hybrid
  • MCOL

2024 (Data Scientist)

  • 105K Salary
  • Hybrid
  • MCOL

Education Bachelors in Computer Science from an Average College.
First job took about ~270 applications.

192 Comments
2024/09/08
22:48 UTC

15

In practice is it fine to make decisions sometimes on descriptive stats? If no models /test are working or have a tight deadline?

Is it fine to use descriptive stats to make decisions? If statistical test and ml models are not working good enough? How common is it in the tech industry? Lets say for tech products

Also if deadline is tight

23 Comments
2024/09/08
19:31 UTC

32

Advanced Analytics for Stock Analysis? (AR-ARCH, Monte Carlo Simulation, etc.)

Many people say stock prices cannot be predicted, and I tend to agree. Still, I wonder if systematic strategies using advanced analytics methods can outperform a global portfolio (like an MSCI All-World ETF). Personally, I use a set of statistical methods to estimate stock returns, including Monte Carlo simulations, spline regression, and autoregressive models like AR-ARCH. I haven't become rich yet, though! šŸ™ƒ

I'd love to hear about any promising, potentially non-mainstream approaches within the realm of Data Science.

** Update **

Thanks for all the inputs!

I think the Monte Carlo part needs a bit more clarification. I sample from the last few years of returns and apply them to a starting value of 100 across 100 iterations. I repeat this process many times to generate a distribution of returns. From that, I calculate metrics like the win-loss ratio and the expected value, which I use to compare stocks. In that sense, I do consider it a form of prediction, though it's more focused on assessing potential outcomes.

I also wanted to share that I don't believe there's one dominant method for predicting stock prices. Instead, I strongly believe that a combination of approachesā€”both data-driven and othersā€”is essential for performing well in the market.

32 Comments
2024/09/08
14:55 UTC

8

Seeking Advice for My First Co-op in Data Science

Hi everyone,

I'm about to start my first co-op in data science/analytics, and I'm feeling pretty nervous. I see many students with strong personal projects, and I'm worried they might have an edge over me. I would greatly appreciate any advice or recommendations you can offer, especially from DS/DA professionals.

  1. Resume Help:Ā Could anyone review my resume or provide suggestions on how to improve it? I'd love to know what stands out to recruiters and what might be missing.
  2. Cover Letter Tips:Ā Should I focus on how my experiences and skills from past projects align with the company or the specific position Iā€™m applying for? Or is there a different approach I should consider to make my cover letter stand out?
  3. Skills and Projects Focus:Ā Are there any specific skills, certifications, or types of projects that I should prioritize? Iā€™m aiming for positions in Data Science, Data Analytics, or Machine Learning.

Thanks in advance for your help!

https://preview.redd.it/2u0spazyxend1.png?width=1030&format=png&auto=webp&s=e256a5f064c15258ff793de6b053b546b5de9af1

20 Comments
2024/09/07
16:19 UTC

41

Feeling like Iā€™m not exploiting data enough - is this normal?

Iā€™m a research fellow at my first year of doing practical stuff after graduating in Computer Science. Iā€™m in this project where I work with water quality data, however I donā€™t feel like Iā€™m exploiting at full the data, I feel like Iā€™m just plotting variables and doing easy statistical analysis. I have to say that Iā€™m basically left alone in the work. What I learnt during university is not very useful in what Iā€™m doing, mostly for the data analysis part. The analyses Iā€™m doing are all coming from continuous research on internet, but I often feel that Iā€™m not doing enough in the analysis (in my case of time series) other than trend estimation, correlation analysis and some statistical tests. After the analyses the ML part comes in, and I feel ā€œsatisfiedā€ about that part. Could it be because I had some expectations of doing complex analyses? Is it a common feeling in this field?

30 Comments
2024/09/07
14:20 UTC

38

How do leetcode problems relate to in-role Data Science performance?

Genuine question what these show other than how much time someone has spent on leetcode

78 Comments
2024/09/06
22:51 UTC

24

Using Machine Learning to Identify top 5 Key Features for NFL Players to Get Drafted

Hello ! I'd like to get some feedback on my latest project, where I use an XGBoost model to identify the key features that determine whether an NFL player will get drafted, specific to each position. This project includes comprehensive data cleaning, exploratory data analysis (EDA), the creation of relative performance metrics for skills, and the model's implementation to uncover the top 5 athletic traits by position. Here is the link to the project

33 Comments
2024/09/06
17:42 UTC

40

Sales Forecasting for Thousands of MSKUs

I have to create a solution for forecasting for thousands of different MSKUs at location level.

Data : After a final cross join, For each MSKU I have a 36 monthly data points. (Not necessarily all will be populated, many monthly sales values could be 0)

The following is what I have attempted:

  • For each category of MSKUs I created a XGB and RF regression models.
  • I used extensive feature engineering but finally settled on ~15 features (including lag and rolling averages)
  • At the end of this for 5 different categories I have 2 .pkl files each i.e 10 .pkl files in total.
  • I did not attempt Time Series, as number of data points for each MSKU was very low.
  • None of the MSKUs have consistent sales patterns - out of 36 monthly data points, nearly 50% is always 0.

However, the final report gives absurdly high values of predictions, even in case of MSKUs with nearly no sales.

This is where business has a problem. They want to me redo everything to get meaningful predictions.

My problem with this approach is - I might have to create models for each item i.e thousands of .pkl files
Constraints:

  1. No access or permissions for - Cloud/Git/CI-CD/Docker.
  2. All the data and models will be have retrained and refreshed monthly (My biggest concern) manually.**
  3. All business applications are loaded on on-premise server (with a laughable 8 GB RAM)
  4. I am the only person - DS/DE everything in one.

I am outta my depth here! Can you please help?

EDIT:
Wow, I was definitely not expecting so many helpful responses!! I am insanely grateful. It seems I need to peruse some of the TS literature. It's midnight as I am writing this here. I will definitely try and answer and thank the comments here!

31 Comments
2024/09/06
16:57 UTC

37

Resources for A/B test in practice

Hello smart people! I'm looking to get well educated in practical A/B tests, including coding them up in Python. I do have some stats knowledge, so I would like the materials to go over different kinds of tests and when to use which. Here's my end goal: when presented with a business problem to test, I want to be able to: define the right data to query, select the right test, know how many samples I need, interpret the results and understand pitfalls.

What's your recommendation? Thank you!

22 Comments
2024/09/06
16:36 UTC

0

It's crazy how much India dominates the DS job market

I did an analysis of job offers in DS among Fortune1000 companies (data: https://jobs-in-data.com/).

Here is a comparison of the manager/director positons vs. junior/mid/senior positions.

https://preview.redd.it/2cbwe95uu6nd1.png?width=1920&format=png&auto=webp&s=1c08c9d1c6c4d19ab71733f6130780f563e95230

https://preview.redd.it/r9xb89euu6nd1.png?width=1920&format=png&auto=webp&s=629e631b97f2f23fa1bd37d4d1586b17a69230f6

NGL huge stomp from India :)

51 Comments
2024/09/06
13:15 UTC

8

Invitation: GDPR/HIPAA Compliance webinar; Python ELT workshop

Hey folks,

dlt cofounder here.

Previously: We recently ran our first 4 hour workshop "Python ELT zero to hero" on a first cohort of 600 data folks. Overall, both us and the community were happy with the outcomes. The cohort is now working on their homeworks for certification. You can watch it here:Ā https://www.youtube.com/playlist?list=PLoHF48qMMG_SO7s-R7P4uHwEZT_l5bufPĀ We are applying the feedback from the first run, and will do another one this month in US timezone. If you are interested, sign up here:Ā https://dlthub.com/events

Next: Besides ELT, we heard from a large chunk of our community thatĀ you hate governanceĀ but it's an obstacle to data usage so you want to learn how to do it right. Well, it's no rocket/data science, so we arranged to have a professional lawyer/data protection officer give a webinar for data engineers, to help them achieve compliance. Specifically, we will do one run for GDPR and one for HIPAA. There will be space for Q&A and if you need further consulting from the lawyer, she comes highly recommended by other data teams.

If you are interested, sign up here:Ā https://dlthub.com/eventsĀ Of course, there will also be a completion certificate that you can present your current or future employer.

This learning content is free :)

Do you have other learning interests? I would love to hear about it. Please let me know and I will do my best to make them happen.

3 Comments
2024/09/06
07:40 UTC

0

How would you evaluate this?

I have a curated list of items and a collection of conversations. I'm extracting a list of items from the conversations using an LLM and trying various techniques to map the extracted items to my curated list. I need to evaluate the correct number of items are extracted, and whether or not it was mapped to the correct item.

Comparing the length of the two item lists is easy enough, and using a similarity score for the extracted items has been Ok. I'm wondering if anyone has any better evaluation techniques they recommend I look into.

4 Comments
2024/09/06
01:07 UTC

191

What is your go to ask math question for entry level candidates that sets a candidate apart from others, trouble them the most?

What math/stats/probability questions do you ask candidates that they always struggle to answer or only a-few can give answer to set them apart from others?

207 Comments
2024/09/05
22:07 UTC

11

Tools for visualizing table relationships

What tools do yo use to visualize relationships between tables like primary keys, foreign keys and other connections?

Especially when working with too many table with complex relational data structure, a tool offering some sort of entity-relationship diagram could come handy.

11 Comments
2024/09/05
20:41 UTC

13

Books to add to case

Curious what books yall might recommend to fill out a bookcase. Thereā€™s the standard ISLR, ESLR, but what about domain specific analytics / science, or any other books that have very good information in them? For example, I really enjoy this one for getting up to speed on Bayesian inference: Bayesian Analysis with Python - Third Edition: A practical guide to probabilistic modeling https://a.co/d/0Y2pyEg

I use this one to really hone the craft in Bayesian inference: Bayesian Data Analysis (Chapman & Hall/CRC Texts in Statistical Science) https://a.co/d/c7VF3w9

This is a great book for pricing Pricing and Revenue Optimization: Second Edition https://a.co/d/04pa3fg

Like in factorio, the bookcase must grow!

11 Comments
2024/09/05
17:00 UTC

69

Will Learning LLMs Be Worth It?

I'm thinking of spending the next 6-12 months to intensely study LLMs (how they're structured, training, fine-tuning, deployment, RAG, etc).

I'm certainly interested in them personally, but my decision is focused on how valuable this knowledge will be.

I have a lot of data science and ML experience already, and I'm trying to figure out what to study next. I could spend the next several months LLMs, *but* I could also spend the next several months studying other data related things or some business topics.

Although I think LLMs are fascinating, I'm somewhat uncertain that applied LLMs will drive real business value (LTV, revenue, profit, ROI) in the same way that more traditional tools do.

What do you think?

Will LLM knowledge be valuable or is it overhyped with limited real value over the next few years?

What am I missing and how should I be thinking about this?

(note: edit to clarify tradeoffs for what I could study.)

115 Comments
2024/09/05
15:09 UTC

58

Do health tech companies not need clinical personnel???

Just kind of a vent post here. Iā€™m a registered nurse and I work on a data science team for a large well known hospital system. I absolutely love my job working as a clinical SME and liaison between teams. I have a masters in health informatics and am very up to date on ML models, data science principals and all that. However, the way my organization is going i can tell itā€™s time for me to look for new opportunities.

But gosh dang itā€™s hard! Iā€™m in the Bay Area, thereā€™s an abundance of health tech ai companies yet far and few opportunities for my skill set somehow. Project manager seems like a good fit potentially, but those positions want years and years of PM experience. Other analyst positions want someone skilled in SQL querying. Clinical informatics would be an okay area, but i prefer the process of making new and cool things over working at a hospital with epic stuff (itā€™s boring).

I thought I had it made once I got into the data science field as a nurse, but it seems none of these health tech companies Iā€™m finding that are doing things Iā€™m interested in need clinical personnel. Am I doing something wrong?

64 Comments
2024/09/04
18:27 UTC

0

Treatment to control ratio for generalized synthetic control.

Hi everyone,

I am trying to find a rule of thumb for treatment to control group ratio when implementing generalized synthetic control (using gsynth in R). All the examples have a bigger donor group (control), but mine is about balanced. Does that matter? Should the control donor pool be X times bigger than the treatment? Any resources that can help me?

Thanks in advance!

4 Comments
2024/09/04
15:19 UTC

52

When a company gives you a ā€˜take homeā€™ how much time are you expected to spend on it?

So for context, Iā€™m interviewing at a big company that you probably know or use their product.

This was their process up until now:

  1. Recruiter Screen
  2. Technical/Coding Screen
  3. Hiring Manger Interview
  4. Stakeholder Interview

Then I was given a take home on Thursday and asked to turn it in by Sunday.

I was only able to spend 2 days on it. I have my presentation coming up next week and now I feel like I made a crappy presentation. I wish had had put in more days.

123 Comments
2024/09/04
14:15 UTC

2

Experience using Red Hat OpenShift AI?

Our company is strictly on-premise for all matters of data. No cloud services allowed for any sort of ML training. We're looking into adopting Red Hat OpenShift AI as an all-inclusive data platform. Does anyone here have any experience with OpenShift AI? How does it compare to the most common cloud tools and which cloud tools would one actually compare it to? Currently I'm in an ML engineer/data engineer position but will soon shift to data science. I would like to hear some opinions that don't come from RedHat consultants.

4 Comments
2024/09/03
19:34 UTC

145

Do you all ever see the job market getting better?

I've been looking to move from data analytics to a data scientist role for a while now, and I've been applying but it seems like there's incredibly few jobs. Certain DS roles I'm just not qualified for depending on the specific domain, so that lowers it further. But I'm getting more and more miserable in my current role and I'm just not sure what to do.

189 Comments
2024/09/02
23:58 UTC

5

How to extract large amount of unstructured text from executable PDF documents?

Hello,

Need some guidance on extracting large compliance items from raw PDF documents. I have csv with these compliance items and I want to fine-tune a model such that if it reads any new PDF documents it can firstly identify the compliance items and extract them.

Examples of these compliance items:

  • "Servicer Action : Determine whether Fannie Mae allows the servicer to approve the release of security on its behalf. If the request requires a non-delegated review, submit Form 236 and all documents as specified on Form 236, along with a recommendation, to Fannie Maeā€™s SF CPM division atĀ partial_releases@fanniemae.comĀ for approval."
  • "Servicer Action : Determine whether Fannie Mae allows the servicer to approve the release of security on its behalf. If the request requires a non-delegated review, submit Form 236 and all documents as specified on Form 236, along with a recommendation, to Fannie Maeā€™s SF CPM division atĀ partial_releases@fanniemae.comĀ for approval."
  • The servicer must document details on the damages or cause of loss to the property.
  • The servicer must discuss with the borrower any plans for repairing the property.
  • If the servicer is unable to establish contact with the borrower or the property is abandoned, the servicer must ensure the property is maintained and secured by complying with the requirements in D2-2-10, Requirements for Performing Property Inspections and the Property Preservation Matrix and Reference Guide.
  • The servicer must Immediately issue the borrower a check for any amount of insurance loss proceeds designated for contents (for example, personal property) or living expenses.
  • The servicer must Deposit any insurance loss proceeds not disbursed into an interest-bearing account (see Depositing the Insurance Loss Proceeds Not Disbursed for additional information).
  • The servicer must Ensure any property inspection report accurately assesses the condition of the property, is dated, and identifies the mortgagor(s) and the property address.
  • The servicer must Obtain the proper lien releases, if applicable.
  • "The servicer must Prohibit payment of fees out of the insurance loss proceeds to any public adjusters or other third parties retained by the borrower to assist with the recovery of those proceeds, unless otherwise agreed to by Fannie Mae in writing."

The PDFs are servicing guides. Examples of them can be seen from the following site:Ā https://servicing-guide.fanniemae.com/

Now regex is not usable since these items are varied in nature and are sometimes inside table and there are no fixed positions in the document where these compliance items can be situated.

The goal is to train a LLM model on these compliance documents so that if I provide a new PDF it can predict/extract compliance items from it.

Can anyone guide me in which model will be better suited for this task, what tokenizers etc and if there are similar scenarios/methodologies for extracting the items?

Attaching image of a compliance item in a pdf for reference:

https://preview.redd.it/9z9hpkjs1hmd1.png?width=1180&format=png&auto=webp&s=c57b3f6d327547780b4e504fce5b8210787669a8

26 Comments
2024/09/02
17:35 UTC

17

What statistical test should I use in this situation?

I am trying to find associations between how much the sales rep smiles and the outcome of an online sales meeting. The sales rep smiling is a percentile (based on our own database of how much people smiled in previous sales meetings) and the outcome of a sales meeting is just "Win" or "Loss", so a binary variable.

I will generate bar plot graphs to get an intuition into the data, but I wonder what statistical test I should use to see if there is even a non random association with how much the sales rep smiles and the outcome of a sales meeting. In my opinion I could bin the data (0-10%, 10-20%, etcā€¦) and run a Chi square test, but that does seem to lose information since Iā€™m binning the data. I could try logistic regression or point-biserial correlation, but I am not completely sure the association is linear (smiling too much and too little should both have negative impacts on the outcome, if any). Long story short - what test should I run to check if there is even any relationship in a feature like smiling (which is continuous) and the outcome (which is binary)?

Second, say I want to answer the question ā€œDoes smiling in the top 5% improve online sales meetings outcome?ā€. Can I simply run a one-tail t-test where I have two groups, top 5% of smiles and rest, and the win rate for each group?

14 Comments
2024/09/02
11:00 UTC

201

Senior Data Analyst at a tech company, having serious anxiety and imposter syndrome issues

I got in as a senior tech analyst in one of the very big e-commerce companies in my country (Asia). I worked really hard for the job. Gave so many interviews and could finally clear them. I am a self taught analyst with no coding background. 5 months into the job- I am so under confident and overwhelmed. I was fairly confident in my life before I started this job. Everyone in my team either is developing an App or winning hackathons or whatever. This gives me serious anxiety and I get anxious of the year end ratings that I would get. I am just hardworking and I persevere. I am not quick, neither with excel nor python but I get the job done after putting long hours. I catch edge cases and donā€™t make silly mistakes. But other than that I donā€™t know much about data. I think I should leave the industry as I cry everyday and this anxiety is killing me. :( Please help. What should I do? How do I get out of my head? How do I get my confidence back? I donā€™t have guts to do 1-1 with my manager (what if he says you need to buck up?). :(

82 Comments
2024/09/02
10:57 UTC

3,078

How to avoid 1/2-assed data analysis

48 Comments
2024/09/02
06:18 UTC

8

Weekly Entering & Transitioning - Thread 02 Sep, 2024 - 09 Sep, 2024

Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:

  • Learning resources (e.g. books, tutorials, videos)
  • Traditional education (e.g. schools, degrees, electives)
  • Alternative education (e.g. online courses, bootcamps)
  • Job search questions (e.g. resumes, applying, career prospects)
  • Elementary questions (e.g. where to start, what next)

While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads.

89 Comments
2024/09/02
04:01 UTC

13

Announcing Plotlars 0.3.0: Enhanced Visualization with New Features and Improvements! šŸ¦€šŸ“Š

Hello Data Scientist!

Iā€™m thrilled to announce the release of Plotlars 0.3.0! šŸš€

This new version brings a host of exciting features and improvements designed to make your data visualization experience in Rust even smoother and more powerful. If youā€™ve been following the progress of Plotlars, youā€™ll know that itā€™s all about bridging the gap between the Polars data analysis library and various plotting libraries. With this release, weā€™re taking things to the next level!

Whatā€™s New in Plotlars 0.3.0?

šŸš€ New Features:

  • From Trait for Text: We've implemented the `From` trait for `Text`, allowing seamless conversion from `&str`, `&String`, and `String`. This makes handling text elements in your plots more intuitive and less error-prone.
  • Plot Title Position: Now, you have more control over your plot's aesthetics with the ability to customize the title position. Whether you want it centered, aligned left, or right, the choice is yours.
  • Axis Customization: Weā€™ve added an axis module that gives you greater flexibility in customizing your plot axes. Tailor your axes to match the precise look and feel you need for your data visualization.
  • Write HTML Method: Need to export your plots? The new `write_html` method makes it easy to save your visualizations as interactive HTML files, perfect for sharing or embedding in reports.

Check It Out!

Head over to the crate, explore the updated documentation, and dive into the GitHub repository to see all the new changes in action. If you find Plotlars useful, consider leaving a star ā­ļø on GitHub ā€”it helps others discover the project and motivates further development.

Thank you for your continued support and interest in Plotlars. Happy plotting! šŸŽ‰

https://preview.redd.it/wfe60vwnf8md1.png?width=3378&format=png&auto=webp&s=7822df23459f7d62de848d409344a6d5f6419739

6 Comments
2024/09/01
17:43 UTC

Back To Top