/r/datascience

Photograph via snooOG

A space for data science professionals to engage in discussions and debates on the subject of data science.

/r/datascience

2,440,739 Subscribers

0

Need help standard deviation

Hey guys I really need help I love statistics but I don’t know what the standard deviation is. I know I could probably google or chatgpt or open a basic book but I was hoping someone here could spoon feed me a series of statistics videos that are entertaining like Cocomelon or Bluey, something I can relate to.

Also I don’t really understand mean and how it is different from average, and a I’m nervous because I am in my first year of my masters in data science.

Thanks guys 🙏

6 Comments
2024/12/12
20:19 UTC

13

How to Best Prepare for DS Python Interviews at FAANG/Big Companies?

Have an interivew coming up where the focus will be on Stats, ML, and Modeling with Python at FAANG. I'm expecting that I need to know Pandas from front to back and basics of Python (Leetcode Easy).

For those that have went through interviews like this, what was the structure and what types of questions do they usually ask in a live coding round for DS? What is the best way to prepare? What are we expected to know besides the fundamentals of Python and Stats?

9 Comments
2024/12/12
18:21 UTC

5

A New Approach To Training With Perforated AI

Hello, I wanted to share this If someone is interested in knowing a little bit about a company I saw at a conference a few months ago.

https://medium.com/@luismarcelobp/a-new-approach-to-training-with-perforated-ai-339e29cabd54

6 Comments
2024/12/12
17:11 UTC

0

Error rates /dirty data can cause sickness?

I do remember reading a long time ago that in production lines with high error-rates the motivation of labourers went down and the stress affected the workforce.

I wonder if dirty-data can have the same effect and has been researched as such. I know there are studies into error-rates in software, but that mixes software with data.

I wonder if specifically the stress caused by the unpredictability of the amount of work and the constant pressure dirty data causes has been studied as a health concern/risk.

Thanks.

Y.

edit: added the source Unraveling Software Engineering Failures: Reasons and Fixes https://growprogramming.com/engineering-excellence-unraveling-the-reasons-behind-failures-in-software-engineering/

13 Comments
2024/12/12
02:33 UTC

0

CodeSignal companies

Does anyone have a list of companies that have the codesignal data science assesement?

Let's list the companies that did codesingal interviews so we can compile a list.

6 Comments
2024/12/12
00:58 UTC

0

get message markdow: execution ko or ok

I am working with non developpers. I want them to enter parameters in markdown, execute a script then get the message at the end execution ok or ko on the knitted html ( they ll do it with command line) I did error=T in the markdown so we ll alwyas get the document open. if I want to specify if execution ko or okay, I have to detect if theres at least a warning or error in my script? how to do that?

8 Comments
2024/12/11
21:13 UTC

0

Love this.

0 Comments
2024/12/11
18:37 UTC

100

The Solitude of Data Science: Looking for a Kindred Spirits

Hello!

I’ll try to keep this short because I’m terrible at being concise.

I came from a different world—operations and sales. It didn’t take long for me to realize that I wanted to move away from... well, salespeople. I applied for a dev job at my company and got rejected, but they saw potential in my knowledge and experience with machine learning, deep learning, and some other rogue projects I had been working on.

They asked if I could develop a proof of concept (POC) to present to our board of directors. The company had previously attempted to work with three external teams, but none of those efforts were successful. I presented the POC, and it went exceptionally well. We secured funding and created a junior data science position specifically for me. Previously, the company had no such role or anything similar. While the IT team is very strong, they haven’t had the capacity to handle initiatives like this.

Since then, I’ve been obsessed—reading everything I can and taking stats classes for a certificate program at MIT (with plans to continue my education). I’m pretty sure I’ve been driving my wife and friends crazy because I love talking about this stuff. I’m genuinely passionate about it!

That said, I still have so much to learn and need to overcome my imposter syndrome. On top of this fast-moving environment, I’ve never worked in IT before, never used Jira, or been involved in their overall processes, so I’m navigating that learning curve too. I’d love to connect with others here, hear your stories, and get more involved in this r/datascience community!

35 Comments
2024/12/11
13:51 UTC

65

Best cross-validation for imbalanced data?

I'm working on a predictive model in the healthcare field for a relatively rare medical condition, about 5,000 cases in a dataset of 750,000 records, with 660 predictive features.

Given how imbalanced the outcome is, and the large number of variables, I was planning on doing a simple 50/50 train/test data split instead of 5 or 10-fold CV in order to compare the performance of different machine learning models.

Is that the best plan or are there better approaches? Thanks

46 Comments
2024/12/10
19:39 UTC

159

I'm burnt out from constantly being on call where everything is on fire. Are there any good "research" or "data collection" or "data interpretation" roles that offer a more relaxed environment?

As a quick summary, I work as a Site Reliability Engineer and get paid pretty well (especially since I live in rural South Carolina and entirely remote). I juggle tasks like automating deployments, managing Kubernetes clusters in AWS, and scripting in Python and Bash, manage and analyze SQL databases, working with APIs, etc.

What I like

  • I get paid well & have skillsets that makes it more difficult for companies to replace you

  • I need to learn and stay up to date on a variety of technologies (I consider this a plus since you're never really 'out of date' on your role)

  • I enjoy makes graphs and gathering statistics/data to help our team

  • I enjoy interpreting that data to determine the root cause of an issue

  • In terms of scripting, I like making quick and dirty scripts that help my team automate something for us (this doesn't including writing large complicated scripts for other teams)

Why I hate it and want to leave

  • The job, by its very nature, means everything is always urgent

  • On call, so a consistent 9-5 is not possible. You're often staying past your shift

  • Have to constantly work with devs and other parties to ensure their services or code gets fixed

  • Rarely any slow days, you're either automating a new large project or jumping on an urgent issue

So based on the above, I'm curious if transitioning to a Data Science type role would offer a more laid-back environment, the question is I don't know what. Anyone made this switch or have insights? If not, can you recommend some jobs that I can look into? Preferably jobs that can utilize at least some of what I know.

57 Comments
2024/12/10
17:49 UTC

54

Hierarchical Time Series Forecasting

Anyone here done work for forecasting grouped time series? I checked out the hyndman book but looking for papers or other more technical resources to guide methodology. I’m curious about how you decided on the top down vs bottom up approach to reconciliation. I was originally building out a hierarchical model in STAN but wondering what others use in terms of software as well.

20 Comments
2024/12/10
14:16 UTC

6

Master Data science vs Quantitative Finance

Major data science vs Quantitative Finance

Hi, I am currently studying the bachelor Econometrics in The Netherlands and next year I will need to choose a master to pursue. My main doubt is, as you can see from the title, between data science (which is a bit outside my bachelor) and quantitative finance.

On the one hand I may be a bit more interested in data science, but on the other hand I have the feeling that I will ‘throw away’ my Econometrics bachelor that is quite unique. From my point of view data science is followed by many people, also people from lower wage countries, while quantitative finance is a master that not many people follow.

That’s why I’m curious what other people think about this, will I be going the wrong path if I choose data science which is pursued by many students overall, should I stick to the specific field of quantitative finance or will it not matter?

28 Comments
2024/12/10
13:42 UTC

16

The pandas MemoryError

I’ve been programming for data analysis for about 5 years, but I’ve never found an easy way to handle this.

With my old beat up Dell Latitude, anything over ~100,000 rows if a sparse df tends to throw the dreaded Memory Error, specifically with functions like get dummies, indexing, merging, etc.

My questions are:

  1. Will a better laptop help with this?
  2. Are there any modules or helper functions for this out there?
  3. How much does using colab help with this problem? Trying to avoid paying more.

TIA!

Edit: seems like most parallelizing options do not store the df in memory, and so can’t be used to visualize. That’s my main use case. So… 4. Anyone know of any visualization tools that work with large data? Currently using Plotly/Dash.

39 Comments
2024/12/10
12:24 UTC

255

Thoughts on the ethics of health insurance companies using Data Science to increase profits based on selective coverage

I want to have a good discussion on this topic since no one is talking about it outside of just the context of a CEO making decisions, but as a lot of us know, company decisions and strategy are driven by the suits(board) and the higher ups a lot of times, and that strategy is trickled down to the analysts and other groups forming projects to support the strategic initiative. I think not talking about this from a data science perspective is an ethics violation because we as practitioners can make the decision to not engage or pursue a project just because “I have a boss and they told me I need to because it aligns with our strategy.” I personally have quit a job in the past because the ethics of the CV models we were creating dawned on me and didn’t make me feel right. Sure I could validate it by saying I was only creating a small part of the software system, the reality is I knew the end goal and was actively participating in the development of a system that could be used for an ethically questionable use case.

The possibility of UHCs actuarial science, analysts, and Data Scientists developing models to contribute to the strategy of increased profits and increased denials should be questioned. And I know “denial rates” aren’t apples to apples as back office rev cycle management people could wrongfully code a claim which can cause it to be denied. I’m talking more from a targeted perspective. Actuaries that work in insurance are very smart, but I want to get some insight about the specifics of what goes on from a health insurance perspective when they are denying a claim.

I would love to hear perspectives from both sides, especially those who may have worked in the industry.

117 Comments
2024/12/10
01:37 UTC

10

Real time predictions of custom models & aws

I am someone who is trying to learn how to deploy machine learning models in real time. As of now the current pain points is that my team uses pmmls and java code to deploy models in production. The problem is that the team develops the code in python then rewrites it in java. I think its a lot of extra work and can get out of hand very quickly.

My proposal is to try to make a docker container and then try to figure out how to deploy the scoring model with the python code for feature engineering.

We do have a java application that actually decisions on the models and want our solutions to be fast.

Where can i learn more about how to deploy this and what type of format do i need to deploy my models? I heard that json is better for security reasons but i am not sure how flexible it is as pmmls are pretty hard to work with when it comes to running the transformation from python pickle to pmmls for very niche modules/custom transformers.

If someone can help explain exactly the workflow that would be very helpful. This is all going to use aws at the end to decision on it.

1 Comment
2024/12/09
23:22 UTC

94

Is LeetCode or HackerRank actually worth it for ML/DS jobs?

I’m an undergrad trying to break into Data Science/ML roles, and I’m not sure if spending time on LeetCode or HackerRank is really worth it. A lot of the problems feel more geared toward software dev interviews, and I’m wondering if that’s the best use of time for DS/ML jobs.

Wouldn’t working on projects or learning tools like TensorFlow or PyTorch be more valuable? Has anyone here actually benefited from doing LeetCode/HackerRank for DS/ML roles, or is it overhyped for this field?

48 Comments
2024/12/09
23:15 UTC

3

SUMO/VISSIM for traffic condition simulation

Hi team!

As I have no experience with AI and predictive models for trafic management, I’m not sure how to simulate current traffic conditions in an urban city (or portion of it) without VS with implementation of IoT and AI.

Any good resources or advice?

Also, if anyone with first hand experience is interested, I would love to have a quick interview discussion, 15-20mins max, for qualitative analysis :)

1 Comment
2024/12/09
21:09 UTC

4

How can a webdev help DS?

Hello y'all. My expertise is between DS and full stack dev, but usually its been one or the other.

What would your ideas be on how I can leverage my webdev skills to collaborate with other DSs in my team?

Context is supply chain, and there's some reasonable freedom to initiate projects

4 Comments
2024/12/09
19:45 UTC

27

Customer Life Time Value Applications

At work I’m developing models to estimate customer lifetime value for a subscription or one-off product. It actually works pretty well. Now, I have found plenty of information on the modeling itself, but not much on how businesses apply these insights.

The models essentially say, “If nothing changes, here’s what your customers are worth.” I’d love to find examples or resources showing how companies actually use LTV predictions in production and how they turn the results into actionable value. Do you target different deciles of LTV with different campaigns? do you just use it for analytics purposes?

18 Comments
2024/12/09
17:24 UTC

33

How do you keep up with all the tools?

Plenty of tools are popping on a regular basis. How do you do to keep up with them? Do you test them all the time? do you have a specific team/person/part of your time dedicated to this? Do you listen to podcasts or watch specific youtube chanels?

14 Comments
2024/12/09
17:05 UTC

5

entering parameters+executing R without accessing R

I am preparing a script for my team (shiny or rmarkdown) where they have to enter some parameters then execute it ( and have maybe executions steps shown). I don t want them to open R or access the script.

  1. How can I do that?
  2. is it dangerous security wise with a markdown knit to html? and with shiny is it safe? I don t know exactly what happens with the online, server thing?
  3. is it okay to have a password passed in the parameters, I know about the Rprofile, but what are the risks?

thanks

3 Comments
2024/12/09
15:45 UTC

876

Thoughts? Please enlighten us with your thoughts on what this guy is saying.

197 Comments
2024/12/09
11:00 UTC

3

Weekly Entering & Transitioning - Thread 09 Dec, 2024 - 16 Dec, 2024

Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:

  • Learning resources (e.g. books, tutorials, videos)
  • Traditional education (e.g. schools, degrees, electives)
  • Alternative education (e.g. online courses, bootcamps)
  • Job search questions (e.g. resumes, applying, career prospects)
  • Elementary questions (e.g. where to start, what next)

While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads.

34 Comments
2024/12/09
05:01 UTC

80

Is your org treating the rollout of LLMs as an IT or data science problem?

Our org has given all resource (and limited all API access) to LLMs to a dedicated team in the IT department, which has no prior data experience. So far no data scientist has been engaged for feedback on design or practicality of use-cases. I'm wondering is this standard in other orgs?

32 Comments
2024/12/08
22:54 UTC

13

Timeseries pattern detection problem

I've never dealt with any time series data - please help me understand if I'm reinventing the wheel or on the right track.

I'm building a little hobby app, which is a habit tracker of sorts. The idea is that it lets the user record things they've done, on a daily basis, like "brush teeth", "walk the dog", "go for a run", "meet with friends" etc, and then tracks the frequency of those and helps do certain things more or less often.

Now I want to add a feature that would suggest some cadence for each individual habit based on past data - e.g. "2 times a day", "once a week", "every Tuesday and Thursday", "once a month", etc.

My first thought here is to create some number of parametrized "templates" and then infer parameters and rank them via MLE, and suggest the top one(s).

Is this how that's commonly done? Is there a standard name for this, or even some standard method/implementation I could use?

4 Comments
2024/12/08
17:36 UTC

143

Are certifications even worth it these days?

So, I’m a cs major stats minor undergrad, and I’ve done a couple of certifications—AWS Cloud Practitioner and IBM Data Science. Honestly, I’m not sure if they added much value. In one interview, I mentioned my certifications right at the end, and they didn’t even seem to notice.

From what I’ve seen, well-defined projects seem to carry more weight than a cert. Projects show real skills, while certs sometimes feel like just ticking a box.

What’s your take? Are there any certs you’ve done that actually helped you stand out, or do you think the focus should shift more toward solid project work?

Also, which one is more valuable or more worth it, AWS, Azure, GCP or Databricks for Data Science/ML??

57 Comments
2024/12/08
16:10 UTC

27

How to find freelance opportunities - what is the most typical troupe of project you do as freelance

Hi all,

I have 5+ years of experience. I’m based in Europe

Lately I’m thinking switch from full time employee to contractor, doing freelancing and working for different companies at the same time.

I think that freelancing for data scientists is harder than freelancing for software developers. I imagine a front end developer can easily get a project to build form scratch a website, or add a functionality to the existent one. Data scientists instead need already data and infrastructure to perform their job.

  • How do data scientists find freelance jobs, I’m based in Europe which platform/website do you use?
  • What is the most typical project you worked on?
  • How is the market now, is there a good demand?
11 Comments
2024/12/08
09:27 UTC

265

Is the data job market as badly affected as software engineering?

Everyone knows the market is bad right now for software engineers. Probably as bad as it's every been. What is the consensus on the job market for data professionals right now?

108 Comments
2024/12/07
17:38 UTC

9

Llama3.3 free API

3 Comments
2024/12/07
03:09 UTC

28

Classification threshold cost optimisation

Say you’ve selected the best classifier for a particular problem, using threshold invariant metrics such as AUROC, Brier score, or log loss.

It’s now time to choose the classification threshold. This will clearly depend on the use case and the cost/ benefits associated with true positives, false positives, etc.

Often I see people advising to choose a threshold by looking at metrics such precision and recall.

What I don’t see very often is people explicitly defining relative (or absolute, if possible) costs/ benefits of each cell in the confusion matrix (or more precisely the action that will be taken as a result). For example a true positive is worth $1000, a false positive -$500 and the other cells $0.

You then optimise the threshold based on maximum benefit using a cost-threshold curve. The precision and recall can also be reported, but they are secondary to the benefit optimisation and not used directly in the choice. I find this much more intuitive and is my go-to.

Does anyone else regularly use this approach? In what situations might this approach not make sense?

25 Comments
2024/12/06
22:29 UTC

Back To Top