/r/datascience
A space for data science professionals to engage in discussions and debates on the subject of data science.
/r/datascience
Hey guys I really need help I love statistics but I don’t know what the standard deviation is. I know I could probably google or chatgpt or open a basic book but I was hoping someone here could spoon feed me a series of statistics videos that are entertaining like Cocomelon or Bluey, something I can relate to.
Also I don’t really understand mean and how it is different from average, and a I’m nervous because I am in my first year of my masters in data science.
Thanks guys 🙏
Have an interivew coming up where the focus will be on Stats, ML, and Modeling with Python at FAANG. I'm expecting that I need to know Pandas from front to back and basics of Python (Leetcode Easy).
For those that have went through interviews like this, what was the structure and what types of questions do they usually ask in a live coding round for DS? What is the best way to prepare? What are we expected to know besides the fundamentals of Python and Stats?
Hello, I wanted to share this If someone is interested in knowing a little bit about a company I saw at a conference a few months ago.
https://medium.com/@luismarcelobp/a-new-approach-to-training-with-perforated-ai-339e29cabd54
I do remember reading a long time ago that in production lines with high error-rates the motivation of labourers went down and the stress affected the workforce.
I wonder if dirty-data can have the same effect and has been researched as such. I know there are studies into error-rates in software, but that mixes software with data.
I wonder if specifically the stress caused by the unpredictability of the amount of work and the constant pressure dirty data causes has been studied as a health concern/risk.
Thanks.
Y.
edit: added the source Unraveling Software Engineering Failures: Reasons and Fixes https://growprogramming.com/engineering-excellence-unraveling-the-reasons-behind-failures-in-software-engineering/
Does anyone have a list of companies that have the codesignal data science assesement?
Let's list the companies that did codesingal interviews so we can compile a list.
I am working with non developpers. I want them to enter parameters in markdown, execute a script then get the message at the end execution ok or ko on the knitted html ( they ll do it with command line) I did error=T in the markdown so we ll alwyas get the document open. if I want to specify if execution ko or okay, I have to detect if theres at least a warning or error in my script? how to do that?
Hello!
I’ll try to keep this short because I’m terrible at being concise.
I came from a different world—operations and sales. It didn’t take long for me to realize that I wanted to move away from... well, salespeople. I applied for a dev job at my company and got rejected, but they saw potential in my knowledge and experience with machine learning, deep learning, and some other rogue projects I had been working on.
They asked if I could develop a proof of concept (POC) to present to our board of directors. The company had previously attempted to work with three external teams, but none of those efforts were successful. I presented the POC, and it went exceptionally well. We secured funding and created a junior data science position specifically for me. Previously, the company had no such role or anything similar. While the IT team is very strong, they haven’t had the capacity to handle initiatives like this.
Since then, I’ve been obsessed—reading everything I can and taking stats classes for a certificate program at MIT (with plans to continue my education). I’m pretty sure I’ve been driving my wife and friends crazy because I love talking about this stuff. I’m genuinely passionate about it!
That said, I still have so much to learn and need to overcome my imposter syndrome. On top of this fast-moving environment, I’ve never worked in IT before, never used Jira, or been involved in their overall processes, so I’m navigating that learning curve too. I’d love to connect with others here, hear your stories, and get more involved in this r/datascience community!
I'm working on a predictive model in the healthcare field for a relatively rare medical condition, about 5,000 cases in a dataset of 750,000 records, with 660 predictive features.
Given how imbalanced the outcome is, and the large number of variables, I was planning on doing a simple 50/50 train/test data split instead of 5 or 10-fold CV in order to compare the performance of different machine learning models.
Is that the best plan or are there better approaches? Thanks
As a quick summary, I work as a Site Reliability Engineer and get paid pretty well (especially since I live in rural South Carolina and entirely remote). I juggle tasks like automating deployments, managing Kubernetes clusters in AWS, and scripting in Python and Bash, manage and analyze SQL databases, working with APIs, etc.
What I like
I get paid well & have skillsets that makes it more difficult for companies to replace you
I need to learn and stay up to date on a variety of technologies (I consider this a plus since you're never really 'out of date' on your role)
I enjoy makes graphs and gathering statistics/data to help our team
I enjoy interpreting that data to determine the root cause of an issue
In terms of scripting, I like making quick and dirty scripts that help my team automate something for us (this doesn't including writing large complicated scripts for other teams)
Why I hate it and want to leave
The job, by its very nature, means everything is always urgent
On call, so a consistent 9-5 is not possible. You're often staying past your shift
Have to constantly work with devs and other parties to ensure their services or code gets fixed
Rarely any slow days, you're either automating a new large project or jumping on an urgent issue
So based on the above, I'm curious if transitioning to a Data Science type role would offer a more laid-back environment, the question is I don't know what. Anyone made this switch or have insights? If not, can you recommend some jobs that I can look into? Preferably jobs that can utilize at least some of what I know.
Anyone here done work for forecasting grouped time series? I checked out the hyndman book but looking for papers or other more technical resources to guide methodology. I’m curious about how you decided on the top down vs bottom up approach to reconciliation. I was originally building out a hierarchical model in STAN but wondering what others use in terms of software as well.
Major data science vs Quantitative Finance
Hi, I am currently studying the bachelor Econometrics in The Netherlands and next year I will need to choose a master to pursue. My main doubt is, as you can see from the title, between data science (which is a bit outside my bachelor) and quantitative finance.
On the one hand I may be a bit more interested in data science, but on the other hand I have the feeling that I will ‘throw away’ my Econometrics bachelor that is quite unique. From my point of view data science is followed by many people, also people from lower wage countries, while quantitative finance is a master that not many people follow.
That’s why I’m curious what other people think about this, will I be going the wrong path if I choose data science which is pursued by many students overall, should I stick to the specific field of quantitative finance or will it not matter?
I’ve been programming for data analysis for about 5 years, but I’ve never found an easy way to handle this.
With my old beat up Dell Latitude, anything over ~100,000 rows if a sparse df tends to throw the dreaded Memory Error, specifically with functions like get dummies, indexing, merging, etc.
My questions are:
TIA!
Edit: seems like most parallelizing options do not store the df in memory, and so can’t be used to visualize. That’s my main use case. So… 4. Anyone know of any visualization tools that work with large data? Currently using Plotly/Dash.
I want to have a good discussion on this topic since no one is talking about it outside of just the context of a CEO making decisions, but as a lot of us know, company decisions and strategy are driven by the suits(board) and the higher ups a lot of times, and that strategy is trickled down to the analysts and other groups forming projects to support the strategic initiative. I think not talking about this from a data science perspective is an ethics violation because we as practitioners can make the decision to not engage or pursue a project just because “I have a boss and they told me I need to because it aligns with our strategy.” I personally have quit a job in the past because the ethics of the CV models we were creating dawned on me and didn’t make me feel right. Sure I could validate it by saying I was only creating a small part of the software system, the reality is I knew the end goal and was actively participating in the development of a system that could be used for an ethically questionable use case.
The possibility of UHCs actuarial science, analysts, and Data Scientists developing models to contribute to the strategy of increased profits and increased denials should be questioned. And I know “denial rates” aren’t apples to apples as back office rev cycle management people could wrongfully code a claim which can cause it to be denied. I’m talking more from a targeted perspective. Actuaries that work in insurance are very smart, but I want to get some insight about the specifics of what goes on from a health insurance perspective when they are denying a claim.
I would love to hear perspectives from both sides, especially those who may have worked in the industry.
I am someone who is trying to learn how to deploy machine learning models in real time. As of now the current pain points is that my team uses pmmls and java code to deploy models in production. The problem is that the team develops the code in python then rewrites it in java. I think its a lot of extra work and can get out of hand very quickly.
My proposal is to try to make a docker container and then try to figure out how to deploy the scoring model with the python code for feature engineering.
We do have a java application that actually decisions on the models and want our solutions to be fast.
Where can i learn more about how to deploy this and what type of format do i need to deploy my models? I heard that json is better for security reasons but i am not sure how flexible it is as pmmls are pretty hard to work with when it comes to running the transformation from python pickle to pmmls for very niche modules/custom transformers.
If someone can help explain exactly the workflow that would be very helpful. This is all going to use aws at the end to decision on it.
I’m an undergrad trying to break into Data Science/ML roles, and I’m not sure if spending time on LeetCode or HackerRank is really worth it. A lot of the problems feel more geared toward software dev interviews, and I’m wondering if that’s the best use of time for DS/ML jobs.
Wouldn’t working on projects or learning tools like TensorFlow or PyTorch be more valuable? Has anyone here actually benefited from doing LeetCode/HackerRank for DS/ML roles, or is it overhyped for this field?
Hi team!
As I have no experience with AI and predictive models for trafic management, I’m not sure how to simulate current traffic conditions in an urban city (or portion of it) without VS with implementation of IoT and AI.
Any good resources or advice?
Also, if anyone with first hand experience is interested, I would love to have a quick interview discussion, 15-20mins max, for qualitative analysis :)
Hello y'all. My expertise is between DS and full stack dev, but usually its been one or the other.
What would your ideas be on how I can leverage my webdev skills to collaborate with other DSs in my team?
Context is supply chain, and there's some reasonable freedom to initiate projects
At work I’m developing models to estimate customer lifetime value for a subscription or one-off product. It actually works pretty well. Now, I have found plenty of information on the modeling itself, but not much on how businesses apply these insights.
The models essentially say, “If nothing changes, here’s what your customers are worth.” I’d love to find examples or resources showing how companies actually use LTV predictions in production and how they turn the results into actionable value. Do you target different deciles of LTV with different campaigns? do you just use it for analytics purposes?
Plenty of tools are popping on a regular basis. How do you do to keep up with them? Do you test them all the time? do you have a specific team/person/part of your time dedicated to this? Do you listen to podcasts or watch specific youtube chanels?
I am preparing a script for my team (shiny or rmarkdown) where they have to enter some parameters then execute it ( and have maybe executions steps shown). I don t want them to open R or access the script.
thanks
Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:
While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads.
Our org has given all resource (and limited all API access) to LLMs to a dedicated team in the IT department, which has no prior data experience. So far no data scientist has been engaged for feedback on design or practicality of use-cases. I'm wondering is this standard in other orgs?
I've never dealt with any time series data - please help me understand if I'm reinventing the wheel or on the right track.
I'm building a little hobby app, which is a habit tracker of sorts. The idea is that it lets the user record things they've done, on a daily basis, like "brush teeth", "walk the dog", "go for a run", "meet with friends" etc, and then tracks the frequency of those and helps do certain things more or less often.
Now I want to add a feature that would suggest some cadence for each individual habit based on past data - e.g. "2 times a day", "once a week", "every Tuesday and Thursday", "once a month", etc.
My first thought here is to create some number of parametrized "templates" and then infer parameters and rank them via MLE, and suggest the top one(s).
Is this how that's commonly done? Is there a standard name for this, or even some standard method/implementation I could use?
So, I’m a cs major stats minor undergrad, and I’ve done a couple of certifications—AWS Cloud Practitioner and IBM Data Science. Honestly, I’m not sure if they added much value. In one interview, I mentioned my certifications right at the end, and they didn’t even seem to notice.
From what I’ve seen, well-defined projects seem to carry more weight than a cert. Projects show real skills, while certs sometimes feel like just ticking a box.
What’s your take? Are there any certs you’ve done that actually helped you stand out, or do you think the focus should shift more toward solid project work?
Also, which one is more valuable or more worth it, AWS, Azure, GCP or Databricks for Data Science/ML??
Hi all,
I have 5+ years of experience. I’m based in Europe
Lately I’m thinking switch from full time employee to contractor, doing freelancing and working for different companies at the same time.
I think that freelancing for data scientists is harder than freelancing for software developers. I imagine a front end developer can easily get a project to build form scratch a website, or add a functionality to the existent one. Data scientists instead need already data and infrastructure to perform their job.
Everyone knows the market is bad right now for software engineers. Probably as bad as it's every been. What is the consensus on the job market for data professionals right now?
Say you’ve selected the best classifier for a particular problem, using threshold invariant metrics such as AUROC, Brier score, or log loss.
It’s now time to choose the classification threshold. This will clearly depend on the use case and the cost/ benefits associated with true positives, false positives, etc.
Often I see people advising to choose a threshold by looking at metrics such precision and recall.
What I don’t see very often is people explicitly defining relative (or absolute, if possible) costs/ benefits of each cell in the confusion matrix (or more precisely the action that will be taken as a result). For example a true positive is worth $1000, a false positive -$500 and the other cells $0.
You then optimise the threshold based on maximum benefit using a cost-threshold curve. The precision and recall can also be reported, but they are secondary to the benefit optimisation and not used directly in the choice. I find this much more intuitive and is my go-to.
Does anyone else regularly use this approach? In what situations might this approach not make sense?