/r/datascience
A space for data science professionals to engage in discussions and debates on the subject of data science.
/r/datascience
I am thinking about a one click solution for my non coders team. We have one pc where they execute the code ( a shiny app). I can execute it with a command line. the .bat file didn t work we must have admin previleges for every execution. so I think of doing for them a standalone R app (.exe). or the plumber API. wich one is a better choice?
I recently graduated with an MSc in Artificial Intelligence in the UK and am currently looking for job opportunities. However, I often feel unsure about whether I’m approaching the job search process effectively. The journey can feel overwhelming and confusing at times, and I wonder if I’m targeting and applying for roles in the right way.
I am specifically targeting roles as a Machine Learning Engineer or Data Scientist. Could you share any proven strategies for job searching in the UK, particularly for these fields? Additionally, I’d like to know which months are crucial for job applications and when companies are most likely to hire graduates.
Our company is quite small and we dont have a robust experimentation platform. Campaign measurement tasks are scattered all around the business with no unified set of standards. 6 different data scientists will bring you 6 different numbers of a lift measurement because nobody has a set way of doing things.
A few of us are thinking of building out an experimentation platform to be a one stop shop for all things measurement. For those of you at places with mature experimentation culture, what kind of things should we consider? I’m a data scientist whose never worked as closely with engineers, but taking on this project is going to force me to do that, so I want to know more about an experimentation platform setup from that side as well. What has worked for you guys and what are things to recommend in building an experimentation platform?
Hey all,
I’m in Germany and was let go at the end of my probation period.
I was ensured I would make it and actively made money for the company with proof.
My reasons for termination were unclear and actually not inline with my responsibilities as a data scientist.
Essentially, I was given peace of mind, and could ensure I needn’t worry.
Whatever it may be, I’m now out of a job. That’s the way it goes sometimes.
What are your tips for grabbing that next position fast? I’m not picky, I just want a job in my field, and with a team I enjoy - easier said than done.
Any tips would be amazing!
Happy holidays :)
I'm dealing with a clustering over time issue. Our company is a sort of PayPal. We are trying to implement an antifraud process to trigger alerts when a client makes excessive payments compared to its historical behavior. To do so, I've come up with seven clustering features which are all 365-day-long moving averages of different KPIs (payment frequency, payment amount, etc.). So it goes without saying that, from one day to another, these indicators evolve very slowly. I have about 15k clients, several years of data. I get rid of outliers (99-percentile of each date, basically) and put them in a cluster-0 by default. Then, the idea is, for each date, to come up with 8 clusters. I've used a Gaussian Mixture clustering (GMM) but, weirdly enough, the clusters of my clients vary wildly from one day to another. I have tried to plant the previous mean of my centroids, using the previous day centroid of a client to sort of seed the next day's clustering of a client, but the results still vary a lot. I've read a bit about DynamicC and it seemed like the way to address the issue, but it doesn't help.
Hello everyone! 👋
In my work as a data scientist, I’ve often found it challenging to compare models and track them over time. This led me to contribute to a recent open-source library called Skore, an initiative led by Probabl, a startup with a team comprising of many of the core scikit-learn maintainers.
Our goal is to help data scientists use scikit-learn more effectively, provide the necessary tooling to track metrics and models, and visualize them effectively. Right now, it mostly includes support for model validation. We plan to extend the features to more phases of the ML workflow, such as model analysis and selection.
I’m curious: how do you currently manage your workflow? More specifically, how do you track the evolution of metrics? Have you found something that worked well, or was missing?
If you’ve faced challenges like these, check out the repo on GitHub and give it a try. Also, please star our repo ⭐️ it really helps!
Looking forward to hearing your experiences and ideas—thanks for reading!
I’m considering getting a master’s and would love to know what type of opportunities it would open up. I’ve been in the workforce for 12 years, including 5-7 years in growth marketing.
Somewhere along the line, growth marketing became analyzing growth marketing and being the data/marketing tech guy at a series c company. I did the bootcamp thing. And now I’m a senior data analyst for a fortune 100 company. So: successfully went from marketing to analytics, but not data science.
I’m an expert in SQL, know tableau in and out, okay at Python, solid business presentation skills, and occasionally shoehorn a predictive model into a project. But yeah, it’s analytics.
But I’d like to work on harder, more interesting problems and, frankly, make more money as an IC.
The master’s would go in depth on a lot of data science topics (multi variable regression, nlp, time series) and I could take comp sci classes as well. Possibly more in depth than I need.
Anyway, thoughts on what could arise from this?
I can't tell you the number of times I've been asked "what random number seed should I use for my model" and later discover that the questioner has grid searched it like a hyperparameter.
Or worse: grid searched the seed for the train/test split or CV folds that "gives the best result".
At best, the results are fragile and optimistically biased. At worst, they know what they're doing and it's intentional fraud. Especially when the project has real stakes/stakeholders.
I was chatting to a colleague about this last week and shared a few examples of "random seed hacking" and related ideas of test-set pruning, p-hacking, leader board hacking, train/test split ratio gaming, and so on.
He said I should write a tutorial or something, e.g. to educate managers/stakeholders/reviewers, etc.
I put a few examples in a github repository (I called it "Machine Learning Mischief", because it feels naughty/playful) but now I'm thinking it reads more like a "how-to-cheat instruction guide" for students, rather than a "how to spot garbage results" for teachers/managers/etc.
What's the right answer here?
Do I delete (make private) the repo or push it for wider consideration (e.g. expand as a handbook on how to spot rubbish ml/ds results)? Or perhaps no one cares because it's common knowledge and super obvious?
Hey guys I really need help I love statistics but I don’t know what the standard deviation is. I know I could probably google or chatgpt or open a basic book but I was hoping someone here could spoon feed me a series of statistics videos that are entertaining like Cocomelon or Bluey, something I can relate to.
Also I don’t really understand mean and how it is different from average, and a I’m nervous because I am in my first year of my masters in data science.
Thanks guys 🙏
Have an interivew coming up where the focus will be on Stats, ML, and Modeling with Python at FAANG. I'm expecting that I need to know Pandas from front to back and basics of Python (Leetcode Easy).
For those that have went through interviews like this, what was the structure and what types of questions do they usually ask in a live coding round for DS? What is the best way to prepare? What are we expected to know besides the fundamentals of Python and Stats?
Hello, I wanted to share this If someone is interested in knowing a little bit about a company I saw at a conference a few months ago.
https://medium.com/@luismarcelobp/a-new-approach-to-training-with-perforated-ai-339e29cabd54
I do remember reading a long time ago that in production lines with high error-rates the motivation of labourers went down and the stress affected the workforce.
I wonder if dirty-data can have the same effect and has been researched as such. I know there are studies into error-rates in software, but that mixes software with data.
I wonder if specifically the stress caused by the unpredictability of the amount of work and the constant pressure dirty data causes has been studied as a health concern/risk.
Thanks.
Y.
edit: added the source Unraveling Software Engineering Failures: Reasons and Fixes https://growprogramming.com/engineering-excellence-unraveling-the-reasons-behind-failures-in-software-engineering/
Does anyone have a list of companies that have the codesignal data science assesement?
Let's list the companies that did codesingal interviews so we can compile a list.
I am working with non developpers. I want them to enter parameters in markdown, execute a script then get the message at the end execution ok or ko on the knitted html ( they ll do it with command line) I did error=T in the markdown so we ll alwyas get the document open. if I want to specify if execution ko or okay, I have to detect if theres at least a warning or error in my script? how to do that?
Hello!
I’ll try to keep this short because I’m terrible at being concise.
I came from a different world—operations and sales. It didn’t take long for me to realize that I wanted to move away from... well, salespeople. I applied for a dev job at my company and got rejected, but they saw potential in my knowledge and experience with machine learning, deep learning, and some other rogue projects I had been working on.
They asked if I could develop a proof of concept (POC) to present to our board of directors. The company had previously attempted to work with three external teams, but none of those efforts were successful. I presented the POC, and it went exceptionally well. We secured funding and created a junior data science position specifically for me. Previously, the company had no such role or anything similar. While the IT team is very strong, they haven’t had the capacity to handle initiatives like this.
Since then, I’ve been obsessed—reading everything I can and taking stats classes for a certificate program at MIT (with plans to continue my education). I’m pretty sure I’ve been driving my wife and friends crazy because I love talking about this stuff. I’m genuinely passionate about it!
That said, I still have so much to learn and need to overcome my imposter syndrome. On top of this fast-moving environment, I’ve never worked in IT before, never used Jira, or been involved in their overall processes, so I’m navigating that learning curve too. I’d love to connect with others here, hear your stories, and get more involved in this r/datascience community!
I'm working on a predictive model in the healthcare field for a relatively rare medical condition, about 5,000 cases in a dataset of 750,000 records, with 660 predictive features.
Given how imbalanced the outcome is, and the large number of variables, I was planning on doing a simple 50/50 train/test data split instead of 5 or 10-fold CV in order to compare the performance of different machine learning models.
Is that the best plan or are there better approaches? Thanks
As a quick summary, I work as a Site Reliability Engineer and get paid pretty well (especially since I live in rural South Carolina and entirely remote). I juggle tasks like automating deployments, managing Kubernetes clusters in AWS, and scripting in Python and Bash, manage and analyze SQL databases, working with APIs, etc.
What I like
I get paid well & have skillsets that makes it more difficult for companies to replace you
I need to learn and stay up to date on a variety of technologies (I consider this a plus since you're never really 'out of date' on your role)
I enjoy makes graphs and gathering statistics/data to help our team
I enjoy interpreting that data to determine the root cause of an issue
In terms of scripting, I like making quick and dirty scripts that help my team automate something for us (this doesn't including writing large complicated scripts for other teams)
Why I hate it and want to leave
The job, by its very nature, means everything is always urgent
On call, so a consistent 9-5 is not possible. You're often staying past your shift
Have to constantly work with devs and other parties to ensure their services or code gets fixed
Rarely any slow days, you're either automating a new large project or jumping on an urgent issue
So based on the above, I'm curious if transitioning to a Data Science type role would offer a more laid-back environment, the question is I don't know what. Anyone made this switch or have insights? If not, can you recommend some jobs that I can look into? Preferably jobs that can utilize at least some of what I know.
Anyone here done work for forecasting grouped time series? I checked out the hyndman book but looking for papers or other more technical resources to guide methodology. I’m curious about how you decided on the top down vs bottom up approach to reconciliation. I was originally building out a hierarchical model in STAN but wondering what others use in terms of software as well.
Major data science vs Quantitative Finance
Hi, I am currently studying the bachelor Econometrics in The Netherlands and next year I will need to choose a master to pursue. My main doubt is, as you can see from the title, between data science (which is a bit outside my bachelor) and quantitative finance.
On the one hand I may be a bit more interested in data science, but on the other hand I have the feeling that I will ‘throw away’ my Econometrics bachelor that is quite unique. From my point of view data science is followed by many people, also people from lower wage countries, while quantitative finance is a master that not many people follow.
That’s why I’m curious what other people think about this, will I be going the wrong path if I choose data science which is pursued by many students overall, should I stick to the specific field of quantitative finance or will it not matter?
I’ve been programming for data analysis for about 5 years, but I’ve never found an easy way to handle this.
With my old beat up Dell Latitude, anything over ~100,000 rows if a sparse df tends to throw the dreaded Memory Error, specifically with functions like get dummies, indexing, merging, etc.
My questions are:
TIA!
Edit: seems like most parallelizing options do not store the df in memory, and so can’t be used to visualize. That’s my main use case. So… 4. Anyone know of any visualization tools that work with large data? Currently using Plotly/Dash.
I want to have a good discussion on this topic since no one is talking about it outside of just the context of a CEO making decisions, but as a lot of us know, company decisions and strategy are driven by the suits(board) and the higher ups a lot of times, and that strategy is trickled down to the analysts and other groups forming projects to support the strategic initiative. I think not talking about this from a data science perspective is an ethics violation because we as practitioners can make the decision to not engage or pursue a project just because “I have a boss and they told me I need to because it aligns with our strategy.” I personally have quit a job in the past because the ethics of the CV models we were creating dawned on me and didn’t make me feel right. Sure I could validate it by saying I was only creating a small part of the software system, the reality is I knew the end goal and was actively participating in the development of a system that could be used for an ethically questionable use case.
The possibility of UHCs actuarial science, analysts, and Data Scientists developing models to contribute to the strategy of increased profits and increased denials should be questioned. And I know “denial rates” aren’t apples to apples as back office rev cycle management people could wrongfully code a claim which can cause it to be denied. I’m talking more from a targeted perspective. Actuaries that work in insurance are very smart, but I want to get some insight about the specifics of what goes on from a health insurance perspective when they are denying a claim.
I would love to hear perspectives from both sides, especially those who may have worked in the industry.
I am someone who is trying to learn how to deploy machine learning models in real time. As of now the current pain points is that my team uses pmmls and java code to deploy models in production. The problem is that the team develops the code in python then rewrites it in java. I think its a lot of extra work and can get out of hand very quickly.
My proposal is to try to make a docker container and then try to figure out how to deploy the scoring model with the python code for feature engineering.
We do have a java application that actually decisions on the models and want our solutions to be fast.
Where can i learn more about how to deploy this and what type of format do i need to deploy my models? I heard that json is better for security reasons but i am not sure how flexible it is as pmmls are pretty hard to work with when it comes to running the transformation from python pickle to pmmls for very niche modules/custom transformers.
If someone can help explain exactly the workflow that would be very helpful. This is all going to use aws at the end to decision on it.
I’m an undergrad trying to break into Data Science/ML roles, and I’m not sure if spending time on LeetCode or HackerRank is really worth it. A lot of the problems feel more geared toward software dev interviews, and I’m wondering if that’s the best use of time for DS/ML jobs.
Wouldn’t working on projects or learning tools like TensorFlow or PyTorch be more valuable? Has anyone here actually benefited from doing LeetCode/HackerRank for DS/ML roles, or is it overhyped for this field?
Hi team!
As I have no experience with AI and predictive models for trafic management, I’m not sure how to simulate current traffic conditions in an urban city (or portion of it) without VS with implementation of IoT and AI.
Any good resources or advice?
Also, if anyone with first hand experience is interested, I would love to have a quick interview discussion, 15-20mins max, for qualitative analysis :)
Hello y'all. My expertise is between DS and full stack dev, but usually its been one or the other.
What would your ideas be on how I can leverage my webdev skills to collaborate with other DSs in my team?
Context is supply chain, and there's some reasonable freedom to initiate projects
At work I’m developing models to estimate customer lifetime value for a subscription or one-off product. It actually works pretty well. Now, I have found plenty of information on the modeling itself, but not much on how businesses apply these insights.
The models essentially say, “If nothing changes, here’s what your customers are worth.” I’d love to find examples or resources showing how companies actually use LTV predictions in production and how they turn the results into actionable value. Do you target different deciles of LTV with different campaigns? do you just use it for analytics purposes?
Plenty of tools are popping on a regular basis. How do you do to keep up with them? Do you test them all the time? do you have a specific team/person/part of your time dedicated to this? Do you listen to podcasts or watch specific youtube chanels?
I am preparing a script for my team (shiny or rmarkdown) where they have to enter some parameters then execute it ( and have maybe executions steps shown). I don t want them to open R or access the script.
thanks