/r/datascience

Photograph via snooOG

A space for data science professionals to engage in discussions and debates on the subject of data science.

/r/datascience

2,267,474 Subscribers

29

Everyone’s building new models but who is actually monitoring the old ones?

I’m currently in the process of searching for a new job and have started engaging with LinkedIn recruiters. While I haven’t spoken with many yet, the ones I have talked to seem to focus heavily on model development experience. My background, however, is in model monitoring and maintenance, where I’ve spent several years building tools that deliver real value to my team.

That said, these recent interactions have shaken my confidence, leaving me wondering if I’ve wasted the last few years in this role.

Do you think the demand for model monitoring roles will grow? I’m feeling a bit lost right now and would really appreciate any advice.

23 Comments
2024/10/26
20:44 UTC

30

Is it worth it to leave over paperwork?

I’m in a pretty cushy job with honestly not much work, but lots of internal angst since the only work is documentation and model maintenance.

I have job security for the next few years because of the messy paperwork, but I don’t feel like I’m learning anything on the job. I’ve been upskilling myself, but I was told to wait it out until the market is over because of all the layoffs.

I really like my team and I’ve been learning a lot from the lead, but I’m incredibly bored right now and am over the company. I feel like I’m staying out of a fear of layoffs in the current market/not getting another job.

Has anyone had this issue? If I have to be in another compliance meeting over Zoom in-office, I’ll scream inwardly from boredom.

31 Comments
2024/10/26
02:37 UTC

82

Senior DS laid off and trying to get out of product analytics. How can I pivot to a more quantitative position?

EDIT: I’m ignoring all messages and chat requests not directly related to my question. If you have a separate question about getting into industry, interview prep, etc., please post it in its own thread or in the appropriate master topic.

(I figured this is specific enough to warrant its own post instead of posting in the weekly Entering and Transition thread, as I already have a lot of industry experience.)

TL;DR: How can an unemployed, experienced analytics-focused data scientist get out of analytics and pivot to a more quantitative position?

I'm a data scientist with a Master's in Statistics and nine years of experience in a tech city. I've had the title Senior Data Scientist for two of them. I was laid off from my job of four years in June and have been dealing with what some would call a "first world problem" in the current market.

I get callbacks from many recruiters, but almost all of them are for analytics positions. This makes sense because (as I'll explain below) I've been repeatedly pushed into analytics roles at my past jobs. I have roughly 8 years of analytics experience, and was promoted to a senior position because I did well on a few analytics projects. My resume that most of my work is analytics, as most of my accomplishments are along the lines of "designed a big metric" or "was the main DS who drove X internal initiative". I've been blowing away every A/B testing interview and get feedback indicating that I clearly have a lot of experience in that area. I've also been told in performance reviews and in interview loops that I write very good code in Python, R, and SQL.

However, I don't like analytics. I don't like that it's almost all very basic A/B testing on product changes. More importantly, I've found that most companies have a terrible experimentation culture. When I prod in interviews, they often indicate that their A/B testing platform is underdeveloped to the point where many tests are analyzed offline, or that they only test things that are likely to be a certain win. They ignore network effects, don't use holdout groups or meta-analysis, and insist that tests designed to answer a very specific question should also be used to answer a ton of other things. It is - more often than not - Potemkin Data Science. I'm also frustrated because I have a graduate degree in statistics and enjoy heavily quantitative work a lot, but rarely get to do interesting quantitative work in product analytics.

Additionally, I have mild autism, so I would prefer to do something that requires less communication with stakeholders. While I'm aware that every job is going to require stakeholder communication to some degree, the amount of time that I spent politicking to convince stakeholders to do experimentation correctly led to a ton of stress.

I've been trying to find a job more focused on some at least one of causal inference, explanatory statistical modeling, Bayesian statistics, and ML on tabular data (i.e. not LLMs, but like fraud prediction). I've never once gotten a callback for an ML Engineer position, which makes sense because I have minimal ML experience and don't have a CS degree. I've had a few HR calls for companies doing ML in areas like identity validation and fraud prediction, but the initial recruiting call is always followed up with "we're sorry, but we decided to go with someone with more ML experience."

My experience with the above areas is as follows. These were approaches that I tried but ended up having no impact, except for the first one, which I didn't get to finish. Additionally, note that I currently do not have experience working with traditional CS data structures and algorithms, but have worked with scipy sparse matrices and other DS-specific data structures:

  • Designed requirements for a regression ML model. Did a ton of internal research, then learned SparkSQL and wrote code to pull and extract the features. However, after this, I was told to design experiments for the model rather than writing the actual code to train it. Another data scientist on my team did the model training with people on another team that claimed ownership. My manager heavily implied this was due to upper management and had nothing to do with my skills.

  • Used a causal inference approach to match treatment group users to control group users for an experiment where we were expecting the two groups to be very different due to selection bias. However, the selection bias ended up being a non-issue.

  • Did clustering on time-dependent data in order to identify potential subgroups of users to target. Despite it taking about two days to do, I was criticized for not doing something simpler and less statistical. (Also, in hindsight, the results didn't replicate when I slightly changed the data.)

  • Discussed an internal fraud model with stakeholders. Recognized that a dead simple feature wasn't in it, learned a bit of the internal ML platform, and added it myself. The feature boosted recall at 99% precision by like 40%. However, even after my repeated prodding, the production model was never updated due to lack of engineering support and because the author of the proprietary ML framework quit.

  • During a particularly dead month, I spent time building a Bayesian model for an internal calculation in Stan. Unfortunately I wasn't able to get it to scale, and ran into major computational issues that - in hindsight - likely indicated an issue with the model formulation in the paper I tried to implement.

  • Rewrote a teammate's prototype recommendation model and built a front end explorer for it. In a nutshell, I took a bunch of spaghetti code and turned it into a maintainable Python library that used Scipy sparse matrices for calculations, which sped it up considerably. This model was never productionized because it was tested in prod and didn't do well.

At the time I was laid off I had about six months of expenses saved up, plus fairly generous severance and unemployment. I can go about another four months without running out of savings. How should I proceed to get one of these more technical positions? Some ideas I have:

  • List the above projects on my resume even though they failed. However, that's inevitably going to come up in an interview.

  • I could work on a personal project focused on Bayesian statistics or causal inference. However, I've noticed that the longer I'm unemployed, the fewer callbacks and LinkedIn messages I get, so I'm worried about being unemployed even longer.

  • Take an analytics job and wait for a more quantitative opening at a different company to occur. Someone fairly big in my city's DS community that knows I can handle more technical work said he'd refer me and probably be able to skip most of the interview process, but his company currently has no open DS positions and he said he doesn't know when more will open up.

  • Take a 3 or 6-month contract position focused on my interests from one of the random third party recruiters on LinkedIn. It'll probably suck, but give me experience I can use for a new job.

  • Drill Leetcode and try to get an entry-level software engineer position. However this would obviously be a huge downgrade in responsibility and pay, preparation would drain my savings, and there’s no guarantee I could pivot back to DS if it doesn’t work out.

Additionally, here's a summary of my work experience:

  • Company 1 (roughly 200 employees). First job out of grad school. I was there for a year and was laid off because there "wasn't a lot of DS work". I had a great manager who constantly advocated for me, but couldn't convince upper management to do anything beyond basic summary statistics. For example, he pitched a cluster analysis and they said it sounded hard.

  • Company 2 (roughly 200 employees). I was there for two years.

Shortly after joining I started an ML project, but was moved to analytics due to organizational priorities. Got a phenomenal performance review, asked if I could take on some ML work, and was given an unambiguous no. Did various analytics tasks (mostly dashboarding and making demos) and mini-projects on public data sources due to lack of internal data (long story). Spent a full year searching for a more modeling-focused position because a lot of the DS was smoke and mirrors and we weren't getting any new data. After that year, I quit and ended up at Company 3.

  • Company 3 (roughly 30000 employees). I was there for six years. I joined because my future manager (Manager #1) told me I'd get to pick my team and would get to do modeling. Instead, after I did a trial run on two teams over three months, I was told that a reorg meant I would no longer get to pick my team and ended up on a team that needed drastic help with experimentation. Although my manager (Manager #2) had some modeling work in mind for me, she eventually quit. Manager #3 repeatedly threw me to the wolves and had me constantly working on analyzing experiments for big initiatives while excluding me from planning said experiments, which led to obvious implementation issues. He also gave me no support when I tried to push back against unrealistic stakeholder demands, and insisted I work on projects that I didn't think would have long-term impact due to organizational factors. However, I gained a lot of experience with messy data. I told his skip during a 1:1 that I wanted to do more modeling, and he insisted I keep pushing him for those opportunities.

    Manager #3 drove me to transfer to another team, which was a much better experience. Manager #4 was the best manager I ever had and got me promoted, but also didn't help me find modeling opportunities. Manager #5 was generally great and found me a modeling project to work on after I explained that lack of modeling work was causing burnout. It was a great project at first, but he eventually pushed me to work only on the experimental aspects of that modeling project. I never got to do any actual modeling for this project even though I did all the preparation for it (e.g. feature extraction, gathering requirements), and another team took it over. Shortly after this project completed, I was laid off.

57 Comments
2024/10/25
21:19 UTC

10

Conducting a study: I have questions (and gift cards) for data scientists

I've been following the data science profession since 2015, back when many data scientists were still employed as data analysts or statisticians.

A lot changed since then, to say the least. What changed, exactly? That's what I'm trying to find out.

I'm doing a small study on what data scientists work on these days and how they approach their work. Especially interested in predictive modeling work, but not strictly.

If you're interested in sharing your point of view on a 60-minute zoom call, add your name here: https://forms.gle/W9q44JjpH1JerKFp6

I have a limited number of $100 Amazon gift cards to give as a small thanks. All conversations are private and will only inform my eventual analysis - no personal or sensitive information will make it into the study.

5 Comments
2024/10/25
04:07 UTC

8

Manim : python package for animation for maths

0 Comments
2024/10/25
03:41 UTC

179

Why Did Java Dominate Over Python in Enterprise Before the AI Boom?

Python was released in 1991, while Java and R both came out in 1995. Despite Python’s earlier launch and its reputation for being succinct & powerful, Java managed to gain significant traction in enterprise environments for many years until the recent AI boom reignited interest in Python for machine learning and AI applications.

  1. If Python is simple and powerful, then what factors contributed to Java’s dominance over Python in enterprise settings until recently?
  2. If Java has such level of performance and scalability, then why are many now returning to Python? especially with the rise of AI and machine learning?

While Java is still widely used, the gap in popularity has narrowed significantly in the enterprise space, with many large enterprises now developing comprehensive packages in Python for a wide range of applications.

158 Comments
2024/10/24
18:44 UTC

48

How can I help low income students learn databricks?

I'm from South America and I'm a data teacher in a school that teaches technology skills to people from minority groups to help them get better jobs. It's a free course for the students, our income comes from sponsor companies that support our cause and have interest in hiring some of our students. One of the skills they asked us to teach the students was Databricks. Long story short, we couldn't find someone to teach our students on the matter so I'm the only one left to help them. I'm not proficient with Databricks so I'm straggling to create something cohesive for them.

Any public databases I could use to gather data from? Even YouTube channels I could inspire myself on? It may sound weird but I haven't found anything updated on YT on how to start with databricks lol. Any ideas or tips would help. Thanks guys!

15 Comments
2024/10/24
17:18 UTC

1

Handling behavioral scores with mixed scales: best practices for encoding and ordering ranks

Problem Description:

I’m working on a data processing pipeline that involves handling behavioral survey data from multiple scales (e.g., Likert scales, frequency scores, and categorical data). My goal is to encode these mixed scales properly while maintaining the correct rank/order (for instance, ensuring that higher Likert scores indicate stronger agreement).

However, I’ve run into several issues with encoding and rank preservation.

Question:

What are some robust methods or best practices for:

Encoding behavioral scales with mixed types (e.g., Likert, categorical, frequency scores) while maintaining the order and rank. Handling inconsistent answer sets across different surveys (e.g., 5-point vs. 7-point scales). Dynamically encoding ordinal and categorical variables in a way that respects the natural order. Dealing with missing values and inconsistent responses within the encoding process. Tools I’m Using:

Python (Pandas, Scikit-learn) Streamlit (for visualization and reporting)

Any suggestions for tools, workflows, or algorithms to dynamically and effectively encode behavioral data will be greatly appreciated. I’d also love to know if anyone has encountered similar challenges and found solutions that work across varied datasets. I'm relatively new to this data pipeline stuff. Thank you in advance!

Below are the approaches I’ve tried so far, but none have provided a robust, generalizable solution:

Hard-coding mappings for categories and ordinal features, like:

{'Never': 0, 'Rarely': 1, 'Sometimes': 2, 'Often': 3, 'Always': 4} This became unmanageable across multiple datasets with slightly different answer sets (e.g., some surveys use 5-point scales, others use 7-point).

LightGBM encoding: I used LightGBM to encode categorical features dynamically. While it works well for feature importance, it didn’t seem to capture or maintain the ordinal nature of all scales.

Clustering methods to find patterns within responses – but this approach failed to respect the natural ordering of some ordinal scales.

One-hot encoding: This lost the rank structure entirely, making it unsuitable for certain analyses.

Ordinal encoding: I also tried OrdinalEncoder from sklearn, but it didn’t encode the columns properly (in some cases, the results didn’t align with the expected order or meaning).

1 Comment
2024/10/24
17:17 UTC

9

AI infrastructure & data versioning

Hi all, This goes especially towards those of you who work in a mid-sized to large company who have implemented a proper ML Ops setup. How do you deal with versioning of large image datasets amd similar unstructured data? Which tools are you using if any and what is the infrastructure behind it?

7 Comments
2024/10/24
17:10 UTC

13

Best practices for visualization of business org charts/social networks? Still just flow chart trees?

Has there been any innovation in org chart visualization? Specifically human readable and curiosity exploration?

Traditionally an organization chart is a pyramid shaped tree of lines and nodes with a name and job title of the boss and their subordinates.

And maybe hyperlinks that let you travel around different business units.

Very local with a small number of records displayed.

Zero proportional visualization of scale, such as number of client accounts or budget/revenue.

Zero cross-matrix geo location, like management layers and adjacent business units at that layer, structure, or region on the map.

Zero motion or animation.

Has there been any innovation in org chart visualization?

Ideal state in first person: "I can click a name, and see its information analogous to the dimensions of a Rand McNally road map. Different road sizes and population sizes have different symbology to denote relationship information and population size. Borders of different layers indicate context and edges. There may even be iconography for airports, parks, etc."

It seems like there is a VAST gap for org charts to just ape other visualization techniques. So I assume someone's doing it. Like a mid tier college professor could crack the case and publish a taxonomy/symbology/methodology. EDIT: To say nothing of LinkedIn, Facebook, or commercial entities.

0 Comments
2024/10/24
15:54 UTC

47

Math topics for DS and MLE interviews

What are the most important topics in Probability, statistics, and linear algebra (add some more field if required like Information theory) that are required for DS and MLE interviews? There can be many topics but I want those most important topics which one can't miss and which are common across any such interviews. Asking as a working professional who needs to balance between work and interview preparation.

Probability, statistics, linear algebra etc. are vast so I can't just cover everything for an interview. So, practically useful topics are what I am looking for. Watching lectures of Gilbert Strang for linear algebra can be a huge learning experience but I might optimise on time and effort by learning those topics which are expected in an interview and with depth according to the interview (I may not require to know these topics just as a PhD in math would need to).

28 Comments
2024/10/24
14:53 UTC

13

Noob Question: How do contractors typically build/deploy on customers network/machine?

Is it standard for contractors to use Docker or something similar? Or do they usually get access to their customers network?

9 Comments
2024/10/23
16:58 UTC

4

How to: Automate RFP responses using a local LLM

I need some help figuring out the overall design and tools for this project. I have done some data engineering and ML work a few years ago. I have a client I do Excel and vba work for and excited to work on this project but slightly out of my depth.

I need to build a system that allows a user to generate answers to an RFP using a local LLM. The company cannot use any cloud services.

Is this something I can biuld on my machine and then install on their network, or should I ask for access to their network while building it?

Will I be able to complete this project using only Python and SQL?

What tools, platforms, libraries, structures ... etc will I need to use/implement?

Is this a data pipeline or orchestrator?

What LLM should I use? I'm thinking Llama since its open sourced but do I need something so large? Should I use a small language model? Then, is this a case for fine-tuning or RAG?

Any highly relevant blog posts I can study?

10 Comments
2024/10/23
15:44 UTC

28

Data Science at Deloitte

36 Comments
2024/10/23
03:27 UTC

18

Is Plotly bad for mobile devices? If so, is there another library I should be using for charts for my website?

Hey everyone, am creating a fun little website with a bunch of interactive graphs for people to gawk at

I used plotly because that's what I'm familiar with. Specifically I used the export to HTML feature to save the chart as HTML every time I get new data and then stick it into my webpage

This is working fine on desktop and I think the plots look really snazzy. But it looks pretty horrific on mobile websites

My question is, can I fix this with plotly or is it simply not built for this sort of work task? If so, is there a Python viz library that's better suited for showing graphs to 'regular people' that's also mobile friendly? Or should I just suck it up and finally learn Javascript lol

19 Comments
2024/10/23
02:54 UTC

260

How do I tell someone that there is nothing new under the sun?

I have been working with a guy and he has some data that he asked me to analyze. His sole interest is in uncovering interesting insights that sound punchy. Something that goes against the general common sense understanding. The data is about three different aspects of a business and their interaction. After joining the three datasets, it comes down to some 2000 rows of aggregated customer data. Not all customer transactions are recorded. The guy keeps using the word 'outcome' every time we talk and doesn't give any value to work that doesn't look punchy or just tells more about the status of the business. I have approached the data in every way possible, there is nothing special about the data. How do I tell him that what he is looking for isn't there? and that the data isn't very good to create good prediction models. I don't want to bend and stretch the data to make it cough up something flashy, I am not comfortable doing that.

Ps, if I am being wrong here, please feel free to enlighten me.

Edit: grammar

68 Comments
2024/10/23
01:02 UTC

0

Is www.mentoring-club.com legit?

I'm looking to do a career pivot and was looking for people in my pivot career to talk with. I just came across this website and wondered if anyone has tried it. Is it legit? https://mentoring-club.com/

6 Comments
2024/10/22
18:31 UTC

107

I'm doing Data Architect work, but my title is Data Analyst. Should I ask for a change of title if I'm happy with my current pay?

A year ago, I interviewed for a Data Engineer position and was hired as Data Analyst III. I asked my then manager why I was hired as an analyst and not as an engineer, and she said it was solely to meet my salary expectation.

She left the company, and now I'm in charge of a data modernization project, in which I designed, architected, and implemented a modern data warehousing solution using Snowflake and Airflow. I'll be in charge of re-creating the whole data ingestion pipeline, which the company has been struggling for a long while and many of the ETLs that will be created with the new architecture.

I don't mind my current pay ($140K in Las Vegas, USA) but I feel weird about having the Data Analyst title while doing Data Architect/Engineer work. Should I ask for a change in title? The median salary of a data architect and data engineer in Las Vegas is $101K and $113K, respectively, so I don't think I'm compensated unfairly.

53 Comments
2024/10/22
18:18 UTC

13

We built a multi-cloud GPU container runtime

Wanted to share our open source container runtime -- it's designed for running GPU workloads across clouds.

https://github.com/beam-cloud/beta9

Unlike Kubernetes which is primarily designed for running one cluster in one cloud, Beta9 is designed for running workloads on many clusters in many different clouds. Want to run GPU workloads between AWS, GCP, and a 4090 rig in your home? Just run a simple shell script on each VM to connect it to a centralized control plane, and you’re ready to run workloads between all three environments.

It also handles distributed storage, so files, model weights, and container images are all cached on VMs close to your users to minimize latency.

We’ve been building ML infrastructure for awhile, but recently decided to launch this as an open source project. If you have any thoughts or feedback, I’d be grateful to hear what you think 🙏

4 Comments
2024/10/22
16:46 UTC

10

Stable Diffusion 3.5 is out !

Stable Diffusion 3.5 is released in 2 versions, large and large-turbo (open-sourced) and can be access for free on HuggingFace. Honestly, the image quality is alright (I feel flux is still better). You can check the demo here : https://youtu.be/3hFAJie6Ttc

0 Comments
2024/10/22
16:03 UTC

109

what's your biggest pet peeve about this job?

Mine is ambiguous language from stakeholders. I get that people who don't have a background in data might not know the proper technical terms for certain concepts, but surely they can articulate what they want me to do better than "oh just wrangle it up" or "I just want an apples to apples comparison". Use examples and analogies, and be as specific as you possible can be.

Edit also scope creep. Y'all probably saw my rant about it yesterday LMAO

What's yours?

Also if this thread is popular, know I'm gonna get a bunch of people hijacking it ask for advice for getting into the field. See my comment here: https://www.reddit.com/r/datascience/comments/1e951vk/comment/lfcvrof/ Please don't ask me how to get into this field unless you've read this comment and have a question on something that I specifically didn't address in it.

92 Comments
2024/10/22
15:55 UTC

23

Large Scale Geoscience Benchmarks

Last month my colleagues and I asked the Python geo community for terabyte scale geo workloads to form a benchmark suite for tools like Xarray, Zarr, Dask, etc.. That call is here:

Large Scale Geospatial Benchmarks: Solicitation

We got a good response. Thanks everyone! Since then we've built out these into a public test suite. This post goes over what's implemented and early results

Large Scale Geospatial Benchmarks: First Pass

3 Comments
2024/10/22
15:13 UTC

35

is there a book that can help me figure out which ML algorithm fits what problem ?

I am on my path to build my graduation project and as I am learning and figuring my way through I can't but realize that I can't match the problems I face with the algorithms I studied

I need a book that explains the use of Machine learning algorithms through real problems, not just from the coding-math perspective

if any of you can recommend me such a book I will be thankful

42 Comments
2024/10/22
14:21 UTC

3

deleted data in corrupted/ repaired excel files?

My team has an R script that deletes an .xlsx file and write again in it ( they want to keep some color formatting). this file gets corrupted and repaired sometimes, I am concerned if there s some data that gets lost. how do I find out that. the .xml files I get from the repair are complicated.

for now I write the R table as a .csv and a .xlsx and copy the .xlsx in the csv to do the comparison between columns manually. Is there a better way? thanks

6 Comments
2024/10/22
12:12 UTC

2

OpenAI Swarm : Ecom Multi AI Agent system demo using triage agent

0 Comments
2024/10/22
10:52 UTC

268

Confessions of an R engineer

I left my first corporate home of seven years just over three months ago and so far, this job market has been less than ideal. My experience is something of a quagmire. I had been working in fintech for seven years within the realm of data science. I cut my teeth on R. I managed a decision engine in R and refactored it in an OOP style. It was a thing of beauty (still runs today, but they're finally refactoring it to Python). I've managed small data teams of analysts, engineers, and scientists. I, along with said teams, have built bespoke ETL pipelines and data models without any enterprise tooling. Took it one step away from making a deployable package with configurations.

Despite all of that, I cannot find a company willing to take me in. I admit that part of it is lack of the enterprise tooling. I recently became intermediate with Python, Databricks, Pyspark, dbt, and Airflow. Another area I lack in (and in my eyes it's critical) is machine learning. I know how to use and integrate models, but not build them. I'm going back to school for stats and calc to shore that up.

I've applied to over 500 positions up and down the ladder and across industries with no luck. I'm just not sure what to do. I hear some folks tell me it'll get better after the new year. I'm not so sure. I didn't want to put this out on my LinkedIn as it wouldn't look good to prospective new corporate homes in my mind. Any advice or shared experiences would be appreciated.

128 Comments
2024/10/21
20:45 UTC

4

How does your team structure DS files?

Currently we have a workspace for dev/test/prod. Then individual repos for each business unit (as well as a shared), and then it's a total crapshoot. How does your team structure project files?

7 Comments
2024/10/21
19:29 UTC

3

Certification or Portfolio Projects

Hi there.

My certification in DataCamp is about to expire and I don't know if I should re-certify or use my time to create more personal/collaborative projects in my portfolio.
I'm searching for a job in UK right now (if this is relevant).

I don't know if I have the time to do both at the same time.

Opinions?

31 Comments
2024/10/21
18:41 UTC

3

Flux.1 Dev can now be used with Google Colab (free tier) for image generation

Flux.1 Dev is one of the best models for Text to image generation but has a huge size.HuggingFace today released an update for Diffusers and BitsandBytes enabling running quantized version of Flux.1 Dev on Google Colab T4 GPU (free). Check the demo here : https://youtu.be/-LIGvvYn398

3 Comments
2024/10/21
13:59 UTC

Back To Top