/r/datascience

Photograph via snooOG

A space for data science professionals to engage in discussions and debates on the subject of data science.

/r/datascience

1,603,307 Subscribers

0

Found this on Linkedin, Is this legit or some elaborate scam/data farming I'm unaware of?

3 Comments
2024/05/05
04:24 UTC

17

Moving to eBay as a Data Science Analyst?

Hey all, firstly, I don't want to sound disingenuous so I really hope this doesn't come off that way.

I have a pretty non-traditional path to Data Science, I did a Bachelor of Commerce, and through a rotation program at a big Canadian bank got into a Data Science team, that was supportive and took me on despite me lacking technicals.

I've been in the position now for 2 years, mostly working with NLP and unstructured data. Doing usual Power BI dashboard, KPI antics, all the way up to using transformer embeddings for email classification models. It has been a cool role, sadly plagued with bad management, and all at a slow bank.

In recent times, the bank has gone through reorg, and we (data and analytics team) do not have the support from senior management, nor the funding that we did maybe even a year ago. Layoffs are a small possibility, but already I feel like we have been brushed to side, with not much expectations, nor no net new projects.

Furthermore, my boss who had hired me might also be leaving, meaning I would be stuck with completely non-technical management, and I would learn nothing.

Perhaps the one good thing is the pay, but that is making it hard for me to now find a new role. My current pay is around $95K CAD + last year, a 15% bonus. With new leadership now though, that doesn't support us, I doubt we will get a bonus as fat as that anymore.

I have been interviewing with eBay for a Data Science Analyst position, working on their item buying page, and got the offer yesterday. I would report to a team in SF, but would be based in Canada. The role would be less model-building, a lot more A/B testing, from what I gathered.

Pros of eBay:

  • Data Science at Tech company gives validity to my otherwise non-technical resume (academically, at least)
  • 1 week in office and flexibility to work from abroad (according to HR); so I can travel home internationally and not burn PTO, compared to 3 in office and no flexibility currently
  • Hopefully newer tech stack than the bank (fr we don't even have a SQL server set up here lmfao)
  • Been with the bank for 3 years in various roles, really tired and fed up, so would be a good change and refresh
  • 20K sign on for first year, 10K sign on second year, conditional to me staying for a year after receiving bonuses, in addition to equity discount purchase options, and $30K USD equity package (25%/year vest) in compensation

Cons:

  • Only a 10% base-pay bump. Tech homies (though in SF) tell me to not settle for anything less than 20% bump between jobs.
  • Not building models as much, which makes it more statistics-focused and less applied
  • 80% focused on A/B testing from what I gathered
  • Unsure of eBay market reputation these days, what are some companies that people end up working at after eBay?
  • High rate of layoffs

In addition to all of this, I am interviewing with Intuit and Robinhood, both for more technical Data Science positions. Those would likely pay in the 120K-140K CAD range from what I understand, eBay would be at 105k.

The next step for these interviews would be the technicals, which I do not feel confident for at all (Python, SQL LeetCode Mediums + statistics and ML design questions). I haven't studied for this stuff that much, and would really be cramming.

I have told eBay that I can let them know on Monday if the compensation is okay, I am tempted to go back to them and beg for $110K CAD, which would be a 15% bump, and mention that I am interested in the role, but the pay is just not working for me, especially that I am interviewing for two other positions that may pay better. So I want to request if I can continue the interview process with the other two, and get back to eBay with a final confirmation. Not sure how to phrase this, or if I should even show my hand like that.
eBay HR was really nice, really felt like they were gunning for me, and the offer felt like the max they could squeeze out between Payroll and the Hiring team.

I'm just confused. In light of all of this, where can/should I go?

I was 65% in favour to take, 35% to reject, but given my friends/family advice that the pay is low and not worth moving, I am unsure.

I'm also not sure how to buy myself time for the interveiws with RH and Intuit, and god if I can even do them.

Is it really that bad to move for 10% base bump? Ignoring sign-on bonus, equity, quality of life?

Lastly, I'm not sure if eBay is a boost to my career or a step down; is it a good place to work? Is it frowned upon by other companies as a "legacy tech company" or something?

Thank you!

10 Comments
2024/05/04
17:19 UTC

0

Actual Product vs Portfolio of Demos

In your opinion, I was wondering which is better when searching for a data job-- a portfolio of small demos or an actual product that fills a void?

For example, if my community has an information need such as analysis of schools, their suspension rate and other related features, would that be better than a bunch of small projects posted to github?

I'm thinking an actual product is more beneficial in showcasing one's skills, because it's an end-to-end project (e.g., data collection, data cleaning, analysis, infrastructure, integrating data updates, etc).

4 Comments
2024/05/04
16:05 UTC

16

Impact of different tool use on future job prospects

I'm in a senior DS role right now. This is my first data job after being a professor for a few years post PhD. I'm a modeler, that's my main focus on the job, which I absolutely love.

However, the client (I'm a consultant) uses SAS miner and guide, and does not use Python at all. Partially because they always have and partially for security concerns. As I build my models, realistically the biggest issue is making sure I do things that our (imo outdated) tech stack can handle. I'd love to do a sexy GNN network based model for example but right now we struggle to execute a random forest.

The experience I'm getting is great, I'll be about to make some solid quantifiable improvements, and I'm not looking to move jobs in the next <3 years. However, I worry that if I go on the market in the future, my lack of experience putting Python into prod will be an issue.

Hopefully at that point I'll have some promotions under my belt and will be moreso managing a team than running code. If I'm in the future applying for more senior positions, will they care so much about what tools I've been using versus my experience leading a team/communicating with the business, etc?

15 Comments
2024/05/04
13:01 UTC

9

How do you prepare for performance reviews?

Hi,

Currently I have a one note where I track different pieces of company desired goals/targets through the year. Some of the things they care about :

  1. certs / continuing education
  2. speaking events
  3. individual contributions (projects etc)

How are some of the ways you track your progress?

And if you don’t…why? Any way you can resell yourself every review is great ammunition imo.

18 Comments
2024/05/04
12:44 UTC

44

What’s the deal with minimum 3 YOE on most of job postings?

Hello, I’m coming with question to maybe more experienced professionals or even people which are recruiting. In most job postings I see for DS, MLE, MLOps etc I see requirement of at least 3 YOE. In my personal experience I saw lot of devs with 1 YOE having much more knowledge and wider range of skills then devs with 6 YOE writing code in PHP and using only excel. I assume most people having less experience in their resume would be dropped immediately in early stage of reviewing candidates because of this factor. What’s the deal with this time boundary and is it really that important?

60 Comments
2024/05/04
10:14 UTC

4

Looking for a dataset that contains the average prices of foods in the U.S.

Looking to be able to search for the price of any given food you’d find in the grocery store. Aware of any datasets that have this information?

6 Comments
2024/05/03
21:29 UTC

26

How much did your grad program help you get a job?

I did a Masters years ago, and we had an internship requirement as part of it. The university had a bunch of big-name employers come in over the course of a week, and we could sign up to interview with them.

I'm curious how other grad programs did or did not help you get a job.

35 Comments
2024/05/03
20:58 UTC

36

What’s the DS job market like for people who have a decent amount of experience?

I know the market is fucked for less experienced candidates.

What’s the market like for people with 5+ years experience at a FAANG as well as a graduate + undergraduate degree from a top tier institution?

Is it just the new grad market that’s fucked or the L5/L6 market too?

39 Comments
2024/05/03
17:56 UTC

760

Put my foot down and refused to go ahead with what would amount to almost 8 hours of interviews for a senior data scientist position.

I initially was going to have a quick call (20 minutes) with a recruiter that ended up taking almost 45 minutes where I feel I was grilled enough on my background, it wasn't just do you know, x,y and z? They delved much deeper, which is fine, I suppose it helps figuring out right away if the candidate has at least the specific knowledge before they try to test it. But after that the recruiter stated that the interview process was over several days, as they like to go quick:

  • 1.5 hours long interview with the HM
  • 1.5 hours long interview focusing on coding + general data science.
  • 1.5 hours long interview focusing on machine learning.
  • 1.5 hour long interview with the entire team, general aspect questions.
  • 1 hour long interview with the VP of data science.

So between the 7 hours and the initial 45 minutes, I am expected to miss the equivalent of an entire day of work, so they can ask me unclear questions or on issues unrelated to work.

I told the recruiter, I need to bow out and this is too much. It would feel like I insulted the entire lineage of the company after I said that. They started talking about how that's their process, and it is the same for all companies to require this sort of vetting. Which to be clear, there is no managing people, I am still an individual recruiter. I just told them that's unreasonable, and good luck finding a candidate.

The recruiter wasn't unprofessional, but they were definitely surprised that someone said no to this hiring process.

155 Comments
2024/05/03
16:52 UTC

14

How would you model this problem?

Suppose I’m trying to predict churn based on previous purchases information. What I do today is come up with features like average spend, count of transactions and so on. I want to instead treat the problem as a sequence one, modeling the sequence of transactions using NN.

The problem is that some users have 5 purchases, while others 15. How to handle this input size change from user to user, and more importantly which architecture to use?

Thanks!!

30 Comments
2024/05/03
14:27 UTC

26

What makes a good or bad product manager?

Realised I’ve only ever worked with two product managers and would love your thoughts as to what makes a product manager good to work with or not so good to work with

27 Comments
2024/05/03
11:13 UTC

6

Apple silicone users: how do you make LLM’s run faster?

Just as the title says.

I’m trying to build a rag using ollama but it’s taking so so long. I’m using apple m1 8gb ram (yes, I know, I brought a butter knife to a gun fight) but I’m broke and cannot afford a new one.

Any suggestions?

Thanks

10 Comments
2024/05/03
10:19 UTC

56

What are you excited about based on the career you've built so far and where you predict it's gonna take you?

What have you accomplished and how does it position you to grow further? What has this career given you that you're thankful for; be it money, prestige, knowledge or even a bit of fun?

I'm asking this to learn from the folks who have done good for themselves in this career and to receive inspiration. We could all use some inspiration.

39 Comments
2024/05/02
19:07 UTC

70

If you are a data scientist and does not work on Machine Learning part, then how does your work differentiate from a Data Analyst?

Hi, So i am curious that if you are not working on ML part in your work and you are a data scientist, how does your work differentiate from a data analyst? What statistics methods do you use if not modelling for insights? Because i recently had an interview and they mentioned they do not work on ML part and usually its about getting insight from data. Then wouldn’t it be same as data analyst?

Also, for data insights how do you gather insights if you are not doing ML models? Like use just basic descriptive stats, or basic regessions, and think about questions, or cleaning and feature engineering is what you do?

Becaus, i alway thought when you want to get insights you need to do PCA, some regression models other than just basic stats. I would love to know more if you are a “data scientist” but doesn’t work on ML

67 Comments
2024/05/02
18:32 UTC

112

What do you think of graduate student applicants?

I am a graduate student working on Data Science. The weird thing I notice recently is that graduate schools don't teach SQL or BI tools(which makes sense because those are areas you can pretry much self learn), so a lot of graduate students are lack of those skills (me included) when applying DS or DA jobs.

But they have all the machine learning related cool-looking projects on their portfolios. So their resumes might more fit to DS roles maybe, but their lack of experiences and way less number of DS jobs stop them. Then when applying DA roles, their inadequate SQL or Bi skills stop them.

I noticed this weirdness because my friend who has several cool ML projects just failed SQL interview for DA role. I know there are many data professionals here, so wanted to ask if you have notice this where there are more graduate students applicants recently but bootcamp self learners are more fit? Now I think 6 months of bootcamp heavy focused on SQL with relavent projects could have given me more higher chances to get me DA role.

99 Comments
2024/05/02
03:14 UTC

33

Anyone have experience working in a healthcare start-up?

I (27) was just recently reached out by a healthcare start-up for a Senior DS role and will be starting the interview process. I've only worked in large healthcare companies, one of them being a hospital system, in analytical work. While it's early in the interview process, the lowest side of their payband is ~$60K more than I make now. But in my current position, I have a very hands off manager, my manager is a big advocate for good work life balance, work at my own pace 99% of the time, and most weeks work 25-30 hours a week which lets me do my own hobbies through out the week. The one thing that is lacking though is the projects I work on aren't very difficult (read mostly ad-hoc reporting via excel and SQL) and I don't feel like I'm growing my skillsets.

I figure that a start-up space is going to be much more faster paced and a lot less work-life balance, but on the flip side I'll (assumingly) will be working on more exciting projects that will actually teach me new things.

Just wanted some perspective from people who are in the start-up space. Thanks!

33 Comments
2024/05/01
22:58 UTC

8

Survival Analysis Question (For Attrition Prediction)

Used a Cox PH Fitter for an initial run at a survival analysis at work. Got a concordance index of .75. I know what this metric means, mechanically. But I'm wondering how this value compares to any tests you've done in your job. I'm a junior analyst and just trying to come up with a model that is as accurate as possible. Any tips or requests for further info are certainly welcome!

Also, variables I'm currently using are:

Pay grade, business unit, how many promotions an employee has received, gender, diversity status, whether or not the employee is a people leader, and age.

My company has about 1500 YoY employees, so I have to avoid a lot of categorical variables with a high cardinality. Anyway, let me know if there's further info you want/need. Thanks!

10 Comments
2024/05/01
20:13 UTC

13

How to transition to machine learning engineering?

Im currently at a small tech consulting company. I have a master’s in data science but not much hard engineering experience.

I’ve built 1 production system but it was still ‘low tech’. I was using excel files and then an AutoML tool and running time series forecasting offline at a regular cadence. But that project is done and it looks like clients I work with are all low tech and having to deploy anything with them seems like a pain. I work on POCs for ML modeling nowadays

I want to transition to a company where I can be on a better path and eventually try to be a software engineer in ML or an MLE. Finding opportunities to advance my skills are hard. I am currently interviewing at a company but the role seems more client focused and POC focused with maybe some opportunities to deploy / monitor ML systems. I am a little nervous that switching into a role that is not advertised as engineering heavy could be the wrong move

However, any company that works at large scale is probably better than what I do now. Any proper tech company where I can use proper tools like pyspark, databricks, etc seem like would put me in the path to do more engineering or ML at scale.

I am curious what people think. What is the best way to break into MLE if you dont have large scale software experience and if your current best new role opportunities are not exactly engineering heavy but could have chances to build internal tools and deploy things sometimes?

Personally I think I’ll try to do as much engineering work as possible in any new tech company that operates at sufficient scale. And maybe even gunning for an internal transfer to SWE / MLE if that ever shows up could be a move (and this has a chance of happening at new company not current one). And I’ll build some ML apps for personal projects as well. It seems like staying at a small consulting company will continue to hurt my long term skillsets since I don’t have exposure to proper tools and large scaled problems

I have 1.25 YOE plus I moonlit and did some NLP work on the side for many months last year. I effectively have 2.5 YOE including internships. Would love opinions. Even opinions that would argue against wanting to be an MLE

20 Comments
2024/05/01
18:14 UTC

2

Disclaimer/license for take-home assessment

As an applicant, does it make sense or is it common to add a disclaimer to ensure your solution is not used for any purpose other than the intended one?

10 Comments
2024/05/01
16:39 UTC

6

Classification - using both euclidean distance and cosine similarity for inference

Context

  • existing label split into 3 sub labels
  • labeled historical data for existing label exists, but none for the 3 new sub labels
  • idea is to use sub label description to get labeled data for the 3 sub labels
  • but won't be able to get a lot of labeled data (because taggers actual role isn't to do tagging)

Given a query document, inference can be done

  • option 1: using binary classifier for each sub label
  • option 2: use label of the document nearest to the query document to predict label of query document
  • option 3: use similarity between sub label description and query document to predict label of query document
  • option 4: combination of options 2 and 3

Option 1 may not be ideal as the amount of labeled data for sub labels will likely not yield performant binary classifiers. If they don't fulfil the metrics set by stakeholders, they shouldn't be deployed which is where options 2, 3 and 4 came about.

Option 2 is one shot classification using euclidean distance. It's opportunistic because it'll predict the sub labels if taggers had labeled the nearest document with the sub labels.

Option 3 is zero shot classification using cosine similarity, because it doesn't rely on any historical documents to make prediction.

Option 4 mitigates the opportunistic factor of option 2 by allowing another similarity measure (option 3) to influence the prediction of the query document. This is the option I'm leaning towards should option 1 not meet stakeholders' expectations.

Wanted to ask this is a sound approach to handling classification of documents with new labels? Or if you have tackled such problem in a different way, feel free to suggest!

17 Comments
2024/05/01
13:47 UTC

91

Offer from an org that is mostly operating in excel

I am a data analyst / scientist. Basically r an end user of data that at least is sitting on a platform before I can interrogate it / clean it etc. Have been interviewing for a role that sounds great but they just mentioned that they are in a transition phase and a lot of their important data is still in Excel. I would be in a management position here but have not been in this situation before. How dire is it likely to be? What are generally the options. Have been an observer of migrations from on prem to cloud but not this so not sure what to expect. Any advice?

70 Comments
2024/05/01
08:14 UTC

41

What are some good resources to learn about missing values and different approaches to deal with them?

I'm pretty new to my professional job, but from school I know missingness is something we'll never escape. I have survey data my project lead wants me to use, and we can't impute if someone didn't fill out the entire survey. They want me to decide between dropping the survey for the sample entirely vs keeping it in, and I want to learn a bit more about what the options are here. Especially since a concern is that the group with the survey will differ from the group without it. Im just not super familiar with this specific concern in missing data and what to do for an entire set of missing features like in a whole survey

19 Comments
2024/04/30
18:23 UTC

15

What would you call a model that by nature is a feedback loop?

So, I'm hoping someone could help me find some reading on a situation I'm dealing with. Even if that's just by providing a name for what the system is called it would be very helpful. My colleagues and I have identified a general concept at work but we're having a hard time figuring out what it's called so that we can research the implications.

tl;dr - what is this called?

  1. Daily updated model with a static variable in it creates predictions of rent
  2. Predictions of rent are disseminated to managers in field to target as goal
  3. When units are rented the rate is fed back into the system and used as an outcome variable
  4. During this time a static predictor variable is in the model and it because it continuously contributes to the predictions, it becomes a proxy for the outcome variable

I'm working on a model of rents for my employer. I've been calling the model incestuous as what happens is the model creates predictions for rents, those predictions are sent to the field where managers attempt to get said rents for a given unit. When a unit is filled the rent they captured goes back into the database where it becomes the outcome variable for the model that predicted the target rent in the first place. I'm not sure how closely the managers adhere to the predictions but my understanding is it's definitely something they take seriously.

If that situation is not sticky enough, in the model I'm updating the single family residence variables are from 2022 and have been in the model since then. The reason being, extracting it like trying to take out a bad tooth in the 1860s. When we try to replace it with more recent data it hammers goodness of fit metrics. Enough so that my boss questions why we would update it if we're only getting accuracy that's about as good as before. So I decided just to try every combination of every year of zillow data 2020 forward. Basically just throw everything at the wall and surely out of 44 combinations something will be better. That stupid 2022 variable and its cousin 21-22 growth were at the top as measured by R-Squared and AIC.

So a few days ago my colleagues and I had an idea. This variable has informed every price prediction for the past two years. Since it was introduced it has been creating our rent variable. And that's what we're predicting. The reason why it's so good at predicting is that it is a proxy for the outcome variable. So I split the data up by moveins in 22, 23, 24 (rent doesn't move much for in place tenants in our communities) and checked the correlation between the home values 22 variable and rent in each of those subsets. If it's a proxy for quality of neighborhoods, wealth, etc then it should be strongest in 22 and then decrease from there. Of course... it did the exact opposite.

So at this point I'm convinced this variable is, mildly put, quite wonky. I think we have to rip the bandaid off even if the model is technically worse off and instead have this thing draw from a SQL table that's updated as new data is released. Based on how much that correlation was increasing from 22 to 24, eventually this variable will become so powerful it's going to join Skynet and target us with our own weapons. But the only way to ensure buy in from my boss is to make myself a mini-expert on what's going on so I can make the strongest case possible. And unfortunately I don't even know what to call this system we believe we've identified. So I can't do my homework here.

We've alternately been calling it self-referential, recursive, feedback loop, etc. but none of those are yielding information. If any of the wise minds here have any information or thoughts on this issue it would be greatly appreciated!

15 Comments
2024/04/30
15:36 UTC

1

Partial Dependence Plot

So i was researching on PDPs and tried to plot these plots on my dataset. But the values on the Y-axis are coming out to be negative. It is a binary classification, Gradient Boosting Classifier, and all the examples that i have seen do not really have negative values. Partial Dependence values are the average effect that the feature has on the prediction of the model.

Am i doing something wrong or is it okay to have negative values?

8 Comments
2024/04/30
13:12 UTC

11

Career networking question

I was reading a post about networking on Reddit, and someone mentioned going to networking events where hiring managers are. Now I'm really curious!

I'm a hiring manager and I don't attend any networking events. Honestly, outside of conferences, if I had an interest in doing that, I'm not even sure where I'd start (to meet other analytic/DS professionals).

Does anyone actually attend networking events? Are they focused on analytics, AI, data science, or more of an industry in general? Are they online or in person?

11 Comments
2024/04/30
11:39 UTC

8

Estimating value and impact on business in data science

I am working on a data science project at a Fortune 500 company. I need to perform opportunity sizing to estimate 'size of the prize'. This would be some dollar figure that helps business gauge value/impact of the initiative and get buy in. How do you perform such analysis? Can someone share examples of how they have done this exercise as part of their work?

7 Comments
2024/04/30
05:42 UTC

14

Impact of LLMs on Nature of and Demand for Data Science

Hi everyone!

I assume questions similar to this have appeared on this sub since ChatGPT burst onto the scene. For my part, I'd like to ask how LLM’s have affected data science in your experience, both the day-to-day's of Data Scientists and the demand for them. Additionally, how do you think this will unfold going forward?

As a DE (and former DS), I certainly know they benefitted my work, especially with tools like GitHub Copilot essentially becoming a more targeted, interactive version of Stackoverflow (can’t beat getting a direct answer to your question, rather than having to scour for a similar question someone has posted!).

I ask because from time to time I have thought about moving back into DS, leveraging the engineering skills I’ve learned as a DE. Both roles have their advantages in my experience.

Thank you for your input!

9 Comments
2024/04/29
20:56 UTC

113

For R users: Recommended upgrading your R version to 4.4.0 due to recently discovered vulnerability.

13 Comments
2024/04/29
19:29 UTC

Back To Top