/r/datascience
A space for data science professionals to engage in discussions and debates on the subject of data science.
/r/datascience
Driven by curiosity, I scraped some marathon data to find potential frauds and found some interesting results; https://medium.com/p/4e7433803604
Although I'm active in the field, I must admit this project is actually more data analysis than science. But it was still fun nonetheless.
I am trying to migrate from vroom to optaplanner but the library has no proper documentation nor people with experience working with it, only a quick start guide on their GitHub but I ran into some problems here on this forum and really need some help: https://stackoverflow.com/questions/78964911/optapy-hard-constraint-is-not-respected-in-a-vrp
If you have any recommendations to another tool for VRPs please share with us.
Thanks.
Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:
While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads.
I'll start.
2020 (Data Analyst ish?)
2021 (Data Analyst)
2022 (Data Analyst)
2023 (Data Scientist)
2024 (Data Scientist)
Education Bachelors in Computer Science from an Average College.
First job took about ~270 applications.
Is it fine to use descriptive stats to make decisions? If statistical test and ml models are not working good enough? How common is it in the tech industry? Lets say for tech products
Also if deadline is tight
Many people say stock prices cannot be predicted, and I tend to agree. Still, I wonder if systematic strategies using advanced analytics methods can outperform a global portfolio (like an MSCI All-World ETF). Personally, I use a set of statistical methods to estimate stock returns, including Monte Carlo simulations, spline regression, and autoregressive models like AR-ARCH. I haven't become rich yet, though! š
I'd love to hear about any promising, potentially non-mainstream approaches within the realm of Data Science.
** Update **
Thanks for all the inputs!
I think the Monte Carlo part needs a bit more clarification. I sample from the last few years of returns and apply them to a starting value of 100 across 100 iterations. I repeat this process many times to generate a distribution of returns. From that, I calculate metrics like the win-loss ratio and the expected value, which I use to compare stocks. In that sense, I do consider it a form of prediction, though it's more focused on assessing potential outcomes.
I also wanted to share that I don't believe there's one dominant method for predicting stock prices. Instead, I strongly believe that a combination of approachesāboth data-driven and othersāis essential for performing well in the market.
Hi everyone,
I'm about to start my first co-op in data science/analytics, and I'm feeling pretty nervous. I see many students with strong personal projects, and I'm worried they might have an edge over me. I would greatly appreciate any advice or recommendations you can offer, especially from DS/DA professionals.
Thanks in advance for your help!
Iām a research fellow at my first year of doing practical stuff after graduating in Computer Science. Iām in this project where I work with water quality data, however I donāt feel like Iām exploiting at full the data, I feel like Iām just plotting variables and doing easy statistical analysis. I have to say that Iām basically left alone in the work. What I learnt during university is not very useful in what Iām doing, mostly for the data analysis part. The analyses Iām doing are all coming from continuous research on internet, but I often feel that Iām not doing enough in the analysis (in my case of time series) other than trend estimation, correlation analysis and some statistical tests. After the analyses the ML part comes in, and I feel āsatisfiedā about that part. Could it be because I had some expectations of doing complex analyses? Is it a common feeling in this field?
Genuine question what these show other than how much time someone has spent on leetcode
Hello ! I'd like to get some feedback on my latest project, where I use an XGBoost model to identify the key features that determine whether an NFL player will get drafted, specific to each position. This project includes comprehensive data cleaning, exploratory data analysis (EDA), the creation of relative performance metrics for skills, and the model's implementation to uncover the top 5 athletic traits by position. Here is the link to the project
I have to create a solution for forecasting for thousands of different MSKUs at location level.
Data : After a final cross join, For each MSKU I have a 36 monthly data points. (Not necessarily all will be populated, many monthly sales values could be 0)
The following is what I have attempted:
However, the final report gives absurdly high values of predictions, even in case of MSKUs with nearly no sales.
This is where business has a problem. They want to me redo everything to get meaningful predictions.
My problem with this approach is - I might have to create models for each item i.e thousands of .pkl files
Constraints:
I am outta my depth here! Can you please help?
EDIT:
Wow, I was definitely not expecting so many helpful responses!! I am insanely grateful. It seems I need to peruse some of the TS literature. It's midnight as I am writing this here. I will definitely try and answer and thank the comments here!
Hello smart people! I'm looking to get well educated in practical A/B tests, including coding them up in Python. I do have some stats knowledge, so I would like the materials to go over different kinds of tests and when to use which. Here's my end goal: when presented with a business problem to test, I want to be able to: define the right data to query, select the right test, know how many samples I need, interpret the results and understand pitfalls.
What's your recommendation? Thank you!
I did an analysis of job offers in DS among Fortune1000 companies (data: https://jobs-in-data.com/).
Here is a comparison of the manager/director positons vs. junior/mid/senior positions.
NGL huge stomp from India :)
Hey folks,
dlt cofounder here.
Previously: We recently ran our first 4 hour workshop "Python ELT zero to hero" on a first cohort of 600 data folks. Overall, both us and the community were happy with the outcomes. The cohort is now working on their homeworks for certification. You can watch it here:Ā https://www.youtube.com/playlist?list=PLoHF48qMMG_SO7s-R7P4uHwEZT_l5bufPĀ We are applying the feedback from the first run, and will do another one this month in US timezone. If you are interested, sign up here:Ā https://dlthub.com/events
Next: Besides ELT, we heard from a large chunk of our community thatĀ you hate governanceĀ but it's an obstacle to data usage so you want to learn how to do it right. Well, it's no rocket/data science, so we arranged to have a professional lawyer/data protection officer give a webinar for data engineers, to help them achieve compliance. Specifically, we will do one run for GDPR and one for HIPAA. There will be space for Q&A and if you need further consulting from the lawyer, she comes highly recommended by other data teams.
If you are interested, sign up here:Ā https://dlthub.com/eventsĀ Of course, there will also be a completion certificate that you can present your current or future employer.
This learning content is free :)
Do you have other learning interests? I would love to hear about it. Please let me know and I will do my best to make them happen.
I have a curated list of items and a collection of conversations. I'm extracting a list of items from the conversations using an LLM and trying various techniques to map the extracted items to my curated list. I need to evaluate the correct number of items are extracted, and whether or not it was mapped to the correct item.
Comparing the length of the two item lists is easy enough, and using a similarity score for the extracted items has been Ok. I'm wondering if anyone has any better evaluation techniques they recommend I look into.
What math/stats/probability questions do you ask candidates that they always struggle to answer or only a-few can give answer to set them apart from others?
What tools do yo use to visualize relationships between tables like primary keys, foreign keys and other connections?
Especially when working with too many table with complex relational data structure, a tool offering some sort of entity-relationship diagram could come handy.
Curious what books yall might recommend to fill out a bookcase. Thereās the standard ISLR, ESLR, but what about domain specific analytics / science, or any other books that have very good information in them? For example, I really enjoy this one for getting up to speed on Bayesian inference: Bayesian Analysis with Python - Third Edition: A practical guide to probabilistic modeling https://a.co/d/0Y2pyEg
I use this one to really hone the craft in Bayesian inference: Bayesian Data Analysis (Chapman & Hall/CRC Texts in Statistical Science) https://a.co/d/c7VF3w9
This is a great book for pricing Pricing and Revenue Optimization: Second Edition https://a.co/d/04pa3fg
Like in factorio, the bookcase must grow!
I'm thinking of spending the next 6-12 months to intensely study LLMs (how they're structured, training, fine-tuning, deployment, RAG, etc).
I'm certainly interested in them personally, but my decision is focused on how valuable this knowledge will be.
I have a lot of data science and ML experience already, and I'm trying to figure out what to study next. I could spend the next several months LLMs, *but* I could also spend the next several months studying other data related things or some business topics.
Although I think LLMs are fascinating, I'm somewhat uncertain that applied LLMs will drive real business value (LTV, revenue, profit, ROI) in the same way that more traditional tools do.
What do you think?
Will LLM knowledge be valuable or is it overhyped with limited real value over the next few years?
What am I missing and how should I be thinking about this?
(note: edit to clarify tradeoffs for what I could study.)
Just kind of a vent post here. Iām a registered nurse and I work on a data science team for a large well known hospital system. I absolutely love my job working as a clinical SME and liaison between teams. I have a masters in health informatics and am very up to date on ML models, data science principals and all that. However, the way my organization is going i can tell itās time for me to look for new opportunities.
But gosh dang itās hard! Iām in the Bay Area, thereās an abundance of health tech ai companies yet far and few opportunities for my skill set somehow. Project manager seems like a good fit potentially, but those positions want years and years of PM experience. Other analyst positions want someone skilled in SQL querying. Clinical informatics would be an okay area, but i prefer the process of making new and cool things over working at a hospital with epic stuff (itās boring).
I thought I had it made once I got into the data science field as a nurse, but it seems none of these health tech companies Iām finding that are doing things Iām interested in need clinical personnel. Am I doing something wrong?
Hi everyone,
I am trying to find a rule of thumb for treatment to control group ratio when implementing generalized synthetic control (using gsynth in R). All the examples have a bigger donor group (control), but mine is about balanced. Does that matter? Should the control donor pool be X times bigger than the treatment? Any resources that can help me?
Thanks in advance!
So for context, Iām interviewing at a big company that you probably know or use their product.
This was their process up until now:
Then I was given a take home on Thursday and asked to turn it in by Sunday.
I was only able to spend 2 days on it. I have my presentation coming up next week and now I feel like I made a crappy presentation. I wish had had put in more days.
Our company is strictly on-premise for all matters of data. No cloud services allowed for any sort of ML training. We're looking into adopting Red Hat OpenShift AI as an all-inclusive data platform. Does anyone here have any experience with OpenShift AI? How does it compare to the most common cloud tools and which cloud tools would one actually compare it to? Currently I'm in an ML engineer/data engineer position but will soon shift to data science. I would like to hear some opinions that don't come from RedHat consultants.
I've been looking to move from data analytics to a data scientist role for a while now, and I've been applying but it seems like there's incredibly few jobs. Certain DS roles I'm just not qualified for depending on the specific domain, so that lowers it further. But I'm getting more and more miserable in my current role and I'm just not sure what to do.
Hello,
Need some guidance on extracting large compliance items from raw PDF documents. I have csv with these compliance items and I want to fine-tune a model such that if it reads any new PDF documents it can firstly identify the compliance items and extract them.
Examples of these compliance items:
The PDFs are servicing guides. Examples of them can be seen from the following site:Ā https://servicing-guide.fanniemae.com/
Now regex is not usable since these items are varied in nature and are sometimes inside table and there are no fixed positions in the document where these compliance items can be situated.
The goal is to train a LLM model on these compliance documents so that if I provide a new PDF it can predict/extract compliance items from it.
Can anyone guide me in which model will be better suited for this task, what tokenizers etc and if there are similar scenarios/methodologies for extracting the items?
Attaching image of a compliance item in a pdf for reference:
I am trying to find associations between how much the sales rep smiles and the outcome of an online sales meeting. The sales rep smiling is a percentile (based on our own database of how much people smiled in previous sales meetings) and the outcome of a sales meeting is just "Win" or "Loss", so a binary variable.
I will generate bar plot graphs to get an intuition into the data, but I wonder what statistical test I should use to see if there is even a non random association with how much the sales rep smiles and the outcome of a sales meeting. In my opinion I could bin the data (0-10%, 10-20%, etcā¦) and run a Chi square test, but that does seem to lose information since Iām binning the data. I could try logistic regression or point-biserial correlation, but I am not completely sure the association is linear (smiling too much and too little should both have negative impacts on the outcome, if any). Long story short - what test should I run to check if there is even any relationship in a feature like smiling (which is continuous) and the outcome (which is binary)?
Second, say I want to answer the question āDoes smiling in the top 5% improve online sales meetings outcome?ā. Can I simply run a one-tail t-test where I have two groups, top 5% of smiles and rest, and the win rate for each group?
I got in as a senior tech analyst in one of the very big e-commerce companies in my country (Asia). I worked really hard for the job. Gave so many interviews and could finally clear them. I am a self taught analyst with no coding background. 5 months into the job- I am so under confident and overwhelmed. I was fairly confident in my life before I started this job. Everyone in my team either is developing an App or winning hackathons or whatever. This gives me serious anxiety and I get anxious of the year end ratings that I would get. I am just hardworking and I persevere. I am not quick, neither with excel nor python but I get the job done after putting long hours. I catch edge cases and donāt make silly mistakes. But other than that I donāt know much about data. I think I should leave the industry as I cry everyday and this anxiety is killing me. :( Please help. What should I do? How do I get out of my head? How do I get my confidence back? I donāt have guts to do 1-1 with my manager (what if he says you need to buck up?). :(
Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:
While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads.
Hello Data Scientist!
Iām thrilled to announce the release of Plotlars 0.3.0! š
This new version brings a host of exciting features and improvements designed to make your data visualization experience in Rust even smoother and more powerful. If youāve been following the progress of Plotlars, youāll know that itās all about bridging the gap between the Polars data analysis library and various plotting libraries. With this release, weāre taking things to the next level!
š New Features:
Head over to the crate, explore the updated documentation, and dive into the GitHub repository to see all the new changes in action. If you find Plotlars useful, consider leaving a star āļø on GitHub āit helps others discover the project and motivates further development.
Thank you for your continued support and interest in Plotlars. Happy plotting! š