/r/datascience

Photograph via snooOG

A space for data science professionals to engage in discussions and debates on the subject of data science.

/r/datascience

2,409,663 Subscribers

0

NEED A MENTOR ASAP!!!!!!

I'm seeking a mentor to guide me on my journey to securing a Junior Data Analyst or Data Scientist role within the next 8-9 months.

Educational Background:

  • Bachelor in Computer Information Systems (BCIS), Apex College
  • Currently pursuing MSc in Data Science and Computational Intelligence

Professional Background:

  • Experience as a content writer, editor, and technical writer
  • Completed a Software Engineering internship at Leapfrog
  • Completed a Data Science internship at JobAxle, focusing on LLMs and LangChain projects

I’m eager to leverage my skills and background to transition into data analytics or data science and would greatly appreciate guidance on career strategies, skill development, and portfolio-building.

4 Comments
2024/12/02
10:26 UTC

22

Is any of you doing actual ML work here?

I'm really passionate and i love the mathematics of machine learning, especially the one in deep learning. I do have experience with training DL models, genetic algo hyperparameter tuning, distribution based models/clustering (KL div, EM), combining models or building them from scratch, implementing complex ones in C from zero, signal analysis, visualizations, and other things.

I work in a FAANG, but most of the work is actually data engineering and statistics. At first I was given the chance to work on a bit of ML, but that was just for me to have the motivation to learn the already existing systems, because no one in the entire department does any ML, and now I'm only getting engineering/statistics projects.

I had jobs in the past at startups where the CEO would tell me to hard code IFs instead of training a decision tree for different tasks.

They all just want "the simplest solution", and I fully agree with the approach, except that the simplest possible approach is not an actual solution some of the time. We may need to add in some complexity to solve different tasks, but most managers/bosses I've encountered have been terrified by any actual ML/mathematics. I agree that explainable and low risk high reward are the best approaches, but not if your "low risk" solution is hardcoding hundred of if statements instead of a decision tree, man.

Is it because I'm from Europe and not US? I've been told by HR that we're inferior and that ideas only come from the US and to keep my head down more instead of proposing projects before.

I'm a very tryhard and hard working person, but I just can't perform in a job where the task is to put together two SQL software pieces built 10 years ago in a rush and with zero documentation...... And my bosses refuse to understand that. Sure, I can do some of it, the job does not need to be perfect. But not if that is 100% of the job.

Are labs like OpenAI/Anthropic/Deepmind the only places on earth that do actual ML and not API calls + statistics/engineering + if statements?

25 Comments
2024/12/02
08:53 UTC

2

Weekly Entering & Transitioning - Thread 02 Dec, 2024 - 09 Dec, 2024

Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:

  • Learning resources (e.g. books, tutorials, videos)
  • Traditional education (e.g. schools, degrees, electives)
  • Alternative education (e.g. online courses, bootcamps)
  • Job search questions (e.g. resumes, applying, career prospects)
  • Elementary questions (e.g. where to start, what next)

While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads.

0 Comments
2024/12/02
05:01 UTC

1

F5-TTS is highly underrated for Audio Cloning !

1 Comment
2024/12/02
04:40 UTC

0

Need help gathering data

Hello!

I'm currently analysing data from politicians across the world and I would like to know if there's a database with data like years in charge, studies they had, age, gender and some other relevant topics.

Please, if you had any links I'll be glad to check them all.

*Need help, no new help...

6 Comments
2024/12/01
22:12 UTC

2

Feature creation out of two features.

I have been working on a project that tried to identify interactions in variables. What is a good way to capture these interactions by creating features?

What are good mathematical expressions to capture interaction beyond multiplication and division? Do note i have nulls and i cannot change it.

16 Comments
2024/12/01
14:09 UTC

10

Daily averaged time series comparison -Linking plankton and aerosols emissions?

Hi everyone, so we have this dataset of daily averaged pytoplankton time series over a full year; coccolithophores, chlorophytes, cyanobacteria, diatoms, dinoflagellates, phaecocystis, zooplankton.
Then we have atmospheric measurements on the same time intervals of a few aerosols species; Methanesulphonic acid, carboxylic acids, aliphatics, sulphates, ammonium, nitrates etc...
Our goal is to establish all the possible links between plankton types and aerosols, we want to find out which planktons matter the most for a given aerosols species.

So here is my question; Which mathematical tools would you use to build a model with these (nonlinear) time series? Random Forest, cross-wavelets, transfer entropy, fractals analysis, chaos theory, Bayesian statistics? The thing that puzzle me most is that we know there is a lag between the plankton bloom and aerosols eventually forming in the atmosphere, it can take weeks for a bloom to trigger aerosols formation, so far many studies have just used lagged Pearson´s correlation, which I am not too happy with as correlation really isn´t reliable, would you know of any advanced methods to find out the optimal lag? What would be the best approach in your opinion?
I would really appreciate any ideas, so please don´t hesitate to write down yours and I´d be happy to debate it, have a nice Sunday, cheers :)

4 Comments
2024/12/01
10:08 UTC

43

TIME-MOE: Billion-Scale Time Series Forecasting with Mixture-of-Experts

Time-MOE is a 2.4B parameter open-source time-series foundation model using Mixture-of-Experts (MOE) for zero-shot forecasting.

You can find an analysis of the model here

13 Comments
2024/11/30
17:43 UTC

5

Large scale video processing help

I want to extract CLIP embeddings from 40k videos at a certain frame rate. To do this there are three main things I need to do, which are to first read the video to extract frames, preprocess the frames using the CLIP Image processor and use CLIP itself to extract the embeddings. The first two operations are cpu heavy and the last one is gpu heavy.

One option to do this would be to use Spark with a cluster of T4 machines, with more cores and RAM, that reads a chunk of the video, preprocesses it and encodes it using CLIP. But if I was to do that sometimes the GPU would be idle and sometimes the CPU would not be used to it's full potential.

What would be the best way to solve this issue? Note that if I was to split this into two tasks I would need to store the preprocessed video frames and that seems overkill because it be around 100 TB of storage (yeah, mp4 really compresses videos well). Is there a way to do this processing using two different kinds of machines on the same cluster? One that is CPU and RAM heavy and one that has a GPU?

I'm sure this could be achieves with Kubernetes, but that seems overkill for this task. Is there an easy way to do this with Spark? Should this even be done with Spark? For context I am doing this in GCP and I really only have basic knowledge of Spark

3 Comments
2024/11/30
15:31 UTC

108

Recommendations for self-studying time series and forecasting models?

This is becoming relevant for my job but is not something I have experience with. I know they're a pretty complex set of models though. Those of you with strong backgrounds in this topic, what are some good resources for a noob to start with?

23 Comments
2024/11/30
12:25 UTC

2

AWS released new Multi-AI Agent framework

3 Comments
2024/11/30
07:19 UTC

8

Ideas for local networking?

I’ve joined local DS/ML meetup groups in the past and didn’t see much benefit. Any advice for networking locally and in person?

6 Comments
2024/11/30
02:45 UTC

52

Interview Query in 2024?

Hi, I’m currently a manager to a ML team at a mid sized startup, and looking to prepare for my next steps.. I stumbled upon InterviewQuery and it seems like a good platform to familiarize with the technical questions asked for ML roles across companies (and its BF right now..)

I’ll be very grateful if you are willing to share your experience using them (number of questions , do they end up helping you with interviews, etc) , or if you think that it’s better to learn from some other resource like books or YouTube. It’s been awhile since I had my last interview, so I’m looking to gauge and plan my preparation..

Thanks!

24 Comments
2024/11/29
17:37 UTC

45

Is Azure ML good today ?

Hi, to give a bit of context I work in a medium sized company that want to start some ML projects. We are already in the azure ecosystem with some data, webapps, powerBI and stuffs, we are now seeking for a ML cloud provider to do all our MLops. As I can see azure ML can be a bit frustrating, what are your thought on it nowadays ?

I am more a coding guy and don't like as much drag&drop tools, can we build an ai model from scratch with VS code integration or whatever (preprocessing/training/evaluation)?

18 Comments
2024/11/29
10:39 UTC

13

Andrew NG releases new GenAI package : aisuite

0 Comments
2024/11/29
05:44 UTC

60

Black Friday, which online course to buy?

With Black Friday deals in full swing, I’m looking to make the most of the discounts on learning platforms. Many courses are being offered at great prices, and I’d love your recommendations on what to explore next.

So far, two courses have had a significant impact on my career:

Both of these helped me take a big step forward in my career, and I’d love to hear your thoughts on other courses that might offer similar value.

27 Comments
2024/11/28
19:42 UTC

26

Is it reasonable to put technical challenges in github?

Hey, I have been solving lots of technical challenges lately, what do you think about, after completing the challenge, putting it in a repo and saving the changes, I think a little bit later those maybe could serve as a portfolio? or maybe go deeper into one particular challenge, improve it and make it a portfolio?

I'm thinking that in a couple years I could have a big directory with lots of challenge solutions and maybe then it could be interesting to see for a hiring manager or a technical manager?

34 Comments
2024/11/28
13:37 UTC

104

Plotly 6.0 Release Candidate is out!

Plotly have a release candidate of version 6.0 out, which you can install with `pip install -U --pre plotly`

The most exciting part for me is improved dataframe support:

- previously, if Plotly received non-pandas input, it would convert it to pandas and then continue

- now, you can also pass in Polars DataFrame / PyArrow Table / cudf DataFrame and computation will happen natively on the input object without conversion to pandas. If you pass in a DuckDBPyRelation, then after some pruning, it'll convert it to PyArrow Table. This cross-dataframe support is achieved via Narwhals

For plots which involve grouping by columns (e.g. `color='symbol', size='market'`) then performance is often 2-3x faster when starting with non-pandas inputs. For pandas inputs, performance is about the same as before (it should be backwards-compatible)

If you try it out and report any issues before the final 6.0 release, then you're a star!

4 Comments
2024/11/28
10:02 UTC

0

Is Freelancing as a Data Scientist Even Possible for Beginners?

Hi everyone,

I’m new to data science and considering freelancing. I’m fine working for as low as $15/hour, so earnings aren’t a big concern for me. I’ve gone through past Reddit posts, but they mostly discuss freelancing from the perspective of income. My main concern is whether freelancing in data science is practical for someone like me, given its unique challenges.

A bit about my background: I’ve completed 3-4 real-world data science projects, not on toy datasets, but actual data (involving data scraping, cleaning, visualization, modeling, deployment, and documentation). I’ve also worked as an intern in the NLP domain.

Some issues I’ve been thinking about:

  1. Domain Knowledge and Context: How hard is it to deliver results without deep understanding of a client’s business?

  2. Resource Limitations: Do freelancers struggle with accessing data, computing power, or other tools required for advanced projects?

  3. Collaboration Needs: Data science often requires working with teams. Can freelancers integrate effectively with cross-functional groups?

  4. Iterative and Long-Term Nature: Many projects require ongoing updates and monitoring. Is this feasible for freelancers?

  5. Trust and Accountability: How do freelancers convince clients to trust them with sensitive or business-critical work?

  6. Client Expectations: Do clients expect too much for too little, especially at low wages?

I’m also open to any tips, advice, or additional concerns beyond these points. Are these challenges solvable for a beginner? Have any of you faced and overcome similar issues? I’d love to hear your thoughts.

Thanks in advance!

21 Comments
2024/11/28
06:46 UTC

45

Senior Data Scientist Interview at Capital One

Hey everyone,I've got an upcoming interview for a Senior Data Scientist position at Capital One and I'm looking for some insights. I'd really appreciate if anyone could share their experiences or advice on the following:

  1. What does the interview process typically look like? I've heard about a "Power Day" - what should I expect?
  2. How can I best prepare for the technical rounds, especially the ML Technical and Stats Roleplay portions?
  3. Are there any specific resources or prep materials that have been particularly helpful for Capital One interviews?
20 Comments
2024/11/28
06:10 UTC

0

Alibaba QwQ-32B : Outperforms OpenAI o1-mini and o1-preview for reasoning on multiple benchmarks

Alibaba's latest reasoning model, QwQ has beaten o1-mini, o1-preview, GPT-4o and Claude 3.5 Sonnet as well on many benchmarks. The model is just 32b and is completely open-sourced as well Checkout how to use it : https://youtu.be/yy6cLPZrE9k?si=wKAPXuhKibSsC810

0 Comments
2024/11/28
04:09 UTC

185

Data Scientist Struggling with Programming Logic

Hello! It is well known that many data scientists come from non-programming backgrounds, such as math, statistics, engineering, or economics. As a result, their programming skills often fall short compared to those of CS professionals (at least in theory). I personally belong to this group.

So my question is: how can I improve? I know practice is key, but how should I practice? I’ve been considering platforms like LeetCode.

Let me know your best strategies! I appreciate all of them

76 Comments
2024/11/28
02:23 UTC

17

Math Question on logistic regression and boundary classification from Andrew Ngs Coursera course

I'm following Andrew Ngs Machine Learning specialisation on Coursera, FYI.

If the value of the sigmoid function is greater than 0.5, the classification model would predict y_hat = 1 or "true".

However, when using more complex functions inside of the sigmoid function, e.g. an ellipse:

1 / (1 + e^(-z)) where z = x1^(2)/a^(2) + x2^(2)/b^(2) -1

in order to define the classification boundary, Andrew says that the model would predict y_hat = 1 for points inside of the boundary. However, based on my understanding of the lecture, as long as the threshold is 0.5, and you're predicting y_hat = 1 for any points where the sigmoid function evaluates to >= 0.5 then it should be points outside the boundary.

More specifically, it's proven that g(z) >= 0.5 when z >= 0, therefore if z is an ellipse, g(z) >= 0.5 would imply that x1^(2)/a^(2) + x2^(2)/b^(2) >= 1, i.e. outside the boundary

... At least by my understanding. Can anyboydy shed some light on what I may have missed, or if this is just a mistake in the lecture? Thank you

10 Comments
2024/11/27
22:12 UTC

26

Marco-o1: Open-sourced alternate for OpenAI-o1

Alibaba recently launched Marco-o1 reasoning model, which specialises not just in topics like maths or physics, but also aim at open-ended reasoning questions like "What happens if the world ends"? The model size is just 7b and is open-sourced as well..check more about it here and how to use it : https://youtu.be/R1w145jU9f8?si=Z0I5pNw2t8Tkq7a4

3 Comments
2024/11/27
03:39 UTC

0

OGI - An Open Source Framework for General Intelligence

Dan and I often found ourselves deep in conversation about the future of artificial intelligence, particularly how we could create a system that mimics human cognition. Our discussions revolved around the limitations of current AI models, which often operate in silos and lack the flexibility of human thought.

From these chats, we conceptualized the Open General Intelligence (OGI) framework, which aims to integrate various processing modules that can dynamically adjust based on the task at hand. We drew inspiration from how the human brain processes information—using interconnected modules that specialize in different functions while still working together seamlessly.

Our brainstorming sessions were filled with ideas about creating a more adaptable AI that could handle multiple data types and switch between cognitive processes effortlessly. This collaborative effort not only sparked innovative concepts but also solidified our vision for a more intelligent and reliable AI system. It is open source and look for the GitHub community link soon

https://arxiv.org/abs/2411.15832

2 Comments
2024/11/26
22:59 UTC

2

Looking for food menu related data.

13 Comments
2024/11/26
22:50 UTC

97

I Wrote a Guide to Simulation in Python with SimPy

Hi folks,

I wrote a guide on discrete-event simulation with SimPy, designed to help you learn how to build simulations using Python. Kind of like the official documentation but on steroids.

I have used SimPy personally in my own career for over a decade, it was central in helping me build a pretty successful engineering career. Discrete-event simulation is useful for modelling real world industrial systems such as factories, mines, railways, etc.

My latest venture is teaching others all about this.

If you do get the guide, I’d really appreciate any feedback you have. Feel free to drop your thoughts here in the thread or DM me directly!

Here’s the link to get the guide: https://simulation.teachem.digital/free-simulation-in-python-guide

For full transparency, why do I ask for your email?

Well I’m working on a full course following on from my previous Udemy course on Python. This new course will be all about real-world modelling and simulation with SimPy, and I’d love to send you keep you in the loop via email. If you found the guide helpful you would might be interested in the course. That said, you’re completely free to hit “unsubscribe” after the guide arrives if you prefer.

16 Comments
2024/11/26
18:32 UTC

277

Just spent the afternoon chatting with ChatGPT about a work problem. Now I am a convert.

I have to build an optimization algorithm on a domain I have not worked in before (price sensitivity based, revenue optimization)

Well, instead of googling around, I asked ChatGPT which we do have available at work. And it was eye opening.

I am sure tomorrow when I review all my notes I’ll find errors. However, I have key concepts and definitions outlined with formulas. I have SQL/Jinja/ DBT and Python code examples to get me started on writing my solution - one that fits my data structure and complexities of my use case.

Again. Tomorrow is about cross checking the output vs more reliable sources. But I got so much knowledge transfered to me. I am within a day so far in defining the problem.

Unless every single thing in that output is completely wrong, I am definitely a convert. This is probably very old news to many but I really struggled to see how to use the new AI tools for anything useful. Until today.

103 Comments
2024/11/26
18:18 UTC

9

Good audiobook for DS/ML?

Is there a good audiobook that goes through topics in DS or ML that I can listen to on my commute to work? I’m looking for something technical, not a statistics driven non-fiction book.

24 Comments
2024/11/26
17:47 UTC

Back To Top