/r/datascience

Photograph via snooOG

A space for data science professionals to engage in discussions and debates on the subject of data science.

/r/datascience

1,890,446 Subscribers

26

Rio: WebApps in pure Python – Technical Description

Hey everyone!

Last month we recieved a lot of encouraging feedback from you and used it to improve our framework.

First up, we've completely rewritten how components are laid out internally.This was a large undertaking and has been in the works for several weeks now - and the results are looking great! We're seeing much faster layout times, especially for larger (1000+ component) apps. This was an entirely internal change, that not only makes Rio faster, but also paves the way for custom components, something we've been wanting to add for a while.

From all the feedback the most common question we've encountered is, "How does Rio actually work?"

The following topics have already been detailed in our wiki for the technical description:

  • What are components?
  • How does observing attributes work?
  • How does Diffing, and Reconciliation work?

We are working technical descriptions for:

  • How does the Client-Server Communication work?
  • How does our Layouting work?

Thanks and we are looking forward to your feedback! :)

GitHub

4 Comments
2024/07/24
14:04 UTC

73

Why do people want access to llama 3.1 400B?

I've been in analytics engineering for several years now, just now starting to learn the basics of LLM and machine learning, so NLP and stuff like that. I got my hands on llama 3 local on my Windows PC. There's a community of people who are also getting access to llama 3.1 400 billion parameters. The size of it alone to download it is about 800 GB, at least. Like, I'm not joking when I say that. 800 GB downloads size. I don't even know how much RAM would be required to run this but I've heard people say it requires about 256 GB RAM or VRAM...

Question is, why do people want this? Does anyone know or can explain that to me? Why would anyone want something so insanely massive, on their local PC?

36 Comments
2024/07/24
10:46 UTC

11

Drinking From a Fire Hose (First 100 Days & Beyond Question)

How do you ensure success for employees that you are bringing into the organization?

As more and more organizations realize the potential of data science and big data in general, new departments and roles are created to start adding value utilizing the mass amounts of data available. I have been working with data for years, I am well versed in databases, analytics, programming, and architecture. The one thing I haven't been able to excel in is how to set up an organization for growth by providing a way to easily onboard an analyst.

There was a thread a few days ago asking why on-boarding is so disorganized and it made me realize that it is because data is disorganized. Or at least, it has been at the majority of companies I have worked and consulted for.

Real world example: A global client of mine utilizes 10 different SaaS services in North America, and an additional 10 globally. All of these SaaS services have APIs and they are accessed through those. Now, they have built automations and have other flows/processes in place utilizing legacy software. These all reside either in the cloud, through admin portals, on virtual machines, or running off SQL stored procedures/SSIS flows. When I walked in, to process map the current state was an exercise in patience:
"How does this process run?"
"On the server"
"OK... what server? what is it called? what sources does it hit?"

  • In your organizations: How do you catalog all of the automated processes and flows?
  • How do you document where these flows and processes are stored, saved, run from?
  • How do you provide documentation on all of your data sources and what APIs you are using? - I don't want a data catalog, I want a Readers Digest Family Handyman Guide to Data.

Does anyone have a recommendation that a new hire could sit in front of for a week or less and be able to access/understand/follow/reference from there with success?

How can I compile all processes that are running and publish a guide for someone coming aboard?

I feel like this subject is so often overlooked, but I imagine that it can create an organization that will thrive and grow faster than one where you have to chase senior employees down and hope they remember (or hope that they didn't lose the source code after re-imaging their pc).

And no, a saved Word Document in a Shared Folder on a Shared Drive (or the File tab in Microsoft Teams) is not what I think is best practices.

8 Comments
2024/07/23
23:03 UTC

1

Having problem with langchain loader

I have the data in JSON format I’m trying to use the jsonloader but apparently I need a download and import a jq module and that’s where my problem is. I have pip installed jq but when it’s time to import it, I get a no module error and yes it’s installed in venv that I am working in. Has anyone had this problem before

1 Comment
2024/07/23
20:37 UTC

83

New Data science jobs in the NFL, Formula1 and sports analytics companies

Hey guys,

I'm constantly checking for jobs in the sports and gaming analytics industry. I've posted recently in this community and had some good comments.

I run www.sportsjobs.online, a job board in that niche.

In the last month I added around 200 jobs:

I'm celebrating I automated all the NFL teams with this post and doing so I've found a few interesting data science and analytics jobs.

F1

There are multiple more jobs related to data science, engineering and analytics in the job board.

I've created also a reddit community where I post recurrently the openings if that's easier to check for you.

I hope this helps someone!

14 Comments
2024/07/23
20:14 UTC

2

Text classification using LLMs

I want to use LLMs to classify color descriptions into 8 different colors. I tried using text embeddings from these color descriptions and applied cosine similarity to classify them, but the performance was not satisfactory. When I use prompts specifying that it’s a color classifier, it gives correct responses. Is there a way to effectively use embeddings for this use case? The dataset is large, so prompt engineering is not a viable option.

13 Comments
2024/07/23
19:36 UTC

227

Is there a place to learn where people aren't petty and condescending?

I see people posting in this subreddit frequently trying to learn things, asking for recommendations and tips, trying to discuss data science, and about 50% of the replies here are people who think they are so much smarter than they are being petty, mocking them, being denigrating to them, aggressive, toxic, for no reason at all. Just acting like they think they are one of the smartest people in the world to ever exist.. The other 50% are pretty nice, they talk, provide recommendations, support, words of encouragement, advice, technical information.

Some wondering if there is another place where people go to discuss data science as they are learning it. I'm not talking about doing a boot camp, or doing a udemy course or anything like that. I'm talking about a place where people who are devoted to learning data science and machine learning fundamentals can go to discuss freely.

115 Comments
2024/07/23
16:19 UTC

67

If you peek in your AB tests, you're setting yourself up for dissapointment

Peeking (looking for significance in an AB test before the experiment has enough samples to reach desired power) is a “no no”. Rationales for not peeking typically mention inflated type 1 error rate.

Unless you’re just randomizing into two groups and not changing anything, the null is unlikely to be true. So inflated type one error rate is really not the primary concern. Rather, if we peek then we are setting ourselves up for disappointment. Detected effects from peeking will typically not generalize, and we will be overstating out impact. The reason why is fairly clear when considering the Winner’s Curse.

I write a short little blog post to demonstrate just how exaggerated the effects detected from peeking can be here. If you need to tell your stakeholders not to peek, its probably best to come at it from this angle as opposed to a statistical angle, which they neither understand nor care about.

38 Comments
2024/07/23
13:37 UTC

5

Suggested literature/techniques to model forward moving averages

I want to start a personal project, but i'm failing to formulate my business problem into a model. I would love inputs on how to better look into this issue and what type of models/techniques i should be researching to tackle it.

I want to model the nth day forward moving average of a metric on a given date based on previous days and on the latest available forward moving average for that given day.

For example: Consider today is day 30 and I want to predict up to 360d forward moving average a metric.

I will only have the actual average value of the 360d forward moving average on day 360. Currently i have the actual average values for 1d to 30d. I also have all of these forward moving averages for the past 5 years.

The goal is to define ranges in for all forward moving averages from the latest date (31d) to 360d.

I am failing to think of the type of model i'd be looking for or how should i structure the problem, given how the goal here is not to predict a single value, but all the values in the 31d to 360d range.

1 Comment
2024/07/22
14:10 UTC

168

Easiest way to calculate required sample size for A/B tests

I am a data scientist that monitors ~5-10 A/B experiments in a given month. I've used numerous online sample size calculators, but had minor grievances with each of them.. so I did a completely sane and normal thing, and built my own!

Screenshot of A/B Test calculator at www.samplesizecalc.com/proportion-metric

Unlike other calculators, mine can handle different split ratios (e.g. 20/80 tests), more than 2 testing groups beyond "Control" and "Treatment", and you can choose between a one-sided or two-sided statistical test. Most importantly, it outputs the required sample size and estimated duration for multiple Minimum Detectable Effects so you can make the most informed estimate (and of course you can input your own custom MDE value!).

Here is the calculator: https://www.samplesizecalc.com/proportion-metric

And here is an article explaining the methodology, inputs and the calculator's underlying formula: https://www.samplesizecalc.com/blog/how-sample-size-calculator-works

Please let me know what you think! I'm looking for feedback from those who design and run A/B tests in their day-to-day. I've built this to tailor my own needs, but now I want to make sure it's helpful to the general audience as well :)

Note: You all were very receptive to the first version of this calculator I posted, so wanted to re-share now that's it's been updated in some key ways. Cheers!

20 Comments
2024/07/22
14:03 UTC

36

Perpetual: a gradient boosting machine which doesn't need hyperparameter tuning

Repo: https://github.com/perpetual-ml/perpetual

PerpetualBooster is a gradient boosting machine (GBM) algorithm that doesn't need hyperparameter tuning so that you can use it without hyperparameter optimization libraries unlike other GBM algorithms. Similar to AutoML libraries, it has a budget parameter. Increasing the budget parameter increases the predictive power of the algorithm and gives better results on unseen data.

The following table summarizes the results for the California Housing dataset (regression):

Perpetual budgetLightGBM n_estimatorsPerpetual mseLightGBM msePerpetual cpu timeLightGBM cpu timeSpeed-up
1.01000.1920.1927.6978129x
1.53000.1880.18821.83066141x
2.110000.1850.18686.08720101x

PerpetualBooster prevents overfitting with a generalization algorithm. The paper is work-in-progress to explain how the algorithm works. Check our blog post for a high level introduction to the algorithm.

25 Comments
2024/07/22
08:30 UTC

11

Weekly Entering & Transitioning - Thread 22 Jul, 2024 - 29 Jul, 2024

Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:

  • Learning resources (e.g. books, tutorials, videos)
  • Traditional education (e.g. schools, degrees, electives)
  • Alternative education (e.g. online courses, bootcamps)
  • Job search questions (e.g. resumes, applying, career prospects)
  • Elementary questions (e.g. where to start, what next)

While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads.

52 Comments
2024/07/22
04:01 UTC

1

Does any have any information and/or example code for Parametric Matrix Models.

There's a paper on arxiv about Parametric Matrix Models https://arxiv.org/abs/2401.11694 . I'm finding it interesting but struggling to understand the details. Has anyone heard about it, tried it, have any information about it. Ideally someone would have example code of using Parametric Matrix Models to solve some small problem.

6 Comments
2024/07/21
06:04 UTC

18

Why do the suggestions for misspelled words seem to miss the obvious word that a typo would cause?

Is it because of the amount of CPU/RAM it would cost to use a smarter algorithm?

For me, a perfect algorithm for single word (so no context) spell-checking would look for words that differ by a certain amount of key shift.

Just now I typed the word "diagusting" trying to type "disgusting". The suggestions chrome provides are:

  • degusting

  • digesting

  • degaussing

Disgusting is obviously the best guess due to how much more common it is than the others, combined with the fact that it would be very difficult to spell "diagusting" while typing to type any of the other words. It seems like the spellchecker was more interested in words that are phonetically similar.

What's going on under the hood that prevents spotting typos? It should be very easy to just parse a ton of natural language to see which types of typos are most common. That narrows down the search space significantly rather than assuming every possible letter could have been every other possible letter, on top of flips, insertions, and removals.

11 Comments
2024/07/21
03:30 UTC

153

The Rise of Foundation Time-Series Forecasting Models

In the past few months, every major tech company has released time-series foundation models, such as:

  • TimesFM (Google)
  • MOIRAI (Salesforce)
  • Tiny Time Mixers (IBM)

There's a detailed analysis of these models here.

88 Comments
2024/07/20
21:30 UTC

5

Top enhancements to try once you have a vanilla RAG set-up with a text vector database?

2 Comments
2024/07/20
20:26 UTC

0

How is this streamlit data science app tutorial ?

So this is our first video. Please tell us anything that you can. I will really appreciate it.

2 Comments
2024/07/20
19:22 UTC

23

Hey guys I know we all hate langchain but I have a question

I’m building a chat bot . And since the data is in a data warehouse, it’s in a table . Do you think retrieval methods perform better when the data is in a text format, just a docs file or it works just the same on a tabular data too and csv or parquet format. And I’m planning on using llama index or langchain. Thank you

47 Comments
2024/07/20
15:11 UTC

3

Recommendation models for User-Role Pairings

I have been working with Matrix Factorization ALS to develope a recommendation model that recommends new roles a user might want to request in order to speed up onboarding.

I have at best been able to achieve a 45-55% error rate when testing the model based off of roles it suggests and roles a user actually has. We have no ratings of user role recommendations yet, so we are just using an implicit rating of 1.

I think a recommendation model that is content based (factors users job profile, seniority level, related projects, other applications they have access to, etc) would preform better.

However, everywhere I look online for similar model implementations everyone is using collaborative ALS models and discussing these damn movie recommendation models.

A kNN model has scored about 66% accuracy but takes hours to run for the user base.

TL; DR: I am looking for recommendations for a recommendation model that uses the attributes of a user in order to recommend roles a user may need/want to request.

1 Comment
2024/07/19
19:57 UTC

74

How to improve a churn model that sucks?

Bottom line:

  1. Churn model sucks hard
  2. People churning are over-represented (most of customers churn)
  3. Lack of demographic data
  4. Only transactions, newsletter behavior and surveys

Any idea what to try to make it work?

95 Comments
2024/07/19
17:07 UTC

0

Which domain pays the most?

Currently working as a data analyst in the aviation field, but it’s kinda boring. I was hoping to find a better paying job out in California, but which domain pays the most?

25 Comments
2024/07/19
16:30 UTC

130

Will the market ever get better?

Came across multiple career based posts, where there was the pain point of no job offers even after extensive applications. While there could be mistakes/issue of luck, many did blame the market getting worse.

While I understand the problem getting outsourced to Asia (I am from one such country), thus creating problems in NA/EU. However, things aren't rosy here as well. Due to population/tech-fluencers, people are gathering like crazy for data science based positions.

To me, nothing short of a Thanos moment will fix this issue. What do you guys think , how can the market ever get back to even slightly being better?

98 Comments
2024/07/19
11:53 UTC

6

Has anyone transitioned from data science to AI engineer position?

Basically the title. I am considering it since I do enjoy applying machine learning models at scale and can deal with the maths like linear algebra and calculus.. I am not super keen on statistics or causality and such aspects of analysis so hence, I am thinking of transitioning. Any input would be greatly appreciated!

33 Comments
2024/07/19
03:34 UTC

144

Why is on-boarding process so disorganized in many companies?

Going into gripe mode.

In my current employer, and with many past ones, getting access and permissions to access data and applications has been a headache, often taking weeks for IT to set up. I have to ask around and the whole process is disorganized.

Why don't companies set this up before the new hire's first day, so they can hit the track running? Especially if you're on a one year contract, you can't waste time.

80 Comments
2024/07/18
19:55 UTC

7

Is m2cgen still alive?

It hasn't been updated for more than two years, so I guess it is abandoned? What a shame.

https://github.com/BayesWitnesses/m2cgen

2 Comments
2024/07/18
18:26 UTC

37

Has anyone ever left a company then successfully pitched themselves back as a consultant?

I’ve been building a really big project that my team and I are super proud of and is getting demoed to the board and ceo at very large top US Company. People from across the company and our sister company keep asking us for tips and meet and greets etc. Despite all the success of the project, the base business model isn’t hitting projections right now and the leadership is trying to bet on cheaping out on raises, promotions, and properly staffing our team (reminder multi Bil company lol). My teammates and I have practically had it , and I’m wondering if anyone has ever actually pulled something like this off with an LLC or consulting. At least at present, we have proven success at our AI use case, and we’re ahead of the competitors in the industry.

19 Comments
2024/07/18
17:13 UTC

2

Tools and methods for collecting user interaction data

Suppose I want to gather data on how users interact with a website, like their clicks and time spent on various pages, to train a discriminative model. I'm particularly interested in using these behaviors to predict whether the user will subscribe to a newsletter.

Do you have any recommended tools or methods for this task?

6 Comments
2024/07/18
17:07 UTC

10

Anyone experience Canada vs. UK job market?

Just wondering if anyone in this field has pursued work in both countries and can comment on which country is better for data professionals, machine learning engineers etc.,

8 Comments
2024/07/18
16:56 UTC

107

How much does hyperparameter tuning actually matter

I say this as in: yes obvioisly if you set ridiculous values for your learning rate and batch sizes and penalties or whatever else, obviously your model will be ass.

But once you arrive at a set of "reasonable" hyper parameters, as in theyre probably not globally optimal or even close but they produce OK results and is pretty close to what you normally see in papers. How much gain is there to be had from tuning hyper parameters extensively?

41 Comments
2024/07/18
16:34 UTC

23

Anyone in cybersecurity willing to help a brother out?

So I have an interview coming up for a DS role in a cyber security team. I was told that I would be asked about security basics and security problem framing.

I have little to no idea what that is. I nailed the DS/ML, and coding part of the interview and this is the last step. If anyone in the field can point me to the right resources, or give me an idea of what some of these problem framing might look like I’d greatly appreciate it!

Ps: the team knows that I have no security background at all.

18 Comments
2024/07/18
14:46 UTC

Back To Top