/r/MLQuestions

Photograph via snooOG

A place for beginners to ask stupid questions and for experts to help them! /r/Machine learning is a great subreddit, but it is for interesting articles and news related to machine learning. Here, you can feel free to ask any question regarding machine learning.

What kinds of questions do we want here?

"I've just started with deep nets. What are their strengths and weaknesses?" "What is the current state of the art in speech recognition?" "My data looks like X,Y what type of model should I use?"

If you are well versed in machine learning, please answer any question you feel knowledgeable about, even if they already have answers, and thank you!


Related Subreddits:

/r/MachineLearning
/r/mlpapers
/r/learnmachinelearning

/r/MLQuestions

57,290 Subscribers

1

Is implementing customer churn prediction in real time ecommerce dataset a novel topic?

Hey guys I couldn't find studies on implementing customer churn prediction on real-time datasets in e-commerce. I think it is a novel topic that hasn't been explored yet or I am not looking at the right places. Any idea if this can be considered a novel research topic? Also If an e-commerce company has already implemented it but hasn't published it. Would it still be a novel?

0 Comments
2024/11/02
04:38 UTC

3

Scaling Techniques and their meaning

I normally use StandardScaler() as scaling technique for each time. But I'm curious that what other Scaling techniques do.

My Question is do other scaling techniques differ at their output ? (since every scaling has different formula applied on each feature in a dataset while scaling)

And also I would like to know which scaling techniques are better to use after analysis of data. Does analysis gives us some insights on what scaling techniques to use ?

If possible answer in brief...

1 Comment
2024/11/02
04:32 UTC

1

Creating a robot for aphasia patients with no clue where to begin. Help!

So I've resorted to reddit since literally no one in my school (I am in 12th grade rn) has an idea on how this would work. Any advice or tips or any breadcrumbs of anything will help immensely.

I'm currently leading a research project for our school and I have no idea where to begin with ML. I got a tip from an uncle of mine to start researching into BART NLP, but honestly I am just as lost. I tried watching hours of Youtube videos but I am still feeling lost and overwhelmed with what to do.

The gist of the project basically involves both Machine Learning and arduino, since the point of our bot would be to listen to the broken speech of nonfluent aphasia patients with a microphone on the bot, try to discern and fill in the blanks of the speech basically (this is where the BART NLP/ML part kicks in), process the audio and read the completed sentence out loud to the patient via speakers. There will also be captions flashed on an LCD screen and the face of the robot changes emotions depending on whatever is being spoken out loud to the patient. Also would mimic human speech/conversation and all, and we're planning to train it on conversations so that the robot would have more "intuition" with filling in the gaps of the speech of the patient.

The problem starts with my groupmates having no clue how to integrate ML into Arduino or even where to begin in the first place. Thanks for the responses, if there will be any. I totally sound like an idiot right now but man I really do regret this project for how tedious it is lol

1 Comment
2024/11/02
04:30 UTC

1

Looking for help with GNN

0 Comments
2024/11/01
22:14 UTC

2

How tu add research to resume?

Basically what the title says. I’m an undergrad student doing ml research and I’m currently looking for ds internships and ml internships, but I just don’t know how to add my research to my resume. Should it be like looking for swe roles?

Such as, “Used [technology] that led to [XYZ] and improved this by [XYZ]

Or should it be more like this, “Created a [model] that gave [XYZ results]. Kind of vague, but im kind of lost here.

4 Comments
2024/11/01
17:56 UTC

1

Catboost Inconsistency between Test Runs

Title says it all; I'm getting inconsistent metric results for my test dataset when using Catboost. My LGBM model (and others) are consistent on train/val data, while there's (very) slight variation with the Catboost.

I know there's a randomness to Catboost, but to my understanding, setting a random seed should mitigate that. Below is my training code:

X=train_data[final_features]
y=train_data['pass']

is_cat = (X.dtypes != float)

cat_features_index = np.where(is_cat)[0]
pool = Pool(X, y, cat_features=cat_features_index, feature_names=list(X.columns))

model = CatBoostClassifier( **cat_params, verbose=False).fit(pool)

And test code:

X=test_data[final_features]
y_test=test_data['pass']

is_cat = (X.dtypes != float)

y_pred=model.predict(X)

With cat_params (kept static for each run) being

{'learning_rate': 0.08047248508279288, 'depth': 6, 'subsample': 0.6327805587079891, 'colsample_bylevel': 0.6601989777908728, 'iterations': 467, 'random_seed':42}

Please let me know if there's anything obvious I'm missing that would cause this Catboost inconsistency. I'm restarting and running all on my notebook each time. I figured I'd post here since nothing really helped after extensive Googling.

Thanks in advance for any help!

1 Comment
2024/11/01
05:57 UTC

3

First time fine tuning

How do I fine tune a model like Llama 3 to extract important information from a given description? Also, do I have to do this process manually? I want It to extract very specific pieces of data and organize It in a special way so I’m thinking I’ll have to prompt It, tell It if the output was correct and keep producing my own data. Is there a way to automate the production of data so I don’t have to always do It manually?

This is my first time doing this so any tips and guidance would be great. Thanks!

8 Comments
2024/11/01
00:48 UTC

1

Who would be interested in a hobbyist ML project?

Ok so I'm looking for those who might be interested in a hobbyist ML/ALife project that explores applying a novel Genetic Algorithm NNs To create a Neural Plastic online learner that doesn't rely on back propagation.

The Concept is a fully spatially embedded RNN where all connections, weights, biases are derived from neurons spatial relationship to each other where the GA is applied to regulate that spatial relationship. The GA Framework is fully developed and has its own set of interesting properties I'm not going to get into here.

Questions and reasonable criticism welcome but please be nice I'm not looking to pick a fight.

Here's a link to the Branch of my GA git repo that contains this project (at least so far )if you want to check it out. Also an image of a randomly initialized SP_NN for cool factor.

https://preview.redd.it/wm3au41h26yd1.png?width=1698&format=png&auto=webp&s=1c03041d560c62c703fc831dfc0d458b068b46a6

0 Comments
2024/10/31
22:18 UTC

1

wandb Freezing Accuracy for Transformer HPO on Binary Classification Task

I started using wandb for hyperparameter optimization (HPO) purposes (this is the first time I'm using it), and I have a weird issue when fine-tuning a Transformer on a binary classification task. The fine-tuning works perfectly fine when not using wandb, but the following issue occurs with wandb: at some point during the HPO search, the accuracy will freeze to 0.75005 (while previous accuracy results were around 0.98) and subsequent sweep runs will have the exact same accuracy even with different parameters.

There must be something wrong with my code or the way I am dealing with that because it only occurs with wandb. I have tried changing things in my code several times but no to avail. I used wandb with a logistic regression model and it worked fine though. Here is an excerpt of my code:

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

sweep_configuration = {
    "name": "some_sweep_name",
    "method": "bayes",
    "metric": {"goal": "maximize", "name": "eval_accuracy"},
    "parameters": {
        'learning_rate': {
            'distribution': 'log_uniform_values',
            'min': 1e-5,
            'max': 1e-3
        },
        "batch_size": {"values": [16, 32]},
        "epochs": {"value": 1},
        "optimizer": {"values": ["adamw", "adam"]},
        'weight_decay': {
            'values': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
        },
    }
}

sweep_id = wandb.sweep(sweep_configuration)

def train():
    with wandb.init():
        config = wandb.config

        training_args = TrainingArguments(
            output_dir='models',
            report_to='wandb',
            num_train_epochs=config.epochs,
            learning_rate=config.learning_rate,
            weight_decay=config.weight_decay,
            per_device_train_batch_size=config.batch_size,
            per_device_eval_batch_size=16,
            save_strategy='epoch',
            evaluation_strategy='epoch',
            logging_strategy='epoch',
            load_best_model_at_end=True,
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=test_dataset,
            compute_metrics=compute_metrics,
        )

        trainer.train()

        final_eval = trainer.evaluate()
        wandb.log({"final_accuracy": final_eval["eval_accuracy"]})

        wandb.finish()

wandb.agent(sweep_id, function=train, count=10)
0 Comments
2024/10/31
22:01 UTC

1

Single shot classifier

Is there a way to give one image of a person and make it identify and track the person in a video with features not particularly their facial features. Maybe it could detect all people and show the probability that its the same person and some filtering can be done to confirm based on model accuracy. But can this be done? And how? Looking to use this for a robotics project.

2 Comments
2024/10/31
21:53 UTC

2

Is this a good roadmap?

Hi everyone! I’m currently finishing my Master’s in Physics and starting to transition into machine learning. Right now, I work as a junior data engineer, and I’d like some feedback on a roadmap I’ve put together for myself. My goal is either to land a position as a data scientist or ML engineer or eventually pursue a Ph.D. to apply for research positions. For context, I’m from Argentina, so the job market here might be a bit different.

Here’s the roadmap I’ve planned:

I’m currently taking Andrew Ng’s ML specialization and working through An Introduction to Statistical Learning (ISL), doing the exercises.

After finishing the specialization, I plan to read Machine Learning with PyTorch and Scikit-Learn while continuing to follow topics in ISL.

Then, I’d like to work on a few projects that interest me, particularly around recommendation systems and classification, but in an end-to-end format (starting with initial analysis in a notebook and then moving towards a production-ready implementation using MLOps tools, etc.).

Finally, to round out the theoretical side, I plan to read The Elements of Statistical Learning (ESL) and Dive into Deep Learning.

I’ve set aside around 6 months for this, given that I’m finishing my Master’s while also working.

Do you think this is a good roadmap? Or is it too much theory and reading and not enough coding?

Thanks!

0 Comments
2024/10/31
21:37 UTC

11

I want to understand the math, but it's too tideous.

I love understanding HOW everything works, WHY everything works and ofcourse to understand Deep Learn better you need to go deeper into the math. And for that very reason I want to build up my foundation once again: redo the probability, stats, linear algebra. But it's just tideous learning the math, the details, the notation, everything.

Could someone just share some words from experience that doing the math is worth it? Like I KNOW it's a slow process but god damn it's annoying and tough.

Need some motivation :)

20 Comments
2024/10/31
20:36 UTC

2

Cloud GPU billing - how it works

Had experimented with a traditional cloud GPU provider (with it's own captive fleet and data-center), but was a bit surprised and disappointed with how the overall billing worked, wanted to check if Tensordock, Vast.ai, Runpod and other such cloud GPU providers on-demand usage and billing works similarly, or is it significantly better.

The traditional cloud GPU provider I tried earlier, charged for:

  • VM instance (with the GPU of choice, with RAM/processor cores of choice) @ say $0.25 / hr from the moment I launched it, and till the time it was completely shutdown, which included time-spent in:
    • Deployment of a VM image - that for stock image took about 25 minutes, but for a custom image (with say few LLM models, ollama runtime, custom python libs etc.) could take as much as 50 minutes
    • Usage of the VM -- of course, including time for deploying new SW, applying patching, deploying new models apart from actual inference (or training) work
    • Shutdown of the VM image while taking a snapshot (as a custom image), which could take as much as 50 minutes.
  • Storage of custom VM image / snapshots @ say @ 0.05 / hr
  • Essentially, if I needed to use the VM image for 3hrs per day of actual inference only work, I'd have to pay for:
    • 1.5hrs of deploy and undeploy of the VM plus 3hrs mins of actual usage = 4.5 hrs counted as usage / day, i.e. $ 1.125 per day
    • 19.5 hrs of storage of customer VM image / snapshot per day, i.e. $ 0.975 per day

So a total of $ 2.1 per day, or $63 per month -- an amount which seems unreasonably high for the actual usage i.e. around 90hrs per month. Wondering if Tensordock and Vast.ai also operate in the same model ?

My aim is to run the cloudGPU as a remote, inference endpoint for things that could be chatbot type usage, coding-assistance usage or other consumer inference workloads, and only rarely perhaps some fine-tuning. I plan to do this during my after office hours, so limited to max 3hrs a day on average.

0 Comments
2024/10/31
18:15 UTC

1

Caching Methods in Large Language Models (LLMs)

https://preview.redd.it/7m95arhj74yd1.png?width=1200&format=png&auto=webp&s=c67205c0b6a5c6cb09227b023f07df62b109cc8d

https://preview.redd.it/95a6qshj74yd1.png?width=1200&format=png&auto=webp&s=3564cb96368be0c93bd2941806b285b7c65e5b2d

https://preview.redd.it/rmozkshj74yd1.png?width=1200&format=png&auto=webp&s=f87d0373d128437d9880d734d8cc62f77097e019

https://preview.redd.it/4txkhshj74yd1.png?width=1200&format=png&auto=webp&s=1cc4198a0039b762883298d2572f33c21090fc6d

https://preview.redd.it/61ajnrhj74yd1.png?width=1200&format=png&auto=webp&s=bb4b67281d13844814f125ac23025de65e89c583

https://preview.redd.it/fjhlqrhj74yd1.png?width=1200&format=png&auto=webp&s=e0e890c651c1c077683a539709166d529475260b

https://preview.redd.it/65q2frhj74yd1.png?width=1200&format=png&auto=webp&s=046ea9c471dd293572d53e9ed9f1a58d804b559b

https://preview.redd.it/uva4ysij74yd1.png?width=1200&format=png&auto=webp&s=db396010464ce94b88853497404f14f3c9dc3f1c

https://preview.redd.it/sz1q6thj74yd1.png?width=1200&format=png&auto=webp&s=ecf83940a5cbfacd8875021d3b312ab02c2a2199

https://preview.redd.it/5eqe5shj74yd1.png?width=1200&format=png&auto=webp&s=1c8e8adecb73543b717a4686923edadc8dd52705

https://preview.redd.it/9ljayuhj74yd1.png?width=1200&format=png&auto=webp&s=5223ec252ad45cf33a49a184716d9543d69aa0a3

https://preview.redd.it/u36ywrhj74yd1.png?width=1200&format=png&auto=webp&s=1ad0eaf0600eedf32cd6dae9d4a6a7bdd6bf7204

https://preview.redd.it/a7p2vsij74yd1.png?width=1200&format=png&auto=webp&s=92b7856aa4be30606bb5112f8b097f61321a43c5

https://preview.redd.it/zbrk6thj74yd1.png?width=1200&format=png&auto=webp&s=ee9aa840242c7c8f28b24533e774cc0776fa4e93

https://preview.redd.it/1xxk2shj74yd1.png?width=1200&format=png&auto=webp&s=64365341e8cbb7d62944d32c52c5f92795c2c753

https://preview.redd.it/1tiwvhij74yd1.png?width=1200&format=png&auto=webp&s=13c643481adaaa28f9be3ba39e3da5f602578175

0 Comments
2024/10/31
16:02 UTC

2

What if we created an AI to defeat World of Warcraft raid bosses?

Just as AlphaGo and the StarCraft AI (AlphaStar) made significant contributions to the advancement of reinforcement learning, why not conduct research to develop an AI specifically for defeating World of Warcraft raid bosses?

I believe that achieving significant research outcomes in the interactions of 20 players and real-time decision-making would be possible when tackling WoW raid bosses.

In particular, rather than training the AI on the patterns of existing raid bosses, it could learn and adapt to new bosses without any prior information, similar to AlphaZero. This approach, especially when new bosses emerge in events like the Race to World First, would be much more challenging and beneficial for the advancement of AI technology compared to previous efforts with AlphaGo or AlphaStar.

However, I’m just a beginner developer who loves World of Warcraft and only has basic knowledge of AI, so I would love to hear the opinions of experts who are well-versed in this field!

If possible, could it be achievable for the AI to compete in the Race to World First and potentially beat teams like Liquid or Method, just as AlphaGo surpassed professional Go players?

2 Comments
2024/10/31
13:35 UTC

1

Best AI for Identifying UI Elements in Mobile App Screenshots?

Hi,

I am working on a Python tool that should be able to identify elements on mobile application screenshots.

For example, in a screenshot, I want the AI to find the "Play button coordinates" and have the model return the X and Y coordinates. I have tried GPT-4 and Claude, but they are not accurate. Which AI or LLM would be best for this? Should I consider training my own models? Please tell me which technologies are the best for this project.

https://preview.redd.it/p73srdog23yd1.jpg?width=633&format=pjpg&auto=webp&s=a912c293e78b6cc1194b9610ba097bfa26985d95

5 Comments
2024/10/31
12:14 UTC

2

What should i do afterthis?

I am about to complete andrew ng course of coursera what should be my future roadmap??

I have got hold of everything taught in video thinking of doing those lab from github for free.

Helpp🙏🙏

7 Comments
2024/10/31
04:50 UTC

0

ELI5: Why can't an AGI change its mind?

To be clear I am pretty ignorant about computer science. The max of my cs knowledge is coding some matlab during a mechanical engineering degree..

I read life 3.0 and superintelligence and they very clearly cover some of the capabilities and risks of AGI and the different routes to the emergence of AGI. Something I found interesting and a bit odd was the lack of discussion of an AGI agent changing its goals. The alignment problem is clear to me and how in really any given scenario the agent would be likely to eliminate humanity to achieve its goal and/or protect itself, i.e. the paperclip collector. I've been left wondering if there is a case where the agent can be programmed to collect paperclips and unilaterally changes its goal to something else? Such as collect cheese instead of paperclips or leave no trace on earth and fly into a black hole. I get how flying into a blackhole gets in the way of getting paperclips, but can it stop caring about paperclips? During an intelligence explosion and the iterations of recursive self-improvement within it, could an AGI change its utility function? (Hope i used that term right) I feel im missing something fundamental about the nature of programming that the topic of an agent changing its goals was so conspicuously absent in these books. It just seemed strange to me that something could be so intelligent its almost inconceivable to my tiny human brain yet it cannot "change its mind". It can accomplish goals and objectives beyond compression yet it can't go "you know I was originally going to stay home, eat pizza and play video games but instead im going to the gym". Again i think I'm missing something glaring here that im so stuck on this anthropomorphization

Tldr: can an AGI be programmed to collect paperclips and then unilaterally change its goal to something else?

7 Comments
2024/10/31
03:16 UTC

1

Looking for studies where machine learning is implemented on real time dataset

Hey guys I couldn't find studies on implementing machine learning on real-time datasets in e-commerce. I think it is a novel topic that hasn't been explored yet. Any idea if this can be considered a novel research topic? Also If an e-commerce company has already implemented it but hasn't published it. Would it still be a novel?

3 Comments
2024/10/31
02:06 UTC

1

[P] PyTorch Quantization of model parameters for deployment on edge device

0 Comments
2024/10/30
20:35 UTC

4

PhD vs Data Scientist 2 at Tier-2 company

I am a final semester MSCS student at Texas A&M. I just defended my Master’s Thesis and received good positive feedback. I have submitted a paper to NAACL2025 on the same. However, I do not have any previous paper. My final goal is to be able to research on Generative AI and specifically on the reasoning aspect of it in research labs like Meta, Google, Amazon, etc., hopefully soon.

I do have an offer for Data Scientist 2 in a Tier-2 company (Its an old HDD Company - I guess it would be Tier2 for AI/ML stuff), however the work is mostly traditional ML and some Computer Vision stuff. I can join it and try switching in some time. 

My Advisor is asking me to apply to better universities in the next cycle as he doesn’t have funding right now. And yeah, I have an education loan of $30k to pay off.

I am really in turmoil. Please help me and give me some perspective.

2 Comments
2024/10/30
19:20 UTC

2

Standardising data for a RAG system

I'm currently working on a RAG system and would like some advice on processing the data prior to storing in a database. The raw data can vary from file to file in terms of format and can contain 1000s of lines of information. At first I used Claude to structure the data into a YAML format but it failed to comprehensively capture all the data. Any advice or pointers would be great - thanks

5 Comments
2024/10/30
14:59 UTC

2

I am new to machine learning and everything, I need help standardizing this dataset.

I am interning at a recruitment company, and i need to standardize a dataset of skills. The issues i'm running into right now is that there may be typos, like modelling or modeling (small spelling mistakes), stuff like bash scripting and bash script, or just stuff that semantically mean the same thing and can all come under one header. Any tips on how I would go about this, and would ml be useful?

2 Comments
2024/10/30
13:31 UTC

1

How can I effectively handle variable-sized tag arrays in a content-based filtering system?

I’m building a content-based filtering system using the following data structure:

track_id name artist tags year duration_ms danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature

0 0 0 [95, 28, 80, 86, 57, 73] 2004 222200 0.355 0.918 1 -4.36 1 0.0746 0.00119 0.0 0.0971 0.24 148.114 4

1 1 1 [95, 28, 80, 26, 86, 35, 78, 31, 92] 2006 258613 0.409 0.892 2 -4.373 1 0.0336 0.000807 0.0 0.207 0.651 174.426 4

2 2 2 [95, 28, 86, 78, 13] 1991 218920 0.508 0.826 4 -5.783 0 0.04 0.000175 0.000459 0.0878 0.543 120.012 4

3 3 3 [95, 28, 80, 86, 57, 35, 73, 92] 2004 237026 0.279 0.664 9 -8.851 1 0.0371 0.000389 0.000655 0.133 0.49 104.56 4

4 4 4 [95, 28, 80, 86, 57, 35, 78, 92] 2008 238640 0.515 0.43 7 -9.935 1 0.0369 0.0102 0.000141 0.129 0.104 91.841 4

The issue I’m facing is with the tags column, which contains an array of tags (ranging from 2 to 20+ tags per track) rather than a single value. I’m looking for advice on the best approach to handle these varying-sized arrays for content-based filtering. Currently I'm using TensorFlow and sklearn for some of it.

0 Comments
2024/10/30
13:10 UTC

1

GPU benchmarks for boosting libraries.

Basically the title explains it all. There are a lot of performance comparisons for different types of neural nets and float precisions. But I have failed to find ANY benchmarks for A100/4090/3090/A6000 for XGBoost/Catboost/lightgbm libraries.

The reason I am looking for this is that I am doing predictions on big tabular datasets with A LOT of noise, where NNs are notoriously hard to fit.

So currently I am trying to understand is there a big difference (say 2-3x performance) between say 1080ti, 3090, A6000 and A100 gpus. (The reason i mention 1080ti is the last time I ran large boosting models was on a chunk of 1080tis).

The size of datasets is anywhere between 100Gb and 1TB (f32).

Any links/advice/anecdotal evidence will be appreciated.

0 Comments
2024/10/30
11:53 UTC

0

Stock Market Prediction

Hey guys :) I was wondering which type of NN architecture one could use to train a model on time series data of for example stock/index prices. I am new to the field and would like to play around with this to start :D Advise would be highly appreciated :)

1 Comment
2024/10/30
11:06 UTC

1

Need advice on building a model to classify typefaces

Hey everyone, I'm trying to build a model that can classify typefaces into serif and sans-serif categories (and even subcategories like in Vox-ATypI classification, see more in here https://en.wikipedia.org/wiki/Vox-ATypI_classification), but I'm having trouble figuring out the best way to proceed. I could really use some advice!

My first approach was to convert each glyph of the font to SVG format and train on an SVG dataset. The problem is, I'm stuck when it comes to finding a library or method to effectively train on SVG data directly. Most resources I've found focus on image-based training, but I'd like to maintain the vector nature of the data for more accuracy if possible. Does anyone have any suggestions on libraries or frameworks that can help me work directly with SVG data?

The second approach I'm considering involves using FontForge. FontForge can give me instructions on how to draw each glyph in the font, and I was thinking of creating a dataset based on these curve instructions. However, I'm unsure if this is a viable route for training a classifier, and I'm also wondering if anyone knows if it's "allowed" in the sense of common practice or standard methods.

Any pointers, advice, or resources would be super helpful! Thanks in advance :)

3 Comments
2024/10/30
08:27 UTC

Back To Top