/r/MLQuestions

Photograph via snooOG

A place for beginners to ask stupid questions and for experts to help them! /r/Machine learning is a great subreddit, but it is for interesting articles and news related to machine learning. Here, you can feel free to ask any question regarding machine learning.

What kinds of questions do we want here?

"I've just started with deep nets. What are their strengths and weaknesses?" "What is the current state of the art in speech recognition?" "My data looks like X,Y what type of model should I use?"

If you are well versed in machine learning, please answer any question you feel knowledgeable about, even if they already have answers, and thank you!


Related Subreddits:

/r/MachineLearning
/r/mlpapers
/r/learnmachinelearning

/r/MLQuestions

59,077 Subscribers

1

How Data Annotation Can Supercharge Your AI Model?

Struggling to get your AI models to perform better? Data annotation might be the game-changer you need! It ensures accurate, labeled datasets while saving time and resources. Plus, it scales with your project needs. 

Check out this blog for insights: How Data Annotation Can Supercharge Your AI Model? 

What are your thoughts on outsourcing annotation tasks? Let’s discuss! 

0 Comments
2024/12/02
12:01 UTC

1

Tech stack for an ML program based on a prediction logic.

Is this the right tech stack?

  1. Data Acquisition and Processing:
  • Sensor Integration:
    • Hardware:
      • Cameras (RGB, depth, thermal)
      • Microphones
      • LiDAR sensors
      • Accelerometers
      • Gyroscopes
    • Software:
      • Sensor drivers and libraries (e.g., OpenCV, ROS)
      • Data acquisition frameworks (e.g., LabVIEW, DAQmx)
      • Signal processing libraries (e.g., NumPy, SciPy)
  • Data Preprocessing and Feature Extraction:
    • Image/Video Processing:
      • OpenCV
      • TensorFlow/Keras
      • PyTorch
    • Audio Processing:
      • LibROSA
      • TensorFlow/Keras
      • PyTorch
    • Sensor Fusion:
      • Kalman filters
      • Particle filters
      • Deep learning techniques (e.g., attention mechanisms)
  1. Model Development and Training:
  • Deep Learning Frameworks:
    • TensorFlow
    • PyTorch
    • JAX
  • Multimodal Fusion Techniques:
    • Early fusion (concatenate features)
    • Late fusion (combine predictions)
    • Feature-level fusion (combine features at intermediate layers)
  • Predictive Modeling:
    • Recurrent Neural Networks (RNNs)
    • Long Short-Term Memory (LSTM) networks
    • Gated Recurrent Units (GRUs)
    • Transformer models
    • Convolutional Neural Networks (CNNs)
    • Graph Neural Networks (GNNs)
  1. Deployment and Inference:
  • Cloud Platforms:
    • AWS
    • GCP
    • Azure
  • Edge Computing:
    • TensorFlow Lite
    • PyTorch Mobile
    • Edge TPU
  • Real-time Processing:
    • C++
    • CUDA
    • OpenCL

Additional Considerations:

  • Data Storage and Management:
    • Databases (e.g., PostgreSQL, MongoDB)
    • Data lakes (e.g., Hadoop, Databricks)
  • Model Optimization and Deployment:
    • TensorFlow Serving
    • TorchServe
    • MLflow
  • Ethical Considerations:
    • Bias and fairness in AI
    • Privacy and security of sensitive data

Example Use Case: Autonomous Vehicle For an autonomous vehicle, the tech stack might involve:

  • Sensor Integration: Cameras, LiDAR, radar, and ultrasonic sensors.
  • Data Processing: Image and point cloud processing, sensor fusion.
  • Model Development: Deep learning models for object detection, semantic segmentation, and motion prediction.
  • Deployment: Cloud-based training and edge-device inference.
0 Comments
2024/12/02
09:30 UTC

2

question about supervised vs unsupervised learning

I'm trying to understand AI concepts (as a layperson, for fun). Could somebody tell me if the following statements are roughly accurate:

-Supervised learning trains an AI to make increasingly valid predictions by giving it feedback

-Unsupervised learning creates a detailed representation of a dataset and all the relationships within it, which can then yield insights into the data

(Just looking for super general "wide-angle" understanding at this point.)

3 Comments
2024/12/02
01:50 UTC

3

How Can I Start Learning About Al?

Hi all

I'm a recent medical school graduate with 3-5 months of free time before I dive into my next steps. l've always been curious about artificial intelligence, especially how it's used in healthcare, but I don't have any background in programming or computer science.

The thing is, I'm not sure coding is for me-l'm more interested in understanding Al conceptually and how I can apply it to real-world scenarios.

I'd love to hear your advice on learning resources or how to approach this vast topic. Thanks!!

3 Comments
2024/12/01
22:09 UTC

1

[Request] Arxiv Endorsement?

Hi everyone,

I am from the University of Munich and have previously published in the Computer Vision category on arXiv and am now only endorsed for the cs.CV category.  My new paper, “Enhancing Deep Learning Model Robustness through Metamorphic Re-Training”, is more focused on general AI, and therefore I would like to submit it to the cs.AI category :)

I need an endorsement to proceed. If you’re qualified, I’d appreciate your help: https://arxiv.org/auth/endorse?x=ZKK9HC

You can also find me on LinkedIn if you'd like to connect. Thanks in advance!

0 Comments
2024/12/01
21:21 UTC

2

I want to fine tune a model with my style, any beginner friendly roadmap?

Is there such a thing as a "fill here with your own data " jsonl file?

1 Comment
2024/12/01
20:30 UTC

1

Doubt on how to use the RFE

I have a dataset with very few samples (less than 2000) but a large number of features (nearly 50000). I’ve used RFE and other feature selection methods but I’m unsure on how to find the optimal number of features for the function. Is there any way to think about this rather than just using a GridSearch with several values for it?

0 Comments
2024/12/01
18:45 UTC

1

[D] Needed help with a basic suspicious url detection with ML

I am trying this whole ML thing, pretty new to it.

I have been trying to predict with some degree the possibility of an url being malicious. I understand that without looking at the contents of the page, but WHOI takes a lot of time. I looked at 2 datasets.

What i did was, create a set of 24 features (The whois detection was taking time, so skipped that) . So like, count of www, sub-domains, path splits, count of query params etc. The two datasets are a bit different, one of them are tagged with benign, phishing, malware. The other one has status (1, 0) .

I trained it with keras as such.

def model_binaryclass(input_dim):
    model = Sequential(
        [
            Input(shape=(input_dim,)),
            Dense(128, activation="relu"),
            Dropout(0.2),
            Dense(64, activation="relu"),
            Dropout(0.2),
            Dense(1, activation="sigmoid"),
        ]
    )
    model.compile(
        optimizer="adam",
        loss="binary_crossentropy",
        metrics=["accuracy", "Recall", "Precision"],
    )
    return model

In my last try, I used only first dataset, But when I try to verify, it against some urls, all of them have the same probability.

Verification code:

special_chars = ["@", "?", "-", "=", "#", "%", "+", ".", "$", "!", "*", ",", "//"]


def preprocess_url(url):
    url_length = len(url)
    tld = get_top_level_domain(url)
    tldLen = 0 if tld is None else len(tld)

    is_https = 1 if url.startswith("https") else 0
    n_www = url.count("www")

    n_count_specials = []
    for ch in special_chars:
        n_count_specials.append(url.count(ch))

    n_embeds = no_of_embed(url)
    n_path = no_of_dir(url)
    has_ip = having_ip_address(url)
    n_digits = digit_count(url)
    n_letters = letter_count(url)
    hostname_len = len(urlparse(url).netloc)
    n_qs = total_query_params(url)

    features = [
        url_length,
        tldLen,
        is_https,
        n_www,
        n_embeds,
        n_path,
        n_digits,
        n_letters,
    ]
    features.extend(n_count_specials)
    features.extend([hostname_len, has_ip, n_qs])

    print(len(features), "n_features")

    return np.array(features, dtype=np.float32)


def predict(url, n_features=24):
    input_value = preprocess_url(url)
    input_value = np.reshape(input_value, (1, n_features))

    interpreter.set_tensor(input_details[0]["index"], input_value)
    interpreter.invoke()

    output_data = interpreter.get_tensor(output_details[0]["index"])
    print(f"Prediction probability: {output_data}")

    # Interpret the result
    predicted_class = np.argmax(output_data)
    print("predicted class", predicted_class, output_data)


uus = [
    "https://google.com",
    "https://www.google.com",
    "http://www.marketingbyinternet.com/mo/e56508df639f6ce7d55c81ee3fcd5ba8/",
    "000011accesswebform.godaddysites.com",
]

[predict(u) for u in uus]

The code to train is on github .

Can someone please point me in the right direction? The answers like this.

24 n_features
Prediction probability: [[0.99999964]]
predicted class 0 [[0.99999964]]
24 n_features
Prediction probability: [[0.99999946]]
predicted class 0 [[0.99999946]]
24 n_features
Prediction probability: [[1.]]
predicted class 0 [[1.]]
24 n_features
Prediction probability: [[0.963157]]
predicted class 0 [[0.963157]]
2 Comments
2024/12/01
16:54 UTC

1

[D] How to compare two generative model's representation capabilities?

Assume that 2 models:

fully connected neural network model 1 = n number of parameters with 3 layers

fcnn model 2 = n number of parameters with 12 layers

the seecond model has higher representation power than model 1. but these two models have the same architecture so it is easy to compare. How to do it with generative models such as diffusion, gan, vae, nowmalizing flows, energy based etc?

0 Comments
2024/12/01
14:19 UTC

2

When should I use GNNs?

I'm finding it difficult to build an intuition around when to use GNNs? I'm specifically interested in using GNNs to solve predictive tasks on relational data. Are there any surveys, papers or benchmarks that I can refer to?

Thanks!

3 Comments
2024/12/01
14:15 UTC

1

How can TransformerXL be used for text classification?

For a normal encoder-only Transformer like BERT, I know we can add a CLS token to the input that "aggregates" information from all other tokens. We can then attach a MLP to this token at the final layer to produce the class predictions.

My question is, how would this work for TransformerXL, which processes a (long) input in small chunks? It must output a CLS token every chunk, right? Do we then only use the last of these CLS tokens (which is produced when TrXL consumes the final chunk of the input) to make the class prediction, and compute the loss from this? Or is there a totally different way to do this?

2 Comments
2024/12/01
05:00 UTC

1

Working with vectors of data, building intuition on disparate data sources

If I have two vectors:

  • exam scores [100, 80, 90]
  • exam weights [.4, .3, .3]

I can take the dot product of these two and find *i'*s total score. Now across many rows, I can find "the class" total scores by i. Another example, vector embeddings of tokens can help me find similarity between them (i.e., projection).

Now, imagine I have two other random vectors:

  • happiness score [10, 8. 9]
  • sadness score [10, 8, 8]

Mathematically, nothing is stopping me from taking the dot product of happiness score and sadness score for person i.

This is where my intuition isn't strong. What "possibly" could the dot product of these two mood scores tell me? I am just looking for any random ideas or ways to take the dot product and "make it make sense".

Overall, This will help me take different vectors of data sets and infer insights. So, if you are given two vectors, how do you approach "combining them" to product an output that "makes sense"?

2 Comments
2024/11/30
22:35 UTC

1

Help needed in StackGAN text to img project

I was training a model on 2d floor images dataset but it’s accuracy is getting worsen while training and it is generating horrible grey images

This is code link

https://colab.research.google.com/drive/12OZyv9rj4UBRG0k0Yl2rA1X8whZpmllL?usp=drive_link

Dataset link:

https://www.kaggle.com/datasets/harshratna/2d-floor-plan-dataset-with-text-descriptions-new

0 Comments
2024/11/30
21:56 UTC

9

Beginner project: Is my shape recognition method close to machine learning?

Hi everyone,

I’m a beginner in ML, and these days I explored convolutions and kernels in image processing. To understand how ML works, I tried to create a simple project to mimic the process. Here’s what I did:

  1. Feature Extraction with Kernels:
    • I made custom kernels to detect features like horizontal and vertical lines (for rectangles) and slanted edges (for triangles).
    • From these, I created a vector space (e.g., span(feature1, feature2, ...)).
  2. Classification with KNN:
    • I calculated feature vectors for different rectangles and triangles to train the algorithm.
    • Then, for a new image, I checked if its vector was closer to rectangle or triangle examples.

The Outcome:

It worked! it can recognize a majority of rectangles and triangles using this method.

My Questions:

  1. Is this approach close to machine learning?
  2. if no what’s the next step to make it more like real ML?

https://reddit.com/link/1h3koyv/video/n61ptj9iz74e1/player

4 Comments
2024/11/30
19:48 UTC

3

Which parameters within a UMAP model are affected as a result of a single training process?

UMAP fit passes the training data to the model. In which parameters are the learned information stored in the model object? The ones I found are: _raw_data, graph_, _sigmas, _rhos, graph_dists_

1 Comment
2024/11/30
11:29 UTC

2

Project Idea suggestions for Machine Learning

Please give me some suggestions on project ideas to stand out in my résumé. I did few projects like Ecommerce(frontend), object detection models and a chat app but seeing my friend's résumé. I realised that these are very generic. Please give me any recommendations or suggestions.

My key skills are Reactjs, Django, NodeJs, Python, C++, PyTorch, Sklearn, MySQL and mongo

1 Comment
2024/11/30
08:21 UTC

0

Is getting a job in this field worth it as a 24 year old

I never had any interest in the corporate field, but perhaps if I funneled all of my energy into ai, which I’m already fond of- would it be a dumb idea or should I pursue this course, I don’t know a whole lot about ai/ml other than I’d like to learn how to make a robot, and I heard it could potentially be good money

3 Comments
2024/11/30
04:08 UTC

3

Struggling to Find Internships a a 3rd Year PhD Student

A little background: I am a 3rd year (technically 4th year but advisor transferred universities after 1 year and asked me to follow him, so bye bye 1 year of course work) PhD student in the ECE department studying machine learning/computer vision at a lesser known school. It is by no means a CMU, Berkley, Stanford, etc. university, but it is still the flagship university for the state.

I've had what I would call a moderately successful first 3 (4) years of the PhD program as I feel like my math skills, particularly probability, are on point and I have a publication record to more or less back that up. Only 3 publications at some lesser known venues (along the lines of some B+ papers or workshop papers at A*), but I remain optimistic that ICML will workout in my favor this review cycle.

With that, I have been really struggling to find meaning/impactful PhD internships both last year and this application year. I was able to land a DoD internship, but those DoD positions seem to be such wastes as you are hardly allowed to publish external to DoD, and even if one is ok with that and continues that career track, DoD positions tend to prioritize managerial, acquisition focused type positions where you don't get to do real research. I've tried applying to what I think those impactful positions might be (Adobe, and any FAANG-type place), but I have virtually 0 success with that. I try to stay hopeful in all this, but every time I think about this hyper-competitive job market I just get down and worry that upon graduation, finding a meaningful position will be rough. Hoping someone can tell me it's really not so bad; maybe I just needed a place to rant. If you share similar feelings at all, or have been successful in navigating a similar situation, please let me know.

1 Comment
2024/11/30
02:57 UTC

7

What does it take to become a senior machine learning engineer?

Hello,

I was wondering how a entry level machine learning engineer becomes a senior machine learning engineer. Is the skills required to become a Sr ML engineer learned on the job, or do I have to self study? If self studying is the appropriate way to advance, how many hours per week should I dedicate to go from entry level to Sr level in 3 years, and how exactly should I self study? Advice is greatly appreciated!

3 Comments
2024/11/30
00:56 UTC

0

from interoir image to 3D i interactive model

hello guys , hope you are well , is their anyone who know or has idea on how to convert an image of interior (panorama) into 3D model using AI .

2 Comments
2024/11/29
21:31 UTC

1

Is there any tool that can create questions based on text i provide

i want a offline ai model, that can create questions based on text I provide

1 Comment
2024/11/29
18:01 UTC

3

OCR with self-trained model from scratch

Hello ladies and gentlemen,

I found that in my company there're a lot of manual effort is required to manually transcribe the client info forms filled by clients and input them into our system. (Using digital input form for client is not a feasible option)

During the past couple years, there are already thousands of transcribed information into our system as well as the scanned copies of them.

Ideally, I'd like to train my own model to recognize the hand writing with a supervised model.

with the scanned copies as the input, and the already transcribed details as the output

In this scenario, do I need to have a powerful GPU/ can it be done with a m4 Mac mini (that I was currently using)? I just did some proof of concept with easyocr today with the Mac and would love to see how far I can go with it.

Thanks heaps.

6 Comments
2024/11/29
13:23 UTC

5

Looking for Advice on Optimizing K-Means Clustering Algorithms

Hello everyone,

I’m currently diving deeper into machine learning and have just learned the basics of K-means clustering. I'm particularly interested in understanding more about how to optimize the algorithm and explore alternative clustering techniques.

So far, I’ve heard about K-means++ for better initialization of centroids, but I’d love to learn about other strategies to improve performance, such as speeding up the algorithm for larger datasets, enhancing cluster quality evaluation (e.g., silhouette scores), or any other variations and optimizations like mini-batch K-means.

I’m also curious about how K-means compares to other clustering algorithms like DBSCAN or hierarchical clustering, especially for handling non-spherical or more complex data distributions.

I’d really appreciate any recommendations, insights, or resources from the community, particularly practical examples and experiences in optimizing K-means or applying clustering algorithms in real-world scenarios.

4 Comments
2024/11/29
10:33 UTC

2

Can a model be trained on data records from database to then answer on said data? Is new data going to involve incremental training or full retraining?

The machine learning and training layer of AI is like a black box to me. I read some articles that take basic concepts and example that are then taken to a more real example. 70% of that is still over my head. Theoretically let’s say I have sql database backend for car dealer that sells cars, services cars, takes in and resells used cars, and maybe on side does collision repair. Most of data is structured and has proper relations in tables. Some data is in PDF that can be OCRed. Now I wear a hat of a CEO that wants AI chatbot that he can ask questions like “what are top 3 car brands that we took in as used for trade in for sales that yielded most gross revenue?” Data analyst will probably get this done just fine, but CEO wants a chatbot to ask questions like this. The idea in everyone’s head is that we can just take all this data, take a model, and train the model on the data from database. When there is new data, the model will just be trained on top. A vendor came in and vaguely, not explicitly, suggested that it is exactly how it works. Does it tho? I am curious because idk and my gut is telling now. Approach that does makes sense to me and in theory seems most plausible to me is one or multiple agents that have access to some tools and maybe read-only database. These agents work together to deconstruct the question and plan out steps that a data analyst might take when given DB schema. In the end the answer has some backing and work showing how it was done with steps laid out. Kind of, or exactly what AutoGen is doing.

But can a model just be trained on data from a sql database and then be able to answer analytical questions while also doing the math?

4 Comments
2024/11/28
23:58 UTC

1

Tablet vs laptop

I am currently in a master's program for data science. I have a higher end PC for most of my work but I would like to get a small portable option when I need to travel. Is it work it to get a tablet or would I be better of going with a similarly priced laptop?

2 Comments
2024/11/28
20:31 UTC

1

RAG System

I’m building an AI chatbot that helps financial professionals with domain specific related enquiries. I’ve been working on this for the last few months and the responses from the system aren’t sounding great. I’ve pulled the data from relevant websites. Standardised into YAML format, broken down granularly. These entries are then embedded and stored on a vector database. The user ask a question which is then embedded and relevant data entries are pulled from the vector database. An OpenAI LLM then summarises what has been pulled from the vector database. Another OpenAI LLM then generates a response based on the summarised information. It’s hard to explain what’s wrong with the system but it doesn’t feel great to talk with. It doesn’t really seem to understand the data and it’s just presenting it. Ideally I want users to be able to input very complex user enquiries and for the model to respond coherently, currently it’s not doing that.

My initial thoughts are instead of a RAG system, to maybe fine tune a model. It would be good to get opinions on what might be the best way to proceed. Do I continue tweaking the RAG system or go in another direction with actually trying to feed an AI model the data?

I have no formal education in ML but just a deep interest so please bear that in mind when answering!

Thank you in advance.

5 Comments
2024/11/28
18:26 UTC

5

How do you gather data for image recognition?

I am very new to ML. I am asking out of curiousity, how do companies tend to collect data regarding image recognition? Do they just hire people to label certain items in a picture? I watched a video of a guy (who led the project and probably is well educated) labeling images manually and was genuinely curious to know if that is always the case?

6 Comments
2024/11/28
16:48 UTC

1

What Evaluation Metrics does Clustering Have?

I'm currently stuck in my final project where I need to accomplish a step for model evaluation. For evaluating my clustering model, I was tasked to use the evaluation metrics: accuracy score, confusion matrix, F1-score, MSE.

Can I just ask if those are valid evaluation metrics or should I consult my professor?

5 Comments
2024/11/28
16:28 UTC

1

Thesis Question

My masters thesis is a group project about a dataset regarding news articles. I have to predict and say what drives engagement of news in this df and don’t have access to the article itself, only the headline. I have several features like:

  • category
  • click through rate

-headline -date -sentiment score

I must also decide on an individual data science/ ML topic that i should further explore within the dataset and topic. My idea was to do a content/user-based reccomendation system that based on the headline, sentiment and category to give similar article suggestions.

I have to deliver the individual theme idea tomorrow and can’t find a good way to evaluate this item-based offline system. How should i do it? Is it even possible? If not, what other topics could I do?

2 Comments
2024/11/28
15:45 UTC

1

Trying to calculate distance between tensor and embeddings, confused about dimensions

Hi, I'm trying to implement VQ-VAE from scratch. I got to the point of calculating euclidean distance between a vector z of shape (b c h w) and embedding space of shape (size, embedding_dim).

For instance, the tensor z is given as flat tensor: torch.Size([2, 16384]) - which means there are two batches of z, and z can be re-shaped to torch.Size([2, 256, 8, 8]) - where batch=2, embedding dimension=256, and height, width are 8.

Now the embedding space shape is: torch.Size([512, 256]) - which means there are 512 vectors of dimension 256.

So to calculate euclidean distance between vector z and the codebook (the embedding space), we do distance calculation like so:

  1. For each width

  2. For each height

  3. Get z[h][w] - this is the vector that we compare to the codebook - this vector size is 256

  4. Calculate distance between z[h][w] and ALL the embedding space (512 vectors) - so we should get 512 distances

  5. Do this for all batches - so we should get distances tensor of shape [2, 512]

After that I check the minimum distance and do VQ-VAE stuff.

But I don't understand how to calculate distances without using for-loops? I want to use pytorch's tensor operations or einops but I don't yet have experience with this complex dimension operations.

2 Comments
2024/11/28
13:31 UTC

Back To Top