/r/MLQuestions
A place for beginners to ask stupid questions and for experts to help them! /r/Machine learning is a great subreddit, but it is for interesting articles and news related to machine learning. Here, you can feel free to ask any question regarding machine learning.
What kinds of questions do we want here?
"I've just started with deep nets. What are their strengths and weaknesses?" "What is the current state of the art in speech recognition?" "My data looks like X,Y what type of model should I use?"
If you are well versed in machine learning, please answer any question you feel knowledgeable about, even if they already have answers, and thank you!
Related Subreddits:
/r/MLQuestions
Struggling to get your AI models to perform better? Data annotation might be the game-changer you need! It ensures accurate, labeled datasets while saving time and resources. Plus, it scales with your project needs.
Check out this blog for insights: How Data Annotation Can Supercharge Your AI Model?
What are your thoughts on outsourcing annotation tasks? Let’s discuss!
Is this the right tech stack?
Additional Considerations:
Example Use Case: Autonomous Vehicle For an autonomous vehicle, the tech stack might involve:
I'm trying to understand AI concepts (as a layperson, for fun). Could somebody tell me if the following statements are roughly accurate:
-Supervised learning trains an AI to make increasingly valid predictions by giving it feedback
-Unsupervised learning creates a detailed representation of a dataset and all the relationships within it, which can then yield insights into the data
(Just looking for super general "wide-angle" understanding at this point.)
Hi all
I'm a recent medical school graduate with 3-5 months of free time before I dive into my next steps. l've always been curious about artificial intelligence, especially how it's used in healthcare, but I don't have any background in programming or computer science.
The thing is, I'm not sure coding is for me-l'm more interested in understanding Al conceptually and how I can apply it to real-world scenarios.
I'd love to hear your advice on learning resources or how to approach this vast topic. Thanks!!
Hi everyone,
I am from the University of Munich and have previously published in the Computer Vision category on arXiv and am now only endorsed for the cs.CV category. My new paper, “Enhancing Deep Learning Model Robustness through Metamorphic Re-Training”, is more focused on general AI, and therefore I would like to submit it to the cs.AI category :)
I need an endorsement to proceed. If you’re qualified, I’d appreciate your help: https://arxiv.org/auth/endorse?x=ZKK9HC
You can also find me on LinkedIn if you'd like to connect. Thanks in advance!
Is there such a thing as a "fill here with your own data " jsonl file?
I have a dataset with very few samples (less than 2000) but a large number of features (nearly 50000). I’ve used RFE and other feature selection methods but I’m unsure on how to find the optimal number of features for the function. Is there any way to think about this rather than just using a GridSearch with several values for it?
I am trying this whole ML thing, pretty new to it.
I have been trying to predict with some degree the possibility of an url being malicious. I understand that without looking at the contents of the page, but WHOI takes a lot of time. I looked at 2 datasets.
What i did was, create a set of 24 features (The whois detection was taking time, so skipped that) . So like, count of www, sub-domains, path splits, count of query params etc. The two datasets are a bit different, one of them are tagged with benign, phishing, malware. The other one has status (1, 0) .
I trained it with keras as such.
def model_binaryclass(input_dim):
model = Sequential(
[
Input(shape=(input_dim,)),
Dense(128, activation="relu"),
Dropout(0.2),
Dense(64, activation="relu"),
Dropout(0.2),
Dense(1, activation="sigmoid"),
]
)
model.compile(
optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy", "Recall", "Precision"],
)
return model
In my last try, I used only first dataset, But when I try to verify, it against some urls, all of them have the same probability.
Verification code:
special_chars = ["@", "?", "-", "=", "#", "%", "+", ".", "$", "!", "*", ",", "//"]
def preprocess_url(url):
url_length = len(url)
tld = get_top_level_domain(url)
tldLen = 0 if tld is None else len(tld)
is_https = 1 if url.startswith("https") else 0
n_www = url.count("www")
n_count_specials = []
for ch in special_chars:
n_count_specials.append(url.count(ch))
n_embeds = no_of_embed(url)
n_path = no_of_dir(url)
has_ip = having_ip_address(url)
n_digits = digit_count(url)
n_letters = letter_count(url)
hostname_len = len(urlparse(url).netloc)
n_qs = total_query_params(url)
features = [
url_length,
tldLen,
is_https,
n_www,
n_embeds,
n_path,
n_digits,
n_letters,
]
features.extend(n_count_specials)
features.extend([hostname_len, has_ip, n_qs])
print(len(features), "n_features")
return np.array(features, dtype=np.float32)
def predict(url, n_features=24):
input_value = preprocess_url(url)
input_value = np.reshape(input_value, (1, n_features))
interpreter.set_tensor(input_details[0]["index"], input_value)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]["index"])
print(f"Prediction probability: {output_data}")
# Interpret the result
predicted_class = np.argmax(output_data)
print("predicted class", predicted_class, output_data)
uus = [
"https://google.com",
"https://www.google.com",
"http://www.marketingbyinternet.com/mo/e56508df639f6ce7d55c81ee3fcd5ba8/",
"000011accesswebform.godaddysites.com",
]
[predict(u) for u in uus]
The code to train is on github .
Can someone please point me in the right direction? The answers like this.
24 n_features
Prediction probability: [[0.99999964]]
predicted class 0 [[0.99999964]]
24 n_features
Prediction probability: [[0.99999946]]
predicted class 0 [[0.99999946]]
24 n_features
Prediction probability: [[1.]]
predicted class 0 [[1.]]
24 n_features
Prediction probability: [[0.963157]]
predicted class 0 [[0.963157]]
Assume that 2 models:
fully connected neural network model 1 = n number of parameters with 3 layers
fcnn model 2 = n number of parameters with 12 layers
the seecond model has higher representation power than model 1. but these two models have the same architecture so it is easy to compare. How to do it with generative models such as diffusion, gan, vae, nowmalizing flows, energy based etc?
I'm finding it difficult to build an intuition around when to use GNNs? I'm specifically interested in using GNNs to solve predictive tasks on relational data. Are there any surveys, papers or benchmarks that I can refer to?
Thanks!
For a normal encoder-only Transformer like BERT, I know we can add a CLS token to the input that "aggregates" information from all other tokens. We can then attach a MLP to this token at the final layer to produce the class predictions.
My question is, how would this work for TransformerXL, which processes a (long) input in small chunks? It must output a CLS token every chunk, right? Do we then only use the last of these CLS tokens (which is produced when TrXL consumes the final chunk of the input) to make the class prediction, and compute the loss from this? Or is there a totally different way to do this?
If I have two vectors:
I can take the dot product of these two and find *i'*s total score. Now across many rows, I can find "the class" total scores by i. Another example, vector embeddings of tokens can help me find similarity between them (i.e., projection).
Now, imagine I have two other random vectors:
Mathematically, nothing is stopping me from taking the dot product of happiness score and sadness score for person i.
This is where my intuition isn't strong. What "possibly" could the dot product of these two mood scores tell me? I am just looking for any random ideas or ways to take the dot product and "make it make sense".
Overall, This will help me take different vectors of data sets and infer insights. So, if you are given two vectors, how do you approach "combining them" to product an output that "makes sense"?
I was training a model on 2d floor images dataset but it’s accuracy is getting worsen while training and it is generating horrible grey images
This is code link
https://colab.research.google.com/drive/12OZyv9rj4UBRG0k0Yl2rA1X8whZpmllL?usp=drive_link
Dataset link:
https://www.kaggle.com/datasets/harshratna/2d-floor-plan-dataset-with-text-descriptions-new
Hi everyone,
I’m a beginner in ML, and these days I explored convolutions and kernels in image processing. To understand how ML works, I tried to create a simple project to mimic the process. Here’s what I did:
span(feature1, feature2, ...)
).It worked! it can recognize a majority of rectangles and triangles using this method.
UMAP fit passes the training data to the model. In which parameters are the learned information stored in the model object? The ones I found are: _raw_data
, graph_
, _sigmas
, _rhos
, graph_dists_
Please give me some suggestions on project ideas to stand out in my résumé. I did few projects like Ecommerce(frontend), object detection models and a chat app but seeing my friend's résumé. I realised that these are very generic. Please give me any recommendations or suggestions.
My key skills are Reactjs, Django, NodeJs, Python, C++, PyTorch, Sklearn, MySQL and mongo
I never had any interest in the corporate field, but perhaps if I funneled all of my energy into ai, which I’m already fond of- would it be a dumb idea or should I pursue this course, I don’t know a whole lot about ai/ml other than I’d like to learn how to make a robot, and I heard it could potentially be good money
A little background: I am a 3rd year (technically 4th year but advisor transferred universities after 1 year and asked me to follow him, so bye bye 1 year of course work) PhD student in the ECE department studying machine learning/computer vision at a lesser known school. It is by no means a CMU, Berkley, Stanford, etc. university, but it is still the flagship university for the state.
I've had what I would call a moderately successful first 3 (4) years of the PhD program as I feel like my math skills, particularly probability, are on point and I have a publication record to more or less back that up. Only 3 publications at some lesser known venues (along the lines of some B+ papers or workshop papers at A*), but I remain optimistic that ICML will workout in my favor this review cycle.
With that, I have been really struggling to find meaning/impactful PhD internships both last year and this application year. I was able to land a DoD internship, but those DoD positions seem to be such wastes as you are hardly allowed to publish external to DoD, and even if one is ok with that and continues that career track, DoD positions tend to prioritize managerial, acquisition focused type positions where you don't get to do real research. I've tried applying to what I think those impactful positions might be (Adobe, and any FAANG-type place), but I have virtually 0 success with that. I try to stay hopeful in all this, but every time I think about this hyper-competitive job market I just get down and worry that upon graduation, finding a meaningful position will be rough. Hoping someone can tell me it's really not so bad; maybe I just needed a place to rant. If you share similar feelings at all, or have been successful in navigating a similar situation, please let me know.
Hello,
I was wondering how a entry level machine learning engineer becomes a senior machine learning engineer. Is the skills required to become a Sr ML engineer learned on the job, or do I have to self study? If self studying is the appropriate way to advance, how many hours per week should I dedicate to go from entry level to Sr level in 3 years, and how exactly should I self study? Advice is greatly appreciated!
hello guys , hope you are well , is their anyone who know or has idea on how to convert an image of interior (panorama) into 3D model using AI .
i want a offline ai model, that can create questions based on text I provide
Hello ladies and gentlemen,
I found that in my company there're a lot of manual effort is required to manually transcribe the client info forms filled by clients and input them into our system. (Using digital input form for client is not a feasible option)
During the past couple years, there are already thousands of transcribed information into our system as well as the scanned copies of them.
Ideally, I'd like to train my own model to recognize the hand writing with a supervised model.
with the scanned copies as the input, and the already transcribed details as the output
In this scenario, do I need to have a powerful GPU/ can it be done with a m4 Mac mini (that I was currently using)? I just did some proof of concept with easyocr today with the Mac and would love to see how far I can go with it.
Thanks heaps.
Hello everyone,
I’m currently diving deeper into machine learning and have just learned the basics of K-means clustering. I'm particularly interested in understanding more about how to optimize the algorithm and explore alternative clustering techniques.
So far, I’ve heard about K-means++ for better initialization of centroids, but I’d love to learn about other strategies to improve performance, such as speeding up the algorithm for larger datasets, enhancing cluster quality evaluation (e.g., silhouette scores), or any other variations and optimizations like mini-batch K-means.
I’m also curious about how K-means compares to other clustering algorithms like DBSCAN or hierarchical clustering, especially for handling non-spherical or more complex data distributions.
I’d really appreciate any recommendations, insights, or resources from the community, particularly practical examples and experiences in optimizing K-means or applying clustering algorithms in real-world scenarios.
The machine learning and training layer of AI is like a black box to me. I read some articles that take basic concepts and example that are then taken to a more real example. 70% of that is still over my head. Theoretically let’s say I have sql database backend for car dealer that sells cars, services cars, takes in and resells used cars, and maybe on side does collision repair. Most of data is structured and has proper relations in tables. Some data is in PDF that can be OCRed. Now I wear a hat of a CEO that wants AI chatbot that he can ask questions like “what are top 3 car brands that we took in as used for trade in for sales that yielded most gross revenue?” Data analyst will probably get this done just fine, but CEO wants a chatbot to ask questions like this. The idea in everyone’s head is that we can just take all this data, take a model, and train the model on the data from database. When there is new data, the model will just be trained on top. A vendor came in and vaguely, not explicitly, suggested that it is exactly how it works. Does it tho? I am curious because idk and my gut is telling now. Approach that does makes sense to me and in theory seems most plausible to me is one or multiple agents that have access to some tools and maybe read-only database. These agents work together to deconstruct the question and plan out steps that a data analyst might take when given DB schema. In the end the answer has some backing and work showing how it was done with steps laid out. Kind of, or exactly what AutoGen is doing.
But can a model just be trained on data from a sql database and then be able to answer analytical questions while also doing the math?
I am currently in a master's program for data science. I have a higher end PC for most of my work but I would like to get a small portable option when I need to travel. Is it work it to get a tablet or would I be better of going with a similarly priced laptop?
I’m building an AI chatbot that helps financial professionals with domain specific related enquiries. I’ve been working on this for the last few months and the responses from the system aren’t sounding great. I’ve pulled the data from relevant websites. Standardised into YAML format, broken down granularly. These entries are then embedded and stored on a vector database. The user ask a question which is then embedded and relevant data entries are pulled from the vector database. An OpenAI LLM then summarises what has been pulled from the vector database. Another OpenAI LLM then generates a response based on the summarised information. It’s hard to explain what’s wrong with the system but it doesn’t feel great to talk with. It doesn’t really seem to understand the data and it’s just presenting it. Ideally I want users to be able to input very complex user enquiries and for the model to respond coherently, currently it’s not doing that.
My initial thoughts are instead of a RAG system, to maybe fine tune a model. It would be good to get opinions on what might be the best way to proceed. Do I continue tweaking the RAG system or go in another direction with actually trying to feed an AI model the data?
I have no formal education in ML but just a deep interest so please bear that in mind when answering!
Thank you in advance.
I am very new to ML. I am asking out of curiousity, how do companies tend to collect data regarding image recognition? Do they just hire people to label certain items in a picture? I watched a video of a guy (who led the project and probably is well educated) labeling images manually and was genuinely curious to know if that is always the case?
Can I just ask if those are valid evaluation metrics or should I consult my professor?
My masters thesis is a group project about a dataset regarding news articles. I have to predict and say what drives engagement of news in this df and don’t have access to the article itself, only the headline. I have several features like:
-headline -date -sentiment score
I must also decide on an individual data science/ ML topic that i should further explore within the dataset and topic. My idea was to do a content/user-based reccomendation system that based on the headline, sentiment and category to give similar article suggestions.
I have to deliver the individual theme idea tomorrow and can’t find a good way to evaluate this item-based offline system. How should i do it? Is it even possible? If not, what other topics could I do?
Hi, I'm trying to implement VQ-VAE from scratch. I got to the point of calculating euclidean distance between a vector z of shape (b c h w) and embedding space of shape (size, embedding_dim).
For instance, the tensor z is given as flat tensor: torch.Size([2, 16384])
- which means there are two batches of z, and z can be re-shaped to torch.Size([2, 256, 8, 8])
- where batch=2, embedding dimension=256, and height, width are 8.
Now the embedding space shape is: torch.Size([512, 256])
- which means there are 512 vectors of dimension 256.
So to calculate euclidean distance between vector z and the codebook (the embedding space), we do distance calculation like so:
For each width
For each height
Get z[h][w] - this is the vector that we compare to the codebook - this vector size is 256
Calculate distance between z[h][w] and ALL the embedding space (512 vectors) - so we should get 512 distances
Do this for all batches - so we should get distances tensor of shape [2, 512]
After that I check the minimum distance and do VQ-VAE stuff.
But I don't understand how to calculate distances without using for-loops? I want to use pytorch's tensor operations or einops but I don't yet have experience with this complex dimension operations.