/r/MachineLearning
Beginners -> /r/mlquestions or /r/learnmachinelearning , AGI -> /r/singularity, career advices -> /r/cscareerquestions, datasets -> r/datasets
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Andrew Ng and Adam Coates (4/15/2015)
Related Subreddit :
/r/MachineLearning
LinkedIn just dropped some intriguing research on using large language models (LLMs) for ranking and recommendation tasks. You can dive into the details in this paper (https://arxiv.org/abs/2501.16450).
Traditionally, recommendation systems have leaned on big, sparse tables (think massive ID embedding tables) to map users to content. But this new approach flips the script: it “verbalizes” all the features, turning them into text that an LLM can chew on (LLM have small embedding tables). The idea is that since recommendations are essentially about matching users with content, an LLM’s knack for pattern recognition and reasoning might uncover hidden insights in user behavior that old-school methods miss.
Here’s the cool part: if this works, we could be looking at recommendation systems that aren’t just smarter but also capable of explaining why they made a certain suggestion. This create zero-shot capability, building a RS model with few examples. No need for a new team or ML engineers for every ranking model.
Of course, there’s a catch. Converting everything into text and then processing it with a massive model sounds like it could be super inefficient. We're talking potential issues with latency and scaling, especially when you need to serve recommendations in real time. It’s a classic case of “smarter but slower” unless some clever optimizations come into play.
So, while this research direction is undeniably exciting and could totally shake up the recommendation game, the big question is: can it be made practical? Will the benefits of better reasoning and explainability outweigh the extra computational cost? Only time (and further research) will tell.
What do you all think?
This paper introduces a novel methodology for analyzing "underthinking" patterns in large language models by tracking reasoning consistency through token-level output analysis. The researchers developed metrics to identify when models switch between different cognitive approaches during tasks.
Key technical points:
The methodology combines:
I think this research reveals important limitations in current LLM architectures that need addressing before these systems can be reliably used for tasks requiring sustained reasoning. The metrics and analysis methods could be valuable tools for evaluating and improving model training approaches.
I think the most interesting technical finding is that simpler tasks actually suffer more from thought switching than complex ones - this suggests our assumptions about how these models handle different cognitive loads may need revision.
TLDR: New method quantifies how often LLMs switch reasoning patterns mid-task, showing 15-30% performance drops from inconsistent thinking. Simple tasks surprisingly more affected than complex ones.
Full summary is here. Paper here.
Hi everyone,
I am currently working on my Masters thesis about Drum-Track-Synthesis via a Extended Long-Term-Short-Term Model and I thought about introducing Attention to the Model-Architecture as it seems to be quite effective in Music Generation tasks as some studies with Bi-LSTMs have shown. As I haven't really found any papers combining xLSTMs and Attention, I am kind of unsure if I have missed something or it hasn't really been tested yet (Since it is still a novel tech.). What is your opinion?
Thanks in advance!
Hey folks, over the holidays I read Meta's papers introducing Large Concept Models and thought it could be powerful approach to compress the KV Cache. I implemented and trained an LCM architecture in Jax on TPU v4-32s to explore its potential for KV cache compression. Full implementation and detailed results are available here.
Key findings: While promising in theory, the base LCM architecture showed significant performance degradation. I suspect the following to cause this degredation:
seq_len/concept_size
examples vs seq_len
in standard transformersPotential improvements worth exploring:
However, given the fundamental data efficiency issues, alternative KV cache compression approaches may be more promising.
Implementation details and full analysis in the links above. Open to discussion and feedback.
Ok, So I have learn and has some idea about algos of Machine learning like Decision Tree, Random forest, etc. But I still dont have any idea about Hypothesis testing practically in ML, like I dont even know about how many and which test to use when. I was working with someone and he said that he is going to train models based on different distribution, perform HYpthesis testing and all, and I was dumbstruck. I know kaggle but when I go through them they are sometimes too confusijng (which I want to learn) and sometimes just EDA (basic), I want to know how you even get these Idea like using test, creating distribution of models. I maybe wrong in describing these, but I am just confused and scared.
Please help me I want to learn these things, but I only understand the easy stuff (HOML 2 and 3). Are there any resources to learn these things.
I am working on a project where I want the user to see what the model "sees" when predicting each token. I am looking for a way to extract attention maps from the vision encoder during inference. Any idea how this can be achieved or if there is any code available for this?
Hey everyone, I want to share VGSLify, a Python package that simplifies defining, training, and interpreting neural networks using VGSL (Variable-size Graph Specification Language). Inspired by Tesseract's VGSL, VGSLify extends this concept for both TensorFlow and PyTorch. 🚀
VGSL is a compact way to define deep learning models using a simple string format:
None,None,64,1 Cr3,3,32 Mp2,2 Cr3,3,64 Mp2,2 Rc3 Fr64 D20 Lfs128 D20 Lf64 D20 Fs10
Each token represents a layer:
Cr3,3,32
→ Convolution (3x3 kernel, 32 filters, ReLU activation)Mp2,2
→ MaxPooling (2x2)Rc3
→ Reshape to (sequence, features)Lfs128
→ Forward LSTM with 128 units that returns sequencesD20
→ Dropout layer with rate 0.2Lf64
→ Forward LSTM with 128 units that does not return sequencesFs10
→ Fully connected layer with 10 outputs and softmax activationWith VGSLify, you can easily generate TensorFlow or PyTorch models from a VGSL string:
from vgslify import VGSLModelGenerator
vgsl_spec = "None,None,64,1 Cr3,3,32 Mp2,2 Fs92"
vgsl_gen = VGSLModelGenerator(backend="tensorflow") # Or "torch"
model = vgsl_gen.generate_model(vgsl_spec)
model.summary()
Want to get the VGSL representation of your model? Use:
from vgslify import model_to_spec
import tensorflow as tf
model = tf.keras.models.load_model("your_model.keras")
vgsl_spec = model_to_spec(model)
print(vgsl_spec)
Perfect for exporting models in a compact format.
I've just released VGSLify v0.14.0, which adds some highly requested features! 🎉
Now you can extend VGSL with your own layers:
from vgslify.tensorflow import register_custom_layer
@register_custom_layer("Xsw")
def build_custom_layer(factory, spec):
return tf.keras.layers.Dense(10) # Example custom layer
This means you can add any layer you need while still using VGSL's simplicity.
Need to convert a model with custom layers back to VGSL? Just register a parser:
from vgslify.model_parsers.tensorflow import register_custom_parser
@register_custom_parser(MyCustomLayer)
def parse_my_custom_layer(layer):
return f"Xsw({layer.units})"
Now, VGSLify will automatically recognize your custom layers when converting models.
I've reorganized modules for easier usage:
from vgslify import VGSLModelGenerator, model_to_spec
No need for deep imports anymore!
pip install vgslify[tensorflow] # For TensorFlow
pip install vgslify[torch] # For PyTorch
Or, install just the core library without any deep learning backend:
pip install vgslify
Would love to hear your feedback! Let me know what you think. 😊
AI has made huge strides in mimicking human behavior, but it still lacks true thought processes behind decision-making and problem-solving. Instead of replicating neural activity, what if we trained AI on the outcomes of human thinking—decisions, solutions, and actions—using text, voice, multimodal data, and EEG signals?
Our approach aims to teach AI how we think, not just what we do, bridging the gap between pattern recognition and true cognitive emulation. This could revolutionize problem-solving in AI.
📄 Read the paper: github.com/abhijayhm/ThoughtMimickingModel
What are your thoughts on AI learning from human decision-making instead of just data patterns?
#AI #MachineLearning #CognitiveAI #Neuroscience #EEG
Posting this here because I haven't seen this announced anywhere. Great news for ML researchers/PhDs in Europe and South-America where many universities only recognize Scopus indexed papers.
Please post your personal projects, startups, product placements, collaboration needs, blogs etc.
Please mention the payment and pricing requirements for products and services.
Please do not post link shorteners, link aggregator websites , or auto-subscribe links.
--
Any abuse of trust will lead to bans.
Encourage others who create new posts for questions to post here instead!
Thread will stay alive until next one so keep posting after the date in the title.
--
Meta: This is an experiment. If the community doesnt like this, we will cancel it. This is to encourage those in the community to promote their work by not spamming the main threads.
Hey everyone,
I’m trying to correctly implement the NF4 (NormalFloat4) quantization levels described in the QLoRA paper, but I’m running into discrepancies between my computed values and the expected ones.
The paper states:
The information theoretically optimal data type for zero-mean normal distributions with arbitrary standard deviations 𝜎 in the range [−1,1] is computed as follows:
(1) estimate the 2^𝑘+1 quantiles of a theoretical N(0,1) distribution to obtain a k-bit quantile quantization data type for normal distributions,
(2) take this data type and normalize its values into the [−1,1] range,
(3) quantize an input weight tensor by normalizing it into the [−1,1] range through absolute maximum rescaling.
First, doubt is 2^𝑘+1 quantiles of a theoretical N(0,1) includes infinities on either end; how do I normalize them to [-1, 1]? Also, regarding the quantization levels/values of the NF4 data type, are they the midpoint of adjacent quantiles? or a point between adjacent quantiles such that both the splits have the same number of weights?
Once I understand these, maybe my other doubts will be resolved.
Sharing the best NLP research papers from 2024, covering 15 papers that I found the most interesting.
Has anyone used this model released by the Allen Institute for AI on Thursday? It seems to outperform 4o and DeepSeek in a lot of places, but for some reason there's been little to no coverage. Thoughts?
tl;dr Use Paper2Audio.com to listen to research papers, or DM me for access to our beta iOS app.
We’ve built a website and a beta iOS app for listening to research papers! Check out Paper2Audio.com or reach out if you’d like access to the iOS beta.
There are three listening modes:
None of the modes simulate a podcast. You just upload a PDF and you get back an audio version of a paper. For now, it is entirely free for users.
I've been using Paper2Audio to listen to papers mostly on vision-language models, the latest LLM papers like Deepseek R1, which have helped us improve our service. I'm also an economist, so I've been catching up on economics papers with Paper2Audio.
Questions and feedback are most welcome!
OpenAI’s Whisper was released more than two years ago, and it seems that no other model has seriously challenged its position since then. While Whisper has received updates over time, its performance in languages other than English—such as Chinese—is not ideal for me. I’m looking for an alternative model to generate subtitles for videos and real-time subtitles for live streams.
I have also tried Alibaba’s FunASR, but it was released more than one year ago as well and does not seem to offer a satisfied performance.
I am aware of some LLM-based speech models, but their hardware requirements are too high for my use case.
In other AI fields, new models are released almost every months, but there seems to be less attention on advancements in speech recognition. Are there any recent models worth looking into?
Hey ML folks,
I'm Khanh, a software engineer guy, I just started to learn ML.
I threw together a web-based linear regression tool that lets you plot data, fit a regression line, and check out key stats (R², MSE, p-values, etc.)—all without writing a single line of code.
🔗 Check it out here: https://www.linear-regression.dev/
You can:
• Add your own data points or generate random ones
• See the regression line update in real-time
• Get a quick breakdown of the model stats
Not trying to reinvent the wheel, just wanted something simple and quick for basic regression analysis. If you give it a spin, let me know what you think! Anything missing? Anything annoying? Appreciate any feedback! 🙌
Hello everyone.
I am working on a project which involves multi-class time series classification. The database is kinda complicated, as it has a good amount of missing or inconsistent values (extreme outliers). The data is also imbalanced.
We are testing some of these architectures:
The procedure we use is given as follows:
Data cleaning - Feature Extraction (if needed, because for the Deep Learning architectures the feature extraction is done automatically, the input is the raw time series) - Normalization (Standard Scaler) - Classification.
The dataset is instance based, that is, there are lots of instances (csv files) for each class. The dataset is also composed by more than 30 variables, however the majority of them are NaN or inconsistent values. Hence for the classification task only four variables are considered.
Considering the four variables, the cleaning is done as follows:
In the cleaning step, the interpolation is always done within the same instance. I do the train-test-validation split separating different instances in different folders (training, testing and validation folders). The ratio is kept the same for all the classes in all three folders. Hence as far as my knowledge goes no data leakage is happening here.
Then in the feature extraction step, I use the sliding window, with no overlap because the data-set is large: These following features are extracted: mean, std dev, kurtosis, skewness, min, Q1, median, Q3 and max. Again, the values are calculated only from the windows, without considering other windows, hence I don't see data leakage happening here.
For the normalization step, I apply the fit_transform() method to the data in X_train, then the transform() method for the data in X_test and X_val, which to me is standard. Finally, the classification method is applied.
From my point of view, I see no data leakage. However, analyzing the results, the Random Forest had a better average f1-score (use f1-score due to imbalanced data) than the other methods (not a large difference), hence I want to check it here it I missed any step to ensure the absence of data leakage.
Thanks a lot everyone.
TLDR: Did I miss anything in my time series classification problem to cause data leakage? Especially in the cleaning and feature extraction steps. Random Forest performed a bit better than more robust methods.
I'm working on extracting financial entities (e.g., EPS, Revenue) from HTML documents that don’t follow a consistent template. i don't want go with LLM (RAG).
I’m considering the following approach:
The goal is to achieve maximum accuracy with low latency. Does this approach seem viable? Are there any optimizations or alternative methods I should consider?
TL;DR we show that molecular fingerprints give SOTA results for peptide classification, and Long Range Graph Benchmark (LRGB) does not really have long-range dependencies
ArXiv: https://arxiv.org/abs/2501.17901
Abstract:
We study the effectiveness of molecular fingerprints for peptide property prediction and demonstrate that domain-specific feature extraction from molecular graphs can outperform complex and computationally expensive models such as GNNs, pretrained sequence-based transformers and multimodal ensembles, even without hyperparameter tuning. To this end, we perform a thorough evaluation on 126 datasets, achieving state-of-the-art results on LRGB and 5 other peptide function prediction benchmarks. We show that models based on count variants of ECFP, Topological Torsion, and RDKit molecular fingerprints and LightGBM as classification head are remarkably robust. The strong performance of molecular fingerprints, which are intrinsically very short-range feature encoders, challenges the presumed importance of long-range interactions in peptides. Our conclusion is that the use of molecular fingerprints for larger molecules, such as peptides, can be a computationally feasible, low-parameter, and versatile alternative to sophisticated deep learning models.
Key contributions:
Molecular fingerprints, a simple feature extraction on molecular graphs, work great for peptides
They get SOTA results on LRGB, while being very short-range descriptors, and contradict claims that it really requires long-range dependencies
First one is more bioinformatics-oriented, but second is very relevant for GNNs evaluation methodology. Most papers that design GNNs capable of learning long-range relations between nodes evaluate on LRGB. But it seems not to really have that, so any conclusions here may be either a) spurious correlation b) they are learning something interesting, but not really long-range relations. Interestingly, the original reviewers of LRGB had the same doubts (https://openreview.net/forum?id=in7XC5RcjEn).
I am working on a project and someone suggested me to try out activation steering over fine tuning, but I fail to understand why anyone would do that, on paper the idea looks elegant but what are the real benefits for doing it?
More context about activation steering (from chatgpt):
Activation steering is a technique to control language model behavior by modifying neuron activations in specific layers. Instead of retraining or fine-tuning, it applies learned direction vectors—often derived from contrastive examples—to nudge model outputs in a desired direction (e.g. reducing bias or aligning with specific instructions). This method is efficient, interpretable, and allows real-time intervention without modifying the underlying model weights. Great for fine-grained control over model behavior!
Hi Community,
I worked on an interactive tutorial on the ROC curve, AUC score and the confusion matrix.
https://maitbayev.github.io/posts/roc-auc/
Any feedback appreciated!
Thank you!
I'm reading through the Deepseek R1 paper's distillation section and did not find any reference to soft labels (probability distributions) in the SFT dataset.
Is it implied that in the process of distillation it's always soft labels? Because the SFT data creation using rejection sampling sounded more like these were hard labels. Thoughts?
I have been reading ML papers for about a year now. Coming from a background in physics, I see that papers do not account for reproducibility at all. The paper often does not reveal all the details they used, such as the model architecture parameters or other hyperparameters.
This also brings me to the question: I almost never see error bars!
I know pre-training is difficult and requires a lot of computing power. However, I imagine that evaluation can be done several times. In fact, many researchers run the evaluation several times but only report their best results instead of reporting an average with confidence intervals, especially when comparing their model against baselines.
What do you guys think about this? Do you think this might be a reason for the inflation of mediocre research being done in AI/ML?
TLDR is the title.
I'm working on writing custom pytorch code to improve training throughput, primarily through asynchrony, concurrency and parallelism on both the GPU and CPU.
Today I finally set up Nsight Systems locally and it's really improved my understanding of things.
While I got it working on my RTX3060, that is hardly representative of true large ML training environments.
... so I tried to get it going on Runpod and fell flat on my face. Something about a kernel paranoid level (that I can't reduce), a --privileged arg (which I can't add because Runpod gives the RUN for Docker, ) and everything in 'nsys status -e' showing 'fail'.
Any ideas?
Hi! I'm Andi from multimodal team at Hugging Face.
Today we're open-sourcing the codebase used to train SmolVLM from scratch on 256 H100s
Inspired by our team's effort to open-source DeepSeek's R1 training, we are releasing the training and evaluation code on top of the weights
Now you can train any of our SmolVLMs—or create your own custom VLMs!
Go check it out: