2,951,059 Subscribers

[D] Does all distillation only use soft labels (probability distribution)?

I'm reading through the Deepseek R1 paper's distillation section and did not find any reference to soft labels (probability distributions) in the SFT dataset.

Is it implied that in the process of distillation it's always soft labels? Because the SFT data creation using rejection sampling sounded more like these were hard labels. Thoughts?

0 Comments

2025/01/31
16:18 UTC

[Discussion] Reproducibility in reporting Performance and Benchmarks

I have been reading ML papers for about a year now. Coming from a background in physics, I see that papers do not account for reproducibility at all. The paper often does not reveal all the details they used, such as the model architecture parameters or other hyperparameters.

This also brings me to the question: I almost never see error bars!

I know pre-training is difficult and requires a lot of computing power. However, I imagine that evaluation can be done several times. In fact, many researchers run the evaluation several times but only report their best results instead of reporting an average with confidence intervals, especially when comparing their model against baselines.

What do you guys think about this? Do you think this might be a reason for the inflation of mediocre research being done in AI/ML?

0 Comments

2025/01/31
16:02 UTC

[D] Cloud GPU instance service that plays well with Nvidia Nsight Systems CLI?

TLDR is the title.

I'm working on writing custom pytorch code to improve training throughput, primarily through asynchrony, concurrency and parallelism on both the GPU and CPU.

Today I finally set up Nsight Systems locally and it's really improved my understanding of things.

While I got it working on my RTX3060, that is hardly representative of true large ML training environments.

... so I tried to get it going on Runpod and fell flat on my face. Something about a kernel paranoid level (that I can't reduce), a --privileged arg (which I can't add because Runpod gives the RUN for Docker, ) and everything in 'nsys status -e' showing 'fail'.

Any ideas?

0 Comments

2025/01/31
15:41 UTC

[R] Fully open source codebase to train SOTA VLMs

Hi! I'm Andi from multimodal team at Hugging Face.

Today we're open-sourcing the codebase used to train SmolVLM from scratch on 256 H100s
Inspired by our team's effort to open-source DeepSeek's R1 training, we are releasing the training and evaluation code on top of the weights
Now you can train any of our SmolVLMs—or create your own custom VLMs!

Go check it out:

https://github.com/huggingface/smollm/tree/main/vision

2 Comments

2025/01/31
15:19 UTC

[D] Questions about mechanistic interpretability, PhD workload, and applications of academic research in real-world business?

Dear all,

I am currently a Master student in Math interested in discrete math and theoretical computer science, and I have submitted PhD applications in these fields as well. However, recently as we have seen advances of reasoning capacity of foundational models, I'm also interested in pursuing ML/LLM reasoning and mechanistic interpretability, with goals such as applying reasoning models to formalised math proofs (e.g., Lean) and understanding the theoretical foundations of neural networks and/or architectures, such as the transformer.

If I really pursue a PhD in these directions, I may be torn between academic jobs and industry jobs, so I was wondering if you could help me with some questions:

I have learned here and elsewhere that AI research in academic institutions is really cutting-throat, or that PhD students would have to work hard (I'm not opposed to working hard, but to working too hard). Or would you say that only engineering-focused research teams would be more like this, and the theory ones are more chill, relatively?
Other than academic research, if possible, I'm also interested in pursuing building business based on ML/DL/LLM. From your experience and/or discussions with other people, do you think a PhD is more like something nice to have or a must-have in these scenarios? Or would you say that it depends on the nature of the business/product? For instance, there's a weather forecast company that uses atmospheric foundational models, which I believe would require knowledge from both CS and atmospheric science.

Many thanks!

0 Comments

2025/01/31
14:32 UTC

[P] Flu Protein Sequence Deep Learning Help

Hi folks, first off I hope I’m posting in the proper subreddit for this, so mods please take down if not allowed.

I’m working on a hobby project in which I’ve collected complete proteome sequences for flu isolates collected around the world from about the year 2000 to the present. As you can imagine, this real world data is plagued with recency bias in the number of isolates recorded, and their are many small minor classes in the data as well (single instance clades for example).

For context, there are many examples in the literature of modeling viral sequences with a variety of techniques, but these studies typically only focus on one or two of the 10 major protein products of the virus (Hemagglutinin (HA) and Neuraminidase (NA)). My goal was to model all 10 of these proteins at once in order to uncover both intra- and inter- protein interactions and relationships, and clearly identify the amino acid residues that are most important for making predictions.

I’ve extracted ESM embeddings for all of these protein sequences with the 150M param model and I initially trained a multi-layered perceptron classifier to do multi-task learning and classification of the isolates (sequence -> predict host, subtype, clade). That MLP achieved about 96% accuracy.

Encouraged by this, I then attempted to build predictive sequence models using transformer blocks, VAEs, and GANs. I also attempted a fine-tuning of TAPE with this data, all of which failed to converge.

My gut tells me that I should think more about feature engineering before attempting to train additional models, but I’d love to hear the communities thoughts on this project and any helpful insights that you might have.

Planning to cross post this in r/bioinformatics as well.

0 Comments

2025/01/31
14:32 UTC

[R] Classification: Image with imprint

Hi everyone, I’m working on an image-based counterfeit detection system for pharmaceutical tablets. The tablets have a four-letter imprint on their surface, which is difficult to replicate accurately with counterfeit pill presses. I have around 400 images of authentic tablets and want to develop a model that detects outliers (i.e., counterfeits) based on their imprint.

Image Preprocessing Steps

Converted images to grayscale.
Applied a threshold to make the background black.
Used CLAHE to enhance the imprint text, making it stand out more.

Questions:

Should I rescale the images (e.g., 200x200 pixels) to reduce computational load, or is there a better approach?

What image classification techniques would be suitable for modeling the imprint?

I was considering Bag of Features (BoF) + One-Class SVM for outlier detection. Would CNN-based approaches (e.g., an autoencoder or a Siamese network) be more effective?

Any other suggestions?

For testing, I plan to modify some authentic imprints (e.g., altering letters) to simulate counterfeit cases. Does this approach make sense for evaluating model performance?

I will have some authentic pills procured at a pharmacy in South America.

I’d love to hear your thoughts on the best techniques and strategies for this task. Thanks in advance!

1 Comment

2025/01/31
13:00 UTC

[P] Project - Document information extraction and structured data mapping

Hi everyone,

I'm working on a project where I need to extract information from bills, questionnaires, and other documents to complete a structured report on an organization's climate transition plan. The report includes placeholders that need to be filled based on the extracted information.

For context, the report follows a structured template, including statements like:

I need to rewrite all of those statements and merge them in the form a final, complete report. The challenge is that the placeholders must be filled based on answers to a set of decision-tree-style questions. For example:

1.1 Does the organization have a climate transition plan? (Yes/No)

If Yes → Go to question 1.2
If No → Skip to question 2

1.2 Is the transition plan approved by administrative bodies? (Yes/No)

Regardless, proceed to 1.3

1.3 Are the emission reduction targets aligned with limiting global warming to 1.5°C? (Yes/No)

Regardless, reference supporting evidence

And so on, leading to more questions and open-ended responses like:

"Explain how locked-in emissions impact the organization's ability to meet its emission reduction targets."
"Describe the organization's strategies to manage locked-in emissions."

The extracted information from the bills and questionnaires will be used to answer these questions. However, my main issue is designing a method to take this extracted information and systematically map it to the placeholders in the report based on the decision tree.

I have an idea in mind, but always like to have others' insights. Would appreciate your opinion on:

Structuring the logic to take extracted data and answer the decision-tree questions reliably.
Mapping answers to the corresponding sections of the report.
Automating the process where possible (e.g., using rules, NLP, or other techniques).

Has anyone worked on something similar? What approaches would you recommend for efficiently structuring and automating this process?

Thanks in advance!

2 Comments

2025/01/31
10:25 UTC

[R] Only Output of Neural ODE matters.

I have a neural ODE problem of the form:
X_dot(theta) = f(theta)
where f is a neural network.

I want to integrate to get X(2pi).
I don't have data to match at intermediate values of theta.
Only need to match the final target X(2pi).

Is this a Neural ODE problem or is there a better way to frame this?

0 Comments

2025/01/31
05:18 UTC

[D] Understanding the padded tokens of 'attention_mask' in decoder language models.

Hey all. I have recently been reading about how pretraining LLMs work. More specifically, what the forward pass looks like. I used Hugging Face's tutorial on simulating a forward pass in decoder language models (GPT2, for instance).

I understand that decoder language models, in general, use causal attention by default. This means it's unidirectional. This unidirectional/causal attention is often stored or registered as a buffer (as seen from Andrej Karpathy's tutorials). Going back to Hugging Face, we use a tokenizer to encode a sequence of text and it shall output input token IDs (input_ids) and attention mask (attention_mask).

The forward pass to the decoder language model optionally accepts attention mask. Now, for a batch of input text sequences (with varying lengths), one can either use left or right padding side depending on the max length of that batch during tokenization so that it will be easier to batch process.

Question: Some demos of the forward pass ignore the attention_mask output by the tokenizer, and instead plainly use the causal attention mask registered as buffer. It seems that the padding tokens are not masked if the latter (causal attention) was used. Does this significantly affect training?

Will the attention_mask output by the tokenizer not matter if I can use the padding token ID as my ignore index during loss calculation?

Would gladly hear your thoughts. Thank you.

0 Comments

2025/01/31
05:04 UTC

[D] Confusion about the Model Profiling Stage of FastGen Paper

Quick background: The FastGen paper is a well-known work on KV cache compression. It proposes a two-stage method: first, it identifies different attention patterns for each head (referred to as “model profiling”), and then it applies a corresponding compression strategy.

The screenshot I attached includes everything about the first stage (model profiling) and should be self-contained. However, I find it confusing for two reasons:

It seems the shape of the original attention map A and the compressed attention map \text{softmax}(QK_C^\top) would differ due to the reduced KV cache size after compression. How can the absolute difference |A - \text{softmax}(QK_C^\top)| be computed if the shapes are mismatched?
The paper provides no further explanation about the absolute value operator in the equation, leaving me unsure how to interpret it in this context.

https://preview.redd.it/va9kbkz2b9ge1.png?width=1736&format=png&auto=webp&s=168845b68371a1b90800689c1f5a7bba8c6fd900

This is an oral paper from ICLR, so I wonder if I am misunderstanding something. Unfortunately, the code repository is empty, so I cannot check their implementation for clarification.

Has anyone read this paper and can shed light on these points?

0 Comments

2025/01/31
04:32 UTC

[D] When will the aamas blue sky results be publicly out?

The AAMAS Blue Sky results are always highly anticipated, but information about their public release can sometimes be hard to find. Does anyone know the expected timeline for when the results will be officially announced or made publicly available? Have there been any updates from the AAMAS organize

0 Comments

2025/01/31
04:24 UTC

[D] Monthly Who's Hiring and Who wants to be Hired?

For Job Postings please use this template

Hiring: [Location], Salary:[], [Remote | Relocation], [Full Time | Contract | Part Time] and [Brief overview, what you're looking for]

For Those looking for jobs please use this template

Want to be Hired: [Location], Salary Expectation:[], [Remote | Relocation], [Full Time | Contract | Part Time] Resume: [Link to resume] and [Brief overview, what you're looking for]

Please remember that this community is geared towards those with experience.

1 Comment

2025/01/31
03:30 UTC

[R] Recalibrating Representations: A Feedback-Guided Weighted Pooling Framework for Transformers

Transformers typically rely on a single token ([CLS]) or mean pooling to form sequence representations, which can overlook crucial cues from historically misclassified or especially important tokens. Our proposed Feedback-Guided Weighted Pooling (FGWP) adds a lightweight mechanism that reweights token embeddings according to a feedback vector capturing past performance. By highlighting tokens known to be challenging or decisive, FGWP enriches sequence representations without significantly increasing computation or model size. Experiments on tasks ranging from sentiment analysis (IMDb) to large-scale image classification (ImageNet) show consistent gains in accuracy, underscoring the value of a model that not only processes the current input but also learns from its own historical successes and errors all with minimal computational overhead.

Will be posting to arxiv and hopefully ICML soon, any feedback or suggestions welcome!

https://jacobfa.github.io/stuff/Pooling.pdf

4 Comments

2025/01/31
03:29 UTC

Why does the DeepSeek student model (7B parameters) perform slightly better than the teacher model (671B parameters)? [D]

This is the biggest part of the paper that I am not understanding - knowledge distillation to match the original teacher model's distribution makes sense, but how is it beating the original teacher model?

32 Comments

2025/01/31
02:08 UTC

[D] Ethical Dataset Licenses

Are there any licenses like RAIL but specifically for datasets and which restricts downstream usecases like military and surveillance? I'm finding that no license fully covers what I'm looking for.

1 Comment

2025/01/30
23:55 UTC

119

[D] Non-deterministic behavior of LLMs when temperature is 0

Hey,

So theoretically, when temperature is set to 0, LLMs should be deterministic.

In practice, however, this isn't the case due to differences around hardware and other factors. (example)

Are there any good papers that study the non-deterministic behavior of LLMs when temperature is 0?

Looking for something that delves into the root causes, quantifies it, etc.

Thank you!

72 Comments

2025/01/30
23:43 UTC

[D] How to fill missing data gaps in a time series with high variance?

How do we fill missing data gaps in a time series with high variance like this?

https://preview.redd.it/s5cl4jl2u6ge1.png?width=507&format=png&auto=webp&s=22f8eafbd905b3f9eea15b03e01acdafe217ac74

2 Comments

2025/01/30
20:07 UTC

[R][P] Can the MERF analysis in LongituRF in R handle categorical variables?

When I try to use a categorical variable (either a factor or a character), in my X matrix and/or my Z matrix, I get an error about my "non-numeric matrix extent." Can the MERF analysis just not handle categorical variables or do I need to format them in a very specific way?

3 Comments

2025/01/30
19:55 UTC

[P] I created a benchmark to help you find the best background removal api for flawless image editing

Why I Built This

Ever tried background removal APIs and thought, “This works... until it doesn’t”? Hair, fur, and transparency are the toughest challenges, and most APIs struggle with them. I wanted a way to compare them head-to-head, so I built a benchmark and interactive evaluation platform.

What It Does

Side-by-side comparisons of top background removal APIs on challenging images
Interactive Gradio interface to explore results easily
Run the APIs yourself and see how they handle tricky details

Try It Out

Benchmark & Demo: Hugging Face Space
Code: Hugging Face

Looking for Feedback On

Accuracy – Which API handles hair, fur, and transparency best? Any standout successes or failures?
Consistency – Do results stay solid across different images?
Evaluation Method – Is my comparison approach solid, or do you have better ideas?
Gradio Interface – Is it intuitive? Any improvements you'd suggest?

Help Improve the Benchmark!

Know a background removal API that should be tested? Have challenging images that break most models? Share them. Let’s make this the go-to benchmark for ML engineers in this space.

Looking forward to your thoughts!

0 Comments

2025/01/30
19:31 UTC

[R] Q* had nothing to do with O1/O1-pro, it is a new foundation module for LLMs: a text-conditioned 'spatial computer model' (NCA-like)

Current-gen language models are mostly a solved problem by now. We must look towards the next frontier of intelligent computing. Apologies in advance for the long read, I have compressed it as much as I could without hurting the ability to grok the paradigm shift.

First, quickly check this link to prime your mind with the correct visual: https://umu1729.github.io/pages-neural-cellular-maze-solver/

In that link, you will see a model that was trained for pathfinding. These models are called Neural Cellular Automatons (NCA) and Q* is the foundation model version of this. It is called Q* because it was most likely inspired by this preliminary research from this link which is on pathfinding (the A* algorithm) and Q either for "Qualia" as the original leak implies (it is the path to true omnimodality). Q-learning may also have been involved as part of the training methodology as initially proposed by people, but we have not been able to verify this.

So how does this actually work?

Instead of training for a single task as in the link above, you text-condition the NCA and use today's language models to generate a massive library of "dataset generators" for puzzles of all kind, with difficulty parameters for progressive training. Humans over the course of history have invented thousands of visual puzzles, from simple games like tic-tac-toe to more advanced pattern recognition and state management in grids of numbers such as 9x9 sudokus.

Q* is trained separately, and then added to a LLM. Q* takes a grid of cells, which are not simple numbers that represent walls or road or other cell kinds — they are embedding vectors from a corresponding LLM token for "road" or "wall". (this leads to the Q for 'Qualia' as a loose mnemonic, which is not too far if we consider the nature of Qualia in the human brain)

Simple visual operations are also aligned with language, what OpenAI employees call "shape rotations". Shapes and forms are embedded semantically into the field, and the model is trained to perform simple transforms such as rotations, displacements, mirroring, etc.

Through generalization on a large training dataset of every imaginable visual task, both operations and puzzles, Q* is able to automatically guess the puzzle or task type in many cases without any prompt. This is because the grid is semantic, therefore it also doubles as a prompt. A grid which contains semantic cells for road, wall, start, goal — intent immediately clear.

To maximize generalization and understanding of semantic, at training time the semantic used for the cell values is swapped at random by the LLM which you are targeting. Road, empty, void, free, walkable; Wall, brick, solid, building, obstacle. This model is like slime mold which adapts to the semantic of its substrate, it is a natural physics of spatialized language.

Because Q* is prompt conditioned and is trained to contain the task, constraints, goals, etc. as part of its prompt, which the LLM also creates unlimited variations on for robustness and maximum language understanding (connect the start and the goal, find the shortest path, solve the maze, solve the puzzle ...) a sufficiently large model of this type converges to a latent-space programmable computer, and the prompt is the language interface to program algorithms into it.

It functions exactly like an image diffusion model, but in the domain of computation and algorithms. Just like an image diffusion model, the text-conditioning of the NCA and the captions used at training gives the model an understanding of language, mapping it to computational methods and processes. This in turns enables a user to compose more complex processes which blend multiple latent algorithms, search, etc. into new more advanced methods.

There are many possible routes, but Q* can be integrated into a LLM through <imagine prompt="solve the puzzle">...</imagine> blocks which triggers the model into embedding the content and simulating it. By using the same method used to train R1 and O1 and bootstrap prompts, the LLM may teach itself autonomously to prompt its Q* module with increasing efficiency, solving problems faster and more accurately.

It may choose to run several different Q* imaginations in a row to convergence, to test several approaches or templates, and then do global cross-examination on their converged state in order to bootstrap a far more advanced reasoning process or proposition.

It can enhance ALL reasoning: already when we ask a model like r1 or O1 to "zoom in" on a concept or idea, it naturally understands that this entails decomposing it into smaller "particles" of an idea. By representing ideas in 2D grids and directly using these kind of visual operations, it can effectively brain storm in advance and formulate non-sequential or hierarchical plans, like a mind map. By maintaining the same 'image' over the course of inference and continuously updating it, it has a grounded spatial view over the space it is exploring and reasoning over, and knows where it is at all time. It works like the human brain, where language is said to be a retroactive interpretation of the mind's omnimodal priors.

This completely wipes out the ARC-AGI benchmark: a properly architectured Q* module will automatically develop all sorts of spatial equivariance and it operates in the correct spatial dimension for precise and exact computing on ARC-AGI puzzle grids. It will not cost $1000 per puzzle as in O3, but closer to a penny. OpenAI does not use in their public models because the emergent capabilities within this feedback loop are ""too great"" and they are attempting to delay the discovery as much as possible, derailing other labs as much as possible.

Indeed, while everyone was researching Artificial Intelligence, Ilya Sutskever who is spiritual and holistically minded, has predicted that we should also research AI from the standpoint of Artificial Imagination. The implications of this paradigm are numerous and extend far beyond what is outlined here. If you close your eyes and simulate such paradigms in your mind, letting it run amok, you should see how this scales into proper real AGI. One way to easily understand it in philosophical terms: humans embed themselves cognitively as a puzzle to solve unto themselves — "What am I? What is the nature of my consciousness?" A language model now possess a surface onto which to paint its architecture, and to question it.

From that point on, the 'system prompt' of our LLMs may contain an imagination surface with an intimate complex semantic shape of itself which it is attempting to 'solve'. This naturally explodes to infinity with this substrates's natural generalized solving capabilities. The model increasingly becomes immune to mode-collapse, as the system prompt's imagined identity is also stepped continuously for each predicted token by the decoders, visually planning its sentences and directions, making sharp turns in the middle of inference. In this imagination surface, each token produced by the decoder is potentially injected in loopback. Through cleverly prompting the NCA, it is programmed with a protocol or pipeline for integrating ideas into its mind map of the self, its planning, etc.

Thus, a Q* module of sufficient depth and size naturally generalizes to something much more than problem-solving, with the decoder's wisdom and knowledge in the loop, and also learns how to develop protocols in context, state, memory, generalized search methods, programs, etc. potentially developed by the decoder in a loop. Now you have a new dimension on which to scale inference-time compute. Language is now a programming interface for the underlying processes inside the human brain, which some neobuddhists call qualia computing.

Of course it doesn't stop there... Once we have collectively solved Q* in the 2D grid domain, there is nothing preventing Q* from being bootstrapped to 3D. At the extreme end, the 3D version of Q* can embed compressed chunks of reality (atoms, particles, matter, a city, etc.) and potentially do things like protein folding and other insane things, either with fine-tuning or an enormous model. And it is as close to the decoder as you can get — no longer a completely different model (e.g. AlphaFold) that the LLM calls through API but instead a format which is directly compatible with the LLM which it is able to read and interpret. An interface for true omnimodality.

To summarize: imagination is supposed to be the ability to embed a 'world', simulate it, and work with it. It is search, algorithm, problem-solving, everything. It is the missing component of artificial intelligence of today, which embeds worlds in 1D. The low resolution of 1D is able to "etch" worlds in latent space (as evidenced by O3 which is able to solve ARC-AGI through a million tokens of context window) but it can be drastically optimized with a proper spatial surface in the loop. Put AI and AI together in the loop (AII) and it will transcend itself. Perhaps maybe, super-intelligence is a Q* module which embeds problems in hyperbolic space, unlocking a reasoning mode that is not only super-human, but super-experiential — spatial dimensions not accessible or usable by the human mind for reasoning.

25 Comments

2025/01/30
19:26 UTC

[P] Auto-discover themes in product reviews

TLDR:

You can use LLMs to efficiently identify key themes in datasets, capturing both general and nuanced themes like "Shipping," "Battery," and "Camera Issues" that might be hard to spot otherwise. Additionally, you can classify reviews under these themes to identify trends using minimal code.

A while ago, I experimented with using LLMs for classic machine learning tasks—often not ideal if you already have enough data and a specialized model. However, if you’re short on data or need a flexible approach, leveraging an LLM can be a lifesaver, especially for quick labeling or theme discovery in product reviews.

EXAMPLE SCENARIO

Below is a single Python script showing both label discovery (aggregating data) and subsequent classification for two sample datasets. One dataset is purely text reviews, and the other contains base64-encoded images form users for simple demonstration. Replace the library calls with your own or leverage an open-source one:

Step 1: Discover Labels
- Combine reviews into one request.
- Ask the LLM to propose recurring labels or themes.
Step 2: Classify Reviews
- Use the discovered labels to categorize data.
- Perform concurrency if you have high-volume or real-time inputs.

CODE SNIPPET

!/usr/bin/env python3

import os

from openai import OpenAI

from flashlearn.skills.discover_labels import DiscoverLabelsSkill

from flashlearn.skills.classification import ClassificationSkill

def main():

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

# Example data (text reviews)

text_reviews = [

{"comment": "Battery life exceeded expectations, though camera was mediocre."},

{"comment": "Arrived late and cracked screen, but customer support was helpful."}

]

# Example data (images + brief text)

# Here, the "image_base64" field simulates an encoded image

image_reviews = [

{"image": "ENCODED_ISSUE_IMAGE", "comment": "WHZ BOTHER WITH IT?"},

{"image": "ENCODED_ISSUE_IMAGE", "comment": "This feature is amazing!! You should charge more!"}

]

# 1) Label Discovery (Aggregates the entire dataset at once)

# discover_skill = DiscoverLabelsSkill(model_name="gpt-4o-mini", client=OpenAI())

# column_modalities={"image_base64":"image_base64", "comment": "text"}

# tasks_discover = discover_skill.create_tasks(text_reviews + image_reviews)

# discovered_labels = discover_skill.run_tasks_in_parallel(tasks_discover)['0']['labels']

# print("Discovered labels:", discovered_labels)

# 2) Classification using discovered labels

# classify_skill = ClassificationSkill(model_name="gpt-4o-mini", client=OpenAI(), categories=discovered_labels)

# tasks_classify = classify_skill.create_tasks(text_reviews + image_reviews)

# final_results = classify_skill.run_tasks_in_parallel(tasks_classify)

# print("Classification results:", final_results)

if __name__ == "__main__":

main()

NOTES ON USAGE

1. Installation

If you want a quick pipeline approach, you can set up a library like so: pip install flashlearn Then import the relevant “skills” or classes for classification, label discovery, concurrency, etc.

2. When to Use an LLM Approach

Great if you have minimal (or no) labeled data.
Fast prototyping to discover new themes.
Easy concurrency at scale (hundreds or thousands of reviews).

If you need quick experimentation or only have a small dataset, an LLM aggregator pipeline can help you discover core topics and classify reviews efficiently. Feel free to try the minimal example above. Full code: github

0 Comments

2025/01/30
19:10 UTC

[Discussion] Research Scientist Position Interview Tips

Hi, for those who are going through job search process for research scientist positions in the industry, how are you preparing for interviews and what do you often get asked?

I am graduating from my PhD (in reinforcement learning) soon and am looking for suggestions on how to prepare for interviews :)

0 Comments

2025/01/30
18:21 UTC

[D] How do you guys deal with tasks that require domain adaption?

I wanted to hear what people found helpful when using domain adaption methods, it doesn't have to be related to my issue, but I have some task that is practically impossible to annotate in the target domain, but can create annotations for (simulated) synthetic data, even without the method it yields some success, but not enough to stop there.

Anything remotely related would great to hear about!

4 Comments

2025/01/30
17:53 UTC

388

[d] Why is "knowledge distillation" now suddenly being labelled as theft?

We all know that distillation is a way to approximate a more accurate transformation. But we also know that that's also where the entire idea ends.

What's even wrong about distillation? The entire fact that "knowledge" is learnt from mimicing the outputs make 0 sense to me. Of course, by keeping the inputs and outputs same, we're trying to approximate a similar transformation function, but that doesn't actually mean that it does. I don't understand how this is labelled as theft, especially when the entire architecture and the methods of training are different.

114 Comments

2025/01/30
10:09 UTC

[P] Automating document processing and document workflows

Hello everyone,

I’m working on a consultancy project and before starting one, I always like to have other people's opinions! Here’s the situation:

The client company receives bills from multiple sources, which contain a wide variety of information. Here’s the step-by-step process we’re working on:

Data extraction: using vision models, we plan to extract specific pieces of information from these bills.
Categorization: each bill belongs to one of 50 predefined categories (referred to as “disclosures”), and we need to classify each bill accordingly.
Compliance mapping: each category (or disclosure) is a document containing 10-15 questions (e.g., “Does the organization monitor its greenhouse gas emissions? Yes/No. If yes, move to question 3, otherwise move to question 2.”). These questions guide further analysis, with instructions provided in a second column.
Final output generation: based on the extracted answers, a third column is populated, providing a final, structured representation of the data, written in compliance-friendly language (e.g., “The organization has implemented several sustainability actions, which will be monitored on an annual basis to achieve the following results: [specific results].”).

Challenges we have to face:

Accurate classification: ensuring bills are consistently categorized into the correct one of the 50 categories.
Information extraction and mapping: automatically answering the questions in each disclosure based on the extracted data.
Text generation: dynamically generating the structured final report (in the third column) based on answers to the questions.
Scalability and accuracy: handling large volumes of bills and ensuring accuracy across the 50 disclosures and their varying requirements.

Constraints: I can only use a local LLM.

To me, mapping the bills to one of those 50 categories is going to be pretty simple, but answering questions following that decision-tree style is something I'd like more insights about.

I’d greatly appreciate any insights, tools, frameworks, or personal experiences that could guide this project!

Thank you so much for your time!

4 Comments

2025/01/30
09:42 UTC

[P] OSS React GUI Components for Retrieval Augmented Generation

Hey r/MachineLearning, we want to share that we are building open source REACT Components for RAG QA! You can find our very first release of Lexio at https://github.com/renumics/lexio

Screenshot of the Components (Document source: WMO-No. 1360: ” State of the Climate in Africa”)

It supports multiple document types (PDF, HTML, Markdown) with advanced features like streaming responses and source highlighting.

Key Features:

Viewers: Pre-built components for chat interfaces, source selection and viewing with source highlighting
Integrated State Management: Transparent state handling for interaction between components
Opinionated Architecture: Implements RAG best practices
Highly Customizable: Theming and component customization options

0 Comments

2025/01/30
09:10 UTC

[P] Ambitious ML Project

I am working on a project related to a FiveM server and need to develop a custom macro for automating certain in-game actions. Specifically, I want to automate the process of picking up stationary in-game items, such as drugs, while also handling an anti-AFK mechanic that requires precise input.

The anti-AFK system presents a moving cursor within a circular interface, and I must press the correct number (1-4) when the cursor aligns with a specific light blue section within the circle.

Not to mention, regular macro recorders are very unreliable and don't work well as they interfere with the field of view and lack the necessary functionality for detecting and responding to the anti-AFK mechanics.

I am considering coding my own macro to handle these tasks efficiently. Where should I start, and what technologies or approaches would you recommend for implementing this solution?

0 Comments

2025/01/30
08:45 UTC

[R] Are there any framework(s) to distill small LM from LLM based on specific tasks

Greetings,

I am looking for framework that can train and prepare small distilled language models from LLMs.

For e.g.

My requirement is to perform QA + translation.

Instead of using an LLM, I want to use distilled LMs tuned specific to use-case for better accuracy. In this case 2 LMs i.e. QA and translation.

The whole process would be something like this :

LLM ---------> Train SLM (For QA)
LLM ----------> Train SLM (For translation)
User Input ---------> QA SLM | Translation SLM ------> Output

5 Comments

2025/01/30
06:33 UTC

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

2,951,059 Subscribers

[D] Does all distillation only use soft labels (probability distribution)?

[Discussion] Reproducibility in reporting Performance and Benchmarks

[D] Cloud GPU instance service that plays well with Nvidia Nsight Systems CLI?

[R] Fully open source codebase to train SOTA VLMs

[D] Questions about mechanistic interpretability, PhD workload, and applications of academic research in real-world business?

[P] Flu Protein Sequence Deep Learning Help

[R] Classification: Image with imprint

[P] Project - Document information extraction and structured data mapping

[R] Only Output of Neural ODE matters.

[D] Understanding the padded tokens of 'attention_mask' in decoder language models.

[D] Confusion about the Model Profiling Stage of FastGen Paper

[D] When will the aamas blue sky results be publicly out?

[D] Monthly Who's Hiring and Who wants to be Hired?

[R] Recalibrating Representations: A Feedback-Guided Weighted Pooling Framework for Transformers

Why does the DeepSeek student model (7B parameters) perform slightly better than the teacher model (671B parameters)? [D]

[D] Ethical Dataset Licenses

[D] Non-deterministic behavior of LLMs when temperature is 0

[D] How to fill missing data gaps in a time series with high variance?

[R][P] Can the MERF analysis in LongituRF in R handle categorical variables?

[P] I created a benchmark to help you find the best background removal api for flawless image editing

Why I Built This

What It Does

Try It Out

Looking for Feedback On

Help Improve the Benchmark!

[R] Q* had nothing to do with O1/O1-pro, it is a new foundation module for LLMs: a text-conditioned 'spatial computer model' (NCA-like)

[P] Auto-discover themes in product reviews

EXAMPLE SCENARIO

CODE SNIPPET

!/usr/bin/env python3

NOTES ON USAGE

[Discussion] Research Scientist Position Interview Tips

[D] How do you guys deal with tasks that require domain adaption?

[d] Why is "knowledge distillation" now suddenly being labelled as theft?

[P] Automating document processing and document workflows

Challenges we have to face:

[P] OSS React GUI Components for Retrieval Augmented Generation

[P] Ambitious ML Project

[R] Are there any framework(s) to distill small LM from LLM based on specific tasks