/r/MachineLearning
Beginners -> /r/mlquestions , AGI -> /r/singularity, career advices -> /r/cscareerquestions, datasets -> r/datasets
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Andrew Ng and Adam Coates (4/15/2015)
Related Subreddit :
/r/MachineLearning
Hello, I am working on an e-commerce project and I need a text-to-image model. I want to deploy this model on Google Cloud Platform (GCP), but this process seems quite new and complicated for me. Since I have limited time, I would like to know which of the following scenarios is more suitable:
Using ready-made GitHub models: For example, pre-trained models like Stable Diffusion. Can I import and use these models on GCP? If possible, can you share the recommended steps for this?
Google Cloud Marketplace: Would it be easier to buy a ready-made solution from GCP Marketplace? If so, what are the recommended APIs or services?
My goal:
To take inputs from user data (e.g. a string array) in the backend and return output via a text-to-image API.
Since I have an e-commerce project, I need a scalable solution for high traffic.
Information:
Backend: Requests will come via REST API.
My project allows users to create customized visuals (e.g. product designs).
Instead of training a model from scratch, I prefer ready-made solutions that will save time.
My questions:
Which way is more practical and faster? A ready-made model from GitHub or a solution from Google Cloud Marketplace?
If I prefer a model from GitHub, what steps should I follow to import these models to GCP?
How can I optimize a scalable text-to-image solution on GCP for a high-traffic application?
What platforms am I asking about:
If you have experience with Stable Diffusion or similar models, can you share them?
I would like to get suggestions from those who have started such a project on Google Cloud.
A year go I started trying to use PPO to play the original Legend of Zelda, and I was able to train a model to beat the first boss after a few months of work. I wanted to share the project just for show and tell. I'd love to hear feedback and suggestions as this is just a hobby project. I don't do this for a living. The code for that lives in the original-design branch of my Triforce repo. I'm currently tinkering with new designs so the main branch is much less stable.
Here's a video of the agent beating the first dungeon, which was trained with 5,000,000+ steps. At 38 seconds, it learned that it's invulnerable at the screen edge, it exploits that to avoid damage from a projectile. At 53 seconds it steps up to avoid damage, even though it takes a -0.06 penalty for moving the wrong way (taking damage would be a larger penalty.) At 55 seconds it walks towards the rock projectile to block it. And so on, lots of little things the model does is easy to miss if you don't know the game inside and out.
As a TLDR, here's an early version of my new (single) model. This doesn't make it quite as far, but if you watch closely it's combat is already far better, and is only trained on 320,000 steps (~6% less than the first model's training steps).
This is pretty far along from my very first model.
I got the original project working using stable-baselines's PPO and default neural network (Shared NatureCNN, I believe). SB was great to get started but ultimately stifling. In the new version of the project I've implemented PPO from scratch with torch with my own simple neural network similar to stable-baseline's default. I'm playing with all kinds of changes and designs now that I have more flexibility and control. Here is my rough original design:
My first pass through this project was basically "imagine playing Zelda with your older sibling telling you where to go and what to do". I give the model an objective vector which points to where I want it to go on the screen (as a bird flies, the agent still had to learn path finding to avoid damage and navigate around the map). This includes either point at the nearest enemy I want it to kill or a NSEW vector if it's supposed to move to the next room.
Due a few limitations with stable-baselines (especially around action masking), I ended up training unique models for traversing the overworld vs the dungeon (since they have entirely different tilesets). I also trained a different model for when we have sword beams vs not. In the video above you can see what model is being used onscreen.
In my current project I've removed this objective vector as it felt too much like cheating. Instead I give it a one-hot encoded objective (move north to the next room, pickup items, kill enemies, etc). So far it's working quite well without that crutch. The new project also does a much better job of combat even without multiple models to handle beams vs not.
Image - The standard neural network had a really tough time being fed the entire screen. No amount of training seemed to help. I solved this by creating a viewport around Link that keeps him centered. This REALLY helped the model learn.
I also had absolutely zero success with stacking frames to give Link a way to see enemy/projectile movement. The model simply never trained with stable-baselines when I implemented frame stacking and I never figured out why. I just added it to my current neural network and it seems to be working...
Though my early experiments show that giving it 3 frames (skipping two in between, so frames curr, curr-3, curr-6) doesn't really give us that much better performance. It might if I took away some of the vectors. We'll see.
Vectors - Since the model cannot see beyond its little viewport, I gave the model a vector to the closest item, enemy, and projectile onscreen. This made it so the model can shoot enemies across the room outside of its viewport. My new model gives it multiple enemies/items/projectiles and I plan to try to use an attention mechanism as part of the network to see if I can just feed it all of that data.
Information - It also gets a couple of one-off datapoints like whether it currently has sword beams. The new model also gives it a "source" room (to help better understand dungeons where we have to backtrack), and a one-hot encoded objective.
Action Space
My original project just has a few actions, 4 for moving in the cardinal directions and 4 for attacking in each direction (I also added bombs but never spent any time training it). I had an idea to use masking to help speed up training. I.E. if link bumps into a wall, don't let him move in that direction again until he moves elsewhere, as the model would often spend an entire memory buffer running headlong straight into a wall before an update...better to do it once and get a huge negative penalty which is essentially the same result but faster.
Unfortunately SB made it really annoying architecturally to pass that info down to the policy layer. I could have hacked it together, but eventually I just reimplemented PPO and my own neural network so I could properly mask actions in the new version. For example, when we start training a fresh model, it cannot attack when there aren't enemies on screen and I can disallow it from leaving certain areas.
The new model actually understands splitting swinging the sword short range vs firing sword beams as two different actions, though I haven't yet had a chance to fully train with the split yet.
Frameskip/Cooldowns - In the game I don't use a fixed frame skip for actions. Instead I use the internal ram state of game to know when Link is animation locked or not and only allow the agent to take actions when it's actually possible to give meaningful input to the game. This greatly sped up training. We also force movement to be between tiles on the game map. This means that when the agent decides to move it loses control for longer than a player would...a player can make more split second decisions. This made it easier to implement movement rewards though and might be something to clean up in the future.
Pathfinding - To facilitate rewards, the original version of this project used A* to pathfind from link to what he should be doing. Here's a video of it in action. This information wasn't giving to the model directly but instead the agent would only be given the rewards if it exactly followed that path or the transposed version of it. It would also pathfind around enemies and not walk through them.
This was a nightmare though. The corner cases were significant, and pushing Link towards enemies but not into them was really tricky. The new verison just uses a wavefront algorithm. I calculate a wave from the tiles we want to get to outwards, then make sure we are following the gradient. Also calculating the A* around enemies every frame (even with caching) was super slow. Wavefront was faster, especially because I give the new model no special rewards for walking around enemies...faster to compute and it has to learn from taking damage or not.
Either way, the both the old and new models successfully learned how to pathfind around danger and obstacles, with or without the cheaty objective vector.
Rewards - I programmed very dense rewards in both the old and new model. At basically every step, the model is getting rewarded or punished for something. I actually have some ideas I can't wait to try out to make the rewards more sparse. Or maybe we start with dense rewards for the first training, then fine-tune the model with sparser rewards. We'll see.
Predicting the Future - Speaking of rewards. One interesting wrinkle is that the agent can do a lot of things that will eventually deal damage but not on that frame. For example, when Link sets a bomb it takes several seconds before it explodes, killing things. This can be a massive reward or penalty since he spent an extremely valuable resource, but may have done massive damage. PPO and other RL propagates rewards backwards, of course, but that spike in reward could land on a weird frame where we took damage or moved in the wrong direction.
I probably could have just not solved that problem and let it shake out over time, but instead I used the fact that we are in an emulator to just see what the outcome of every decision is. When planting a bomb, shooting sword beams, etc, we let the game run forward until impact, then rewind time and reward the agent appropriately, continuing on from when we first paused. This greatly speeds up training, even if it's expensive to do this savestate, play forward, restore state.
Neural Networks - When I first started this project (knowing very little about ML and RL), I thought most of my time would be tuning the shape of the neural network that we are using. In reality, the default provided by stable-baselines and my eventual reimplemnentation has been enough to make massive progress. Now that I have a solid codebase though, I really want to revisit this. I'd like to see if trying CoordConvs and similar networks might make the viewport unncessary.
Hyperparameters - Setting the entropy coefficinet way lower helped a TON in training stable models. My new PPO implementation is way less stable than stable-baselines (ha, imagine that), but still converges most of the time.
Infinite Rewards - As with all reinforcement learning, if you give some way for the model to get infinite rewards, it will do just that and nothing else. I spent days, or maybe weeks tweaking reward functions to just get it to train and not find a spot on the wall it could hump for infinite rewards. Even just neutral rewards, like +0.5 moving forward and -0.5 for moving backwards, would often result in a model that just stepped left, then right infinitely. There has to be a real reward or punishment (non-neutral) for forward progress.
Debugging Rewards - In fact, building a rewards debugger was the only way I made progress in this project. If you are tackling something this big, do that very early.
Stable-Retro is pretty great - Couldn't be happier with the clean design for implementing emulation for AI.
Torch is Awesome - My early versions heavily used numpy and relied on stable-baselines, with its multiproc parallelization support. It worked great. Moving the project over to torch was night and day though. It gave me so much more flexibility, instant multithreading for matrix operations. I have a pretty beefy computer and I'm almost at the same steps per second as 20 proc stable-retro/numpy.
This has already gone on too long. I have some ideas for future projects, but maybe I'll just make them another post when I actually do them.
A special thanks to Brad Flaugher for help with the early version of this, Fiskbit from the Zelda1 speedrunning community for help pulling apart the raw assembly to build this thing, and MatPoliquin for maintaining Stable-Retro.
Happy to answer any questions, really I just love nerding out about this stuff.
Hi all!
I'm working on a project about Multitouch Attribution Modeling using Tensor flow to predict conversion over different channels.
In the project, we are using this dataset (https://www.kaggle.com/code/hughhuyton/multitouch-attribution-modelling). However, we cannot find any formal reference (published paper or something similar) to make a proper citation. I have searched on Google a lot… really, a lot.
Does anyone know what is the origin of the data or if is it referenced somewhere?
Thanks for the help.
For the past few years I've had a job with the official title "machine learning engineer", but as I hunt for other jobs online, I wonder if that's actually accurate. Based on the experience requirements and responsibilities listed, it doesn't seem to match up with what I do.
I have a master's with a focus in ML (though that was pre LLM-boom, so things have changed a lot) but struggled to find work in my area pertaining to that out of college. Post-COVID when everyone went remote I got my current job. In it, I work on a team building and deploying software that utilize machine learning to accomplish tasks. However, I'm never the one actually building the learning models (there's a researcher on our team who does that); just creating the systems around them. I'm actually pretty happy in my "machine learning adjacent" role, but should I be searching for different job titles to find something similar?
I’ve been doing web development for a year now and want to explore a new domain. My plan is to work for a company for a year or two and then aim for remote jobs in a niche field. However, I’m really confused about what to learn next—AI/ML or Web3.0.
Some people say Web3.0 is dead, has no future, and there are no jobs in that space. On the other hand, AI/ML is the hot topic right now, and I’d love to explore it too.
That said, I’m more inclined towards Web3.0 but don’t really mind learning AI/ML either. I just need some guidance on which direction to take. For context, I’m a second-year undergraduate student in India.
Hi, if anyone wants to prepare for interview rounds for Apple AI Residency Program Please dm me who got invitation mail for information session. Only Serious candidates contact.
Join the upcoming Open Science AI & Data Challenge Virtual Orientation session on January 22nd 2025. Let's work together to cool down our cities and create healthier, more sustainable urban environments. Learn how the 2025 EY Open Science AI & Data Challenge will help tackle the problem of urban heat islands through the application of AI and technology-based solutions. Winners are eligible for cash prizes and attendance at an exciting awards ceremony. Register today!
So i have been working with a deep learning project the aim is to detect objects. My main goal was to detect plastic from water and pick it up using a conveyor belt attached with a boat so i took code from GitHub and made sufficient changes and now the model is working but one problem is i have to manually add photo and change its name to test.jpeg(which i have given) so in my model the boat have a camera how will i make a project that can took the photo automatically when it detects a object and automatically load to my already made model and for all this process which development board will be sufficient.i hope someone answers my question 🙂
I submitted a paper to TPAMI on June 25, 2024. It was a significant extension of our work that was accepted as an oral presentation at AAAI 2023. I know the reviews at TPAMI are rigorous and can take months, but I was just wondering what the longest time it has taken in your experience, since it has been 6 months and 3 days with no news. Also, would the reviewers take into account works that were published after the submission date? I am just worried that with the (understandably) slow reviews, I will be asked by the reviewer why I am not comparing against method XYZ, and asked to compare against said method, which could potentially outperform mine due to how fast the field progresses, and make revision and acceptance complicated.
Hi I’m devising up a list of papers to recommend students just starting out in compsci.
What are some must-read papers to give that is not too deep?
These days all the statistic learning theories are within reach with online courses but I want them to grow to read academic papers.
I’m starting off with ilya Sutskever's reading list.
A brief explanation of why you’re recommending the paper would be welcome too!
Hey all,
I’m curious about the most common embarrassingly parallel tasks you encounter in the wild. In the ML and DS world, I’ve noticed many workflows tend to follow this general pattern:
What workloads do you have that follow this process or something similar? I’ve been tinkering with a cloud abstraction to make large-scale parallel processing easier, and I’m trying to identify common use cases to build tutorials around.
Any ideas, advice, or feedback would be super helpful
Flash-attention paper show that “most operation in Transformers are botlenecked by memory accesses”
Cut cross entropy show that “The cross-entropy loss is responsible for up to 90% of the memory footprint of modern LLM training”
How to get these data, is there a tool or platform which can show the cost by parts in LLM, like embedding, attention, layer normalazation, loss computation?
Purpose: After know that, we will know which part we should accelerate at first and can pay more attention on it.
Thanks for any suggestion
Hey yall! I'm curious, how often are you kicking off a new training runs?
Once a week? Twice a week? Everyday?
Would love to hear about your experience!
Grokking, the sudden generalization that occurs after prolonged overfitting, is a surprising phenomenon challenging our understanding of deep learning. Although significant progress has been made in understanding grokking, the reasons behind the delayed generalization and its dependence on regularization remain unclear. In this work, we argue that without regularization, grokking tasks push models to the edge of numerical stability, introducing floating point errors in the Softmax function, which we refer to as Softmax Collapse (SC). We demonstrate that SC prevents grokking and that mitigating SC enables grokking without regularization. Investigating the root cause of SC, we find that beyond the point of overfitting, the gradients strongly align with what we call the naïve loss minimization (NLM) direction. This component of the gradient does not alter the model's predictions but decreases the loss by scaling the logits, typically by scaling the weights along their current direction. We show that this scaling of the logits explains the delay in generalization characteristic of grokking and eventually leads to SC, halting further learning. To validate our hypotheses, we introduce two key contributions that address the challenges in grokking tasks: StableMax, a new activation function that prevents SC and enables grokking without regularization, and ⊥Grad, a training algorithm that promotes quick generalization in grokking tasks by preventing NLM altogether. These contributions provide new insights into grokking, elucidating its delayed generalization, reliance on regularization, and the effectiveness of existing grokking-inducing methods.
Paper: https://arxiv.org/abs/2501.04697
(not my paper, just something that was recommended to me)
Hello,
I am conducting research that I plan to submit to the AHLI Conference on Health, Inference, and Learning (CHIL) (H5-index 26, h5-median 43). However, the submission deadline is approaching quickly—February 10.
My advisor has suggested adding other professors as co-authors, but they would primarily review and provide feedback rather than directly contributing to the writing. Therefore, I am reaching out to see if anyone with expertise in time series foundation models would be interested in collaborating as a co-author.
The research involves comparing time series foundation models across different datasets. The experiments are nearly complete, but I need support in writing the theoretical foundation for each model. If you have the necessary knowledge, time, and interest in contributing meaningfully to this work, please send me a private message so we can discuss this opportunity further.
Thank you!
Here is a example which uses simpler language, for testing if it is the confusing language that causes a model to fail.
Edit: Detailed post keeps getting removed. Please ask questions, hope someone finds this tool helpful.
What do you guys use to upload Multimodal Dataset?
I want it to be convenient for the people who use it. For the text, huggingface dataset is the best convenient solution, but I cant find any convenient solution for Multimodal (Image + Video + Audio + Text) datast.
Thanks in advance.
TL;DR: A reasoning multimodal model built from Qwen2-VL-72B. Surprisingly, beats QVQ in evals.
Paper: https://arxiv.org/pdf/2501.01904
Abstract:
Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems.
To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at this https URL.
Highlights:
[W]e obtain approximately 5K long thought instruction instances distilled from two open slow-thinking reasoning systems: DeepSeek-R1-Lite-Preview [2] (abbreviated as R1) and QwQ-32B-preview [3] (abbreviated as QwQ). The statistics of the collected instruction data are categorized by domain as follows: math (3.7K), science (0.9K), code (0.2K) and puzzle (0.1K). [...]
After collecting instruction data for long-form reasoning, we fine-tune the base MLLM to emulate slow-thinking reasoning behavior. [...]
The second approach we explore is the direct distillation of multimodal long thought data from slow-thinking MLLMs (e.g., QVQ). [...]
As another alternative approach, we design a multi-stage tuning method for self-distillation. Specifically, we first fine-tune the selected MLLM (i.e., Qwen2-VL-72B-Instruct) on the textual long thought instruction set DT, obtaining model M0. Next, we use M0 to generate the visual long thought instruction set by self-distillation DSD, which can be subsequently used for fine-tuning the original MLLM.
Visual Highlights:
The key innovation here is combining large language models with image generation to create a system that can "visually think" while solving problems. The approach, called Multimodal Visualization-of-Thought (MVoT), generates relevant visualizations during its reasoning process, similar to how humans might sketch diagrams to better understand a problem.
Main technical points:
Results:
I think this approach could meaningfully improve AI systems' ability to reason about physical and spatial problems. By incorporating visual thinking into the reasoning process, we might see better performance on tasks that humans typically solve through visualization - from physics problems to architectural design. However, the computational overhead of generating images during reasoning could limit practical applications.
I think the most interesting aspect is how this mimics human cognitive processes - we often sketch or visualize to understand complex problems. This could lead to AI systems that reason in more intuitive and interpretable ways.
TLDR: New method combines language models with image generation to create AI systems that can "think visually" while reasoning, showing 12% improvement on visual reasoning tasks.
Full summary is here. Paper here.
Recently took part in a hackathon where was tasked with achieving a high accuracy without using Convolution and transformer models. Even though mlp mixers can be argued being similar to convolution they were allowed. Even after a lot of tries i could not take the accuracy above 60percent. Is there a way to do it either with mlp or with anything else to reach somewhere near the 90s.
Hi there ! I've been looking around for a MIT (commercially available) model for Text-to-Sound-Effects (Text-to-Audio) and haven't found much, besides the traditional stable-Audio-Open (with its special license)
Do you know any other ?
How do you deal with multiple adapters created for different tasks? I understand task id based dynamic loading of the appropriate adapter is obvious but is there a better way? I am especially asking for whisper
Hello, everyone
I recently developed a new open-source LLM-driven research automation tool, called AutoResearch. It can automatically conduct various tasks related to machine learning research, the key function is:
Topic-to-Survey Automation - In one sentence, it converts a topic or research question into a comprehensive survey of relevant papers. It generates keywords, retrieves articles for each keyword, merges duplicate articles, ranks articles based on their impacts, summarizes the articles from the topic, method, to results, and optionally checks code availability. It also organizes and zips results for easy access.
When searching for research papers, the results from a search engine can vary significantly depending on the specific keywords used, even if those keywords are conceptually similar. For instance, searching for "LLMs" versus "Large Language Models" may yield different sets of papers. Additionally, when experimenting with new keywords, it can be challenging to remember whether a particular paper has already been checked. Furthermore, the process of downloading papers and organizing them with appropriate filenames can be tedious and time-consuming.
This tool streamlines the entire process by automating several key tasks. It suggests multiple related keywords to ensure comprehensive coverage of the topic, merges duplicate results to avoid redundancy, and automatically names downloaded files using the paper titles for easy reference. Moreover, it leverages LLMs to generate summaries of each paper, saving researchers valuable time and effort in uploading it to ChatGPT and then conversing with it in a repetitive process.
Additionally, there are some basic functionalities:
This tool is still under active development, I will add much more functionalities later on.
I know there are many existing tools for it. But here are the key distinctions and advantages of the tool:
------Here is a quick installation-free Google Colab demo------
Here is the official website of AutoResearch.
Here is the GitHub link to AutoResearch.
------Please star the repository and share it if you like the tool!------
Please DM me or reply in the post if you are interested in collaborating to develop this project!
I'm working on a project where I need to classify text as either nsfw or sfw. I know there are some BERT-based classifiers out there that are specifically trained for this kind of task. I've also seen people using smaller LLMs.
What's the best approach for this? Since the underlying complexity of detecting NSFW text isn't that high, I'm thinking maybe a full blown LLM is overkill. What are your recommendations?
I have a simple dataset that I want to train a prediction model on for a pretty low stakes project (more for fun), but I have no experience training ML models. Simple linear regression didn't have great performance when I tried it and I suspect there is a more complex interaction between the variables.
Training Dataset: 25K observations of 5 numerical predictor variables with one 1 numerical outcome variable.
What is the best AutoML platform that I can run this with minimal code, just to see if ML models can perform better than simple regression can? Thanks!
Abstract:
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high- quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.
Arxiv link: https://arxiv.org/pdf/2501.07542
Hey r/MachineLearning! Last week, Microsoft released Phi-4, a 14B open-source model that rivals OpenAI's GPT-4-o-mini. I managed to find & fix 4 bugs impacting its output quality. You might remember me previously from fixing 8 bugs in Google's Gemma model! :)
I'm going to walk you through how I found & fixed the bugs. Phi-4's benchmarks were amazing, however many users reported weird or just wrong outputs. Since I maintain the open-source project called 'Unsloth' (fine-tuning LLMs 2x faster with 70% less VRAM) with my brother, I firstly tested Phi-4 for inference and found many errors. Our GitHub repo: https://github.com/unslothai/unsloth
This time, the model had no implementation issues (unlike Gemma 2) but did have problems in the model card. For my first inference run, I randomly found an extra token which is obviously incorrect (2 eos tokens is never a good idea). Also during more runs, I found there was an extra assistant prompt which is once again incorrect. And, lastly, from past experience with Unsloth's bug fixes, I already knew fine-tuning was wrong when I read the code.
These bugs caused Phi-4 to have some drop in accuracy and also broke fine-tuning runs. Our fixes are now under review by Microsoft to be officially added to Hugging Face. We uploaded the fixed versions to https://huggingface.co/unsloth/phi-4-GGUF
Here’s a breakdown of the bugs and their fixes:
1. Tokenizer bug fixes
The Phi-4 tokenizer interestingly uses <|endoftext|> as the BOS (beginning of sentence), EOS (end of sentence) and PAD (padding) tokens. The main issue is the EOS token is wrong - it should be <|im_end|>. Otherwise, you will get <|im_end|><|endoftext|> in generations.
2. Fine-tuning bug fixes
The padding token should be a designated pad token like in Llama (<|finetune_right_pad_id|>) or we can use an untrained token - for example we use <|dummy_87|>, fixing infinite generations and outputs.
3. Chat template issues
The Phi-4 tokenizer always adds an assistant prompt - it should only do this if prompted by add_generation_prompt. Most LLM serving libraries expect non auto assistant additions, and this might cause issues during serving.
We dive deeper into the bugs in our blog: https://unsloth.ai/blog/phi4
Yes! Our fixed Phi-4 uploads show clear performance gains, with even better scores than Microsoft's original uploads on the Open LLM Leaderboard.
Some redditors even tested our fixes to show greatly improved results in:
We also made a Colab notebook fine-tune Phi-4 completely for free using Google's free Tesla T4 (16GB) GPUs: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb
Thank you for reading this long post and hope you all found this insightful! If you have any questions, please feel free to ask! :)
How I found the bugs:
<|im_start|>assistant<|im_sep|>
to be appended at the even with add_generation_prompt = False
in Hugging Face, so I theorized there was a chat template problem. Adding assistant prompts by default can break serving libraries.<|endoftext|>
to be used for the BOS, EOS and PAD tokens, which is a common issue amongst models - I ignored the BOS, since Phi-4 did not have one anyways, but changed the PAD token to <|dummy_87|>
. You can select any of the tokens since they're empty and not trained. This counteracts issues of infinite generations during finetuning.Dear All,
I am a UG student and I want to submit my manuscript to any of these two journals; the work is on the interplay of privacy and explainability in machine learning (would be more than happy to send you the arXived version of the same on request). I have previously published in a very reputed workshop of EMNLP and came to know that mostly ML nowadays is a conference-centric discipline. I want to know which of these two will be better to submit my work (due to the length and scope, I am unable to submit to conferences this time). I cannot submit it to tmlr until it's Scopus-indexed and not considering AIJ and Machine Learning Journal at this moment.
I just want to make sure that if the paper gets accepted, I want this to be at least comparable with a borderline A* paper (in terms of the so-called prestige of the venue). Also, let me know if you have any other suggestions; I am new to journals and I appreciate your opinion.
P.S.: My guide slightly prefers PR to JAIR due to its higher IF but nevertheless, he is open JAIR or any other Scopus-indexed journals as long as it is comparable with at least a borderline A* or very strong A conf paper as said.
Hello folks,
Just started contributing into the writing for research, previously I just used to experiment and work on results, tables and plots.
Obviously using AI to generate content for paper is unethical and wrong in many aspect. But what about using it to correct your grammar and comprehension. Technically it will also considered as AI written but is it okay to do this atleast in the literature review, introduction and description for the experiment?
To be honest, I like writing and when I asked AI (chatgpt and others) I see that it is much easier to read and interpret, which I think is good for the community and on the other hand, it may be considered unethical by many.
When I ran a 'AI-text detector' on many of paper I'm using as reference from last 1~ year, I usually get a 50-70% score.
What do you all think?