/r/MachineLearning
ml.
Beginners please see learnmachinelearning
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Andrew Ng and Adam Coates (4/15/2015)
Related Subreddit :
/r/MachineLearning
Hello r/MachineLearning,
I'm excited to share a project that was initially intended to integrate automated AI maintenance features for Windows into an application I was building to sell commercially, but has now been open-sourced for community use and development. The project focuses on automated data sorting and could serve as a base for more advanced machine learning applications.
You can explore the project here: [NazTech Automated Data Sorting Tools](https://github.com/nazpins/naztech-automated-data-sorting-tools)
These tools are designed to quickly automate sorting large data dumps, employing python algorithms suitable for handling large datasets. While the project is no longer in active development from my end, the python scripts are functional and open for any adaptations or enhancements you might find interesting for your own ML projects. I started building the framework for the actual application but due to time constraints and a lot going on irl, I haven't had time to continue working on it.
I am happy however to share these tools with the community and hopefully they can be beneficial to someone else down the road.
Cheers!
Hi All,
I am brainstorming some kind of a nomenclature for our team so that theres a standard way of naming ML models like their pickle files . Any inputs will be appreciated.
thanks
Hi r/MachineLearning community,
I'm currently working on the second part of my thesis focused on Named Entity Recognition (NER) for detecting Adverse Drug Events (ADE) in clinical texts. In my first thesis project, I tried to replicate a paper but had to pivot to the n2c2 dataset, which led to challenges in model performance.
I've fine-tuned a DeBERTa model with standard practices, but I'm struggling with achieving high accuracy, particularly with precision and recall. This is my first deep dive into a thesis and the world of NLP, and any guidance would be immensely appreciated.
Also, any common pitfalls for thesis work or useful resources on this topic would be extremely helpful. I'm eager to learn from the community and improve my research.
Thank you so much for your time!
Through our experiments, we found that LLM agents have a certain ability to understand probability distributions, the LLM agent's sampling ability for probability distributions is lacking and it is difficult to give a behavior sequence that conforms to a certain probability distribution through LLMs alone.
We are looking forward to your thoughts, critiques, and discussions on this topic. Full Paper & Citation: You can access the full paper https://arxiv.org/abs/2404.09043. Please cite our work if it contributes to your research.
Main aim: Style transfer between two discrete timeseries signals.
Here are the details: Dataset: Discrete time series. 1700 rows, with 97 percent of it with zeroes. Cannot remove these zeroes as it means something. Values ranging from 0-32 for one of the features in Domain A needs to translated to another feature with same range in domain B. Another feature from 0-5000 from domain A, translated to a different domain B with same range. I can recreate the same dataset multiple times with small variations, so we can have larger datasets. I would create sequences of size 20 or 30 and batch: 32 or 64 initially.
Generator Network: A simple encoder with linear layer first hidden size:16 , relu, 2nd linear layer :8 and relu again . A symmetric Decoder .
Discriminator: 2 linear layers with hidden size 8 and leaky Relu between them. And sigmoid as final layer. Loss function : BCEloss . Also experimented BCE + MSE loss for generator.
Training: I'm using pytorch. Only trained with one feature/signal and tried to generate this feature from noise. Didn't move to cycle consistency yet. With the small dataset training, the discriminator becomes too strong, I even tried to set reduce the learning rate for discriminator as 0.0001 and generator as 0.01 , it didn't work. Tried to add/complicate the layer of generator, still didn't work. Tried to train discriminator every 10th epoch, while the generator trained more. Didn't work. Also tried to normalize the data.
I want to explore Adversarial autoencoder /cycle Gan , but the generator is unable to learn anything with vanilla GAN as well. Can someone help or give me some ideas on what I can do ? Thanks
Hi guys,
I just tried to run Llama 3 on my Colab(free version) and seems that I ran out of the memory:
OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 9.06 MiB is free. Process 8863 has 14.74 GiB memory in use. Of the allocated memory 14.60 GiB is allocated by PyTorch, and 22.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Anyone have the same experience? Has anyone managed to run Llama 3 on free version of Colab (or similar platform)?
Thanks!
Hi everyone,
I'm trying to build my own test set in order to make an initial fast evaluation of the huge number of models that pop up on huggingface.co every week, and I'm searching for a starting point or suggestions.
If someone would share some questions that they use to test LLM abilities, even as high-level concepts, or simply give me some tips or suggestions, I would really appreciate that!
Thanks in advance to everyone for any kind of reply."
https://arxiv.org/abs/2404.16767
New deep RL algorithm that works with both language models and diffusion models.
I am attempting to train CLIP from scratch. However, there is a lack of available datasets. The one dataset that seemed quite diverse and clean seems to be taken down (laion-400m
). Looking at HF datasets, these are the two datasets that are promising, but wondering if there has been anything better/ cleaner.
TIA
Everywhere I look for the answer to this question, the responses do little more than anthropomorphize the model. They invariably make claims like:
Without examples, the model must infer context and rely on its knowledge to deduce what is expected. This could lead to misunderstandings.
One-shot prompting reduces this cognitive load by offering a specific example, helping to anchor the model's interpretation and focus on a narrower task with clearer expectations.
The example serves as a reference or hint for the model, helping it understand the type of response you are seeking and triggering memories of similar instances during training.
Providing an example allows the model to identify a pattern or structure to replicate. It establishes a cue for the model to align with, reducing the guesswork inherent in zero-shot scenarios.
These are real excerpts, btw.
But these models don’t “understand” anything. They don’t “deduce”, or “interpret”, or “focus”, or “remember training”, or “make guesses”, or have literal “cognitive load”. They are just statistical token generators. Therefore pop-sci explanations like these are kind of meaningless when seeking a concrete understanding of the exact mechanism by which in-context learning improves accuracy.
Can someone offer an explanation that explains things in terms of the actual model architecture/mechanisms and how the provision of additional context leads to better output? I can “talk the talk”, so spare no technical detail please.
I could make an educated guess - Including examples in the input which use tokens that approximate the kind of output you want leads the attention mechanism and final dense layer to weight more highly tokens which are similar in some way to these examples, increasing the odds that these desired tokens will be sampled at each generation step; like fundamentally I’d guess a similarity/distance thing, where explicitly exemplifying the output I want increases the odds that the output get will be similar to it - but I’d prefer to hear it from someone else with deep knowledge of these models and mechanisms.
In a video about "A little guide to building Large Language Models in 2024" at 41:38 the author starts to talk about the limits of how big the batch size can be.
Well, if you start to have a very large batch size, the model for each optimization step makes less efficient use of each token, because the batch size is so big that each token is kind of washed out in the optimization step. And roughly, it's a little bit hard to measure this limit, which we call the critical batch size.
I thought that the bigger batch size is always better for training LLMs, because:
I found a paper that introduces the "critical batch size" concept - An Empirical Model of Large-Batch Training. It mostly talks about the speed/efficiency tradeoff for data parallelism of large batch sizes. Also another highly cited paper Scaling Laws for Neural Language Models:
Training at the critical batch size provides a roughly optimal compromise between time and compute efficiency
So I don't really understand what author of the video meant by saying:
each token is kind of washed out in the optimization step
Are there any other issues with large batch sizes other than infrastructure, hardware or implementation limits?
Hi guys,I have a problem for which I am not sure what would be the best approach (and I cannot really find any relevant literature).I have a small dataset (~100 measurements like the one attached) of a sensor value, for which I want to predict a certain relevant event. Here t_0 is the relevant moment in time which I want to predict. The problem is, that I need to trigger something when the event is reached. If I need too long to trigger after it was reached, it will not be a positive outcome.My initial idea was to basically chunk the time series before the event, and try to predict the remaining time from that segment until the event is reached. When it is below a threshold, I can trigger my action. I wanted to have a look at e.g. XGBoost and feed it small chunks of the timeseries and run this process continuously. I am not really sure if that is the correct approach there.Is that a known problem? What would be a good name for this problem to search for literature? Do you have suggestions how to solve it?
Thanks.
"[P]"
How to calculate IoU score on yolo v8. I didn't find any inbuilt function to do that. I have trained yolov8n on my custom dataset for object detection which has 1 class.
I'm starting to do some research for my graduation and I'm looking for some papers on text to speech synthesis. I'm doing some reproductions on a paper I found to be interesting called Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Prediction. Basically it's a model that receives text, turns it into a spectogram and the spectogram is used to build the audio file. Since I'm still at the start of reproduction, are there papers that you guys would recommend looking into? Did you work with speech synthesis (TTS)? What are good refferences I should look into?
I saw this post over here https://www.reddit.com/r/MachineLearning/comments/nxkuvn/d_what_is_actually_the_state_of_the_art_in_text/
But it already has 3 years. Maybe there is something newer than FastSpeech2?
[D] Hey everyone, do you have any suggestions on what's the better option and the most effective way to dive into a hot topic these days?
I stumbled upon a repository for Federated Learning at:
But can't seem to find anything similar for Meta Learning. Any advice on how to pick my PhD topic would be greatly appreciated!
My friend implemented the method of Multihead Mixture of Experts in this arxiv paper https://arxiv.org/pdf/2404.15045 and he wanted me to share it with you!
https://github.com/lhallee/Multi_Head_Mixture_of_Experts__MH-MOE
Try it out. Let me know what you think and I will pass it on to him.
Hello guys, I've been working on a sequential labelling using DNA sequences as inputs. Lately there have been 2 foundation models released HyenaDNA (Based on Hyena operator) and Caduceus (based on mamba), I used both pretrained and from scratch models and performances are terrible even with pretrained ones.
Does anyone have experience with this type of models, and what are the potential causes for performance drop ? I am literally getting zero performance for the minority class ? Does mamba deal poorly with class imbalance ?
This is a small drug toxicity prediction GNN model I wrote/trained
repo: https://github.com/Null-byte-00/toxicity-prediction-gnn
I have a use case to use function calling within my application, I am confused whether to choose OpenAI function calling or use Bedrock Agents coupled with Lambda functions for this, which is the best approach? Or help me to choose between these two.
ML is very good at solving a niche set of problems, but most of the technical nuances are lost on tech bros and managers. What are some problems you have been told to solve which would be impossible (no data, useless data, unrealistic expectations) or a misapplication of ML (can you have this LLM do all of out accounting).
Does anyone know what datasets are out there for causal inference? I’d like to explore methods in the doubly robust ML literature, and I’d like to compensate my learning by working on some datasets and learn the econML software.
Does anyone know of any datasets, specifically in the context of marketing/pricing/advertising that would be good sources to apply causal inference techniques? I’m open to other datasets as well.
Dreambooth the MusicGen model suite on small consumer GPUs, in a matter of minutes, using this repository: https://github.com/ylacombe/musicgen-dreamboothing
The aim of this project is to provide tools to easily fine-tune and dreambooth the MusicGen model suite, with little data and to leverage a series of optimizations and tricks to reduce resource consumption, thanks to LoRA adaptors.
For example, the model can be fine-tuned on a particular music genre or artist to give a checkpoint that generates in that given style. The aim is also to easily share and build on these trained checkpoints,
Specifically, this involves:
using as few data and resources as possible. We're talking fine-tuning with as little as 15mn on an A100 and as little as 10GB to 16GB of GPU utilization.
easily share and build models thanks to the Hugging Face Hub.
optionally, generate automatic music descriptions
optionally, training MusicGen in a Dreambooth-like fashion, where one key-word triggers generation in a particular style
Wandb example of what the training run looks like here.
Paper: https://arxiv.org/abs/2402.11131
Abstract:
Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.
Paper: https://arxiv.org/abs/2404.08698
Abstract:
While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM's original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficient and plug-and-play enhancement. In our experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x, validating the effectiveness of our proposed ANPD.
I just wanted to remind or introduce newcomers to this paper. I think this discussion should be re-opened since many people here actually do influence the trends of the field.
https://arxiv.org/pdf/1807.03341
On a personal note (feel free to skip):
Specifically, I want to point out the issue of "Mathiness", as it seems like this problem got way out of hand and most best papers of conferences suffer from it (one of the most important ML papers tried to be mathy and introduced a big mistake, I believe other papers have bigger issues but no one bothers to check it).
So here are my personal points to academics and researchers:
In this video about Times Series, https://www.youtube.com/watch?v=0zpg9ODE6Ww, does anyone have an idea about the Python package used to create the animated plots showed at the 34th minute of the video ?
Thank for your help.
Following the review phase(old post), creating a thread for others like me waiting for the decision.
All the best!
It seems to me that thanks to the residual path the gradient that flows to each layer is the same regardless of the transformer layer/block. Example:
ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))) ...)
Since the input to ProjectionAndCost is just sum of outputs from all layers and initial embeddings then the gradient that comes to the layer L1 is the same as the gradient that comes to L2 or L3.
So we could:
We can't train first L2 then L1, because the input to L2 depends on L1, but we could train lower layers first then gradually add and train deeper layers. Is there any problem with that approach?
So there's this relatively new "BigDL" project" (https://bigdl.readthedocs.io/en/latest/ ), which is for Intel CPUs and Intel GPUs, but there's no mention anywhere of it working for NVIDIA GPUs.
Is there any equivalent library for NVIDIA GPUs on a spark cluster?
Hi. Here is the new's question.
The biggest concern is the rate of generation. I want to generate about 5 seconds of voice in about 100ms. I want to know which model performs best(SOTA) under those conditions.
Which model is best for me?
I think "styletts2" is best.
If you have any relevant experience or know any other information, I would really appreciate your help.
Thank you !