/r/MachineLearning

2,886,765 Subscribers

1

Open-Sourced: Automated Data Sorting Tools [P]

Hello r/MachineLearning,

I'm excited to share a project that was initially intended to integrate automated AI maintenance features for Windows into an application I was building to sell commercially, but has now been open-sourced for community use and development. The project focuses on automated data sorting and could serve as a base for more advanced machine learning applications.

You can explore the project here: [NazTech Automated Data Sorting Tools](https://github.com/nazpins/naztech-automated-data-sorting-tools)

These tools are designed to quickly automate sorting large data dumps, employing python algorithms suitable for handling large datasets. While the project is no longer in active development from my end, the python scripts are functional and open for any adaptations or enhancements you might find interesting for your own ML projects. I started building the framework for the actual application but due to time constraints and a lot going on irl, I haven't had time to continue working on it.

I am happy however to share these tools with the community and hopefully they can be beneficial to someone else down the road.

Cheers!

0 Comments
2024/04/26
17:29 UTC

4

[D]What Nomenclature do you follow for naming ML Models?

Hi All,

I am brainstorming some kind of a nomenclature for our team so that theres a standard way of naming ML models like their pickle files . Any inputs will be appreciated.

thanks

1 Comment
2024/04/26
16:42 UTC

1

[D] Advice Needed: Enhancing NER for ADE Detection in Clinical Texts (Thesis Work)

Hi r/MachineLearning community,

I'm currently working on the second part of my thesis focused on Named Entity Recognition (NER) for detecting Adverse Drug Events (ADE) in clinical texts. In my first thesis project, I tried to replicate a paper but had to pivot to the n2c2 dataset, which led to challenges in model performance.

I've fine-tuned a DeBERTa model with standard practices, but I'm struggling with achieving high accuracy, particularly with precision and recall. This is my first deep dive into a thesis and the world of NLP, and any guidance would be immensely appreciated.

Also, any common pitfalls for thesis work or useful resources on this topic would be extremely helpful. I'm eager to learn from the community and improve my research.

Thank you so much for your time!

0 Comments
2024/04/26
16:29 UTC

7

[R]Large language models may not be able to sample behavioral probability distributions

Through our experiments, we found that LLM agents have a certain ability to understand probability distributions, the LLM agent's sampling ability for probability distributions is lacking and it is difficult to give a behavior sequence that conforms to a certain probability distribution through LLMs alone.

We are looking forward to your thoughts, critiques, and discussions on this topic. Full Paper & Citation: You can access the full paper https://arxiv.org/abs/2404.09043. Please cite our work if it contributes to your research.

https://preview.redd.it/ai7uks7nluwc1.png?width=935&format=png&auto=webp&s=891dd57ef50d1ee99b1a8b2372b9a460397754d6

0 Comments
2024/04/26
16:11 UTC

1

[D] GAN/Adversary Autoencoder/Cycle GAN

Main aim: Style transfer between two discrete timeseries signals.

Here are the details: Dataset: Discrete time series. 1700 rows, with 97 percent of it with zeroes. Cannot remove these zeroes as it means something. Values ranging from 0-32 for one of the features in Domain A needs to translated to another feature with same range in domain B. Another feature from 0-5000 from domain A, translated to a different domain B with same range. I can recreate the same dataset multiple times with small variations, so we can have larger datasets. I would create sequences of size 20 or 30 and batch: 32 or 64 initially.

Generator Network: A simple encoder with linear layer first hidden size:16 , relu, 2nd linear layer :8 and relu again . A symmetric Decoder .

Discriminator: 2 linear layers with hidden size 8 and leaky Relu between them. And sigmoid as final layer. Loss function : BCEloss . Also experimented BCE + MSE loss for generator.

Training: I'm using pytorch. Only trained with one feature/signal and tried to generate this feature from noise. Didn't move to cycle consistency yet. With the small dataset training, the discriminator becomes too strong, I even tried to set reduce the learning rate for discriminator as 0.0001 and generator as 0.01 , it didn't work. Tried to add/complicate the layer of generator, still didn't work. Tried to train discriminator every 10th epoch, while the generator trained more. Didn't work. Also tried to normalize the data.

I want to explore Adversarial autoencoder /cycle Gan , but the generator is unable to learn anything with vanilla GAN as well. Can someone help or give me some ideas on what I can do ? Thanks

0 Comments
2024/04/26
16:09 UTC

0

GPU out of memory error message on Colab with Llama 3 [D]

Hi guys,

I just tried to run Llama 3 on my Colab(free version) and seems that I ran out of the memory:

OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 9.06 MiB is free. Process 8863 has 14.74 GiB memory in use. Of the allocated memory 14.60 GiB is allocated by PyTorch, and 22.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Anyone have the same experience? Has anyone managed to run Llama 3 on free version of Colab (or similar platform)?

Thanks!

0 Comments
2024/04/26
15:43 UTC

1

[D] Overwhelming LLM release rate: Seeking suggestions for building a test set to evaluate LLMs

Hi everyone,

I'm trying to build my own test set in order to make an initial fast evaluation of the huge number of models that pop up on huggingface.co every week, and I'm searching for a starting point or suggestions.

If someone would share some questions that they use to test LLM abilities, even as high-level concepts, or simply give me some tips or suggestions, I would really appreciate that!

Thanks in advance to everyone for any kind of reply."

1 Comment
2024/04/26
14:43 UTC

4

[R] Reinforcement Learning via Regressing Relative Rewards

https://arxiv.org/abs/2404.16767

New deep RL algorithm that works with both language models and diffusion models.

1 Comment
2024/04/26
13:10 UTC

4

[D] Clean caption dataset

I am attempting to train CLIP from scratch. However, there is a lack of available datasets. The one dataset that seemed quite diverse and clean seems to be taken down (laion-400m). Looking at HF datasets, these are the two datasets that are promising, but wondering if there has been anything better/ cleaner.

  • conceptual captions: uses alt-text.
  • red_caps: reddit threads, but these are mostly the first comment on the image than an actual caption.

TIA

0 Comments
2024/04/26
11:09 UTC

88

[D] LLMs: Why does in-context learning work? What exactly is happening from a technical perspective?

Everywhere I look for the answer to this question, the responses do little more than anthropomorphize the model. They invariably make claims like:

Without examples, the model must infer context and rely on its knowledge to deduce what is expected. This could lead to misunderstandings.

One-shot prompting reduces this cognitive load by offering a specific example, helping to anchor the model's interpretation and focus on a narrower task with clearer expectations.

The example serves as a reference or hint for the model, helping it understand the type of response you are seeking and triggering memories of similar instances during training.

Providing an example allows the model to identify a pattern or structure to replicate. It establishes a cue for the model to align with, reducing the guesswork inherent in zero-shot scenarios.

These are real excerpts, btw.

But these models don’t “understand” anything. They don’t “deduce”, or “interpret”, or “focus”, or “remember training”, or “make guesses”, or have literal “cognitive load”. They are just statistical token generators. Therefore pop-sci explanations like these are kind of meaningless when seeking a concrete understanding of the exact mechanism by which in-context learning improves accuracy.

Can someone offer an explanation that explains things in terms of the actual model architecture/mechanisms and how the provision of additional context leads to better output? I can “talk the talk”, so spare no technical detail please.

I could make an educated guess - Including examples in the input which use tokens that approximate the kind of output you want leads the attention mechanism and final dense layer to weight more highly tokens which are similar in some way to these examples, increasing the odds that these desired tokens will be sampled at each generation step; like fundamentally I’d guess a similarity/distance thing, where explicitly exemplifying the output I want increases the odds that the output get will be similar to it - but I’d prefer to hear it from someone else with deep knowledge of these models and mechanisms.

43 Comments
2024/04/26
11:01 UTC

13

[D] Critical batch size and LLMs

In a video about "A little guide to building Large Language Models in 2024" at 41:38 the author starts to talk about the limits of how big the batch size can be.

Well, if you start to have a very large batch size, the model for each optimization step makes less efficient use of each token, because the batch size is so big that each token is kind of washed out in the optimization step. And roughly, it's a little bit hard to measure this limit, which we call the critical batch size.

I thought that the bigger batch size is always better for training LLMs, because:

  1. It better approximates the true gradient.
  2. We go through the dataset faster.
  3. To my knowledge the limits are only infrastructure, hardware, communication overhead etc.

I found a paper that introduces the "critical batch size" concept - An Empirical Model of Large-Batch Training. It mostly talks about the speed/efficiency tradeoff for data parallelism of large batch sizes. Also another highly cited paper Scaling Laws for Neural Language Models:

Training at the critical batch size provides a roughly optimal compromise between time and compute efficiency

So I don't really understand what author of the video meant by saying:

each token is kind of washed out in the optimization step

Are there any other issues with large batch sizes other than infrastructure, hardware or implementation limits?

6 Comments
2024/04/26
09:21 UTC

0

[Discussion] Time series regression problem?

Hi guys,I have a problem for which I am not sure what would be the best approach (and I cannot really find any relevant literature).I have a small dataset (~100 measurements like the one attached) of a sensor value, for which I want to predict a certain relevant event. Here t_0 is the relevant moment in time which I want to predict. The problem is, that I need to trigger something when the event is reached. If I need too long to trigger after it was reached, it will not be a positive outcome.My initial idea was to basically chunk the time series before the event, and try to predict the remaining time from that segment until the event is reached. When it is below a threshold, I can trigger my action. I wanted to have a look at e.g. XGBoost and feed it small chunks of the timeseries and run this process continuously. I am not really sure if that is the correct approach there.Is that a known problem? What would be a good name for this problem to search for literature? Do you have suggestions how to solve it?

Thanks.

https://preview.redd.it/l59gsmswkswc1.png?width=1303&format=png&auto=webp&s=f9e83d7b87ce5227378de6a0805a916fd4f93314

0 Comments
2024/04/26
09:21 UTC

0

IoU Accuracy on Yolov8 object detection "[P]"

"[P]"

How to calculate IoU score on yolo v8. I didn't find any inbuilt function to do that. I have trained yolov8n on my custom dataset for object detection which has 1 class.

11 Comments
2024/04/26
07:04 UTC

2

[D] What is the State of Art for text to speech synthesis?

I'm starting to do some research for my graduation and I'm looking for some papers on text to speech synthesis. I'm doing some reproductions on a paper I found to be interesting called Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Prediction. Basically it's a model that receives text, turns it into a spectogram and the spectogram is used to build the audio file. Since I'm still at the start of reproduction, are there papers that you guys would recommend looking into? Did you work with speech synthesis (TTS)? What are good refferences I should look into?

I saw this post over here https://www.reddit.com/r/MachineLearning/comments/nxkuvn/d_what_is_actually_the_state_of_the_art_in_text/

But it already has 3 years. Maybe there is something newer than FastSpeech2?

10 Comments
2024/04/26
04:24 UTC

5

[D] Meta-learning vs Federated Learning?

[D] Hey everyone, do you have any suggestions on what's the better option and the most effective way to dive into a hot topic these days?
I stumbled upon a repository for Federated Learning at: 

But can't seem to find anything similar for Meta Learning. Any advice on how to pick my PhD topic would be greatly appreciated!

3 Comments
2024/04/26
03:53 UTC

12

[P] Multihead Mixture of Experts - Implementation of dense subtoken routing suggested in https://arxiv.org/pdf/2404.15045

My friend implemented the method of Multihead Mixture of Experts in this arxiv paper https://arxiv.org/pdf/2404.15045 and he wanted me to share it with you!

https://github.com/lhallee/Multi_Head_Mixture_of_Experts__MH-MOE

Try it out. Let me know what you think and I will pass it on to him.

11 Comments
2024/04/25
22:00 UTC

18

[D] HyenaDNA and Mamba are not good at sequential labelling ?

Hello guys, I've been working on a sequential labelling using DNA sequences as inputs. Lately there have been 2 foundation models released HyenaDNA (Based on Hyena operator) and Caduceus (based on mamba), I used both pretrained and from scratch models and performances are terrible even with pretrained ones.

Does anyone have experience with this type of models, and what are the potential causes for performance drop ? I am literally getting zero performance for the minority class ? Does mamba deal poorly with class imbalance ?

9 Comments
2024/04/25
20:02 UTC

10

[P] Drug toxicity prediction model with graph-based neural networks

This is a small drug toxicity prediction GNN model I wrote/trained

repo: https://github.com/Null-byte-00/toxicity-prediction-gnn

1 Comment
2024/04/25
19:10 UTC

0

[D] Does anyone use Bedrock Agents for function calling?

I have a use case to use function calling within my application, I am confused whether to choose OpenAI function calling or use Bedrock Agents coupled with Lambda functions for this, which is the best approach? Or help me to choose between these two.

3 Comments
2024/04/25
19:07 UTC

237

[D] What are your horror stories from being tasked impossible ML problems

ML is very good at solving a niche set of problems, but most of the technical nuances are lost on tech bros and managers. What are some problems you have been told to solve which would be impossible (no data, useless data, unrealistic expectations) or a misapplication of ML (can you have this LLM do all of out accounting).

155 Comments
2024/04/25
18:45 UTC

3

Datasets for Causal ML [D]

Does anyone know what datasets are out there for causal inference? I’d like to explore methods in the doubly robust ML literature, and I’d like to compensate my learning by working on some datasets and learn the econML software.

Does anyone know of any datasets, specifically in the context of marketing/pricing/advertising that would be good sources to apply causal inference techniques? I’m open to other datasets as well.

4 Comments
2024/04/25
18:24 UTC

7

[P] Dreamboothing MusicGen

Dreambooth the MusicGen model suite on small consumer GPUs, in a matter of minutes, using this repository: https://github.com/ylacombe/musicgen-dreamboothing

The aim of this project is to provide tools to easily fine-tune and dreambooth the MusicGen model suite, with little data and to leverage a series of optimizations and tricks to reduce resource consumption, thanks to LoRA adaptors.

For example, the model can be fine-tuned on a particular music genre or artist to give a checkpoint that generates in that given style. The aim is also to easily share and build on these trained checkpoints,

Specifically, this involves:

  • using as few data and resources as possible. We're talking fine-tuning with as little as 15mn on an A100 and as little as 10GB to 16GB of GPU utilization.

  • easily share and build models thanks to the Hugging Face Hub.

  • optionally, generate automatic music descriptions

  • optionally, training MusicGen in a Dreambooth-like fashion, where one key-word triggers generation in a particular style

Wandb example of what the training run looks like here.

1 Comment
2024/04/25
17:43 UTC

4

[R] Speculative Streaming: Fast LLM Inference without Auxiliary Models

Paper: https://arxiv.org/abs/2402.11131

Abstract:

Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.

0 Comments
2024/04/25
16:13 UTC

6

[R] Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Paper: https://arxiv.org/abs/2404.08698

Abstract:

While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM's original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficient and plug-and-play enhancement. In our experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x, validating the effectiveness of our proposed ANPD.

2 Comments
2024/04/25
16:08 UTC

59

[D] Old Paper - Troubling Trends in Machine Learning Scholarship

I just wanted to remind or introduce newcomers to this paper. I think this discussion should be re-opened since many people here actually do influence the trends of the field.

https://arxiv.org/pdf/1807.03341


On a personal note (feel free to skip):

Specifically, I want to point out the issue of "Mathiness", as it seems like this problem got way out of hand and most best papers of conferences suffer from it (one of the most important ML papers tried to be mathy and introduced a big mistake, I believe other papers have bigger issues but no one bothers to check it).

So here are my personal points to academics and researchers:

  1. We (I think most will relate), practitioners, do not need equations to know what recall is and clearly don't want to read difficult-to-understand versions of what linear regression is, it just makes your paper unuseful. If you don't want to waste our time, please put it in the appendix or completely remove it.
  2. Reviewers, please don't get impressed by unnecessary math, if it's complicated and does nothing useful, who cares? Also, it might be flawed anyway and you will probably not catch it.
27 Comments
2024/04/25
15:50 UTC

4

[R] Python package for animated time series

In this video about Times Series, https://www.youtube.com/watch?v=0zpg9ODE6Ww, does anyone have an idea about the Python package used to create the animated plots showed at the 34th minute of the video ?
Thank for your help.

2 Comments
2024/04/25
15:48 UTC

17

[D] UAI-2024 results waiting area

Following the review phase(old post), creating a thread for others like me waiting for the decision.

All the best!

25 Comments
2024/04/25
15:38 UTC

69

[D] Why transformers are not trained layer-wise?

It seems to me that thanks to the residual path the gradient that flows to each layer is the same regardless of the transformer layer/block. Example:

ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))) ...)

Since the input to ProjectionAndCost is just sum of outputs from all layers and initial embeddings then the gradient that comes to the layer L1 is the same as the gradient that comes to L2 or L3.

So we could:

  • first train only L1: ProjectionAndCost(X + L1(X))
  • freeze L1, include L2 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X)))
  • freeze L1 and L2, include L3 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))))
  • .. and so on

We can't train first L2 then L1, because the input to L2 depends on L1, but we could train lower layers first then gradually add and train deeper layers. Is there any problem with that approach?

18 Comments
2024/04/25
14:16 UTC

11

[D] Is there an equivalent BigDL project for NVIDIA GPUs, which allows distributing work loads across a DL cluster with spark?

So there's this relatively new "BigDL" project" (https://bigdl.readthedocs.io/en/latest/ ), which is for Intel CPUs and Intel GPUs, but there's no mention anywhere of it working for NVIDIA GPUs.

Is there any equivalent library for NVIDIA GPUs on a spark cluster?

7 Comments
2024/04/25
10:18 UTC

8

[D] What is the best TTS model for my case?

Hi. Here is the new's question.

The biggest concern is the rate of generation. I want to generate about 5 seconds of voice in about 100ms. I want to know which model performs best(SOTA) under those conditions.

Which model is best for me?

I think "styletts2" is best.

If you have any relevant experience or know any other information, I would really appreciate your help.

Thank you !

7 Comments
2024/04/25
08:07 UTC

Back To Top