/r/MachineLearning

2,941,651 Subscribers

0

Need feedback insights and your precious comment on the following start-up idea [D], [P]

AI Agent for Digital Adoption.

Problem

  • Companies spend a lot of time, effort and money on user onboarding,  troubleshooting for users.

Solution

OpenAI Vision enabled AI agent that could assist to provide guidance.

  • Version 1: Guidance in a floating panel on screen over any app. In floating panel instruction given in chat and application record screen. 
  • Version 2.0: Over layer it will give virtual signs like an arrow or drawing which  direct the user to click the button.
  • Final Version: Fully automated AI Agent to solve the issue.

Customers

B2b: SaaS companies for their employees and their customer companies.

D2c: People who are starting to learn new things. Using new software.

Competitors ??

Future Market Evolution??

Differentiator??

[D], [P]

0 Comments
2024/12/15
03:40 UTC

4

[D] Last Week in Medical AI: Top LLM Research Papers/Models (December 7 - December 14, 2024)

Last Week in Medical AI: Top LLM Research Papers/Models (December 7 - December 14, 2024)

Medical LLM & Other Models

  • PediaBench: Chinese Pediatric LLM
    • This paper introduces PediaBench, the first Chinese pediatric dataset for evaluating Large Language Model (LLM) question-answering performance, containing 4,565 objective and 1,632 subjective questions across 12 disease groups.
  • BiMediX: Bilingual Medical LLM
    • This paper introduces BiMediX, the first bilingual (English-Arabic) medical Mixture of Experts LLM, along with BiMed1.3M, a 1.3M bilingual medical instruction dataset with over 632M tokens used for training.
  • Diverse medical knowledge integration
    • This paper introduces BiMediX2, a bilingual (Arabic-English) Large Multimodal Model (LMM) based on Llama3.1 architecture, trained on 1.6M medical interaction samples.
  • BRAD: Digital Biology Language Model
    • This paper introduces BRAD (Bioinformatics Retrieval Augmented Digital assistant), an LLM-powered chatbot and agent system integrating various bioinformatics tools.
  • MMedPO: Vision-Language Medical LLM
    • This paper introduces MMedPO, a multimodal medical preference optimization approach to improve factual accuracy in Medical Large Vision-Language Models (Med-LVLMs) by addressing modality misalignment.

Frameworks & Methodologies
- TOP-Training: Medical Q&A Framework
- Hybrid RAG: Secure Medical Data Management
- Zero-Shot ATC Clinical Coding
- Chest X-Ray Diagnosis Architecture
- Medical Imaging AI Democratization

Benchmarks & Evaluations
- KorMedMCQA: Korean Healthcare Licensing Benchmark
- Large Language Model Medical Tasks
- Clinical T5 Model Performance Study
- Radiology Report Quality Assessment
- Genomic Analysis Benchmarking

LLM Applications

- TCM-FTP: Herbal Prescription Prediction
- LLaSA: Activity Analysis via Sensors
- Emergency Department Visit Predictions
- Neurodegenerative Disease AI Diagnosis
- Kidney Disease Explainable AI Model

Ethical AI & Privacy
- Privacy-Preserving LLM Mechanisms
- AI-Driven Digital Organism Modeling
- Biomedical Research Automation
- Multimodality in Medical Practice

Full thread in detail: https://x.com/OpenlifesciAI/status/1867999825721242101

0 Comments
2024/12/14
21:25 UTC

0

Custom Implementation of Contrastive Loss [P]

I am trying to implement the contrastive loss function I am unsure if it is correct. My loss seems to explode into infinity. Another set of eyes on this would be appreciated does this look correct?

class ContrastiveLoss(nn.Module):
    def __init__(self, temperature=0.9):
        super(ContrastiveLoss, self).__init__()
        self.temperature = temperature

    def forward(self, projections_1, projections_2):
        z_i = projections_1
        z_j = projections_2
        z_i_norm = F.normalize(z_i, dim=1)
        z_j_norm = F.normalize(z_j, dim=1)
        cosine_num = torch.matmul(z_i, z_j.T)
        cosine_denom = torch.matmul(z_i_norm, z_j_norm.T)
        cosine_similarity = cosine_num / cosine_denom

        numerator = torch.exp(torch.diag(cosine_similarity) / self.temperature)

        denominator = cosine_similarity
        diagonal_indices = torch.arange(denominator.size(0))
        denominator[diagonal_indices, diagonal_indices] = 0
        denominator = torch.sum(torch.exp(cosine_similarity), dim=1)
        loss = -torch.log(numerator / denominator).mean()
        return loss
4 Comments
2024/12/14
20:01 UTC

16

[Project] Matrix Recurrent States, a Attention Alternative

https://github.com/mikayahlevi/mru-lm
Hi, I'm posting here to share a project I just published on GitHub. I'll start with a description, some of which will be copy/pasted from the GitHub repo.
The idea of a matrix recurrent unit is dictated by the update rule H_t = H_{t-1} X_{t-1} and H_1 = X_1 where X and H are s×n×n sequences of square matrices. The primary difference between this and a traditional RNN is that no initial vector is passed through the linears, instead the first state is a matrix, leading to the output also being a matrix. My motivation for coming up with this idea are based on the following reasons:

  • Matrix multiplication is associative but not commutative. The associativity means I can compute the cumulative matrix product using an (inclusive) parallel scan. The lack of commutativity means that the order of tokens is automatically incorporated into the MRU.
  • When you try to do this scan on an traditional RNN, the number of operations scales cubically with the amount of elements in the output state, meaning that limited information is retained compared to the amount of computation. On the other hand, if the states are matrices, the number of operations as a function of elements in the output state is (n^2)^(3/2), where n^2 is the number of elements in the square n×n matrix state. Here's a paper including some information about this: https://arxiv.org/abs/1709.04057.
  • When processing the tokens sequentially or in parallel with the (not-yet implemented) Brent-Kung parallel scan the network scales linearly with time, in contrast to attention which scales quadratically with time.

I tried generating matrix X by different methods in the different branches. All of the ways to generate X and fold the output hidden state back into a vector, are arbitrary combinations of linears and reshapes and just based on what I found worked well.

Loss vs Steps for a Transformer and an MRU-LM on shakespeare-char

This approach seems to work pretty well based on the toy dataset shakespeare-char, so if anyone wants to help me out, I would like to benchmark it on more informative datasets and see how it works out.

4 Comments
2024/12/14
18:19 UTC

0

[P] i have an issue with my nn the model instead of predicting real words it returns things like "the in etc." like not real words how to fix this

2 Comments
2024/12/14
17:58 UTC

85

[D] What are the (un)written rules of deep learning training

Disclaimer: I posted this in r/learnmachinelearing first, but the sub seems to be more concerned with very basic questions, courses and hiring, so feel free to remove it if it doesn't fit here (tho I think that also fits this sub as a discussion).

I now have a few years of experience building and training different model architectures, I know most of the basic theory and am able to follow most papers. So my question goes into a more methodological direction. While I am able to successfully build models for a number of applications, a lot of the time this is to a large extend guesswork. I try out different stuff and see what sticks. I know there is a lot of research in the direction of interpretability going on, but this is not directly the direction I want to go with this. Instead I want to ask you all what general advice you have on the training process, what are some practical observations, rules of thumb, approaches you take that are not described in a paper or theoretical ml class. For example:

  • How do you analyze gradients in your model. I know how to do some very basic plots in this regard, but would be interested in your methods and how you read them from a practical perspective?

  • How do you visualize temporal instabilities between optimizer steps resulting from e.g. a too large learning rate?

  • How do you determine appropriate regularization?

  • What are your rules of thumb for diminisheing returns during a training run?

  • How do you tune your hyperparameters? I eyeballed them more or less and also used optuna for this in the past.

  • What are some important intuitions, unwritten rules and pitfalls during training in your opinion?

  • What are your debugging steps when a model does not perform as expected?

  • What tricks do you actually use? There are lots of small tricks (EMA, obscure activation functions, ...) that promise some gains, but what do you actually use?

  • How does your approach differ when you do a transformer, CNN, diffusion model, ...

  • Some general opinions or tips that I might have missed above.

University classes and online resources mostly teach the basics or theoretical foundation, which is very important, but in practice only part of the story. Real world experience also helps, but you only get so far with trial and error and might miss something useful. I am aware of the blog posts by Karpathy on the training of neural networks and look for more resources in this direction.

I am happy to here your replies on this arguably broad topic.

18 Comments
2024/12/14
10:29 UTC

467

[D] What happened at NeurIPS?

440 Comments
2024/12/14
07:00 UTC

0

[R] survey on students’ motivation to learn Artificial Intelligence and Modeling.

We are university students and we're conducting a quick survey on students’ motivation to learn Artificial Intelligence and Modeling. The survey will take less than 10 minutes to complete.

Here's the link to the survey: https://docs.google.com/forms/d/e/1FAIpQLSdS-xy53N9lDRlC_835A_E59VMjCPql0_HuihPYqaQ_nINSsw/viewform?usp=sf_link

Your input would mean a lot to us! Thank you so much for your support and time.

1 Comment
2024/12/14
05:25 UTC

1

[D] Help with clustering over time

I'm dealing with a clustering over time issue. Our company is a sort of PayPal. We are trying to implement an antifraud process to trigger alerts when a client makes excessive payments compared to its historical behavior. To do so, I've come up with seven clustering features which are all 365-day-long moving averages of different KPIs (payment frequency, payment amount, etc.). So it goes without saying that, from one day to another, these indicators evolve very slowly. I have about 15k clients, several years of data. I get rid of outliers (99-percentile of each date, basically) and put them in a cluster-0 by default. Then, the idea is, for each date, to come up with 8 clusters. I've used a Gaussian Mixture clustering (GMM) but, weirdly enough, the clusters of my clients vary wildly from one day to another. I have tried to plant the previous mean of my centroids, using the previous day centroid of a client to sort of seed the next day's clustering of a client, but the results still vary a lot. I've read a bit about DynamicC and it seemed like the way to address the issue, but it doesn't help.

1 Comment
2024/12/13
19:27 UTC

36

[D] NVIDIA’s hostages: A Cyberpunk Reality of Monopolies

In AI and professional workstations, NVIDIA's dominance feels like a suffocating monopoly. Their segmented product lines widen the gap between consumer and professional GPUs, particularly in VRAM, performance, and price.

AI enthusiasts struggle with prohibitive costs for GPUs equipped with sufficient VRAM. The reliance on CUDA cores—a proprietary standard—further locks developers into NVIDIA’s ecosystem, stifling competition and innovation.

NVIDIA’s control extends beyond hardware, as their CUDA platform discourages adoption of open, competitive solutions. This feeds a cyberpunk dystopia where corporations consolidate power, leaving consumers and developers with few choices.

Why does the tech world remain complicit? Why aren’t we pursuing alternative hardware architectures or broader software compatibility beyond CUDA? AMD’s ROCm is a start, but more aggressive development and policy interventions are needed to challenge NVIDIA’s grip.

Until when will this continue? Who will stand up for the end consumer?

27 Comments
2024/12/13
19:02 UTC

12

[R] Identifying Critical Decision Points in Neural Text Generation Through Token-Level Uncertainty Analysis

This paper introduces a framework for analyzing and visualizing the branching decisions language models make during text generation. The key methodology involves tracking probability distributions across different sampling paths to understand how early choices affect downstream generation.

Main technical points:

  • Developed metrics to quantify uncertainty at each generation step
  • Created visualization tools for mapping decision trees in generation
  • Analyzed how different sampling methods affect path divergence
  • Measured correlation between model confidence and generation quality
  • Identified clustering patterns in generation trajectories

Key results:

  • Found that paths tend to cluster into 2-3 distinct trajectory groups
  • Early sampling decisions have outsized impact on final outputs
  • Uncertainty patterns vary significantly between sampling methods
  • Similar prompts can lead to dramatically different generation paths
  • Model confidence doesn't consistently predict output quality

I think this work provides important insights into how we might better control text generation. The ability to map and understand generation paths could help develop more reliable sampling methods and better uncertainty estimates.

I think the clustering of generation paths is particularly interesting - it suggests there may be ways to guide generation toward desired trajectory groups. This could be valuable for applications needing more predictable outputs.

The methodology also reveals some concerning aspects about current sampling methods. The strong dependence on early decisions suggests we may need new approaches that better preserve generation flexibility throughout the sequence.

TLDR: New framework for analyzing how language models make text generation choices. Shows that generation paths cluster into distinct groups and early decisions heavily influence outcomes. Could help develop better sampling methods and uncertainty estimates.

Full summary is here. Paper here.

1 Comment
2024/12/13
14:10 UTC

16

[D] Training with synthetic data and model collapse. Is there progress?

About a year ago, research papers talked about model collapse when dealing with synthetic data. Recently I’ve been hearing about some progress in this regard. I am not expert and would welcome your views on what’s going on. Thank you and have a fantastic day.

24 Comments
2024/12/13
10:03 UTC

0

[D] Agentic AI Design Patterns

I was looking into design patterns for Agentic AI and I could need some help to grasp the concepts.

I read about ReAct and ReWOO.

From ReWOO, I really liked the idea of having a planner that creates a blueprint of the work that needs to be done. I can imagine that this works well for a lot of tasks, and it optimizes token usage compared to ReAct.

From ReAct, I like that it has a reflection/observation LLM, to decide whether the output is good enough or needs another pass through the agents.

What I don't understand: Why does ReWOO not have a reflection component??

Wouldn't it be the best of both worlds to have the planner and the reflection?

This was the first draft for my agentic AI prototype, and I think it has pretty obvious advantages.

I think I am missing something here.

0 Comments
2024/12/13
09:49 UTC

17

[D] Importance of HPO per field / model type / applications

I’ve noticed that the time spent on hyperparameter optimization vary significantly, not just between industry and academia but also across different fields like NLP, computer vision, or reinforcement learning. I’m curious—what’s your experience?

  • Is tuning something you prioritize heavily, or do you often settle for “good enough” configurations to move faster?
  • What field / model type / applications do you think experience most(or least) bottleneck in workflow due to HPO?
  • Are there any industry dependency around choosing HPO tools? For example, everyone in xx industry would pick Optuna as a go-to or everyone running xx experiments would use Sigopt.

Would love to hear your experiences! Thanks

7 Comments
2024/12/13
06:58 UTC

3

[D] help with evaluating model

i am having an issue with evaluating my model because model.evaluate() returns an okay overall score in accuracy but the confusion matrix and classification report return 100% for one class and 0% for another, i am using cifar10 but only 2 classes from it. anyone know why this happens? is this overfitting i am not sure because i am getting a similar score as model.evaluate(0 in my training accuracy and same for loss (which is almost as high as the accuracy)

5 Comments
2024/12/13
05:40 UTC

1

[D] does intel gpu support ROCm or AMD cards support intel one?

i can't find this information and if both are open source it make sense a compatibility layer , any of the two is already ported to the other platform?, if you can share info about nvidia too will be cool

0 Comments
2024/12/12
19:39 UTC

1

[P] Scalling data from aggregated calculations

Hello, I have a project in which I detect anomalies on transactions data from ethereum blockchain. I have performed aggregated calculations on each wallet address (ex. minimum, maximum, median, sum, mode of transactions' values) and created seperated datafile with it. I have joined the data on all the transactions. Now I have to standardize data (I have chosen robust scalling) before machine learning but I have following questions regarding this topic:

  1. Should I actually standardize each feature based on its unique mean and iqr? Or perform scalling on the column that the calculations come from - value column and than use its mean and iqr to scale the calculated columns?
  2. If each feature was scaled based on its own mean and iqr should I do it before joining calculated data or after?
0 Comments
2024/12/12
20:32 UTC

4

[D] LSTM model implementation and approximation questions

For a project I am currently trying to integrate an Autoencoder for feature extraction and an LSTM for classification of the reduced feature space. The problem I am encountering is on how to train the LSTM network. The AE produces 5 datapoints which is fed into the LSTM network. The trick now comes in on the training of the LSTM network and how the LSTM works. I want the LSTM to take into account the 5 parameters from the AE at time t as well as the parameters at t-1 and t-2. As far as I understand the LSTM does this automatically, or should it then be that the LSTM takes in a total of 15 parameters with each pair of 5 corresponding to one timestep of the AE?

Any advice on LSTM would be great or how such training can be done in an efficient way. The AE is processing a time-series signal.

2 Comments
2024/12/12
21:09 UTC

7

[D] "Proper" way to upload accepted conference paper to the ArXiv?

We recently had a paper accepted to a conference (AAAI). We found out that the conference does not publish appendices so they recommend we upload the full paper (with appendix) to arXiv. This is something we were considering doing anyway since the paper would be available before the conference proceedings come out.

My concern is that if someone decides to cite our work, they may either become confused or cite the arXiv rather than AAAI "version".

Is there a "correct" or common way to handle this? Do arXiv uploads with the same title get indexed to "one manuscript" on google scholar?

Also, are we allowed to use the conference template to upload? (This part might be conference dependent I suppose).

I know it is common these days to upload to arXiv before hearing back from a conference (usually with a different title) but I think this is a slightly different situation as the paper is accepted and the uploaded version will be identical to the conference paper (though with an Appendix).

Thanks in advance!

2 Comments
2024/12/12
20:38 UTC

634

[D] The winner of the NeurIPS 2024 Best Paper Award sabotaged the other teams

Presumably, the winner of the NeurIPS 2024 Best Paper Award (a guy from ByteDance, the creators of Tiktok) sabotaged the other teams to derail their research and redirect their resources to his own. Plus he was at meetings debugging his colleagues' code, so he was always one step ahead. There's a call to withdraw his paper.

https://var-integrity-report.github.io/

I have not checked the facts themselves, so if you can verify what is asserted and if this is true this would be nice to confirm.

81 Comments
2024/12/12
19:41 UTC

0

[D] I got the acceptance for my IEEE publication, does that means it will be uploaded on their Xplore page?

So I submitted a paper and it got accepted by my publication around 2 months ago, today was my conference in online mode, didnt go well I think he was in hurry he didnt listen much diagreed a bit and then closed the meet on my face. So my question is how bad is it? Will it be published as I have the acceptance or still a no?

17 Comments
2024/12/12
17:04 UTC

8

[R] Rethinking the positive pairs in contrastive learning

Hi, I am sharing my recent work which allows arbitrary images to be positive pairs. Our finding is quite astonishing that two disparate images, e.g., a snake and a lamp, can be positive. Our work potentially broadens the applications of contrastive learning to deal with the "false positive" in which two views are not similar.

We challenge the common sense in contrastive learning, that is, the positive pair design is critical. Our results prove that the feature selection is the key!

Paper: https://arxiv.org/abs/2410.18200

10 Comments
2024/12/12
17:02 UTC

135

[D] What makes TikTok's recommendation algorithm so strong?

General Discussion - now that they are about to be banned in the US, I'm becoming fascinated by the strength of their For You recommendations. To try and put some guard rails on what I mean, TikTok has shown itself to be able to match content to relevant audience at greater frequency and scale than any other app (YouTube included). Many creators can join the platform, post a single video, and have millions of views in 24 hours. This does happen on other apps, but TikTok seems to be the most consistent at scaling audience incredibly fast.

What models might they be basing their system on? What about their models creates their competitive advantage?

41 Comments
2024/12/12
16:39 UTC

3

[D] Pet project - Style Transfer Neural Networks Implementation

Hi, I am learning ML and this is my first project. I did a simple 100 LoC implementation of the Neural Style Transfer paper by Gatys et al. See https://github.com/TAOGenna/pytorch-neural-style-transfer

https://preview.redd.it/x2udi76n2g6e1.jpg?width=939&format=pjpg&auto=webp&s=437bdda1683e9fd580a6b3d1d4dc2598b25079ff

0 Comments
2024/12/12
16:25 UTC

0

[D] What Models Are Best at Producing Ambient Sounds/ Music?

I'm working on an application that requires ambient sounds/ music. For example:

  • "A crackling fire, with chat murmuring in the background."
  • "Nightime countryside summer sounds in the UK."
  • "Wind blowing through the mountains as you're stood on a high rock."

I've had a look at Hugging Face and found the Text-To-Audio section. However it appears the top models have very few downloads:

https://preview.redd.it/d4o7g760yf6e1.png?width=1788&format=png&auto=webp&s=2901a7678582745beb714b81519712bac37bd195

This makes me think the field is immature, and there's no clear best model. Is this a fair appraisal of the field, or are there models outside of Hugging Face that perform well for this use case?

4 Comments
2024/12/12
16:07 UTC

3

[D] Question About ResNet and Scalability of Extremely Deep Networks

I’ve been exploring the architecture of ResNet and its ability to train very deep neural networks effectively. While I understand that residual connections help mitigate issues like vanishing gradients and make training deeper networks feasible, I’m curious about the limitations of this approach when scaling to extremely deep networks, such as those with 1000 layers or more.

From my understanding, a ResNet with, say, 100 layers might effectively function like a much smaller network due to the residual connections, which essentially "skip" layers and add outputs. However, wouldn’t this also mean that if a regular MLP struggles to scale beyond 15 layers, a ResNet might just shift this limit proportionally (e.g., struggling beyond 150 layers)? In other words, does ResNet fundamentally solve the problem of training extremely deep networks, or does it merely extend the depth at which issues start to reappear?

I’d appreciate any insights you might have! TYSM!

5 Comments
2024/12/12
15:54 UTC

4

[R] A Grounded Theory Study of LLM Red Teaming: Motivations, Strategies, and Techniques

This paper presents a grounded theory study of how red-teaming is conducted on Large Language Models (LLMs), based on interviews with practitioners. The researchers systematically analyzed practitioner approaches to identify common patterns, strategies and motivations in LLM red-teaming.

Key technical points:

  • Used qualitative coding of interviews to develop taxonomy of red-teaming approaches
  • Identified 12 distinct attack strategies and 35 specific techniques
  • Found red-teaming requires manual effort rather than automation
  • Demonstrated importance of team collaboration over individual attempts
  • Established red-teaming as distinct from malicious attacks
  • Mapped common patterns in tester motivations and goals

Main results:

  • Red-teaming strategies fall into categories like prompt manipulation, psychology-based attacks, and system limit testing
  • Successful testers adopt an "alchemist" mindset of systematic experimentation
  • Most practitioners are motivated by curiosity and safety concerns
  • Testing requires deep understanding of both technical and psychological aspects
  • Manual testing currently more effective than automated approaches

I think this work provides an important foundation for developing more structured approaches to LLM safety testing. The taxonomy they've developed could help standardize how we evaluate and secure these systems. Their finding that manual testing remains superior to automation suggests we need much more work on automated testing approaches.

I think the emphasis on non-malicious intent and safety motivations is particularly relevant as these systems become more widely deployed. Understanding how and why people conduct these tests helps distinguish legitimate security research from attacks.

TLDR: First systematic study of LLM red-teaming practices, providing taxonomy of strategies and techniques based on practitioner interviews. Shows importance of manual testing and team collaboration, while establishing red-teaming as legitimate security research.

Full summary is here. Paper here.

0 Comments
2024/12/12
13:56 UTC

Back To Top