/r/MachineLearning
Beginners -> /r/mlquestions , AGI -> /r/singularity, career advices -> /r/cscareerquestions, datasets -> r/datasets
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Andrew Ng and Adam Coates (4/15/2015)
Related Subreddit :
/r/MachineLearning
I was working with SAM2 and have been trying to figure out the best way to fine-tune it for my specific use case. A few considerations that I was hoping get some insights on:
Hey people, I'm searching papers or hints for a computer vision task. I have implemented a Vision Transformer for image classification. In the next step I have to implement a predictor on top of the encoder network of the ViT, which predicts from enc(x_t) -> enc(x_t+1). The predictor should predict the embedding of the next frame. my first idea is a MLP head or decoder network. If someone has tackled a similar task, im happy about recommendations. Ty
For some reason, I can't seem to find any well known benchmark datasets that have text or images as features, and real-valued targets. Any target range is fine ( (0,1), (-infinity, infinity), (0, infinity), etc.) I have found examples with ordinal classification targets (e.g. integer rating from 1-5), but that doesn't serve my purpose.
Does anyone know of any open source supervised ML data that fits this description? Preferably a benchmarked one with a performance leaderboard.
Lead ML & Cryptographer researcher here. Just wrapped up a study that might piss some people off, but the data doesn't lie: Current LLMs (yes, even GPT-4) are just incredibly sophisticated autocomplete. Here's why that matters.
TL;DR:
* Current LLMs don't actually reason, they pattern match really well
* We identified two promising paths forward: training-time and inference-time enhancements
* PEFT + Chain of Thought prompting together show surprising results
* All research/code will be open-source
Last Week in Medical AI: Top LLM Research Papers/Models (November 2 - November 9, 2024)
Medical AI Paper of the Week:
Medical LLM & Other Models:
AutoProteinEngine: Multimodal Protein LLM
GSCo: Generalist-Specialist AI Collaboration
SAM for Lung X-ray Segmentation
MEG: Knowledge-Enhanced Medical QA
This paper introduces MEG, a parameter-efficient method for augmenting Large Language Models (LLMs) with medical knowledge graphs using a lightweight mapping network. Evaluated on four medical multiple-choice datasets, MEG achieves a 10.2% accuracy improvement over the Mistral-Instruct baseline and 6.7% over specialized models like BioMistral, demonstrating the benefit of knowledge graph integration.
Frameworks and Methodologies:
Medical LLM Applications:
CataractBot: Patient Support System
CheX-GPT: X-ray Report Enhancement
CardioAI: Cancer Cardiotoxicity Monitor
HealthQ: Healthcare Conversation Chain
PRObot: Diabetic Retinopathy Assistant
Medical LLMs & Benchmarks:
MediQ: Clinical Reasoning Benchmark
Touchstone: Segmentation Evaluation
Medical LLM Adaptation Progress
Fine-Tuning Medical QA Strategies
AI in Healthcare Ethics:
Full thread in detail : https://x.com/OpenlifesciAI/status/1855207141302473090
Hello, everyone.
I wrote a small collection of boosting algorithms in Rust named MiniBoosts.
This is a hobby project, but I would like to improve more.
Any feedback is welcome.
I appreciate your cooperation.
As title said I was wondering if there are some other ways to embedd corpus without using torch. One of the solution I came up with was by using ONNX. I created the images by using the fastembed library from Qdrant and the sentence-transformer library. Using fastembed result in a significant image size reduction.
Are there other ways (for example modifying the dockerfile or using other libraries) to shrink the docker image even more?
public repo: https://github.com/learning-bos/dockerize-torch-fastembed-sentence-transformer-comparison
Seems an obvious question but such a "data point" would be very helpful to clear our ignorance.
Hey guys, wanted to get your feedback on a project I'm developing. I'm building a framework to define AI agents from YAML configuration files. These files encapsulate tasks that need to be done, how they connect etc, while all the rest is abstracted away.
Now the idea is to use LLMs themselves to create those YAML files from a user prompt. Since the config file has all the core logic of the agent and removes all unnecessary details, I think this is the most efficient way to build a text-to-agent framework. Wdyt?
Let me know your thoughts, and have a look at the repo https://github.com/octopus2023-inc/gensphere
Let me know if you want to contribute and make it work.
i was looking into the research papers published in PAKDD 2023. From the names of the authors, I can guess that they are Chinese, Korean, or Japanese
I know PAKDD is a double-blind review. But why other people don't submit their work? or if they submit why the number of acceptance is low
I am also Asian, so I am not trying to be racist here. Just wondering why it is like that
My job’s looking for a way for ai to help generate plans, I really think a simple multi-variable model should do the trick; just need to find a reliable hosting service that can be built upon however needed. Are there well established ML hosters that are scalable, configurable, all that?
Dear Colleagues
Time Series Anomaly Detection (TSAD) is hot right now, with dozens of papers each year in NeurIPS, SIGKDD, ICML, PVLDB etc.
However, I claim that much of the published results are meaningless, because the uncertainty of the ground truth labels dwarfs any claimed differences between algorithms or amount of claimed improvements.
I have made two 90-second-long videos that make this clear in a visual and intuitive way:
1) Why Most Time Series Anomaly Detection Results are Meaningless (Dodgers)
https://www.youtube.com/watch?v=iRN5oVNvZwk&ab_channel=EamonnKeogh
2) Why Most Time Series Anomaly Detection Results are Meaningless (AnnGun)
https://www.youtube.com/watch?v=3gH-65RCBDs&ab_channel=EamonnKeogh
As always, corrections and comments welcome.
Eamonn
EDIT: To be clear, my point is simply to prevent others from wasting time working with datasets with essentially random labels. In addition, we should be cautious of any claims in the literature that are based on such data (and that includes at least dozens of highly cited papers)
For a review of most of the commonly used TSAD datasets, see this file:
I was wondering if anyone had any thoughts on how far out something like this might be or how difficult this is. Ever since the advent of the current era of ai/llms, I thought it would be great to somehow be able to feed data from nostalgic games in some form and create some type of system that is able to generate these worlds infinitely - while still being very true to the style and layout/ethos of the worlds/levels from the reference game. I feel like it would just be so wonderful if there was a path to creating some type of 'never-ending' <insert nostalgic game here> instead of being limited to what the devs put out back in the day.
If anyone has any insight or thoughts on this, please let me know :). I work in the AI space, but I integrate the models, and don't do any training or anything on the low level ML side. Also, yes, I'm only think about the gameworlds/levels atm.
Lets say we have a dataset that is much larger than we have disk storage. For example:
What are the usual approaches to training on something like this? What I can think of intuitively is to do the following in parallel somehow:
- prefetch block n, train on block n-1, delete block n-2 from disk
Lets say we use PyTorch, so we have a PyTorch Dataset that has all the paths to where the data is stored in the cloud. Do we need to write code for the prefetcher/deleter that downloads from the cloud and store on disk and have it run in a separate process, then have a DataLoader for training that just assumes that it can read from disk (because the prefetcher does its job correctly)? Having the DataLoader read from S3 would be bad for GPU utilization, right?
To take a step back, I'm assuming that this is ordinary and often occuring "problem" for every company that trains on large datasets, so I'm skeptical to writing all of this code by myself as I feel like there should be standard out of the box solutions for this, but can't really find anything that matches perfectly.
Hi all, I have a few GPUs left over from mining, and I’m interested in starting a small-scale GPU-as-a-service. My goal is to set up a simple, side income that could help pay off my credit cards, as I already have a primary job.
What steps are needed for getting started with a small-scale GPU-as-a-service business focused on machine learning or AI? Any insights would be greatly appreciated!
Thanks in advance for any advice you can share!
Context: I was making what was supposed to be an FP-oriented NN library/framwork on top of JAX (which too was FP-oriented) called z-zephyr on pip. However, I noticed something you could do with it that kinda clunky, if not tedious, with other frameworks.
(please read context)
TLDR; Zephyr turns out to be very good way (at least in my experience) to make structures that are weird. And I recently just added update capabilities so that zephyr doesn't only do structures but updates too.
Disclaimer: You can this with other frameworks, I have tried many of things I will tell below in other frameworks or libraries, and it's just painful for me or i'm just inexperienced with those.
Here are the crazy things that's quick to do in zephyr, that might not be as quick in other frameworks (if it could be done easily in other frameworks more easily, please tell me).
(These are not supposed to be useful, they're supposed to be extreme)
Here is the tree network in zephyr, and how you get the initial params and tags (tag, is the key in params[key]).
# essentially 4 lines of code
@flexible
def tree_net(params, x, n, i=0):
if i == n-1:
return [x]
return (
tree_net(
params["branch"]["L"] if i !=n-2 else params,
validate(params["weight"]["L"], (1,), uniform) * x,
n,
i+1) +
tree_net(
params["branch"]["R"] if i !=n-2 else params,
validate(params["weight"]["R"], (1,), uniform) * x,
n,
i+1)
)
x = jnp.ones((1,)) # dummy
N = 4
params = trace(tree_net, key, x, N)
tags = get_lineage_tags(params)
assume you had the loss function and gradients and what not, to keep it simple, i'll just update so that the left branch have weights 0, and the rights ones are kept the same.
def make_left_zero(params, tags): # i left out gradients
if tags[-1] == "L":
return params * 0
return params
# update the params
params = apply_updates(make_left_zero, params, tags)
Please suggest some extreme idea for me to try.
I think zephyr could be the tooling to make those easy to do. I would like to hear your extreme ideas, so I can try to code them zephyr, and if i can't do it without strugling, and if it's something i think is generic enough, I will evolve zephyr to handle it more easily.
PS: The readme doesn't include these yet, since it started as an (normal) NN library.
The link of the repo will be in the comments if you want to check it out.
Hi everyone. I built Grada to learn how things work under the hood. It’s an interactive browser tool that lets you observe real-time changes while training a multilayer perceptron, all built from scratch with a custom tensor-based engine.
You can easily construct neural networks with drag and drop and watch how training affects parameters and outputs visually in real time. Grada also includes a handwritten digit recognition feature, letting you interactively test your model by drawing digits and visualizing predictions. It might be a useful educational tool.
You can find the source code and a quick demo gif on GitHub at https://github.com/saliherdemk/Grada, and the live demo is available at https://saliherdemk.github.io/Grada/.
Hope this helps and looking forward to hearing some feedback.
Benchmarking Large Language Models with Integer Sequence Generation Tasks
Daniel O'Malley, Manish Bhattarai, Javier Santos - Los Alamos National Laboratory
This paper presents a novel benchmark where the large language model (LLM) must write code that computes integer sequences from the Online Encyclopedia of Integer Sequences (OEIS), a widely-used resource for mathematical sequences. The benchmark is designed to evaluate both the correctness of the generated code and its computational efficiency. Our benchmark reveals that the o1 series of models outperform other frontier models from OpenAI, Anthropic, Meta, and Google in accuracy and cheating rates across both easy and hard integer sequences. In order to ensure models do not exploit memorized sequence values, we introduce an automated cheating detection mechanism that flags the use of lookup tables and validated this automation against human cheating evaluations. This benchmark provides a meaningful challenge for current LLMs, offering insights into their mathematical reasoning and code writing capabilities, which can guide future research directions and model development in mathematical reasoning and code synthesis.
arXiv:2411.04372 [cs.LG]: https://arxiv.org/abs/2411.04372
Hi. I'm trying to use ML/DL model for predicting variability statistics like min, max, avg, var, with several features same as target.
For example,
I find several papers related to interval or range prediction for various area like wind power, stock price or solar energy, but I think those papers are different to my purpose. Almost every papers are predicting specific constant value based on time series data first, and use statistical method to estimate prediction interval.
I'm trying to find a way for prediction variability of target value with variability of features. My best idea is make each model to predict each statistics, like one model for minimum, other model for average, ... But I think there is a better way to do this. Is there any ML/DL model or other technique/methodology for this purpose?
Hey everyone! I’m working on my final-year project for my Bachelor’s, where I’m trying to predict Autism Spectrum Disorder (ASD) using voice cues. I’ve worked on some basic ML projects and CNNs before, but this is my first time dealing with audio data, and I’ll be collecting samples from young kids with ASD, from toddlers up to age 12.
I could really use some help finding resources to get a solid grasp on signal processing and how to train classification models specifically on audio. Also, if anyone knows of any open datasets in this area (I haven’t had much luck there) or has any advice or resources, I’d be super grateful. Thanks a ton in advance!
Hello, our company recently decided to expand our ML team from a very small 2 person team to a more serious efforts.
When we were small, we really didnt have a way to manage data sets or evaluations. They were just files checked into a github repo.
But increasingly we find, with multiple ML models (some llm and some not), and many iterations of datasets (some experimental and some not). It's really hard to version them in a meaningful way and be able to compare and analyze them.
We are a large company, so cost is not really an issue. And all our infrastructure is hosted in Azure. If anything, they fear lock in. What is the best platform/tools for this kind of usage?
Almost all the papers I have read on DTI do something like this.
How to do things differently? Can we use something like docking scores as cross modal attention bias?
I'm trying a new cute architecture on a bunch of the default datasets out there, using Jax since I'm doing live brain surgery, that part works well.
What I'm having a hell of a time with is actually loading the data. I was going for tfds since its 1) old 2) used in production 3) has a million datasets already prepared. I've not used TF since the 2.0 days and everything seems broken? I'm getting warnings and errors whenever I try loading and running through any dataset. Even their documentation has the errors [0] in the tutorial notebooks.
I can't just ignore a whole bunch of errors and warnings when I'm trying to benchmark a new architecture. Is tfds just that bad or am I missing something obvious?
For general prototyping purposes, I don't want to have to train or deploy a model, I just want it behind a service already and to provide it with necessary inputs in the request.... what do you guys think?
EDIT: I suppose for more classical ML tasks, there's no real concept of "pre-trained" in the first place, so you can't just get inference for free... does that sound roughly true?
https://arxiv.org/abs/2411.02780
Shows that corrupted images can be almost as useful as clean images for training generative models, assuming that a small initial set of clean images is available.
This could be useful for dataset design/curation: some budget needs to be invested in obtaining a few high-quality samples and then for the rest of the dataset corrupted images should work fine.
Abstract:
The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than 80 models on data with different corruption levels across three datasets ranging from 30,000 to ≈1.3M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g. ~10% of the total dataset) and a large set of highly noisy data suffices to reach the performance of models trained solely on similar-size datasets of clean data, and in particular to achieve near state-of-the-art performance. We provide theoretical evidence for our findings by developing novel sample complexity bounds for learning from Gaussian Mixtures with heterogeneous variances. Our theoretical model suggests that, for large enough datasets, the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample. Providing a small set of clean samples can significantly reduce the sample size requirements for noisy data, as we also observe in our experiments.
Paper: https://arxiv.org/abs/2411.02780
Code: https://github.com/giannisdaras/ambient-laws
Huggingface models: https://huggingface.co/giannisdaras?search_models=ambient_laws
Tokenizers are key to successful development of image and video generative models or multimodal LLMs. Compared to generative models, they are underrated. This work presents many tokenizers that are causal supporting both images and videos in both continuous (relevant in diffusion) and discrete (relevant in autoregressive/transformers) spaces
I was hired original as an ML engineer/scientist a few years ago. And for the most part my day to day reflected that. But with the boom of LLMs my team seems to solely focus on using a lot of this tech "out of the box", including agentic wrappers. My work has been dumbed down to prompt engineering to force a huge general purpose model into our domain specific use case. The results are acceptable for the most part, not going to lie, but there's still a small proportion of the cases where a fine-tuned model would have won. The leadership does not seem to be interested in fine-tuning or coming up with something original. A lot of the wrappers especially are very raw and force you into the usage of specific patterns and models. But because they are considered "out of the box", that's what's pushed on us to use. I feel like we are trying to fit a cube into a round hole.
Hey everyone! Wanted to share the link to the database of 500 ML use cases from 100+ companies that detail ML and LLM system design. The list also includes over 80 use cases on LLMs and generative AI. You can filter by industry or ML use case.
If anyone here is designing an ML system, I hope you'll find it useful!
Link to the database: https://www.evidentlyai.com/ml-system-design
Disclaimer: I'm on the team behind Evidently, an open-source ML and LLM observability framework. We put together this database.