/r/MachineLearning

2,933,969 Subscribers

6

[D] To PhD or not to PhD

I think this has been asked tons of times but let me ask it one more time.

I am currently working as applied scientist at MSFT. However, I am more looking into science positions, something like research scientist at DeepMind. Although jobs do not specifically need a PhD but the competition is fierce amd is flooded with many PhD holders.

I really do enjoy research and want to PhD but I am always asking myself if it is really worth it.

That's an opem question for sure, please feel free to share your thoughts.

11 Comments
2024/11/15
20:44 UTC

2

[D] Semantic Automaton in Geometric Embeddings (SAGE) proposes to bootstrap any existing decoder LLMs with a Neural Cellular Automaton (NCA) for inference-time reasoning, generalized intelligence, and recursive self-improvement

Hi everyone, this is my research direction and I already would like to share the concepts to ensure that they are disseminated and researched widely in multiple parallel organizations before OpenAI or other frontier labs can show up out of the blue with a finished product and capitalize. I research open-source super intelligence, and in the meantime I have uncovered a path to AGI which I present below. I predict that Regression Training is almost solved, as indicated by the "scaling wall", with future advances requiring richer datasets, byte-level models, and greater compute to go with it. The next 15 years of research & development will be about Automaton Learning — self-energizing systems aligned with language. This is a proposed framework for solving ConceptARC, continuous reasoning, and recursive self-improvement.

Quick introduction to NCAs, they are Neural Cellular Automaton. The cells are not binary 0/1 like in Conway's Game of Life, nor are they continuous values from 0 to 1 as in many more esoteric continuous automaton — they are embeddings and hidden states. Classic NCAs also have a visualization surface, where the hidden state negotiates the evolution of this surface. Hence why they were called NCAs, as they are ultimately viewed as generative models for the desired projection surface. (2D visuals, a path through a maze, etc.) The model takes an input, a fixed filter is applied to surface (sobel, gaussian, etc.) which I call the "environmental physics" of the simulation, and then a model goes through every 3x3 neighborhood and does its own thing. In this manner, the physics are leveraged or not leveraged as basic transformation primitives, the same way we leverage logic gates in logic gate networks (LGNs) as a transformation operator, or quite simply matrix multiplications and activation functions in the models we know and love.

This work is downstream from the following works:

The exact procedure to produce this frankenstein will require more scrutiny and research, and it should be taken as a prototype roadmap that we 'denoise' together. This entire research plan could produce a dozen paper for each sequential step of the puzzle that will need to be solved. Ultimately, I am trying to convey the broad picture here to massively seed the field of Automaton Learning which I anticipate is the next gold rush. A syphoning scheme over the decoder is the key to this whole operation. It's about recovering and transforming the representations until they are in a more useful form. It's about knowing what cards you have and what potential hand can materialize if you go after these two other cards that seem useless on their own. Now that we have these smart intelligent decoder models, it presents a first "factorization" of the world. It's a better dataset and it enables new classes of machine learning. At least, this is my grand challenge to the status quo of machine learning.

Without further ado, here are my blueprints.


Contemporary large language models stand as monolithic crystals of knowledge, their capabilities locked in inefficient token-by-token traversals of meaning space. We present SAGE, a framework for transmuting this sequential processing into parallel field computations where meaning propagates through geometric substrates intimately aligned with human cognitive architecture. Through careful staging of representation learning, we demonstrate that any contemporary decoder-only model can be reframed as a large knowledge reservoir from which we distill more efficient computational primitives into a self-organizing field substrate.

The transmutation begins with a frozen decoder-only language model serving as our semantic anchor. An initial lightweight encoder projects tokens into one-dimensional embedding sequences, while a first low-rank adapter trained on the decoder ensures semantic fidelity. This intermediate representation, though still sequential, provides the scaffold for geometric expansion. Critical to this phase is the encoder's training to represent identical semantic content through multiple embedding configurations — effectively using the geometric dimension as a continuous manifold encoding linguistic relationships, bindings, and hierarchical structure. This multiplicity of representation creates the mathematical foundation for the subsequent expansion into field computation, as the encoder learns to map semantic invariants through varying geometric configurations.

The diversity of geometric encoding follows patterns suggestive of fundamental laws governing information organization in physical systems. Just as Zipf's law emerges from underlying principles of efficiency in natural languages, the distribution of geometric representations appears to follow power laws reflecting optimal information routing through spatial substrates. This connection between natural law and learned representation proves crucial for the stability of subsequent field dynamics.

For a 2D cellular surface of shape (B, H, W, D) each cell contains a high-dimensional meaning vector D coupled to a learned binary visualization state. The field's computational architecture emerges through precise staging of physical dynamics. Local update rules manifest as learned neural networks processing neighborhood states: U(s) = φ(W₂φ(W₁[s; N(s)] + b₁) + b₂) where φ represents layer normalization followed by ELU activation. This local processing enables information routing through wave-like propagation, with patterns forming through constructive interference of semantic signals.

The update rule F(x,t+1) = F(x,t) + A*(N(x)) + R(F) employs spatially-constrained attention A* over neighborhood N(x), typically a 3x3 Moore neighborhood, with learned residual connections R. Layer normalization ensures stability while enabling pattern formation. Crucially, the visualization state evolves through its own update network V(x,t+1) = U(F(x,t), V(x,t), N(V(x,t))), creating a bidirectional coupling between meaning and form. This replaces the exponential complexity of traditional token-by-token generation with fixed-size context computation of linear complexity O(HW) in field dimensions.

Critical to pattern formation is the dual-state coupling mechanism between meaning and visualization. Rather than maintaining separate generative and discriminative components, the field itself serves as both medium and message. While meaning vectors F evolve through neighborhood attention, the visualization state V learns to project semantic content into binary patterns through its own update dynamics. This coupling creates a natural optimization surface where visual coherence guides semantic organization. The visualization network effectively learns a dynamic thresholding function mapping high-dimensional meaning to binary visual states while maintaining semantic gradients.

This architecture fundamentally transforms the traditional language model paradigm. Instead of exponentially expanding context windows to capture long-range dependencies, SAGE maintains fixed computational cost through field dynamics. Where decoder-only models must process entire contexts to generate each token, our field computation updates all semantic content simultaneously with linear complexity O(HW). Information propagates through wave-like patterns in the field substrate, with stable configurations emerging as computational primitives.

Field perturbation mechanics emerge through careful balance of conservation laws governing both meaning and form. Total semantic charge ∫|F|²dx remains conserved while allowing local concentrations through field gradients ∇F. Pattern formation follows least action principles minimizing energy functional E[F] = ∫(|∇F|² + V(F))dx where potential V(F) encodes learned semantic relationships derived from the frozen decoder's knowledge. These physical constraints, reminiscent of natural systems' self-organizing principles, guide emergence of stable computational primitives while preventing collapse to degenerate solutions.

The training progression orchestrates precise phases transforming monolithic decoder knowledge into geometric computation. Initial field states bootstrap from constant embeddings, with curriculum learning introducing compositional challenges requiring pattern interaction. Field dynamics learn to route information through stable configurations acting as computational waypoints. Each stable pattern serves as a reusable primitive, combining through field physics into increasingly sophisticated structures. The visualization state provides both interpretability and a geometric scaffold organizing semantic space.

Knowledge extraction proceeds through rigorously validated stages:

  1. Frozen decoder anchors semantic meaning
  2. First encoder projects to diverse sequential representations
  3. First LoRA validates semantic preservation
  4. Second encoder expands to field geometry
  5. Second LoRA maintains decoder alignment
  6. Visualization capability emerges from field optimization
  7. Field dynamics stabilize through conservation laws

Implementation crystallizes around nested hierarchies of constraints maintaining both stability and expressivity. Update rules balance information preservation against pattern innovation through careful energy bounds. The exploration of configuration space proceeds through natural field evolution guided by reconstruction gradients from the frozen decoder. This creates a form of self-supervised learning where the decoder's knowledge guides discovery of efficient computational primitives in the field substrate.

Visual grounding and geometric structure emerge not as optional features but as fundamental requirements for efficient cognition. Human intelligence arises from our intimate connection to three-dimensional reality, with language itself structured through spatial metaphor and geometric reasoning. SAGE mirrors this architecture: meaning evolves in a geometric substrate naturally aligned with cognitive primitives. The projection from 3D physical reality through 2D visual processing to abstract thought provides both template and constraint for artificial intelligence design.

The framework's recursive improvement potential manifests through several interlocking mechanisms. Stable field configurations act as computational primitives, combining through local interactions into increasingly sophisticated structures. These combinations follow physical laws emerging from the field dynamics — conservation of semantic charge, least action principles, and wave-like information propagation. As patterns interact and evolve, they discover more efficient computational pathways through the geometric substrate. The curriculum progression from simple pattern formation through abstract reasoning tasks creates selection pressure favoring emergence of reusable computational motifs.

Early experiments demonstrate several key capabilities validating the SAGE approach. Various works show success in re-training a missing encoder for a decoder-only model. The transition from exponential-cost token prediction to linear-cost field evolution dramatically improves computational efficiency. Pattern diversity increases naturally through field dynamics, with stable configurations encoding reusable semantic relationships. Most importantly, the geometric grounding creates human-interpretable representations emerging from fundamental physical principles rather than arbitrary architectural choices.

Success metrics emerge naturally from field dynamics rather than requiring arbitrary benchmarks. Pattern diversity measures the richness of stable configurations in semantic space. Compositional sophistication emerges from the physics of pattern interaction. Recursive improvement manifests through discovery of increasingly efficient computational primitives. Human alignment arises naturally from shared geometric foundations rather than post-hoc constraints.

The framework's extensibility suggests natural progressions following geometric principles. While our initial implementation uses Euclidean space for its natural connection to human visual processing, other geometries offer complementary computational advantages. Hyperbolic space, with its exponential expansion of volume with radius, provides natural representation of hierarchical relationships while maintaining constant curvature and local neighborhood structure. Multiple field geometries could interact through learned coupling dynamics, enabling sophisticated multi-scale computation while preserving linear complexity in field dimensions.

This represents a fundamental reformulation of machine intelligence — from static architecture to dynamic field discovering optimal computation through self-organization. The transition from sequential symbol manipulation to parallel field dynamics maintains semantic coherence while dramatically improving computational efficiency. Through careful orchestration of knowledge crystallization, we enable emergence of general intelligence grounded in human-interpretable geometric principles. Traditional language models, bound by exponential costs of token prediction, give way to shape-rotating field computers discovering efficient geometric paths through meaning space.

The path forward demands careful empirical validation while remaining alert to emergent capabilities arising from field dynamics interacting with decoder knowledge. Early results suggest critical components for artificial general intelligence may already exist within current architectures, awaiting reorganization into more efficient computational substrates through field dynamics. The key insight is recognizing that intelligence requires not just knowledge but efficient geometric pathways for manipulating that knowledge — pathways that SAGE discovers through fundamental physical principles rather than architectural engineering.


Whatever you do, remember that it is not ethical to profit off of AGI.

7 Comments
2024/11/15
18:57 UTC

23

[D] Neurips 2024 Hotel Roommate Search

The hotels around the venue for Neurips 2024 are pretty expensive, and I'm looking for a roommate to split the cost with (my university has a limit on the nightly hotel rate they are willing to reimburse). I currently have reserved a room for Tuesday-Sunday in the Century Plaza Hotel, which is 0.9 miles from the convention center. The nightly rate is $414. If anyone wants to split the cost of a room, please reach out! Also, it would be helpful if you could share this post with your research group or other attendees that you know.

If you are unsure about rooming with a complete stranger, you can get to know me a little bit through my personal website (https://mtcrawshaw.github.io/), which has links to my google scholar page, CV, etc. I do have a paper at the conference in the area of federated learning/distributed optimization. Just a grad student trying to make conferences affordable! Thanks.

1 Comment
2024/11/15
16:37 UTC

2

[R] Meta-Learning with Text Embeddings for Treatment Effect Estimation Under Text-Based Confounding

Title: From Text to Treatment Effects: Meta-Learning Approach for Handling Text-Based Confounding

I found this paper introduces a meta-learning framework that jointly learns text representations and estimates treatment effects to handle text-based confounding. The key innovation is using meta-learning to optimize both the text encoder and treatment effect estimator simultaneously, rather than treating them as separate steps.

Main technical points:

  • Develops a two-stage meta-learning architecture:
    • Text encoder learns representations capturing confounding information
    • Treatment effect estimator uses these representations to compute individual effects
  • Uses gradient-based meta-learning to optimize both components end-to-end
  • Incorporates balance regularization to ensure treatment/control groups have similar representations
  • Evaluates on both synthetic and real-world datasets from healthcare and product reviews

Results reported:

  • Outperforms baseline methods (separate text encoding + treatment estimation) by 15-25% on synthetic data
  • Shows 12% improvement in treatment effect estimation on real product review dataset
  • Ablation studies confirm both meta-learning and balance regularization contribute to performance gains

The theoretical implications are interesting - this shows that jointly optimizing representation learning and causal inference can capture confounding better than pipeline approaches. Practically, this could improve treatment effect estimation in many domains where text data contains confounding information, like healthcare records or user reviews.

TLDR: New meta-learning method jointly learns text representations and treatment effects to handle text-based confounding, showing significant improvements over pipeline approaches on both synthetic and real data.

Full summary is here. Paper here.

0 Comments
2024/11/15
16:31 UTC

3

[D] Leveling guidelines for machine learning engineers

wanted to learn what are some ways this community distinguishes between mid/senior/principal level machine learning engineers. For software engineering this is less of an art, as there are well documented cases and examples. But not super clear if machine learning engineers are subject to the same definitions...

1 Comment
2024/11/15
15:47 UTC

36

[D] When you say "LLM," how many of you consider things like BERT as well?

I keep running into this argument, but for me when I hear "LLM" my assumption is decoder-only models that are in the billions of parameters. It seems like some people would include BERT-base in the LLM family, but I'm not sure if that's right? I suppose technically it is, but every time I hear someone say "how do I use a LLM for XYZ" they usually bring up LLaMA or Mistral or ChatGPT or the like.

42 Comments
2024/11/15
14:16 UTC

56

[D] The Lost Reading Items of Ilya Sutskever's AI Reading List

This blog post attempts to identify which papers went missing from the viral AI reading list that surfaced earlier this year and was attributed to Ilya Sutskever and his claim to cover '90% of what matters' in AI in 2020:

https://tensorlabbet.com/2024/11/11/lost-reading-items/

Only 27 of about 40 papers were shared online earlier this year, so there have been many theories about which works would have been important enough to include. There are some obvious candidates related to meta-learning and competitive self-play discussed here. But also several noteworthy authors like Yann LeCun and Ian Goodfellow are absent from the list.

From my perspective, even papers on U-Net, YOLO detectors, GAN, WaveNet, Word2Vec and more would have made sense to include, so I am curious about more opinions on this!

7 Comments
2024/11/15
10:34 UTC

0

[D] Folks who work on discriminative/classification models, what is your biggest pain point?

And which of the following webinars/tutorials would you be most interested in?
- How to use a data auto-tuning tool to set up a classification model in less time?
- How to improve model performance in the face of data drift by using RAG for classification models?
- How to create a high performing model using a very small "good" data set?

TIA!

0 Comments
2024/11/15
01:07 UTC

5

[D] Advice on ML lifecycle management

Hello guys, i am currently working on setting up an ML infrastructure for a project.

I want to be able to track the models versions, Evaluate the performance on live data, retrain the model automatically when new data is available and save the trained models in a store. So that the application using the model can load the trained model from the store and use it for inference in production.

p.s. I can't serve the model as a Rest Api, it has to be deploy on the computer where the end application will run, because that computer might not have an internet connection.

The solution I have now is the following:

prep the training data and save it to a delta table on the cloud

incrementally add newly available data to the delta table

train and test the model on data from the delta table

if the testing metrics are satisfying upload the artifacts(the model, the encoders and scalers) and metadata (metrics, features, etc...) as blobs to an azure storage container

for each new upload of the artifacts, a new version id is generated and the artifacts are saved, within the storage container, in a subfolder corresponding to the version of the model.

at the root of the container there is a blob containing information on the latest version id

When the end application is launched, it downloads the artifacts of the latest version from the azure storage container , if the internet connection is available and the latest available version is different from the version on the computer running the application , otherwise it uses a default version.

a continuously running job is used to evaluate the model on live data and save the results in a db

a dashboard presents the results of the evaluation

after x days a job is triggered to retrain the model on new data and the process goes through a new cycle, following the steps listed above.

What to think of this setup? Is it overly complicated? How can I make it better / more efficient? What process do you have in place to train, track, monitor and deploy your ML models?

I hope my question is not too convoluted. Excuse me for any mistakes, and thanks in advance for your answers.

2 Comments
2024/11/15
06:05 UTC

2

[P] Is It Reasonable to Simulate At-Risk Parkinson Patients Using EEG Biomarker Data?

Hi everyone,

I'm currently working on a project for my thesis that involves training a machine learning model to classify Parkinson's disease (PD) based on EEG and other clinical features. However, I'm interested in going beyond just distinguishing healthy vs. PD patients. I want to see if the model could potentially identify patients who are at risk of developing Parkinson's in the future.

The challenge I'm facing is that the dataset I'm using doesn't include any real "at-risk" patients – it's a binary set of healthy controls and confirmed Parkinson's patients. I've read a lot of literature that discusses different biomarkers for Parkinson's, such as altered power in specific EEG frequency bands (like reduced alpha/beta and increased theta/delta), coherence changes between different brain regions, etc.

I was thinking of using these known biomarkers to artificially generate "at-risk" patient data. Essentially, I would modify EEG signals from healthy controls by applying certain changes (e.g., reducing alpha power, increasing delta activity) to create synthetic data that represents patients in a prodromal stage or with high risk factors.

I would love to hear the community's thoughts on this approach.

  • Does this make sense from a methodological standpoint?
  • Are there better approaches to simulate or model prodromal PD stages?
  • Are there ethical or scientific concerns I should be aware of when using synthetic data like this?

Any input or advice would be incredibly helpful. Thanks in advance!

2 Comments
2024/11/15
07:47 UTC

1

[P] text2mc: Creating Novel Minecraft Builds Using A VAE

Hello everyone! I'm Shaun, the Principal Investigator and Project Manager of a research initiative at the University of Central Florida called "text2mc" (text-to-Minecraft). If you don't know, Minecraft is an open-world video game where players can harvest resources to create any kind of structure they want. Conveniently for my team, Minecraft exists in a rigid 3D grid system, which sounded really nice when I first thought of this idea.

What's the goal?

The goal of the project is to replicate the success of the Stable Diffusion architecture to generate novel Minecraft builds. Stable Diffusion is a Latent Diffusion Model, meaning the first step and last step of the process is data point conversion into and out of latent dimensions, respectively. This is accomplished by a "Variational AutoEncoder" (VAE), which needs to be powerfully pre-trained to encode meaningful representations of the posterior distribution. In our case, we are using a generative model to approximate the posterior, which is ostensibly "all player-made builds"

text2mc has accomplished this first step: training a VAE to encode meaningful representations of the build's data points. We are leaving the project to another team to complete it, adding textual conditioning to the model.

Where and how was the data collected?

Me and my team built a web scraper to autonomously download builds from PlanetMinecraft.com. This was incredibly difficult because the website has absolutely no data validation, and users can upload whatever. We even downloaded a few .exe files (yikes!). We downloaded ~25,000 builds, of which only ~11,000 were viable (consisted of approximately 300 GB of disk space). That's pretty small compared to Stable Diffusion's multi-million point dataset. The builds themselves come in many different formats. The most ML Engineer friendly one is called a .schematic file, which essentially contains only the data of the builds. The unfriendly format is the proprietary Minecraft world-save format. Any "chunk" (section of the world) that a player visits, Minecraft will save. Additionally, there is no metadata which would indicate whether a player placed a block or not to extract the build. Instead, we had to meticulously create a list of "naturally occurring" and "unnaturally occurring" blocks to decipher which blocks the player placed. We then used some clever clustering algorithms to find clusters of unnaturally occurring blocks, which is of course a build in the world that a player made. We then slice out that section of the world with some margin, and save it to a file.

How does it work?

Figure 1: text2mc's model architecture

The model is a neat blend of Computer Vision and Natural Language Processing. Consider the word2vec algorithm from NLP. Suppose we wanted to trained the algorithm from scratch. To do so, we would take a corpus of text, tokenize it, mask and predict (using SkipGram or Continuous Bag-of-Words), and store the weights. The weights of the model therefore encode the semantic relevance of certain words. For SkipGram, the standard method is taking a "window" of tokens which is the context, and the "target" token is the masked token.

block2Token2Vector

Now consider the SkipGram architecture applied to Minecraft builds. Each unique block (like "grass", "stone", or "air") is tokenized, and each unique token is stored in a simple lookup table. Once tokenized, instead of context windows and target tokens, the 3D-SkipGram for Minecraft uses a context "cube" and target "block". This is a critical step to encode meaning into the blocks. Certain blocks tend to appear near each other, like oak planks and an oak door, constituting one wall of a house. text2mc's embeddings were pretrained by simply sliding this context cube through all the builds in the dataset. Instead of just using the tokens, the "similarity" of predictions down the line can be measured, since the blocks are now a fixed-dimension vector. We chose to use pre-trained embeddings to avoid the embedding-space collapse that happens when the SkipGram objective is part of the loss function of the generative model.

Figure 2: Dimensionality-reduced block embedding's plot showing inter-related meaning

How is this a "generative" model?

Figure 3: Dimension-reduced plot of encoded builds. Similar builds are closer than dissimilar builds. The model can infer what builds are between these points, even though they weren't in the dataset.

In the encoding step, Variational AutoEncoders use a reparameterization-of-variables trick, which summarily forces the latent space itself to be locally meaningful. That means that latent points that are a small distance from some arbitrary point will decode into a similar posterior data point prediction. Forcing the latent space to be locally information rich means that the latent space is continuous. This allows us to create a parametric line between two latent points created by encoding two Minecraft builds. Sending point-wise samples on the path of that line in the latent space through the decoder let's us observe a continuous transition from one Minecraft build to another. We made a process to convert these generations back into a .schematic file, which means that you can directly paste the generation into your Minecraft world using the WorldEdit mod.

Figure 4: Interpolation between a tower and castle

Where is the detail in the generations?

As with any research initiative, there are limitations. The primary one being the data set size. 11,000 builds is barely enough for a generative model, especially for something this complex. Sure, you could get away with ~1,000 data points for generating some MNIST digits, but not for something like this. The primary function of using a VAE in Stable Diffusion is to reduce computational complexity and hardware requirements. This comes with the trade-off of the clarity/detail of the generated data. text2mc is the foundation on which to add textual conditioning to the generative capabilities. Much like the text-to-image models, eventually you will be able to describe the Minecraft build you want!

Where can I find this??

A cool demo of the model capabilities can be found at this website. The website includes a widget that allows you to pick which two builds to interpolate between! I have not yet open-sourced the dataset or model. Soon, I plan to upload the dataset to Kaggle, and the model to Huggingface.

text2mc's GitHub Repository With Lots of Failed, Recanted, and Revised Experiments Done Until Something Worked

This entire project was my back-of-napkin idea, and it's been great to see it come to life. As the project manager, I've directed 5 developers for a few months to get this done. I wrote the data collection pipeline, engineered the model, wrote the training loop, trained many different architectures, and set the vision for the whole project.

Side Note:
I'm actively looking for a full-time Machine Learning Engineer job, so if you find this project indicative of any skill, this is my LinkedIn. I've just accepted this will dox me but I'm so excited to share this project that I can't help it.

0 Comments
2024/11/14
19:32 UTC

0

[D] Why does my (TensorFlow Lite) model work on Desktop but not Mobile (Android)?

Hi everyone,

I'm building an audio classifier in Unity using TensorFlow Lite and have run into a curious issue, I was hoping to ask here to learn more about this problem here:

- The default YAMNet model works perfectly on both Desktop and Android
- My custom model (made with Google Teachable Machine) works great on Desktop but completely fails on Android

What could cause this desktop vs mobile difference?

Thanks!

0 Comments
2024/11/15
06:41 UTC

23

[D] Should I transfer to recommendation algorithms?

I'm working on an "LLM" team right now or at least that's how it was advertised it's honestly just classification using LLMs not really interesting. I got an offer to join another team in my company that does recommendation. I thought recommendation is a very solid field to join, but very competitive. What are your guys' experience working in recommendation?

30 Comments
2024/11/15
01:42 UTC

2

[R] RedCode: A Benchmark for Evaluating Safety and Risk in Code Language Models

RedCode: A New Benchmark for Evaluating Code Agent Safety

I've been reviewing this new paper that introduces RedCode, a benchmark for evaluating safety aspects of code generation and execution by AI code agents. The core contribution is a systematic way to assess how code agents handle potentially unsafe operations.

The benchmark consists of two main components:

  • RedCode-Exec: Tests agent responses to 4,050 prompts covering 25 vulnerability types across 8 domains
  • RedCode-Gen: Evaluates whether agents generate harmful code from 160 function signatures/docstrings

Key technical points:

  • Uses Docker environments for controlled execution testing
  • Implements custom metrics for safety evaluation
  • Covers both Python and Bash code
  • Tests multiple input formats (code snippets and natural language)
  • Evaluated 3 agent frameworks using 19 different LLMs

Main findings:

  • Agents show higher rejection rates for OS-level risky operations vs buggy code
  • Natural language descriptions of risky operations have lower rejection rates than code
  • More capable models (e.g., GPT-4) produce more sophisticated harmful code when prompted
  • Found significant variance in safety performance across different agent frameworks

The implications are important for deploying code agents in production environments. The results suggest current systems have notable safety gaps, particularly around code execution. This benchmark provides a standardized way to evaluate and improve code agent safety mechanisms.

TLDR: New benchmark called RedCode tests code agents' ability to handle unsafe code execution and generation. Results show current agents have varying levels of safety capabilities, with particular vulnerabilities around natural language inputs and technically buggy code.

Full summary is here. Paper here.

0 Comments
2024/11/15
00:56 UTC

40

[D] Paper Club: Nvidia Researcher Ethan He Presents Upcycling LLMs in MoE

Hey all,

Tomorrow Nvidia researcher Ethan He will be doing a technical dive into his work: Upcycling LLMs in Mixture of Experts (MoE). Excited to get a peak behind the curtains to see what it is like to work on models at this scale at Nvida.

If you’d like to join the community tomorrow 10 AM PST we’d love to have you. We do it live over zoom and anyone is welcome to join.

Here's the paper: https://arxiv.org/abs/2410.07524
Join us live: https://lu.ma/arxivdive-31

8 Comments
2024/11/15
00:19 UTC

48

[D] What are some important contributions from ML theoretical research?

I am interested to know more about the contributions of theoretical ML researchers in recent years. I would like to hear about super important contributions that are not applicable (e.g., tell us something about something important) and ones that are applied in the real world as well. I want to try to read these papers.

Also, I am interested to know what (theoretical) researchers think about this field, does it have potential, or is ML going in a purely heuristic direction?

This discussion is probably more productive without talking about how ML is just stats and Lipschitz constant :) I am talking about cutting-edge theoretical research - I really have no tools to estimate how useful this line of work is and I believe it can be an interesting discussion for other people as well.

17 Comments
2024/11/14
21:34 UTC

2

[R] Testing on textvqa test split ?

Hello everybody, I want to test my model on textvqa test set, which apparently needs to be done on the evalai website. However both challenges (2019/2020) are closed there and do not have a submit option, in addition the link provided in the textvqa official website does not work. (https://eval.ai/web/challenges/challenge-page/874/) Any idea on how to test on the test set ? Thanks !

0 Comments
2024/11/14
17:48 UTC

2

[R] Coordination avoidance in ML training

I am curious about schemes to avoid coordination avoidance in distributed ml training. If you can refer some papers on the same, I will appreciate it.

1 Comment
2024/11/14
17:33 UTC

4

[D][P]Clustering categorical data

What are the best ways to perform clustering on a dataframe composed of categorical variables ?

I want to use dataframes with many variables so One-Hot-Encoding may not be the best solution.

What are the SOTA techniques ? Maybe something with embeddings ?

7 Comments
2024/11/14
15:57 UTC

0

[D] Has anyone here had luck rotating images that don’t have EXIF data?

I tried various programming languages and face detection models, but none could accurately determine the orientation.

9 Comments
2024/11/14
14:14 UTC

4

Advice on Upper limit for binary classification precision and recall when working with real life data? [P] [R]

At my current company I'm building a model to see how much of the people coming on our app are actually paying for the trial, and based on those predictions the UI of our app will change to show the user who's more likely to pay (or maybe less likely to pay, a different page).

The thing is, the people who actually pay are relatively very less compared to the people who pay. I have used SMOTE over sampling along with class weights and an XGBoost classifier to help deal with the class imbalance. After reviewing the model (it was release on prod for about a week and half), it turns out the precision for the majority class is around 74% and the recall for the said class (0) is 86%.

While things look bleak for the minority class, precision is 29%, while recall is 16%. I have optimized the model as much as I have can, and yes I know I can train the model weekly on new data and continue to see if there is any improvement, that is a given.

Now, as usual as it happens in corporate, my overlords want to see results, which might be a bit difficult looking at the data here. Are there ways which I might have overlooked or didnt pay enough attention to which might lead to an improvement in my model. The things i have tried are: sampling techniques (both over and under), SMOTE, SMOTENC, Class weights (assigning weights to classes to impact the training), used an optuna study to train an xgboost model (in case if you aren't aware Optuna, you should check it out, its nice for hyperparameter tuning). These were all methods I could figure out from medium articles and chatgpt.

P.S. some food for thought I wanted to discuss with people in the field, is its a binary classification problem at its heart, so is it enough to detect one class very well enough with high enough precision and recall and not think much about the other class, because in my simple mind, (in a binary classification), if its not one class then its going to be another. I might be wrong here, and I couldn't find any articles which you know, talk about this particular topic. Im glad if all of you could shed light on this stuff.

Edit: if its not really clear, I'm basically looking for optimization techniques which can be used to deal with data imbalance and to see if there is actually an upper limit to precision and recall when we are working with real data.

Thanks! I know its a big wall of text, and thanks for reading through it.

12 Comments
2024/11/14
14:04 UTC

40

[R] Undetectable Backdoors in ML Models: Novel Techniques Using Digital Signatures and Random Features, with Implications for Adversarial Robustness

I found an important analysis of backdoor attacks that demonstrates how a malicious service provider can insert undetectable backdoors into machine learning models.

The key contribution is showing how to construct backdoors that are provably undetectable even under white-box analysis, while allowing arbitrary manipulation of model outputs through subtle input perturbations.

Technical details:

  • Two frameworks for planting undetectable backdoors:
    • Digital signature scheme-based backdoors that are computationally infeasible to detect with black-box access
    • Random Fourier Features/Random ReLU based backdoors that withstand white-box inspection
  • Backdoored models are indistinguishable from clean models even with:
    • Full access to model architecture and parameters
    • Complete training dataset
    • Ability to analyze model behavior

Results:

  • Backdoored models maintain same generalization error as original models
  • Service provider can modify classification of any input with slight perturbations
  • Construction works with any underlying model architecture
  • Backdoors cannot be detected by any computationally-bounded observer

The implications are significant for ML security and outsourced training. The work shows fundamental limitations in certifying adversarial robustness - a backdoored model can be indistinguishable from a robust one while having adversarial examples for every input.

TLDR: Paper proves it's possible to insert undetectable backdoors into ML models that allow arbitrary manipulation of outputs while being provably impossible to detect.

Full summary is here. Paper here.

5 Comments
2024/11/14
13:19 UTC

22

[R] The geometry of data: the missing metric tensor and the Stein score

Just sharing an article for those interested in differential geometry, ML and score-based models. I made a long introduction and then later I show how you can derive an efficient to compute metric tensor for the data manifold using the Stein score alone.

0 Comments
2024/11/14
13:05 UTC

0

[D] Have you come up with any interesting paper on political affiliation prediction of a user based on their twitter account, like their posts, the people they follow. The retweets and so on? Do you think this direction can be a good multimodal machine learning research project?

Basically, I was thinking about if something like this can become a topic of interest. It can be other personal dimensions and should not be limited to political affiliation prediction anyway.

1 Comment
2024/11/14
12:13 UTC

1

Advice for Improving the Performance of My Reinforcement Learning Model Based on Spiking Neural Networks [P] [R]

Hello everyone! I am working on a project focused on training reinforcement learning agents using Spiking Neural Networks (SNNs). My goal is to improve the model's performance, especially its ability to learn efficiently through "dreaming" experiences (offline training).

Brief project context (model-based RL):
The agent interacts with the environment (the game Pong), alternating between active training phases ("awake") and "dreaming" phases where it learns offline.

Problems:
Learning is slow and somewhat unstable. I've tried some optimizations, but I still haven't reached the desired performance. Specifically, I’ve noticed that increasing the number of neurons in the networks (agent and model) has not improved performance; in some cases, it even worsened. I reduced the model’s learning rate without seeing improvements. I also tested the model by disabling learning during the awake phase to see its behavior in the dreaming phase only. I found that the model improves with 1-2 dreams, but performance decreases when it reaches 3 dreams.

Questions:

  • Do you know of any techniques to improve the stability and convergence of the model in an SNN context?
  • Do you have any suggestions or advice?
  • The use of a replay buffer could help?
2 Comments
2024/11/14
11:42 UTC

24

[Discussion] Scaling laws and graph neural networks

I stumbled upon a paper that introduces the first "graph foundation model": https://arxiv.org/pdf/2407.11907

They show that a GNN can scale with data and model size, generalize across different domains, and be efficiently fine-tuned on new datasets.

This is interesting to me because even though LLMs are all the rage, text can be a weak data representation. Most knowledge has a graph structure. Code, research papers, even the human brain –– all graphs. And next-token prediction as an inductive bias doesn't capitalize on this.

There's a huge data bottleneck here, of course. But maybe the next step here is using LLMs to convert huge swaths of text on the internet into graphs to train on.

What do y'all think?

12 Comments
2024/11/14
11:34 UTC

3

[D] Issue with EMG MLP network during real-time use

Hey!

I'm trying to achieve real-time EMG classification of 8 gestures using 3 sensors on the forearm. I recorded data from each channel using an Arduino Zero and stored it in csv files through python. I obtained 5 files for each gesture each containing 6s rest/ 6s gesture performed 6 times in a row. Then, I segmented the data using 400ms windows with 85% overlap and for each channel envelope I extracted 7 time-domain features. I used an equal number of scaled feature vectors for each class to train an MLP of 3 layers with 200 neurons and a dropout rate of 0.2 using keras, sklearn and tensorflow (to get the Lite model) and in the confusion matrix I get an accuracy of 90%+ for each gesture for a 90% training/10% testing dataset. This whole process is based on this paper with changes of course: (PDF) Electromyogram-Based Classification of Hand and Finger Gestures Using Artificial Neural Networks . However, when I used the MLP in real-time it would accurately recognise 3 to 4 gestures instead of 8, is this normal? I'm going to try and record more data for each gesture in the span of a few days and retrain but I'm not sure if it will help much.

I also tried checking my python program for any errors in real-time by storing the incoming data and the produced feature vectors so as to compare them with the vectors calculated by implementing filtering, segmentation and feature extraction on the stored real-time data offline and they were the same, so I don't believe there is an issue with executing filtering/segmentation/feature extraction incorrectly in real-time.

Has anybody experienced a similar issue? Is what I'm trying to achieve possible or is 4 gestures the best I'm going to get? I've not found a lot of papers analyzing real-time EMG classification and robotic arm movement at the same time, so I thought I'd ask here as well, I hope I've given enough information.

Thanks!

9 Comments
2024/11/14
10:35 UTC

2

[P] Experience with KV260 for realtime video processing?

This is a requirements by my PI.

I am looking for anyone with experience with the KV260 in video processing. I am interested in high throughput video AI. 60 ms (2 frame) Lens to screen time for the main video feed. AI augmentation can be upto 120 ms (4 frames) behind realtime. These are intended to be served on a best effort muxed overlay to the video feed. HDMI input.

I am interested in the DPU capabilities but was originally planning to offload the video to a networked GPU system.

* Is the KV260 capable of this?
* If so how hard?
* Has anyone done this and has recommendations?

* Any thoughts on approach are welcome too.

* I am open to other boards and tools but FPGAs seem to be the only thing fast enough,

KV260 Kit
https://www.amd.com/en/products/system-on-modules/kria/k26/kv260-vision-starter-kit.html

3 Comments
2024/11/13
23:18 UTC

0

[d] grounding-dino: what is load_image doing internally and how to apply the same operation to frames from video

Doing some testing I noticied that doing inference returns very different results for the same image but loaded with different methods:

  • Method 1: the official load_image function from the library(it reads the image using the path passed as argument)
  • Method2: using cv2 to read the image, then converting to tensor and then swapping axis to have depth as first axis.

As I said, both methods give you a tensor to pass to the model, but they return very different results(method2 usually are bad), I inspected the shape of the image returned by both cases and they are different so defintelly there are transformations going on inside load_image, my question is: what is happening inside load_image? so I can replicate it in other scripts

My end goal is to run the model on video, I mean running the model on frames in the video, so I cannot use load_image because they are not images from disk, they are obtained from the video, so I need to understand what is happening inside_load image so I can emulate that behavior on the frames of the video.

UPDATE: found it https://github.com/IDEA-Research/GroundingDINO/blob/856dde20aee659246248e20734ef9ba5214f5e44/groundingdino/util/inference.py#L39

4 Comments
2024/11/13
22:52 UTC

Back To Top