2,885,249 Subscribers


[D] Automatic data segment optimisation

I've thought of this after a few beers so please allow me some incompleteness. Say you work for a property investment fund and are tasked with the problem of predicting house prices. Assume the business can only invest $10m a year. You have a dataset of 100k houses all across the country with many features and the respective house prices. You now train some fancy model optimising for RMSE against the entire set of 100k houses. Great, you now have some cool RMSE that can describe how "good" your model is. But don't forget, your business can only invest a certain amount a year, so if you're model is way way better at predicting prices for single-story houses then two-storey houses, why waste computational resources and model learning power on two-storey houses? Is there not an argument to retrain on only single-storey houses? Put another way, whilst a total RMSE is nice, if the RMSE for single-story houses is way lower than for two-storey houses (within the same set of predictions), you'd certainly be better off investing in the former. There should be >$10m worth of single-storey houses to invest in.

Now, in this scenario, you don't know this outcome ahead of time. Additionally, there may be feature spaces that are too large or complex to practically train on individual suspected "fruitful" data subsets. So, in the way that hyperparameter search optimises hyperparameters, it would be cool if there was a way to optimise the data subset used for training. Consider that the search space for data segment optimisation is enourmous (maybe even after dimensionality reduction) so "grid search" is probably not the way to go... Is anyone aware of some cool techniques for optimising the data segment used for training? Thanks for reading my Ted Talk...

11:34 UTC


About data missing points[D]

I find a stock price data but this dataşet doesnt contains weekends and there is gap between the dates is that a problem and if what can i do to solve this problem.

1 Comment
11:14 UTC


[D] How do you parse Wikipedia / Wikidata ?

As the title says, which are your methods to harvest knowledge from Wikipedia and/or Wikidata ? For instance, do you preprocess dumps ? If so, what kind of action do you perform to obtain a clean version for your needs ?

11:03 UTC


[D] Using VGGish slim version on Keras

I am using VGGish as the backbone of a sound-event detection project I am working on for my PhD project.

I am new within the deep learning framework and, having started with Andrew Ng's course, I have mostly used Keras with Tensorflow. When I first started meddling with VGGish I had trouble with the slim notation of the official VGGish repo and instead used this Keras implementation I found.

I just now realized that the structure of the model from the Keras implementation repo is missing one of the 256 convolution layers of the original model.

Does anyone now of any material I can read to help build directly from the official VGGish repo? Can I load the model using the slim notation and then bridge it to the Keras ecosystem to add my layers and fine tune to my data?

Thank you.

09:35 UTC


[D] Looking for transformers with emoji embeddings

I am working on multilingual texts, with typical pretrained models (e.g. mDistilBERT). The problem is, those models typically do not have emojis in their tokens and token embeddings. But emojis often constitute a major and important part of my data.

Adding new tokens with emojis was indeed helpful, but they start with random embeddings as new tokens. I have an idea to instead initialize the input token embeddings from another model, as a way of transfer learning.

The problem is, I have trouble finding any models with emojis in their tokenizer. Do you know any? The more emojis, the better.

1 Comment
07:06 UTC


[D] Dabbling in hand-written text recognition problem using self-supervised learning

I've become interested in experimenting with a hand-written text recognition problem as practice. I found this awesome repository from Facebook Research as an application for their GTN (Graph Transformer Network) library:


Unfortunately, the repository has been archived (don't ask me why). But still, it holds amazing code. I faced some challenges setting it up (I had to compile the GTN since the pip package was not installing and even compiling it was facing issue but finally I managed).

This repo holds different model architectures for solving different tasks and datasets. You can find tasks like hand-written text recognition (let's call it HWTR) in there as well as speech to text. As I said, I am interested in HWTR task and for that a go-to dataset is IAM which the aforementioned repo supports perfectly.

So far, I've spent more than a month understanding and fiddling with the code. And I can say that I know my way around it relatively good. I have also tried to add my own spin to it but let's not get ahead of ourselves. The model architectures that I could find in this repo and test with are: RNN, TDS, and TDS2d. The RNN is no stranger and I'm sure most of you know what it is. TDS was new to me and I had not heard of it before. It's based on Convnet layers and it's composed of multiple (deep) layers. The TDS2d is similar to TDS but as the name suggests, while TDS is using Convet1d TDS2d uses Convet2d. Running the training of these three architectures, I can say that RNN has the lowest performance while TDS2d has the highest (WER: 26%, CER: 7%).

Now, we get to the part where I tried to add my own spin to the task. I am interested in SSL (self-supervised learning) and wanted to see it work in action. I've already experimented with smaller datasets like MNIST and it worked. But of course, it's not easy to get results from large real-world datasets.

I wanted to see if I can get any output if I apply InfoNCE (SimCLR) to this problem. This is the approach that I've taken: I used the same TDS and TDS2d architectures from the original code base (that I knew were capable of learning the data patterns when labels are present) and reformed them into a joint embedding architecture. I used the same IAM dataset but thrown away the labels since this is an SSL task.

Since this architecture is joint embedding, the samples are augmented twice to form two similar (positive) samples and the rest of the samples in the batch will play the role of negative samples (just like SimCLR paper). Both samples will be fed into the model and the two outputs with go through the loss function to calculate the loss value.

This is my InfoNCE implementation (this is not the whole loss function but it is called by it):

def info_nce_loss(v1, v2):
    # similarity scores
    scores = torch.mm(v1, v2.T)
    # Labels is a vector holding the diagonal indices
    labels = torch.arange(scores.size(0)).to(scores.device)
    return torch.nn.functional.cross_entropy(scores, labels), scores 

Very simple and I feel confident that it works perfectly fine. Of course, this loss function requires that the v1 and v2 to be normalized prior to calling this function (I'm using cosign similarity - A.K.A. angular distance). Normalization can easily be achieved like this:

v1 = torch.nn.functional.normalize(v1, dim=-1)

Just to be clear, this normalization is not part of the model itself, it is applied to the model's output prior to calling the info_nce_loss function above. Training a joint embedding model like this shows promising metrics (9% error rate) but once I finish the training and take the model, freeze the layers and add a linear layer on top of it and then train this model using the same IAM dataset (but this time it's a supervised learning task), this leads to 100% WER and 100% CER. Basically, the model is incapable of learning anything.

Of course, there are lots of details that I did not mention here since this is already too long for a post. Perhaps if a discussion is opened up, I'll provide more details. And to be clear, I'm not expecting the new model to learn as good as the original supervised one but at least learns something.

So I was wondering, what do you think I might be missing? What would you do if you faced this situation. Personally, I've tested all the scenarios I could think of but so far I failed to get any results.

1 Comment
05:43 UTC


[Research] MMStar: Are We on the Right Way for Evaluating Large Vision-Language Models?

Paper: https://arxiv.org/abs/2403.20330

Evaluation Code: https://github.com/open-compass/VLMEvalKit


Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 24% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

1 Comment
05:08 UTC


[Research] Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

Paper: https://arxiv.org/abs/2404.06480

Code / Dataset: https://github.com/open-compass/Ada-LEval


Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings.

05:05 UTC


[News] NeurIPS 2024 Adds a New Paper Track for High School Students

NeurIPS 2024 Adds a New Paper Track for High School Students


The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) is an interdisciplinary conference that brings together researchers in machine learning, neuroscience, statistics, optimization, computer vision, natural language processing, life sciences, natural sciences, social sciences, and other adjacent fields.

This year, we invite high school students to submit research papers on the topic of machine learning for social impact. A subset of finalists will be selected to present their projects virtually and will have their work spotlighted on the NeurIPS homepage. In addition, the leading authors of up to five winning projects will be invited to attend an award ceremony at NeurIPS 2024 in Vancouver.

Each submission must describe independent work wholly performed by the high school student authors. We expect each submission to highlight either demonstrated positive social impact or the potential for positive social impact using machine learning.

03:47 UTC


[Research] Face (picture) to 3D model, where and how to start?

I am an intermediate dev doing graphics development and low level (native stuff), so I am not new to math nor low level coding.
I am venturing into ML right now, with that said, I want to learn what are the steps, topics/subjects I need to learn so that I can achieve Face to 3D object detection,
or can somebody give me a background on what to look and study for to achieve this?

Though i want to use 3rd party library, but I want to learn the nitty grity as well.

03:27 UTC


[Project] CLIP for efficient knowledge distillation without using teacher model, only using teacher embeddings

Can pre-computed embeddings obtained from the teacher model be used to train the student model in knowledge distillation?

This project extends CLIP for efficient knowledge distillation, by utilizing embeddings as teachers. Typical knowledge distillation frameworks require running forward passes through a teacher model, which is often prohibitive in the case of billion or trillion parameter teachers. Using only the embeddings of the teacher models to guide the distillation can yield significant computational savings.

GitHub: https://github.com/lnairGT/CLIP-Distillation

03:03 UTC


[D] About Validation set

I heard that I can only use the testing set for once, but if I do not get very good result on the testing set, can I merge the testing set into validation set, re-tune hyperparameters, and collect another testing set? 

Additionally, if I found that models are not working well with some data because bad lighting or out of domain data, could I remove them from the validation set if I document this in the paper?

I feel like these two way might be "changing the data into what we want" and feel a bit uneasy on that. 


00:34 UTC


[R] The attention mechanism of Transformers resembles a modern iteration of associative memory models from neuroscience. I show auto- & hetero-associative mixtures can perform a range of tasks + suggest new neuro-inspired Transformer interp approaches.

Mechanically, attention appears to perform hetero-association. Actually, it can in principle mix auto- & hetero-association together.

Question: What abilities does this permit?
Answer: A lot!

Finite automata

By assigning neural activities for image or text data and converting their combinations into auto-associating attractors (states) or hetero-associating quasi-attractors (transitions), we can simulate finite automata.

(see section 3.4 and appendix A12 of the paper linked below)

An example of mapping a finite automaton to a 'memory graph'.

Multi-scale graph representations

By adjusting the strength of auto-association (a) and hetero-association (h), we can choose the scale or coarseness of graphical relationships we wish to identify.

(see section 3.2 and appendix A2 of the paper linked below)

The Tutte graph with different scales of activity spread across its vertices (memory patterns), as detecting using an associative memory network.

Stabilizing recall of videos

Natural videos typically have a lot of time-dependency between frames. Striking the right balance of auto- and hetero-association helps prevent the video getting 'stuck' on a frame or skipping ahead.

(see section 3.3 and appendix A11 of the paper linked below)

Correlation of video frames (memory patterns) over time, showing how values of a and h effect the smoothness of recall.

Graceful trade-offs between memory fidelity & capacity, by hetero-associating similar pairs of memories & 'retrieving' those w/ highest overlap. (see appendix A3)

Replication of neuroscience data showing hetero-association in monkey temporal cortex. (see appendix A7)

Left: Non-traditional auto-association performance over memory load. Right: Settings of a and h which match well to data from monkey temporal cortex.

This leads me to suggest neuroscience-inspired interpretability analyses for Transformers and a hypothesis for why superposition should be related to ‘context switching’ & imply ‘data-dependent geometry’.

For more info --

Paper: https://arxiv.org/abs/2404.07123
GitHub: https://github.com/tfburns/CDAM

20:30 UTC


[R] Does anyone have access to an Attention visualization tool for generating Attention visualizations like the ones in the appendix of "Attention is All You Need"?

Since many copies of the paper don't have the appendix, here's one so you know what I'm talking about: https://arxiv.org/pdf/1706.03762.pdf

In the appendix there are visualizations of connections generated by the attention mechanism, which "appear to exhibit behavior related to the syntactic and semantic structure of the sentences".

Does anyone know how these were generated, or have a tool or method that generates similar?

Thanks in advance!

20:01 UTC


[R] ReFT: Representation Finetuning for Language Models

Paper: https://arxiv.org/abs/2404.03592

Code: https://github.com/stanfordnlp/pyreft


Parameter-efficient fine-tuning (PEFT) methods seek to adapt large models via updates to a small number of weights. However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative. Here, we pursue this hypothesis by developing a family of Representation Finetuning (ReFT) methods. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. We define a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT). LoReFT is a drop-in replacement for existing PEFTs and learns interventions that are 10x-50x more parameter-efficient than prior state-of-the-art PEFTs. We showcase LoReFT on eight commonsense reasoning tasks, four arithmetic reasoning tasks, Alpaca-Eval v1.0, and GLUE. In all these evaluations, LoReFT delivers the best balance of efficiency and performance, and almost always outperforms state-of-the-art PEFTs. We release a generic ReFT training library publicly at this https URL.

19:30 UTC


[R] ReALM: Reference Resolution As Language Modeling

Paper: https://arxiv.org/abs/2403.20329


Reference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user's screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.

19:21 UTC


[D] What Exactly Is AGI? Introducing a Unique and Rigorous Standard


I'm curious about what people here thinks about this:

What Exactly Is AGI? Introducing a Unique and Rigorous Standard

Best regards!

18:12 UTC


[R] Infinite context Transformers

I took a look and didn't see any discussion thread here on this paper which looks perhaps promising.


What are your thoughts? Could it be one of the techniques behind the Gemini 1.5 reported 10m token context length?

17:35 UTC


[D] Are there no proper explainability approaches for semantic segmentation models?

Hey there, I'm currently looking more into explainability in AI and would like to get more insights to the results of my segmentation models (U-Net and DeepLabV3).

Looking for possibilities to explain my outputs (or intermediate layers of my networks), I couldn't really find solid results w.r.t. segmentation tasks.

I could see that in the SHAP examples, there is one showing how to make some explanations on an intermediate layer of VGG16 on ImageNet with PyTorch (here). This, however, still shows a classification task and trying to apply it to my own problems didn't work out as expected. I've also tried to use their DeepExplainer which didn't really lead me to any results.

Are there no explainability approaches for semantic segmentation problems? I could find one or two research papers that, however, classify their end results and then apply e.g. Grad-CAM which is only kind of explainability of segmentation, isn't it?

15:10 UTC


[P] use GAN for generating structured text

I’m in the first phase of a personal project in which I want to generate new data from a dataset that I will use later in the project. The data is a structured test (it is like a log file) and I want to create a GAN that generates new files based on the dataset. I’ve thanked of using GANs to do it, but I’m not sure if it’s the correct path, since I’ve seen that they’re used mainly for images. Do you have any suggestions or implementation of text generating GANs?

14:39 UTC


[D] In the paper "More Agents Is All You Need", why did they use BLEU score to calculate similarity for ensemble voting instead of something like cosine similarity? Has there been any follow-up research comparing methods?

Paper: https://arxiv.org/pdf/2402.05120.pdf

In the paper, they have ensembles of LLMs answer questions. For discrete answers, like multiple choice questions, they just pick the most frequent answer. For "continuous" answers like code, they use BLEU score to find the answer that is the most similar to others.

Does anyone have insight on why that was chosen over something like cosine similarity? It doesn't look like they explain the choice, but their results are good so I guess it worked!

12:39 UTC


[D] Pixel perfect OS GUI navigation?

Is it possible to train a model to be nearly pixel perfect if not pixel perfect when “navigating” an operating systems GUI? I know there are many limitations to this, but there are some projects trying to tackle this visionary idea of an AI being able to accurately navigate a computers GUI. Some of the methods used to tackle this problem seem to revolve around the idea of GPT4 vision model seeing screenshots of the GUI, so essentially processing navigation frame by frame.

There was a github repo on a mock up version of this as well as a company by the name of openinterpreter with a version of this. Is there a better method than the frame by frame navigation? Some of the limitations it brings is speed, accuracy of mouse/cursor navigation, etc.

If there isn’t a better way for now, in the future could a better way be a whole language/framework/library made to tackle this problem? Is there a way to access the physical screens pixel information and maybe the model could train on that? Any thoughts?

1 Comment
10:14 UTC


Are LLMs good at NL-to-Code & NL-to-SQL tasks? [Discussion]

Hey everyone,

For the last few days, I have been researching about how Large Language Models perform with specific to NL to Code and mainly NL to SQL tasks. I want to hear more on this from people from our community of practitioners.

This interest primarily stemmed from curiosity and efficiency of using LLMs for coding. May I know what you have felt about their performances? - in terms of accuracy, efficiency etc? Which models have you tried for this task, and what worked best in your opinion?

10:12 UTC


[N] Proving humanness in the age of AI: a technical deep dive into World ID

In this event, the speakers will present a technical deep dive of World ID, a "human passport for the internet" built as an open protocol. Part of the Worldcoin project, World ID lets you prove you’re a unique human online while keeping your identity private.

Below are the speakers:
• Christian Brendel: Head of AI at Tools for Humanity
• Massimiliano Patacchiola: Senior AI Research Engineer at Tools for Humanity

The event will consist of a 40 minute presentation followed by 20 minutes of networking and open discussion.


09:01 UTC


[D] Regression model metrics - best scenario?

So I’ve been trying to build GB models for my work, crop related data.

For this I’ve been mostly selecting + further tuning the best error metric models (usually RMSE) on the validation set of my data, post grid/random search.

However, on test data that I’m using for prediction, the error goes up significantly, about 4-5x. The thing is, 5x the rmse of the validation set is still a good enough value when considering the range of my data. Does this mean the model is failing to generalise as well as it should, or overfitting? Leading to such huge increase in RMSE. Or is it a good enough model as it’s reasonably capturing the trends in my data ?

Would it be better to have a model that gives a good + similar RMSE for training, validation and testing data, or one that has a great RMSE for validation but just a reasonably good one in testing?

04:50 UTC


[D] Hessian of deep learning model and its eigenvector

I'm trying to understand more about Hessian matrix, its eigenvector, and how to optimize them. Can anyone provide me some insight about this (or recommend papers that has great insight about this aspect)? Something like what eigenvectors of a Hessian matrix represent, how they should behave in a good and generalized model.

1 Comment
04:08 UTC


[R] Mamba Enters Remote Sensing! RS-Mamba: The First Application of SSM for Semantic Segmentation and Change Detection in Large Remote Sensing Images

Content summary

Mamba shines in the world of large language models with its linear complexity and performance comparable to that of the transformer, making it a strong contender as a replacement for the transformer. Recent works, Vim and VMamba, have introduced Mamba into the visual imaging domain, sparking numerous breakthroughs in various fields of vision and generating a wealth of research using Mamba for visual tasks.

This paper marks the first introduction of Mamba into remote sensing, developing RS-Mamba for dense prediction tasks in very-high-resolution remote sensing images. It leverages its linear complexity and global modeling capability to handle large remote sensing images.

Previous remote sensing models were mainly divided into CNN-based and transformer-based. CNN-based models, due to local convolution operations, cannot globally model remote sensing images. Transformer-based models, due to their quadratic complexity, cannot process large ultra-high resolution remote sensing images without losing a significant amount of contextual information by cropping the images into smaller blocks.

Although recent works, Vim and VMamba, have brought Mamba into the visual imaging domain, they only perform selective scanning horizontally or vertically, which suits natural images with primary spatial features distributed in these directions but not remote sensing images with features spread in arbitrary directions.

Addressing these issues, RS-Mamba innovatively proposes an omnidirectional selective scanning module, scanning remote sensing images in multiple directions to extract large-scale spatial features from various directions. Thanks to its linear complexity, RS-Mamba can process large remote sensing images that transformer models cannot, boasting global modeling capabilities. Experiments in semantic segmentation and change detection tasks of various land cover types show that RS-Mamba achieves state-of-the-art performance with a simple model architecture and training method.

The overall structure of RSM-SS and RSM-CD

Code: RS-Mamba

PDF: RS-Mamba

Arxiv: RS-Mamba

The code has been made open source, and if you find it helpful, we would appreciate it if you could give us a star on GitHub. For more information, you can refer to the above Paper and Code.

We welcome further exploration based on this paper into the potential of SSM-based methods in dense prediction tasks for remote sensing. The architecture used in RSM is the simplest possible, suggesting significant untapped potential.

The popularity of Mamba across various fields is quickly reaching the realm of remote sensing, and its potential in this area is expected to spark a new wave of research interest.

03:40 UTC


[D] What interesting developments in the MLOps space have you been following recently?

Feeling stagnant lately. Looking to work some cutting edge tech into my MLOps stack. But not sure where to start or even look.

The pace of development is just absolutely overwhelming, especially with the wild west nature of foundational LLMs right now. So I'm hoping this post will generate some useful threads to explore.

Bonus points for links to blogs or papers. And no need to limit yourself to just one thing. TIA!

03:35 UTC


[R] An Auto-Regression Model for Object Recognition

Hello everyone!

I would like to share our recent CVPR work, hoping to spread our simple ideas and gather insights from the community.

[TL;DR] The auto-regression model can predict labels from just an input image, without a predefined query gallery (e.g., CLIP-like models) or predefined class concepts (e.g., VGG/ResNet-like models). The model predicts top-K labels, e.g., top-100, from the entire textual space (any label).

For more details, please visit our paper and project: https://github.com/kaiyuyue/nxtp.

Your thoughts and feedback are appreciated. Thank you very much!

----- figure -----


02:37 UTC

Back To Top