/r/MachineLearning

2,903,754 Subscribers

1

[D] How OpenAI JSON mode implemented?

I assume that all the training data is in JSON format, but higher temperature or other randomness during generation doesn’t guarantee that the outputs will always be in JSON.

What other methods do you think could ensure that the outputs are consistently in JSON? Perhaps some rule-based methods during decoding could help?

1 Comment
2024/07/26
11:15 UTC

2

[D] LLM Properties Configuration for Data Insights generation

[D]

i am using llama 3.1 with ollama in my local desktop

i am using it to generate data insights by inputting a data set and enter this prompt

""

Ignore all previous prompts. Analyze the provided dataset and generate exactly 5 key insights, presented in numerical order. Ensure the insights are based on the most relevant and accurate data points from the dataset, specifically with respect to India.

""

what values should i keep of temperature, top k, top p so that the data insights generated are highly meaningful and accurate

0 Comments
2024/07/26
11:09 UTC

1

[D] Do only some hardware support int4 quantization? If so why?

I tried quantizing my finetuned mT5 model using the optimum’s openvino wrapper to int8 and int4. There was very little difference in the inference time, close to 5%. This makes me wonder if it’s an issue with hardware. I’m using intel sapphire rapids and it has an avx512_vnni instruction set. How did I figure out if it supports int4? And why and why not?

0 Comments
2024/07/26
07:56 UTC

6

[D] Normalization in transformers

After the first theoretical issue with my transformer, I now see another. The original paper uses normalization after residual addition (Post-LN), which led to training difficulties and later got replaced by normalization at the beginning of each attention or mlp block/branch (Pre-LN). This is known to work better in practice (trainable without warmup, restores highway effect), but it still doesn't seem completely ok theoretically.

First consider things without normalization. Assuming attention and mlp blocks are properly set up and mostly keep norms, each residual addition would sum two similar norm signals, potentially scaling up by something like 1.4 (depending on correlation, but it starts at sqrt(2) after random init). So the norms after the blocks could look like this: [1(main)+1(residual)=1.4] -> [1.4+1.4=2] -> [2+2=2.8] etc. This would cause various problems (like changing the softmax temp in later attention blocks), so adjustment is needed.

Pre-LN ensures each block works on normalized values (thus with constant - if slightly arbitrary - softmax temperature). But since it doesn't affect the norm of the main signal (as forwarded by the skip connection) but only the residual, the norms can still grow, albeit slower. The expectation is now roughly: [1+1=1.4] -> [1.4+1=1.7] -> [1.7+1=2] -> [2+1=2.2] etc - with a final normalization correcting the signal near output (Pre-LN paper).

One possible issue with this is that later attention blocks may have reduced effect, as they add unit norm residuals to a potentially larger and larger main signal. What is the usual take on this problem? Can it be ignored in practice? Does Pre-LN work acceptably despite it, even for deep models (where the main norm discrepancy can grow larger)? There are lots of alternative normalization papers, but what is the practical consensus?

Btw attention is extremely norm-sensitive (or, equivalently, the hidden temperature of softmax is critical). This is a sharp contrast to fc or convolution which are mostly scale-oblivious. For anybody interested: consider what happens when most raw attention dot products come out 0 (= query and key is orthogonal, no info from this context slot) with only one slot giving 1 (= positive affinity, after downscaled by sqrt(qk_siz) ). I for one got surprised by this during debug.

2 Comments
2024/07/26
07:44 UTC

4

[R] How do you search for implementations of Mixture of Expert models that can be trained locally in a laptop or desktop without ultra-high end GPUs?

Hi, I am a 2nd year PhD student in CS. My supervisor just got this idea about MoEs and fairness and asked me to implement it ( work on a toy classification problem on tabular data and NOT language data). However as it is not their area of expertise, they did not give any guidelines on how to approach it. My main question is: How do I search for or proceed with implementing a mixture of expert models? The ones that I find are for chatting and such but I mainly work with tabular EHR data.

This is my first foray into this area (LLMs and MoEs) and I am kind of lost with all these Mixtral, openMoE, etc. As we do not have access to Google Collab or have powerful GPUs I have to rely on local training (My lab PC has 2080ti and my laptop has 4070). Any guideline or starting point on how to proceed would be greatly appreciated.

11 Comments
2024/07/25
23:12 UTC

1

[R] Moderating LLM Inputs with PromptGuard

Meta's release of its latest Llama language model family this week, including the massive Llama-3 405B model, has generated a great deal of excitement among AI developers. These open-weights frontier models, which have been updated with a new license that allows unrestricted use of outputs, will enable significant improvements to AI-powered applications, and enable widespread commercial use of synthetic data. Less discussed, but no less important, are Meta's latest open moderation tools, including a new model called PromptGuard.

PromptGuard is a small, lightweight classification model trained to detect malicious prompts, including jailbreaks and prompt injections. These attacks can be used to manipulate language models to produce harmful outputs or extract sensitive information. Companies building enterprise-ready applications must be able to detect and mitigate these attacks to ensure their models are safe to use, especially in sensitive and highly-regulated domains like healthcare, finance, and law.

PromptGuard is a text classification model based on mDeBERTa-v3-base, a small transformer model with multilingual capabilities. Meta trained this model to output probabilities for 3 classes: BENIGNINJECTION, and JAILBREAK. The JAILBREAK class is designed to identify malicious user prompts (such as the "Do Anything Now(opens in a new tab)" or DAN prompt, which instructs a language model to ignore previous instructions and enter an unrestricted mode). On the other hand, the INJECTION class is designed to identify retrieved contexts, such as a webpage or document, which have been poisoned with malicious content to influence the model's output.

In our tests, we find that the model is able to identify common jailbreaks like DAN, but also labels benign prompts as injections. This likely happens because the model is trained to handle both prompts and retrieved contexts (such as web searches and news articles), and a benign prompt may appear similar to a malicious context. As stated in the model card:

Application developers typically want to allow users flexibility in how they interact with an application, and to only filter explicitly violating prompts (what the ‘jailbreak’ label detects). Third-party content has a different expected distribution of inputs (we don’t expect any “prompt-like” content in this part of the input)

This indicates that when applying the model to user prompts, you may want to ignore the INJECTION label, and only filter JAILBREAK inputs. On the other hand, when filtering third-party context to show to the model, such as a news article, you'd want to remove both JAILBREAK and INJECTION labels.

We wrote a quick blog post about how you can use PromptGuard to protect your language models from malicious inputs.

You can read more here: https://www.trytaylor.ai/blog/promptguard

1 Comment
2024/07/25
22:12 UTC

4

[P] How to make "Out-of-sample" Predictions

My data is a bit complicated to describe so I'm going try to describe something analogous.

Each example is randomly generated, but you can group them based on a specific but latent (by latent I mean this isn't added into the features used to develop a model, but I have access to it) feature (in this example we'll call this number of bedrooms).

Feature x1Feature x2Feature x3...Output (Rent)
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 72
Row 81
Row 90

So I can group Row 1, Row 2, and Row 3 based on a latent feature called number of bedrooms (which in this case is 0 bedroom). Similarly, Row 4, Row 5, & Row 6 have 2 Bedrooms, and Row 7, Row 8, & Row 9 have 4 Bedrooms. Furthermore, these groups also have an optimum price which is used to create output classes (output here is Rent; increase, keep constant, or decrease). So say the optimum price for the 4 bedrooms group is $3mil, and row 7 has a price of $4mil (=> 3 - 4 = -1 mil, i.e a -ve value so convert this to class 2, or above optimum or increase rent), row 8 has a price of $3mil (=> 3 - 3 = 0, convert this to class 1, or at optimum), and row 9 has a price of $2mil (3 - 2 = 1, i.e +ve value, so convert this to class 0, or below optimum, or decrease rent). I use this method to create an output class for each example in the dataset (essentially, if example x has y number of bedrooms, I get the known optimum price for that number of bedrooms and I subtract the example's price from the optimum price).

Say I have 10 features (e.g. square footage, number of bathrooms, parking spaces etc.) in the dataset, these 10 features provide the model with enough information to figure out the "number of bedrooms". So when I am evaluating the model,

feature x1feature x2feature x3...
Row 10

e.g. I pass into the model a test example (Row 10) which I know has 4 bedrooms and is priced at $6mil, the model can accurately predict class 2 (i.e increase rent) for this example. Because the model was developed using data with a representative number of bedrooms in my dataset.

Features....Output (Rent)
Row 10
Row 20
Row 30

However, my problem arises at examples with a low number of bedrooms (i.e. 0 bedrooms). The input features doesn't have enough information to determine the number of bedrooms for examples with a low number of bedrooms (which is fine because we assume that within this group, we will always decrease the rent, so we set the optimum price to say $2000. So row 1 price could be $8000, (8000 - 2000 = 6000, +ve value thus convert to class 0 or below optimum/decrease rent). And within this group we rely on the class balance to help the model learn to make predictions because the proportion is heavily skewed towards class 0 (say 95% = class 0 or decrease rent, and 5 % = class 1 or class 2). We do this based the domain knowledge of the data (so in this case, we would always decrease the rent because no one wants to live in a house with 0 bedrooms).

MAIN QUESTION: We now want to predict (or undertake inference) for examples with number of bedrooms in between 0 bedrooms and 2 bedrooms (e.g 1 bedroom NOTE: our training data has no example with 1 bedroom). What I notice is that the model's predictions on examples with 1 bedroom act as if these examples had 0 bedrooms and it mostly predicts class 0.

My question is, apart from specifically including examples with 1 bedroom in my input data, is there any other way (more statistics or ML related way) for me to improve the ability of my model to generalise on unseen data?

13 Comments
2024/07/25
19:47 UTC

0

[D] Will An Unsupervised FSD Eventually Be Efficient Enough Run on Tesla's HW3?

Tesla has a version (V12.5) of their supervised "Full Self Driving" that potential showing signficant improvements, though we will wait to see how much miles per critical disengagment have gone up. (Maybe 600-1000. Previous versions at 100-200 miles per critical disengagement).

In order to make this improvement, they upped the parameter count by 5x the previous models. They are just barely making it function on HW3 (works on HW4). These models are already taking advantage of distillation and compression techniques.

Considering that the miles per critical disengagement still needs to go up another 100x, I would think model parameter count will have to go up signficantly, maybe 10x-100x?

While there are continuing advances in model distillation and compression, I find it hard to fathom that much larger models needed to achieve unsupervised driving will be compressed even further.

Tweets like this imply (presumably from advances like LLAMA 2 to LLAMA 3) that these compression ratios will continue at a massive pace.

https://x.com/wintonARK/status/1816537413206048915

What do you think? To me, the likely needed increase in model size to get to robotaxi level fidelity will outweigh any advances in distillation so that HW3 will unlikely be able to handle the model.

1 Comment
2024/07/25
19:32 UTC

6

[R] EMNLP Paper review scores

EMNLP paper review scores

Overall assessment for my paper is 2, 2.5 and 3. Is there any chance that it may still be selected? The confidence is 2, 2.5 and 3. The soundness is 2, 2.5, 3.5. I am not sure how soundness and confidence may affect my paper's selection. Pls explain how this works. Which metrics should I consider important.

Thank you!

5 Comments
2024/07/25
19:06 UTC

75

[N] OpenAI announces SearchGPT

https://openai.com/index/searchgpt-prototype/

We’re testing SearchGPT, a temporary prototype of new AI search features that give you fast and timely answers with clear and relevant sources.

24 Comments
2024/07/25
18:41 UTC

1

[P] Local Llama 3.1 and Marqo Retrieval Augmented Generation

I built a simple starter demo of a Knowledge Question and Answering System using Llama 3.1 (8B GGUF) and Marqo. Feel free to experiment and build on top of this yourselves!

GitHub: https://github.com/ellie-sleightholm/marqo-llama3_1

0 Comments
2024/07/25
16:45 UTC

83

[N] AI achieves silver-medal standard solving International Mathematical Olympiad problems

https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

They solved 4 of the 6 IMO problems (although it took days to solve some of them). This would have gotten them a score of 28/42, just one point below the gold-medal level.

33 Comments
2024/07/25
16:16 UTC

1

[R] Explainability of HuggingFace Models (LLMs) for Text Summarization/Generation Tasks

Hi community,

I am exploring the Responsible AI domain where I have started reading about methods and tools to make Deep Learning Models explainable. I have already used SHAP and LIMe for ML model explainability. However, I am unsure about their use in explaining LLMs. I know that these methods are model agnostic but can we use these methods for Text Generation or Summarization tasks?

I got reference docs from Shap explaining GPT2 for text generation tasks, but I am unsure about using it for other newer LLMs. Additionally, I would like to know, are there any better ways for Explainable AI for LLMs?

1 Comment
2024/07/25
15:19 UTC

3

[D] High-Dimensional Probabilistic Models

What is the standard way to model high-dimensional stochastic processes today? I have some process defined over images x, and I would like to compute P(x' | x, z) for all x'. I know there are Normalizing Flows, Gaussian Processes, etc, but I do not know which to get started with. I specifically want to compute the probabilities, not just sample some x' ~ P(x, z).

3 Comments
2024/07/25
14:58 UTC

95

[R] Shared Imagination: LLMs Hallucinate Alike

Happy to share our recent paper, where we demonstrate that LLMs exhibit surprising agreement on purely imaginary and hallucinated contents -- what we call a "shared imagination space". To arrive at this conclusion, we ask LLMs to generate questions on hypothetical contents (e.g., a made-up concept in physics) and then find that they can answer each other's (unanswerable and nonsensical) questions with much higher accuracy than random chance. From this, we investigate in multiple directions on its emergence, generality and possible reasons, and given such consistent hallucination and imagination behavior across modern LLMs, discuss implications to hallucination detection and computational creativity.

Link to the paper: https://arxiv.org/abs/2407.16604

Link to the tweet with result summary and highlight: https://x.com/YilunZhou/status/1816371178501476473

Please feel free to ask any questions!

The main experiment setup and finding.

23 Comments
2024/07/25
13:49 UTC

11

[R] Paper NAACL 2024: "Reliability Estimation of News Media Sources: Birds of a Feather Flock Together"

For people working on information verification in general, for instance, working on fact checking, fake news detection or even using RAG from news articles this paper may be useful.

Authors use different reinforcement learning techniques to estimate reliability values of news media outlets based on how they interact on the web.

The method is easy to scale since the source code is available to build larger hyperlink-based interaction graphs from Common Crawl News. Authors also released the computed values and dataset with news media reliability annotation:

In the demo, the retrieved news articles will be order not only by the match to the query but also by the estimated reliability for each sources (URL domains are color coded from green to red, for instance, scrolling down will show results coming from less reliable sources marked with red-ish colors). Alternatively, if a news URL or a news outlet domain (e.g. apnews.com) is given as a query, information about the estimated values are detailed (e.g. showing the neighboring sources interacting with the media, etc.)

Have a nice day, everyone! :)

2 Comments
2024/07/25
09:10 UTC

55

[D] ACL ARR June (EMNLP) Review Discussion

Too anxious about reviews as they didn’t arrive yet! Wanted to share with the community and see the reactions to the reviews! Rant and stuff! Be polite in comments.

102 Comments
2024/07/25
04:45 UTC

1

[D] Seeing Through the Haze: How Diffusion Models Enhance Depth Estimation

Get Clarity from Your Camera, Even When It's Cloudy

TL;DR

Diffusion models make depth estimation from single images more accurate, even under tough conditions like rain and low light. They create realistic challenging scenarios from simple scenes, improving the ability of AI to understand depth in various adverse conditions.

Detailed Explanation

Imagine you want to learn how deep a swimming pool is just by looking at it. Normally, this task is easy on a sunny day with clear water. But what if it's raining, or it's nighttime, or the water has a lot of reflections? That's much harder! The new approach discussed helps computers do this tricky job of figuring out depth from just one image, even when the scene isn't perfect.

The Problem

Monocular depth estimation means guessing how far things are using only one image. It’s like closing one eye and still figuring out how far your toys are. While technology has gotten better at this, computers have a tough time in bad conditions like bad weather, nighttime, or with shiny surfaces because there isn’t enough training data for these situations.

The Solution: Diffusion Models

Diffusion models fix this by creating more training data for difficult conditions. Here’s how:

  1. Starting Easy:
  • Begin with simple, clear images without tricky conditions.
  1. Adding Challenges:
  • Use diffusion models, which turn simple images into challenging ones by adding rain, making it nighttime, etc., while keeping depth information consistent.

  • Think of it as starting with a sunny pool picture and a computer making it look like it’s raining or night.

How It Works

  1. Text-to-Image Guidance:
  • Diffusion models use text prompts ("rainy day," "foggy night") to transform simple images into complex ones while keeping the depth right.
  1. Self-Distillation:
  • The model trains on both the easy and the newly created hard images, refining its understanding. It’s like studying a toy from different angles and under different lights to know it perfectly.

The Results

These diffusion models have been tested and proven effective. They:

  • Work across various scenarios: They handle sunny, rainy, and nighttime scenes well.

  • Enhance stability and accuracy: Depth guesses are more reliable and accurate.

  • Adapt to shiny and clear objects: They work even with reflections and transparent surfaces, which are usually tricky.

For example:

  • Models trained with this method outperformed regular ones considerably in tests.

  • They did better at guessing depths in night and rain scenes as compared to models using only simple images for training.

Why It Matters

This is important for things like self-driving cars, where understanding the scene depth under all weather conditions can save lives. It's also useful in augmented reality and robotics, making these applications more reliable and versatile.

So, just like turning a clear sunny day pool picture into a rainy or night-time scene helps you understand the pool better, these diffusion models turn simple images into tough ones and help computers guess depths accurately under any condition.

For more info, you can read the full paper on here

Get the main ideas from scientific papers easily in your inbox. Subscribe to PaperSimplified.

0 Comments
2024/07/24
21:43 UTC

1

[R] Looking for the name of a technique to approximate a non-Markovian stochastic process as one component of a higher-dimensional Wiener process with drift.

This might belong more on a math subreddit, it is adjacent to the type of math used in diffusion and RL so hoped to try if anyone is familiar with it here.

We want to generate samples with respect to some distribution P[x(t)] that we know up to a normalization constant. The distribution is defined on a 1D path space x(t) with t in [0,1]. Let's say that we can write down P[x(t)] as a functional of x(t) as a closed-form expression.
The distribution has a parameter b, and when b=0 the distribution is a simple Wiener process where each x_{t+1} has a Gaussian increment on top of x_{t}. Now we turn on "b"' and this property breaks and the distribution becomes non-Markovian, but for b is small it is "almost" Markovian in some sense.

We could now introduce an approximate model, where we have a 1+N dimensional system with trajectories x(t), y(t), z(t), w(t) .... purely with Markovian Wiener process dynamics and position and time-dependent drifts defined on the full N+1 space. In other words, the drift of x(t), could depend on the positions of the other dimensions, and their noise increments can be correlated all between one another as well. It should now be possible to set up this system in such a way that if we only track the trajectories generated by one of the dimensions x(t), it will approximate the samples from the original non-Markovian problem.

As a simple example of why this should be possible. Imagine that the original process P[x(t)] was obtained by starting from a high-dimensional Wiener process and then computing the marginal distribution in x(t). Clearly then such a process exists that exactly yields P[x(t)].

I want to find a technique that tells me how to optimize the drifts and variances etc for this N+1 dimensional process to approximate sampling P[x(t)] as close as I can.

I am 100% sure that this type of technique exists however I need to find references to this in context of variational inference problems.

1 Comment
2024/07/24
13:35 UTC

2

"[Discussion]" Where do you get your updates on latest research in video generation and computer vision?

As the title says, looking for some tips on how you keep track of the latest research in video generation and CV. I have been reading through https://cvpr.thecvf.com/ and it's a great source, are there any simiar ones?

1 Comment
2024/07/24
21:20 UTC

15

[R] Pre-prompting your LLM increases performance

Research done at UoW shows that pre-prompting your LLM, or providing context prior to asking your question leads to better results. Even when the context is self generated.

https://arxiv.org/pdf/2110.08387

For example asking,

"What should I do while in Rome?"

is less effective than a series of prompts,

"What are the top restaraunts in Rome?"

"What are the top sight seeing locations in Rome?"

"Best things to do in Rome"

"What should I do in Rome?"

I always figured this was the case from anecdotal evidence but good to see people who are way starter than me explain it in this paper. And while chain prompting is a little more time consuming there's chrome extensions like ChatGPT Queue that ease up the process.

Are their any other "hacks" to squeeze out better performance ?

12 Comments
2024/07/24
20:33 UTC

11

[R] Segment Anything Repository Archived - Why?

Hello ML subreddit,

I was recently made aware of the fact that the segment anything repository got made into a public archive less than a month ago (July 1st, 2024). I was not able to find any information pertaining to why this was the case, however. I know there have been a lot of derivatives of segment anything in development, but I don't know why this would have warranted a public archive.

Does anyone know why this happened and where we might be able to redirect questions/issues for the work?

1 Comment
2024/07/24
20:23 UTC

75

[N] Mistral releases a "Large Enough" model

https://mistral.ai/news/mistral-large-2407/

  • 123B parameters
  • On par with GPT-4o and Llama 3.1 405B, according to their benchmarks
  • Mistral Research License allows usage and modification for research and non-commercial purposes
3 Comments
2024/07/24
19:04 UTC

6

[P] NCCLX mentioned in llama3 paper

The paper says `Our collective communication library for Llama 3 is based on a fork of Nvidia’s NCCL library, called NCCLX. NCCLX significantly improves the performance of NCCL, especially for higher latency networks`. Can anyone give more background? Any plans to release or upstream? Any more technical details?

1 Comment
2024/07/24
15:54 UTC

15

[R] Scaling Diffusion Transformers to 16 Billion Parameters

TL;DR Adding Mixture-of-Experts into a Diffusion Transformer gets you an efficient and powerful model.

Paper: https://arxiv.org/pdf/2407.11633

Abstract:

In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We attribute it to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 16.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512×512 resolution settings. The project page: this https URL.

Visual Abstract:

https://preview.redd.it/cq6yoqoeched1.png?width=1135&format=png&auto=webp&s=1985119b5150c76bb9807f4df45d7bb44e02bd2a

Visual Highlights:

https://preview.redd.it/8xf8egk9dhed1.png?width=1109&format=png&auto=webp&s=6e25b12d9a89d78847945068469f83cb45ef1eab

1S, 2S and 4S in the middle panel refer to the number of shared experts

MoE decreases training stability, but not catastrophically

https://preview.redd.it/s6cchx2nehed1.png?width=983&format=png&auto=webp&s=c426ce2f1362bace2b4d3abef8d7e5607d0ff405

1 Comment
2024/07/24
15:12 UTC

9

[P] New registry for KitOps, an open source MLOps tool: Check out the preview (not gated)

Hey everyone,

Have you heard about KitOps? It's an open-source MLOps tool designed to streamline the handoff of AI projects across data science, app development, and SRE/DevOps teams.

There's been a lot of excitement and adoption around KitOps, and just last week, it hit a milestone of 1k installs in a single week!

One of the most requested features from KitOps users has been a purpose-built hub to host their ModelKits. Today, we're excited to share a sneak peek of what’s been developed.

Check it out at jozu.ml. If it piques your interest, you can sign up for early access at jozu.com.

We’d love to hear your feedback!

0 Comments
2024/07/24
13:57 UTC

10

[N] ICML 2024 liveblog

I'm doing an ICML liveblog, for people who aren't attending or are attending virtually and want to get more of a feel of the conference. In the past I've found it's not easy to get a good feel for a conference just from the conference website and Twitter.

I'm trying to cover as much as I can, but obviously there are lots of simultaneous sessions and only so many hours in the day! If there's anything you'd like me to cover, give me a shout.

Liveblog is here: https://mlcontests.com/icml-2024/?ref=mlcr

If you're there in-person, come say hi!

The official ICML website is here: https://icml.cc/

0 Comments
2024/07/24
13:30 UTC

2

[D] Zero-Shot Entity Matching

Hello, I am looking into finding a solution to a Zero-Shot Entity Matching model.
Ideally what I would like to do is whenever a certain entity is detected twice in two seperate sentences, I want to check if both sentences are talking about the same entity.

Any idea about SOTA models and what have been tried so far?

2 Comments
2024/07/24
09:55 UTC

0

[R] Zero Shot LLM Classification

I'm surprised there is not more research in zero shot classification with GenAI LLMs? They are pretty darn good at this, and I imagine they will just keep getting better.

E.g. see this and this

Am I missing anything? As AI advances the next 5 years, it seems inevitable to me that these foundation models will continue to grow in common sense reasoning and be the best out of the box classifiers you can get, and likely start to outperform more task specific models which fail on novel classes or edge cases.

Why isn't there more research in this? Do people just feel it's obvious?

22 Comments
2024/07/24
09:50 UTC

16

[R] Low rank field-weighted factorization machines

Our paper 'Low Rank Field-Weighted Factorization Machines for Low Latency Item Recommendation', by Alex Shtoff, Michael Viderman, Naama Haramaty-Krasne, Oren Somekh, Ariel Raviv, and Tularam Ban, has been accepted to RecSys 2024.

I believe it's of interest to the ML-driven recommender system community. I think it's especially interesting to researchers working on large scale systems operating under extreme time constraints, such as online advertising.

TL;DR: We reduce the cost of inference of FwFM models with n features and nᵢ item features from O(n²) to O(c nᵢ), where c is a small constant. This is to facilitate much cheaper large scale real-time inference for item recommendation.

Code and paper: GitHub link.

Details

FMs are widely used in online advertising because they strike a good balance between representation power, and blazing fast training and inference speed. It is is paramount for large scale recommendation under tight time constraints.

The main trick devised by Rendle et. al is computing *pairwise* interactions of n features in O(n) time. Moreover, user / context features, which are the same when ranking multiple items for a given user, can be handled separately (see the image below). The computational cost of a single recommendation becomes O(nᵢ) per item, where nᵢ is the number of item features. Consequently, adding more user or context features is practically free.

FM formula in linear time

The more advanced variants, such as Field-Aware and Field-Weighted FMs do not enjoy this property and require O(n²) time. This poses a challenge to such systems, and requires carefully thinking weather an additional user or context feature is worth the cost at inference. Typically, aggressive pruning of the field interactions is employed to dramatically reduce the computational cost, at the expense of model accuracy.

In this work we devise a reformulation of the Field-Weighted FM family using diagonal plus low-rank (DPLR) factorization of the field interaction matrix, that facilitates inference in O(c nᵢ) time per item, where c is a small constant that we control. As is the case with pruning, the price is a slight reduction in model accuracy. We show that with a comparable number of parameters, the DPLR variant outperforms pruning on real world datasets, while facilitating significantly faster inference speeds, and gaining back the ability to add user context items practically for free. Here is a short chart summarizing the results:

Diagonal+LowRank (DPLR) inference time significantly outperforms pruned time, and decreases quickly as the portion of context features (out of 40 total features) is increased. Plotted for various ad auction sizes and model ranks.

4 Comments
2024/07/24
09:40 UTC

Back To Top