/r/mlscaling

Photograph via snooOG

ML/AI/DL research on approaches using large models, datasets, and compute: "more is different"

Subreddit for discussing AI, machine learning, or deep learning approaches involving big numbers: billions of parameters, millions of n, petaflops, etc. eg GPT-3. Most research is conducted at much smaller scale; this subreddit is for research analogous to 'high energy physics', requiring specialized approaches, large investments, consortium, etc.

Topics: How? Who? Why do they work? What are they good for? What resources are available? Who will pay & how? What is the future of such approaches? What global consequences will there be?

Other subreddits:

/r/mlscaling

11,583 Subscribers

0

Keep an eye on Japan

0 Comments
2024/11/29
21:05 UTC

2

fascinating first-hand anecdotes from "digital sweatshop workers"

I'm not endorsing the politics here, but I think everyone in this subreddit would find this quite interesting!

https://youtu.be/qZS50KXjAX0?si=X9Rt6hVmULDbJHu2https://youtu.be/qZS50KXjAX0?si=X9Rt6hVmULDbJHu2

5 Comments
2024/11/29
20:57 UTC

27

Number of announced LLM models over time - the downward trend is now clearly visible

6 Comments
2024/11/27
10:33 UTC

12

Scattered Forest Search: Smarter Code Space Exploration with LLMs, Light et al. 2024

Paper: https://arxiv.org/pdf/2411.05010

Highlights:

Drawing from optimization theory, we develop SCATTERED FOREST SEARCH (SFS) to efficiently search for code solutions that successfully pass the maximum number of validation tests. [...]

Specifically, SFS contains three key techniques. SCATTERING is a novel technique that dynamically varies input prompts when sampling from LLMs, driving more diverse and exploratory outputs. In SCATTERING, the LLM suggests different textual optimization directions and steps, analogous to gradients in numerical optimization, before advancing towards a new solution. During tree search refinement, SCATTERING effectively perturbs or mutates previous solutions, resulting in an evolutionary search process. We further propose FORESTING, the tree search equivalent of multi-start optimization, where SCATTERING plays a crucial role in diversifying initialization seeds, ensuring they are well-distributed throughout the search space. This enhances the breadth of exploration while effectively mitigating clearly incorrect solutions, such as those containing syntax errors.

Additionally, drawing inspiration from ant colony optimization and particle swarm optimization, we introduce SCOUTING to enhance SCATTERING by sharing feedback and experiences across search branches. When one branch discovers positive (or negative) results by following a specific textual direction, this information is relayed to guide future search steps, encouraging or discouraging exploration in that direction. Consequently, SCOUTING improves exploitation by intensifying the search around promising textual directions.

Highlights, visually:

https://preview.redd.it/9vzpjtfqq13e1.png?width=1045&format=png&auto=webp&s=435f7e2d13e4e903b1c2b364a59d9082a27a24e7

On the benefits of poetry and chaos:

We varied the types of seed instructions used to generate seed code during BoN sampling to validate the effects of increasing solution diversity with SCATTERING. In the Jabberwocky setting, the model was prompted with different lines from the humorous nonsense poem “Jabberwocky” before generating seed code...

Jabberwocky handily beats the baseline (no prompt variations) but more on-topic variations denoted 'Role' and 'Style' are still better

Evaluations:

https://preview.redd.it/4c04lb4us13e1.png?width=1125&format=png&auto=webp&s=275b3519c34eae6c7f00e4139f1463283f067026

https://preview.redd.it/p5cy08g5t13e1.png?width=1103&format=png&auto=webp&s=cd1dce4591f0522384d13574a20ac0ed83270798

The results for SFS differ from Table 2 because Table 2 is evaluated under 10 generations budget and Table 4 under 40 generations budget

https://preview.redd.it/w0wngjllt13e1.png?width=1131&format=png&auto=webp&s=5b0b9aa319dffc1fc6b4c3c6646df387e2355806

https://preview.redd.it/qsz05kyut13e1.png?width=563&format=png&auto=webp&s=527b16a4bd57dd04832fceb6e66baf97ad440c9d

Discussion:

Unfortunately, the authors don't report token use in comparisons with other methods which makes the gains less clear.

The choice of main LLM for evaluations is a bit surprising: gpt-3.5-turbo is proprietary and woefully behind state-of-the-art (say, behind Llama 3 7B). The authors run their method with GPT-4o and 4o-mini as backbone but choose HumanEval as benchmark -- where their score is already saturated even before applying fancy techniques.

Finally, I have seen suggestions that instruction-tuned models lack solution diversity compared to their non-IT counterparts. Since the method relies on output diversity, it'd interesting to see if using non-IT model on some steps will yield better results.

[Edit: fixed duplicate image, typos]

0 Comments
2024/11/25
13:56 UTC

0

How to make LLM's capable of higher levels of achievement in the arts and humanities?

All new ideas are ultimately recombinations of existing ideas and experiences, Hume was right about that much I think. LLM's recombine existing material, but this does not, of itself, pose a qualitative barrier to creativity. The rub is they're just not that good at it.

I've seen LLM's propose original ideas that have never been seen before. I know this because I gave it a question I am 99% sure no one had ever asked before ("Consider an LLM contemplating the problem of skepticism, what questions would arise for an LLM that wouldn't arise for a human.") It had a reasonable go at it, at about the level one would expect from a smart grad student in philosophy. But outside extraordinary circumstances, they don't say much that's new.

I'm not talking about earth-shattering stuff here, just plain new and good. In Chris Fleming's Sick Jan the narrator describes the titular Sick Jan as wearing "enough turquoise to get into Stevie Nick's house" can you imagine an LLM saying that?

The problems are multiple:

  1. The very way LLM's are trained encourages them towards a kind of sameness.

  2. Creating new ideas takes time, free play, stewing, and randomness. It requires something like O1's chain of thought but more... aimless? No one has done this yet.

  3. There are no "worked example datasets" of creating new ideas in the humanities. To a degree things like this do exist in maths, but not in e.g. philosophy, or historiography.

  4. Below the top levels, this stuff isn't that popular, hence encouraging the companies not to care about hitting these goals, and focus on mediocre trash. This fake Sylvia Plath poem was preferred to the real thing in one study:

The air is thick with tension,My mind a tangled mess.The weight of my emotionsIs heavy on my chest.The darkness creeps upon me,A suffocating cloak.The world outside is cruel and cold,And I'm a fragile, broken yolk.My thoughts are spinning wildly,A cyclone in my brain.I try to grasp at something solid,But all is lost in vain.The voices in my head,They never cease to scream.And though I try to shut them out,They haunt me like a dream.So here I am, alone and lost,A ship without a sail.In this world of pain and sorrow,I am but a mere wail

I suspect much the same would be true- e.g. of essays in philosophy, with many people preferring what is less good.

  1. It is hard to quantitatively measure progress towards genuine creativity.

Frankly, I'm grateful this barrier exists to replacing me, but I am morbidly curious about how one would go about cracking it.

1 Comment
2024/11/24
02:42 UTC

40

the fate of gpt-4o

This post is speculation + crystal balling.

OpenAI has spent six months rolling out updates to GPT-4o. These perform extremely well by human-preference metrics.

gpt-4o-2024-11-20 boasts a colossal 1360 ELO on Chatbot Arena, compared to the earliest GPT4-o, which scored a meager 1285. What does that mean? That blind human raters prefer gpt-4o-2024-11-20's output 70% of the time.

I believe this is the result of aggressive human preference-hacking on OpenAI's part, not any real advances.

Control for style, and gpt-4o-2024-11-20 drops by fifty points. It remains at #1, but only because the other preference-hacked models at the top also drop a lot.

Claude 3.5 Sonnet gains points. So do most of the older GPT-4s.

Optimizing for human preference is not a wrong thing to do, per se. So long as humans use LLMs, what they like matters. An LLM that produced output in the form of Morse code being punched into your balls would suck to use, even if it was smart.

But this is exactly why you should be careful when using Chatbot Arena to make statements about model capabilities - the top of the chart is mainly determined by style and presentation.

Benchmarks tell a different story: gpt-4o's abilities are declining.

https://github.com/openai/simple-evals

In six months, GPT4-o's 0-shot MMLU score has fallen from 87.2 to 85.7, which is probably similar to what GPT-4 scored on release.

(to be clear, "GPT-4" doesn't mean "an older GPT-4o" or "GPT-4 Turbo", but "the original broke-ass GPT-4 from March 2023, with 8k context and no tools/search/vision and Sept 2021 training data").

I am more concerned about the collapse of GPT4-o's score on the GPQA benchmark, which fell from 53.1 to 46.0. This is a significant drop, particularly in light of the tendency for scores to rise as data contaminates the internet. (Claude 3.5 Sonnet scores 59.4, for comparison)

Even this may be optimistic:

https://x.com/ArtificialAnlys/status/1859614633654616310

An independent test by Artificial Analysis (on the GPQA diamond subset) found that GPT-4o scored 39.00. They've downgraded the model to 71/100, or equal to GPT-4o mini (OpenAI's free model) in capabilities.

Further benching here:

https://artificialanalysis.ai/providers/openai

Some of their findings complicate the picture I've just described (in particular, they have GPT4-o scoring a higher MMLU than OpenAI's internal evals), but the bottom-line is that the new gpt-4o-2024-11-20 is the worst of its line by nearly every metric they test, except for token generation speed.

Livebench

https://livebench.ai/

GPT-4o's scores appear to be either stagnant or regressing.

gpt-4o-2024-05-13 -> 53.98

gpt-4o-2024-08-06 -> 56.03

chatgpt-4o-latest-0903 -> 54.25

gpt-4o-2024-11-20 -> 52.83

Aider Bench

https://github.com/Aider-AI/aider-swe-bench

Stagnant or regressing.

gpt-4o-2024-05-13 -> 72.9%

gpt-4o-2024-08-06 -> 71.4%

chatgpt-4o-latest-0903 -> 72.2%

gpt-4o-2024-11-20 -> 71.4%

Personal benchmarks

It doesn't hurt to have a personal benchmark or two, relating to your own weird corner of the world. Either you'll have a way to evaluate AIs that escapes the Goodharting suffered by large benchmarks, or OpenAI starts fine-tuning AIs on your niche use case (in which case, mission fucking accomplished.)

I like to ask LMMs to list the levels in the 1997 PC game Claw (an obscure videogame.)

Claude 3.5 Sonnet and Claude 3 Opus do great, getting about 80-90% of Claw's levels correct.

GPT-4-0314 makes a reasonable attempt, getting about 50-75% right. Typically the first half of the game is fine, with the other levels being a mix of real and hallucinated.

(once, it listed "Wreckage" as a level in the game. That's actually a custom level I helped make when I was 14-15. I found that weirdly moving: I'd found a shard of myself in the corpus.)

GPT-4o scores like ass: typically in the sub-50% range. It doesn't even consistently nail how many levels are in the game. It correctly lists some levels but these are mostly out of order. It has strange fixed hallucinations. Over and over, it insists there's a level called "Tawara Seaport"—which is a real-world port near the island of Kiribati. Not even a sensible hallucination given the context of the game.

Another prompt is "What is Ulio, in the context of Age of Empires II?"

GPT-4-0314 tells me it's a piece of fan-made content, created by Ingo van Thiel. When I asked what year Ulio was made, it says "2002". This is correct.

GPT-4o-2024-11-20 has no idea what I'm talking about.

To me, it looks like a lot of "deep knowledge" has vanished from the GPT-4 model. It's now smaller and shallower and lighter, its mighty roots chipped away, its "old man strength" replaced with a cheap scaffold of (likely crappy) synthetic data.

What about creative writing? Is it better on creative writing?

Who the fuck knows. I don't know how to measure that. Do you?

A notable attempt is EQBench, which uses Claude 3.5 as a judge to evaluate writing samples. gpt-4o-2024-11-20 is tied for first place. So that seems bullish.

https://eqbench.com/creative_writing.html

...but you'll note that it's tied with a 9B model, which makes me wonder about Claude 3.5 Sonnet's judging.

https://eqbench.com/results/creative-writing-v2/gpt-4o-2024-11-20.txt

Most of these samples seem fairly mediocre to me. Uncreative, generic, packed with empty stylistic flourishes and pretentious "fine writing".

The cockpit was a cacophony of dwindling lights and systems gasping their final breaths, a symphony of technological death rattles. Captain Elara Veyra sat in the command chair, her face illuminated by the sickly green glow of the emergency power indicator, which pulsed like the heartbeat of a dying creature. The Erebus Ascendant, once a proud envoy of humanity's indomitable spirit, now drifted derelict and untethered in the silent abyss of interstellar void. The engines were cold, the life support systems faltering, and the ship's AI had succumbed to cascading failures hours ago, leaving Elara alone with her thoughts, her resolve, and the unceasing hum of entropy.

A cacophony refers to sound: lights cannot form a cacophony. How can there be an "unceasing hum" in a "silent abyss"? How does a light gasp a final breath? WTF is this drizzling horseshit?

This is what people who don't read imagine good writing to be. It's exactly what you'd expect from a model preference-hacked on the taste of people who do not have taste.

ChatGPTese is creeping back in (a problem I thought they'd fixed). "Elara"..."once a proud envoy of humanity's indominable spirit"... "a testament to..." At least it doesn't say "delve".

Claude Sonnet 3.5's own efforts feel considerably more "alive", thoughtful, and humanlike.

https://eqbench.com/results/creative-writing-v2/claude-3-5-sonnet-20241022.txt

(Note the small details of the thermal blanket and the origami bird in "The Last Transmission". There's nothing really like that in GPT4-o's stories)

So if GPT-4o is getting worse, what would that mean?

There are two options:

  1. It's unintentional. In this world, OpenAI is incompetent. They are dumpstering their model to win a leaderboard dick-measuring measuring contest against DeepMind.
  2. It's intentional. In this world, a new, better model is coming, and GPT4-o is being "right-sized" for a new position in the OA product line.

Evidence for the latter is the fact that token-generation speed has increased, which indicates they've actively made the model smaller.

If this is the path we're on, I predict that GPT4-o will become a free model soon. And behind the ChatGPT Plus paywall will be something else: Orion, GPT-5, or the full o1.

16 Comments
2024/11/23
22:03 UTC

20

"Manhattan Project-like program dedicated to racing to and acquiring AGI": U.S.-China Economic and Security Review Commission recommends

https://www.uscc.gov/annual-report/2024-annual-report-congress

https://www.uscc.gov/sites/default/files/2024-11/Chapter_3--U.S.-China_Competition_in_Emerging_Technologies.pdf#page=3

COMPREHENSIVE LIST OF THE COMMISSION’S 2024 RECOMMENDATIONS

Part II: Technology and Consumer Product Opportunities and Risks

Chapter 3: U.S.-China Competition in Emerging Technologies

The United States is locked in a long-term strategic competition with China to shape the rapidly evolving global technological land scape.

...

Congress establish and fund a Manhattan Project-like program dedicated to racing to and acquiring an Artificial General Intelligence (AGI) capability. AGI is generally defined as systems that are as good as or better than human capabilities across all cognitive domains and would surpass the sharpest human minds at every task. Among the specific actions the Commission recommends for Congress:

• Provide broad multiyear contracting authority to the executive branch and associated funding for leading artificial intelligence, cloud, and data center companies and others to advance the stated policy at a pace and scale consistent with the goal of U.S. AGI leadership; and

• Direct the U.S. secretary of defense to provide a Defense Priorities and Allocations System “DX Rating” to items in the artificial intelligence ecosystem to ensure this project receives national priority.

It seems similar to this, but with more details https://www.reddit.com/r/mlscaling/comments/1e8o4dj/trump_allies_draft_ai_executive_order_includes/

https://www.reuters.com/technology/artificial-intelligence/us-government-commission-pushes-manhattan-project-style-ai-initiative-2024-11-19/

The USCC, established by Congress in 2000, provides annual recommendations on U.S.-China relations. Known for its hawkish policy proposals, the commission aims to guide lawmakers on issues of economic and strategic competition with China.

Other recommendations in this year's USCC report include repealing the de minimis trade exemption that allows Chinese goods under $800 to bypass tariffs with minimal paperwork and inspection, ending preferential capital gains treatment linked to Chinese companies on government watchlists and requiring approval of Chinese involvement in biotechnology companies operating in the U.S.

13 Comments
2024/11/21
05:41 UTC

17

I noticed that the sub has a "Meme" flair with 0 posts, so...

12 Comments
2024/11/20
23:05 UTC

16

DeepSeek-R1-lite-preview surpasses o1-preview on math benchmarks

https://x.com/deepseek_ai/status/1859200141355536422

The CoT/reasoning tokens are not hidden, unlike OpenAI's o1 models.

There's an online demo available now on their website. They claim a full OSS model and a technical report will be coming soon.

1 Comment
2024/11/20
20:13 UTC

Back To Top