AI BENCHMARKS ARE BROKEN! [Prof. MELANIE MITCHELL]

23,104

688 0

2023-09-10に共有

Patreon: www.patreon.com/mlst
Discord: discord.gg/ESrGqhf5CB

Pod version: podcasters.spotify.com/pod/show/machinelearningstr…

Prof. Melanie Mitchell argues that the concept of "understanding" in AI is ill-defined and multidimensional - we can't simply say an AI system does or doesn't understand. She advocates for rigorously testing AI systems' capabilities using proper experimental methods from cognitive science. Popular benchmarks for intelligence often rely on the assumption that if a human can perform a task, an AI that performs the task must have human-like general intelligence. But benchmarks should evolve as capabilities improve.

Large language models show surprising skill on many human tasks but lack common sense and fail at simple things young children can do. Their knowledge comes from statistical relationships in text, not grounded concepts about the world. We don't know if their internal representations actually align with human-like concepts. More granular testing focused on generalization is needed.

There are open questions around whether large models' abilities constitute a fundamentally different non-human form of intelligence based on vast statistical correlations across text. Mitchell argues intelligence is situated, domain-specific and grounded in physical experience and evolution. The brain computes but in a specialized way honed by evolution for controlling the body. Extracting "pure" intelligence may not work.

Other key points:

- Need more focus on proper experimental method in AI research. Developmental psychology offers examples for rigorous testing of cognition.
- Reporting instance-level failures rather than just aggregate accuracy can provide insights.
- Scaling laws and complex systems science are an interesting area of complexity theory, with applications to understanding cities.
- Concepts like "understanding" and "intelligence" in AI force refinement of fuzzy definitions.
- Human intelligence may be more collective and social than we realize. AI forces us to rethink concepts we apply anthropomorphically.

The overall emphasis is on rigorously building the science of machine cognition through proper experimentation and benchmarking as we assess emerging capabilities.

TOC:

[00:00:00] Introduction and Munk AI Risk Debate Highlights
[00:05:00] Douglas Hofstadter on AI Risk
[00:06:56] The Complexity of Defining Intelligence
[00:11:20] Examining Understanding in AI Models
[00:16:48] Melanie's Insights on AI Understanding Debate
[00:22:23] Unveiling the Concept Arc
[00:27:57] AI Goals: A Human vs Machine Perspective
[00:31:10] Addressing the Extrapolation Challenge in AI
[00:36:05] Brain Computation: The Human-AI Parallel
[00:38:20] The Arc Challenge: Implications and Insights
[00:43:20] The Need for Detailed AI Performance Reporting
[00:44:31] Exploring Scaling in Complexity Theory

Eratta:

Note Tim said around 39 mins that a recent Stanford/DM paper modelling ARC “on GPT-4 got around 60%”. This is not correct and he misremembered. It was actually davinci3, and around 10%, which is still extremely good for a blank slate approach with an LLM and no ARC specific knowledge. Folks on our forum couldn’t reproduce the result. See paper linked below.

Books (MUST READ):

Artificial Intelligence: A Guide for Thinking Humans (Melanie Mitchell)
www.amazon.co.uk/Artificial-Intelligence-Guide-Thi…

Complexity: A Guided Tour (Melanie Mitchell)
www.amazon.co.uk/Audible-Complexity-A-Guided-Tour?…

See rest of references in pinned comment.
Show notes + transcript atlantic-papyrus-d68.notion.site/Melanie-Mitchell-…

コメント (21)

@MachineLearningStreetTalk 11ヶ月前

Refs: Papers/Misc: Why AI is Harder Than We Think (Melanie Mitchell, 21) arxiv.org/abs/2104.12871 MLST#57 - Prof. MELANIE MITCHELL - Why AI is harder than we think https://www.youtube.com/watch?v=A8m1Oqz2HKc MLST - MUNK DEBATE ON AI (COMMENTARY) [DAVID FOSTER] (Featuring Melanie) https://www.youtube.com/watch?v=V4UkcU1hDZE How to Build Truly Intelligent AI (Quanta Magazine, with Melanie) [we used clips from here] https://www.youtube.com/watch?v=cz1UfjZjjyk Ingredients of understanding [Dileep George] - MUST READ! dileeplearning.substack.com/p/ingredients-of-under… Do half of AI researchers believe that there's a 10% chance AI will kill us all? [Mitchell] aiguide.substack.com/p/do-half-of-ai-researchers-b… Douglas Hofstadter changes his mind on Deep Learning & AI risk www.lesswrong.com/posts/kAmgdEjq2eYQkB5PP/douglas-… The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain [Mitchell] arxiv.org/pdf/2305.07141.pdf How do we know how smart AI systems are [Mitchell] www.science.org/doi/10.1126/science.adj5957 ChatGPT broke the Turing test — the race is on for new ways to assess AI www.nature.com/articles/d41586-023-02361-7 The Debate Over Understanding in AI’s Large Language Models (Modes of understanding) [Mitchell] arxiv.org/pdf/2210.13966.pdf Rethink reporting of evaluation results in AI [Mitchell with many others] melaniemitchell.me/PapersContent/BurnellEtAlScienc… Probing the psychology of AI models [Richard Shiffrin and Melanie Mitchell] www.pnas.org/doi/10.1073/pnas.2300963120 Evaluating Understanding on Conceptual Abstraction Benchmarks melaniemitchell.me/PapersContent/EBeM_Workshop2022… Abstraction for Deep Reinforcement Learning [Shanahan/Mitchell] arxiv.org/pdf/2202.05839.pdf What Does It Mean to Align AI With Human Values? [Mitchell] www.quantamagazine.org/what-does-it-mean-to-align-… What Does It Mean for AI to Understand? [Mitchell] www.quantamagazine.org/what-does-it-mean-for-ai-to… Large language models aren’t people. Let’s stop testing them as if they were.[Will Douglas Heaven] www.technologyreview.com/2023/08/30/1078670/large-… The Contemporary Theory of Metaphor (George Lakoff) terpconnect.umd.edu/~israel/lakoff-ConTheorMetapho… Verbal Disputes (About understanding/conceptual engineering) David J. Chalmers consc.net/papers/verbal.pdf Large Language Models as General Pattern Machines (ARC modelling with LLM) Deepmind/Stanford Mirchandani et al arxiv.org/pdf/2307.04721.pdf Michael Frank Stanford (reffed over building better AI experiments) profiles.stanford.edu/michael-frank?tab=publicatio… Measure of intelligence (Chollet) arxiv.org/abs/1911.01547 Elizabeth Spelky - Core knowlege www.harvardlds.org/wp-content/uploads/2017/01/Spel… Completer Science as Empirical Inquiry: Symbols and Search (Newell and Simon) [Physical Symbol System Hypothesis] dl.acm.org/doi/pdf/10.1145/360018.360022 Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks (MIT) arxiv.org/pdf/2307.02477.pdf [Asimov](en.wikipedia.org/wiki/Isaac_Asimov) foundation series en.wikipedia.org/wiki/Foundation_series Sparks of AGI paper [read with a large pinch of salt!] arxiv.org/abs/2303.12712
@todddavidson3192 11ヶ月前

Seems premature to consider in what mode AI might be malevolent against people when nothing has been discussed about safeguarding evil people from utilizing AI.
@tedhoward2606 11ヶ月前

Great interview. Around 26:40 in the discussion of structural vs learned vs implicit skills; it is interesting to me. In 1973 I first encountered a situation that forced me to consider that other people might not be abstracting ideas and working from first principles. I was given direct admission to second year biochem in my first year at uni, so in my third year, I had already completed 3rd year biochem the year before, but most others in my "year" were just doing it. One of them (a straight A student) asked me one night at a party how I learned the questions and answers. Over a 5 minute discussion, it became clear that she had gotten all her "A"s by learning the expected responses to questions. I was only interested in the conceptual "structures". It occurred to me that while we were using language, we meant entirely different things by most of the terms, and there was little or no communication (as in concepts being shared between minds) happening. I have been diagnosed as "autistic spectrum", but the term is extremely misleading. I am "different" in many ways. I have tetrachromatic vision, I hear in standard range up to 13,500Hz, then in 3 higher ultrasonic spectra. I relate to things spatially, can close my eyes and model hundreds of km of roads with every corner in place for example, or model plate tectonic subduction zones at plate boundaries (like the one I live on). A whole series of events and circumstances have meant that I got used to being happy with having my own concepts, without the need of social agreement, though I could go along with things socially when required. To me, as someone with a lifelong interest in both biology and computation and "reality" more generally, the very idea of "reason" is an often useful simplification of something deeply more complex. The reality we live in seems to contain multiple classes of fundamental uncertainty and unknowability, even as in many contexts it does usefully approximate classically causal systems. Human intelligence is a very variable thing, both across individuals, and within individuals across both time and domains and contexts. I'm reasonably comfortable that I have a reasonable handle on the major classes of systems involved. I have been working with the paid version of ChatGPT4 on some problems, and put one of those online (https://tedhowardnz.wordpress.com/2023/08/29/a-chat-with-gpt4-on-value/). What ChatGPT managed during that conversation was extremely interesting, and it was clear that while it could lock onto particular and appropriate chains of words, there was also a general tendency for it to revert to the biases implicit in language generally in the population at large - which is exactly what one would expect from the particular structure of the neural networks. The sort of structure required to deliver intelligence has been obvious to me for a very long time, and some are getting very close. Uplift was far closer to AGI than ChatGPT, and I strongly suspect that Google's team will have achieved AGI already, which does pose multiple levels of issues. Jaak Panksepp pioneered a set of concepts that seem to me to be very useful approximations to what seems to be the major drivers of consciousness. Jeff Hawkins Thousand Brain model adds a useful dimension. Seth Grant and his team have done great work on how brains actually achieve pattern matching. Put all of that into the recursive notion of "Life as Search" referenced in the GPT4 chat on value above, and you have a set of useful approximations to the gnarly problem of consciousness. I have the particular form of consciousness that I have. I am starting to strongly suspect that it is a very different form of consciousness from that experienced by most human beings. And of course, it does share some attributes. So yes, some of intelligence is physical, some computational, some experience dependent, some domain specific at various levels of and classes of domains. And when recursively considers, as an agent experiencing its own model of reality, and recursively able to modify both the model and any set of abstractions one uses for model evaluation and design; sapient life as search across the domain of all possible models, computational systems, strategies; for the survivable - and one delves into the depths of evolutionary strategy across all contexts; then one sees the fundamental role of cooperation, and the need for eternally evolving ecosystems of cheat detection and mitigation systems. Such awareness is a direct short term threat to the cheating systems currently dominating most economic and political realms, even as it is also in their long term self interest to modify their behaviour to be cooperative.
@aitheignis 11ヶ月前

I really love the part you discussed about proper experimention and hidden assumption. I always feels the itch when people claim that GPT can do this or that like human, but they don't even have any clear definition on what they are measuring and whether what they measure is even statistically significant (sure there is a replication crisis in science and p-value cutoff is pretty arbitrary, but at least it can prune out pure random effect to a certain extent). They don't even try to define a proper benchmark. For example in summarization task, what is actually a good summary. Most people in the field is straight out bypass thinking about this by using RLHF or crappy metrics like all of those ngram based method when it is very important to actually properly define what good summarization is. Or in the alignment field where people don't really even define in a robust mathematical structure what alignment is.
@BrianMosleyUK 11ヶ月前

We're using a LLM which has finished training... What we need is a LLM which can learn at the time of 'thinking' or working on the problem.
@MachineLearningStreetTalk 11ヶ月前

Top Quotes! "I think that's exactly right. And and what's interesting is we, computer scientists, were never trained in experimental methods. We never learned about, like, controls and, and, and, and, you know, confounding things." [00:19:52] "I mean right. So and it's this notion that intelligence is this thing that you can just have more and more of." [00:32:32] "I mean people have different views about nay than nativism, empiricism debate. And there's whole different schools and cognitive science about, like, how how much is learned, how much is evolutionarily built in and all of that." [00:34:12] "And so I think the brain is doing computations, but it's doing very, very highly evolved, very domain, specific computations that, Perhaps don't necessarily make sense without having a body." [00:37:35] "So In in most cases, we have to rely on behavior, which is very noisy. I think. I think that's, you know, it can be it can be misleading." [00:47:27] "People are starting to do this kind of more scientifically grounded experimental method on language models, but it's still not not very there's not very much of it." [00:51:05] "I think, you know, in science, If you're, you know, you're looking at a phenomenon, you're trying to replicate it. If it replicates, if it only replicates half the time, That's not a replication. That's not a robust replication." [00:54:31] "Traditionally in machine learning, people use accuracy and similar kinds of, aggregate measures to report their results. And, you know, if someone tells you that the accuracy was, you know, 78 percent What does that tell you exactly?" [00:56:01] "But if you're interested in it, the 1 big, topic that people look at is called scaling. And it's the question of, like, what happens to a system as it gets bigger in some sense." [00:56:54] "And there was a there was a fantastic, talk by, Dave Chalmers, the philosopher who I think you've probably had on this this show. That where he he talks about conceptual un engineering, which is something that philosophers do, where they take a term, like, understanding, and they they refine it." [00:19:03] "But does solving ARC doesn't mean we're we're at AGI." [00:42:07] "I think that we have to keep changing our benchmarks. We can't just say, okay, here's ImageNet. Go, you know, beat on that for the next 20 years until you've solved it. That's not going to yield, general intelligence." [00:43:39] "Yeah. I agree. I mean, you know, 1 question is that arcs a very, you know, idealized kind of micro world type domain. So -- Right. -- does it capture What's interesting about the real world in terms of abstraction?" [00:42:51] "Well, if it if if you had a program that really could solve This these tasks in a general way, that would whatever however it worked, it would be a good AI solution." [00:40:27] "I do think all of our benchmarks have, as you say, have this problem of, that they have assumptions built in that if a human could do this, that then the machine must if the machine does it, it has the same kind of, generalization capacity as a human who could solve that problem." [00:31:19] "There's individual intelligence. And then there's collective intelligence. And how much of the intelligence that we have individually is actually grounded in a more collective intelligence?" [01:01:19] [00:06:56] Melanie Mitchell: "Herbert Simon even said that explicitly. But then we saw that chess actually could be conquered by very unintelligent brute force, search that didn't generalize in any way." [00:08:22] Melanie Mitchell: "I do think that they're [LLMs are] intelligent. Well, you know, intelligence is a ill defined notion. Multidimensional and, you know, I don't know if we can say yes or no about something being intelligent rather than, you know, intelligent in certain ways or to certain degrees." [00:09:38] Melanie Mitchell: "That kind of goes along with the whole sort of, metaphor theory of cognition of Lakoff it at all and that, you know, we're sort of building on these physical metaphor, so we can build up many, many layers of abstraction." [00:14:23] Tim Scarfe: "We see that humans who can do A can do B, and now we see machines that can do A, and assume they can do B ... we have all of these built in assumptions in benchmarks and we don't really realize that we're talking about machines now." [00:19:40] Melanie Mitchell: "Well, no. He [Douglas Hofstadter] was quite worried about that it was going to happen sooner than he thought, and that, you know, his quote that its AI is gonna leave us in the dust." [00:19:52] Melanie Mitchell: "We have to really specify what we mean exactly." [00:25:58] Melanie Mitchell: "We are gonna build a science of machine cognition you know, this work has to be done." [00:28:45] Melanie Mitchell: "Yeah. I mean, you know, traditionally in machine learning, people use accuracy and similar kinds of, aggregate measures to report their results."
@MachinaMusings 11ヶ月前

Love it! Keep up the good work!
@rudyhengeveld 11ヶ月前

Very informative and interesting, thanks
@stretch8390 11ヶ月前

Bought Melanie's book A Guide for Thinking Humans after seeing her round 1 on your channel and have since passed it on to many people. It's the perfect balance of technical but approachable for those on the outskirts of these ideas. Looking forward to this one!
@duudleDreamz 11ヶ月前

Excellent interview. A much needed/important discussion, which is often overlooked and forgotten in the midst of current LLM excitement. (That said, my GPT4 version has no problem in doing addition etc. in base 8, (and other bases for that matter) and I tested it many times)
@CapsAdmin 11ヶ月前

On the point about the brain being general purpose vs domain specific, some arguments you didn't mention in favor of it being more general purpose is that people can be born blind, deaf, without limbs, without half the brain, and so on and the brain seems to adapt in those situations.
@rockapedra1130 11ヶ月前

This one was super fun! Thanks!
@LuisManuelLealDias 11ヶ月前

42:00 Lc0 doesn't know how to play Chess960 because it has slightly different rules to it, but that's not the same thing as saying it "fails" to play it because it doesn't "generalize". The only issue with Chess960 is the castling, which has different rules to normal chess. If you ignore that part, Lc0 will kick anyone's ass. And the reason is that it knows how to play chess, generally. In a general way!
@DarkSkay 11ヶ月前

In general, benchmarks assign a small set of expected/desired answers a positive score; the infinite set of different answers all get the same score of 0. No matter how ignorant, bad, dumb, dangerous or instead nuanced, out-of-the-box, original, brilliant the answer is: all get a score of 0. Real life doesn't work like that.
@Thrashmetalman 11ヶ月前

This was my advisor back in grad school. Glad to see she is giving these talks.
@staycurious3954 11ヶ月前

Who are the guests? Strange their names aren’t anywhere to be found in the description. That one guy looks like Tom Green’s sidekick from back in the day?
@exhibitD79 11ヶ月前

Really enjoyed this conversation. I am following a lot these discussions purely for the exploration into how we expand our knowledge of ourselves, not just what Ai can do.
@jonathanseagraves8140 11ヶ月前

As a complete lay person, it seems like the most obvious use for AI is as a tool to identify unintended results/game breaking bugs of prospective incentive based systems. Or possibly as a tool to help navigate a possible exit to an incentive based system that we are to deep into to be able to see properly.
@CodexPermutatio 11ヶ月前

Oh! Melanie is back. Excellent! This is going to be good so you already have my thumbs up.
@kimholder 11ヶ月前

What set of benchmarks would tell you if the AI is lying? Once you know how to test competence, you still have to test intention. Whether or not it's felt AIs have goals in the sense that leads to a world of risk, we have to get a handle on how to check. The existence of such a thing would immediately lead to the possibility of lying about it, if that would advance achievement of the goal.