LMArena cofounders Anastasios N. Angelopoulos, Wei-Lin Chiang, and Ion Stoica sit down with a16z general partner Anjney Midha to talk about the future of AI evaluation. As benchmarks struggle to keep up with the pace of real-world deployment, LMArena is reframing the problem: what if the best way to test AI models is to put them in front of millions of users and let them vote? The team discusses how Arena evolved from a research side project into a key part of the AI stack, why fresh and subjective data is crucial for reliability, and what it means to build a CI/CD pipeline for large models.They also explore:Why expert-only benchmarks are no longer enough.How user preferences reveal model capabilities — and their limits.What it takes to build personalized leaderboards and evaluation SDKs.Why real-time testing is foundational for mission-critical AI.Follow everyone on X:Anastasios N. AngelopoulosWei-Lin ChiangIon StoicaAnjney MidhaTimestamps0:04 - LLM evaluation: From consumer chatbots to mission-critical systems6:04 - Style and substance: Crowdsourcing expertise18:51 - Building immunity to overfitting and gaming the system29:49 - The roots of LMArena41:29 - Proving the value of academic AI research48:28 - Scaling LMArena and starting a company59:59 - Benchmarks, evaluations, and the value of ranking LLMs1:12:13 - The challenges of measuring AI reliability1:17:57 - Expanding beyond binary rankings as models evolve1:28:07 - A leaderboard for each prompt1:31:28 - The LMArena roadmap1:34:29 - The importance of open source and openness1:43:10 - Adapting to agents (and other AI evolutions) Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts. Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other recent transcribed episodes
Transcribed and ready to explore now
Eric Larsen on the emergence and potential of AI in healthcare
10 Dec 2025
McKinsey on Healthcare
Reducing Burnout and Boosting Revenue in ASCs
10 Dec 2025
Becker’s Healthcare -- Spine and Orthopedic Podcast
Dr. Erich G. Anderer, Chief of the Division of Neurosurgery and Surgical Director of Perioperative Services at NYU Langone Hospital–Brooklyn
09 Dec 2025
Becker’s Healthcare -- Spine and Orthopedic Podcast
Dr. Nolan Wessell, Assistant Professor and Well-being Co-Director, Department of Orthopedic Surgery, Division of Spine Surgery, University of Colorado School of Medicine
08 Dec 2025
Becker’s Healthcare -- Spine and Orthopedic Podcast
NPR News: 12-08-2025 2AM EST
08 Dec 2025
NPR News Now
NPR News: 12-08-2025 1AM EST
08 Dec 2025
NPR News Now