Two Voice Devs

Episode 238 - LLM Benchmarking: What, Why, Who, and How

09 May 2025

Audio

Description

How do you know if a Large Language Model is good for your specific task? You benchmark it! In this episode, Allen speaks with Amy Russ about her fascinating career path from international affairs to data, and how that unique perspective now informs her work in LLM benchmarking.Amy explains what benchmarking is, why it's crucial for both model builders and app developers, and how it goes far beyond simple technical tests to include societal, cultural, and ethical considerations like preventing harms.Learn about the complex process involving diverse teams, defining fuzzy criteria, and the technical tools used, including data versioning and prompt template engines. Amy also shares insights on how to get involved in open benchmarking efforts and where to find benchmarks relevant to your own LLM projects.Whether you're building models or using them in your applications, understanding benchmarking is key to finding and evaluating the best AI for your needs.Learn More:* ML Commons - https://mlcommons.org/Timestamps:00:18 Amy's Career Path (From Diplomacy to Data)02:46 What Amy Does Now (Benchmarking & Policy)03:38 Defining LLM Benchmarking05:08 Policy & Societal Benchmarking (Preventing Harms)07:55 The Need for Diverse Benchmarking Teams09:55 Technical Aspects & Tooling (Data Integrity, Versioning)10:50 Prompt Engineering & Versioning for Benchmarking12:48 Preventing Models from Tuning to Benchmarks15:30 Prompt Template Engines & Generating Prompts17:10 Other Benchmarking Tools & Testing Nuances19:10 Benchmarking Compared to Traditional QA21:45 Evaluating Benchmark Results (Human & Metrics)23:05 The Challenge of Establishing an Evaluation Scale23:58 How to Get Started in Benchmarking (Volunteering, Organizations)25:20 Open Benchmarks & Where to Find Them26:35 Benchmarking Your Own Model or App28:55 Why Benchmarking Matters for App Builders29:55 Where to Learn More & Follow AmyHashtags:#LLM #Benchmarking #AI #MachineLearning #GenAI #DataScience #DataEngineering #PromptEngineering #ModelEvaluation #TechPodcast #Developer #TwoVoiceDevs #MLCommons #QA

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes

🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Other recent transcribed episodes

Transcribed and ready to explore now

13:00H | 21 DIC 2025 | Fin de Semana

01 Jan 1970

Fin de Semana

10:00H | 21 DIC 2025 | Fin de Semana

01 Jan 1970

Fin de Semana

12:00H | 20 DIC 2025 | Fin de Semana

01 Jan 1970

Fin de Semana

2ª PARTE | 06 ENE 2026 | EL PARTIDAZO DE COPE

01 Jan 1970

El Partidazo de COPE

3ª PARTE | 22 ENE 2026 | EL PARTIDAZO DE COPE

01 Jan 1970

El Partidazo de COPE

3ª PARTE | 04 MAR 2026 | EL PARTIDAZO DE COPE

01 Jan 1970

El Partidazo de COPE

Comments

There are no comments yet.

Please log in to write the first comment.

Report any issue

Two Voice Devs

Episode 238 - LLM Benchmarking: What, Why, Who, and How

This episode hasn't been transcribed yet

Other recent transcribed episodes

13:00H | 21 DIC 2025 | Fin de Semana

10:00H | 21 DIC 2025 | Fin de Semana

12:00H | 20 DIC 2025 | Fin de Semana

2ª PARTE | 06 ENE 2026 | EL PARTIDAZO DE COPE

3ª PARTE | 22 ENE 2026 | EL PARTIDAZO DE COPE

3ª PARTE | 04 MAR 2026 | EL PARTIDAZO DE COPE

Sign in to Audioscrape

Share this moment