Dev and Doc: AI For Healthcare Podcast
Everything you need to know about LLM benchmarks- Turing Test, OpenAI's Healthbench, ARC prize, LM arena
22 Aug 2025
Whenever there was AI, there were benchmarks- from the turing test, to society-changing benchmarks like MNIST and ImageNet to modern problems like the ARC prize, benchmarked served a vital purpose to measure the performance of AI models. But something has shifted in modern times, in the LLM era have benchmarks lost their utility, becoming mere advertisement for big tech? Even seemingly more sophisticated benchmarks like LM Arena can be gamed by tech giants. We also deep dive into healthcare benchmarks like OpenAI's Healthbench (deeply problematic) and Microsoft's AI-DXO orchestrator agent for diagnosis. Where is this all going? How do we make the perfect benchmark? Or is the real work to be done afterwards in the real world?👋 Hey! If you are enjoying our conversations, reach out, share your thoughts and journey with us. Don't forget to subscribe whilst you're here :)---Timestamps00:00 Intro - The OG benchmarks - Turing test, MNIST, ImageNET06:40 Are large language models benchmarks similar to humans taking tests?10:05 Are we testing model capability vs production ready?12:00 LLM era - data contamination15:30 LM Arena - The leaderboard illusion paper - how big tech games benchmarks28:35 Goodhart's law - When a measure becomes a target, it ceases to be a good measure32:05 Some good benchmarks - games - Pokemon, ARC prize, Minecraft34:35 Medical benchmarks - OpenAI's healthbench has some big problems46:50 Microsoft AI-DXO orchestrator for case reports---Connect with UsYour Hosts:👨🏻⚕️ Doc - Dr. Joshua Au Yeung - LinkedIn🤖 Dev - Zeljko Kraljevic - TwitterFollow & Subscribe:YT: https://youtube.com/@DevAndDocSpotify: Follow us on SpotifyApple Podcasts: Listen on Apple PodcastsSubstack: https://aiforhealthcare.substack.com/For enquiries:📧 [email protected] Credits🎞️ Editor: Dragan Kraljević - Instagram🎨 Brand & Art: Ana Grigorovici - Behance
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other recent transcribed episodes
Transcribed and ready to explore now
3ª PARTE | 17 DIC 2025 | EL PARTIDAZO DE COPE
01 Jan 1970
El Partidazo de COPE
13:00H | 21 DIC 2025 | Fin de Semana
01 Jan 1970
Fin de Semana
12:00H | 21 DIC 2025 | Fin de Semana
01 Jan 1970
Fin de Semana
10:00H | 21 DIC 2025 | Fin de Semana
01 Jan 1970
Fin de Semana
13:00H | 20 DIC 2025 | Fin de Semana
01 Jan 1970
Fin de Semana
12:00H | 20 DIC 2025 | Fin de Semana
01 Jan 1970
Fin de Semana