Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing
Podcast Image

The Daily AI Show

Why Patchwork AGI Is Gaining Traction

13 Jan 2026

Transcription

Chapter 1: What is the main topic discussed in this episode?

0.622 - 3.226 Brian Maucere

Hey, what's going on, everybody? Welcome to the Daily Eyes show.

0

Chapter 2: What improvements does ElevenLabs Scribe V2 offer for transcription?

4.208 - 13.482 Brian Maucere

It is January 12th, 2026. I had to look at the calendar there for a second. It's Monday. We appreciate you guys being here. I'm here with Andy.

0

Chapter 3: What challenges does timestamp drift present in multimodal models?

13.722 - 31.629 Brian Maucere

I'm Brian. And we will be your co-hosts for the day. We'll see if anybody else pops in the door, but it might just be Andy and I. But that's okay. We're ready to have a great conversation along the way. talk about all things AI. And as we typically do, we want to start off the show by talking about some of the news items.

0

Chapter 4: How does DeepMind's Patchwork AGI concept differ from traditional AGI models?

31.649 - 44.668 Brian Maucere

Some of them come out over the weekends. You just never really know with AI. It doesn't typically take Saturdays and Sundays off, but also things just slip through the cracks that maybe we didn't have a chance to talk about earlier. And Andy, I'll tell you one of the ones I would love to kick off with.

0

44.648 - 70.112 Brian Maucere

because I used it this weekend and I was just really impressed with it, is Eleven Labs came out with their Scribe V2 real time and also just Scribe V2. And what this is, is what they're calling their most accurate transcription service ever or model ever, rather. Just so happened. You know I'm working on this other project. We usually call it Project Bruno when it's on the show here.

0

70.492 - 85.489 Brian Maucere

But I was literally working with transcription and running into some issues, trying to figure out if I was going to use one of OpenAI's model. And lo and behold, 11Labs showed up at the right time, well, the 11th hour.

0

85.469 - 108.321 Brian Maucere

and said hey we have this new model i see what you did there but i did there i didn't even meet um uh they said hey we have we have this new model you should come check it out and while i have not run a ton through it yet immediately i was like yep this is this is what we need to be this is what we need to be using so i'm all connected to the api and pushing stuff through it um

0

108.301 - 118.616 Brian Maucere

But they're saying that it's the real time version is optimized for ultra low latency and agent use cases. So its name is V2 real time.

Chapter 5: What impact does Claude Code have on autonomous work processes?

118.696 - 139.877 Brian Maucere

It's meant for real time. Right. Inscribe V2 built for batch transcription, subtitling and captioning at scale. Guess what I need? So that second one, V2, is going to be as, well, like I said, it's already proving its worth. Much more testing to do, but just wanted to throw that out there because I literally used it over the weekend.

0

140.6 - 168.697 Andy Halliday

Yeah, good. So a couple of comments. One is that transcription of audio is important, not only for using it against media that's generated and doesn't have captioning, etc. But right now, you know, getting a speech to text transcription out of something is pretty well established. You get stuff, but 11 labs has always been at the forefront. Yeah.

0

168.717 - 176.496 Andy Halliday

And in terms of accuracy and fluid fluency, let's call it. The question is whether in the long run,

0

Chapter 6: What are the implications of ChatGPT Health for accessing medical records?

177.134 - 200.745 Andy Halliday

You know, we don't even think about that anymore, except in the narrow context of where you want to place captioning or text somehow in association with audio. Because, you know, one of the uses that we have for getting the transcript out is in order to allow large language models to then process the transcript text. Correct. As opposed to the audio.

0

200.725 - 227.362 Andy Halliday

But multimodal LLMs are going to be able to work in straight audio, you know, with better and better facility. And so eventually you may not ever need to do that intermediate step of getting a transcription. You'll just feed the audio and or the video with audio directly to the model and then have it processed that way. If it needs to capture a full transcript, great.

0

Chapter 7: How is AI transforming healthcare and predictive analysis?

228.403 - 240.722 Andy Halliday

The model can decide to do that because it understands what's being said. But we may bypass that step, especially as voice interaction with AI becomes more and more prevalent among users.

0

241.394 - 269.354 Brian Maucere

Listen, I, you're a hundred percent correct. And I know where we are currently because I've literally been for, you know, stress testing these models. And so if I give, um, Gemini pro a 30 plus minute MP4, so certainly not giving it a stripped out audio. Um, And my request is, give me what we would call a VTT-level transcript.

0

269.454 - 274.186 Brian Maucere

Meaning that it's not like what you would get from Glasp if you just went to YouTube and YouTube did it.

0

Chapter 8: What are the advantages and drawbacks of using X for AI news distribution?

274.426 - 300.858 Brian Maucere

You're going to get a timestamp every... 150, I don't know, like every 10 seconds, you're going to get a timestamp. That's not nearly tight enough of a transcript, right? So a VTT, now if we download a .VTT from this show, which I do every day, that has almost line items, right? And it doesn't necessarily do diarization where it pulls out your name or mine.

0

301.178 - 329.558 Brian Maucere

Anyway, if I throw this into Google at 30 minutes, what I found at least is that unless you use some advanced movements like chunking and some other things, it has creep in it. And the timestamps slip. And it will try to do it, but it can't really... Even though it has a huge context window and the video might only be less than 200 megabytes big, even in AI Studio, I was running into issues.

0

329.779 - 353.734 Brian Maucere

So the solution for me right now, but you're right, which is not going to be the long-term solution, is I... rip the transcript out first and then i do i do processes after that that are both transcript based and audio visual based or that mp4 and i found that that is that one two punch is uh sufficient

0

353.95 - 376.465 Andy Halliday

Yeah. Now, question. Clearly, timestamp accuracy is really important for retrieval and presentation or extraction of a segment from a video, a long video. And you're saying that when you when you do this sort of multistage process, that gives you more accurate timestamps. Is that the main benefit of that?

0

376.968 - 406.587 Brian Maucere

That is that is the main benefit. And also you're by doing that, you're preserving you're asking the multimodal model just to focus on just the transcript first, as opposed to asking it to maybe do one or two things at a time, which is to say an example would be give me the top five transcript timestamps where people were running. I don't know what your video is going to be about.

406.607 - 434.667 Brian Maucere

Whatever, right? Whatever it is. And so as it stands, Gemini happily will try to do that by looking at clips of it and trying to literally watch the video, for lack of a better word, and also listen to it. But when, at least in my case, when you layer that on top of also trying to get accurate timestamps, that part slips. And so an example over the weekend would be that

434.647 - 451.553 Brian Maucere

I was sitting down watching college football and also vibe coding at the same time. And what I would run into is that a 37 minute video would show me timestamps at 45 minutes, 47 minutes, which obviously can't be true. Right. So then I dug into like, well, why is this happening?

451.573 - 474.9 Brian Maucere

And ultimately what I got back to was at least in the terms of Gemini, which is the most advanced model for this type of work, this multimodal work. That it was just having a hard time. So my workaround is pretty easy. It's just a two step, which is go get me a VTT like best in case best in class level transcript, although vibe to from 11 labs is better. So that already handles that.

475.38 - 494.488 Brian Maucere

And then once that's set in stone, you're not looking at something that's sort of. mid, you know, mid-level coming from YouTube, it's a really good, Andy said this, Brian said that, or at least speaker one, speaker two, even if it's not picking up our voices. And then that's like your, that's like your, I don't know, your rock.

Comments

There are no comments yet.

Please log in to write the first comment.