Eno Reyes
๐ค SpeakerAppearances Over Time
Podcast Appearances
But you quickly realize that that does not work.
Yeah.
And so then we started to try to figure out, you know, what are the individual dimensions that matter for a compression of a situation?
Like, you know, it's going to lose some information.
So what matters to preserve?
How can we design a system that not only doesn't just summarize, but sort of provides a very high quality and active balance?
block of information that the agent can then resume its task with little to no issues.
And so, you know, you can look at all these things like the accuracy, if whether or not the actual compressed artifact is accurate,
accurate.
You can look at its context awareness, like what information is or isn't included about what is currently happening, what has previously happened.
You can look at the artifact trail.
Are the files and the logs and the...
critical, like, you know, singular information pieces present in the summarization, you can look at completeness and continuity and instruction following.
Anyway, you can look at all these different dimensions, and you can start to evaluate different strategies for context compression.
And so what we found is that using this method called probe based evaluation, which is a fairly popular way to evaluate LLMs, where you basically, you know,
The idea is you can ask like very focused questions with rubric based evaluations using another LLM to extract information from some state or from some answer that has previously been been executed.
And so, you know, an example probe might be, you know, I have an agent session and then I compact that.
And then a probe would ask, what was the file that had the bug inside of it, right?
And so if you can answer that question, right?
And LLMs are now good enough such that it's basically a binary like yes, no, like it either has the information or it doesn't, right?