Mark Williams-Cook
👤 PersonAppearances Over Time
Podcast Appearances
When people talk about AI, LLMs and the training data that goes in, and this is related as well to a lot of people saying, oh, you need, you should use structured data.
You should use schema.
So there's structured markup that traditionally we put on web pages for search engines.
to explicitly label connections between things.
So you would say, this is this website.
This website is part of this company.
This is the author.
He or she works for this company.
And you map out those connections, right?
The idea is it's explicit and it removes ambiguity.
Now, the way the large language models are obviously working with their training is, you know, they get given all this data or they get given, they take whatever they like, it seems.
But when that goes through the process of tokenization and such, the actual training data
is not like saved within the model right that you know the text isn't there all that happens is they you know it's broken down into these tokens these components and the model is the relationship between those you know all the different combinations of tokens so how it produces text so
The schema side of things, the structured data can't survive that process.
It can kind of generate structured data because it sees those patterns together.
But the fact that it once saw that the organization was candor for this website, because that's statistically such a drop, not even drop in the bucket, that's forever gone.
So it doesn't, that's not encapsulated in, I guess what you could call like the language graph
which is the way they're storing knowledge.
So that's one of the reasons why you get those issues.
And I've had this with clients as well, where it thought that two different websites that were named similar were the same entity, if you wanna call it that.