Jeff Kao
๐ค SpeakerAppearances Over Time
Podcast Appearances
So in a lot of ways, the FST is almost like a cache.
You can almost think of it as like a hash map or maybe even a B-tree map of a string to a U64.
Like that's how you would just, it's maybe a compressed version of that.
But it's even more like, more than that, it's really just like a way to cache high cardinality text in a very compressed way.
So I'll give a quick rundown of these.
So for any sort of search system, the most fundamental data structure is this thing called an inverted index.
So that implies like a forward index.
So maybe I'll explain what that is first.
And so like forward index is more like a traditional database.
So record one maps to Broadway, record two maps to Prince Street.
The inverted index sort of switches that around.
So you first tokenize, and that is its own whole topic.
People research on how to tokenize text, especially with all AI and machine learning trend now.
But you can then say, oh, Broadway, the token, maps to ID 1, and then Prince maps to Document 2, and Street maps to Document 2.
So when you type in Prince, then it's a...
I mean, it's not a hash map in these implementations, but essentially you just look up the word prints and then you get all the documents that are related.
And so there's, once you have these documents, you can sort of perform these set operations to essentially narrow down which documents are relevant.
And then there's, you know, Tantive offers this thing called BM25, which you can think of it as like TF-IDF.
But essentially, once you have these documents,