Chris Lattner
๐ค SpeakerAppearances Over Time
Podcast Appearances
I can see
Then I can get to doing four or eight or 16 at a time with vectors.
That's called vectorization.
Then you can say, hey, I have a whole bunch of different, you know, what a multi-core computer is, is it's basically a bunch of computers, right?
So they're all independent computers that can talk to each other and they share memory.
And so now what Parallelize does is it says, okay, run multiple instances on different computers, right?
and now they can all work together on a problem, right?
And so what you're doing is you're saying, keep going out to the next level out, and as you do that, how do I take advantage of this?
So tiling is a memory optimization, right?
It says, okay, let's make sure that we're keeping the data close to the compute part of the problem instead of sending it all back and forth through memory every time I load a block.
Yeah, well, so all of these, the details matter so much to get good performance.
This is another funny thing about machine learning and high-performance computing that is very different than C compilers we all grew up with, where if you get a new version of GCC or a new version of Clang or something like that, maybe something will go 1% faster.
And so compiler engineers will work really, really, really hard to get half a percent out of your C code, something like that.
But when you're talking about an accelerator or an AI application, or you're talking about these kinds of algorithms, and these are things people used to write in Fortran, for example, right?
If you get it wrong, it's not 5% or 1%.
It could be 2x or 10x, right?
If you think about it, you really want to make use of the full memory you have, the cache, for example.
But if you use too much space, it doesn't fit in the cache.
Now you're going to be thrashing all the way back out to main memory.
And these can be 2x, 10x, major performance differences.