Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Anita Zhang

👤 Person
146 total appearances

Appearances Over Time

Podcast Appearances

Yeah, but if you're on the infrastructure side, sometimes you need to maintain those widely distributed binaries on the bare metal hosts. So like us running SystemD or the team at Siamat that does the load balancing, they also run a widely distributed binary across the fleet on bare metal.

Yeah, but if you're on the infrastructure side, sometimes you need to maintain those widely distributed binaries on the bare metal hosts. So like us running SystemD or the team at Siamat that does the load balancing, they also run a widely distributed binary across the fleet on bare metal.

There's also another service that does specifically fetching packages or shipping out configuration files and things like that. But yeah, most of the services people write, they're running in containers. Databases, they have kind of their own separate thing going on as well.

There's also another service that does specifically fetching packages or shipping out configuration files and things like that. But yeah, most of the services people write, they're running in containers. Databases, they have kind of their own separate thing going on as well.

Most of them are moving more into TWShared as well, but they have more specific requirements related to draining the host and making sure there's no data loss.

Most of them are moving more into TWShared as well, but they have more specific requirements related to draining the host and making sure there's no data loss.

Yeah, but they're one of those teams that, They just want their own set of bare metal hosts as well to do their own thing with. They don't care about running things in a container if they don't have to.

Yeah, but they're one of those teams that, They just want their own set of bare metal hosts as well to do their own thing with. They don't care about running things in a container if they don't have to.

The AI fleet's always a challenge, I guess, making sure jobs stay running for that long. I think we're... Every site event is like kind of an opportunity to see where we can make our infrastructure more stable, adding more validation in places and things like that. Just removing some of the clowniness that people who have been here a long time have kind of gotten used to.

The AI fleet's always a challenge, I guess, making sure jobs stay running for that long. I think we're... Every site event is like kind of an opportunity to see where we can make our infrastructure more stable, adding more validation in places and things like that. Just removing some of the clowniness that people who have been here a long time have kind of gotten used to.

Yeah, making things more deterministic, removing cases where teams that don't need to have their own host, shifting them into TW shared so that they're on more common infrastructure, adding more safeguards in place so that we can't roll things out live and stuff like that.

Yeah, making things more deterministic, removing cases where teams that don't need to have their own host, shifting them into TW shared so that they're on more common infrastructure, adding more safeguards in place so that we can't roll things out live and stuff like that.

Yeah, that's probably not true anymore. But yeah, the majority of our compute fleet looks like that. Yeah.

Yeah, that's probably not true anymore. But yeah, the majority of our compute fleet looks like that. Yeah.

Yeah. And we're trying to shift to a model now where, you know, we have bigger compute hosts so that we can run more jobs side by side. Stacking because realistically, you know, one service isn't going to be able to scale to like all the resources on the host, uh, forever. So yeah, we're, we're getting into stacking now.

Yeah. And we're trying to shift to a model now where, you know, we have bigger compute hosts so that we can run more jobs side by side. Stacking because realistically, you know, one service isn't going to be able to scale to like all the resources on the host, uh, forever. So yeah, we're, we're getting into stacking now.

Oh, I mean, even in like the past year, we've made a few notable infrastructure shifts to support the AI fleet. Yeah, it's not even just like the different like resources on the host, but like all of the different components. A lot of them have like additional network cards. managing how the accelerators work and how to make sure they're healthy and things like that.

Oh, I mean, even in like the past year, we've made a few notable infrastructure shifts to support the AI fleet. Yeah, it's not even just like the different like resources on the host, but like all of the different components. A lot of them have like additional network cards. managing how the accelerators work and how to make sure they're healthy and things like that.