Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

Base by Base

215: Protein Set Transformer for high-diversity viromics

01 Dec 2025

Description

️ Episode 215: Protein Set Transformer for high-diversity viromics In this episode of PaperCast Base by Base, we explore Protein Set Transformer (PST) is a protein-based genome language model that represents genomes as sets of proteins to improve genome and protein representations across diverse viral datasets Study Highlights:PST embeds proteins with ESM2, concatenates positional and strand vectors, contextualizes proteins with a multi-head attention encoder, and produces genome embeddings via a learnable weighted decoder pooling. The foundation PST-TL models were pretrained on >100k dereplicated viral genomes encoding >6M proteins using a triplet-loss objective with PointSwap augmentation and evaluated on IMG/VR v4 and MGnify soil virus test sets. PST-TL outperformed other protein- and nucleotide-based methods at recovering genome–genome relationships, including remote relationships, and its protein embeddings clustered structural capsid folds and late-gene functional modules. PST improved annotation transfer for hypothetical proteins via embedding and structure-aware clustering and boosted viral host-species prediction when used in a graph link-prediction framework. Conclusion:PST provides transferable genome- and protein-level embeddings that strengthen representation, annotation, and host-prediction tasks for diverse viral and microbial genomics applications Music:Enjoy the music based on this article at the end of the episode. Reference:Martin, C., Gitter, A., Anantharaman, K. Protein Set Transformer: a protein-based genome language model to power high-diversity viromics. Nat Commun (2025). https://doi.org/10.1038/s41467-025-66049-4 License:This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/ Support:Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00 Official website https://basebybase.com Castos player https://basebybase.castos.com On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics. Episode link: https://basebybase.castos.com/episodes/protein-set-transformer Chapters (00:00:00) - Deep Learning in Viral Biology(00:02:31) - Preliminary insights into viral biology(00:08:01) - PSTTL: The Hidden Genome of Viruses(00:11:29) - PSTTL: The Virality Model(00:14:22) - Preston 2, Context-aware viral evolution(00:16:19) - Signs and Numbers in the Code

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.