AI: post transformers

Technology

Contact & Links

RSS Feed

Reach & Influence

340

Episodes

Countries

Language

Also charts in:

🇰🇷 #23 🇦🇹 #42 🇳🇱 #120 🇨🇭 #125 🇯🇵 #138 🇮🇳 #161 🇨🇦 #188

View Chart History

About

Daily Since August 2025

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art research paper reviews starting from the transformer and on.

Last updated: 2025-12-11 12:50:04.508766

Activity Overview

Episode publication activity over the past year

Episodes

Showing 301-340 of 340

«« ← Prev Page 4 of 4

FP8 Quantization

08 Aug 2025

Contributed by Lukas

Three sources are reviewed to understand the value of FP8 quantization:https://www.baseten.co/blog/33-faster-llm-inference-with-fp8-quantization/https...

Mistral 7B: Superior Performance in a Smaller Package

08 Aug 2025

Contributed by Lukas

This paper introduces Mistral 7B, a new 7-billion-parameter language model designed for both superior performance and efficiency. The paper highli...

Diffusion Probabilitic Models: Deep Unsupervised Learning Through Diffusion Processes

08 Aug 2025

Contributed by Lukas

This paper introduces a novel deep unsupervised learning algorithm that leverages non-equilibrium thermodynamics to model complex datasets. The co...

Teraio: Cost-Efficient LLM Training via Lifetime-Aware Tensor Offloading

08 Aug 2025

Contributed by Lukas

The research introduces Teraio, a novel framework designed to enhance the cost-efficiency and performance of large language model (LLM) training. This...

Shared Virtual Memory: Design and Performance Implications

08 Aug 2025

Contributed by Lukas

The provided academic paper investigates Shared Virtual Memory (SVM), a technology that integrates GPU memory into host virtual memory systems to impr...

vLLM & PagedAttention: Efficient LLM Serving with Virtual Memory

08 Aug 2025

Contributed by Lukas

This document introduces PagedAttention, an innovative attention algorithm, and vLLM, a high-throughput serving system for large language models (LLMs...

ZeRO-Offload: Democratizing Billion-Scale Model Training

08 Aug 2025

Contributed by Lukas

This reviews the paper which introduced ZeRO-Offload, a novel technology designed to democratize large-scale deep learning model training by making ...

AdamW: Decoupled Weight Decay Regularization for Adaptive Gradient Algorithms

08 Aug 2025

Contributed by Lukas

The paper "Decoupled Weight Decay Regularization" by Ilya Loshchilov and Frank Hutter (2017) introduced what we know in Pytorch now as AdamW.This acad...

Adam: A Method for Stochastic Optimization

08 Aug 2025

Contributed by Lukas

This reviews the 2015 paper which introduced Adam, an algorithm for first-order gradient-based optimization in scenarios involving stochastic objec...

FlashAttention-2: Faster Attention with Better Parallelism

08 Aug 2025

Contributed by Lukas

This reviews the paper that introduces FlashAttention-2, an optimized attention algorithm designed to significantly improve the speed and efficiency o...

PartitionedVC: Optimizing External Memory Graph Analytics

08 Aug 2025

Contributed by Lukas

Thus paper introduces PartitionedVC, an innovative external memory graph analytics framework designed to enhance the processing of large graphs that ...

Fast Learning for Deep Belief Networks

08 Aug 2025

Contributed by Lukas

We review a letter communicated by Yann Le Cun and published in Neural Computation in 2006, details a fast learning algorithm for deep belief net...

Mu: The Windows Settings Agent Language Model

08 Aug 2025

Contributed by Lukas

We review data from the Windows Experience Blog which introduces a new era of Windows experiences, heavily integrated with Artificial Intelligence...

Lost in the Middle: How Language Models Use Long Contexts

08 Aug 2025

Contributed by Lukas

This academic paper explores how language models utilize long input contexts, focusing on their ability to identify and retrieve relevant information....

MUVERA: Efficient Multi-Vector Information Retrieval

08 Aug 2025

Contributed by Lukas

We review MUVERA, a novel algorithm designed to significantly improve the efficiency of multi-vector information retrieval. Traditional information ...

ColBERT and ColBERT v2

08 Aug 2025

Contributed by Lukas

We reviewed two papers on ColBERT. We review and expand upon ColBERT, a neural information retrieval model that utilizes contextualized late interac...

Microscaling Quantization for Large Language Models

08 Aug 2025

Contributed by Lukas

Double paper review for modern quantization techniques.These two academic papers address the crucial challenge of quantizing Large Language Models (LL...

Chamfer Matching: Image Registration and Medical Applications

07 Aug 2025

Contributed by Lukas

This paper provides an overview of Chamfer Matching, a classical image registration method primarily used for segmented features. They explain its...

LeNet-5: Convolutional Networks for Character Recognition

07 Aug 2025

Contributed by Lukas

This paper outlines the advancements in Optical Character Recognition (OCR), particularly focusing on handwritten character and word recognition us...

Constitutional AI: Harmlessness Through Self-Improvement

07 Aug 2025

Contributed by Lukas

This paper details "Constitutional AI," a novel method for training AI assistants to be harmless without extensive human-labeled data for harmful outp...

GQA: Grouped Query Attention

07 Aug 2025

Contributed by Lukas

This paper introduces Grouped-Query Attention (GQA), a novel approach designed to enhance the inference efficiency of large language models. It addr...

Linformer: Efficient Self-Attention with Linear Complexity

07 Aug 2025

Contributed by Lukas

This academic paper introduces Linformer, a novel approach to address the computational bottleneck of Transformer models in natural language processin...

Longformer: A Transformer for Long Documents

07 Aug 2025

Contributed by Lukas

This paper introduces Longformer, a novel Transformer-based model designed to overcome the limitations of traditional Transformers in processing exc...

RoPE

07 Aug 2025

Contributed by Lukas

This paper introduces RoFormer, an enhanced Transformer model that leverages Rotary Position Embedding (RoPE) to improve natural language processin...

Layer Normalization and Dual Patch Normalization

07 Aug 2025

Contributed by Lukas

The first source introduces Layer Normalization (LN), a technique designed to accelerate and stabilize the training of deep neural networks, particula...

Batch Normalization

07 Aug 2025

Contributed by Lukas

This academic paper introduces Batch Normalization (BN), a novel technique designed to accelerate the training of Deep Neural Networks (DNNs) by addre...

Chinchilla: Optimal Language Model Scaling

07 Aug 2025

Contributed by Lukas

The Chinchilla research by DeepMind investigates the optimal model size and training tokens for large language models, aiming to maximize performance ...

Transformer Scaling

07 Aug 2025

Contributed by Lukas

This research paper explores the scaling behavior of Transformer architectures, offering insights into pre-training and fine-tuning efficiency. It cha...

Learning from repeated data

07 Aug 2025

Contributed by Lukas

This 2022 paper explores the significant negative impact of repeated data on the performance of large language models, even when such repetitions c...

Scaling Laws

07 Aug 2025

Contributed by Lukas

This 2000 paper, titled "Scaling Laws for Neural Language Models," explores the empirical relationships between the performance of neural language ...

LSTM: the forget gate

07 Aug 2025

Contributed by Lukas

This 2000 paper introduces a novel solution to a weakness found in Long Short-Term Memory (LSTM) networks, specifically when processing continuous da...

GPT4 Technical Report

07 Aug 2025

Contributed by Lukas

This 2023 paper, GPT-4 Technical Report from OpenAI introduces GPT-4, a multimodal AI model capable of processing both image and text inputs to pro...

GPT3

07 Aug 2025

Contributed by Lukas

This 2020 paper outlines the development and evaluation of GPT-3, a large language model, exploring its performance across various natural language p...

GPT2

07 Aug 2025

Contributed by Lukas

This 2019 paper, "Language Models are Unsupervised Multitask Learners," introduces GPT-2, a large language model designed for zero-shot learning, mean...

GELU

07 Aug 2025

Contributed by Lukas

This 2023 paper Gaussian Error Linear Units (GELUs), a novel activation function for neural networks that outperforms traditional activations like R...

Dropout

07 Aug 2025

Contributed by Lukas

This 2014 journal article introduces "Dropout", a novel technique designed to combat overfitting in deep neural networks, which are powerful but prone...

ResNets - residual block

07 Aug 2025

Contributed by Lukas

What ResNet introduced is adding the input of a block directly to its output, like this:Output = 𝐹(𝑥)+ 𝑥This academic paper introduces Deep R...

BERT

07 Aug 2025

Contributed by Lukas

Review of the 2017 paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", leveraging the transformer architecture, ...

BART

07 Aug 2025

Contributed by Lukas

Review of the 2019 pun titled paper "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension...

Attention is all you need

07 Aug 2025

Contributed by Lukas

Review of the seminal 2017 paper Attention is all you need.These paper introducea the Transformer architecture, a dominant model in natural language p...

«« ← Prev Page 4 of 4

AI: post transformers

Activity Overview

Episodes

FP8 Quantization

Mistral 7B: Superior Performance in a Smaller Package

Diffusion Probabilitic Models: Deep Unsupervised Learning Through Diffusion Processes

Teraio: Cost-Efficient LLM Training via Lifetime-Aware Tensor Offloading

Shared Virtual Memory: Design and Performance Implications

vLLM & PagedAttention: Efficient LLM Serving with Virtual Memory

ZeRO-Offload: Democratizing Billion-Scale Model Training

AdamW: Decoupled Weight Decay Regularization for Adaptive Gradient Algorithms

Adam: A Method for Stochastic Optimization

FlashAttention-2: Faster Attention with Better Parallelism

PartitionedVC: Optimizing External Memory Graph Analytics

Fast Learning for Deep Belief Networks

Mu: The Windows Settings Agent Language Model

Lost in the Middle: How Language Models Use Long Contexts

MUVERA: Efficient Multi-Vector Information Retrieval

ColBERT and ColBERT v2

Microscaling Quantization for Large Language Models

Chamfer Matching: Image Registration and Medical Applications

LeNet-5: Convolutional Networks for Character Recognition

Constitutional AI: Harmlessness Through Self-Improvement

GQA: Grouped Query Attention

Linformer: Efficient Self-Attention with Linear Complexity

Longformer: A Transformer for Long Documents

RoPE

Layer Normalization and Dual Patch Normalization

Batch Normalization

Chinchilla: Optimal Language Model Scaling

Transformer Scaling

Learning from repeated data

Scaling Laws

LSTM: the forget gate

GPT4 Technical Report

GPT3

GPT2

GELU

Dropout

ResNets - residual block

BERT

BART

Attention is all you need

Login Required

Share this moment