AI: post transformers
Episodes
FP8 Quantization
08 Aug 2025
Contributed by Lukas
Three sources are reviewed to understand the value of FP8 quantization:https://www.baseten.co/blog/33-faster-llm-inference-with-fp8-quantization/https...
Mistral 7B: Superior Performance in a Smaller Package
08 Aug 2025
Contributed by Lukas
This paper introduces Mistral 7B, a new 7-billion-parameter language model designed for both superior performance and efficiency. The paper highli...
Diffusion Probabilitic Models: Deep Unsupervised Learning Through Diffusion Processes
08 Aug 2025
Contributed by Lukas
This paper introduces a novel deep unsupervised learning algorithm that leverages non-equilibrium thermodynamics to model complex datasets. The co...
Teraio: Cost-Efficient LLM Training via Lifetime-Aware Tensor Offloading
08 Aug 2025
Contributed by Lukas
The research introduces Teraio, a novel framework designed to enhance the cost-efficiency and performance of large language model (LLM) training. This...
Shared Virtual Memory: Design and Performance Implications
08 Aug 2025
Contributed by Lukas
The provided academic paper investigates Shared Virtual Memory (SVM), a technology that integrates GPU memory into host virtual memory systems to impr...
vLLM & PagedAttention: Efficient LLM Serving with Virtual Memory
08 Aug 2025
Contributed by Lukas
This document introduces PagedAttention, an innovative attention algorithm, and vLLM, a high-throughput serving system for large language models (LLMs...
ZeRO-Offload: Democratizing Billion-Scale Model Training
08 Aug 2025
Contributed by Lukas
This reviews the paper which introduced ZeRO-Offload, a novel technology designed to democratize large-scale deep learning model training by making ...
AdamW: Decoupled Weight Decay Regularization for Adaptive Gradient Algorithms
08 Aug 2025
Contributed by Lukas
The paper "Decoupled Weight Decay Regularization" by Ilya Loshchilov and Frank Hutter (2017) introduced what we know in Pytorch now as AdamW.This acad...
Adam: A Method for Stochastic Optimization
08 Aug 2025
Contributed by Lukas
This reviews the 2015 paper which introduced Adam, an algorithm for first-order gradient-based optimization in scenarios involving stochastic objec...
FlashAttention-2: Faster Attention with Better Parallelism
08 Aug 2025
Contributed by Lukas
This reviews the paper that introduces FlashAttention-2, an optimized attention algorithm designed to significantly improve the speed and efficiency o...
PartitionedVC: Optimizing External Memory Graph Analytics
08 Aug 2025
Contributed by Lukas
Thus paper introduces PartitionedVC, an innovative external memory graph analytics framework designed to enhance the processing of large graphs that ...
Fast Learning for Deep Belief Networks
08 Aug 2025
Contributed by Lukas
We review a letter communicated by Yann Le Cun and published in Neural Computation in 2006, details a fast learning algorithm for deep belief net...
Mu: The Windows Settings Agent Language Model
08 Aug 2025
Contributed by Lukas
We review data from the Windows Experience Blog which introduces a new era of Windows experiences, heavily integrated with Artificial Intelligence...
Lost in the Middle: How Language Models Use Long Contexts
08 Aug 2025
Contributed by Lukas
This academic paper explores how language models utilize long input contexts, focusing on their ability to identify and retrieve relevant information....
MUVERA: Efficient Multi-Vector Information Retrieval
08 Aug 2025
Contributed by Lukas
We review MUVERA, a novel algorithm designed to significantly improve the efficiency of multi-vector information retrieval. Traditional information ...
ColBERT and ColBERT v2
08 Aug 2025
Contributed by Lukas
We reviewed two papers on ColBERT. We review and expand upon ColBERT, a neural information retrieval model that utilizes contextualized late interac...
Microscaling Quantization for Large Language Models
08 Aug 2025
Contributed by Lukas
Double paper review for modern quantization techniques.These two academic papers address the crucial challenge of quantizing Large Language Models (LL...
Chamfer Matching: Image Registration and Medical Applications
07 Aug 2025
Contributed by Lukas
This paper provides an overview of Chamfer Matching, a classical image registration method primarily used for segmented features. They explain its...
LeNet-5: Convolutional Networks for Character Recognition
07 Aug 2025
Contributed by Lukas
This paper outlines the advancements in Optical Character Recognition (OCR), particularly focusing on handwritten character and word recognition us...
Constitutional AI: Harmlessness Through Self-Improvement
07 Aug 2025
Contributed by Lukas
This paper details "Constitutional AI," a novel method for training AI assistants to be harmless without extensive human-labeled data for harmful outp...
GQA: Grouped Query Attention
07 Aug 2025
Contributed by Lukas
This paper introduces Grouped-Query Attention (GQA), a novel approach designed to enhance the inference efficiency of large language models. It addr...
Linformer: Efficient Self-Attention with Linear Complexity
07 Aug 2025
Contributed by Lukas
This academic paper introduces Linformer, a novel approach to address the computational bottleneck of Transformer models in natural language processin...
Longformer: A Transformer for Long Documents
07 Aug 2025
Contributed by Lukas
This paper introduces Longformer, a novel Transformer-based model designed to overcome the limitations of traditional Transformers in processing exc...
RoPE
07 Aug 2025
Contributed by Lukas
This paper introduces RoFormer, an enhanced Transformer model that leverages Rotary Position Embedding (RoPE) to improve natural language processin...
Layer Normalization and Dual Patch Normalization
07 Aug 2025
Contributed by Lukas
The first source introduces Layer Normalization (LN), a technique designed to accelerate and stabilize the training of deep neural networks, particula...
Batch Normalization
07 Aug 2025
Contributed by Lukas
This academic paper introduces Batch Normalization (BN), a novel technique designed to accelerate the training of Deep Neural Networks (DNNs) by addre...
Chinchilla: Optimal Language Model Scaling
07 Aug 2025
Contributed by Lukas
The Chinchilla research by DeepMind investigates the optimal model size and training tokens for large language models, aiming to maximize performance ...
Transformer Scaling
07 Aug 2025
Contributed by Lukas
This research paper explores the scaling behavior of Transformer architectures, offering insights into pre-training and fine-tuning efficiency. It cha...
Learning from repeated data
07 Aug 2025
Contributed by Lukas
This 2022 paper explores the significant negative impact of repeated data on the performance of large language models, even when such repetitions c...
Scaling Laws
07 Aug 2025
Contributed by Lukas
This 2000 paper, titled "Scaling Laws for Neural Language Models," explores the empirical relationships between the performance of neural language ...
LSTM: the forget gate
07 Aug 2025
Contributed by Lukas
This 2000 paper introduces a novel solution to a weakness found in Long Short-Term Memory (LSTM) networks, specifically when processing continuous da...
GPT4 Technical Report
07 Aug 2025
Contributed by Lukas
This 2023 paper, GPT-4 Technical Report from OpenAI introduces GPT-4, a multimodal AI model capable of processing both image and text inputs to pro...
GPT3
07 Aug 2025
Contributed by Lukas
This 2020 paper outlines the development and evaluation of GPT-3, a large language model, exploring its performance across various natural language p...
GPT2
07 Aug 2025
Contributed by Lukas
This 2019 paper, "Language Models are Unsupervised Multitask Learners," introduces GPT-2, a large language model designed for zero-shot learning, mean...
GELU
07 Aug 2025
Contributed by Lukas
This 2023 paper Gaussian Error Linear Units (GELUs), a novel activation function for neural networks that outperforms traditional activations like R...
Dropout
07 Aug 2025
Contributed by Lukas
This 2014 journal article introduces "Dropout", a novel technique designed to combat overfitting in deep neural networks, which are powerful but prone...
ResNets - residual block
07 Aug 2025
Contributed by Lukas
What ResNet introduced is adding the input of a block directly to its output, like this:Output = 𝐹(𝑥)+ 𝑥This academic paper introduces Deep R...
BERT
07 Aug 2025
Contributed by Lukas
Review of the 2017 paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", leveraging the transformer architecture, ...
BART
07 Aug 2025
Contributed by Lukas
Review of the 2019 pun titled paper "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension...
Attention is all you need
07 Aug 2025
Contributed by Lukas
Review of the seminal 2017 paper Attention is all you need.These paper introducea the Transformer architecture, a dominant model in natural language p...