The article introduces a new study titled CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs, whose core goal is to optimize the efficiency of Transformer model training, particularly addressing seemingly fragmented but cumulatively time-consuming "memory-intensive" operations. On May 22nd, Tri Dao retweeted a tweet from Han Guo on social media. He also wrote: "After some mathematical rewriting, it turns out that everything in Transformer is a series of GEMMs + epilogues. Given some optimized primitives, LLMs (and beginners) can write light-speed kernels for all Transformer operations!" Tri Dao is one of the core authors of the FlashAttention series, and this tweet refers to a paper they released that day: CODA. Paper title: CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs. Paper address: https://arxiv.org/abs/2605.19269. Code address: https://github.com/HanGuo97/coda-kernels. The name sounds like "Finale" and is pronounced like "CUDA." Researchers from MIT, Princeton, Together AI, and Meta are attempting to systematically digest the rarely noticed but consistently time-consuming "fragmented computations" in Transformer training using a new programming abstraction. Background: The "laziness tax" of training large models. To understand what problem CODA solves, we must first understand where the training time for large models goes. When training an LLaMA-3 style 1B parameter model on an NVIDIA H100, most people intuitively think that the time is spent on matrix multiplication and attention calculations, since that's the "real computation." This intuition is largely correct: matrix multiplication (GEMM) and attention do indeed account for the majority of computational power. However, if you look closely at the performance analyzer, you'll find a group of "small operators" quietly consuming time: normalization (RMSNorm), activation functions (SwiGLU, RoPE), residual addition, cross-layer reduction… Each of these operations has a small individual computational cost, but they frequently move large intermediate tensors in and out of GPU memory. This is the so-called "memory bandwidth bottleneck": it's like a master chef who has to move ingredients from a distant warehouse to a remote location for each dish, and then return them after use, instead of keeping them on the counter at hand. No matter how fast the chef's hands are, the time spent waiting to move these ingredients is a real waste. Worse still, as NVIDIA's low-precision formats like FP8 and FP4 make matrix calculations faster, the relative cost of these "moving" operations is actually increasing: matrix multiplication has accelerated, but the cost of moving tensors in and out has not decreased proportionally.One set of data in the paper is quite intuitive: when training a 1B parameter model on the H100 using TorchTitan, non-matrix multiplication operations account for a significant portion of the end-to-end runtime, and this proportion will become even more pronounced with the introduction of FP8 precision. Existing programming frameworks are almost powerless to address this. PyTorch expresses Transformer computation as a sequence of operators with clear boundaries between them. These boundaries are very friendly to automatic differentiation (autograd), but they happen to prevent cross-operator fusion optimization: each operator boundary often results in an unnecessary memory write-back. CODA: A treasure lies hidden in the "epilogue". CODA's starting point is a simple observation. On a GPU, a high-performance matrix multiplication (GEMM) kernel is structurally divided into two parts: the main loop is responsible for the core matrix block multiplication and addition computations, and the epilogue is responsible for some finishing processing before writing the results back to memory, such as adding bias, type conversion, and simple scaling. The significance of the "tail end" lies in the fact that the output of the matrix multiplication is still "alive" in the on-chip registers and has not yet been written to global memory. This is a brief golden window: if more computation can be performed at this moment, one round trip of writing to and reading from global memory can be completely avoided. CODA's core insight is that many memory-intensive operations in the Transformer can actually be algebraically reparameterized and squeezed into this "tail end" window for execution. This requires some mathematical skills. Take the most common GEMM-RMSNorm-GEMM pattern as an example: the result of a matrix multiplication is subjected to residual addition, RMS normalization, and then another matrix multiplication is performed. The traditional approach is to execute three independent operators serially, with the intermediate results written to global memory twice. The CODA team discovered that the row scaling factor r in RMS normalization, because it is a scalar shared by each row, satisfies the commutative law with the subsequent matrix multiplication: the application of r can be postponed from "before the second GEMM" to "the tail end of the second GEMM". After the delay, the tail of the first GEMM only needs to compute the local "partial RMS", which is merged by a very lightweight auxiliary reduction kernel, while the complete RMSNOrm computation disappears. Similar reparameterization applies to operations such as SwiGLU, RoPE (Rotated Position Encoding), and cross-entropy loss, and even to backpropagation. The paper includes a theorem proving that as long as the forward tail is "partially local", backpropagation automatically inherits the same structure. Please visit the original paper for details. Five "building blocks" and a "LEGO language". CODA is not a specific fusion kernel, but a set of programming abstractions.It anchors the expert-optimized GEMM main loop and then exposes five types of composable basic primitives at the end: element-wise transformations (residual addition, activation functions, RoPE), vector loading and storage (broadcasting RMSnorm weights), matrix block loading and storage (saving intermediate activations for backpropagation), block reduction (local root mean square, block log-sum-exp), and stateful transformations (max and sum-exp statistics required for online normalization). With these five types of building blocks, almost all operations in the forward and backward propagation of a standard Transformer, except for attention, can be covered. Even more interesting is the tolerance of this abstraction regarding "who writes the code." The paper evaluated two implementation modes in experiments: one written by human programmers, and the other generated using Claude Code—given CODA primitive specifications, several examples, and implementation logs, AI completes most of the kernel code, with light human supervision. Both modes achieved high performance levels. Tri Dao tweeted that "LLM and even beginners can write light-speed kernels," which is a real-world reflection of the paper's experimental results. The experimental results for CODA were based on demanding benchmarks: cuBLAS with torch.compile, and the Liger Kernel and FlashInfer, optimized specifically for LLM. The paper evaluated two implementations for each kernel: CODA (LLM) was generated by Claude Code, with researchers providing primitive descriptions, examples, and a continuously updated implementation tips log; AI completed the main code, with light human supervision; CODA (Human) was written independently by human programmers, using the same high-level reparameterization approach but not relying on the CODA primitive set itself. Both sets of results were compared with optimized libraries such as cuBLAS + torch.compile, Liger Kernel, and FlashInfer. At the single operator level, taking the typical GEMM-RMSNorm-GEMM model as an example, CODA outperforms the cuBLAS + PyTorch baseline in all three hidden dimensions corresponding to model sizes of 1B, 7B, and 70B. SwiGLU, RoPE, and cross-entropy combinations also show similar performance. The kernels generated by LLM are comparable to manually written versions on most benchmarks, and even slightly surpass them on some configurations. This is a rather rare conclusion in the historically high-barrier field of GPU kernel optimization. The benefits of backpropagation are particularly outstanding: the inverse kernel of GEMM-Residual-PartialRMS-GEMM achieves a speedup of 1.6 to 1.8 times compared to the baseline, while SwiGLU inverse propagation also shows an improvement of approximately 1.4 to 1.6 times. In this direction, the gap between LLM and manual implementations is also minimal.This is not surprising: backpropagation naturally involves accessing more intermediate tensors, making tail fusion more beneficial; and CODA's primitive design is clear enough to allow AI models to correctly perform the combination. In end-to-end benchmarks of a complete Transformer layer, CODA's forward speedup ranges from approximately 5% to 20% at different scales, with more significant effects at larger model sizes (corresponding to 70B hidden dimensions). Regarding numerical accuracy, CODA's reparameterization adjusts the timing of RMSnorm scaling factor application, but experiments show that its numerical error is comparable to the PyTorch reference implementation, and even smaller in some configurations—thanks to the higher-precision accumulator in the GEMM main loop itself. What CODA can do: A quick reference guide. Before diving into a broader perspective, let's clarify the boundaries of CODA's capabilities. Coverage: Covers almost all computations in the forward and backward propagation of standard Transformers (such as the LLaMA architecture), excluding attention and word embeddings, including RMSnorm, residual addition, SwiGLU activation, RoPE rotation position encoding, cross-entropy loss, and backward gradient calculations for the above operations. Speedup: For hidden dimensions ranging from 1B to 70B, single-operator levels show varying degrees of improvement compared to the cuBLAS + torch.compile baseline, with the most significant benefit in backpropagation (some kernels achieving over 1.6x speedup); the end-to-end forward speedup of the complete Transformer layer is approximately 5% to 20%, with more pronounced effects at larger model sizes. Usability: CODA is implemented based on CuTeDSL (NVIDIA CUTLASS's Python DSL), supporting both human programmers and AI models for kernel writing, both achieving high performance. Current Limitations: Currently only supports single-GPU scenarios and does not involve distributed training; reparameterization is primarily for standard Transformer architectures, and its applicability to other architectures needs further verification. Conclusion. CODA is not an isolated project. It's a concrete implementation of a class of ideas: on GPUs, the real optimization space often lies not in "what to compute," but in "how to implement it." FlashAttention allows attention computation to "live" in on-chip memory, and CODA attempts to allow normalization and activation functions to "live" in as well. Triton lowers the barrier to writing custom kernels, while ThunderKittens, TileLang, and others further explore this space at different levels. These works all point in the same direction: to truly unify the expressive convenience of PyTorch operator graphs with the near-handed execution efficiency of CUDA within a programmable framework. The last sentence of the TriDao tweet is worth revisiting: "LLMs and beginners can write light-speed kernels for all Transformer operations.""There's a deeper logic behind this: when the programming abstraction is well-designed, the AI model itself can participate in optimizing its own training infrastructure. This cycle is what makes CODA so intriguing. From this perspective, the name 'CODA' might have a deeper meaning. In classical music, the coda is the concluding section of a piece. Here, it's the 'epilogue' of the GEMM kernel—and crafting this epilogue well might be the next important chapter in improving the efficiency of the Transformer training system." [Machine Heart]
CODA: The GPU Kernel Revolution Transforming AI and Its Implications for Crypto
The introduction of CODA (Rewriting Transformer Blocks as GEMM-Epilogue Programs) represents a paradigm shift in how we approach GPU computation for large language models, with significant potential ripple effects across the blockchain and cryptocurrency ecosystem. This breakthrough, led by researchers from MIT, Princeton, Together AI, and Meta including FlashAttention architect Tri Dao, directly addresses the “laziness tax” in model training—the cumulative performance drain from seemingly small memory-intensive operations that collectively consume substantial computing resources.
Technical Breakthrough and Market Impact
CODA’s core innovation lies in its mathematical reparameterization approach that transforms fragmented operations into optimized GEMM-Epilogue programs. By leveraging the brief golden window when matrix multiplication results remain in on-chip registers, CODA eliminates unnecessary memory round trips, achieving 5-20% speedups in end-to-end benchmarks and up to 1.8x improvements in backpropagation operations. This isn’t merely an incremental improvement; it fundamentally rethinks how computational primitives interact on GPU architecture.
For crypto investors, this development carries several implications:
-
AI-Blockchain Convergence Acceleration: Projects at the intersection of AI and blockchain (such as decentralized AI infrastructure, analytics platforms, and oracles) could benefit tremendously from these efficiency gains. The ability to train more powerful models at lower operational costs directly improves the economic viability of AI-driven blockchain applications.
-
Competitive Landscape in GPU-Optimized Projects: The CODA approach intensifies the competitive pressure on existing GPU optimization frameworks. Projects like ThunderKittens, TileLang, and those built on Triton will need to either adapt or differentiate as the bar for performance has been raised significantly. This could lead to market consolidation or strategic pivots.
-
Democratization of High-Performance Computing: Perhaps most intriguing is CODA’s dual-mode implementation—both human-written and LLM-generated code achieve comparable performance levels. This lowers the barrier to entry for developing high-performance kernels, potentially accelerating innovation across the stack. For blockchain projects, this means access to cutting-edge optimization techniques no longer requires elite GPU programming expertise.
Token Price Implications and Investment Opportunities
The market implications for specific tokens and sectors are multifaceted:
-
Infrastructure Tokens: Projects providing GPU infrastructure for AI/ML workloads could see enhanced value propositions as CODA-like optimizations reduce operational costs while increasing throughput. This could positively impact tokens from decentralized GPU providers and compute marketplaces.
-
AI-Blockchain Hybrid Projects: Tokens from projects that successfully integrate CODA-like optimizations into their AI components could experience significant upside as they demonstrate clear performance advantages over competitors. The reduced training costs could translate to more sustainable token economics.
-
Oracles and Analytics Platforms: The efficiency gains in Transformer models directly benefit projects requiring sophisticated data analysis on-chain. Enhanced analytics capabilities could attract more users and data providers to these platforms, potentially driving token value.
However, investors should exercise caution regarding:
-
Adoption Timeline: While CODA demonstrates impressive performance benchmarks, widespread implementation will take time. The current limitations (single-GPU support, standard Transformer architectures only) restrict immediate applicability in many blockchain contexts.
-
Competitory Response: Established players like NVIDIA and cloud providers may rapidly incorporate similar optimizations into their proprietary stacks, potentially narrowing the competitive advantage for early adopters.
-
Security Considerations: The LLM-generated code aspect introduces security considerations that must be thoroughly vetted before deployment in critical blockchain infrastructure.
Strategic Outlook
CODA represents more than just a technical optimization; it embodies a broader trend toward computational efficiency that will increasingly influence blockchain development. As the crypto industry continues its pursuit of scalability, interoperability, and real-world utility, the ability to run complex computations efficiently will become a critical differentiator.
For investors, this signals a shift toward valuing projects not just for their tokenomics or consensus mechanisms, but for their underlying technical efficiency and ability to leverage cutting-edge computational advances. The blockchain projects that successfully integrate CODA-like optimizations will be better positioned to deliver superior performance at lower costs, potentially capturing greater market share.
The most significant opportunity lies in projects that can bridge the gap between AI optimization breakthroughs like CODA and blockchain infrastructure, creating a new class of high-performance decentralized applications that were previously impractical due to computational constraints.