VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Hoda Eldardiry, Pinar Yanardag
Read on arXiv →Key claim
MLA reduces memory usage by 92.7% in video diffusion.
This paper introduces Multi-Head Latent Attention (MLA) for video diffusion, achieving a 92.7% reduction in per-token memory usage while maintaining quality. It demonstrates that MLA can outperform existing methods in long-horizon streaming video diffusion, improving throughput significantly. This advancement could lead to more efficient video processing techniques.
In plain English
This paper introduces Multi-Head Latent Attention (MLA) for video diffusion, achieving a 92.7% reduction in per-token memory usage while maintaining quality. It demonstrates that MLA can outperform existing methods in long-horizon streaming video diffusion, improving throughput significantly. This advancement could lead to more efficient video processing techniques.
The introduction of Multi-Head Latent Attention in video diffusion represents a significant advancement in the field.
The paper provides strong experimental results and comparisons to existing methods, supporting its claims.
Deep reliability assessment
The methodology supports the claim that an MLA-style latent KV layout can substantially reduce cached KV memory in a Wan-style causal video diffusion setup while preserving reported VBench quality and improving single-GPU throughput. The broader claim that the MLA bottleneck, rather than pretrained low-rank structure, explains success is plausible from their spectral analyses, but may be overclaimed without broader model scales, datasets, hardware, and long-rollout failure-mode evaluations.
Reproducibility
Partial: the paper mentions a project page at https://videomla.github.io, but the provided text does not explicitly mention open-source code, released checkpoints, or training datasets; evaluation uses VBench and a Wan2.1-T2V-1.3B-style setup.
Discussion questions
- 1.If pretrained video diffusion attention is not low-rank, is MLA succeeding because the model relearns a compressed attention mechanism during finetuning rather than because the original KV structure was redundant?
- 2.For builders serving video generation in SEA with constrained GPU budgets, does the 92.7% KV-cache reduction translate into larger batch sizes and lower cost in real deployments, or is throughput still dominated by non-cache compute and memory movement?
- 3.What result would falsify the paper's core claim: quality collapse at longer rollouts, failure on larger video models, inability to preserve prompt consistency, or a benchmark where dense KV significantly outperforms VideoMLA at the same training budget?
Key figure
Figure 1 shows that the pretrained Wan2.1-T2V-1.3B attention projections have high effective rank, with a 192-dimensional latent capturing only about 45.8% median spectral energy and 99%-energy rank exceeding 1300 in every layer.
