2026-05-28visioninfracode

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Hoda Eldardiry, Pinar Yanardag

Key claim

MLA reduces memory usage by 92.7% in video diffusion.

This paper introduces Multi-Head Latent Attention (MLA) for video diffusion, achieving a 92.7% reduction in per-token memory usage while maintaining quality. It demonstrates that MLA can outperform existing methods in long-horizon streaming video diffusion, improving throughput significantly. This advancement could lead to more efficient video processing techniques.

In plain English

Novelty

8.5/10

The introduction of Multi-Head Latent Attention in video diffusion represents a significant advancement in the field.

Reliability

8.0/10

The paper provides strong experimental results and comparisons to existing methods, supporting its claims.

Deep reliability assessment

The methodology supports the claim that an MLA-style latent KV layout can substantially reduce cached KV memory in a Wan-style causal video diffusion setup while preserving reported VBench quality and improving single-GPU throughput. The broader claim that the MLA bottleneck, rather than pretrained low-rank structure, explains success is plausible from their spectral analyses, but may be overclaimed without broader model scales, datasets, hardware, and long-rollout failure-mode evaluations.

Reproducibility

Partial: the paper mentions a project page at https://videomla.github.io, but the provided text does not explicitly mention open-source code, released checkpoints, or training datasets; evaluation uses VBench and a Wan2.1-T2V-1.3B-style setup.

Discussion questions

1.If pretrained video diffusion attention is not low-rank, is MLA succeeding because the model relearns a compressed attention mechanism during finetuning rather than because the original KV structure was redundant?
2.For builders serving video generation in SEA with constrained GPU budgets, does the 92.7% KV-cache reduction translate into larger batch sizes and lower cost in real deployments, or is throughput still dominated by non-cache compute and memory movement?
3.What result would falsify the paper's core claim: quality collapse at longer rollouts, failure on larger video models, inability to preserve prompt consistency, or a benchmark where dense KV significantly outperforms VideoMLA at the same training budget?

Key figure

Figure 1 shows that the pretrained Wan2.1-T2V-1.3B attention projections have high effective rank, with a 192-dimensional latent capturing only about 45.8% median spectral energy and 99%-energy rank exceeding 1300 in every layer.

Codelink

videomla.github.ioOfficial