STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models
Yiming Liang, Yixiao Chen, Yiyang Zhou, Yixuan Wang, Shoubin Yu, Andong Deng, Fuxiao Liu, Qin Zhang, Chen Chen, Mohit Bansal, Huaxiu Yao
Read on arXiv →Key claim
STORMS improves video reasoning accuracy and reduces latency.
The STORMS framework enhances video reasoning by internalizing the reasoning process through latent trajectories instead of relying on external tools or textual chains. This approach significantly improves accuracy while reducing inference time. The key result shows that STORMS outperforms existing methods in both efficiency and effectiveness.
In plain English
The STORMS framework enhances video reasoning by internalizing the reasoning process through latent trajectories instead of relying on external tools or textual chains. This approach significantly improves accuracy while reducing inference time. The key result shows that STORMS outperforms existing methods in both efficiency and effectiveness.
The proposed STORMS framework introduces a new approach to visual reasoning that reduces reliance on explicit textual representations.
The experiments demonstrate improved accuracy and reduced inference overhead across multiple datasets, supporting the claims made.
Deep reliability assessment
The methodology supports the claim that TORM can improve video reasoning accuracy while reducing inference overhead by internalizing reasoning into latent states. However, the reliance on generated thought-video supervision during training may introduce variability depending on the quality of the generated content.
Reproducibility
Yes, the paper mentions that the code is available at https://github.com/aiming-lab/storm.
Discussion questions
- 1.How does the quality of the generated thought videos impact the effectiveness of the latent reasoning process?
- 2.What are the practical implications of reducing inference-time complexity for real-time video processing applications?
- 3.What specific scenarios or datasets could demonstrate the limitations of TORM's internalized reasoning approach?
Key figure
Figure 1 illustrates the TORM training sequence, showing how generated thought videos provide dynamic supervision for latent tokens to encode temporal evidence before generating the final answer.