Note: TimeZero is the original version
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang*, Ziheng Wang*, Boshen Xu*‡, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, Qin Jin†
![]()
![]()
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM
Ye Wang*, Boshen Xu*, Zihao Yue, Zihan Xiao, Ziheng Wang, Liang Zhang, Dingyi Yang, Wenxuan Wang, Qin Jin†
![]()
![]()
- Base model support for MiMo-VL and Intern3-VL
- Time-R1: RL-based framework for temporal video grounding. We introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task.
- TimeRFT: Time-aware reinforcement fine-tuning. We explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization.
- TVGBench: Comprehensive benchmark for LVLMs on TVG. We carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries.
- State-of-the-Art results and generalization. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.
- A codebase that supports training LVLM with RL.
- Speedup inference time for temporal video grounding and video QA by vllm library.

- Experiment toolkits: Support training on our TimeRFT, Charades, and ActivityNet; support vLLM inference on TVGBench, Charades, ActivityNet, MVBench,s TempCompass, VideoMME, EgoSchema.
see docs/INSTALL.md
Demo I/O:
CUDA_VISIBLE_DEVICES=0 python demo.py --model_base ./ckpts/Time-R1-7B --video_path ./assets/OHOFG.mp4 --query "person sitting down in a chair."see docs/DATA.md
For TimeRFT post-training process:
# w/ sample filtering per epoch
bash scripts/posttrain/run_rl_SF.sh
# w/o sample filtering per epoch
bash scripts/posttrain/run_rl.shFor fine-tuning on downstream benchmarks like Charades and ActivityNet:
# first preprocess dataset
bash scripts/finetune/preprocess_videos_ch.sh
# then finetune
bash scripts/finetune/run_charades.shAfter training, evaluate your model's performance on TVGBench/Charades/Activitynet/MVBench/TempCompass/VideoMME/EgoSchema:
# remember to change BASE_PATH, EVAL_DATASET, and MODEL_NAME in test.sh
bash scripts/test.shWe mainly compare with 7B opensourced LVLMs trained by SFT.
- TVGBench (ZeroShot)
| Method | Type | [email protected] | [email protected] | [email protected] |
|---|---|---|---|---|
| Gemini-2.5-Pro | - | 39.1 | 24.4 | 12.8 |
| VideoChat-Flash | SFT | 32.8 | 19.8 | 10.4 |
| TimeSuite | SFT | 31.1 | 18.0 | 8.9 |
| TRACE | SFT | 37.0 | 25.5 | 14.6 |
| Time-R1 (ours) | RL | 41.8 | 29.4 | 16.4 |
- Charades-STA (ZeroShot)
| Method | Type | [email protected] | [email protected] | [email protected] |
|---|---|---|---|---|
| VideoChat-Flash | SFT | 74.5 | 53.1 | 27.6 |
| TimeSuite | SFT | 69.9 | 48.7 | 24.0 |
| TRACE | SFT | - | 40.3 | 19.4 |
| Time-R1 (ours) | RL | 78.1 | 60.8 | 35.3 |
- ActivityNet (ZeroShot)
| Method | Type | [email protected] | [email protected] | [email protected] |
|---|---|---|---|---|
| HawkEye | SFT | 49.1 | 29.3 | 10.7 |
| VTimeLLM | SFT | 44.0 | 27.8 | 14.3 |
| Time-R1 (ours) | RL | 58.6 | 39.0 | 21.4 |
- Charades-STA (FineTune)
| Method | Type | [email protected] | [email protected] | [email protected] |
|---|---|---|---|---|
| EaTR | VLP | - | 68.4 | 44.9 |
| HawkEye | SFT | 72.5 | 58.3 | 28.8 |
| TimeSuite | SFT | 79.4 | 67.1 | 43.0 |
| Time-R1 (ours) | RL | 82.8 | 72.2 | 50.1 |
- ActivityNet (FineTune)
| Method | Type | [email protected] | [email protected] | [email protected] |
|---|---|---|---|---|
| SSRN | VLP | - | 54.5 | 33.2 |
| SnAG | VLP | - | 48.6 | 30.6 |
| EaTR | VLP | - | 58.2 | 37.6 |
| HawkEye | SFT | - | 37.7 | 24.0 |
| TRACE | SFT | - | 37.7 | 24.0 |
| Time-R1 (ours) | RL | 73.3 | 55.6 | 34.0 |
Comparison between post-training paradigms across various tasks, including temporal video grounding, short video QA, and long video QA. Both “SFT” and “RL” full-finetune the LLM, while “SFT-LoRA” denotes finetuning the LLM with LoRA. The “Base” is Qwen2.5-VL-7B.
We thank the following projects: TRACE, R1-V, Qwen2.5-VL, TRL, vLLM
If you find our work useful, please consider cite our paper :)
@article{wang2025timer1,
title={Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding},
author={Wang, Ye and Wang, Ziheng and Xu, Boshen and Du, Yang and Lin, Kejun and Xiao, Zihan and Yue, Zihao and Ju, Jianzhong and Zhang, Liang and Yang, Dingyi and Fang, Xiangnan and He, Zewen and Luo, Zhenbo and Wang, Wenxuan and Lin, Junqi and Luan, Jian and Jin, Qin},
journal={arXiv preprint arXiv:2503.13377},
year={2025},
}@article{wang2025timezero,
title={TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM},
author={Wang, Ye and Xu, Boshen and Yue, Zihao and Xiao, Zihan and Wang, Ziheng and Zhang, Liang and Yang, Dingyi and Wang, Wenxuan and Jin, Qin},
journal={arXiv preprint arXiv:2503.13377},
year={2025}
}
