Video Generation Models are Good Latent Reward Models

Xiaoyue Mi1,†, Wenqing Yu2,*, Jiesong Lian3, Shibo Jie4, Ruizhe Zhong5, Zijun Liu6, Guozhen Zhang7, Zixiang Zhou2, Zhiyong Xu2, Yuan Zhou2,‡, Qinglin Lu2, Fan Tang1,§
1University of Chinese Academy of Sciences    2Tencent Hunyuan    3Huazhong University of Science and Technology
4Peking University    5Shanghai Jiao Tong University    6Tsinghua University    7Nanjing University
Work done during internship at Tencent Hunyuan    *Equal contribution    Project leader    §Corresponding author
mxysdu@gmail.com, tfan.108@gmail.com

📝 Overview

Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding.

In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning (PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding.

Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

PRFL Pipeline

📊 Method Comparison

Prompt: A woman in a flowing white dress is dancing gracefully in a modern dance studio. Her movements are fluid and expressive, with arms sweeping widely and legs moving in elegant, rhythmic patterns. She has long wavy hair that flows freely with each movement, catching the soft lighting from above. The background is a minimalist setup with black walls and a few abstract paintings hanging on them. The camera follows her from a medium shot, capturing her full body as she dances, then moves to a close-up of her face, highlighting her joyful expression and the sparkle in her eyes. The video has smooth transitions and dynamic camera movements, including tracking shots and slow-motion sequences to emphasize her graceful movements.
PRFL (Ours)
Pretrain
SFT
RWR
RGB ReFL

🎨 Additional Results

T2V • 480P
Two shirtless men with short dark hair are sparring in a dimly lit room. They are both wearing boxing gloves, one red and one black. One man is wearing white shorts while the other is wearing black shorts. There are several screens on the wall displaying images of buildings and people.
T2V • 480P
The woman has dark eyes and is holding a black smartphone to her ear with her right hand. She is typing on the keyboard of an open silver laptop computer with her left hand. Her fingers have blue nail polish. She is sitting in front of a window covered by sheer white curtains.
T2V • 720P
A woman with fair skin, dark hair tied back, and wearing a light green t-shirt is visible against a gray background. She uses both hands to apply a white substance from below her eyes upward onto her face. Her mouth is slightly open as she spreads the cream.
T2V • 720P
A light-skinned man with short hair wearing a yellow baseball cap, plaid shirt, and blue overalls stands in a field of sunflowers. He holds a cut sunflower head in his left hand and touches it with his right index finger. Several other sunflowers are visible in the background, some facing away from the camera.
Reference frame
Reference
I2V • 480P
a monochromatic video capturing a cat's gaze into the camera
Reference frame
Reference
I2V • 480P
A family of four eats fast food at a table.
Reference frame
Reference
I2V • 720P
a young boy is jumping in the mud
Reference frame
Reference
I2V • 720P
Normal speed, Medium shot shot, Eye level angle, Third person viewpoint, Static camera movement, Frame-within-frame composition, Shallow depth of field, Natural light light, Cinematic style, Desaturated palette with slate blue, dusty rose, and dark wood tones color palette, Dramatic atmosphere, The scene is set on a patio or veranda, framed by a stone archway. In the back, there is a large, weathered wooden gate set into a stone wall. background, IP. In a static outdoor shot, six people are gathered on a stone patio in front of a large wooden gate. On the right, two men are seated at a dark wooden table. An older man in a grey traditional jacket holds a cane and gestures with his right hand while speaking. A younger man in a light grey suit sits beside him, listening. On the left side of the frame, a man in a dark suit stands with his back to the camera. Next to him, a woman in a pink patterned cheongsam and a woman in a grey skirt suit are standing close together, whispering. The women then turn and smile towards the men at the table. The man in the dark suit turns to face the group, revealing a newborn baby cradled in his arms, wrapped in a pink blanket. He takes a few steps forward, holding the baby. The women look at him and the infant. The older man at the table continues to talk, now gesturing towards the man with the baby. The man holding the baby looks down at the infant as he continues to walk slowly. The table is set with white cups, plates, fruit, and a dark wooden box.
T2V • 720P
A woman with short gray hair wearing glasses and headphones plays a small keyboard.
T2V • 720P
Four people hold sparklers by a body of water at sunset.
Reference frame
Reference
I2V • 720P
a man standing on top of a sand dune in the desert
Reference frame
Reference
I2V • 720P
A Caucasian woman with short, curly pink hair wearing a pink striped robe over pink athletic wear dances in a kitchen. She is wearing headphones and holding a smartphone connected by a white cable. The kitchen has white cabinets, a wooden countertop, and white tile backsplash. There are apples on the counter near an electric kettle.

📚 Cite Our Work

@article{mi2025video, title={Video Generation Models are Good Latent Reward Models}, author={Mi, Xiaoyue and Yu, Wenqing and Lian, Jiesong and Jie, Shibo and Zhong, Ruizhe and Liu, Zijun and Zhang, Guozhen and Zhou, Zixiang and Xu, Zhiyong and Zhou, Yuan and Lu, Qinglin and Tang, Fan}, journal={arXiv preprint}, year={2025} }