RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing

Abstract

Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite the rapid progress, most methods still require fixed-length inputs and substantial compute. Meanwhile, autoregressive video generation enables efficient variable-length synthesis, yet remains under-explored for video editing. We introduce a causal, efficient video editing model that edits variable-length videos frame by frame. For efficiency, we start from an image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing by conditioning the edit at time step \(t\) on the model’s prediction at \(t-1\). To leverage videos’ temporal redundancy, we propose a new I2I diffusion forward process formulation that encourages the model to predict the residual between the target output and the previous prediction. We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames. Moreover, we propose a new benchmark that better ranks state-of-the-art methods for editing tasks. Trained on paired video data for global/local style transfer and object removal, RFDM surpasses I2I-based methods and competes with fully spatiotemporal (3D) V2V models, while matching the compute of image models and scaling independently of input video length.

Training and Inference

The RFDM training and inference procedures are illustrated in Figures 1 and 2, respectively. We use the inference algorithm from Figure 2 to generate all visualizations. For RFDM, we set the number of steps to \(S = 8\), with an image guidance scale of \(3.5\) and a text guidance scale of \(1.5\) for CFG. For other methods, we use the configurations recommended in their repositories or the settings that perform best. RFDM can generate sequences of up to 60 frames with a key-frame update interval \(\Delta = 3\), as shown in Table 2.d in the ablation studies. For longer generations, such as 120 frames, \(\Delta\) is increased to 6.

Visualizations

Below we show qualitative results for global style transfer, object removal, and local style transfer. We train our model on Señorita train split and use Señorita test split or DAVIS videos for evaluation. The prompts are from Señorita test split.

Global Style Transfer

Prompt

"Turn it into Anime style"

Prompt

"Turn it into Chinese Ink style"

Prompt

"Turn it into Pixel art style"

Prompt

"Turn it into Noir style"

Prompt

"Turn it into Anime style"

Prompt

"Turn it into Noir style"

Prompt

"Turn it into Chinese Ink style"

Prompt

"Turn it into Noir style"

Prompt

"Turn it into Doodle style"

Prompt

"Turn it into Van Gogh style"

Prompt

"Turn it into Minimalist style"

Prompt

"Turn it into Rivera style"

Prompt

"Turn it into Fine art style"

Prompt

"Turn it into Abstract art style"

Comparison to the ground truth (input, output, ground-truth)

Prompt

"Turn it into Fine art style"

Prompt

"Design it interior"

Prompt

"Turn it into Minimalist-warm style"

Prompt

"Turn it into Ominous style"

Comparison to state-of-the-art

Input

I2I

Fairy

VidToMe

Ours

Prompt

"Turn it into Chinese Ink style"

Input

I2I

Fairy

VidToMe

Ours

Prompt

"Turn it into Anime style"

Input

I2I

Fairy

VidToMe

Ours

Prompt

"Turn it into Chinese Ink style"

Object Removal

Removing objects while preserving background structure and temporal coherence.

Prompt

"Remove the person"

Prompt

"Remove the boat"

Prompt

"Remove the person"

Prompt

"Remove the person"

Prompt

"Remove the bear"

Prompt

"Remove the car"

Prompt

"Remove the surfer"

Prompt

"Remove the dog"

Prompt

"Amputate the hand"

Prompt

"Eliminate the ocean"

Prompt

"Eliminate the person"

Prompt

"Eliminate the river"

Prompt

"Eliminate the woman"

Prompt

"Delete the artwork"

Comparison to the ground truth (input, output, ground-truth)

Prompt

"Remove the roundabout"

Prompt

"Eliminate the person"

Prompt

"Eliminate the sea"

Prompt

"Eliminate the swan"

Comparison to state-of-the-art

Input

I2I

Fairy

VidToMe

Ours

Input

I2I

Fairy

VidToMe

Ours

Local Style Transfer

Editing only specific regions while keeping the rest of the video unchanged.

Prompt

"Make the car blue"

Prompt

"Make the road snowy"

Prompt

"Make the bear white"

Prompt

"Make the hat blue"

Prompt

"Turn water a vibrant yellow"

Prompt

"Make the grass a soft green"

Prompt

"Paint the tv remote control indigo blue"

Prompt

"Paint the person deep purple"

Prompt

"Paint the pool a deep blue and render it smooth"

Prompt

"Paint the tray Rubey red"

Prompt

"Paint the fishing rod dark"

Prompt

"Paint the sunset a vibrant_orange"

Comparison to the ground truth (input, output, ground-truth)

Prompt

"Make the cake delightfully moist and crumbly"

Prompt

"Turn the beer golden"

Prompt

"Make the goose a soft white and downy"

Prompt

"Color the christmas cake a dark green"

Comparison to state-of-the-art

Input

I2I

Fairy

VidToMe

Ours

Input

I2I

Fairy

VidToMe

Ours

Visualization of ablation studies

Here we illustrate what is measured in our ablation studies through error accumulation and temporal consistency. As shown in the first row of the figure below, when \(\Delta = 0\), we obtain low error accumulation, meaning that the distribution of the final frames remains almost similar to the distribution of the initial ones. However, temporal consistency starts to degrade, as indicated by higher scores, since the distance between the previous frame—used as the input condition—and the current frame increases. When \(\Delta = 1\), in the second row, temporal consistency remains almost stably good; however, error accumulation rises such that the final generated frames have about 10× larger error (0.20 vs 0.02) compared to the initial ones. The sweet spot is when \(\Delta = 3\), shown in the third row, where both temporal consistency and error accumulation remain in a good range, as also reflected in the videos. Please also notice to the magnitude of the numbers when reading the plots.

Input

key-frame update interval \(\Delta\)=0

Error accumulation per frame

Temporal consistency per frame

Input

key-frame update interval \(\Delta\)=1

Error accumulation per frame

Temporal consistency per frame

Input

key-frame update interval \(\Delta\)=3

Error accumulation per frame

Temporal consistency per frame

Error Accumulation for all key-frame update interval \(\Delta\) in one plot

Temporal Consistency for all key-frame update interval \(\Delta\) in one plot

Cases where our method surpass the ground truth (input, output, ground-truth)

We show the cases where our method performs even better than the ground truth in this section. In Señorita, the ground truth is produced using a data-generation pipeline that combines multiple backbones, including object-segmentation modules, object in-painting modules, and object-tracking modules. We train RFDM on this generated data and show that the model can often produce results that surpass the ground truth, due to the strong the sptatial understanding of SD1.5 and high temporal stability introduced by the proposed training procedures.

Prompt

"Color the pancake stack a lavender"

Prompt

"Make the chandelier a striking crystal and bronze combination"

Prompt

"Make the ocean a dull_green"

Prompt

"Paint the gas station a rusty metal and faded neon"

Prompt

"Turn the egg Aqua Blue"

Prompt

"Turn trees dark and gnarled"

Prompt

"Discard the flower"

Prompt

"Eliminate the sunset"

Prompt

"Remove the pine tree"

Prompt

"Eliminate the yellow flower"

Failure cases (input, output, ground-truth)

In this section, we present the failure cases of our method.

In the candles painting example, when a new object part suddenly appears, RFDM may fail to color that region consistently. This happens because the model must infer that the newly revealed part belongs to the same object that is only partially colored, a highly challenging task.

In the truck removal example, the truck and the building are removed together in the first frame and remain absent in subsequent frames. This results from SD1.5’s limited prompt understanding, which struggles to distinguish the truck from the background.

Similarly, in the plant removal and woman removal cases, the model fails to distinguish the objects properly, leading to incorrect or incomplete removal. Note that in these examples, the issue is primarily spatial understanding rather than temporal.

The same issue appears in the bridge painting and sink painting cases, where the model either applies the wrong color or fails to localize the sink accurately.

In the jellyfish painting example, the object deforms rapidly, causing the model to lose track of it early. Since RFDM relies only on the previous frame (short-term memory), the object cannot be recovered once it is missed.