Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite the rapid progress, most methods still require fixed-length inputs and substantial compute. Meanwhile, autoregressive video generation enables efficient variable-length synthesis, yet remains under-explored for video editing. We introduce a causal, efficient video editing model that edits variable-length videos frame by frame. For efficiency, we start from an image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing by conditioning the edit at time step \(t\) on the model’s prediction at \(t-1\). To leverage videos’ temporal redundancy, we propose a new I2I diffusion forward process formulation that encourages the model to predict the residual between the target output and the previous prediction. We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames. Moreover, we propose a new benchmark that better ranks state-of-the-art methods for editing tasks. Trained on paired video data for global/local style transfer and object removal, RFDM surpasses I2I-based methods and competes with fully spatiotemporal (3D) V2V models, while matching the compute of image models and scaling independently of input video length.
The RFDM training and inference procedures are illustrated in Figures 1 and 2, respectively. We use the inference algorithm from Figure 2 to generate all visualizations. For RFDM, we set the number of steps to \(S = 8\), with an image guidance scale of \(3.5\) and a text guidance scale of \(1.5\) for CFG. For other methods, we use the configurations recommended in their repositories or the settings that perform best. RFDM can generate sequences of up to 60 frames with a key-frame update interval \(\Delta = 3\), as shown in Table 2.d in the ablation studies. For longer generations, such as 120 frames, \(\Delta\) is increased to 6.
Below we show qualitative results for global style transfer, object removal, and local style transfer. We train our model on Señorita train split and use Señorita test split or DAVIS videos for evaluation. The prompts are from Señorita test split.
"Turn it into Anime style"
"Turn it into Chinese Ink style"
"Turn it into Pixel art style"
"Turn it into Noir style"
"Turn it into Anime style"
"Turn it into Noir style"
"Turn it into Chinese Ink style"
"Turn it into Noir style"
"Turn it into Doodle style"
"Turn it into Van Gogh style"
"Turn it into Minimalist style"
"Turn it into Rivera style"
"Turn it into Fine art style"
"Turn it into Abstract art style"
"Turn it into Fine art style"
"Design it interior"
"Turn it into Minimalist-warm style"
"Turn it into Ominous style"
"Turn it into Chinese Ink style"
"Turn it into Anime style"
"Turn it into Chinese Ink style"
Removing objects while preserving background structure and temporal coherence.
"Remove the person"
"Remove the boat"
"Remove the person"
"Remove the person"
"Remove the bear"
"Remove the car"
"Remove the surfer"
"Remove the dog"
"Amputate the hand"
"Eliminate the ocean"
"Eliminate the person"
"Eliminate the river"
"Eliminate the woman"
"Delete the artwork"
"Remove the roundabout"
"Eliminate the person"
"Eliminate the sea"
"Eliminate the swan"
Editing only specific regions while keeping the rest of the video unchanged.
"Make the car blue"
"Make the road snowy"
"Make the bear white"
"Make the hat blue"
"Turn water a vibrant yellow"
"Make the grass a soft green"
"Paint the tv remote control indigo blue"
"Paint the person deep purple"
"Paint the pool a deep blue and render it smooth"
"Paint the tray Rubey red"
"Paint the fishing rod dark"
"Paint the sunset a vibrant_orange"
"Make the cake delightfully moist and crumbly"
"Turn the beer golden"
"Make the goose a soft white and downy"
"Color the christmas cake a dark green"
Here we illustrate what is measured in our ablation studies through error accumulation and temporal consistency. As shown in the first row of the figure below, when \(\Delta = 0\), we obtain low error accumulation, meaning that the distribution of the final frames remains almost similar to the distribution of the initial ones. However, temporal consistency starts to degrade, as indicated by higher scores, since the distance between the previous frame—used as the input condition—and the current frame increases. When \(\Delta = 1\), in the second row, temporal consistency remains almost stably good; however, error accumulation rises such that the final generated frames have about 10× larger error (0.20 vs 0.02) compared to the initial ones. The sweet spot is when \(\Delta = 3\), shown in the third row, where both temporal consistency and error accumulation remain in a good range, as also reflected in the videos. Please also notice to the magnitude of the numbers when reading the plots.
Error Accumulation for all key-frame update interval \(\Delta\) in one plot
Temporal Consistency for all key-frame update interval \(\Delta\) in one plot
"Color the pancake stack a lavender"
"Make the chandelier a striking crystal and bronze combination"
"Make the ocean a dull_green"
"Paint the gas station a rusty metal and faded neon"
"Turn the egg Aqua Blue"
"Turn trees dark and gnarled"
"Discard the flower"
"Eliminate the sunset"
"Remove the pine tree"
"Eliminate the yellow flower"
In this section, we present the failure cases of our method.
In the candles painting example, when a new object part suddenly appears, RFDM may fail to color that region consistently. This happens because the model must infer that the newly revealed part belongs to the same object that is only partially colored, a highly challenging task.
In the truck removal example, the truck and the building are removed together in the first frame and remain absent in subsequent frames. This results from SD1.5’s limited prompt understanding, which struggles to distinguish the truck from the background.
Similarly, in the plant removal and woman removal cases, the model fails to distinguish the objects properly, leading to incorrect or incomplete removal. Note that in these examples, the issue is primarily spatial understanding rather than temporal.
The same issue appears in the bridge painting and sink painting cases, where the model either applies the wrong color or fails to localize the sink accurately.
In the jellyfish painting example, the object deforms rapidly, causing the model to lose track of it early. Since RFDM relies only on the previous frame (short-term memory), the object cannot be recovered once it is missed.
"Dye candles indigo"
"Eliminate the beach"
"Eliminate the truck"
"Eliminate the woman"
"git rid of the plant"
"Paint the bridge a salmon color"
"Paint the sink Forest Green"
"Turn the jellyfish a gelatinous purple"