Ryan Burgert1,3
Yuancheng Xu1,4
Wenqi Xian1
Oliver Pilarski1
Pascal Clausen1
Mingming He1
Li Ma1
Yitong Deng2,5
Lingxiao Li2
Mohsen Mousavi1
Michael Ryoo3
Paul Debevec1
Ning Yu1

1Netflix Eyeline Studios
2Netflix
3Stony Brook University
4University of Maryland
5Stanford University
Project Lead


Abstract

Go-with-the-Flow is an easy and efficient way to control the motion patterns of video diffusion models. It lets a user decide how the camera and objects in a scene will move, and can even let you transfer motion patterns from one video to another.

We simply fine-tune a base model - requiring no changes to the original pipeline or architecture, except: instead of using pure i.i.d. gaussian noise, we use warped noise instead. Diffusion inference is exactly the same computational cost as running the base model, down to the last byte.

The stength of our motion prior can be modulated at inference-time via a process we call "noise degradation", allowing varying degrees of control. In addition to image-to-video models, we show this prior is strong enough to work with text-to-video as well, deriving 3d scenes from motion informatoin alone!

Our code and models are open source! If you create something cool with our model - and want to share it on our website - email rburgert@cs.stonybrook.edu. We will be creating a user-generated content section, starting with whomever submits the first video! Your name could be on our webpage.

Table of Contents

There are many applications for Go-with-the-flow, for both image-to-video (I2V) and text-to-video (T2V). Check them out with the below links!

Cut-and-drag Animation (I2V)

Go-with-the-flow allows for several types of motion control, including cut-and-drag animations, shown below. Here, the user provides a crude segmentation of the object as a motion signal and uses the initial frame as a control. The goal is to generate a coherent video that aligns with the motion indicated by the user’s drag.

View More Video Results

Warped Noise Visualization

We introduce a way to generate gaussian warped noise very quickly, which is important because we need to do this for millions of videos.
A visualization comparing the original video, the optical flow, and the warped noise generated by various methods.
Can you you see Rick dancing in the noise? Try pausing the video at any frame - he will dissapear!

Degredation Levels

Go-with-the-flow can excert different motion control strengths by degrading the warped noise by different amounts. (Videos might take a while to load)

First Frame Editing (I2V)

A video-editing task where the user starts with an original video and an edited version of its initial frame. The goal is to propagate the edits made to the first frame seamlessly throughout the entire video by copying the original video's motion. For instance, a user might add an object to the initial frame of the original video and expect the model to generate a coherent video that consistently incorporates the added object.

Turntable Camera Motion Transfer

A 3D-rendered turntable camera motion is used as the motion signal for the T2V model. Compared to the baseline MotionClone, our model generates scenes with significantly better 3D consistency and faithfully adheres to the provided turntable camera motion.

View More Video Results

Davis Motion Transfer (T2V)

Below we show results on DAVIS for T2V models, where the original video is used as the motion signal, and a different target prompt is provided. The task requires the video model to generate a video that aligns with the target prompt while preserving the motion from the original video.

Motion Control: WonderJourney (I2V)

We apply our Image-to-Video model to a sequence of frames warped using monocular depth estimation, enabling consistent 3D scene generation from a single image. Using input videos from WonderJourney's website, we remake them with Go-with-the-Flow into more coherent, smooth videos.

Camera Control: Depth Warping (I2V)

Results on novel camera motion control, featuring synthetic camera movements specified by the user. We estimate depth map from single image input using a monocular depth estimator, and use that depth map to warp the input image along a camera path. Our method effectively transforms the depth warped input animation into a 3D consistent video with plausible illuminations and view-dependent effects.

Image-to-Image Based Video Generation Applications: Relighting and Super-Resolution

The original How-I-Warped-Your-Noise paper showed that image diffusion models can yield temporally consistent results when using warped noise when that noise follows the optical flow of the input video.
Here, we show various noise warping and interpolation techniques side-by-side, for both relighting using DiffRelight and super-resolution using DeepFloyd Stage II.
What's cool about using noise warping on image-to-image translation: it's a training-free way to get better temporal consistency from image diffusion models that were never trained on videos!