Go-with-the-Flow is an easy and efficient way to control the motion patterns of video diffusion models. It lets a user decide how the camera and objects in a scene will move, and can even let you transfer motion patterns from one video to another.
We simply fine-tune a base model - requiring no changes to the original pipeline or architecture, except: instead of using pure i.i.d. gaussian noise, we use warped noise instead. Diffusion inference is exactly the same computational cost as running the base model, down to the last byte.
The stength of our motion prior can be modulated at inference-time via a process we call "noise degradation", allowing varying degrees of control. In addition to image-to-video models, we show this prior is strong enough to work with text-to-video as well, deriving 3d scenes from motion informatoin alone!
Our code and models are open source! If you create something cool with our model - and want to share it on our website - email rburgert@cs.stonybrook.edu. We will be creating a user-generated content section, starting with whomever submits the first video! Your name could be on our webpage.
Go-with-the-flow allows for several types of motion control, including cut-and-drag animations, shown below. Here, the user provides a crude segmentation of the object as a motion signal and uses the initial frame as a control. The goal is to generate a coherent video that aligns with the motion indicated by the user’s drag.
We introduce a way to generate gaussian warped noise very quickly, which is important
because we need to do this for millions of videos.
A visualization comparing the original video, the optical flow, and the warped noise
generated by various methods.
Can you you see Rick dancing in the noise? Try pausing the video at any frame - he will
dissapear!
Go-with-the-flow can excert different motion control strengths by degrading the warped noise by different amounts. (Videos might take a while to load)
A video-editing task where the user starts with an original video and an edited version of its initial frame. The goal is to propagate the edits made to the first frame seamlessly throughout the entire video by copying the original video's motion. For instance, a user might add an object to the initial frame of the original video and expect the model to generate a coherent video that consistently incorporates the added object.
A 3D-rendered turntable camera motion is used as the motion signal for the T2V model. Compared to the baseline MotionClone, our model generates scenes with significantly better 3D consistency and faithfully adheres to the provided turntable camera motion.
Below we show results on DAVIS for T2V models, where the original video is used as the motion signal, and a different target prompt is provided. The task requires the video model to generate a video that aligns with the target prompt while preserving the motion from the original video.
We apply our Image-to-Video model to a sequence of frames warped using monocular depth estimation, enabling consistent 3D scene generation from a single image. Using input videos from WonderJourney's website, we remake them with Go-with-the-Flow into more coherent, smooth videos.
Results on novel camera motion control, featuring synthetic camera movements specified by the user. We estimate depth map from single image input using a monocular depth estimator, and use that depth map to warp the input image along a camera path. Our method effectively transforms the depth warped input animation into a 3D consistent video with plausible illuminations and view-dependent effects.
The original How-I-Warped-Your-Noise paper showed that image diffusion models can
yield temporally consistent results when using warped noise when that noise follows
the optical flow of the input video.
Here, we show various noise warping and interpolation techniques side-by-side,
for both relighting using DiffRelight and super-resolution using DeepFloyd Stage II.
What's cool about using noise warping on image-to-image translation: it's a training-free way to get better temporal consistency from image diffusion models that were never trained on videos!