Go with the Flow

Abstract

Go-with-the-Flow is an easy and efficient way to control the motion patterns of video diffusion models. It lets a user decide how the camera and objects in a scene will move, and can even let you transfer motion patterns from one video to another.

We simply fine-tune a base model - requiring no changes to the original pipeline or architecture, except: instead of using pure i.i.d. gaussian noise, we use warped noise instead. Diffusion inference is exactly the same computational cost as running the base model, down to the last byte.

The stength of our motion prior can be modulated at inference-time via a process we call "noise degradation", allowing varying degrees of control. In addition to image-to-video models, we show this prior is strong enough to work with text-to-video as well, deriving 3d scenes from motion informatoin alone!

Our code and models are open source! If you create something cool with our model - and want to share it on our website - email rburgert@cs.stonybrook.edu. We will be creating a user-generated content section, starting with whomever submits the first video! Your name could be on our webpage.

There are many applications for Go-with-the-flow, for both image-to-video (I2V) and text-to-video (T2V). Check them out with the below links!

Cut-and-drag Animation (I2V)

Go-with-the-flow allows for several types of motion control, including cut-and-drag animations, shown below. Here, the user provides a crude segmentation of the object as a motion signal and uses the initial frame as a control. The goal is to generate a coherent video that aligns with the motion indicated by the user’s drag.

Input Template

Ours

MotionClone

DragAnything

SG-I2V

The camera can be controlled, too.

Some additional cut-and-drag animations

View More Video Results

Warped Noise Visualization

We introduce a way to generate gaussian warped noise very quickly, which is important because we need to do this for millions of videos.
A visualization comparing the original video, the optical flow, and the warped noise generated by various methods.
Can you you see Rick dancing in the noise? Try pausing the video at any frame - he will dissapear!

Degredation Levels

Go-with-the-flow can excert different motion control strengths by degrading the warped noise by different amounts. (Videos might take a while to load)

First Frame Editing (I2V)

A video-editing task where the user starts with an original video and an edited version of its initial frame. The goal is to propagate the edits made to the first frame seamlessly throughout the entire video by copying the original video's motion. For instance, a user might add an object to the initial frame of the original video and expect the model to generate a coherent video that consistently incorporates the added object.

The below videos are also first-frame edits, where the original video is on the far left and the edited videos are to the right.

In the below video, the input video is in the middle, and the output video is on the far left.

Turntable Camera Motion Transfer

A 3D-rendered turntable camera motion is used as the motion signal for the T2V model. Compared to the baseline MotionClone, our model generates scenes with significantly better 3D consistency and faithfully adheres to the provided turntable camera motion.

Input Template

Ours

MotionClone

Weathered stump in tall grass field, embedded axe, solitary.

A high-definition camera orbits a mouse nibbling on golden cheese.

A camera spinning about a snowman on the top of a mountain

A high-definition camera orbits a curious squirrel on a wooden fencepost, highlighting its fur textures.

View More Video Results

Davis Motion Transfer (T2V)

Below we show results on DAVIS for T2V models, where the original video is used as the motion signal, and a different target prompt is provided. The task requires the video model to generate a video that aligns with the target prompt while preserving the motion from the original video.

DAVIS

Ours

Diffusion Motion Transfer

MotionClone

An elephant strolling by a waterhole surrounded by lush greenery.

A daring snowboarder racing down a snowy slop.

A colorful hot air balloon drifting over vineyards and olive groves.

A white horse leaping over a fence with fluttering ribbons.

A black stallion galloping through a misty forest through the trees.

Motion Control: WonderJourney (I2V)

We apply our Image-to-Video model to a sequence of frames warped using monocular depth estimation, enabling consistent 3D scene generation from a single image. Using input videos from WonderJourney's website, we remake them with Go-with-the-Flow into more coherent, smooth videos.

WonderJourney

Ours

MotionClone

Alice explores a colorful forest with fantastical creatures, then joins a whimsical tea party with the Mad Hatter.

A dynamic cityscape with a central tower, vibrant brushstrokes, and bustling streets shifts to a tree-lined view.

A tranquil waterfall in lush greenery with a wooden bridge transitions to mountains under a bright sky.

A serene garden with statues and classical architecture shifts to a view of a fountain and flower beds.

A serene mountain landscape with a reflective lake and snow-capped peaks transitions to a scene with a solitary building.

Camera Control: Depth Warping (I2V)

Results on novel camera motion control, featuring synthetic camera movements specified by the user. We estimate depth map from single image input using a monocular depth estimator, and use that depth map to warp the input image along a camera path. Our method effectively transforms the depth warped input animation into a 3D consistent video with plausible illuminations and view-dependent effects.

Single Image Input

Ours

Single Image Input

Ours

Image-to-Image Based Video Generation Applications: Relighting and Super-Resolution

The original How-I-Warped-Your-Noise paper showed that image diffusion models can yield temporally consistent results when using warped noise when that noise follows the optical flow of the input video.
Here, we show various noise warping and interpolation techniques side-by-side, for both relighting using DiffRelight and super-resolution using DeepFloyd Stage II.
What's cool about using noise warping on image-to-image translation: it's a training-free way to get better temporal consistency from image diffusion models that were never trained on videos!

Here's a comparison of DiffRelight, an image-to-image relighting algorithm that works in Stable Diffusion's latent space, where we compare using the same initial noise on every frame (fixed noise) versus our more temporally consistent warped noise. Note how there are less "sticking artifacts"!

Here's another DiffRelight comparison, where we compare many noise strategies.

And here's an example of us warping noise with many strategies using DeepFloyd's Stage II superresolution model, which runs in RGB pixel-space as opposed to latent space!

Abstract

Table of Contents

Cut-and-drag Animation (I2V)

Warped Noise Visualization

Degredation Levels

First Frame Editing (I2V)

Turntable Camera Motion Transfer

Davis Motion Transfer (T2V)

Motion Control: WonderJourney (I2V)

Camera Control: Depth Warping (I2V)

Image-to-Image Based Video Generation Applications: Relighting and Super-Resolution