The Mathematics of Flow and Diffusion Models

Understanding the elegant math behind modern generative modelsJune 21th, 2025

aimachine learningmathematicsgenerative modelsflow modelsdiffusion models

Flow and Diffusion Models: The Mathematical Perspective

Modern generative AI relies on sophisticated mathematical frameworks to transform random noise into complex, structured data like images, text, and music. Two of the most powerful approaches in this domain are flow models and diffusion models. In this post, we'll explore the elegant mathematics that makes these models work, starting with flow models.

Flow Models: Continuous Transformations in Probability Space

Flow models represent a fascinating approach to generative modeling where we define vector fields that transform simple distributions (like Gaussian noise) into complex data distributions (like images or music). Unlike their more famous cousins, diffusion models, flow models are fundamentally deterministic - given the same starting point, they'll always produce exactly the same result.

The Intuition: Rivers of Probability

Someone gave me this analogy for flow models, which isn't perfect but is a good starting point: Imagine pouring a bucket of water (representing a simple probability distribution) onto a carefully sculpted landscape. The water flows along paths determined by the terrain (that we will later define as a vector field), eventually forming streams that lead towards our target probability distribution. Flow models work in a similar way - they define "probability currents" that guide simple distributions into complex ones through continuous transformations.

Just like how the slope of terrain determines which way water flows at any point, the vector field in a flow model tells each point in our probability distribution exactly which way to move.

The Mathematical Foundations

Now let's translate this intuition into precise mathematics. At first, you might think of flow models as simple updates where \(u_t(x_t) = x_{t+1}\), meaning the vector field directly gives the next position. However, this isn't quite right because flow models operate in continuous time, not discrete steps.

Vector Fields

We can define a vector field as a function:

\(u: \mathbb{R}^d \times [0,1] \rightarrow \mathbb{R}^d, \quad (x,t) \mapsto u_t(x)\)(1)

Let's break down what this means:

The vector field \(u\) takes two inputs: a position \(x\) in \(d\)-dimensional space and a time \(t\)
It outputs a vector that tells us which direction to move and how fast at that specific point and time
The time parameter \(t\) ranges from 0 to 1, representing the complete transformation from start to finish

The notation \(u_t(x)\) emphasizes that our vector field can change over time - the directions might be different at the beginning (t=0) compared to the end (t=1) of our transformation.

Trajectories

A trajectory is the path that a single point follows through our vector field over time. Mathematically, we define it as:

\(X: [0,1] \rightarrow \mathbb{R}^d, \quad t \mapsto X_t\)(2)

It's simply saying:

\(X\) is a function that maps each time \(t\) between 0 and 1 (where 0 is the start of our full noise distribution and 1 is the end of our fully transformed distribution such as an image) to a position \(X_t\) in our space
It tells us where a particular point is at any time during the transformation
Every point follows its own unique trajectory through the space

Note that \(X_0 = x_0\) specifies the initial position of our point sampled from the noise distribution, which serves as the starting condition.

For example, in a 2D vector field, we can think of our starting point as a single point in 2D space, and our trajectory as the path that point follows through time. A point starting at (1,3) at t=0 might move to (2,4) at t=0.5, then to (3,5) at t=1. However, a different point starting at (1,4) at t=0 would follow a completely different trajectory because it has a different initial condition. This is why specifying \(X_0 = x_0\) is crucial - it determines which unique trajectory we follow through the time-varying vector field.

The Ordinary Differential Equation (ODE)

The relationship between our vector field and trajectories is formalized through an ordinary differential equation:

\(\frac{d}{dt} X_t = u_t(X_t)\)(3)

with the initial condition \(X_0 = x_0\).

This equation is the heart of flow models. Breaking it down:

\(\frac{d}{dt} X_t\) is the rate of change of position with respect to time (the velocity)
\(u_t(X_t)\) is the vector field evaluated at the current position and time
The equation says "the velocity (basically the rate of change of position with respect to time) of a point equals the vector field at that point" - in other words, the vector field directly controls how each point moves

The Flow Function: The Complete Transformation

While trajectories describe the path of individual points, the flow function captures the entire transformation all at once:

\(\psi: \mathbb{R}^d \times [0,1] \rightarrow \mathbb{R}^d, \quad (x,t) \mapsto \psi_t(x)\)(4)

The flow function \(\psi_t(x)\) tells us where an initial point \(x\) will be at time \(t\). It satisfies two key properties:

Identity at time zero: \(\psi_0(x_0) = x_0\) (at the start, points are at their initial positions)
Governed by the vector field: \(\frac{d}{dt} \psi_t(x_0) = u_t(\psi_t(x_0))\) (the flow evolves according to our ODE)

The flow function is powerful because it encapsulates the complete transformation from our simple distribution to our complex target distribution. If we know \(\psi_1(x)\) for all \(x\), we know exactly how our initial distribution transforms into our final distribution.

Key Properties That Make Flow Models Special

Flow models possess several crucial mathematical properties that make them uniquely powerful for generative modeling:

Deterministic Nature

Given the same starting point \(x_0\), the trajectory will always follow the exact same path through the vector field. This comes from a fundamental property of ODEs: they have unique solutions given an initial condition.

This determinism contrasts with diffusion models, which involve stochastic noise and don't guarantee the same output given the same input.

Invertibility

Flow models use invertible transformations, meaning we can trace the path backward from the target distribution to the source distribution. This bidirectional property is essential for both training and sampling.

Mathematically, this means we can define an inverse flow \(\psi^{-1}_t\) that takes us from time \(t\) back to time 0.

Continuous Transformation

The transformation occurs continuously over the time interval \([0,1]\), providing smooth interpolation between the source and target distributions.

This continuity allows us to "watch" the transformation process at any intermediate time step, which can be helpful for understanding and debugging these models.

From Theory to Practice: Numerical Implementation

While the mathematical theory of flow models operates in continuous time, practical implementation requires discrete numerical methods.

The dx-notation and Numerical Approximation

The key insight comes from the definition of a derivative. Recall that:

\(\frac{d}{dt} X_t = \lim_{h \rightarrow 0} \frac{X_{t+h} - X_t}{h}\)(5)

This is the formal definition of a derivative - the limit of the difference quotient as the time step \(h\) approaches zero. For small but non-zero values of \(h\), we can approximate this as:

\(\frac{X_{t+h} - X_t}{h} \approx u_t(X_t) + R_t(h)\)(6)

Here, \(R_t(h)\) is a remainder term that captures the approximation error when \(h\) is not exactly zero. This term is crucial for understanding the accuracy of our numerical methods, but is often overlooked in simplified presentations.

The Discrete Formulation

Rearranging the approximation by multiplying both sides by \(h\), we obtain:

\(X_{t+h} = X_t + hu_t(X_t) + hR_t(h)\)(7)

This equation provides the discrete equivalent of our continuous ODE. It tells us how to compute the next position \(X_{t+h}\) given the current position \(X_t\) and the vector field \(u_t\).

The Euler Method: Simple but Effective

If we ignore the remainder term \(hR_t(h)\), we obtain the Euler method for numerically solving ODEs:

\(X_{t+h} \approx X_t + hu_t(X_t)\)(8)

This approximation method works by:

Taking the current position \(X_t\)
Evaluating the vector field to get the direction \(u_t(X_t)\)
Scaling this direction by the step size \(h\)
Adding it to the current position to estimate the next position

Despite its simplicity, the Euler method works remarkably well for many applications, especially when the step size is sufficiently small.

Sidefact: Connection to Taylor Series

If you've ever heard of Taylor series: The numerical approximation methods used in flow models have a deep connection to Taylor series expansion. The Taylor series provides a way to approximate functions using polynomial terms:

\(f(x+h) = f(x) + h \cdot f'(x) + \frac{h^2}{2!} \cdot f''(x) + \frac{h^3}{3!} \cdot f'''(x) + \ldots\)(9)

We basically calculate the rate of change of a given function at a specific point and use it to approximate the function at a nearby point. If the function was infinitely differentiable, we could use the Taylor series to exactly represent the function by infinitely many polynomial terms.

In our discrete formulation:

\(X_{t+h} = X_t + h \cdot u_t(X_t) + h \cdot R_t(h)\)(10)

This corresponds exactly to a first-order Taylor approximation where:

\(X_t\) represents the current value (analogous to \(f(x)\))
\(h \cdot u_t(X_t)\) represents the first derivative term (analogous to \(h \cdot f'(x)\))
\(h \cdot R_t(h)\) contains all the higher-order terms we're neglecting

This connection explains why numerical methods for solving ODEs work: they're essentially using truncated Taylor series to approximate the continuous evolution of the system. There are actually more sophisticated methods that use more terms of the Taylor series to get a more accurate approximation, but the Euler method is the simplest one and works surprisingly well.

The Complete Picture: Flow Models in Action

Let's connect all these mathematical pieces to understand how flow models work in practice:

Initialization: We start by sampling points from a simple distribution (typically Gaussian noise) at time \(t=0\).
Vector Field Definition: We define (or more commonly, learn) a vector field \(u_t(x)\) that will guide our points from the simple distribution to the complex distribution. This vector field will basically be our model that we train.
Numerical Integration: We use numerical methods like Euler's method to approximate the ODE and push our samples forward through time:
\(X_{t+h} \approx X_t + h \cdot u_t(X_t)\)
Final Distribution: By time \(t=1\), our initially simple distribution has been transformed into the target data distribution through this continuous, deterministic flow.

For training, we typically use the inverse direction: starting with real data samples, we learn a vector field that pushes them back to a simple distribution (like a Gaussian). Once trained, we can sample from the simple distribution and push forward to generate new data samples. But this will be a topic for another post.

Diffusion Models: Adding Randomness

Now that we understand the deterministic approach of flow models, let's explore their more chaotic cousins: diffusion models. While flow models give us the same output every time for the same input, diffusion models embrace randomness to create more flexible and diverse generative processes.

The key insight behind diffusion models is simple yet powerful: real-world processes aren't perfectly predictable. By adding controlled randomness to our transformations, we can model more complex and varied data distributions.

From Deterministic to Stochastic

Remember our flow model ODE:

\(\frac{d}{dt} X_t = u_t(X_t)\)(11)

Diffusion models extend this by adding a stochastic component, transforming our ordinary differential equation (ODE) into a stochastic differential equation (SDE):

\(dX_t = u_t(X_t)dt + \sigma_t dW_t\)(12)

This single equation captures the essence of diffusion models. Let's break it down:

\(u_t(X_t)dt\): The deterministic drift - the predictable part (same as flow models)
\(\sigma_t dW_t\): The stochastic diffusion - the random part that makes diffusion models special

The \(dt\) term represents an infinitesimal time step, while \(dW_t\) represents an increment of something called Brownian motion, which we'll explain shortly. \(\sigma_t\) is a scalar that controls the amount of noise added at each time step - note that if this is 0, we get back our original ODE of theflow model.

Understanding the Components

The first term \(u_t(X_t)dt\) works exactly like in flow models. It provides a consistent direction based on the current position and time. If we removed the stochastic part entirely (set \(\sigma_t = 0\)), we'd get back our original flow model.

The second term \(\sigma_t dW_t\) introduces controlled randomness:

\(\sigma_t\): The diffusion coefficient - controls HOW MUCH randomness to add at time t
\(dW_t\): A Brownian motion increment - provides the random direction

Think of it like walking: the deterministic part tells you the general direction to head (toward your destination), while the stochastic part adds random stumbles and course corrections that make your path more varied.

Brownian Motion

Brownian motion is the mathematical foundation for the randomness in diffusion models. It's named after the botanist Robert Brown, who observed the random movement of particles suspended in fluid.

The Key Property

For any two time points \(s < t\), Brownian motion satisfies:

\(W_t - W_s \sim \mathcal{N}(0, (t-s)I_d)\)(13)

This means the change in Brownian motion between times s and t follows a normal distribution with:

Mean of 0 (no bias in any direction)
Variance proportional to the time difference \((t-s)\)
Note how each path is different

Why This Matters

The crucial insight is that variance scales with time:

Small time steps → small random movements
Large time steps → large random movements
The standard deviation scales as \(\sqrt{t-s}\), not \((t-s)\)

This scaling property is what makes Brownian motion perfect for diffusion models - it ensures that our random process behaves consistently regardless of how we discretize time.

Numerical Implementation

Just like flow models need numerical methods to implement continuous ODEs, diffusion models need the Euler-Maruyama method to handle SDEs.

The Discrete SDE

For a small time step \(h\), we discretize our SDE as:

\(X_{t+h} = X_t + hu_t(X_t) + \sigma_t(W_{t+h} - W_t)\)(14)

The Brownian increment \(W_{t+h} - W_t\) has the property:

\(W_{t+h} - W_t \sim \mathcal{N}(0, hI_d)\)(15)

The Practical Algorithm

In practice, we implement this by sampling from a standard normal distribution and scaling appropriately:

\(X_{t+h} = X_t + hu_t(X_t) + \sigma_t\sqrt{h}\epsilon\)(16)

where \(\epsilon \sim \mathcal{N}(0, I_d)\) is sampled fresh at each step.

Why the square root of h?

You might wonder: why \(\sqrt{h}\) and not just \(h\)?

This comes from a fundamental property of variance scaling. When you multiply a random variable by a constant \(c\), its variance gets multiplied by \(c^2\).

Since we want our Brownian increment to have variance \(h\), and we're starting with a standard normal (variance 1), we need to multiply by \(\sqrt{h}\) because \((\sqrt{h})^2 = h\).

Key Differences from Flow Models

Non-Deterministic Output

Unlike flow models, running a diffusion model twice with the same input will produce different results due to the random components. This makes them excellent for generating diverse outputs.

Controlled Randomness

The diffusion coefficient \(\sigma_t\) allows fine-grained control over randomness:

\(\sigma_t = 0\): Reduces to a flow model approach (pure determinism)
Large \(\sigma_t\): High randomness and diversity
Small \(\sigma_t\): Low randomness, more predictable

Exploration vs. Exploitation

The stochastic component allows diffusion models to "explore" the probability space more thoroughly than deterministic flow models. This exploration capability often leads to better coverage of complex data distributions.

The Complete Algorithm: Euler-Maruyama in Action

Here's the complete algorithm for sampling from a diffusion model:

Require: Vector field \(u_t\), number of steps \(n\), diffusion coefficient \(\sigma_t\)

Set \(t = 0\)
Set step size \(h = \frac{1}{n}\)
Set \(X_0 = x_0\) (initial condition)
For \(i = 1, \ldots, n-1\) do:
- Draw \(\epsilon \sim \mathcal{N}(0, I_d)\)
- \(X_{t+h} = X_t + hu_t(X_t) + \sigma_t\sqrt{h}\epsilon\)
- Update \(t \leftarrow t + h\)
Return \(X_0, X_h, X_{2h}, \ldots, X_1\)

The key insight is that we're taking the deterministic Euler step from flow models and adding a carefully scaled random component at each iteration.

Why Diffusion Models

Diffusion models have become incredibly successful in generative AI because they strike a perfect balance between structure and randomness. The deterministic component ensures the generated samples follow the learned data distribution, while the stochastic component allows for exploration and diversity.

This mathematical framework explains why diffusion models can generate such diverse and high-quality samples - they're not just following a single deterministic path (trajectory) like flow models, but exploring a whole family of possible paths guided by both learned structure and controlled randomness.

The State of the Art and what's next

So which approach reigns supreme? The answer isn't straightforward. Diffusion models currently dominate the commercial landscape, powering everything from Stable Diffusion to DALL-E, and have proven their reliability in real-world applications. However, recent advances in flow matching show that flow models can achieve superior performance in controlled studies, offering faster training and more efficient sampling paths. The most exciting development is that cutting-edge models like Stable Diffusion 3 and OpenAI's Sora are combining both approaches, using diffusion transformers with flow matching techniques. Rather than one approach replacing the other, the future of generative AI lies in hybrid methods that make use of the deterministic approach of flow models alongside the proven diversity and robustness of diffusion models.