Ordinary Differential Equations and Neural ODEs
From the mathematics of change to neural networks that learn dynamicsJune 3rd, 2026A a cup of coffee cooling, or the temperature outside changing throughout the day are picture perfect examples of ODEs. An Ordinary Differential Equation is simply a formal way of describing how something changes, without necessarily knowing what that something looks like at every point in time.
What is an ODE?
Let's say you have some unknown function \(z(t)\) that describes how a quantity — say, temperature — evolves over time. You don't know what \(z(t)\) looks like. But what if you knew, at every single point in time, exactly how fast it was changing and in which direction? That's the core idea of an ODE.
\(\frac{dz}{dt} = f(z, t)\)(1)where \(z\) denotes the unknown quantity we want to find, \(t\) denotes time, \(\frac{dz}{dt}\) denotes the rate of change of \(z\) with respect to time, and \(f(z, t)\) denotes a known expression that tells us what that rate of change is at any given state and time.
Notice something important: the rate of change \(f(z, t)\) can depend on both the current time \(t\) AND the current value of \(z\) itself. The derivative of your unknown function depends on the current function value(!) itself.
Take bacteria growing in a dish. The more bacteria there are right now, the faster the colony grows. This is captured by one of the simplest ODEs:
\(\frac{dz}{dt} = k \cdot z\)(2)where \(z\) denotes the current population size and \(k\) denotes a growth rate constant. The rate of change of the population depends on the population itself — you can't solve this with simple integration.
Does Time Always Matter?
Not always. Sometimes the rate of change of a system depends only on its current state, not on what time it is. These are called autonomous systems, and they simplify to:
\(\frac{dz}{dt} = f(z)\)(3)where \(f(z)\) denotes a function of the current state alone, with no explicit dependence on \(t\).
Think of a cup of coffee cooling in a room. The rate at which it cools depends entirely on how hot it currently is — not on whether it's 2pm or 3pm.
Contrast that with outdoor temperature. Even if the outdoor temperature happened to be identical at two different times of day, the rate of change would be different — because at 6am the sun is rising and heating things up, while at 8pm it's cooling down. The time of day itself carries information that the current temperature alone doesn't capture. That requires the full \(f(z, t)\).
The question to ask yourself is: does knowing the current state tell you everything you need to predict what happens next, or does the time itself add extra information? If yes to the latter, you need \(f(z, t)\). If no, \(f(z)\) is sufficient.
Initial Conditions: Where Do You Start?
This is a critical point. An ODE like \(\frac{dz}{dt} = k \cdot z\) has infinitely many valid solutions. For \(k = 1\), all of the following satisfy the equation:
\(z(t) = e^t, \quad z(t) = 2e^t, \quad z(t) = 100e^t\)(4)Each one has a derivative equal to itself. The ODE alone can't tell you which solution describes your actual system — it only tells you the direction of change at each point, not where you start.
To pin down a unique solution, you need an initial condition: the value of \(z\) at some specific starting time \(t_0\).
\(z(t_0) = z_0\)(5)where \(t_0\) denotes the starting time and \(z_0\) denotes the known value of the function at that starting time.
Think of it like navigation. The ODE is a map that tells you "go left here, go right there." But a map without a starting location is useless — you need to know where you are before the map can guide you anywhere. The initial condition is your GPS pin.
A solution to an ODE is therefore a function that satisfies two things simultaneously: it passes through the initial condition \(z(t_0) = z_0\), and it satisfies the ODE at every point in time. Both conditions together uniquely define the solution.
If you've studied calculus, you'll recognise this intuitively: when you differentiate \(f(x) = mx + b\), the constant \(b\) vanishes. The derivative \(f'(x) = m\) has infinitely many antiderivatives — all differing by some constant. The initial condition is exactly what pins down that constant.
Solving ODEs Numerically
In theory, you solve an ODE by integrating. In practice, most ODEs that appear in the real world have no clean analytical solution — you can't write down a nice closed-form formula for \(z(t)\). Computers solve them numerically instead, by taking small steps through time and approximating the true solution one step at a time.
The Euler Method
The most straightforward approach is the Euler method. The idea is simple: if you know where you are right now and you know the direction the function is heading, take a small step in that direction. Then repeat from your new position.
\(z(t + \Delta t) = z(t) + \Delta t \cdot f(z(t),\, t)\)(6)where \(z(t)\) denotes your current state, \(f(z(t), t)\) denotes the rate of change at that state and time, and \(\Delta t\) denotes the size of the time step you take.
You can think of \(f(z(t), t)\) as a compass direction and \(\Delta t\) as how far you walk before rechecking the compass. The smaller \(\Delta t\) is, the more accurate your path — but the more steps you need to take. As \(\Delta t \rightarrow 0\), you are essentially integrating the ODE exactly.
Runge-Kutta: A More Accurate Alternative
The Euler method has a known weakness: it only looks at the direction at the very start of each step. If the direction changes significantly within that step, you accumulate error quickly. A more accurate alternative is the Runge-Kutta method (RK4), which evaluates the direction at four intermediate points within a single step and takes a weighted average before committing to a move.
\(z(t + \Delta t) = z(t) + \frac{\Delta t}{6}\left(k_1 + 2k_2 + 2k_3 + k_4\right)\)(7)where \(k_1\) denotes the slope at the beginning of the step, \(k_2\) denotes the slope estimated at the midpoint using \(k_1\), \(k_3\) denotes the slope at the midpoint refined using \(k_2\), and \(k_4\) denotes the slope at the end of the step. The key insight is that RK4 gets a far better estimate of the true path by sampling the direction multiple times within each step, making it significantly more accurate than Euler for the same step size.
Neural ODEs: Learning the Dynamics from Data
So far we've assumed that \(f(z, t)\) is a known function. But what if it isn't? What if you have a real-world system — say, weather data collected over many years — and you want to learn the underlying dynamics from data alone, without knowing the governing equations?
This is exactly what Neural ODEs do. The idea is conceptually simple: replace the known function \(f(z, t)\) with a neural network \(F_\theta(z, t)\) that we train to approximate it.
\(\frac{dz}{dt} = F_\theta(z, t)\)(8)where \(F_\theta\) denotes a neural network with learnable parameters \(\theta\), \(z\) denotes the current state of the system, and \(t\) denotes the current time.
Crucially, the network does not predict the next state directly. It predicts the rate of change at the current state and time. You then use a numerical integrator — Euler, RK4, or any other — to step forward through time and reconstruct the full trajectory.
Training and Inference
Given a dataset of time series — say, temperature measurements recorded every hour for a year — you train the network by asking it to produce trajectories that match the observed data. You provide an initial condition (the temperature at some starting time \(t_0\)), integrate forward using a numerical method, and penalise the network for deviating from the true observed values.
At inference time, the network is called many times during a single prediction. Each call takes the current state \(z(t)\) and current time \(t\) as input, outputs a rate of change, and the integrator nudges the state forward by a small \(\Delta t\). This repeats until you reach your desired target time. The target time is not an input to the network — it's simply the stopping condition for the integrator.
In practice, \(z\) is rarely a single number. For weather forecasting, it might be a vector containing temperature, pressure, humidity, and wind speed — all evolving together. The network learns to predict the joint rate of change of this entire state vector simultaneously.
Diffusion and Flow Models
In a diffusion or flow model, a neural network also learns a direction field over time, and you also integrate through time. But "time" here doesn't represent physical time moving forward in the real world. Instead, it represents a denoising process: starting from pure Gaussian noise and gradually flowing toward a clean data sample. The network learns, at every point in this process, which direction to nudge the current noisy sample to make it slightly more data-like.
The key difference from a classical Neural ODE is what the state \(z\) represents. In a physical system, \(z\) is something concrete like temperature or population size. In a generative model, \(z\) is a point in a high-dimensional data space — for example, the raw pixel values of an image. The network learns to flow that point from a noisy, random region of that space toward a region where real data lives.
Where ODEs Fall Short
ODEs are powerful, but they carry a fundamental limitation: the unknown function \(z(t)\) varies only with respect to one independent variable — time. This works well for systems that are spatially uniform, like a perfectly mixed liquid, a single-point measurement, or an abstract latent state in a generative model.
But most real physical systems are not spatially uniform. Temperature doesn't just vary over time — it varies across space too. A room isn't the same temperature everywhere. Sound doesn't arrive everywhere at once — it spreads outward from a source as a wave, with some regions affected before others. To describe systems that vary across both space and time simultaneously, ODEs are no longer sufficient, which is where Partial Differential Equations (PDEs) come into play.