Neural Networks as Wave Equation Approximators

Reformulating the wave equation so a neural network can step it forwardJuly 2nd, 2026

mathematicsdeep learningneural networkswave equationphysics simulation

In the post on Neural PDEs, once we had discretised the heat equation onto a grid, we replaced the hand-derived spatial operator with a neural network and wrote the whole update as:

\(u(t + \Delta t) = u(t) + \Delta t \cdot F_\theta(u(t))\)(1)

where \(u(t)\) denotes the full spatial grid of values at the current time step — a vector holding the field value at every single grid point at once — \(u(t + \Delta t)\) denotes that same full grid one time step later, \(\Delta t\) denotes the size of the time step, \(F_\theta\) denotes a neural network, and \(\theta\) denotes that network's learnable parameters.

Staring at the right-hand side for a second, the network \(F_\theta\) is handed only the current field \(u(t)\), and from that alone it is asked to produce the rate of change. The assumption is that the current snapshot of the field is enough to predict its own future. For the heat equation this is completely true, and it is true for a precise reason: the heat equation is a first-order-in-time equation. Its left-hand side is \(\frac{\partial u}{\partial t}\) — the rate of change of the value itself — and the equation hands you that rate directly from the current shape.

The wave equation is not first-order in time. It is second-order. Its left-hand side is \(\frac{\partial^2 u}{\partial t^2}\) — the acceleration of the value, not its rate of change. Before I show you why in the language of fields, I want to leave PDEs completely and think about something far more everyday, because the whole difficulty is visible there with no math at all.

A Ball That Two Photographs Can't Tell Apart

Imagine a single photograph of a ball in the air. In the photo, the ball is exactly three metres off the ground. Nothing is blurred; you cannot see motion. My question is simple: where is the ball one second from now?

You get what I mean. Two completely different balls could produce that identical photograph. One was thrown upward and is still rising through the three-metre mark — a second later it will be higher, then come back down. The other was already falling and is passing three metres on its way to the ground. Both of them have opposite futures despite being indistinguishable in the photo.

What separates them is not visible in a single still image: it is velocity — whether the ball is currently rising or falling, and how fast.

It is the concept of any system whose governing law is written in terms of acceleration rather than rate of change. Newton's law tells a ball its acceleration (gravity pulls it down at a fixed rate), and acceleration is two steps removed from position — a fact leaned at the very end of the last post, when I argued that a second time derivative is exactly what forces information to travel at a finite speed.

A Single Point of Air Has the Same Problem

Picture a loudspeaker in a room, and zoom in on one tiny point in the air. As the wave passes, the air there doesn't fly across the room — it stays put, and the sound wave shows up as a tiny pressure change instead. We track this with the pressure perturbation field \(u(x, y, z, t)\), where \(u\) denotes how far the pressure at that point deviates from the calm, undisturbed baseline, \(x, y, z\) denote the point's position in space, and \(t\) denotes time.

Now suppose I freeze time and tell you \(u = 0\) at that point. Where is it headed? Same as the ball's situation. The current value alone tells you nothing about the next instant — you also need to know how fast it's currently changing.

The Missing Half: A Velocity Field

The fix is to carry the velocity along as a second quantity. So alongside the displacement field \(u(x, t)\) we now introduce a second field, the velocity field:

\(v(x, t) = \frac{\partial u}{\partial t}(x, t)\)(2)

where \(v(x, t)\) denotes the velocity of the air parcel at position \(x\) at time \(t\) — how fast, and in which direction, that parcel is currently moving away from or back toward its rest position — \(u\) denotes the displacement field, and \(\frac{\partial u}{\partial t}(x, t)\) denotes the first partial derivative of displacement with respect to time at that point.

Up to now, the state of our system was a single field. From here on, the state of the system is a pair of fields: displacement and velocity, side by side, one value of each at every point in space.

Closing the Loop With the Last Post's Derivation

Acceleration is the rate of change of velocity. But velocity was itself the rate of change of displacement. So acceleration is the rate of change of the rate of change of displacement — the second time derivative:

\(a = \frac{\partial v}{\partial t} = \frac{\partial^2 u}{\partial t^2}\)(4)

where \(a\) denotes the parcel's acceleration, \(\frac{\partial v}{\partial t}\) denotes the first time derivative of the velocity field (how fast the velocity is changing), and \(\frac{\partial^2 u}{\partial t^2}\) denotes the second time derivative of the displacement field. That middle-and-right equality is just the statement that "rate of change of velocity" and "rate of change of the rate of change of displacement" are the same thing, since \(v = \frac{\partial u}{\partial t}\).

And \(\frac{\partial^2 u}{\partial t^2}\) is precisely the term that sat on the left-hand side of the wave equation we spent the whole last post deriving from Newton's law on a vibrating string. Which means we already know what that acceleration equals — the derivation handed it to us:

\(a = \frac{\partial^2 u}{\partial t^2} = c^2 \frac{\partial^2 u}{\partial x^2}\)(5)

where \(a\) denotes the parcel's acceleration, \(\frac{\partial^2 u}{\partial t^2}\) denotes the second time derivative of displacement, \(c\) denotes the wave speed, and \(\frac{\partial^2 u}{\partial x^2}\) denotes the spatial curvature of the displacement field — the same "how much does this point differ from the average of its neighbours" quantity that has followed us since the heat equation.

So the two loose ends tie together perfectly. Newton's law said acceleration is force over mass, computed fresh each instant. The wave equation is Newton's law, already worked out for this system, and it tells us that the force per unit mass on each parcel is the curvature term \(c^2 \frac{\partial^2 u}{\partial x^2}\). The parcel's acceleration is dictated entirely by how bent the displacement field is right at that parcel, right now.

And "how bent the field is at a point" is something we already know how to compute from a grid — we derived it from scratch in the Neural PDE post. The discrete curvature at grid point \(i\) was:

\(\frac{\partial^2 u}{\partial x^2}(x_i, t) \approx \frac{u_{i+1}(t) - 2u_i(t) + u_{i-1}(t)}{\Delta x^2}\)(6)

where \(\frac{\partial^2 u}{\partial x^2}(x_i, t)\) denotes the spatial curvature of displacement at grid point \(x_i\), \(u_{i+1}(t)\) denotes the displacement of the right-hand neighbour, \(u_i(t)\) denotes the displacement of the point itself, \(u_{i-1}(t)\) denotes the displacement of the left-hand neighbour, and \(\Delta x^2\) denotes the grid spacing squared. The concrete payoff of writing it out is this: a single parcel's acceleration depends only on the displacements of its two immediate neighbours and itself. Not on their velocities, not on anything far away — just three displacement values, exactly as in the heat equation. The velocity of the parcel never enters the acceleration calculation at all; velocity only matters when we later use it to move the displacement.

One Second-Order Equation as two First-Order Equations

We stop treating the Wave Equation as one equation with a second time derivative and instead split it into two equations, each with only a first time derivative — one for each of the two things to carry forward.

The first equation is nothing more than the definition of velocity, read as an instruction for how displacement changes:

\(\frac{\partial u}{\partial t} = v\)(7)

where \(\frac{\partial u}{\partial t}\) denotes how fast the displacement is changing at a point with respect to time, and \(v\) denotes the velocity field at that point.

The second equation is Newton's law from above, read as an instruction for how velocity changes:

\(\frac{\partial v}{\partial t} = c^2 \frac{\partial^2 u}{\partial x^2}\)(8)

where \(\frac{\partial v}{\partial t}\) denotes how fast the velocity is changing at a point (which is the acceleration), \(c\) denotes the wave speed, and \(\frac{\partial^2 u}{\partial x^2}\) denotes the spatial curvature of displacement.

Notice how these two lock into each other. The first equation says displacement is driven by velocity. The second says velocity is driven by the shape of displacement. They are coupled — you cannot advance one without the other.

This becomes useful because a first time derivative is something we already know how to step forward using the Euler method. By tracking two fields instead of one, we traded a single hard equation for two easy ones. That is the whole reason for the reformulation.

Stepping a Single Point Forward

Let me apply Euler to a single grid point \(i\). Two quantities update each time step — velocity, then displacement. In the version below, order does not matter (I'll explain why once you see both formulas) — but it's worth walking through explicitly because a different, very common variant does care about order, and I want the contrast to be clear rather than assumed.

First, velocity, using the curvature of the current displacement field as acceleration:

\(v_i(t + \Delta t) = v_i(t) + \Delta t \cdot c^2 \cdot \frac{u_{i+1}(t) - 2u_i(t) + u_{i-1}(t)}{\Delta x^2}\)(9)

where \(v_i(t + \Delta t)\) is the new velocity at point \(i\), \(v_i(t)\) the current velocity, \(\Delta t\) the time-step size, \(c\) the wave speed, and the fraction the discrete curvature at \(i\) — built from \(u_i(t)\) and neighbours \(u_{i+1}(t)\), \(u_{i-1}(t)\), divided by \(\Delta x^2\). In short: new velocity = old velocity + step size × acceleration.

Second, displacement — using the old velocity, from before the update above:

\(u_i(t + \Delta t) = u_i(t) + \Delta t \cdot v_i(t)\)(10)

where \(u_i(t + \Delta t)\) is the new displacement, \(u_i(t)\) the current displacement, and \(v_i(t)\) is the old velocity — not the freshly computed \(v_i(t + \Delta t)\). Both formulas read from the same old snapshot; neither uses the other's new result. Because of that, computing them in either order gives the identical answer — this is the plain (fully explicit) Euler scheme.

One structural point: this is one time loop, with two formulas run once each per iteration — velocity, then displacement, then advance. Not a nested loop.

From One Molecule to the Whole Room

The individual values \(u_i\) and \(v_i\) are full vectors \(u(t)\) and \(v(t)\), each holding one entry for every grid point in the room at once.

Also, when you update the whole grid, you might imagine sweeping across it in the direction the wave is travelling — update the leftmost parcel, then use its brand-new value to update the next one over, and so on rightward, chasing the wavefront. That would be wrong. Every point's update reads from the same frozen old snapshot of displacement and velocity. In effect all the points step forward together, in parallel.

Handing the Two-Field State to a Neural Network

We are finally ready to repeat the trick from the Neural PDE post, but for the two-field state. Back then we replaced the hand-derived spatial operator with a network that took the single field \(u(t)\) and produced its rate of change. Now the state is the pair of fields, so the network takes both and produces the rate of change of both:

\(\begin{bmatrix} u(t+\Delta t) \\ v(t+\Delta t) \end{bmatrix} = \begin{bmatrix} u(t) \\ v(t) \end{bmatrix} + \Delta t \cdot F_\theta\!\left(\begin{bmatrix} u(t) \\ v(t) \end{bmatrix}\right)\)(11)

where the stacked object \(\begin{bmatrix} u(t) \\ v(t) \end{bmatrix}\) denotes the full current state — the displacement grid \(u(t)\) and the velocity grid \(v(t)\) stacked on top of one another — \(\begin{bmatrix} u(t+\Delta t) \\ v(t+\Delta t) \end{bmatrix}\) denotes that same stacked state one step later, \(\Delta t\) denotes the time-step size, \(F_\theta\) denotes the neural network, and \(\theta\) denotes its learnable parameters. Everything the network learned to learn using only one field for the heat equation, it now has to do for a doubled state.

What This Does to the Training Data

A two-channel state means two-channel data: every training snapshot needs ground-truth displacement and velocity, not just displacement. A single displacement snapshot can't tell you how fast the field was moving at that instant — same independence the ball example established. A sequence of displacement snapshots can approximate velocity, via finite differencing between consecutive frames, \(v_i(t) \approx \frac{u_i(t+\Delta t) - u_i(t)}{\Delta t}\) — but that adds discretization error on top of whatever error the network already makes. That approximation error is avoidable, but only if the data source actually has velocity available to give you. If the training data comes from a numerical solver, that solver is typically already tracking \(v_i\) internally at every step, so it can output \(v_i\) directly alongside \(u_i\) — exact, not reconstructed. If instead the data is displacement-only recordings with no such internal state (e.g. a filmed wave), the finite-difference approximation above is the only option, error and all.

Also, Two-part output means two-part loss:

\(\mathcal{L} = \sum_i \left( \hat{u}_i - u_i^{\text{true}} \right)^2 + \sum_i \left( \hat{v}_i - v_i^{\text{true}} \right)^2\)(12)

where \(\mathcal{L}\) is the total loss for one predicted step, \(i\) indexes grid points, \(\hat{u}_i\) and \(\hat{v}_i\) are the network's predicted displacement and velocity at point \(i\), and \(u_i^{\text{true}}\), \(v_i^{\text{true}}\) are the corresponding ground-truth values. Predict two fields, score two fields — not a separate design choice, it falls directly out of the two-channel output.

Inference, Rollout, and the Drift Problem

Training feeds the network the true state at every step, since we have the full ground-truth trajectory. At inference time we don't — we only know the starting state, then we're on our own.

Predicting further out than one step is called rollout: feed the network the starting state, take its prediction, feed that prediction back in as the next input, repeat. Because each step's input is the previous prediction rather than ground truth, small errors compound over the rollout — a general risk for any autoregressive learned model.

3d wave rollout · 48×48×48 gridstep 0

showing displacement u · walls reflect

slice depth (z)24

walls:

hover

move the pointer over the slice

locked cell

switch to inspect mode and click a cell on the slice to lock it — its u and v then stay visible while you press Step

damping0.015

friction on each parcel's velocity, applied everywhere in the room — This is independent of the walls toggle, which only controls the boundary.

This is where wave is meaningfully harder than heat. Heat is diffusive: it smooths sharp features over time, so a spurious error introduced by a bad prediction gets partially ironed out by the dynamics themselves at the next step — a built-in, self-correcting mechanism.

Wave has no such mechanism. As a hyperbolic equation, it preserves features rather than smoothing them — sharp fronts propagate without decaying. That's true of real signal and of errors: a spurious bump from a bad prediction gets carried and bounced around the room exactly like a real disturbance, with nothing damping it out.