Loading [MathJax]/jax/output/HTML-CSS/jax.js

Stochastic Integration for Poets

November 2024.

When you start reading diffusion papers, you’ll quickly come across an equation like this:

dx=f(x,t)dt+g(x,t)dw

where w is Brownian motion. When I first saw these, I admit that I just thought about the discretized version

Δx=f(x,t)Δt+g(x,t)Δw

and moved on. You can generate approximate samples in code easily enough:

dt = 0.001
x_last = 0
t = 0

while t < 1:
    dw = np.random.normal(0, np.sqrt(dt))
    x_next = x_last + f(x_last,t)*dt + g(x_last,t)*dw
    
    t += dt
    x_last = x_next

Here’s a sample plot with f(x,t)=x and g(x,t)=t2:

test

But discretization feels bad. Sir Isaac Newton didn’t brave plague, sizardom, and falling apples just so that we could discretize. Let’s have some respect. Because you do lose something when you discretize. Here’s Brownian motion at a few different resolutions, becomming more and more jagged as dt becomes finer:

test

Look at that smooth-ish curve when dt=0.1. That’s not honest Brownian motion. And if we zoomed in, we’d be similarly disgusted with dt=0.01 and dt=0.001. I mean, these curves are differentiable almost everywhere. We need something better, we need an infinitesimal notion of “noise”.

So what does the equation

dx=f(x,t)dt+g(x,t)dw

actually mean? What’s the real, stochastic calculus interpretation?

Stochastic Integration

One answer — and this is actually the formal answer in one school of thought — is that the “differential” equation dx=f(x,t)dt+g(x,t)dw is just shorthand for an integral equation:

x=T0f(x,t)dt+T0g(x,t)dw.

Keep in mind here that x and w are both stochastic processes depending on t, so it’s probably clearer to write

xt=t0f(x,s)ds+t0g(x,s)dws.

We understand the integral on the left hand side. Now we have to make sense of the integral on the right.

It’s going to be confusing to use w to refer to Brownian motion given some notation that’s coming later, so let’s use Bt instead. In general, we might want to integrate not just some deterministic function g(x,t) but some other stochastic process Ht. So, in the end, we’re trying to define

I(t)=t0HsdBs

The rest of this post is going to be about defining & making sense of this integral.

Intuitively, what should the integral be? If we chop [0,t] into a bunch of sub-intervals [ti,ti+1], then by analogy to Riemann integration, we hope that the integral above will be well-approximated by

ni=1Hti1(BtiBti1)

if we make all the [ti1,ti] really small. (We’ll say that the “mesh size” goes to zero if titi10 for all i uniformly in i.)

And this does in fact work. Choose some sequence πn of partitions of [0,t] with mesh size going to zero. Define

t0HsdBs=limnn[ti1,ti]πnHti1(BtiBti1).

We have to put some regularity conditions on Hs to guarantee convergence: it’s enough that t0H2sds< t.

Done. But…

Some Surprises

Now let’s try to actually integrate something. Let’s set Ht=Bt, so we’re integrating

t0BsdBs.

For a partition πn, we look at the sum

n[ti1,ti]πnBti1(BtiBti1)=12(B2tiB2ti1)12(BtiBti1)2.

The first sum 12(B2tiB2ti1) telescopes, and goes to 12B2t. Let’s look at the second sum. BtiBti1 is a normal random variable with mean zero and variance titi1. So the expected value of (BtiBti1)2 is titi1. So the expected value of the whole sum is just t. By a similar argument, the variance of BtiBti1 is 2(titi1)2. This goes to zero as the mesh size goes to zero: if (titi1)<t/n then (titi1)2<t2/n2 and (titi1)2<t2/n0. So the second sum 12(BtiBti1)2 converges to t/2 in probability. That leaves us with the potentially surprising result that

t0BsdBs=12B2t12t.

What’s going on? In the “classic”, deterministic case where we let b be a function with continuous first derivative, we’d expect that

t0b(s)b(s)ds=12b(t)2.

But in the stochastic case, we have this extra t/2 term. Why? If we retrace our argument, we’re led to compare the sums

(BtiBti1)2vs(b(ti)b(ti1))2.

As we said above, the left hand sum converges to t. But the right hand term converges to 0. Remember, b is differentiable, so (cutting some corners here)

(b(ti)b(ti1))2(titi1)2b(ti1)2.

b is bounded, so and as the mesh size goes to zero, (titi1)2b(ti1)20. So in the “classic” case, this sum vanishes, but in the stochastic case, it doesn’t.

Here’s what things look like if we set b(t)=sin(t) and compare that vs Bt as above. We’ll integrate both, and look at both “derivatives”:

test

See what’s going on here? The “derivative” of Bt is this super jagged green line. This line is so noisy that it’s samples are discontinous everywhere, and (as we’ll see later), it’s even unbounded on any interval. The only reason it “looks” bounded above is that we discretized it. (Another L for discretization.) No wonder that (BtiBti1)2 doesn’t vanish as the mesh size goes to zero.

On the other hand, the derivative of b(t), namely cos(t), is smooth. The sum (b(ti)b(ti1))2 is approximately (titi1)2cos(ti1)2(titi1)20 as the mesh size goes to zero.

Just to beat this horse to death, let’s compare the sums (BtiBti1)2 vs (b(ti)b(ti1))2. over [0,1] through numerical calculation. From our arithmetic above, we know that (BtiBti1)2 should converge to 1 and (b(ti)b(ti1))2 should converge to 0 as the mesh size goes to zero. And in fact, here’s what things look like as the number of sub-intervals (x-axis) increases:

test

Our specific example with BsdBs is an instance of Itô’s formula, the stochastic calculus analog of the Fundamental Theorem of Calculus. In the simplest case, if f is a function with continuous second derivative, then Itô tells us that:

f(t)=f(0)+t0f(Bs)dBs+12t0f(s)ds.

The proof comes from the Taylor series for f. Write

f(Bti)f(Bti1)=f(Bti1)(BtiBti1)+12f(Bti1)(BtiBti1)2+h(Bti1,Bti)(BtiBti1)2

Where h is uniformly continous, bounded, and h(x,x)=0 x. For brevity, we’ll write ΔBti=BtiBti1.

f(Bti)f(Bti1)=f(Bti1)ΔBti+12f(Bti1)Δ2Bti+h(Bti1,Bti)Δ2Bti

and when we sum things up

f(Bt)f(B0)=ni=1f(Bti)f(Bti1)=ni=1f(Bti1)ΔBti+12ni=1f(Bti1)Δ2Bti+ni=1h(Bti1,Bti)Δ2Bti.

Following Chapter 8 of Steele’s Stochastic Calculus, we’ll write

An=ni=1f(Bti1)ΔBti,Bn=ni=112f(Bti1)Δ2Bti,Cn=ni=1h(Bti1,Bti)Δ2Bti.

I’m not going to work it through, but Ant0f(Bs)dBs, Bn12t0f(s)ds, and Cn0 in probability. Again, if Bt (Brownian motion, not the sum Bnabove) were replaced with a differentiable function b(t), then Bn0 and we’d recover the Fundamental Theorem of Calculus.

Trying to build more intuition for what’s different in the stochastic case, let’s go back to a sample plot from earlier:

test

The green line is an approximate/discretized version of dBt. In code, we’re getting this by writing

dt = 0.001
dB_t = np.random.normal(0, np.sqrt(dt))

In some sense, dBt is the “derivative” of Bt. By definition Bt=t0dBs. But what is dBt? Earlier, we just used it as a shorthand for an integral equation. But is there a more direct way to think about it?

Yes

Do you remember the Dirac delta function? It’s a “function” δx(t) such that

Ix(t)=tδx(s)ds

is 0 if t<x and 1 if tx. The delta function isn’t a function in the classical sense. There’s no function δx(t) with the desired property. But clearly Ix(t) is well-defined, and in some sense δx(t) is its “derivative”. We might suggestively write dIx(t)=δx(t)dt.

It’s in this sense that we look for a “white noise” process Wt such that dBt=Wtdt. And we already know that Wt should be infinitesimal Gaussian noise. Indeed, fixing a partition πn of [0,t],

BtB0=iBtiBti1titi1(titi1).

Letting the mesh size go to zero, we want to define

Wt=limstBsBsst.

As with Dirac, this limit doesn’t exist in the classic sense. Here are some sample paths for the limit, with t=1:

test

We can do something similar to what we did with Dirac, though. We’ll basically just define Wt as the “generalized process” for which

Bt=t0Wsds.

It’s instructive to look at what this looks like for different resolutions. Here’s Wt and the resulting Bt for discretizations dt=0.1, dt=0.01, and dt=0.001:

test

As dt gets finer, Bt becomes more “jagged”, which is what we want. To do accomplish this, the variance for Wt has to explode. In fact, Wt is pretty chaotic in general. The samples for Wt are discontinuous everywhere, and they’re unbounded on every interval. For all these reasons, it’s hard to define Wt as a “classic” process. Is there any alternative?

You may know that you can more formally define the Dirac delta as a “generalized function”. The idea is that a generalized function f defines a linear functional from some space of functions ϕ to the base field C by

ϕRϕ(x)f(x)dx.

Someone, probably Elias Stein, once explained this Biblically by saying “by their deeds ye shall know them”. The point is that if you know what f does to every ϕ, you know what f is.

For the Dirac delta δx,

ϕϕ(x).

This is a well-defined functional that uniquely specifies δx. What more is there to know?

And you can do an entirely similar thing for Wt. Define Wt as the functional on a space of (square-integrable) stochastic processes Ht by

Htt0HsWsds=t0HsdBs.

This is a well-defined functional that uniquely specifies Wt. What more is there to know?