When ML meets the Game Theory

I had always some loose thought in my mind which I’m eager but not able to systemically answer, which is what problem could/couldn’t be well solved by model machine learning/deep learning etc methodologies. In particular, I want to claim (loosely) that even if one can find the most powerful machine learning, it won’t be useful in predicting the financial or economics behavior in the world because it’s essentially ruled by game theory/mean field game theory, i.e., the outcome of the result really depends on the other players in the game so that no model could ‘predict’ it.

Admittedly, one can always give some frameworks to claim the effectiveness of ML in financial world, however, I’m eager to see very simple naive models to formalize this problem (either support or against my claim) in an elegant way.

Invariant of Models

I had been always been intrigued and interested by the concept of invariant variable. Maybe the first impression about this concept was started from the a mathematician Paul Gordan who was known as the king of invariant theory. Later on, I had been introduced more invariant examples, one funny example in my impression was from Prof Xiangjun Wang, to prove there must be a swirl on any head with hair (one application of Brouwer fixed-point theory).

However, I had a belief/intuition that the essence of any models in the world is to understand and find the invariant of the problem. Though the idea here is very abstract and aboard, but let me keep this post here to be finished gradually going forward.

Firstly, let’s focus on the predictive regression model. For example, for any ML model to be deployed, people tend to split the data into training set and validation set which are randomly split, why it’s reasonable to do this? The underlying principle is that data distribution is invariant (i.e., y\sim f(X)) under the random splitting (this can be proved below). Next, when new test data is introduced, the reason we’re able to use such a data is based on the assumption that data distribution under the given won’t change.

Another simple simple is linear regression y=a x + b +\sigma \epsilon, as one can see that the essence of this model’s prediction power comes from the invariant quantity \sigma, it has prediction power only if \sigma is unchanged. For more general model, the essence of such a model’s prediction power comes from the invariance assumption of the model parameters.

However, should people worry about Legitimacy of this assumption in any of the scenarios?

to be continued…….

Linear Algebra exercises.

  1. What’s eigenvalues of a n\times n matrix A=\begin{pmatrix} 1 & c & c & \dots & c\\ c & 1 & c & \dots & c\\ c & \vdots &\vdots && \vdots \\ \vdots &\vdots & & 1&c\\ c&c&c\dots&c&1\end{pmatrix} ?

Solution 1: Notice that A=c J+(1-c)I, where I,J are identity and ones matrix respectively. Assume v is an eigen-vector associated with an eigen-value \lambda, then

\begin{aligned}  &&cJv+(1-c)Iv=\lambda v\\  &\implies& c|v|\overset{\rightarrow}{1}=(\lambda+c-1) v.  \end{aligned}

Therefore,  v=\overset{\rightarrow}{1} , when  c|v|=(\lambda+c-1), i.e., \lambda_1=1+(n-1)c.

Otherwise, |v|=0 when \lambda_{2,\dots,n}=1-c, this is a n-1 dimension subspace \text{span}\{\begin{pmatrix} 1\\,-1\\, 0\\, \vdots,\\0\end{pmatrix},  \begin{pmatrix} 1\\,0\\, -1\\, \vdots,\\0\end{pmatrix},  \dots,  \begin{pmatrix} 1\\,0\\, 0\\, \vdots,\\-1\end{pmatrix}\}.

 

Solution 2: Matrix Determinant lemma (wikipedia: https://en.wikipedia.org/wiki/Matrix_determinant_lemma#Proof)

Statement: \text{det} (A+uv^T)=(1+v^TA^{-1}u)\text{det}(A), where A is an invertible matrix, u,v\in \mathbb{R}^{n,1}.

Let’s denote A=(1-c-\lambda)I_{n,n}, and u,v=\sqrt c \overset{\rightarrow}{1}_{n,1}, we can solve the eigenvalue from the corresponding polynomial

\begin{aligned}  (1+\frac{c}{1-c-\lambda}\cdot n)(1-c-\lambda)^n=(1-\lambda+(n-1)c)(1-c-\lambda)^{n-1}  \end{aligned}.

Therefore, the eigenvalues \lambda_1=1+(n-1)c, \lambda_{2,\dots,n}=1-c.

Monte Hall problem revisit

This post is to write the Bayesian proof of the monte hall goat problem. Below is the statement of the problem from wikipedia:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

Analysis: the answer to this question is depending on whether you’re a Bayesian or a frequentist. Although the former is the approach through which the answer is widely known (with hidden assumptions such as the host is randomly choosing between the rest doors if your first guess is correct).

Bayesian:

Denote {C=i} as event that the car is behind the door i, {I=i} as event of your first guess, {H=i} as event that the host opened the door i with a goat behind. Hence,

\begin{aligned}  P(C=2|I=1,H=3)  &=\frac{P(I=1,H=3|C=2)*P(C=2)}{P(I=1,H=3|C=1)*P(C=1)+P(I=1,H=3|C=2)*P(C=2)+P(I=1,H=3|C=3)*P(C=3)}\\  &=\frac{1*\frac{1}{3}}{\frac{1}{2}*\frac{1}{3}+1*\frac{1}{3}+0*\frac{1}{3}}\\  &=2/3  \end{aligned}

Meanwhile ,P(C=1|I=1,H=3)=\frac{1}{3}. Therefore, the correct choice is to switch your decision from door 1 to door 2.

Frequentist:

However, as a frequentist, there is no prior distribution. This approach would draw the inference as below:

Assume, H_0: C=1, H_a: C=2, then

P_{H_0}(H=3)=\frac{1}{2} is a big number, therefore, we’re not able to reject H_0. Vice Versa for H_a.

Unfortunately, frequentist is not able to answer the question of posterior probability so it’s not able to answer the question whether switch or not.

Reproducing Kernel Hilbert Space

Since writing about the simulation of fractional Brownian motion, I’ll spend an hour to write RKHS.

Definition: A Hilbert space \mathcal{H} is a RKHS if |\mathcal{F}_t[f]|=|f(t)|\leq M\|f\|_{\mathcal{H}}, \forall f\in \mathcal{H}

Theorem: If \mathcal{H} is a RKHS, then for each t \in X there exists a function K_t \in \mathcal{H} (called the representer of t) with the reproducing property

\mathcal{F}_t[f]=<K_t,f>_{\mathcal{H}}=f(t), \forall f\in\mathcal{H}

Therefore, K_t(t^{'})=<K_t,K_{t^{'}}>_{\mathcal{H}}.

Definition: K: X\times X\rightarrow\mathbb{R} is a reproducing kernel if it’s symmetric and positive definite.

Theorem: A RKHS defines a corresponding reproducing kernel. Conversely, a reproducing kernel defines a unique RKHS.

Once we have the kernels, if f(\cdot)=\sum \alpha_i K(t_i,\cdot), g(\cdot)=\sum \beta_i K(t^{'}_i,\cdot),

then <f,g>_{\mathcal{H}}=\sum\sum \alpha_i \beta_j K(t_i, t^{'}_j)

When it comes to the fractional Brownian motion

Theorem: For fBm RKHS K(x,x^{'})=R(x,x^{'}) is symmetric and positive definite, , there exists K^H(x,\cdot) s.t K(x,x^{'})=\int K^H(x,y)K^H(x^{'},y)dy

Take Wiener integral as an example:

\begin{pmatrix}    \textit{dual space}&\textit{inner product}&E&\leftrightarrow&<\cdot,\cdot>_{\mathcal{H}}&\leftrightarrow&<\cdot,\cdot>_{L^{2}}\\    \delta_{t}(\cdot)& &B^H_t&\leftrightarrow&R(\cdot,t)&\leftrightarrow&K^H(t,\cdot)\\    f(\cdot)& &\textit{"}\int_0^T f(t)B^H_tdt\textit{"}&\leftrightarrow&K^H\circ(K^H)^{*}f&\leftrightarrow&(K^H)^{*}f    \end{pmatrix}

where K^H\circ f(t)=\int_0^T K^H(t,s)f(s)ds,

(K^H)^{*}\circ f(t)=\int_0^T K^H(s,t)f(s)ds and

K^H\circ(K^H)^{*}f(t)=\int_0^T R(t,s)f(s)ds

For example, when H=\frac{1}{2}, K^H(t,s)=1_{[0,t]}(s), , we can check

E(B_tB_s)=<t\wedge\cdot, s\wedge\cdot>_{\mathcal{H}}=<1_{[0,t]}(\cdot),1_{[0,s]}(\cdot)>_{L^2}

Claim: By previous theorem, the reproducing kernel R(t,s) uniquely defines a RKHS L(R(t,\cdot)) with the inner product \mathcal{H}

how to simulate a fractional Brownian motion?

A fractional Brownian motion B^H_tis a Gaussian process satisfying

a) B^H_0=0

b) B^H_t is Gaussian

c) E(B^H_tB^H_s)=R(t,s)=\frac{1}{2}(t^{2H}+s^{2H}-|t-s|^{2H})

As for simulation, we just need to simulate the correlated Gaussian (\Delta B^H_{t_1},\dots,\Delta B^H_{t_n})\sim N(0,\Sigma), where

t_i=i*T/n=i \Delta t,

\Delta B^H_{t_i}=B^H_{t_i}-B^H_{t_{i-1}}, i=1,\dots,n,

\Sigma_{i,j}=E(B^H_{t_i}B^H_{t_j})=\frac{\Delta t^{2H}}{2}((i-j+1)^{2H}+(i-j-1)^{2H}-2(i-j)^{2H})

Now the question is just simulating L'\cdot N(0,I_{n\times n}) where L is the Cholesky decomposition of \Sigma, i.e (\Sigma=L'L)

Method 1: find \Sigma first and then use Cholesky decomposition to compute L. This works theoretically, but not practically because it’s too time and space costy to record \Sigma, not to mention decomposition step.

Method 2:

We denote \lambda(k)=\frac{(\Delta t)^{2H}}{2}((k+1)^{2H}+(k-1)^{2H}-2k^{2H})=\Sigma_{i,i+k}

Denote a circulant matrix

C=\begin{pmatrix}\lambda(0)&\lambda(1)&\lambda(2)&\dots&\lambda(n-1)&\lambda(n-2)&\dots&\lambda(1)\\ \lambda(1)&\lambda(0)&\lambda(1)&\dots&\lambda(n-2)&\lambda(n-1)&\dots&\lambda(2)\\ \lambda(2)&\lambda(1)&\lambda(0)&\dots&\lambda(n-3)&\lambda(n-2)&\dots&\lambda(3)\\ \vdots&\vdots&\vdots&\ddots&\dots&\dots&\dots&\dots\\    \lambda(n-1)&\lambda(n-2)&\lambda(n-3)&\dots&\lambda(0)&\lambda(1)&\dots&\lambda(n-2)\\    \dots    \end{pmatrix}_{2n-2\times 2n-2}

of which \Sigma is the left up n\times n block matrix.

The idea is, suppose C=DD^*=(D_1+iD_2)*(D_1^T-i D_2^T)where D_1, D_2 \in \mathbb{R}^{2n-2\times 2n-2}, now if we generate i.i.d \quad \epsilon^1,\epsilon^2\sim N(0,I_{2n-2,2n-2}),

We can conclude that

denote X_1+iX_2=D*(\epsilon^1+i\epsilon^2)=D_1\epsilon^1-D_2\epsilon^2+i(D_2\epsilon^1+D_1\epsilon^2)\in \mathbb{R}^{2n-2,1}, then X_1[1:n]\sim N(0,\Sigma) and X_2[1:n]\sim N(0,\Sigma)

let Q is the unitary matrix defined by

Q_{j,k}=(2n-2)^{-1/2}exp(-2i\pi\frac{jk}{2n-2}) and thus D=Q*\Lambda^{\frac{1}{2}}

so how to achieve \Lambda? The property of circulant matrix shows that

\lambda_k=\sum_{j=1}^{2n-2}c_j exp(-2i\frac{jk}{2n-2}), i.e

\Lambda=Q*\begin{pmatrix}\lambda(0)&\lambda(1)&\lambda(2)&\dots&\lambda(n-1)&\lambda(n-2)&\dots&\lambda(1)\\    \end{pmatrix}^{'} that can be achieved by FFT.

Next, according to the idea stated before, FFT(diag(\Lambda^{\frac{1}{2}})*(\epsilon^1+i\epsilon^2))=Q*diag(\Lambda^{\frac{1}{2}})*(\epsilon^1+i\epsilon^2)=X_1+iX_2

And the first n component of X_1 will be N(0,\Sigma)

Reference : Coeurjolly, JF. Simulation and identification of the fractional Brownian motion

Matlab code:

function []=fBM(n,H)
r=nan(n+1,1); r(1) = 1;
for k=1:n
r(k+1) = 0.5*((k+1)^(2*H) – 2*k^(2*H) + (k-1)^(2*H));
end
r=[r; r(end-1:-1:2)]; % first rwo of circulant matrix
lambda=real(fft(r))/(2*n); % eigenvalues
W=fft(sqrt(lambda).*complex(randn(2*n,1),randn(2*n,1)));
W = n^(-H)*cumsum(real(W(1:n+1))); % rescale
plot((0:n)/n,W);
%Terminal=W(n+1);%terminal value of fBM
end