<![CDATA[Soyuj's RSS Feed]]>http://github.com/dylang/node-rssGatsbyJSSat, 13 May 2023 12:12:39 GMT<![CDATA[Principal Component Analysis (PCA)]]>https://www.soyuj.com/blog/pcahttps://www.soyuj.com/blog/pcaSat, 06 May 2023 00:00:00 GMT
> These are my notes that cover the mathematical foundations of a dimensionality reduction method called Principal Component Analysis (PCA), taken while attending *CSCI-UA 9473 - Foundations of Machine Learning* at NYU Paris. They make use of linear algebra and statistics to formalize the concept of PCA.
## Multivariate Statistics & Notation
Let $X = \begin{pmatrix} X^1 \\ X^2 \\ \vdots \\ X^d \end{pmatrix} \in \R^d$ be a random vector. We will use the superscript notation to denote the $d$ components of $X$.
The expectation of $X$ is defined as:
$$
\mathbb{E}[X] = \begin{pmatrix} \mathbb{E}[X^1] \\ \mathbb{E}[X^2] \\ \vdots \\ \mathbb{E}[X^d] \end{pmatrix} \in \R^d
$$
Similarly, the covariance matrix of $X$, denoted by $\Sigma$, is a $d \times d$ matrix defined such that:
$$
\Sigma_{ij} = \sigma_{ij} = \text{Cov}(X^i, X^j) = \mathbb{E}[X^iX^j] - \mathbb{E}[X^i]\mathbb{E}[X^j]
$$
We can write the whole covariance matrix in the following vectorized form:
$$
\begin{equation}
\Sigma = \mathbb{E}[XX^\intercal] - \mathbb{E}[X]\mathbb{E}[X]^\intercal
\end{equation}
$$
> **Note**: This is because $(\mathbb{E}[XX^\intercal])_{ij} = \mathbb{E}[(XX^\intercal)_{ij}] = \mathbb{E}[X^iX^j]$. Recall that:
>$$
>XX^\intercal = \begin{pmatrix} X^1 \\ X^2 \\ \vdots \\ X^d \end{pmatrix} \begin{pmatrix} X^1 & X^2 & \dots & X^d \end{pmatrix} = \begin{pmatrix} X^1X^1 & X^1X^2 & \dots & X^1X^d \\ X^2X^1 & X^2X^2 & \dots & X^2X^d \\ \vdots & \vdots & \ddots & \vdots \\ X^dX^1 & X^dX^2 & \dots & X^dX^d \end{pmatrix}
>$$
The covariance matrix can also be written as:
$$
\begin{equation}
\Sigma = \mathbb{E}[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^\intercal]
\end{equation}
$$
> **Note**: This is because $(X - \mathbb{E}[X])_{i} = X_i - \mathbb{E}[X_i] = X_i - \mathbb{E}[X]_i$.
>
> Just to verify we will expand the right hand side of the equation:
> $$
> \begin{align*}
> \mathbb{E}[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^\intercal] &= \mathbb{E}[(X - \mathbb{E}[X])(X^\intercal - \mathbb{E}[X]^\intercal)] \\
> &= \mathbb{E}[XX^\intercal - X\mathbb{E}[X]^\intercal - \mathbb{E}[X]X^\intercal + \mathbb{E}[X]\mathbb{E}[X]^\intercal] \\
> &= \mathbb{E}[XX^\intercal] - \mathbb{E}[X]\mathbb{E}[X]^\intercal - \mathbb{E}[X]\mathbb{E}[X]^\intercal + \mathbb{E}[X]\mathbb{E}[X]^\intercal \\
> &= \mathbb{E}[XX^\intercal] - \mathbb{E}[X]\mathbb{E}[X]^\intercal
> \end{align*}
> $$
### Empirical Estimation & Reviewing Linear Algebra
Let $\mathbb{X} = \begin{pmatrix} \dots & {X_1}^\intercal & \dots \\ \dots & {X_2}^\intercal & \dots \\ \dots & \vdots & \dots \\ \dots & {X_n}^\intercal & \dots \end{pmatrix} \in \R^{n \times d} $ be a matrix that contains $n$ realizations of $X$.
We will use the subscript notation to denote the $n$ observations of $X$. These $X_1, X_2, \dots, X_n$ are assumed to be independent and identically distributed (i.i.d.) random vectors with the same distribution as the random variable $X$.
Since we don't have access to the true distribution of $X$, we will use the empirical distribution of $\mathbb{X}$ to estimate the expectation and covariance matrix of $X$.
The empirical expectation of $X$ is denoted by $\bar{X}$ and is defined as:
$$
\newcommand\identity{1\kern-0.25em\text{l}}
\begin{align}
\bar{X} &= \frac{1}{n} \sum_{i=1}^n X_i \\
&= \frac{1}{n} \mathbb{X}^\intercal \identity_n
\end{align}
$$
where $\newcommand\identity{1\kern-0.25em\text{l}} \identity_n = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} \in \R^n$ is a vector of ones.
> **Note**: This can be explained by:
> $$
> \mathbb{X}^\intercal \newcommand\identity{1\kern-0.25em\text{l}} \identity_n = \begin{pmatrix} \text{\textbar} & \text{\textbar} & \text{\textbar} & \text{\textbar} \\ X^i_1 & X^i_2 & \dots & X^i_n \\ \text{\textbar} & \text{\textbar} & \text{\textbar} & \text{\textbar} \end{pmatrix} \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = \begin{pmatrix} \text{\textbar} \\ \sum_{j=1}^n X^i_j \\ \text{\textbar} \end{pmatrix} = \begin{pmatrix} \text{\textbar} \\ n\bar{X}^i \\ \text{\textbar}\end{pmatrix} = n\bar{X}
> $$
The empirical covariance matrix of $X$ is denoted by $S$ and is defined as:
$$
\begin{align}
S &= \frac{1}{n} \sum_{i=1}^n X_i{X_i}^\intercal - \bar{X}\bar{X}^\intercal \\
&= \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})(X_i - \bar{X})^\intercal \\
\end{align}
$$
Again, performing further vectorization of equation $(5)$, we get:
$$
\begin{equation}
S = \frac{1}{n} \mathbb{X}^\intercal \mathbb{X} - \bar{X}\bar{X}^\intercal
\end{equation}
$$
Replacing value of $\bar{X}$ from equation $(4)$, we get:
$$
\newcommand\identity{1\kern-0.25em\text{l}}
\begin{align}
S &= \frac{1}{n} \mathbb{X}^\intercal \mathbb{X} - \frac{1}{n^2} \mathbb{X}^\intercal \identity_n (\mathbb{X} \identity_n)^\intercal \\
&= \frac{1}{n} \mathbb{X}^\intercal \mathbb{X} - \frac{1}{n^2} \mathbb{X}^\intercal \identity_n \identity_n^\intercal \mathbb{X} \\
&= \frac{1}{n} \mathbb{X}^\intercal \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) \mathbb{X} \\
&= \frac{1}{n} \mathbb{X}^\intercal H \mathbb{X}
\end{align}
$$
where $H$ is a matrix such that $H_{ii} = 1 - \frac{1}{n}$ and $H_{ij} = -\frac{1}{n}$ for $i \neq j$.
### Properties of $H$
#### Orthogonal Projector
Matrix $H$ can be shown as a orthogonal projector. Since it is symmetric, $H^\intercal = H$, so it suffices to show $H^2 = H$.
$$
\newcommand\identity{1\kern-0.25em\text{l}}
\begin{align*}
H^2 &= \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) \\
&= \mathbb{I}_n - \frac{2}{n} \identity_n \identity_n^\intercal + \frac{1}{n^2} \identity_n \left( \identity_n^\intercal \identity_n \right) \identity_n^\intercal \\
&= \mathbb{I}_n - \frac{2}{n} \identity_n \identity_n^\intercal + \frac{n}{n^2} \identity_n \identity_n^\intercal ~~~[~\because \identity_n^\intercal \identity_n = n~]\\
&= \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \\
&= H
\end{align*}
$$
#### Projection Space
Let's take $v \in \R^d$. We have:
$$
\newcommand\identity{1\kern-0.25em\text{l}}
\begin{align*}
Hv &= \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) v \\
&= v - \frac{1}{n} \identity_n \identity_n^\intercal v \\
&= v - \left(\frac{1}{n} v^\intercal \identity_n \right) \identity_n \\
&= v - \bar{v} \identity_n
\end{align*}
$$
where $\bar{v} = \frac{1}{n} \sum_{i=1}^n v_i$. Therefore, this projector is removing the average of $v$ from from each of its coordinates.
> **Note**: It projects onto the subspace of vectors having zero mean. It means:
> $$
> \newcommand\identity{1\kern-0.25em\text{l}}
> \begin{align*}
> v \perp \text{span} \left\{ \identity_n \right\}
> \end{align*}
> $$
### Switching to Statistics
Let $u \in \R^d$, then we can show that $u^\intercal \Sigma u$ is the variance of $u^\intercal X$.
$$
\begin{align*}
u^\intercal \Sigma u &= u^\intercal \left( \mathbb{E}[XX^\intercal] - \mathbb{E}[X] \mathbb{E}[X]^\intercal \right) u \\
&= u^\intercal \mathbb{E}[XX^\intercal] u - u^\intercal \mathbb{E}[X] \mathbb{E}[X]^\intercal u \\
&= \mathbb{E}[u^\intercal XX^\intercal u] - \mathbb{E}[u^\intercal X]^2 \\
&= \mathbb{E}[u^\intercal X] - (\mathbb{E}[u^\intercal X])^2 \\
&= \text{Var}(u^\intercal X)
\end{align*}
$$
With a similar argument, we can show that $u^\intercal S u$ is the sample variance of $(u^\intercal X_1, \ldots, u^\intercal X_n) \in \R$. This gives us the variance of $X$ along the direction of $u$.
---
## Principal Component Analysis
PCA is an unsupervised linear transformation technique that allows us to reduce the dimensionality of a dataset while retaining as much information as possible. The core idea behind PCA is to use variance as a measure of spread in the data.
PCA identifies the directions of maximum variance in the data and projects it onto a new orthogonal basis with same or lesser dimensions, which is identified by the principal components.
Let's write down the maximization problem more formally:
$$
\begin{align}
\max_{u \in \R^d} \quad & u^\intercal S u \\~~\text{~~s.t.~} u^\intercal u = 1
\end{align}
$$
The constraint $u^\intercal u = 1$ is added to ensure that the solution is not affected by the magnitude of $u$.
### Spectral Theorem
If S is a symmetric matrix with real components, then there exists an orthogonal matrix $P$ and a diagonal matrix $\Lambda$ such that:
$$
S = P \Lambda P^\intercal
$$
where the columns of $P$ are the eigenvectors of $S$ such that ${v_i}^\intercal v_j = 0$ and ${v_i}^\intercal v_i = 1$ and the diagonal entries of $\Lambda$ are the corresponding eigenvalues.
The fact is that ${v_i}^\intercal \Sigma v_i = \lambda_i {v_i}^\intercal v_i = \lambda_i$ because $P^\intercal v_i = v_i$ by orthogonality of vectors. Thus, the variance of $X$ along the eigenvector $v_i$ is carried by the associated eigenvalue.
### Solution
Equation $(12)$ is a constrained optimization problem. We can solve it using Lagrange multipliers. Let's define the Lagrangian:
$$
\begin{align*}
\mathcal{L}(u, \lambda) &= u^\intercal S u - \lambda (u^\intercal u - 1) \\
&= u^\intercal S u - \lambda u^\intercal u + \lambda
\end{align*}
$$
Now, we can take the derivative of $\mathcal{L}$ with respect to $u$ and set it to zero:
$$
\begin{align*}
\frac{\partial \mathcal{L}}{\partial u} &= 2Su - 2\lambda u \\
&= 0 \\
\implies Su &= \lambda u
\end{align*}
$$
This is an eigenvalue problem. The eigenvector $u$ that corresponds to the largest eigenvalue of $S$ is the solution to the optimization problem. This eigenvector is called the first principal component of $X$.
## PCA Algorithm
Let us assume we are given input data: $X_1 , \dots , X_n$. Assume that they are cloud of $n$ points in dimension $d$. We need to reduce its dimension to $k$ s.t. $k \leq d$.
The algorithm is as follows:
1. Compute the empirical covariance matrix $S$.
2. Compute the spectral decomposition of $S$:
$$
S = PDP^\intercal \\ \text{~with~} D = \text{diag}(\lambda_1, . . . , \lambda_d) \text{~and~} \lambda_1 \geq \lambda_2 \geq ... ≥ \lambda_d
$$
3. Choose a set $k < d$ and set $P_k = [v_1, \dots , v_k] \in \R^{d \times k}$.
4. We have $Z_1, \dots , Z_n$ where:
$$
Z_i = P_k^\intercal X_i \in \R^k
$$
<![CDATA[Mathematics for Machine Learning]]>https://www.soyuj.com/blog/mathematics-for-machine-learninghttps://www.soyuj.com/blog/mathematics-for-machine-learningWed, 15 Feb 2023 00:00:00 GMT<![CDATA[Support Vector Machines]]>https://www.soyuj.com/blog/support-vector-machineshttps://www.soyuj.com/blog/support-vector-machinesWed, 15 Feb 2023 00:00:00 GMT
Support Vector Machines are one of the most widely used machine learning techniques for classification and regression analysis. They were once top classifiers before the rise of deep learning with neural networks before the year 2010.
These are my notes that cover the mathematical foundations of Support Vector Machines, taken while attending *CSCI-UA 9473 - Foundations of Machine Learning* at NYU Paris. Instead of using statistics and probability theory, they make use of geometry and the idea of drawing lines to separate points.
---
Suppose there are a number of points $(\bold{x}_i, y_i)$ where $\bold{x}_i \in \R^d$ and $y_i \in \{+1, -1\}$
$$
\begin{align}
\bold{x}_1 &= (x_{11}, x_{12}, \dots, x_{1d}) \\
\bold{x}_2 &= (x_{21}, x_{22}, \dots, x_{2d}) \\
\vdots \\
\end{align}
$$<![CDATA[Linear Regression]]>https://www.soyuj.com/blog/linear-regressionhttps://www.soyuj.com/blog/linear-regressionWed, 01 Feb 2023 00:00:00 GMT
> These are my notes that cover the mathematical foundations of Linear Regression, taken while attending *CSCI-UA 9473 - Foundations of Machine Learning* at NYU Paris. They make use of probability theory, statistics, calculus, optimization, and linear algebra to formalize the concept of linear regression.
## Introduction
Linear regression is one of the most widely used statistical techniques in data analysis and machine learning. It provides a simple and intuitive linear model for modeling the relationship between a response variable and a set of explanatory variables.
## Simple Linear Regression
## Least Squares Criterion
$$
\begin{equation}
J_{\tiny N}(\mathbf{w}) = \frac{1}{2N} \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)^2
\end{equation}
$$
where $\mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \end{pmatrix}$. We want to find $\mathbf{w}$ such that $J_{\tiny N}(\mathbf{w})$ is minimized.
Therefore, lets compute $\nabla J_{\tiny N}(\mathbf{w})$:
$$
\begin{equation}
\nabla J_{\tiny N}(\mathbf{w}) = \left[\begin{array}{c}
\dfrac{\partial J_{\tiny N}}{\partial w_0}(\left.\mathbf{w}\right)\\ \\
\dfrac{\partial J_{\tiny N}}{\partial w_1}(\left.\mathbf{w}\right)\\
\end{array}\right]
\end{equation}
$$
where:
$$
\begin{align}
\dfrac{\partial J_{\tiny N}}{\partial w_0}(\left.\mathbf{w}\right) &= -\frac{1}{N} \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)\\
\dfrac{\partial J_{\tiny N}}{\partial w_1}(\left.\mathbf{w}\right) &= -\frac{1}{N} \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)x_i
\end{align}
$$
Setting them equal to zero, we get the critical values $(w_0, w_1)$ that minimize $J_{\tiny N}(\mathbf{w})$:
$$
\begin{equation*}
\dfrac{\partial J_{\tiny N}}{\partial w_0}(\left.\mathbf{w}\right) = 0
\end{equation*}
$$
$$
\begin{align*}
&\implies \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right) = 0\\
&\implies w_0 = \frac{1}{N} \sum_{i=1}^N \left( y_i - w_1x_i \right)\\
&\implies w_0 = \left(\frac{1}{N} \sum_{i=1}^N y_i \right) - w_1 \left(\frac{1}{N} \sum_{i=1}^N x_i\right)\\
\end{align*}
$$
$$
\begin{equation}
\therefore w_0 = \overline{y} - w_1 \overline{x}
\end{equation}
$$
Similarly:
$$
\begin{equation*}
\dfrac{\partial J_{\tiny N}}{\partial w_1}(\left.\mathbf{w}\right) = 0
\end{equation*}
$$
$$
\begin{align*}
&\implies \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)x_i = 0\\
\end{align*}
$$
Replacing the value of $w_0$, we get:
$$
\begin{align}
&\implies \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right)x_i = 0\\
&\implies \sum_{i=1}^N \left( y_i - \overline{y} \right)x_i - w_1 \sum_{i=1}^N \left( x_i - \overline{x} \right)x_i = 0\\
&\implies w_1 = \frac{\displaystyle\sum_{i=1}^N \left( y_i - \overline{y} \right)x_i}{\displaystyle\sum_{i=1}^N \left( x_i - \overline{x} \right)x_i}\\
\end{align}
$$
However, it is commonly written in terms of Pearson's sample correlation coefficient $r_{xy}$.
For that, a slight modification is required in equation $(6)$:
$$
\begin{align}
&\implies \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) \textcolor{#01B636}{(}x_i \textcolor{#01B636}{- \overline{x}) + \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) \overline{x}}= 0\\
\end{align}
$$
Even though we have added the part in green, we can show that the second term is zero as:
$$
\begin{align*}
\sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) = \textcolor{#ca0047}{\sum_{i=1}^N y_i} \textcolor{#8047cd}{- w_1 \sum_{i=1}^N x_i} \textcolor{#ca0047}{- N \overline{y}} \textcolor{#8047cd}{+ N w_1 \overline{x}} = 0
\end{align*}
$$
Therefore, from equation $(9)$, we get:
$$
\begin{equation*}
\implies \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) \left(x_i - \overline{x}\right) = 0
\end{equation*}
$$
$$
\begin{align}
\therefore w_1 = \frac{\displaystyle\sum_{i=1}^N \left( y_i - \overline{y} \right) \left(x_i - \overline{x}\right)}{\displaystyle\sum_{i=1}^N \left( x_i - \overline{x} \right)^2} = r_{xy} \cdot \frac{\sigma_y}{\sigma_x}
\end{align}
$$
Here, $\sigma_y$ and $\sigma_x$ are standard deviations of $x$ and $y$.
> $r_{xy}$ is the Pearson's sample correlation coefficient between $x$ and $y$ defined as:
>$$
>\begin{equation}
>r_{xy} = \frac{\displaystyle\sum_{i=1}^N \left( y_i - \overline{y} \right) \left(x_i - \overline{x}\right)}{\displaystyle\sqrt{\sum_{i=1}^N \left( y_i - \overline{y} \right)^2} \displaystyle\sqrt{\sum_{i=1}^N \left( x_i - \overline{x} \right)^2}}
>\end{equation}
>$$
## Multiple Linear Regression
In multiple linear regression, there are multiple independent variables in the equation rather than just one. The mathematics behind multiple linear regression involves using matrix operations to solve for the coefficients of the regression equation.
The regression equation is now represented as:
$$
\begin{equation}
y = w_0 + w_1x_1 + w_2x_2 + \cdots + w_dx_d
\end{equation}
$$
where $x_1, x_2, \cdots, x_d$ are the independent variables and $w_0, w_1, \cdots, w_d$ are the coefficients of the equation.
We define $\mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \vdots \\ w_d \end{pmatrix}$ and augment each data point as $\mathbf{x} = \begin{pmatrix} 1 \\ x_1 \\ \vdots \\ x_d \end{pmatrix}$, so that the equation can be written as:
$$
\begin{equation}
y = \langle \mathbf{w}, \mathbf{x} \rangle = \mathbf{w}^T \mathbf{x}
\end{equation}
$$
## Least Squares Criterion
$$
\begin{equation}
J_{\tiny N}(\mathbf{w}) = \frac{1}{2N} \| \mathbf{y} - \Phi \mathbf{w}\|^2
\end{equation}
$$
where:
$\mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_{\tiny N} \end{pmatrix}$ is the vector of target values.
$\Phi \mathbf{w} = \begin{pmatrix} \langle \mathbf{x_1}, \mathbf{w} \rangle \\ \langle \mathbf{x_2}, \mathbf{w} \rangle \\ \vdots \\ \langle \mathbf{x_{\tiny N}}, \mathbf{w} \rangle \end{pmatrix} = \begin{pmatrix} \mathbf{x_1}^T \mathbf{w} \\ \mathbf{x_2}^T \mathbf{w} \\ \vdots \\ \mathbf{x_{\tiny N}}^T \mathbf{w} \end{pmatrix} = \begin{pmatrix} \mathbf{x_1}^T \\ \mathbf{x_2}^T \\ \vdots \\ \mathbf{x_{\tiny N}}^T \end{pmatrix}\mathbf{w} $ is the vector of predictions.
$\Phi = \begin{pmatrix} \mathbf{x_1}^T \\ \mathbf{x_2}^T \\ \vdots \\ \mathbf{x_{\tiny N}}^T \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 & x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{\tiny{N}1} & x_{\tiny{N}2} & \cdots & x_{\tiny{N}d} \end{pmatrix}$ is called the design matrix. It is of size $N \times (d+1)$.
$\mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \vdots \\ w_d \end{pmatrix}$ is the vector of weights.
We want to find the minimizer for $J_{\tiny N}(\mathbf{w})$, so we solve:
$$
\begin{equation*}
\nabla J_{\tiny N}(\mathbf{w}) = 0
\end{equation*}
$$
$$
\begin{align*}
\implies &\nabla_{\mathbf{w}} \begin{pmatrix} \| \mathbf{y} - \Phi \mathbf{w}\|^2 \end{pmatrix} = 0\\
\implies &-2\Phi^T \left( \mathbf{y} - \Phi \mathbf{w} \right) = 0\\
\implies &\Phi^T \mathbf{y} = \Phi^T \Phi \mathbf{w}
\end{align*}
$$
$$
\begin{equation}
\therefore \mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbf{y}
\end{equation}
$$
> Note that we are implicitly assuming that $\Phi^T \Phi$ is an invertible matrix. If either the number of linearly independent examples is less than the number of features, or if the features are not linearly independent, then $\Phi^T \Phi$ is not invertible.
This least squared solution is an estimator for $\mathbf{w}$. It represents the vector normal to the hyperplane that minimizes the sum of squared errors between the target values and the predictions.
However, this is not enough, we need to be sure of how confident we are in our predictions.
---
## Homoscedastic Model
Assume that target values $y_i$ are represented by random variables $Y_1, Y_2, \cdots, Y_{\tiny N}$ that are independent and identically distributed (i.i.d.). Then,
$$
\begin{equation}
Y_i = \langle \mathbf{w^{*}}, \mathbf{x_i} \rangle + \epsilon_i
\end{equation}
$$
where $\mathbf{w^{*}}$ is the vector of true weights that generates the data and $\mathbf{\epsilon}_i \sim \mathcal{N}(0, \sigma^2)$ is an independent gaussian noise term.
We can write $\mathbf{Y}$ as:
$$
\begin{align}
&\mathbf{Y} = \Phi \mathbf{w^{*}} + \Large{\mathbf{\epsilon}}\\
\implies & \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_{\tiny N} \end{pmatrix} = \begin{pmatrix} \langle \mathbf{x_1}, \mathbf{w^{*}} \rangle \\ \langle \mathbf{x_2}, \mathbf{w^{*}} \rangle \\ \vdots \\ \langle \mathbf{x_{\tiny N}}, \mathbf{w^{*}} \rangle \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_{\tiny N} \end{pmatrix}\\
\end{align}
$$
> Here, $\Large \mathbf{\epsilon}$ $\sim \mathcal{N}(\mathbf{0}, \sigma^2 I_{\tiny N})$ where $I_{\tiny N}$ is the $N \times N$ identity matrix.
>
> We can show this as:
>
> $\mathbb{E} \left[\begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_{\tiny N} \end{pmatrix} \right] = \left[\begin{array}{c} \mathbb{E} \left( \epsilon_1 \right) \\ \mathbb{E} \left( \epsilon_2 \right) \\ \vdots \\ \mathbb{E} \left(\epsilon_{\tiny N}\right) \end{array}\right] = \mathbf{0}$
>
> $\mathrm{Cov}(\mathbf{\Large{\epsilon}}) = \begin{pmatrix} \mathrm{Cov}(\epsilon_i, \epsilon_j) \end{pmatrix}_{1 \leq~i,~j~\leq N} = \begin{pmatrix} \sigma^2 & 0 & 0 & \cdots & 0 \\ 0 & \sigma^2 & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \sigma^2 \end{pmatrix} = \sigma^2 I_{\tiny N}$
>
> $\because \forall ~i \neq j \mathrm{: } \mathrm{Cov}(\epsilon_i, \epsilon_j) = 0$ and $\forall ~i = j \mathrm{~: Cov}(\epsilon_i, \epsilon_j) = \mathrm{Var}(\epsilon_i) = \sigma^2$
$\mathbf{\hat{w}}$ is our estimator for $\mathbf{w^{*}}$. We can write:
$$
\begin{equation}
\mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbf{Y}
\end{equation}
$$
Substituting value of $\mathbf{Y}$ from equation $(17)$, we can write:
$$
\begin{align}
&\mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \left( \Phi \mathbf{w^{*}} + \Large{\mathbf{\epsilon}} \right)\\
\implies &\mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \Phi \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}}\\
\implies &\mathbf{\hat{w}} = \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}}
\end{align}
$$
Then taking expectation of both sides of the equation $(22)$, we get:
$$
\begin{align*}
\implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbb{E} \left[ \mathbf{w^{*}} \right] + \mathbb{E} \left[ \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}} \right]\\
\implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}} + \mathbb{E} \left[ \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}} \right]\\
\implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbb{E} \left[ \Large{\mathbf{\epsilon}} \right]\\
\implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbf{0}\\
\end{align*}
$$
$$
\begin{equation}
\therefore \mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}}
\end{equation}
$$
We can see that our least-squared estimator $\mathbf{\hat{w}}$ is unbiased in probability with the true $\mathbf{w^{*}}$.
---
#### Lemma
Any affine transformation of a gaussian vector $\mathbf{z} \sim \mathcal{N} (\mathbf{0}, \Sigma)$ is also a gaussian vector:
Then, if $A$ is a matrix (that represents a linear transformation):
$$
\begin{equation}
A\mathbf{z} \sim \mathcal{N} (\mathbf{0}, A \Sigma A^T)
\end{equation}
$$
Then, if $A = \left( \Phi^T \Phi \right)^{-1} \Phi^T$:
$$
\begin{align*}
\mathbf{\hat{w}} &\sim \mathcal{N} ( \mathbf{w^{*}}, A \sigma^2 \mathbb{I}_N A^T )\\
&= \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} \Phi^T ( \left( \Phi^T \Phi \right)^{-1} \Phi^T)^T)\\
&= \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} \Phi^T \Phi \left( \Phi^T \Phi \right)^{-T})\\
&= \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} \Phi^T \Phi \left( \Phi^T \Phi \right)^{-1}) ~[~\because\left(\Phi^T \Phi \right)^{T} = \left( \Phi^T \Phi \right)~]
\end{align*}
$$
$$
\begin{equation}
\therefore \mathbf{\hat{w}} \sim \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} )
\end{equation}
$$
---
## Confidence Interval
To compute confidence interval on the $i^{th}$ component of $\mathbf{\hat{w}} = \begin{pmatrix} \hat{w_0} \\ \hat{w_1} \\ \vdots \\ \hat{w_d} \end{pmatrix} \sim \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} )$ , we define the following:
$S = \left( \Phi^T \Phi \right)^{-1} = \begin{pmatrix} s_{00} & s_{11} & s_{12} & \cdots & s_{1d} \\ 1 & x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{\tiny{N}1} & x_{\tiny{N}2} & \cdots & x_{\tiny{N}d} \end{pmatrix}$
<![CDATA[Mastering Nepali Board Game of Bagh Chal with self-learning AI]]>https://www.soyuj.com/blog/mastering-bagh-chal-with-self-learning-aihttps://www.soyuj.com/blog/mastering-bagh-chal-with-self-learning-aiFri, 03 Jul 2020 11:46:37 GMT
> _This blog was originally posted on the Programiz Blog on May 19, 2020, and is intended for the general audience. It can be accessed [here](https://www.programiz.com/blog/mastering-bagh-chal-with-self-learning-ai)._
People have long dreamed of creating machines that can think and decide for themselves. There are countless Sci-Fi books and movies that exploit (and sometimes over-hype) the term *Artificial Intelligence*. This peculiar trait of the human mind—to imagine something well before it comes into existence—has led to many inventions and discoveries. Just a decade ago, the research on Artificial Intelligence was limited to only a few candidates pursuing higher degrees in Universities or big companies. However, the increase in computational power and data availability over the past few years has made it possible for anyone with a decent computer to get started with Machine Learning and Artificial Intelligence. This blog discusses one such personal project that I started working on almost a year ago. It uses state-of-the-art methods in the deep reinforcement learning paradigm to master the traditional Nepali board game of Bagh Chal through self-play.
Instead of diving straight into the project, I want to introduce Artificial Intelligence and some of the relevant concepts first. If you would like to skip these upcoming sections, directly refer to the <a href="#baghchal_ai">Bagh Chal AI Project</a> section.
---
## Renaissance of Artificial Intelligence
When programmable computers were first created, they rapidly overtook humans in solving problems that could be described by a list of formal mathematical rules, such as mathematical computations. The main obstacle to computers and artificial intelligence proved to be the tasks that are easy for human beings but difficult to formalize as a set of mathematical rules. The tasks such as recognizing spoken words or differentiating objects in images require intuition and do not translate to simple mathematical rules.
We generally do not give our brain enough credit and are unaware of the extent to which our intuition plays a role in our everyday thinking process. To that, I want to start the blog with a perfect example that Andrej Karpathy gave in his <a target="_blank" href="https://karpathy.github.io/2012/10/22/state-of-computer-vision/">blog</a> back in 2012 that holds to this day.
<figure>
![Obama pranking a person](./obama_funny.jpg)
<figcaption>A funny picture</figcaption>
</figure>
The above picture is funny.
What does our brain go through within fractions of seconds to comprehend this image? What would it take for a computer to understand this image as you do?
- We recognize it is an image of a number of people in a hallway.
- We recognize that there are 3 mirrors, so some are "fake" replicas of people from different viewpoints.
- We recognize Obama from the few pixels that make up his face.
- We recognize from a few pixels and the posture of the man that he is standing on a scale.
- We recognize that Obama has his foot on top of the scale (3D viewpoint) from a 2D image.
- We know how physics works; pressing on the scale applies force to it and will hence over-estimate the person's weight.
- We deduce from the person's pose that he is unaware of this and further infer how the scene is about to unfold. He might be confused after seeing the reading exceeding his expectation.
- We perceive the state of mind of people in the back and their view of the person's state of mind. We understand why they find the person's imminent confusion funny.
- The fact that the perpetrator here was the president maybe makes it even funnier. We understand that people in his position aren't usually expected to undertake these actions.
This list could go on and on. The mind-boggling fact is that we make all these inferences just by a simple glance at this 2D array of RGB values. Meanwhile, even the strongest supercomputers would not even come close to achieving this feat using today's state-of-the-art techniques in Computer Vision.
For the sake of this blog, let's start with something more straightforward. Imagine that we are given a task to identify handwritten digits in a 28x28 image. How would we go about solving this problem? It might sound ridiculously easy at the start considering that even a small child introduced to numbers can get this correct almost every time. Even though this example is used as the typical "Hello, World!" program equivalent for people learning Artificial Intelligence, the solution to this problem is not as trivial as it first seems.
One obvious classical approach would be to use handcrafted rules and heuristics on the shape of strokes to distinguish the digits. However, due to the variability of handwriting, it leads to a proliferation of rules and exceptions giving poor results. Some sample variants of the handwritten digits are shown in the following image.
<figure>
![Sample Handwritten Digits From MNIST Dataset](./handwritten_digits.png)
<figcaption>Sample Handwritten Digits From MNIST Dataset</figcaption>
</figure>
So, how would we tackle this problem using Machine Learning?
Before we start talking about how to solve this problem, let's first understand how machine learning differs from the traditional programming or algorithmic methods. Machine learning refers to the concept that allows computers to learn from examples and experiences rather than being explicitly programmed.
Basically, we trade-off the hard-coded rules in the program for massive amounts of data. Mathematical tools in linear algebra, calculus, and statistics are cleverly used to find patterns in the data and construct a model that is then used for prediction. The model is trained through an iterative process where its predictive accuracy is evaluated and improved. This is done by using an optimizer to minimize a loss function that tells us how bad the model is doing. As a result of this training process, the model becomes proficient in accurately predicting outcomes for data it has not encountered before.
For instance, instead of using handcrafted rules to identify handwritten digits, we can show the computer lots of examples of how each digit looks like. It can then use information from the example data and try to fit them to a model. Over time, it learns to generalize over the shape of each digit.
The subject of creating a model, evaluating its performance, and improving it is a topic for another blog. However, let's have a glance at the sub-fields of machine learning.
---
## The Sub-Fields of Machine Learning
### Supervised Learning
Supervised learning is a type of machine learning in which the learning is done from the data having input and output pairs. The goal is to create a model that learns to map from the input values to the output.
It is called supervised learning because we know beforehand what the correct answers are. The goal of the machine learning algorithm is to learn the relationship between each input value to the output value by training on a given dataset. In doing so, the model should neither prioritize specific training data nor generalize too much. To avoid this, a larger training dataset is preferred and the model is tested using input values that the model has never seen before (test dataset).
This is better understood with an example. Suppose you're given the following dataset:
<div class="table-responsive">
<table>
<tr>
<th>x</th>
<td>0</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr>
<th>y</th>
<td>0</td>
<td>4</td>
<td>9</td>
<td>16</td>
</tr>
</table>
</div>
Let's put your admirable brain to use, shall we? Can you construct a mental model to find the function that maps <strong>x</strong> to <strong>y</strong>?
After quick speculation, you might have stumbled upon the function <strong>y = x<sup>2</sup></strong> that fits the dataset perfectly.
Let's see how our intuition could have been altered if we had access to only a subset of the dataset:
<div class="table-responsive">
<table>
<tr>
<th>x</th>
<td>0</td>
<td>2</td>
</tr>
<tr>
<th>y</th>
<td>0</td>
<td>4</td>
</tr>
</table>
</div>
Here, both <strong>y = 2x</strong> and <strong>y = x<sup>2</sup></strong> are equally plausible answers. But we know that the actual answer is <strong>y = x<sup>2</sup></strong>, and <strong>y = 2x</strong> is considered to be the underfit model due to less training samples. Our model would predict <strong>y = 6</strong> for <strong>x = 3</strong> while the actual answer is <strong>y = 9</strong>.
Let's look at some visualizations to clearly understand these concepts.
<figure>
![Polynomial Curve Fitting](./polynomial-fitting.png)
<figcaption>Polynomial Curve Fitting</figcaption>
</figure>
Here, the blue line shows the actual function and the purple line shows the prediction of the model. The blue circles represent the training set.
We can see that the last model correctly predicts all the training data (all points pass through the purple line). However, this model is said to be an overfit model (too specific to the training set) and it performs badly on the test set. Similarly, the first two models are said to be underfit models (too much generalization).
The third model is the best among these models even though it has a lesser accuracy than the overfitted model. The model can further be improved by using more training data as shown below.
<figure>
![Less Training Data vs More Training Data](./more_training_data.png)
<figcaption>Less Training Data vs More Training Data</figcaption>
</figure>
Supervised learning can further be divided into classification and regression.
Classification problems are related to making probabilistic estimates on classifying the input data into one of many categories. Identifying handwritten digits falls under this category.
Regression problems are related to predicting real value output in continuous output space. The above problem of finding the best-fit polynomial to predict output to its other input values falls under this category.
---
### Unsupervised Learning
Unsupervised learning is a type of machine learning in which the learning algorithm does not have any labels. Instead, the goal of unsupervised learning is to find the hidden structure in the input data itself and learn its features.
Some types of unsupervised learning include clustering, dimensionality reduction, and generative models.
Clustering is the method by which the input data is organized into clusters based on the similarity on some of their features and their dissimilarity with other clusters, despite having no labels.
<figure>
![Clustering in Unsupervised Learning](./clustering.png)
<figcaption>Clustering in Unsupervised Learning</figcaption>
</figure>
Dimensionality reduction is used to convert a set of data in higher dimensions to lower dimensions. They can remove redundant data and only preserve the most important features. This pre-processing technique can reduce a lot of computational expenses and make the model run a lot faster.
The new unsupervised deep learning field has given rise to autoencoders. Autoencoders use deep neural networks to map input data back to themselves. The twist is that the model has a bottleneck as a hidden layer. So, it learns to represent the input in a smaller amount of data (compressed form).
<figure>
![Neural Network Architecture of Autoencoders" width="700](./autoencoder.png)
<figcaption>Neural Network Architecture of Autoencoders</figcaption>
</figure>
Generative modeling is a task that involves learning the regularity or patterns in the input data so that the model can generate output samples similar to the input dataset.
Since my project does not use unsupervised learning, we won't go into its details in this blog. Check out the following blog from OpenAI to learn more about unsupervised learning:
<ul><li><a target="_blank" href="https://openai.com/blog/generative-models/">Generative Models</a></li></ul>
---
### Reinforcement Learning
Reinforcement Learning (RL) is the type of Machine Learning where an agent learns how to map situations to actions so as to maximize a numerical reward signal from the environment.
The agent interacts with the environment by performing certain actions and receiving feedback in the form of rewards or penalties. The objective of reinforcement learning is to develop a policy that enables the agent to take actions that result in the maximum possible long-term cumulative reward.
The typical examples where RL is used are:
<ul><li>Defeat the world champion at Go</li><li>Make a humanoid robot walk</li><li>Play different Atari games better than humans</li><li>Fly stunt maneuvers in a helicopter</li></ul>
It also is the main component of my AI project that we later are going to discuss.
#### So what makes reinforcement learning different?
<ul><li>There is no supervisor, only a reward signal</li><li>Feedback may be delayed and not instantaneous</li><li>Agent's actions affect the subsequent data it receives</li></ul>
At any time step, the agent in state <strong>S<sub>1</sub></strong> takes an action <strong>A<sub>1</sub></strong>. Based on this action, the environment provides the agent with reward <strong>R<sub>1</sub></strong> and a new state <strong>S<sub>2</sub></strong>.
<figure>
![The typical framing of a Reinforcement Learning (RL) scenario](./reinforcement-learning.png)
<figcaption>The typical framing of a Reinforcement Learning (RL) scenario</figcaption>
</figure>
A reward is a scalar feedback signal which indicates how well an agent is doing. The agent's goal is to maximize the reward signal. For instance, in the example of flying stunt maneuvers in a helicopter,
<ol><li>A positive reward may be given for following the desired trajectory.</li><li>A negative reward may be given for crashing.</li></ol>
#### Major Components of a Reinforcement Learning Agent
<ul><li><strong>Policy</strong>: A function that defines the behavior of the agent.</li><li><strong>Value function</strong>: The agent's understanding of how good each state and/or action is.</li><li><strong>Model</strong>: The agent's representation of the environment.</li></ul>
#### Exploration vs Exploitation
What makes the reinforcement learning problem so much harder is that the agent will initially be clueless about how good or bad its actions are. Sometimes, even the environment might be only partially observable. The agent has to perform hit and trial until it starts discovering patterns and strategies.
Moreover, the agent cannot act greedily on the reward signal. The agent has to learn to <strong>maximize the reward signal in the long term</strong>. So, sometimes the agent must be willing to give up some reward so as to gain more rewards in the long run. One such example would be to sacrifice a piece in chess to gain a positional or tactical advantage.
The <strong>exploration vs exploitation</strong> trade-off is the central problem in RL where the agent with incomplete knowledge about the environment has to decide whether to use strategies that have worked well so far (exploitation) or to make uncertain novel decisions (exploration) in hopes to gain more reward.
Some classical solutions to the reinforcement learning problem are Dynamic Programming, Monte Carlo Methods, and Temporal-difference learning. You can visit <a target="_blank" href="https://youtu.be/2pWv7GOvuf0">RL Course by David Silver</a> to learn more about reinforcement learning problems and solutions. David Silver was the lead researcher on <a target="_blank" href="https://deepmind.com/research/case-studies/alphago-the-story-so-far">AlphaGo</a> and <a target="_blank" href="https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go">AlphaZero</a>.
AlphaGo was the first computer program to beat a professional human Go player and the first to defeat a Go world champion. It was first trained on professional human games and then learned to improve by itself. To learn more, read the <a target="_blank" href="https://vk.com/doc-44016343_437229031?dl=56ce06e325d42fbc72">AlphaGo Research Paper</a>.
AlphaZero was an improved and more general version of AlphaGo that learned to play Go, Chess and Shogi without any human knowledge. It surpassed its predecessor and defeated AlphaGo <strong>100-0</strong> in 100 games of Go. To learn more, read the <a target="_blank" href="https://arxiv.org/pdf/1712.01815.pdf">AlphaZero Research Paper</a>. My self-learning AI project is also inspired very closely by AlphaZero.
---
### Deep Learning
Numerous artificial intelligence projects have tried to hard-code knowledge about the world in formal languages. This approach is known as the <strong>knowledge base approach</strong> to AI. However, none of these projects have led to major breakthroughs.
Then, machine learning was introduced so that the AI systems could acquire their own knowledge from the data. The performance of these simple machine learning algorithms depends heavily on the representation of the data and the use of important features.
Imagine that we have developed a <strong>logistic regression model</strong> (regression for binary data such as True or False) to detect Diabetic Retinopathy (diabetes complication that affects eyes). To use this model, a doctor has to <strong>manually</strong> observe the retina image and put relevant pieces of information into the model, such as the number and type of retinal lesions (damaged regions) and where they appear in the image.
<figure>
![Retina Image of a Person with Diabetic Retinopathy"](./diabetic-retinopathy.jpg)
<figcaption>Retina Image of a Person with Diabetic Retinopathy</figcaption>
</figure>
If the model was directly given the retina image as shown above, rather than the formalized report from the doctor, it would not be able to make predictions. It is because the individual pixels of the retina image have a negligible correlation with the presence or absence of Diabetic Retinopathy.
Let's look at one more example where the representation of the data plays an important role in the performance of the ML model.
<figure>
![Representation of data in Cartesian vs Polar coordinates](./cartesian-vs-polar.png)
<figcaption>Representation of data in Cartesian vs Polar coordinates</figcaption>
</figure>
Here, it is impossible to separate the two sets of data in cartesian coordinates with a linear model. However, just changing the representation of data to polar coordinates makes this task an easy one.
For many tasks, it is actually very difficult to know what features should be extracted. Suppose we want to write a program that detects cars in images. Since cars have wheels, we might like to use their presence as a feature. However, it is embarrassingly difficult to describe wheels in terms of pixel values. Even though wheels have simple geometric shapes, the real-life images of wheels are complicated by shadows, glaring sunlight, masking of the wheels by other objects, and so on.
One solution to the problem of finding the right feature is <strong>representation learning</strong>. In this approach, the human intervention is further reduced by replacing the hand-designed features with learned representations by the model itself.
In other words, the model not only learns the mapping from features to the output but also <strong>learns to choose the right features</strong> from the raw data.
Whenever we go from one technique to another, we substitute one problem for another one. Now, the major challenge in representation learning is to find a way for the model to learn the features by itself. However, it is very difficult to extract high level, abstract features from raw data directly. This is where deep learning comes to the rescue.
Deep Learning is a type of representation learning where the <strong>representations are expressed in terms of other simpler representations</strong>. This allows the computer to build complex features from simpler features.
<figure>
![Difference between Rule-based, Classical Machine Learning, and Representation Learning Systems](./representation-learning.png)
<figcaption>Difference between Rule-based, Classical ML, and Representation Learning Systems</figcaption>
</figure>
The quintessential example of a deep learning model is the multilayer perceptron that maps the input to the output. Let's look at an illustration of how a deep learning model learns to recognize complex patterns by building upon simpler concepts in the data.
<figure>
![Illustration of how deep neural networks build complex representations from simpler ones" width="750](./deep-learning-neural-net.png)
<figcaption>Illustration of how deep neural networks build complex representations from simpler ones</figcaption>
</figure>
Even though each pixel value of the image has no correlation with identifying the object in the image, the deep learning model builds a hierarchical structure to learn representations. It first learns to detect edges that makeup corners and contours which in turn gives rise to the object parts. These object parts are then finally able to detect the object in the image.
<p id="baghchal_ai">The examples above are inspired by the ones in the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Visit <a target="_blank" href="https://www.deeplearningbook.org/">Deep Learning Book Website</a> to read a free online version of the book. It is an excellent material to get started with Deep Learning.</p>
---
## Project Alpha BaghChal?
### The motivation behind the project
During the summer of 2017, I had a few months off before beginning my junior year in high school. While I had some experience with programming, I was largely unfamiliar with artificial intelligence (AI) at the time. My initial goal was to learn about how chess engines such as Stockfish operated, but my interest in AI grew after learning about _AlphaZero_. This prompted me to delve into the field and spend nearly two years starting from the basics and working on various smaller projects.
Inspired by _AlphaZero_, I thought of making a chess engine but the complexity of the game and the amount of training it would require set me back even before I got started. During my senior year, I tried making a similar engine for a school project but for a much simpler game of Bagh Chal. Even then, I had underestimated the difficulty of the project and the time scramble led to the project failure without yielding any results. I stopped working on it for a while until I finished high school and gave it another shot.
Before discussing how reinforcement learning and deep learning was used in the agent's architecture, let's first acquaint ourselves with the game of Bagh Chal. In the upcoming sections, I will also discuss how I built a game library and introduced game notations to record the moves for this traditional and somewhat obsolete game.
---
## Bagh Chal
Bagh Chal is one of the many variants of the tiger hunting board games played locally in South East Asia and the Indian subcontinent. This ancient Nepali game shares many resemblances to other traditional games like Komikan, Rimau, and Adugo in terms of board structure and player objectives.
The strategic, two-player board game is played on a 5x5 grid. The pieces are positioned at the intersection of the lines where adjacent lines from the intersection define the legal moves for a piece.
<figure>
![The initial board configuration for Bagh Chal where the tigers are placed at the four vertices of a 5x5 grid and 20 goats are outside the board.](./bagh_chal-1.png)
<figcaption>The initial board configuration for Bagh Chal where the tigers are placed at the four vertices of a 5x5 grid and 20 goats are outside the board.</figcaption>
</figure>
---
### Game Rules
The game completes in two phases:
<ol><li><strong>Goat Placement phase</strong><br/>During the placement phase, twenty goats are placed one after the other in one of the empty slots on the board while tigers move around. Goats are not allowed to move until all goats have been placed.</li><li><strong>Goat Movement phase</strong><br/>The movement phase continues with both players moving their corresponding pieces.</li></ol>
Pieces can move from their corresponding slot to other empty slots along the grid lines.
<figure>
![The Moveable Goats can jump along the grid-lines to an empty slot.](./bagh_chal_goat_move.png)
<figcaption>The Moveable Goats can jump along the grid-lines to an empty slot.</figcaption>
</figure>
Throughout the game, tigers also have a special <strong>Capture</strong> move, where they can jump over a goat along the grid lines to an empty slot, thereby removing the goat from the board.
<figure>
![The Tigers can jump over a Goat to capture it." width="750](./baghchal_capture_move.png)
<figcaption>The Tigers can jump over a Goat to capture it.</figcaption>
</figure>
This asymmetric game proceeds with tigers trying to capture goats and goats trying to trap tigers (without any legal moves). The game is over when either the tigers capture five goats or the goats have blocked the legal moves for all tigers. In some rare cases, tigers can also win by blocking all the legal moves for goats.
<figure>
![Moveable Tigers and Trapped Tigers in Bagh Chal" width="750](./moveable_tiger_trapped_tiger.png)
<figcaption>Moveable Tigers and Trapped Tigers in Bagh Chal</figcaption>
</figure>
---
### Ambiguous Rules
The game can fall into a cycle of repeating board positions during the gameplay. To deal with these perpetual move orders that goats can employ to defend themselves from being captured, some communities have introduced constraints that do not allow moves that revert the game position to one that has already occurred in the game.
However, this restriction can sometimes prevent moves that are forced for bagh players or cause goats to make absurd sacrifices. The board positions are bound to reoccur in lengthy games where there have been no capture moves for a long time. Thus, declaring a winner on that basis is highly ambiguous.
For this project, I have overridden this rule by the standard **"draw by threefold repetition"** rule from chess, where the recurrence of the same board position for three times automatically leads to a draw. The rule is rational as the recurrence of board position implies that no progress is being made in the game.
---
### Creating Bagh Chal library (Game Environment)
Before working on the AI project, I had to prepare a Python Bagh Chal library to use the logic of the game and keep records of the board game states.
<strong>baghchal</strong> is a pure Python Bagh Chal library that supports game import, move generation, move validation, and board image rendering. It also comes with a simple engine based on the minimax algorithm and alpha-beta pruning.
Visit the <a target="_blank" href="https://github.com/basnetsoyuj/baghchal">GitHub baghchal Repository</a> to learn more about this library.
---
### Introducing Game Notation
Since Bagh Chal is a traditional Nepali board game, there was no recorded game dataset nor was there any way to keep track of the game moves.
So, I used two notations to record Bagh Chal games: <strong>PGN</strong> and <strong>FEN</strong>.
Portable Game Notation (PGN) is inspired by the game of chess. This notation consists of a full move history of the game, along with other information. The algebraic notation used makes PGN easier for humans to read and write, and for computer programs to parse the information.
The history of games is tracked by movetexts, which defines actual moves in the game. Each goat move and tiger move constitutes a movetext pair, where goat piece and tiger (Bagh) piece are represented by "<strong>G</strong>" and "<strong>B</strong>" respectively. Moves are defined in the following ways:
<ol><li>Placement move: <strong>G<new[row][column]></strong><br/>For example: <strong>G22</strong></li><li>Normal move: <strong><Piece><old[row][column]><new[row][column]></strong> <br/>For example: <strong>B1122</strong></li><li>Capture move: <strong>Bx<old[row][column]><new[row][column]></strong><br/>For example: <strong>Bx1133</strong></li></ol>
>***Note:** Both the row and column position use numbers rather than an alphabet and a number like in chess because Bagh Chal has reflection and rotational symmetry, so the numbers can be counted from any direction.*
At the end of the game, <strong>#</strong> is added along with:
<ul><li><strong>1-0</strong> for Goat as the winner.</li><li><strong>0-1</strong> for Tiger as the winner.</li><li><strong>1/2-1/2</strong> for a draw.</li></ul>
The following PGN represents one entire Bagh Chal game:
```
1. G53 B5545 2. G54 B4555 3. G31 B5545 4. G55 B1524 5. G15 B2414
6. G21 B1413 7. G12 B1322 8. G13 B2223 9. G14 B4544 10. G45 B4435
11. G44 B5152 12. G43 B5251 13. G52 B3534 14. G35 B1122
15. G11 B3433 16. G25 B2324 17. G23 B3334 18. G41 B5142
19. G51 B2232 20. G33 Bx2422 21. G1524 B2223 22. G1122# 1-0
```
FEN (Forsyth-Edwards Notation) for Bagh Chal tracks only the current board position, current player, and the number of moves made. Even though it does not contain the full history of moves, it encapsulates enough information to continue the game from that point onwards and is helpful for shorthand representation of the board state. It consists of 3 fields:
<ol><li><strong>Piece location</strong><br/>The piece location is given for each row, separated by "<strong>/</strong>". Like PGN, "<strong>B</strong>" and "<strong>G</strong>" represent Tigers and Goats. The integers <strong>[1-5] </strong>represent empty spaces between the pieces.</li><li><strong>Player to move</strong><br/>The player with next move, given by "<strong>G</strong>" or "<strong>B</strong>"</li><li><strong>Number of moves made by Goats</strong><br/>This integer represents the number of half moves in the game. This is required to track the number of remaining and captured goats.</li></ol>
The following FEN represents the starting Bagh Chal board state:
<pre>
B3B/5/5/5/B3B G 0
</pre>
---
### Prior Work on Bagh Chal
Bagh Chal is a relatively simple board game in terms of game tree complexity, compared to other board games like Go, Shogi, or Chess. The Bagh Chal programs found online use search algorithms based on variants of the Minimax algorithm such as <a target="_blank" href="https://en.wikipedia.org/wiki/Alpha%E2%80%93beta_pruning">Alpha-beta pruning</a> to traverse the game tree.
Prior works have been done to evaluate the game under optimal play and even exhaustively analyze the endgame phase of the game using retrograde analysis. In their book called Games of No Chance 3, authors Lim Yew Jin and Jurg Nievergelt even prove that <a target="_blank" href="http://library.msri.org/books/Book56/files/22jin.pdf">Tigers and Goats is a draw</a> under optimal play.
My project, on the other hand, is inspired by AlphaZero, a general reinforcement learning agent by <em>Google DeepMind</em>. Instead of creating an agent that uses brute-force methods to play the game, the project takes a different route where the agent learns to improve its performance by continually playing against itself. It uses a single deep residual convolutional neural network which takes in a multilayered binary board state and outputs both the game policy and value, along with Monte Carlo Tree Search.
Let's look at what each of these terms means and how they fit into the design of AI architecture.
---
### Design of the AI agent
Before proceeding, why not first look at the end performance of the Bagh Chal AI agent that we are going to design next?
<div className="gatsby-resp-iframe-wrapper" style={{ paddingBottom: `56.25%`, position: `relative`, height: `0px`, overflow: `hidden`, marginBottom: `1rem` }}>
<figure>
<iframe style={{ position: `absolute`, top: `0px`, left: `0px`, width: `100%`, height: `100%` }} src="https://www.youtube.com/embed/7piuvfkX17o?feature=oembed" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</figure>
</div>
The sections below assume that you know following topics. If not, visit the corresponding links to read about them in brief.
<ol>
<li><strong>Convolutional Neural Network (CNN)</strong><br/>CNN is a type of neural network especially used in image recognition as it considerably reduces the parameters and makes the network more efficient. To learn more, visit <a target="_blank" href="https://cs231n.github.io/convolutional-networks/">Convolutional Neural Networks for Visual Recognition</a>.</li>
<li><strong>ResNets (Residual Blocks)</strong><br/>It is difficult to train deep neural networks due to exploding or vanishing gradients. Residual blocks optimize the training of these deep networks by introducing skip connections in the network. To learn more, visit <a target="_blank" href="https://arxiv.org/pdf/1512.03385v1.pdf">Deep Residual Learning for Image Recognition</a>.</li>
<li><strong>Monte-Carlo Tree Search (MCTS)</strong><br/>MCTS a probabilistic and heuristic driven game tree search algorithm that combines tree search with reinforcement learning. After each simulation, it learns to selectively explore moves rather than applying brute-force methods. To learn more, visit <a target="_blank" href="https://web.stanford.edu/class/cs234/CS234Win2019/slides/lnotes14.pdf">Stanford Monte-Carlo Tree Search Lecture Notes</a>.</li>
<li><strong>Batch Normalization</strong><br/>Batch normalization reduces the covariant shift (amount by which values shift) in the units of the hidden layer. To learn more, visit <a target="_blank" href="https://arxiv.org/pdf/1502.03167.pdf">Batch Normalization Paper</a>.</li>
</ol>
In the past, deep convolutional neural networks served as board evaluation functions for searching game trees. AlphaZero—the quintessential deep reinforcement learning agent for board games—takes this approach one step further by using Policy-Value networks and Monte Carlo Tree Search algorithm.
Similar to AlphaZero, the BaghChal AI agent's neural network <strong>f<sub>Θ</sub>(s) = (p, v)</strong> takes in the current board state and gives out two outputs, a policy vector <strong>p<sub>Θ</sub>(s)</strong> and a scalar value estimate <strong>v<sub>Θ</sub>(s)</strong>.
<ol>
<li>The scalar value <strong>v</strong> estimates the expected outcome of the game from the given position (which player is most likely to win from that point on). It represents the agent's positional understanding of the game.</li>
<li>The policy vector outputs a vector of move probabilities <strong>p</strong> with components <strong>p<sub>a</sub></strong> <strong>= Pr(a|s)</strong> for each action <strong>a</strong>. They serve as 0-ply strategies, since they do not perform any look-ahead and thus correspond to intuitive play, previously thought to be exclusive only to human players.</li>
</ol>
<figure>
![The AI agent uses a Policy Value Network](./policy-value-net.png)
<figcaption>The AI agent uses a Policy Value Network</figcaption>
</figure>
A number of simulations are played out from the root (current position) to a leaf node, employing Monte Carlo tree search (<strong>MCTS</strong>) algorithm that utilizes raw output from the neural network to selectively explore the most promising moves. Evaluation from the MCTS are then used to improve parameters of the neural network. This iterative relation between the neural network and tree search allows the agent to start from the ground up and continually get stronger from self-play, after every iteration.
The network is initially set to random weights and thus, its predictions at the beginning are inherently poor. As the search is conducted and parameters are improved, the previous poor initial judgement of the network is overridden, making the agent stronger.
Note that the Monte Carlo Tree Search used in the Bagh Chal AI is not a pure MCTS where random rollouts are played once the search tree hits a leaf node. Instead, this random rollout is replaced by the value estimate from the neural network. Watch the <a target="_blank" href="https://youtu.be/NjeYgIbPMmg">AlphaZero MCTS Video Tutorial</a> to learn about the complete Monte Carlo Tree Search Algorithm used in AlphaZero.
---
### Network Architecture
As previously mentioned, the AI agent has a single neural network that has two heads: a policy and a value network.
The input is taken as 5x5x5 input stack which is different than in the AlphaZero architecture:
<ol>
<li>The 1<sup>st</sup> layer represents the layer for Goat pieces (5x5 grid filled with 1 for presence of goat & 0 for absence of goat)</li>
<li>The 2<sup>nd</sup> layer represents the layer for Tiger pieces (5x5 grid filled with 1 for presence of tiger & 0 for absence of tiger)</li>
<li>The 3<sup>rd</sup> layer is a 5x5 grid filled with the number of goats captured.</li>
<li>The 4<sup>th</sup> layer is a 5x5 grid filled with the number of tigers trapped.</li>
<li>The 5<sup>th</sup> layer represents whose turn it is to play (filled with 1 for Goat to play and 0 for Tiger to play)</li>
</ol>
This input then goes through some residual blocks that have Convolutional, Batch Normalization and Activation layers. The neural network then branches off to two parts: one for policy network and the other for the value network. They both go through some fully-connected Dense layers.
The policy network outputs a 217 dimensional vector that represents all the moves that are possible in Bagh Chal at any point. It is a probability distribution for all moves and thus adds up to 1. It uses Categorical Cross-Entropy as the loss function.
The value network outputs a single scalar value between <strong>-1</strong> (Represents Tiger player is winning) and <strong>1</strong> (Represents Goat player is winning). It uses Mean Squared error as the loss function.
All in all, the policy network combined with the MCTS helps in reducing the breadth of the tree search.
<figure>
![The policy network reduces the breadth of tree search by exploring only few promising moves.](./policy_net.png)
<figcaption>The policy network reduces the breadth of tree search by exploring only few promising moves.</figcaption>
</figure>
Similarly, the value network helps in reducing the depth of the tree search.
<figure>
![The value network reduces the depth of tree search by returning the evaluation of a certain position.](./value_net.png)
<figcaption>The value network reduces the depth of tree search by returning the evaluation of a certain position.</figcaption>
</figure>
---
### Training
For training, a number of games were played using the neural network employing MCTS and about 50 simulations for each move.
At the end of the game, <strong>1</strong> was awarded for Goat as winner, <strong>-1</strong> was awarded for Tiger as winner and <strong>0</strong> for a draw. For all moves for that game, the value network was then trained to predict this number.
In MCTS, the most promising move is the one that was visited the most. So, the policy network was improved by training it to predict moves in proportion to the number of times the move was made in the simulation.
Similarly, the symmetry of Bagh Chal was used explicitly to populate the training set for each game.
<figure>
![The Bagh Chal game has a rotational and a reflection symmetry.](./bagh_chal_symmetry.png)
<figcaption>The Bagh Chal game has a rotational and a reflection symmetry.</figcaption>
</figure>
Here, we can see that one game can produce eight times more training sample games.
---
### Compromising the Tabula Rasa Learning
Although all of this seems to work in theory, I did not experience any improvements in the agent's play even with some training. My speculation is that it might have been caused by the asymmetry of the game. Bagh Chal is an asymmetrical game as each player has different objectives.
If we were to make random moves (similar to the performance of an untrained Neural Network), the probability of Tiger winning the game is larger than the game ending in a draw than Goat winning the game. When the tree search reaches a terminal state favoring a given player very few times, there may not be enough actual game environment rewards for the tree-search to correct any poor network predictions. Thus, the agent gets stuck in a local optimum.
Even though this problem is solved as the model gets trained more and more, I had to find other solutions due to hardware limitations.
I eventually decided to use a greedy heuristic in the training pipeline by initially training the agent on games generated by minimax algorithm with alpha-beta pruning on depth 4.
---
### Results
After training the agent for some time, first greedily on minimax generated games and then by making it play against itself, the agent learns to pick up on the strategies that humans use in their own games. One of the strategies is to always occupy the board corners as the Goat player and maintain distance between the Tiger pieces as the Tiger player.
This trained agent was then used to play games against itself, where most of the games ended in a draw. This result was consistent with the prior study that proves the game as a draw under optimal play. About 10 percent of the time, however, the Goat player was able to snatch victory.
Though this Bagh Chal AI agent punishes even slight inaccuracies by human players, it is by no means perfect. Occasionally, when the agent played as Goat against human players, it made absurd sacrifices over and over again. This represents gaps in the learning domain of the agent. It means that the agent reached positions that it had never explored during its training, and hence had no clue about certain positions.
The AI agent could have been further improved by:
<ol><li>More Experimenting with the hyperparameters of the neural network (Network structure)</li><li>Using two separate networks for Goats and Tigers</li><li>Tabula Rasa Learning</li><li>More training</li></ol>
If you want to learn more about the project and play around with its code, you can find one early prototype of this project in the <a target="_blank" href="https://github.com/basnetsoyuj/AlphaBaghChal">GitHub AlphaBaghChal Repository</a>.
---
## Final Words
Even though my agent is not at the level of _AlphaZero_ by any means, the ultimate joy of being crushed by your own creation is second to none.
One key takeaway from this blog is that the performance of the Artificial Intelligence project boils down to finding the right set of data to train on and choosing the correct hyperparameters to get an optimal learning process.
As the potential of Artificial Intelligence continues to grow by the day, its scope has also broadened to encompass many other fields. Even if you are new to Artificial Intelligence or programming in general, you definitely won't regret keeping up with the field of AI.
Below are some online resources to play around with interesting AI models:
<ul>
<li><a target="_blank" href="http://gaugan.org/gaugan2">NVIDIA GAUGAN</a></li>
<li><a target="_blank" href="https://playground.tensorflow.org/">Tensorflow Neural Network Playground</a></li>
<li><a target="_blank" href="https://teachablemachine.withgoogle.com/">Teachable Machine</a></li>
<li><a target="_blank" href="https://poloclub.github.io/cnn-explainer/">CNN Explainer</a></li>
</ul>
> _**Update**: Some more resources to play around with AI models:_
- [Dalle 2](https://openai.com/dall-e-2/)
- [ChatGPT](https://openai.com/blog/chatgpt/) (of course)<![CDATA[Developing the Content Workflow System for Programiz]]>https://www.soyuj.com/blog/content-workflow-systemhttps://www.soyuj.com/blog/content-workflow-systemTue, 19 May 2020 23:46:37 GMT
>*As of July 2020, I have been working as a Python Developer and Senior Content Editor at <a href="https://www.programiz.com/" target="_blank">Programiz</a> for the past 10 months.
>This blog was originally posted on the Programiz Blog on May 19, 2020, and can be accessed [here](https://www.programiz.com/blog/developing-content-workflow-system/).*
At Programiz, one of our daily tasks is to create beginner-friendly programming tutorials that eventually reach out to millions of users all over the world. Behind the scenes, we are also constantly experimenting with different tools and techniques to furnish our products and enhance the user experience.
One such tool I recently developed for Programiz—called the *Content Workflow System*—allows the content writers to effectively write, edit, manage, review, and publish content—that includes programming tutorials, quizzes, and challenges for the Programiz website as well as the mobile app—**directly using Google Docs**.
What could possibly have gone wrong that made us abandon conventional writing tools to develop an entire Content Workflow system? As a matter of fact, this was not our first time trying to do so.
---
## Problems with Conventional Writing Tools
If you are developing a blogging site or any site with similar static web pages, you know that it is a lot of hassle to manually type in HTML for each article post. In today's context, most content writers probably end up using some sort of visual (<a target="_blank" href="https://en.wikipedia.org/wiki/WYSIWYG">WYSIWYG</a>) text editor or markdown language to speed up the writing process.
When Programiz first started as a small company, we also used one such rich text editor called *CKeditor* to write our programming tutorials. This editor would then translate our writings to HTML, and we would publish them directly on our Programiz website.
<figure>
![CKEditor](./CKeditor.png)
<figcaption>CKeditor - A Smart WYSIWYG HTML Editor</figcaption>
</figure>
Our articles usually consist of plain and stylized texts, images, lists, tables, and preformatted code blocks. All of these items are non-fancy HTML elements and are supported by the CKeditor.
Despite this fact, CKeditor had many flaws, which we soon became aware of. These *smart* WYSIWYG editors often had a tendency to act overly intelligent, which caused problems such as incorrectly formatting special symbols within
preformatted code blocks, or causing HTML to become bloated with empty paragraph tags. Additionally, we had to manually input items that required Programiz-specific HTML structure, which was quite inconvenient. Due to these difficulties, we felt compelled to search for better alternatives.
We wanted a way to automate the Programiz-specific HTML generation process without having the content writers explicitly follow this semantic every time they wrote an article.
Considering that we required a broader level of customization, the next stop at our venture was to start from the ground up and build our very own text editor.
We used <a target="_blank" href="https://www.slatejs.org/">Slate.js</a> (a customizable framework for building rich text editors) to create our custom text editor. We shaped this tool to fit our needs, and it worked perfectly for some time.
As we began to scale up our team and programming tutorials over the years, however, we realized that this might not have been the best solution moving forward. One of the main problems with this tool was that it did not provide us with a layer for moderation and article review.
So, most of the articles that our content writers previously wrote were only proofread by themselves before publishing. This gave room for mistakes and compromised the quality of our tutorials.
If you were to review some of our old programming articles, you would find numerous grammatical and linguistic errors even though they are technically correct. This went against our foundational belief in delivering the highest quality content and prioritizing quality over quantity.
As a result, we decided to uphold these principles and invested significant time in developing a robust Content Workflow System—time we could have used to write bulks of other tutorials. We were back to square one; we now faced the dilemma of choosing a medium not only to write
but also to review and manage our tutorial articles. This was only possible with a medium that supported seamless, real-time collaboration and sharing.
The only option that came to mind at that time was Google Docs—a burgeoning web-based word processor by Google. I began experimenting with <a target="_blank" href="https://developers.google.com/docs/api/">Google Docs API</a> to convert simple Google Documents into HTML, and it produced decent results.
Besides, Google Docs also provided us with a reliable state-of-the-art interface for real-time collaboration and article review (commenting). Subsequently, we decided to develop this idea into a full-fledged content-writing tool.
Most of the tutorial articles that you see today on the Programiz website are derived from Google Docs!
>***Update**: Today, the Content Workflow System is not only used for Web articles, but also for the Mobile App & Programiz Pro content including Articles, Quizzes, and Challenges—all of which are internally written in Google Docs.*
Google Docs does not provide us with these customizable features right out of the box. So how was I able to exploit the features of Google Docs in our favor?
---
## Our Solution: Content Workflow System
### docsToHTML: Generating HTML Content from Google Documents
A Google Document comes with all the typical elements that are available in HTML, such as headers, paragraphs, lists, tables, and basic stylizing options (bold, italics, hyperlinks, subscript, superscript).
However, as previously mentioned, there was no implicit way to convert Google Documents into HTML. There were also no third-party libraries to achieve this task, so I had to build one using Python. <strong>docsToHTML</strong> is a Python module that converts Google Documents into HTML using Google Docs API.
In addition to basic HTML elements, our tutorial articles use other HTML tags for preformatted code blocks, other inline stylings, and note-tips.
<figure>
![Programiz-specific elements](./programiz-elements.png)
<figcaption>Programiz-specific elements not native to Google Docs</figcaption>
</figure>
These custom elements are not natively available in Google Docs. So I also had to find a way to incorporate our Programiz HTML semantics into the Google Docs interface. This involved finding methods to represent these elements as Google Docs components, which could then be detected and parsed by a Python script.
We decided to use a combination of different styling options—like fonts and foreground/background colors—to distinguish these elements. The following image shows one such option for the preformatted code that I mentioned earlier.
<figure>
![Differentiating preformatted code from plain text using stylizing options in Google Docs](./style-difference.png)
<figcaption>Differentiating preformatted code from plain text using stylizing options in Google Docs</figcaption>
</figure>
While this approach effectively resolved our challenges, it would have been overwhelming for our content writers to remember all the distinct styling options for each element.
After some research, I found out that <a target="_blank" href="https://developers.google.com/apps-script">Google Apps Script</a> (a Google scripting language for G suite) could be embedded with Google Documents. These App Scripts could then be used to modify the Google Document and even alter its User Interface.
We utilized Google Apps Script to modify the User Interface of Google Docs, which was particularly useful for generating custom buttons in the Menu Bar to execute the styling tasks mentioned above.
<figure>
![Customized Google Docs Menu Bar](./custom-menu-docs.png)
<figcaption>Customized Google Docs Menu Bar using Google Apps Script</figcaption>
</figure>
Now the content writers could simply highlight the required text and perform styling options like changing the font and color, or inserting predefined tables into the Google Document using these custom buttons.
The feature to insert predefined table templates in the Google Document allows the content writers to add metadata information about articles or images. These special tables are parsed differently by our Python module. The *Page Info* Table, for instance, is used at the beginning of a Google Document for Web Page and Article Information:
<figure>
![Predefined Page Info Table](./header-table-docs.png)
<figcaption>Predefined Page Info Table</figcaption>
</figure>
After we complete writing a properly formatted Google Document, we can use our Python module to send a request to the Google Docs API using the Document's unique ***DOC_ID***. Google Docs API sends back a *JSON* response corresponding to the contents of the Google Document.
You can visit <a target="_blank" href="https://developers.google.com/docs/api/samples/output-json">Google Docs API Response JSON</a> to learn more about Google Docs API and its JSON response.
Our Python module in the backend then converts this somewhat disordered JSON into a structured HTML.
<figure>
![Process to convert Google Docs To HTML](./dth-flowchart.png)
<figcaption>Google Docs To HTML process</figcaption>
</figure>
Let's look at how a preformatted code block is parsed by our Python module to generate the HTML content:
<figure>
![pre with docsToHTML](./pre-docs-html.png)
<figcaption>Preformatted text from Google Docs To HTML</figcaption>
</figure>
The Image Info Table mentioned earlier is parsed in the following way:
<figure>
![Image Info from Google Docs To HTML](./img-doc-html.png)
<figcaption>Image Info from Google Docs To HTML</figcaption>
</figure>
After completing these basic features, I further polished the docsToHTML module to perform a sanity check on the HTML produced from the Google Document. As of yet, the module does the following:
1. Detects the programming language described in the article. (Some HTML semantics are language-specific)
2. Checks if any image names clash with other pre-existing images.
3. Replaces Smart Quotes and different Unicode whitespaces with normal ones. (can cause errors if they occur in code blocks)
4. Checks if all Comments and Suggestions in the Google Document have been resolved. (warning if moderator reviews aren't addressed)
<figure>
![Various Warnings and messages from docsToHTML](./docsToHTML_terminal.png)
<figcaption>Various Warnings and messages from docsToHTML</figcaption>
</figure>
The Table of Contents bar that you see on the Programiz website is also automatically generated by `docsToHTML`:
<figure>
![Table of Contents generated by docsToHTML](./toc-docs-html.png)
<figcaption>Table of Contents generated by docsToHTML</figcaption>
</figure>
### HTMLToDocs: Converting HTML Contents To Google Document
After developing docsToHTML, we started writing our new tutorials in Google Docs and using it to convert Google Docs into HTML. Despite this, there were still a lot of old articles of which we only retained the HTML copies.
Since we strive for perfection, we constantly update our old articles by fixing technical and grammatical errors wherever necessary. We also rewrite obsolete sections of old articles with up-to-date developments in that certain topic.
The editing of these existing articles had to be done manually by changing their HTML as we did not have their corresponding Google Doc (docsToHTML could not be used). Rewriting the entire article in Google Docs was also not a very feasible solution.
To solve this problem, I developed **HTMLToDocs**. `HTMLToDocs` is a similar Python module that now converts raw HTML Content into a Google Document. It performs the exact reverse operation of the `docsToHTML` module.
`HTMLToDocs` takes in a Programiz Article Web Page URL and converts the tutorial content into a Google Document (what we would have originally written had we used Google Docs). It is programmed to parse both our old and new HTML semantics.
<figure>
![Converting Note Tip from HTML To Google Docs](./note-html-docs.png)
<figcaption>Converting Note Tip from HTML To Google Docs</figcaption>
</figure>
This Python module also uses <a target="_blank" href="https://developers.google.com/drive/api/">Google Drive API</a> to cluster HTMLToDocs generated Google Docs into a proper hierarchical structure in the user's Google Drive.
Now that we had completed the full cycle of converting Google Docs to HTML and HTML back to Google Docs, we were ready to take this project one step further.
---
## Content Workflow System: Bringing Everything Together
Initially, every content writer had to set up a Python environment and install various dependencies on their local machine to run `docsToHTML` and `HTMLToDocs`. They also had to keep track of the Google Document ID to run these scripts.
The Content Workflow System was developed to create an interface that connects both `docsToHTML` `HTMLToDocs` while running everything in the cloud. It would also keep records of every entry made by the user. We could then use this tool to write, edit, review, and publish content on the Programiz website even more easily.
<figure>
![Working of the Content Workflow System](./cws-flowchart.png)
<figcaption>Working of the Content Workflow System</figcaption>
</figure>
For this, I first converted the `docsToHTML` and `HTMLToDocs` modules into an API and hosted them in the cloud. Next, we built a User Interface to send requests and retrieve responses from this API.
<figure>
![Content Workflow System Detailed View Interface](./cws-editor-detailed.png)
<figcaption>Detailed View Interface of Content Workflow System</figcaption>
</figure>
Users can login to Content Workflow System (CWS) with their Programiz credentials. They have the option of using either `docsToHTML` or `HTMLToDocs`. For either case, the endpoint is `docsToHTML` to generate the HTML Content.
>***Update**: The users now also have the option to write content, quizzes, and challenges for the Mobile App and Programiz Pro. For this, there is a `docsToJSON` module that can convert Google Doc to a specific JSON format.*
When the user submits an article, it goes to the review section. Reviewers can review the article and approve it or send it back to the user for further editing (suggestions and comments are handled in the Google Document itself) and the user is notified accordingly.
<figure>
![List View Interface of Content Workflow System](./content-workflow-system-list.png)
<figcaption>List View Interface of Content Workflow System</figcaption>
</figure>
The reviewed article, along with images, can finally be uploaded to the website with our tool. The article will go for revision and can be published by the admin or the moderator.
The Content Workflow System has become exponentially better since its first release. However, it is far from perfect, and there are always some occasional errors that go unhandled.
---
## Challenges
Using Google Docs came with its own shortcomings. One of the major problems I faced while developing `HTMLToDocs` was that inserting and editing all the styling elements in Google Documents with Python (in fact with any supported programming languages) was not very intuitive compared to how easy it was using Google's native Apps Script. The Google Docs documentation was very ambiguous.
For instance, we can add a table in Google Docs using Apps Script with a structure similar to:
```javascript
var cells = [
['CELL11', 'CELL12'],
['CELL21', 'CELL22']
];
var table = element.insertTable(index, cells);
```
If instead, we were to use other programming languages, we would have to send the following JSON request to the Google Docs API.
```javascript
[
{ insertTable: { rows: 2, columns: 2, location: { index: 2 } } },
{ insertText: { location: { index: 13 }, text: "CELL22" } },
{ insertText: { location: { index: 11 }, text: "CELL21" } },
{ insertText: { location: { index: 8 }, text: "CELL12" } },
{ insertText: { location: { index: 6 }, text: "CELL11" } },
];
```
>***Note:** For efficiency, the Document is written backwards so that the text's length in each cell doesn't affect the indices of the subsequent elements.*
Here, the index of any cell is given by the formula:
```javascript
4 + TABLE_INDEX + (1 + NO_OF_COLUMNS * 2) * CURRENT_ROW + 2 * CURRENT_COLUMN
```
This is not mentioned anywhere in the Google Docs API documentation. Moreover, some basic elements like a **Horizontal Rule** could not even be added—atleast at the time of writing this article.
Nonetheless, we were able to find work arounds for most of the problems we faced. The Content Workflow System has been able to meet most of our requirements and it has definitely made the content writing process more flawless than ever.
---
## Final Words
Creating your own Content Writing Tool can seem like a daunting task at the beginning. You are sure to encounter loads of unintended errors along the way. It will take time to get used to the quirks of how different services like Google Docs handle information and how we can use them to our advantage.
However, I believe it is well worth the effort if you really want to add custom features to your Content Writing Project (or any other project for that matter) and save yourself some nasty inconvenience in the future. Moreover, you come out learning more about what actually happens under the hood of various services & frameworks and why things work the way they do.
>***P.S**. HTML Content for this blog post was also generated via docsToHTML.*