Kernel Principal Component Analysis from scratch with R

Standard Principal Component Analysis

Basic understanding of the algorithm

Let $X$ be a column centered data matrix of $n \times d$ size. Each row is a data point with $d$ features. Then the $d \times d$ covariance matrix of $X$ is given by $C = X^T X / (n-1)$. Its eigenvectors are principal axes (PAs) of $X$ and eigenvalues are PC variances of $X$. The components of $X$ on the principal axes are called principal components.

Let $X=U_{n\times r}D_{r\times r}V_{d\times r}^T$ be the SVD decomposition of $X$, where $r\le d$ is the rank of $X$. Then $C = VDU^TUDV^T/(n-1) = VD^2V^T/(n-1)$. Therefore $V$ are eigenvectors of $C$ (PAs of $X$), and The diagonal entries of $D^2/(n-1)$ are non-zero eigenvalues of $C$ (PC variances of $X$). Consider the transformation

\[\begin{equation} Z = X V \sqrt{\frac{n-1}{D^2}}\frac{1}{\sqrt{n-1}}=X V \Lambda^{-1/2}\frac{1}{\sqrt{n-1}} \label{eq:standardproj} \end{equation}\]

where $\Lambda$ is the diagonal matrix whose diagonal entries are the non-zero eigenvalues of $C$.

Since

\[ \begin{align} Z^T Z &= \sqrt{\frac{1}{D^2}} V^T X^T X V \sqrt{\frac{1}{D^2}} \\ &= \sqrt{\frac{1}{D^2}} V^T V D U^T U D V^T V \sqrt{\frac{1}{D^2}} = I, \end{align} \] $Z$ is normalized. $Z$ is the normalized principal components in the principal axes $V$.

Remark 1. Since

\[ \begin{align} XX^T Z &= XX^T X V \Lambda^{-1/2}\frac{1}{\sqrt{n-1}} \\ &= X VD^2V^T V \Lambda^{-1/2}\frac{1}{\sqrt{n-1}} \\ & = (XV\Lambda^{-1/2}\frac{1}{\sqrt{n-1}})D^2 = ZD^2, \end{align} \] $Z$ are the eigenvectors of $XX^T/(n-1)$. $D^2 / (n-1)$ are the eigenvalues of $XX^T/(n-1)$.

Numerical Implementation

Get data

For illustration, we use a toy data matrix $X$.

X <- matrix(data = c(2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2.0, 1.0, 1.5, 1.1,
                     2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9), ncol = 2)
plot(X[,1], X[,2],xlab=expression("x"[1]), ylab=expression("x"[2]))

Convert $X$ to a column-centered matrix

It is not necessary for calculating the covariance matrix, but it is necessary for performing projection.

# column mean-centering 
Xc <- apply(X, 2, function(y) y - mean(y))

Calculate the covariance matrix

C <- cov(Xc)

Get the eigenvalues and eigenvectors of the covariance matrix

C.eigen <- eigen(C)

Choose principal axes

The eigenvalues of $C$ are used to find the proportion of the total variance explained by the components.

totalVar <- sum(C.eigen$values)
# one can check it equal to the total variance sum(diag(S)) 
totalVar == sum(diag(C))

## [1] TRUE

for (s in C.eigen$values) {
  print(s/totalVar)
}

## [1] 0.9631813
## [1] 0.03681869

The first eigenvalue account for $96.3%$ of the total variance. So we choose the first eigenvector as the principal axis.

Project on the principal axis

# Set up figure layout
par(oma=c(0, 0, 0, 0), mar = c(4, 4.1, 2, 1), las="1", xpd=TRUE)
layout.matrix <- matrix(c(1, 2), nrow = 1, ncol = 2)
layout(mat = layout.matrix, heights = c(3.25), widths = c(3.25, 3.25))
# Plot principal component
plot(Xc, xlab=expression("x"[1]), ylab=expression("x"[2]))
x <- seq(-1.0, 1.0, 0.1)
lines(x, -x * C.eigen$vectors[1,2]/C.eigen$vectors[1,1], col='red')
# Principal components scaled to unit norm
r <- 2
V <- C.eigen$vectors[,1:r]
n <- dim(Xc)[1]
for (i in 1:r) {
V[,i] <- V[,i]/sqrt((n-1)*C.eigen$values[i])
}
PXc <- Xc %*% V
plot(PXc,  xlab="PC1", ylab="PC2")

Verify Remark 1

gram <- Xc %*% t(Xc)/9
gram.eign <- eigen(gram)
round(gram.eign$values[1:2], 6) == round(C.eigen$values, 6)

## [1] TRUE TRUE

round(gram.eign$vectors[, 1], 6) == round(PXc[, 1], 6)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

# difference in a scaling factor -1
round(gram.eign$vectors[, 2], 6) == round(-PXc[, 2], 6)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Validation

Compare with R function prcomp

pilots.pca <- prcomp(X)
pilots.pca

## Standard deviations (1, .., p=2):
## [1] 1.1331495 0.2215477
## 
## Rotation (n x k) = (2 x 2):
##             PC1        PC2
## [1,] -0.6778734  0.7351787
## [2,] -0.7351787 -0.6778734

Remark 2. The squares of the standard deviations equal to the eigenvalues of $C$.

Remark 3. The PCs/Rotation are the same as the eigenvectors of $C$ except for a scaling factor -1.

Kernel Principal Component Analysis

Basic understanding of the algorithm

Non-linear data (e.g. a curve in 2-D) may be linearized in a higher-dimensional space so that dimension of the data may be reduced. Kernel principal component analysis is to map the input space to a new higher-dimensional feature space. PCA is carried out in the new space.

Consider a feature map $\phi: \mathbb{R}^2 \rightarrow \mathbb{R}^3$,

\[ \mathbf{x}= \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \mapsto \phi(\mathbf{x})=\begin{bmatrix} \phi_1 (\mathbf{x})\\ \phi_2 (\mathbf{x})\\ \phi_3 (\mathbf{x}) \end{bmatrix} \] The data matrix in $\mathbb{R}^3$ is

\[ \Phi = \begin{bmatrix} \phi_1(\mathbf{x}_{1}) & \phi_2(\mathbf{x}_{1}) & \phi_3(\mathbf{x}_{1})\\ \phi_1(\mathbf{x}_{2}) & \phi_2(\mathbf{x}_{2}) & \phi_3(\mathbf{x}_{2})\\ \phi_1(\mathbf{x}_{3}) & \phi_2(\mathbf{x}_{3}) & \phi_3(\mathbf{x}_{3}) \\ \phi_1(\mathbf{x}_{4}) & \phi_2(\mathbf{x}_{4}) & \phi_3(\mathbf{x}_{4}) \end{bmatrix} = \begin{bmatrix} \phi(\mathbf{x}_{1})^T \\ \phi(\mathbf{x}_{2})^T \\ \phi(\mathbf{x}_{3})^T \\ \phi(\mathbf{x}_{4})^T \end{bmatrix} \]

In the feature space $\mathbb{R}^3$, after centering the data, we can write the covariance matrix as

\[ \mathcal{C} = \frac{1}{n-1}\mathbf{\Phi}^T\mathbf{\Phi}=\frac{1}{n-1}\sum_{i=1}^n \phi(\mathbf{x}_i)\phi(\mathbf{x}_i)^T \] Remark 4. If $\phi$ is a linear transformation:

\[ \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \mapsto \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \]

then $\mathcal{C}$ reduces to standard $C$.

Consider the eigenvalue equation

\[\begin{equation} \mathcal{C} \mathbf{v} = \lambda \mathbf{v} \label{eq:feateigen} \end{equation}\]

i.e.

\[ \frac{1}{n-1} \sum_{i=1}^n \phi(\mathbf{x}_i)(\phi(\mathbf{x}_i)^T\mathbf{v}) = \lambda \mathbf{v} \]

So for $\lambda > 0$, $\mathbf{v}$ is a linear combination of $\phi(\mathbf{x}_i)$, $i=1,\cdots, n$ and thus can be written as

\[ \mathbf{v} = \sum_{i=1}^n \alpha_i \phi(\mathbf{x}_i) \]

Since for $\forall q \in \{1, \cdots, n\}$,

\[ \begin{align} \phi(\mathbf{x}_p)^T \left(\frac{1}{n-1} \sum_{i=1}^n \phi(\mathbf{x}_i)\left(\phi(\mathbf{x}_i)^T\mathbf{v}\right) \right) & = \phi(\mathbf{x}_p)^T \left(\frac{1}{n-1} \sum_{i=1}^n \phi(\mathbf{x}_i)\left(\phi(\mathbf{x}_i)^T\sum_{j=1}^n \alpha_j \phi(\mathbf{x}_j)\right) \right) \\ &= \frac{1}{n-1} \sum_{i=1}^n k(\mathbf{x}_p, \mathbf{x}_i)\left(\sum_{j=1}^n k(\mathbf{x}_i, \mathbf{x}_j) \alpha_j \right), \end{align} \] and

\[ \phi(\mathbf{x}_p)^T (\lambda \mathbf{v}) = \lambda \sum_{i=1}^n \alpha_i \phi(\mathbf{x}_p)^T \phi(\mathbf{x}_i) = \lambda\sum_{i=1}^n k(\mathbf{x}_p, \mathbf{x}_i) \alpha_i, \] we get

\[ K^2 \mathbf{\alpha} = \lambda (n-1) K \mathbf{\alpha}. \]

or equivalently for $\lambda > 0$,

\[\begin{equation} K \mathbf{\alpha} = \lambda (n-1) \mathbf{\alpha} \label{eq:kPCAeigen} \end{equation}\]

where $K=(k(\mathbf{x}_i, \mathbf{x}_j))_{n\times n}$ is the kernel/Gram matrix. For the linear transformation in Remark 4, $K=XX^T$. So the eigenproblem (\ref{eq:feateigen}) is transformed to a new eigenproblem (\ref{eq:kPCAeigen}) where only information about the kernel matrix is required.

Since

\[ \begin{align} \|\mathbf{v}\|_{\mathcal{H}}^2 &= \left\langle \sum_{i=1}^n \alpha_i \phi(\mathbf{x}_i), \sum_{i=1}^n \alpha_i \phi(\mathbf{x}_i) \right\rangle \\ & = \sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_j\left\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \right\rangle \\ & = \sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_j k(\mathbf{x}_i, \mathbf{x}_j)\\ & = \alpha^T K \alpha = (n-1)\lambda \alpha^T \alpha, \end{align} \]

to ensure that $\|\mathbf{v}\|_{\mathcal{H}}=1$, $\alpha$ needs to be normalized by

\[ \alpha \leftarrow \frac{1}{\sqrt{(n-1)\lambda}}\alpha \]

The principal axis is $\mathbf{v} = \sum_{i=1}^n \alpha_i \phi(\mathbf{x}_i)$. The principal components are the projection

\[ \begin{align} z(\mathbf{x}) &= \sum_{i=1}^n\alpha_i \phi(\mathbf{x})^T \phi(\mathbf{x}_i) \\ &= \sum_{i=1}^n\alpha_i k(\mathbf{x}, \mathbf{x}_i) \label{eq:kPCAproj} \end{align} \]

Remark 5. Equations (\ref{eq:kPCAeigen}) and (\ref{eq:kPCAproj}) only depend on the kernel matrix.

Numerical Implementation

Get data

For illustration and comparison, we use the same toy data.

Center data
Construct kernel/Gram matrix

We choose the linear kernel to see whether we can obtain the same results as standard PCA.

linearKern <- function(x, y) {
  return(sum(x * y))
}

dataSize <- dim(Xc)
n <- dataSize[1]
d <- dataSize[2]
K <- matrix(NA, nrow = n, ncol = n)
for (i in 1:n) {
  for (j in 1:n) {
    K[i,j] = linearKern(Xc[i,], Xc[j, ])
  }
}

Center the $K$ matrix

Kc <- apply(K, 2, function(y) y - mean(y))

# check if round(Kc, digits = 4) == round(sKs, digits = 4)
s <- diag(1, n, n) - matrix(1/n, nrow = n, ncol = n)
sKs <- s %*% K %*% s
round(Kc, digits = 4) == round(sKs, digits = 4)

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
## [10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE

Find the eigenvalues and eigenvectors of $Kc$

Kc.eign <- eigen(Kc/(n-1))
lambda <- Kc.eign$values
# principal components
alpha <- Kc.eign$vectors

So we have obtained the same principal components alpha as PXc.

par(oma=c(0, 0, 0, 0), mar = c(4, 4.1, 2, 4), las="1", xpd=TRUE)
plot(PXc, pch=3, xlab = "PC1", ylab = "PC2", main = "Principal Components")
# up to a factor -1
points(alpha[,1], -alpha[,2], col='red')
legend("topright", inset = c(-0.5, 0), legend = c("Standard","Kernel"), pch = c(3,1), 
       col = c(1,2), title = "Method")

Choose principal axes

The eigenvalues of $Kc$ are used to find the proportion of the total variance explained by the components.

totalVar <- sum(Kc.eign$values)
for (t in Kc.eign$values) {
  print(t/totalVar)
}

## [1] 0.9631813
## [1] 0.03681869
## [1] 4.289189e-17
## [1] 1.03107e-17
## [1] 4.674769e-18
## [1] 2.715163e-19
## [1] -3.685304e-18
## [1] -7.481969e-18
## [1] -8.83757e-18
## [1] -1.623554e-17

Principal axes

# scale alpha
r <- 2
PCalpha <- Kc.eign$vectors[,1:r]
n <- dim(X)[1]
for (i in 1:r) {
  PCalpha[,i] <- PCalpha[,i] / sqrt((n-1) * Kc.eign$values[i])
}
# Principal axis
PCAxis1 <- t(Xc) %*% PCalpha[,1]
PCAxis2 <- t(Xc) %*% PCalpha[,2]

So we have obtained the same principal axes PCAxis1 and PCAxis2 as C.eigen$vectors

par(oma=c(0, 0, 0, 0), mar = c(4, 4.1, 2, 4), las="1", xpd=TRUE)
PCAStandard <- matrix(NA, nrow = 2,  ncol = 2) 
PCAStandard[1,] <- C.eigen$vectors[,1]
PCAStandard[2,] <- C.eigen$vectors[,2]
plot(PCAStandard, pch=3, xlab = "PC1", ylab = "PC2", main = "Principal Axes")
# up to a factor -1
PCAKernel <- matrix(NA, nrow = 2,  ncol = 2)
PCAKernel[1,] <- PCAxis1
PCAKernel[2,] <- -PCAxis2
points(PCAKernel, col='red')
legend("topright", inset = c(-0.5, 0), legend = c("Standard","Kernel"), pch = c(3,1), 
       col = c(1,2), title = "Method")

Validation

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following objects are masked from 'package:pracma':
## 
##     cross, eig, size

## The following object is masked from 'package:ggplot2':
## 
##     alpha

# initialize kernel function
linK <- vanilladot()
kpc <- kpca(Xc, kernel=linK)
# return principal component vectors
pcv(kpc)

##              [,1]       [,2]
##  [1,]  0.22656759  1.2535653
##  [2,] -0.48642101 -1.0226454
##  [3,]  0.27150712 -2.7515536
##  [4,]  0.07503555 -0.9335934
##  [5,]  0.45857001  1.4996976
##  [6,]  0.24982141 -1.2547618
##  [7,] -0.02712053  2.5042249
##  [8,] -0.31320326 -0.3322786
##  [9,] -0.11986791 -0.1271683
## [10,] -0.33488896  1.1645133

# return eigenvalues
eig(kpc)

##     Comp.1     Comp.2 
## 1.15562494 0.04417506

# return the data projected on kernel pc space
rotated(kpc)

##             [,1]        [,2]
##  [1,]  2.6182716  0.55376322
##  [2,] -5.6212026 -0.45175422
##  [3,]  3.1376040 -1.21550044
##  [4,]  0.8671295 -0.41241542
##  [5,]  5.2993494  0.66249230
##  [6,]  2.8869986 -0.55429176
##  [7,] -0.3134116  1.10624283
##  [8,] -3.6194550 -0.14678426
##  [9,] -1.3852235 -0.05617669
## [10,] -3.8700604  0.51442443

# returns the kernel used
kernelf(kpc)

## Linear (vanilla) kernel function.

Due to the small size of the toy data matrix, $n$ and $n-1$ makes a difference. In standard PCA prcomp, the factor is $1/(n-1)$. In kpca in kernlab, the factor is $1/n$.

S <- (t(Xc) %*% Xc)/n
S.eign <- eigen(S)
V <- S.eign$vectors
m <- dim(V)[2]
for (i in 1:m) {
  V[,i] = V[,i] / sqrt(n*S.eign$values[i])
}
# normalized principal components, rotated shows unnormalized PC
Z <- Xc %*% V

S.eign$values

## [1] 1.15562494 0.04417506

eig(kpc)

##     Comp.1     Comp.2 
## 1.15562494 0.04417506

##              [,1]        [,2]
##  [1,]  0.24356016 -0.26347265
##  [2,] -0.52290258  0.21493822
##  [3,]  0.29187014  0.57831779
##  [4,]  0.08066321  0.19622138
##  [5,]  0.49296275 -0.31520440
##  [6,]  0.26855801  0.26372412
##  [7,] -0.02915456 -0.52633457
##  [8,] -0.33669350  0.06983786
##  [9,] -0.12885800  0.02672807
## [10,] -0.36000563 -0.24475581

# rotated, principal components non-scaled to unit norm
z <- rotated(kpc)[, 1]  
# recover principal components
z/sqrt(sum(z^2))

##  [1]  0.24356016 -0.52290258  0.29187014  0.08066321  0.49296275  0.26855801
##  [7] -0.02915456 -0.33669350 -0.12885800 -0.36000563

# recover principal axis
pc1 <- t(Xc) %*% pcv(kpc)[, 1]
pc1 <- pc1 / sqrt(sum(pc1^2))
pc1

##           [,1]
## [1,] 0.6778734
## [2,] 0.7351787

S.eign$vectors[,1]

## [1] 0.6778734 0.7351787

References

Bernhard Scholkopf, Alexander Smola, and Klaus-Robert Müller. Kernel Principal Component Analysis (1999).