Processing math: 81%
Skip to Tutorial Content

Kernel Principal Component Analysis from scratch with R

Standard Principal Component Analysis

Basic understanding of the algorithm

Let X be a column centered data matrix of n×d size. Each row is a data point with d features. Then the d×d covariance matrix of X is given by C=XTX/(n1). Its eigenvectors are principal axes (PAs) of X and eigenvalues are PC variances of X. The components of X on the principal axes are called principal components.

Let X=Un×rDr×rVTd×r be the SVD decomposition of X, where rd is the rank of X. Then C=VDUTUDVT/(n1)=VD2VT/(n1). Therefore V are eigenvectors of C (PAs of X), and The diagonal entries of D2/(n1) are non-zero eigenvalues of C (PC variances of X). Consider the transformation

Z=XVn1D21n1=XVΛ1/21n1

where Λ is the diagonal matrix whose diagonal entries are the non-zero eigenvalues of C.

Since

ZTZ=1D2VTXTXV1D2=1D2VTVDUTUDVTV1D2=I, Z is normalized. Z is the normalized principal components in the principal axes V.

Remark 1. Since

XXTZ=XXTXVΛ1/21n1=XVD2VTVΛ1/21n1=(XVΛ1/21n1)D2=ZD2, Z are the eigenvectors of XXT/(n1). D2/(n1) are the eigenvalues of XXT/(n1).

Numerical Implementation

  1. Get data

For illustration, we use a toy data matrix X.

X <- matrix(data = c(2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2.0, 1.0, 1.5, 1.1,
                     2.4, 0.7, 2.9, 2.2, 3.0, 2.7, 1.6, 1.1, 1.6, 0.9), ncol = 2)
plot(X[,1], X[,2],xlab=expression("x"[1]), ylab=expression("x"[2]))

  1. Convert X to a column-centered matrix

It is not necessary for calculating the covariance matrix, but it is necessary for performing projection.

# column mean-centering 
Xc <- apply(X, 2, function(y) y - mean(y))
  1. Calculate the covariance matrix
C <- cov(Xc)
  1. Get the eigenvalues and eigenvectors of the covariance matrix
C.eigen <- eigen(C)
  1. Choose principal axes

The eigenvalues of C are used to find the proportion of the total variance explained by the components.

totalVar <- sum(C.eigen$values)
# one can check it equal to the total variance sum(diag(S)) 
totalVar == sum(diag(C))
## [1] TRUE
for (s in C.eigen$values) {
  print(s/totalVar)
}
## [1] 0.9631813
## [1] 0.03681869

The first eigenvalue account for 96.3 of the total variance. So we choose the first eigenvector as the principal axis.

  1. Project on the principal axis
# Set up figure layout
par(oma=c(0, 0, 0, 0), mar = c(4, 4.1, 2, 1), las="1", xpd=TRUE)
layout.matrix <- matrix(c(1, 2), nrow = 1, ncol = 2)
layout(mat = layout.matrix, heights = c(3.25), widths = c(3.25, 3.25))
# Plot principal component
plot(Xc, xlab=expression("x"[1]), ylab=expression("x"[2]))
x <- seq(-1.0, 1.0, 0.1)
lines(x, -x * C.eigen$vectors[1,2]/C.eigen$vectors[1,1], col='red')
# Principal components scaled to unit norm
r <- 2
V <- C.eigen$vectors[,1:r]
n <- dim(Xc)[1]
for (i in 1:r) {
V[,i] <- V[,i]/sqrt((n-1)*C.eigen$values[i])
}
PXc <- Xc %*% V
plot(PXc,  xlab="PC1", ylab="PC2")

  1. Verify Remark 1
gram <- Xc %*% t(Xc)/9
gram.eign <- eigen(gram)
round(gram.eign$values[1:2], 6) == round(C.eigen$values, 6)
## [1] TRUE TRUE
round(gram.eign$vectors[, 1], 6) == round(PXc[, 1], 6)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# difference in a scaling factor -1
round(gram.eign$vectors[, 2], 6) == round(-PXc[, 2], 6)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Validation

Compare with R function prcomp

pilots.pca <- prcomp(X)
pilots.pca
## Standard deviations (1, .., p=2):
## [1] 1.1331495 0.2215477
## 
## Rotation (n x k) = (2 x 2):
##             PC1        PC2
## [1,] -0.6778734  0.7351787
## [2,] -0.7351787 -0.6778734

Remark 2. The squares of the standard deviations equal to the eigenvalues of C.

Remark 3. The PCs/Rotation are the same as the eigenvectors of C except for a scaling factor -1.

Kernel Principal Component Analysis

Basic understanding of the algorithm

Non-linear data (e.g. a curve in 2-D) may be linearized in a higher-dimensional space so that dimension of the data may be reduced. Kernel principal component analysis is to map the input space to a new higher-dimensional feature space. PCA is carried out in the new space.

Consider a feature map ϕ:R2R3,

x=[x1x2]ϕ(x)=[ϕ1(x)ϕ2(x)ϕ3(x)] The data matrix in R3 is

Φ=[ϕ1(x1)ϕ2(x1)ϕ3(x1)ϕ1(x2)ϕ2(x2)ϕ3(x2)ϕ1(x3)ϕ2(x3)ϕ3(x3)ϕ1(x4)ϕ2(x4)ϕ3(x4)]=[ϕ(x1)Tϕ(x2)Tϕ(x3)Tϕ(x4)T]

In the feature space R3, after centering the data, we can write the covariance matrix as

C=1n1ΦTΦ=1n1ni=1ϕ(xi)ϕ(xi)T Remark 4. If ϕ is a linear transformation:

[x1x2][x1x2],

then C reduces to standard C.

Consider the eigenvalue equation

Cv=λv

i.e. 

1n1ni=1ϕ(xi)(ϕ(xi)Tv)=λv

So for λ>0, v is a linear combination of ϕ(xi), i=1,,n and thus can be written as

v=ni=1αiϕ(xi)

Since for q{1,,n},

ϕ(xp)T(1n1ni=1ϕ(xi)(ϕ(xi)Tv))=ϕ(xp)T(1n1ni=1ϕ(xi)(ϕ(xi)Tnj=1αjϕ(xj)))=1n1ni=1k(xp,xi)(nj=1k(xi,xj)αj), and

ϕ(xp)T(λv)=λni=1αiϕ(xp)Tϕ(xi)=λni=1k(xp,xi)αi, we get

K2α=λ(n1)Kα.

or equivalently for λ>0,

Kα=λ(n1)α

where K=(k(xi,xj))n×n is the kernel/Gram matrix. For the linear transformation in Remark 4, K=XXT. So the eigenproblem (7) is transformed to a new eigenproblem (10) where only information about the kernel matrix is required.

Since

to ensure that \|\mathbf{v}\|_{\mathcal{H}}=1, \alpha needs to be normalized by

\alpha \leftarrow \frac{1}{\sqrt{(n-1)\lambda}}\alpha

The principal axis is \mathbf{v} = \sum_{i=1}^n \alpha_i \phi(\mathbf{x}_i). The principal components are the projection

\begin{align} z(\mathbf{x}) &= \sum_{i=1}^n\alpha_i \phi(\mathbf{x})^T \phi(\mathbf{x}_i) \\ &= \sum_{i=1}^n\alpha_i k(\mathbf{x}, \mathbf{x}_i) \label{eq:kPCAproj} \end{align}

Remark 5. Equations (\ref{eq:kPCAeigen}) and (\ref{eq:kPCAproj}) only depend on the kernel matrix.

Numerical Implementation

  1. Get data

For illustration and comparison, we use the same toy data.

  1. Center data

  2. Construct kernel/Gram matrix

We choose the linear kernel to see whether we can obtain the same results as standard PCA.

linearKern <- function(x, y) {
  return(sum(x * y))
} 
dataSize <- dim(Xc)
n <- dataSize[1]
d <- dataSize[2]
K <- matrix(NA, nrow = n, ncol = n)
for (i in 1:n) {
  for (j in 1:n) {
    K[i,j] = linearKern(Xc[i,], Xc[j, ])
  }
}
  1. Center the K matrix
Kc <- apply(K, 2, function(y) y - mean(y))
# check if round(Kc, digits = 4) == round(sKs, digits = 4)
s <- diag(1, n, n) - matrix(1/n, nrow = n, ncol = n)
sKs <- s %*% K %*% s
round(Kc, digits = 4) == round(sKs, digits = 4)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
##  [9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
## [10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE
  1. Find the eigenvalues and eigenvectors of Kc
Kc.eign <- eigen(Kc/(n-1))
lambda <- Kc.eign$values
# principal components
alpha <- Kc.eign$vectors

So we have obtained the same principal components alpha as PXc.

par(oma=c(0, 0, 0, 0), mar = c(4, 4.1, 2, 4), las="1", xpd=TRUE)
plot(PXc, pch=3, xlab = "PC1", ylab = "PC2", main = "Principal Components")
# up to a factor -1
points(alpha[,1], -alpha[,2], col='red')
legend("topright", inset = c(-0.5, 0), legend = c("Standard","Kernel"), pch = c(3,1), 
       col = c(1,2), title = "Method")

  1. Choose principal axes

The eigenvalues of Kc are used to find the proportion of the total variance explained by the components.

totalVar <- sum(Kc.eign$values)
for (t in Kc.eign$values) {
  print(t/totalVar)
}
## [1] 0.9631813
## [1] 0.03681869
## [1] 4.289189e-17
## [1] 1.03107e-17
## [1] 4.674769e-18
## [1] 2.715163e-19
## [1] -3.685304e-18
## [1] -7.481969e-18
## [1] -8.83757e-18
## [1] -1.623554e-17
  1. Principal axes
# scale alpha
r <- 2
PCalpha <- Kc.eign$vectors[,1:r]
n <- dim(X)[1]
for (i in 1:r) {
  PCalpha[,i] <- PCalpha[,i] / sqrt((n-1) * Kc.eign$values[i])
}
# Principal axis
PCAxis1 <- t(Xc) %*% PCalpha[,1]
PCAxis2 <- t(Xc) %*% PCalpha[,2]

So we have obtained the same principal axes PCAxis1 and PCAxis2 as C.eigen$vectors

par(oma=c(0, 0, 0, 0), mar = c(4, 4.1, 2, 4), las="1", xpd=TRUE)
PCAStandard <- matrix(NA, nrow = 2,  ncol = 2) 
PCAStandard[1,] <- C.eigen$vectors[,1]
PCAStandard[2,] <- C.eigen$vectors[,2]
plot(PCAStandard, pch=3, xlab = "PC1", ylab = "PC2", main = "Principal Axes")
# up to a factor -1
PCAKernel <- matrix(NA, nrow = 2,  ncol = 2)
PCAKernel[1,] <- PCAxis1
PCAKernel[2,] <- -PCAxis2
points(PCAKernel, col='red')
legend("topright", inset = c(-0.5, 0), legend = c("Standard","Kernel"), pch = c(3,1), 
       col = c(1,2), title = "Method")

Validation

library(kernlab)
## 
## Attaching package: 'kernlab'
## The following objects are masked from 'package:pracma':
## 
##     cross, eig, size
## The following object is masked from 'package:ggplot2':
## 
##     alpha
# initialize kernel function
linK <- vanilladot()
kpc <- kpca(Xc, kernel=linK)
# return principal component vectors
pcv(kpc)
##              [,1]       [,2]
##  [1,]  0.22656759  1.2535653
##  [2,] -0.48642101 -1.0226454
##  [3,]  0.27150712 -2.7515536
##  [4,]  0.07503555 -0.9335934
##  [5,]  0.45857001  1.4996976
##  [6,]  0.24982141 -1.2547618
##  [7,] -0.02712053  2.5042249
##  [8,] -0.31320326 -0.3322786
##  [9,] -0.11986791 -0.1271683
## [10,] -0.33488896  1.1645133
# return eigenvalues
eig(kpc)
##     Comp.1     Comp.2 
## 1.15562494 0.04417506
# return the data projected on kernel pc space
rotated(kpc)
##             [,1]        [,2]
##  [1,]  2.6182716  0.55376322
##  [2,] -5.6212026 -0.45175422
##  [3,]  3.1376040 -1.21550044
##  [4,]  0.8671295 -0.41241542
##  [5,]  5.2993494  0.66249230
##  [6,]  2.8869986 -0.55429176
##  [7,] -0.3134116  1.10624283
##  [8,] -3.6194550 -0.14678426
##  [9,] -1.3852235 -0.05617669
## [10,] -3.8700604  0.51442443
# returns the kernel used
kernelf(kpc)
## Linear (vanilla) kernel function.

Due to the small size of the toy data matrix, n and n-1 makes a difference. In standard PCA prcomp, the factor is 1/(n-1). In kpca in kernlab, the factor is 1/n.

S <- (t(Xc) %*% Xc)/n
S.eign <- eigen(S)
V <- S.eign$vectors
m <- dim(V)[2]
for (i in 1:m) {
  V[,i] = V[,i] / sqrt(n*S.eign$values[i])
}
# normalized principal components, rotated shows unnormalized PC
Z <- Xc %*% V
S.eign$values
## [1] 1.15562494 0.04417506
eig(kpc)
##     Comp.1     Comp.2 
## 1.15562494 0.04417506
Z
##              [,1]        [,2]
##  [1,]  0.24356016 -0.26347265
##  [2,] -0.52290258  0.21493822
##  [3,]  0.29187014  0.57831779
##  [4,]  0.08066321  0.19622138
##  [5,]  0.49296275 -0.31520440
##  [6,]  0.26855801  0.26372412
##  [7,] -0.02915456 -0.52633457
##  [8,] -0.33669350  0.06983786
##  [9,] -0.12885800  0.02672807
## [10,] -0.36000563 -0.24475581
# rotated, principal components non-scaled to unit norm
z <- rotated(kpc)[, 1]  
# recover principal components
z/sqrt(sum(z^2))
##  [1]  0.24356016 -0.52290258  0.29187014  0.08066321  0.49296275  0.26855801
##  [7] -0.02915456 -0.33669350 -0.12885800 -0.36000563
# recover principal axis
pc1 <- t(Xc) %*% pcv(kpc)[, 1]
pc1 <- pc1 / sqrt(sum(pc1^2))
pc1
##           [,1]
## [1,] 0.6778734
## [2,] 0.7351787
S.eign$vectors[,1]
## [1] 0.6778734 0.7351787

References

Bernhard Scholkopf, Alexander Smola, and Klaus-Robert Müller. Kernel Principal Component Analysis (1999).