C Solutions ch. 4 - Clustering
Solutions to exercises of clustering chapter.
C.1 Exercise 1
First we need to read the image data and transform it into a suitable format for analysis:
library(EBImage)
library(ggplot2)
<- readImage("data/histology/Emphysema_H_and_E.jpg")
img
<- dim(img)
imgDim
<- data.frame(
imgDF x = rep(1:imgDim[1], imgDim[2]),
y = rep(imgDim[2]:1, each=imgDim[1]),
r = as.vector(img[,,1]),
g = as.vector(img[,,2]),
b = as.vector(img[,,3])
)
Next we will perform kmeans clustering for k in the range 1:9. This is computationally quite intensive, so we’ll use parallel processing:
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(detectCores())
=1:9
kset.seed(42)
<- suppressWarnings(
res foreach(i=k, .options.multicore=list(set.seed=FALSE)) %dopar%
kmeans(imgDF[,c("r", "g", "b")], i, nstart=50)
)
We can now plot total within-cluster sum of squares against k:
<- function(kmeans_output){
plot_tot_withinss <- sapply(k, function(i){kmeans_output[[i]]$tot.withinss})
tot_withinss qplot(k, tot_withinss, geom=c("point", "line"),
ylab="Total within-cluster sum of squares") + theme_bw()
}
plot_tot_withinss(res)
The plot of total within-cluster sum of squares against k shows an elbow at k=2, indicating that most of the variance in the image can be described by just two clusters. Let’s plot the clusters for k=2.
<- rgb(res[[2]]$centers)
clusterColours ggplot(data = imgDF, aes(x = x, y = y)) +
geom_point(colour = clusterColours[res[[2]]$cluster]) +
xlab("x") +
ylab("y") +
theme_minimal()
Segmentation of the image with k=2 separates air-spaces from all other objects. Therefore, the difference in pixel colour between the air-spaces and other objects accounts for most of the variance in the data-set (image).
Let’s now take a look at a segmentation of the image using k=4.
<- rgb(res[[4]]$centers)
clusterColours ggplot(data = imgDF, aes(x = x, y = y)) +
geom_point(colour = clusterColours[res[[4]]$cluster]) +
xlab("x") +
ylab("y") +
theme_minimal()
K-means clustering with k=4 rapidly and effectively segments the image of the histological section into the biological objects we can see by eye. A manual segmentation of the same image would be very laborious. This exercise highlights the importance of using biological insight to choose a sensible value of k.
N.B. the cluster centres provide the mean pixel intensities for the red, green and blue channels and we have used this information to colour the pixels belonging to each cluster.