Commit ffe47f44 authored by Jung Alex's avatar Jung Alex

clustering HA

parent 783fbc6d
\usepackage{bm, amssymb}
\title{CS-E3210- Machine Learning Basic Principles \\ Home Assignment 5 - ``Clustering''}
Your solutions to the following problems should be submitted as one single pdf which does not contain
any personal information (student ID or name). The only rule for the layout of your submission is that for each
problem there has to be exactly one separate page containing the answer to the problem. You are welcome to use the \LaTeX-file underlying this pdf,
available under \url{}, and fill in your solutions there.
\section{Hard Clustering}
Consider $\samplesize=20$ snapshots, available at \url{},
which are named according to the season when they have been taken, i.e., either ``winter??.jpeg'' or ``summer??.jpeg''.
We represent the $i$th snapshot, with $i=1,\ldots,\samplesize$, by the feature vector $\vx^{(\sampleidx)}=(x^{(\sampleidx)}_{\rm r},x^{(\sampleidx)}_{\rm g})^{T} \in \mathbb{R}^{2}$ with the total image
redness $x_{\rm r}$ and greenness $x_{\rm g}$. Thus, the overall dataset is given by the feature vectors $ \{ \vx^{(\sampleidx)} \}_{\sampleidx=1}^{\samplesize}$,
which are divided into two subsets $\dataset^{(\rm summer)}$ and $\dataset^{(\rm winter)}$, which contain only the feature vectors of summer or winter snapshots, respectively.
Apply the k-means algorithm, using a fixed number of $M$ iterations, for clustering the dataset $\dataset= \{ \vx^{(\sampleidx)} \}_{\sampleidx=1}^{\samplesize}$
into two non-overlapping clusters $\mathcal{C}_{0}$, $\mathcal{C}_{1}$ such that each snapshot belongs exactly to one of the clusters $\mathcal{C}_{0}$ and $\mathcal{C}_{1}$.
Let us characterize the ``quality'' of the clusters by how well they separate winter from summer images. To this end, we define the ``purity'' measure
$P_{\rm w}= h \big( \frac{| \cluster_{1} \cap \dataset^{(\rm winter)}|}{| \dataset^{(\rm winter)} |}\big)$ and $P_{\rm s}= h \big( \frac{| \cluster_{1} \cap \dataset^{(\rm summer)}|}{| \dataset^{(\rm summer)} |}\big)$
with the function $h(p) = 1 + p \log_{2} p + (1-p) \log_{2} (1-p)$. The average purity obtained from the k-means output is then $\bar{P} = (1/2) ( P_{\rm w} + P_{\rm s} )$.
Implement the k-means algorithm using different numbers $M$ of iterations and plot the average purity $\bar{P}$ obtained for different values of $M$.
For each choice of $M$, repeat the application of k-means several (say 10) times (runs), and use for each run two (independently) randomly selected feature vectors $\vx^{(i)},\vx^{(j)} \in \dataset$ as
the initial choices for the cluster means $\vm_{0}$ and $\vm_{1}$. Average the results of the different runs to get one single estimate of $\bar{P}$ for each $M$.
\noindent {\bf Answer.}
\section{Soft Clustering}
Redo Problem 1 using, instead of the hard clustering algorithm k-means,
the soft clustering algorithm discussed in Lecture 9 (cf.\ slide 35 in \url{}).
We run this soft clustering algorithm for a fixed number $M$ of iterations to obtain, for each
snapshot, the degree $y^{(\sampleidx)}$ to which the $i$th snapshot belongs to $\cluster_{1}$.
A reasonable adaption of the purity measure of Problem 1 to the soft clustering setting is to use\footnote{For an index $i \in \{1,\ldots,\samplesize\}$, with a slight abuse of notation,
we write $i \in \dataset^{(\rm winter)}$ if the $i$th feature vector $\vx^{(i)}$ represents
a winter image, i.e., $\vx^{(i)} \in \dataset^{(\rm winter)}$.} is $P_{\rm w}= h \big( (2/\samplesize) \sum_{\sampleidx \in \dataset^{(\rm winter)}} y^{(\sampleidx)} \big)$ and
$P_{\rm s}= h \big( (2/\samplesize) \sum_{\sampleidx \in \dataset^{(\rm summer)}} y^{(\sampleidx)} \big)$ for computing the average purity $\bar{P} = (1/2) ( P_{\rm w} + P_{\rm s} )$. Implement the soft clustering algorithm
using different numbers $M$ of iterations and plot the average purity $\bar{P}(M)$ as a function of the number of iterations $M$.
%Since the output of the soft clustering algorithm depends on the initial choice for the cluster means $\vm_{0}$, $\vm_{1}$
%and cluster covariance matrices $\mathbf{C}_{0}$ and $\mathbf{C}_{1}$ (cf. slides of Lecture 9), it is a good idea to run,
For each choice of $M$, use two (independently) randomly selected feature vectors $\vx^{(i)}, \vx^{(j)} \in \dataset$
as the initial cluster means $\vm_{0}$ and $\vm_{1}$. Initialize the covariance matrices with the identity matrix, i.e., $\mathbf{C}_{0} = \mathbf{C}_{1} = \mathbf{I}$.
Average the results obtained from these runs to get one single estimate of $\bar{P}(M)$ for each $M$.
\noindent {\bf Answer.}
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment