Monday, 6 February 2012

K-Means Algorithm



There are two types of learning algorithms : Supervised and Unsupervised. In supervised algorithms we have a collection of labelled target data. This helps us provide the right answers to the algorithm for a set of sample data. This isn't always possible as it could be tedious to list all the right answers, while another possibility is that we don't know the right answers in advance. Unsupervised learning algorithms work in situations where no information about the correct outputs is available.

In regression, we group the inputs based on similarities, but since we don't have the right answers, we can't derive a function for regression. If we figure out a way to exploit similarities between inputs and clustering them based on this, then classification is done automatically. So in Unsupervised Learning, the algorithm needs to find clusters of similar inputs and it needs to figure out the similarity by itself.

The aim of every learning algorithm is to minimize some error function. In supervised learning we use measures like the sum-of-squares error which can't be used in Unsupervised learning as we have no means of measuring the error, since there is no target data. We can't use any external criteria and need to figure out something internal to the algorithm. This measure should also be task independent. Else we would need a different measure for each task.

For a general error criteria, it is helpful to remember that if two inputs are close together then the Euclidean distance between them is small (since their vectors are similar). So inputs that are close to each other can be clustered as being similar. If the elements of a vector are close to the weight values of a node, then that node is a good match for the vector.

Suppose input data needs to be clustered into k categories. An example would be rainfall over the month of January, February and March, in which case k=3. We need to partition our input space into k clusters and allocate cluster centers such that there is one cluster center at the middle of each cluster. So now we need to think about defining the cluster centers and the error criteria.

Assuming Euclidean space, the distance between points is Euclidean distance. The center will be the mean average. This only holds for Euclidean space and so we will always assume that to be the case. To position the cluster centers, we calculate the mean of points in each cluster and assign the mean as the cluster center.

We can place the cluster centers randomly in the input space. In each iteration of the algorithms, their position will be updated according to the data. For each data point we calculate the Euclidean distance to all the cluster centers and the point is assigned to the cluster that it is closest to. Now that the members of the clusters have changed, we again calculate the mean (cluster center). Now each data point is re-evaluated and assigned the same or different cluster depending on the Euclidean distance. This process is repeated till the iterations stop producing new changes.

Algorithm :
  • Initialization : 
    1) Choose a value of k
    2) Choose k random positions in the input space
    3) Assign the cluster centers μj to these positions
  • Learning :
    - Repeat

    1) For each data point xi, calculate the distance to each cluster center. Then assign the point to the nearest cluster center with distance d= min d(xi,μj) (calculated for j ranging from 1 to k)

    2) For each cluster center, move the center to the mean of all the points in the cluster.
    μj = (1/Nj)*
    Σi (xi)  (for a cluster with Nj points and Σ for i ranging from 1 to Nj)

    - Until the cluster centers stop moving.
  • Usage :For each test point, compute the distance to each cluster center.Assign the data point to the nearest cluster center with distance d= min d(xi,μj(calculated for j ranging from 1 to k)
That is it for now. I will paste the code for this in Python in some time. I will also post a numerical. This tutorial is based on the book Machine Learning : An Algorithmic Perspective. Hope you found this post informative.

No comments:

Post a Comment