Hex, Bugs and More Physics | Emre S. Tasci

a blog about physics, computation, computational physics and materials…

K-means Clustering – the hard way…

December 3, 2007 Posted by Emre S. Tasci

Clustering (or grouping) items using a defined similarity of selected properties is an effective way to classify the data, but also (and probably more important) aspect of clustering is to pick up abnormal pieces that are different from the other clusterable "herd". These exceptions may be the ones that we’re looking (or unlooking) for!

There is this simple clustering method called K-means Clustering. Suppose that you have N particles (atoms/people/books/shapes) defined by 3 properties (coordinates/{height, weight, eye color}/{title,number of sold copies,price}/{number of sides,color,size}). You want to classify them into K groups. A way of doing this involves the K-means Clustering algorithm. I have listed some examples to various items & their properties, but from now on, for clarity and easy visionalizing, I will stick to the first one (atoms with 3 coordinates, and while we’re at it, let’s take the coordinates as Cartesian coordinates).

The algorithm is simple: You introduce K means into the midst of the N items. Then apply the distance criteria for each means upon the data and relate the item to the nearest mean to it. Store this relations between the means and the items using a responsibility list (array, vector) for each means of size (dimension) N (number of items).

Let’s define some notation at this point:

  • Designate the position of the kth mean by mk.
  • Designate the position of the jth item by xj.
  • Designate the responsibility list of the kth mean by rk such that the responsibility between that kth mean and the jth item is stored in the jth element of rk. If  rk[j] is equal to 1, then it means that this mean is the closest mean to the item and 0 means that, some other mean is closer to this item so, it is not included in this mean’s responsibility.

The algorithm can be built in 3 steps:

0. Ä°nitialize the means. (Randomly or well calculated)

1. Analyze: Check the distances between the means and the items. Assign the items to their nearest mean’s responsibility list.

2. Move: Move the means to the center of mass of the items each is responsible for. (If a mean has no item it is responsible for, then just leave it where it is.)

Formula: % MathType!MTEF!2!1!+-
% feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
% hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
% 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9
% vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x
% fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaaCyBamaaBa
% aaleaacaWGRbaabeaakiabg2da9maalaaabaWaaabCaeaacaWGYbWa
% aSbaaSqaaiaadUgaaeqaaOWaamWaaeaacaWGPbaacaGLBbGaayzxaa
% GaeyyXICTaaCiEamaaBaaaleaacaWGPbaabeaaaeaacaWGPbGaeyyp
% a0JaaGymaaqaaiaad6eaa0GaeyyeIuoaaOqaamaaqahabaGaamOCam
% aaBaaaleaacaWGRbaabeaakmaadmaabaGaamyAaaGaay5waiaaw2fa
% aaWcbaGaamyAaiabg2da9iaaigdaaeaacaWGobaaniabggHiLdaaaa
% aa!5305!
\[
{\mathbf{m}}_k = \frac{{\sum\limits_{i = 1}^N {r_k \left[ i \right] \cdot {\mathbf{x}}_i } }}
{{\sum\limits_{i = 1}^N {r_k \left[ i \right]} }}
\]

3. Check: If one or more mean’s position has changed, then go to 1, if not exit.

That’s all about it. I have written a C++ code for this algorithm. It employs an object called kmean (can you guess what it is for? ) which has the properties and methods as:

class kmean
{
 private:
  unsigned int id,n;//"id" is for the unique identifier and "n" is the total number of items in the world.
  double x,y,z;//coordinates of the mean
  gsl_vector * r;//responsibility list of the mean
  char buffer [150];//some dummy variable to use in the "whereis" method
  unsigned int r_tott;//total number of data points this mean is responsible for

 public:
  kmean(int ID,double X, double Y, double Z,unsigned int N);
  ~kmean();
  //kmean self(void);
  int move(double X, double Y, double Z);//moves the mean to the point(X,Y,Z)
  int calculate_distances(gsl_matrix *MatrixIN, int N, gsl_matrix *distances);//calculates the distances of the "N" items with their positions defined in "MatrixIN" and writes these values to the "distances" matrix
  char * whereis(void);//returns the position of the mean in a formatted string
  double XX(void) {return x;}//returns the x-component of the mean’s position
  double YY(void) {return y;}//returns the y-component of the mean’s position
  double ZZ(void) {return z;}//returns the z-component of the mean’s position
  int II(void) {return id;}//returns the mean’s id
  int NN(void) {return n;}//returns the number of items in the world (as the mean knows it)
  int RR(void);//prints out the responsibility list (used for debugging purposes actually)
  gsl_vector * RRV(void) {return r;}//returns the responsibility list
  int set_r(int index, bool included);//sets the "index"th element of the responsibility list as the bool "included". (ie, registers the "index". item as whether it belongs or not belongs to the mean)
  int moveCoM(gsl_matrix *MatrixIN);//moves the mean to the center of mass of the items it is responsible for
  int RTOT(void) {return r_tott;}//returns the total number of items the mean is responsible for
  };
int check_distances(gsl_matrix* distances, int numberOfWorldElements, kmean **means);//checks the "distances" matrix elements for the means in the "means" array and registers each item in the responsibility list of the nearest mean.
double distance(double a1, double a2,double a3,double b1,double b2,double b3);//distance criteria is defined with this function

Now that we have the essential kmean object, the rest of the code is actually just details. Additionaly, I’ve included a random world generator (and by world, I mean N items and their 3 coordinates, don’t be misled). So essentialy, in the main code, we have the following lines for the analyze section:

  for (i=0;i<dK;i++)
   means[i]->calculate_distances(m,dN,distances);
  
  check_distances(distances,dN,means);

which first calculates the distances between each mean (defined as elements of the means[] array) and item (defined as the rows of the m matrice with dNx4 dimension (1st col: id, cols 2,3,4: coordinates) and writes the distances into the distances matrice (of dimensions dNxdK) via the calculate_distances(…) method. dN is the total number of items and dK is the total number of means. Then analyzes the distances matrice, finds the minimum of each row (because rows represent the items and columns represent the means) and registers the item as 1 in the nearest means’s responsibility list, and as 0 in the other means’ lists via the check_distances(…) method.

Then, the code executes its move section:

  for(i=0; i<dK; i++)
  {
   means[i]->moveCoM(m);
   […]
  }

which simply moves the means to the center of mass of the items they’re responsible for. This section is followed by the checkSufficiency section which simply decides if we should go for one more iteration or just call it quits.

  counter++;
  if(counter>=dMaxCounter) {f_maxed = true;goto finalize;}

  gsl_matrix_sub(prevMeanPositions,currMeanPositions);
  if(gsl_matrix_isnull(prevMeanPositions)) goto finalize;
  […]
  goto analyze;

as you can see, it checks for two things: whether a predefined iteration limit has been reached or, if at least one (two 😉 of the means moved in the last iteration. finalize is the section where some tidying up is processed (freeing up the memory, closing of the open files, etc…)

The code uses two external libraries: the Gnu Scientific Library (GSL) and the Templatized C++ Command Line Parser Library (TCLAP). While GSL is essential for the matrix operations and the structs I have intensely used, TCLAP is used for a practical and efficient solution considering the argument parsing. If you have enough reasons for reading this blog and this entry up to this line and (a tiny probability but, anyway) have not heard about GSL before, you MUST check it out. For TCLAP, I can say that, it has become a habit for me since it is very easy to use. About what it does, it enables you to enable the external variables that may be defined in the command execution even if you know nothing about argument parsing. I strictly recommend it (and thank to Michael E. Smoot again 8).

The program can be invoked with the following arguments:

D:\source\cpp\Kmeans>kmeans –help

USAGE:

   kmeans  -n <int> -k <int> [-s <unsigned long int>] [-o <output file>]
           [-m <int>] [-q] [–] [–version] [-h]

Where:

   -n <int>,  –numberofelements <int>
     (required)  total number of elements (required)

   -k <int>,  –numberofmeans <int>
     (required)  total number of means (required)

   -s <unsigned long int>,  –seed <unsigned long int>
     seed for random number generator (default: timer)

   -o <output file>,  –output <output file>
     output file basename (default: "temp")

   -m <int>,  –maxcounter <int>
     maximum number of steps allowed (default: 30)

   -q,  –quiet
     no output to screen is generated

   –,  –ignore_rest
     Ignores the rest of the labeled arguments following this flag.

   –version
     Displays version information and exits.

   -h,  –help
     Displays usage information and exits.

   K-Means by Emre S. Tasci, <…@tudelft.nl> 2007

 The code outputs various files:

<project-name>.means.txt : Coordinates of the means evolution in the iterations
<project-name>.meansfinal.txt : Final coordinates of the means
<project-name>.report : A summary of the process
<project-name>.world.dat : The "world" matrice in Mathematica Write format
<project-name>.world.txt : Coordinates of the items (ie, the "world" matrice in XYZ format)
<project-name>-mean-#-resp.txt : Coordinates of the items belonging to the #. mean

and one last remark to those who are not that much accustomed to coding in C++: Do not forget to include the "include" directories! 

Check the following command and modify it according to your directory locations (I do it in the Windows-way 🙂

g++ -o kmeans.exe kmeans.cpp -I c:\GnuWin32\include  -L c:\GnuWin32\lib -lgsl -lgslcblas -lm

(If you are not keen on modifying the code and your operating system is Windows, you also have the option to download the compiled binary from here)

The source files can be obtained from here.

And, the output of an example with 4000 items and 20 means can be obtained here, while the output of a tinier example with 100 items and 2 means can be obtained here.

I’m planning to continue with the soft version of the K-Means algorithm for the next time…

References: Information Theory, Inference, and Learning Algorithms by David MacKay