## Some “trivial” derivations

### December 4, 2007 Posted by Emre S. Tasci

Information Theory, Inference, and Learning Algorithms by David MacKay, Exercise 22.5:

A random variable *x* is assumed to have a probability distribution that is a *mixture of two Gaussians*,

where the two Gaussians are given the labels *k* = 1 and *k* = 2; the prior probability of the class label *k* is {p_{1} = 1/2, p_{2} = 1/2}; are the means of the two Gaussians; and both have standard deviation sigma. For brevity, we denote these parameters by

A data set consists of *N* points which are assumed to be independent samples from the distribution. Let *k _{n}* denote the unknown class label of the

*n*th point.

Assuming that and are known, show that the posterior probability of the class label *k _{n}* of the

*n*th point can be written as

and give expressions for and .

** Derivation:**

Assume now that the means are *not* known, and that we wish to infer them from the data . (The standard deviation is known.) In the remainder of this question we will derive an iterative algorithm for finding values for that maximize the likelihood,

Let *L* denote the natural log of the likelihood. Show that the derivative of the log likelihood with respect to is given by

where appeared above.

**Derivation:**

Assuming that =1, sketch a contour plot of the likelihood function as a function of mu1 and mu2 for the data set shown above. The data set consists of 32 points. Describe the peaks in your sketch and indicate their widths.

**Solution:**

We will be trying to plot the function

if we designate the function

as **p[ x,mu]** (remember that =1 and ),

then we have:

And in Mathematica, these mean:

mx=Join[N[Range[0,2,2/15]],N[Range[4,6,2/15]]]

Length[mx]

ListPlot[Table[{mx[[i]],1},{i,1,32}]]

p[x_,mu_]:=0.3989422804014327` * Exp[-(mu-x)^2/2];

pp[x_,mu1_,mu2_]:=.5 (p[x,mu1]+p[x,mu2]);

ppp[xx_,mu1_,mu2_]:=Module[

{ptot=1},

For[i=1,i<=Length[xx],i++,

ppar = pp[xx[[i]],mu1,mu2];

ptot *= ppar;

(*Print[xx[[i]],"\t",ppar];*)

];

Return[ptot];

];

Plot3D[ppp[mx,mu1,mu2],{mu1,0,6},{mu2,0,6},PlotRange->{0,10^-25}];

ContourPlot[ppp[mx,mu1,mu2],{mu1,0,6},{mu2,0,6},{PlotRange->{0,10^-25},ContourLines->False,PlotPoints->250}];(*It may take a while with PlotPoints->250, so just begin with PlotPoints->25 *)

That’s all folks! (for today I guess 8) (and also, I know that I said next entry would be about the soft K-means two entries ago, but believe me, we’re coming to that, eventually ðŸ˜‰

Attachments: Mathematica notebook for this entry, MSWord Document (actually this one is intended for me, because in the future I may need them again)

## Likelihood of Gaussian(s) – Scrap Notes

### December 3, 2007 Posted by Emre S. Tasci

Given a set of N data x, , the optimal parameters for a Gaussian Probability Distribution Function defined as:

are:

with the definitions

Let’s see this in an example:

**Define the data set mx:
**mx={1,7,9,10,15}

**Calculate the optimal mu and sigma:**dN=Length[mx];

mu=Sum[mx[[i]]/dN,{i,1,dN}];

sig =Sqrt[Sum[(mx[[i]]-mu)^2,{i,1,dN}]/dN];

Print["mu = ",N[mu]];

Print["sigma = ",N[sig]];

**Now, let’s see this Gaussian Distribution Function:**

<<Statistics`NormalDistribution`

ndist=NormalDistribution[mu,sig];

MultipleListPlot[Table[{x,PDF[NormalDistribution[mu,sig],x]}, {x,0,20,.04}],Table[{mx[[i]], PDF[NormalDistribution[mu,sig],mx[[i]]]},{i,5}], {PlotRange->{Automatic,{0,.1}},PlotJoined->{False,False}, SymbolStyle->{GrayLevel[.8],GrayLevel[0]}}]

## K-means Clustering – the hard way…

### Posted by Emre S. Tasci

Clustering (or grouping) items using a defined similarity of selected properties is an effective way to classify the data, but also (and probably more important) aspect of clustering is to pick up abnormal pieces that are different from the other clusterable "herd". These exceptions may be the ones that we’re looking (or unlooking) for!

There is this simple clustering method called *K-means Clustering*. Suppose that you have N particles (atoms/people/books/shapes) defined by 3 properties (coordinates/{height, weight, eye color}/{title,number of sold copies,price}/{number of sides,color,size}). You want to classify them into K groups. A way of doing this involves the K-means Clustering algorithm. I have listed some examples to various items & their properties, but from now on, for clarity and easy visionalizing, I will stick to the first one (atoms with 3 coordinates, and while we’re at it, let’s take the coordinates as Cartesian coordinates).

The algorithm is simple: You introduce K means into the midst of the N items. Then apply the distance criteria for each means upon the data and relate the item to the nearest mean to it. Store this relations between the means and the items using a *responsibility* list (array, vector) for each means of size (dimension) *N* (number of items).

Let’s define some notation at this point:

- Designate the position of the k
^{th}mean by**m**._{k} - Designate the position of the j
^{th}item by**x**._{j} - Designate the responsibility list of the k
^{th}mean by**r**such that the responsibility between that k_{k}^{th}mean and the j^{th}item is stored in the j^{th}element of**r**. If_{k}**r**[j] is equal to 1, then it means that this mean is the closest mean to the item and 0 means that, some other mean is closer to this item so, it is not included in this mean’s responsibility._{k}

The algorithm can be built in 3 steps:

0. Ã„Â°nitialize the means. (Randomly or well calculated)

1. Analyze:Check the distances between the means and the items. Assign the items to their nearest mean’s responsibility list.

2. Move:Move the means to thecenter of massof the items each is responsible for. (If a mean has no item it is responsible for, then just leave it where it is.)

3. Check:If one or more mean’s position has changed, then go to 1, if not exit.

That’s all about it. I have written a C++ code for this algorithm. It employs an object called ** kmean** (can you guess what it is for? ) which has the properties and methods as:

class kmean

{

private:

unsigned int id,n;//"id" is for the unique identifier and "n" is the total number of items in the world.

double x,y,z;//coordinates of the mean

gsl_vector * r;//responsibility list of the mean

char buffer [150];//some dummy variable to use in the "whereis" method

unsigned int r_tott;//total number of data points this mean is responsible for

public:

kmean(int ID,double X, double Y, double Z,unsigned int N);

~kmean();

//kmean self(void);

int move(double X, double Y, double Z);//moves the mean to the point(X,Y,Z)

int calculate_distances(gsl_matrix *MatrixIN, int N, gsl_matrix *distances);//calculates the distances of the "N" items with their positions defined in "MatrixIN" and writes these values to the "distances" matrix

char * whereis(void);//returns the position of the mean in a formatted string

double XX(void) {return x;}//returns the x-component of the mean’s position

double YY(void) {return y;}//returns the y-component of the mean’s position

double ZZ(void) {return z;}//returns the z-component of the mean’s position

int II(void) {return id;}//returns the mean’s id

int NN(void) {return n;}//returns the number of items in the world (as the mean knows it)

int RR(void);//prints out the responsibility list (used for debugging purposes actually)

gsl_vector * RRV(void) {return r;}//returns the responsibility list

int set_r(int index, bool included);//sets the "index"th element of the responsibility list as the bool "included". (ie, registers the "index". item as whether it belongs or not belongs to the mean)

int moveCoM(gsl_matrix *MatrixIN);//moves the mean to the center of mass of the items it is responsible for

int RTOT(void) {return r_tott;}//returns the total number of items the mean is responsible for

};

int check_distances(gsl_matrix* distances, int numberOfWorldElements, kmean **means);//checks the "distances" matrix elements for the means in the "means" array and registers each item in the responsibility list of the nearest mean.

double distance(double a1, double a2,double a3,double b1,double b2,double b3);//distance criteria is defined with this function

Now that we have the essential kmean object, the rest of the code is actually just details. Additionaly, I’ve included a random world generator (and by world, I mean N items and their 3 coordinates, don’t be misled). So essentialy, in the main code, we have the following lines for the **analyze** section:

for (i=0;i<dK;i++)

means[i]->calculate_distances(m,dN,distances);

check_distances(distances,dN,means);

which first calculates the distances between each mean (defined as elements of the *means*[] array) and item (defined as the rows of the *m* matrice with *dN*x4 dimension (1st col: id, cols 2,3,4: coordinates) and writes the distances into the distances matrice (of dimensions *dN*x*dK*) via the *calculate_distances(…)* method. *dN* is the total number of items and *dK* is the total number of means. Then analyzes the distances matrice, finds the minimum of each row (because rows represent the items and columns represent the means) and registers the item as 1 in the nearest means’s responsibility list, and as 0 in the other means’ lists via the *check_distances(…)* method.

Then, the code executes its **move** section:

for(i=0; i<dK; i++)

{

means[i]->moveCoM(m);

* […]*

}

which simply moves the means to the center of mass of the items they’re responsible for. This section is followed by the **checkSufficiency** section which simply decides if we should go for one more iteration or just call it quits.

counter++;

if(counter>=dMaxCounter) {f_maxed = true;goto finalize;}

gsl_matrix_sub(prevMeanPositions,currMeanPositions);

if(gsl_matrix_isnull(prevMeanPositions)) goto finalize;

* […]*

goto analyze;

as you can see, it checks for two things: whether a predefined iteration limit has been reached or, if at least one (two ðŸ˜‰ of the means moved in the last iteration. **finalize** is the section where some tidying up is processed (freeing up the memory, closing of the open files, etc…)

The code uses two external libraries: the Gnu Scientific Library (GSL) and the Templatized C++ Command Line Parser Library (TCLAP). While GSL is essential for the matrix operations and the structs I have intensely used, TCLAP is used for a practical and efficient solution considering the argument parsing. If you have enough reasons for reading this blog and this entry up to this line and (a tiny probability but, anyway) have not heard about GSL before, you MUST check it out. For TCLAP, I can say that, it has become a habit for me since it is very easy to use. About what it does, it enables you to enable the external variables that may be defined in the command execution even if you know nothing about argument parsing. I strictly recommend it (and thank to Michael E. Smoot again 8).

The program can be invoked with the following arguments:

D:\source\cpp\Kmeans>kmeans –help

USAGE:

kmeans -n <int> -k <int> [-s <unsigned long int>] [-o <output file>]

[-m <int>] [-q] [–] [–version] [-h]

Where:

-n <int>, –numberofelements <int>

(required) total number of elements (required)

-k <int>, –numberofmeans <int>

(required) total number of means (required)

-s <unsigned long int>, –seed <unsigned long int>

seed for random number generator (default: timer)

-o <output file>, –output <output file>

output file basename (default: "temp")

-m <int>, –maxcounter <int>

maximum number of steps allowed (default: 30)

-q, –quiet

no output to screen is generated

–, –ignore_rest

Ignores the rest of the labeled arguments following this flag.

–version

Displays version information and exits.

-h, –help

Displays usage information and exits.

K-Means by Emre S. Tasci, <…@tudelft.nl> 2007

The code outputs various files:

<project-name>.means.txt : Coordinates of the means evolution in the iterations

<project-name>.meansfinal.txt : Final coordinates of the means

<project-name>.report : A summary of the process

<project-name>.world.dat : The "world" matrice in Mathematica Write format

<project-name>.world.txt : Coordinates of the items (ie, the "world" matrice in XYZ format)

<project-name>-mean-#-resp.txt : Coordinates of the items belonging to the #. mean

and one last remark to those who are *not that much* accustomed to coding in C++: Do not forget to include the "include" directories!

Check the following command and modify it according to your directory locations (I *do it* in the Windows-way ðŸ™‚

**g++ -o kmeans.exe kmeans.cpp -I c:\GnuWin32\include -L c:\GnuWin32\lib -lgsl -lgslcblas -lm**

(If you are not keen on modifying the code and your operating system is Windows, you also have the option to download the compiled binary from here)

The source files can be obtained from here.

And, the output of an example with 4000 items and 20 means can be obtained here, while the output of a tinier example with 100 items and 2 means can be obtained here.

I’m planning to continue with the *soft* version of the K-Means algorithm for the next time…

References: Information Theory, Inference, and Learning Algorithms by David MacKay

## Bayesian Probabilities & Inference – Another example

### November 27, 2007 Posted by Emre S. Tasci

Yesterday, I told my *battalion-buddy*, Andy about the pregnancy example and, he recalled another example, with the shuffling of the three cups and a ball in one of them. The situation is proposed like this:

The dealer puts the ball in one of the cups and shuffles it so fast that at one point you lose the track and absolutely have no idea which cup has the ball. So, you pick a cup randomly (let’s say the first cup) but for the moment, don’t look inside. At this point, the dealer says, he will remove one of the remaining two cups and he guarantees you that, the one he removes does not have the ball. He even bestows you the opportunity to change your mind and pick the other one if you like. The question is this: Should you

a) Stick to the one you’ve selected or,

b) Switch to the other one or,

c) (a) or (b) – what does it matter?

Today, while I was studying MacKay’s book *Theory, Inference, and Learning Algorithms*, I came upon to the same example and couldn’t refrain myself from including it on this blog.

First, let’s solve the problem with simple notation. Let’s say that, initially we have picked the first cup, the probability that this cup has the ball is 1/3. There are two possibilities: Whether in reality the 1st cup has the ball indeed, or not.

i) 1st cup has the ball (1/3 of all the times): With this situation, it doesn’t matter which of the two cups remaining is removed.

ii) 1st cup doesn’t have the ball (2/3 of all the times): This time, the dealer removes the one, and the remaining cup has the ball.

So, to summarize, left with the two cups, there’s a 1/3 probability that the ball is within the cup we have chosen in the beginning while there’s the 2/3 probability that the ball is within the other cup. So, switching is the advantegous (did I spell it correctly? Don’t think so.. ) / To put it in other words: option (b) must be preffered to option (a) and actually, option (c) is wrong.

Now, let’s compute the same thing with the formal notations:

Let H_{i} be define the situation that the ball is in cup *i*. D denotes the removed cup (either 2 or 3). Now, initially the ball can be in any of the three cups with equal probabilities so

Now, since we have selected the 1st cup, either 2nd or the 3rd cup will be removed. And we have three cases for the ball position:

The first column represents that the ball is in our selected cup, so it doesn’t matter which of the remaining cups the dealer will remove. Each of the remaining cup has the same probability to be removed which is 0.5. The second column is the situation when the ball is in the second cup. Since it is in the second cup, the dealer will be forced to remove the third cup (ie P(D=3|H_{2})=1 ).Third column is similar to the second column.

Let’s suppose that, the dealer removed the third cup. With the 3rd cup removed, the probability that the ball is in the i. cup is given by: (remember that H_{i} represents the situation that the ball is in the i. cup)

for our purposes, P(D=3) is not relevant, because we are looking for the comparison of the first two probabilities (P(D=3) = 0.5, by the way since it is either D=2 or D=3).

meaning that it is 2 times more likely that the ball is in the other cup than the one we initially had selected.

Like John, Jo’s funny boyfriend in the previous bayesian example, used in context to show the unlikeliness of the "presumed" logical derivation, MacKay introduces the concept of one million cups instead of the three. Suppose you have selected one cup among the one million cups. Then, the dealer removes 999,998 cups and you’re left with the one you’ve initially chosen and the 432,238th cup. Would you *now* switch, or stick to the one you had chosen?

## Bayesian Inference – An introduction with an example

### November 26, 2007 Posted by Emre S. Tasci

Suppose that Jo (of Example 2.3 in MacKay’s previously mentioned book), decided to take a test to see whether she’s pregnant or not. She applies a test that is 95% reliable, that is if she’s indeed pregnant, than there is a 5% chance that the test will result otherwise and if she’s indeed NOT pregnant, the test will tell her that she’s pregnant again by a 5% chance (The other two options are test concluding as positive on pregnancy when she’s indeed pregnant by 95% of all the time, and test reports negative on pregnancy when she’s actually not pregnant by again 95% of all the time). Suppose that she is applying contraceptive drug with a 1% fail ratio.

**Now comes the question: Jo takes the test and the test says that she’s pregnant. Now what is the probability that she’s indeed pregnant?**

I would definitely not write down this example if the answer was 95% percent as you may or may not have guessed but, it really is tempting to guess the probability as 95% the first time.

The solution (as given in the aforementioned book) is:

where *P(b:b _{j}|a:a_{i})* represents the probability of

**b**having the value

*b*

_{j}

*given that***a**=

*a*. So Jo has P(test:1|preg=1) = 16% meaning that given that Jo is actually pregnant, the test would give the positive result by a probability of 16%. So we took into account both the test’s and the contra-ceptive’s reliabilities. If this doesn’t make sense and you still want to stick with the

_{i}*somehow more logical looking*95%, think the same example but this time starring John, Jo’s humorous boyfriend who as a joke applied the test and came with a positive result on pregnancy. Now, do you still say that John is 95% pregnant? I guess not Just plug in 0 for P(preg:1) to the equation above and enjoy the outcoming likelihood of John being non-pregnant equaling to 0…

The thing to keep in mind is the probability of *a* being some value *a _{i}* when

*b*is

*b*

_{j }**is not**equal to the probability of

*b*being

*b*when

_{j }*a*is equal to

*a*

_{i}.