Hex, Bugs and More Physics | Emre S. Tasci » Blog Archive » A Crash Course on Information Gain & some other DataMining terms and How To Get There

A Crash Course on Information Gain & some other DataMining terms and How To Get There

November 21, 2007 Posted by Emre S. Tasci

Suppose that you have -let’s say- 5 pipelines outputting Boolean values seemingly randomly such as:

Now, we will be looking for a possible relation between these variables -let’s name that {a,b,c,d,e}ï¿½ (Suppose that we do not know that the first 3 variables (a,b,c)ï¿½ are randomly tossed, while the 4th (d)ï¿½ is just “a AND b” and the 5th one (e) is actually “NOT a”).

We look for different possible values of each variable and store them in an array called posVals (as obvius, from “Possible Values”). In our example, each of the variables can have only two values, so our posVals array will look something like this:
Now that we know the possible outcomes for each of the variables, we calculate the probability of each of these values by simply counting them and then dividing to the total number of output. As an example, for the 3rd variable,ï¿½ there are four 0’s and 6 1’s totaling 10 values. So, the probability of a chosen value to be 0 would be 4 out of 10 and 6/10 for 1. We store the probabilities in an array called probs this time which is like
ï¿½
In theï¿½ 3rd row, the probabilities for the c variable are stored in the order presented in the posVals array (check the 3rd row of posVals to see that the first value is 1 and the second is 0). You can object the unnecessity of storing the second row of probs because, the second row is always equal to 1-probs[[1]] (I will be using the Mathematica notation for array elements, with a[[1]] meaning the 1st row if a is a matrix, or 1st element if a is a 1-D List; a[[1]][[3]] meaning the element on the first row and third column of the matrix a), but it is only a specific case that we are dealing with Boolean outputs. Yes,ï¿½ a variable’s probability always equals to the sum of the probabilities of all the other variables subtracted from 1 but we will keep record of all the probabilities for practical reasons.
Now we will calculate the entropy for each variable. Entropy is a measure of order in a distribution and a higher entropy corresponds to a more disordered system. For our means, we will calculate the entropy as
$Formula: % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamisaiaacI % cacaWGybGaaiykaiabg2da9iabgkHiTiaadchadaWgaaWcbaGaaGym % aaqabaGcciGGSbGaai4BaiaacEgadaWgaaWcbaGaaGOmaaqabaGcca % WGWbWaaSbaaSqaaiaaigdaaeqaaOGaeyOeI0IaamiCamaaBaaaleaa % caaIYaaabeaakiGacYgacaGGVbGaai4zamaaBaaaleaacaaIYaaabe % aakiaadchadaWgaaWcbaGaaGOmaaqabaGccqGHsislcqWIMaYscqGH % sislcaWGWbWaaSbaaSqaaiaad2gaaeqaaOGaciiBaiaac+gacaGGNb % WaaSbaaSqaaiaaikdaaeqaaOGaamiCamaaBaaaleaacaWGTbaabeaa % aaa!55DC! $$ H(X) =ï¿½ - p_1 \log _2 p_1ï¿½ - p_2 \log _2 p_2ï¿½ -ï¿½ \ldotsï¿½ - p_m \log _2 p_m $$ $
$Formula: % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeyypa0Jaey % OeI0YaaabCaeaacaWGWbWaaSbaaSqaaiaadQgaaeqaaOGaciiBaiaa % c+gacaGGNbWaaSbaaSqaaiaaikdaaeqaaOGaamiCamaaBaaaleaaca % WGQbaabeaaaeaacaWGQbGaeyypa0JaaGymaaqaaiaad2gaa0Gaeyye % Iuoaaaa!45A5! $$ ï¿½ =ï¿½ - \sum\limits_{j = 1}^m {p_j \log _2 p_j } $$ $
In these formulas, H represents the entropy, $Formula: $$ {p_j } $$ $ represents the probability of the jth value for variable X. Let’s store this values in the entropies array.
Special Conditional Entropy, $Formula: % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamisamaabm % aabaGaamywaiaacYhacaWGybGaeyypa0JaamiEamaaBaaaleaacaWG % PbaabeaaaOGaayjkaiaawMcaaaaa!3E25! $$ H\left( {Y|X = x_i } \right) $$ $ ï¿½ is defined as the entropy of the Y variable values for which the corresponding X variable values equal to .ï¿½ Forï¿½ our given example, let’s calculate H(d|a=1): Searching through the data, we see that, a=1 for rows: {1,3,4,7,10} and the values of d in these rows correspond to: {1,0,0,0,1}. Let’s assume this set ({1,0,0,0,1}) as a brand new set and go on from here.ï¿½ Applying the definition of the entropy given in the 3rd step, we have:
$Formula: % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiCamaabm % aabaWaaiWaaeaacaaIWaGaaiilaiaaigdaaiaawUhacaGL9baaaiaa % wIcacaGLPaaacqGH9aqpdaGadaqaamaaleaaleaacaaIZaaabaGaaG % ynaaaakiaacYcadaWcbaWcbaGaaGOmaaqaaiaaiwdaaaaakiaawUha % caGL9baaaaa!43EB! $$ p\left( {\left\{ {0,1} \right\}} \right) = \left\{ {{\textstyle{3 \over 5}},{\textstyle{2 \over 5}}} \right\} $$ $
$Formula: % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamisamaabm % aabaGaamizaiaacYhacaWGHbGaeyypa0JaaGymaaGaayjkaiaawMca % aiabg2da9iabgkHiTmaaqahabaGaamiCamaaBaaaleaacaWGQbaabe % aakiGacYgacaGGVbGaai4zamaaBaaaleaacaaIYaaabeaakiaadcha % daWgaaWcbaGaamOAaaqabaaabaGaamOAaiabg2da9iaaigdaaeaaca % aIYaaaniabggHiLdGccqGH9aqpcqGHsisldaqadaqaamaalaaabaGa % aG4maaqaaiaaiwdaaaGaciiBaiaac+gacaGGNbWaaSbaaSqaaiaaik % daaeqaaOWaaSaaaeaacaaIZaaabaGaaGynaaaaaiaawIcacaGLPaaa % cqGHsisldaqadaqaamaalaaabaGaaGOmaaqaaiaaiwdaaaGaciiBai % aac+gacaGGNbWaaSbaaSqaaiaaikdaaeqaaOWaaSaaaeaacaaIYaaa % baGaaGynaaaaaiaawIcacaGLPaaaaaa!6003! $$ H\left( {d|a = 1} \right) =ï¿½ - \sum\limits_{j = 1}^2 {p_j \log _2 p_j }ï¿½ =ï¿½ - \left( {{3 \over 5}\log _2 {3 \over 5}} \right) - \left( {{2 \over 5}\log _2 {2 \over 5}} \right) $$ $
$Formula: % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamisamaabm % aabaGaamizaiaacYhacaWGHbGaeyypa0JaaGymaaGaayjkaiaawMca % aiabg2da9iaaicdacaGGUaGaaGyoaiaaiEdacaaIWaGaaGyoaiaaiw % dacaaIXaaaaa!43C0! $$ H\left( {d|a = 1} \right) = 0.970951 $$ $
Conditional Entropy, $Formula: % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamisamaabm % aabaGaamywaiaacYhacaWGybaacaGLOaGaayzkaaaaaa!3AFE! $$ H\left( {Y|X} \right) $$ $ is the entropy including all the possible values of the variable being tested (X)ï¿½ as having a relation to theï¿½ targetedï¿½ variable (Y). It is calculated as:
$Formula: % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamisamaabm % aabaGaamywaiaacYhacaWGybaacaGLOaGaayzkaaGaeyypa0Zaaabu % aeaacaWGWbWaaeWaaeaacaWGybGaeyypa0JaamiEamaaBaaaleaaca % WGQbaabeaaaOGaayjkaiaawMcaaiaadIeadaqadaqaaiaadMfacaGG % 8bGaamiwaiabg2da9iaadIhadaWgaaWcbaGaamOAaaqabaaakiaawI % cacaGLPaaaaSqaaiaadQgaaeqaniabggHiLdaaaa!4DD2! $$ H\left( {Y|X} \right) = \sum\limits_j {p\left( {X = x_j } \right)H\left( {Y|X = x_j } \right)} $$ $
Again, if we look at our example, since we have H(d|a=1) =ï¿½ 0.970951, p(a=1) = 0.5 ; and if we calculate H(d|a=0) in a similar way as in step 4, we find H(d|a=0) = 0 and taking
$Formula: % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaCbeaeaaci % GGSbGaaiyAaiaac2gaaSqaaiaadIhacqGHsgIRcaaIWaaabeaakiaa % dIhaciGGSbGaai4BaiaacEgadaWgaaWcbaGaaGOmaaqabaGcdaqada % qaaiaadIhaaiaawIcacaGLPaaaaaa!43E9! $$ \mathop {\lim }\limits_{x \to 0} x\log _2 \left( x \right) = 0 $$ $
we find that H(d|a) = 0.5 x 0.970951 + 0.5 x 0 = 0.4854755
Information Gain, $Formula: % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamysaiaadE % eadaqadaqaaiaadMfacaGG8bGaamiwaaGaayjkaiaawMcaaaaa!3BCB! $$ IG\left( {Y|X} \right) $$ $ – is a measure of the effect a variable (X)ï¿½ is having againstï¿½ a targetted variable (Y) with the information on the entropy of the targetted variable itself (H(Y))ï¿½ taken into account. It is calculated as:
$Formula: % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamysaiaadE % eadaqadaqaaiaadMfacaGG8bGaamiwaaGaayjkaiaawMcaaiabg2da % 9iaadIeadaqadaqaaiaadMfaaiaawIcacaGLPaaacqGHsislcaWGib % WaaeWaaeaacaWGzbGaaiiFaiaadIfaaiaawIcacaGLPaaaaaa!4603! $$ IG\left( {Y|X} \right) = H\left( Y \right) - H\left( {Y|X} \right) $$ $

For our example , H(d) = 0.721928 and we calculated H(d|a=0) = 0.4854755 so, IG(d|a) = 0.236453. Summing up, for the data (of 10 for each variable)ï¿½ presented in step 1, we have the Information Gain Table as:

X	IG(e\|X)	X	IG(d\|X)	X	IG(c\|X)
a	1.0	a	0.236453	a	0.124511
b	0.0290494	b	0.236453	b	0.0
c	0.124511	c	0.00740339	d	0.00740339
d	0.236453	e	0.236453	e	0.124511

X	IG(b\|X)	X	IG(a\|X)
a	0.0290494	b	0.0290494
c	0.0	c	0.124511
d	0.236453	d	0.236453
e	0.0290494	e	1.0

And here are the results for 1000 values per variable instead of 10:

X	IG(e\|X)	X	IG(d\|X)	X	IG(c\|X)
a	0.996257	a	0.327922	a	0.000500799
b	0.000716169	b	0.29628	b	0.00224924
c	0.000500799	c	0.000982332	d	0.000982332
d	0.327922	e	0.327922	e	0.000500799

X	IG(b\|X)	X	IG(a\|X)
a	0.000716169	b	0.000716169
c	0.00224924	c	0.000500799
d	0.29628	d	0.327922
e	0.000716169	e	0.996257

What can we say about the relationships from these Information Gain results?

H(Y|X) = H(X|Y) – Is this valid for this specific case or is it valid forï¿½ general?
Variable a: We can most probably guess a if we had known e, and it is somehow correlated with d.
Variable b: It is correlated with d.
Variable c: There doesn’t seem to be any correlation.
Variable d: It is equally correlated with a&e and also correlated with b very closely to that correlation with a&e.
Variable e: It is VERY strongly correlated with a andï¿½ it is somehow correlated with d.

Let’s take it on from here, with the variable d:

This is the IG table for 10000 data per each variable:

X	IG(d\|X)
a	0.30458
b	0.297394
c	0.00014876
e	0.30458

This is the IG table for 100000 data per each variable:

X	IG(d\|X)
a	0.309947
b	0.309937
c	1.97347×10^-6
e	0.309947

You can download the Mathematica sheet for this example : Information Gain Example

You will also need the two libraries (Matrix & DataMining) to run the example.

Filed in Definition & Method 2 Comments »

2 Responses to “A Crash Course on Information Gain & some other DataMining terms and How To Get There”

admin Says:
November 22, 2007 at 2:01 pm
If you run the example code, you will see that you are receving different values from the results presented here, even if you use the same 10 data. This is because of the definition of the entropy. In the entry, I’m using the binary entropy with Log base 2, but in the Mathematica functions I’ve posted, I switched the Log base to e for general purposes.

You can easily switch back to the former version by simply changing the line in the definition of the function DMEntropyEST from

Return[-Sum[problist[[i]] Log[problist[[i]]], {i,
Length[problist]}]];

to

Return[-Sum[problist[[i]] Log[2,problist[[i]]], {i, Length[problist]}]];

in the DataMining Functions Library file “estrefdm.nb”
Hex, Bugs and More Physics | Emre S. Tasci » Blog Archive » Octave Functions for Information Gain (Mutual Information) Says:
April 7, 2008 at 4:22 pm
[…] theoretical background and mathematica version of the functions refer to : A Crash Course on Information Gain & some other DataMining terms and How To Get There (my 21/11/2007dated […]

Hex, Bugs and More Physics | Emre S. Tasci

A Crash Course on Information Gain & some other DataMining terms and How To Get There

November 21, 2007 Posted by Emre S. Tasci

2 Responses to “A Crash Course on Information Gain & some other DataMining terms and How To Get There”

Leave a Reply

Categories

Archive

Blogroll

Meta

Design by