Octave Functions for Information Gain (Mutual Information)
April 7, 2008 Posted by Emre S. Tasci
Here is some set of functions freshly coded in GNU Octave that deal with Information Gain and related quantities.
For theoretical background and mathematica correspondence of the functions refer to : A Crash Course on Information Gain & some other DataMining terms and How To Get There (my 21/11/2007dated post)
function [t r p] = est_members(x)
Finds the distinct members of given vector x returns 3 vectors r, t and p where :
r : ordered unique member list
t : number of occurences of each element in r
p : probability of each element in r
Example :
octave:234> x
x =
1 2 3 3 2 1 1
octave:235> [r t p] = est_members(x)
r =
1 2 3
t =
3 2 2
p =
0.42857 0.28571 0.28571
function C = est_scent(z,x,y,y_n)
Calculates the specific conditional entropy H(X|Y=y_n)
It is assumed that:
X is stored in the x. row of z
Y is stored in the y. row of z
Example :
octave:236> y
y =
1 2 1 1 2 3 2
octave:237> z=[x;y]
z =
1 2 3 3 2 1 1
1 2 1 1 2 3 2
octave:238> est_scent(z,1,2,2)
ans = 0.63651
function cent = est_cent(z,x,y)
Calculates the conditional entropy H(X|Y)
It is assumed that:
X is stored in the x. row of z
Y is stored in the y. row of z
Example :
octave:239> est_cent(z,1,2)
ans = 0.54558
function ig = est_ig(z,x,y)
Calculates the Information Gain (Mutual Information) IG(X|Y) = H(X) – H(X|Y)
It is assumed that:
X is stored in the x. row of z
Y is stored in the y. row of z
Example :
octave:240> est_ig(z,1,2)
ans = 0.53341
function ig = est_ig2(x,y)
Calculates the Information Gain IG(X|Y) = H(X) – H(X|Y)
Example :
octave:186> est_ig2(x,y)
ans = 0.53341
function ig = est_ig2n(x,y)
Calculates the Normalized Information Gain
IG(X|Y) = [H(X) – H(X|Y)]/min(H(X),H(Y))
Example :
octave:187> est_ig2n(x,y)
ans = 0.53116
function ent = est_entropy(p)
Calculates the entropy for the given probabilities vector p :
H(P) = – SUM(p_i * Log(p_i))
Example :
octave:241> p
p =
0.42857 0.28571 0.28571
octave:242> est_entropy(p)
ans = 1.0790
function ent = est_entropy_from_values(x)
Calculates the entropy for the given values vector x:
H(P) = -Sum(p(a_i) * Log(p(a_i))
where {a} is the set of possible values for x.
Example :
octave:243> x
x =
1 2 3 3 2 1 1
octave:244> est_entropy_from_values(x)
ans = 1.0790
Supplementary function files:
function [t r p] = est_members(x)
% Finds the distinct members of x
% r : ordered unique member list
% t : number of occurences of each element in r
% p : probability of each element in r
% Emre S. Tasci 7/4/2008
t = unique(x);
l = 0;
for i = t
l++;
r(l) = length(find(x==i));
endfor
N = length(x);
p = r/N;
endfunction
function C = est_scent(z,x,y,y_n)
% Calculates the specific conditional entropy H(X|Y=y_n)
% It is assumed that:
% X is stored in the x. row of z
% Y is stored in the y. row of z
% y_n is located in the list of possible values of y
% (i.e. [r t p] = est_members(z(y,:))
% y_n = r(n)
% Emre S. Tasci 7/4/2008
[r t p] = est_members(z(x,:)(z(y,:)==y_n));
C = est_entropy(p);
endfunction
function cent = est_cent(z,x,y)
% Calculates the conditional entropy H(X|Y)
% It is assumed that:
% X is stored in the x. row of z
% Y is stored in the y. row of z
% Emre S. Tasci 7/4/2008
cent = 0;
j = 0;
[r t p] = est_members(z(y,:));
for i=r
j++;
cent += p(j)*est_scent(z,x,y,i);
endfor
endfunction
function ig = est_ig(z,x,y)
% Calculates the Information Gain IG(X|Y) = H(X) – H(X|Y)
% X is stored in the x. row of z
% Y is stored in the y. row of z
% Emre S. Tasci 7/4/2008
[r t p] = est_members(z(x,:));
ig = est_entropy(p) – est_cent(z,x,y);
endfunction
function ig = est_ig2(x,y)
% Calculates the Information Gain IG(X|Y) = H(X) – H(X|Y)
% Emre S. Tasci <e.tasci@tudelft.nl> 8/4/2008
z = [x;y];
[r t p] = est_members(z(1,:));
ig = est_entropy(p) – est_cent(z,1,2);
endfunction
function ig = est_ig2n(x,y)
% Calculates the Normalized Information Gain
% IG(X|Y) = [H(X) – H(X|Y)]/min(H(X),H(Y))
% Emre S. Tasci <e.tasci@tudelft.nl> 8/4/2008
z = [x;y];
[r t p] = est_members(z(1,:));
entx = est_entropy(p);
enty = est_entropy_from_values(y);
minent = min(entx,enty);
ig = (entx – est_cent(z,1,2))/minent;
endfunction
function ent = est_entropy(p)
% Calculates the Entropy of the given probability vector X:
% H(X) = – Sigma(X*Log(x))
% If you want to directly calculate the entropy from values array
% use est_entropy_from_values(X) function
%
% Emre S. Tasci 7/4/2008
ent = 0;
for i = p
ent += -i*log(i);
endfor
endfunction
function ent = est_entropy_from_values(x)
% Calculates the Entropy of the given set of values X:
% H(X) = – Sigma(p(X_i)*Log(p(X_i)))
% If you want to calculate the entropy from probabilities array
% use est_entropy(X) function
%
% Emre S. Tasci 7/4/2008
ent = 0;
[r t p] = est_members(x);
for i = p
ent += -i*log(i);
endfor
endfunction
Less is More MySQL
January 9, 2008 Posted by Emre S. Tasci
(-Right now, I’m going to define a new category: the MySQL category!-)
Suppose that, you have a database which contains a huge number of entries about the materials (like, for instance, the Pauling Database). Let it have 163 different properties we can query about. It is optimized for queries, so, the values are enumerated and the labels for these are kept in "pauling.dblxxx" tables, which may be something like this for the "Chemical System" property:
The values are kept in "pauling.valxxx" tables:
(Where the first one is the EntryCode, the PRIMARY key that relates all the tables and the second value is the enumeration for the value. For example, 1422 for the 13. property is actually Ho-Ir 🙂
and there is one more set of tables, the paulingv2.valxxx tables which are stored and keyed in val order, so:
- if we want to find the EntryCodes corresponding to a given property, we query the paulingv2.valxxx tables
- if we want to find the property that is corresponding to a given EntryCode we query the pauling.valxxx tables
- if we want to "translate" the property’s enumeration, we query the pauling.dblxxx tables.
Here is the thing: Let’s say that we want the Structure Types that have an Atomic Enviroment Type (AET) of a rhombic dodecahedron , with a/b ratio 1, alpha=beta=90o. The property numbers for these are:
Structure Type : 32
AET : 86
a/b ratio : 44
alpha : 41
beta : 42
Since the last three properties are numeric, we don’t enumerate them and the enumeration corresponding to the rhombic dodecahedron for AET is 6. So, fasten your seat belts, we are about to lift off! :
SELECT id,pauling.dbl032.val FROM pauling.dbl032
INNER JOIN
(
SELECT DISTINCT val FROM val032
INNER JOIN
(
SELECT v AS EntryCode FROM
(
(
SELECT val001.val AS v FROM val001
INNER JOIN paulingv2.val086
USING (EntryCode)
WHERE paulingv2.val086.val=6
) AS A
INNER JOIN
(
(
SELECT val001.val AS v FROM val001
INNER JOIN paulingv2.val044
USING (EntryCode)
WHERE paulingv2.val044.val=1
) AS B
INNER JOIN
(
# 41=90 && 42= 90
(
SELECT val001.val AS v FROM val001
INNER JOIN paulingv2.val041
USING (EntryCode)
WHERE paulingv2.val041.val=90
) AS C
INNER JOIN
(
SELECT val001.val AS v FROM val001
INNER JOIN paulingv2.val042
USING (EntryCode)
WHERE paulingv2.val042.val=90
) AS D
USING (v)
)
USING (v)
)
USING (v)
)
) AS G
USING (EntryCode)
)
AS Q ON (id = Q.val) ORDER BY val;
Later addition: Assuming 86=6; we don’t really need the other constraints 44=1, 42=90, 41=90 – do we?… So it’s just the boring:
SELECT id FROM pauling.dbl032
INNER JOIN
(
SELECT DISTINCT val FROM val032
INNER JOIN
(
SELECT v AS EntryCode FROM
(
(
SELECT val001.val AS v FROM val001
INNER JOIN paulingv2.val086
USING (EntryCode)
WHERE paulingv2.val086.val=6
) AS A
)
) AS G
USING (EntryCode)
)
AS Q ON (id = Q.val) ORDER BY Q.val;
The result of this is something like this:
To be honest, it is actually something like this :
The strange symbols are the price we pay for using non-standard charsets! 😉
So, to tidy up, I import this to a table with the following structure:
CREATE TABLE IF NOT EXISTS `bcc` (
`id` smallint(5) unsigned NOT NULL default ‘0’,
`val` varchar(254) NOT NULL default ”,
`usagecount` smallint(5) unsigned NOT NULL default ‘0’,
`val1` varchar(30) NOT NULL default ‘0’,
`val2` varchar(6) default NULL,
`val3` varchar(3) default NULL,
`val1t` varchar(30) NOT NULL,
`val2sp` smallint(5) unsigned NOT NULL,
`SG` tinyint(3) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `val` (`val`),
KEY `val1` (`val1`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
You have already met the id and val columns. "usagecount" will be imported from the pauling.dbl013 table; val1, val2, val3 are the seperated structure type information (ie, for "CuTi,tp4,129", val1="CuTi", val2="tp4" and val3="129"); val1t is the "translated" version of val1 which is the readeable one (ie, "(Ag¡•¦§Zn¡•¥¥)©Zn" is translated as "(Ag0.56Zn0.44)8Zn"). The translation is done via the following simpe php function:
function translate_symbols($string,$f_tr2symb=true)
{
$symbol_array = array("•","¡","¢","£","¤","¥","¦","§","¨","©","ª");//subscript values
$transl_array = array(".","0","1","2","3","4","5","6","7","8","9");//subscript values
if(!$f_tr2symb)return str_replace($transl_array,$symbol_array,$string);
else return str_replace($symbol_array,$transl_array,$string);
}
"val2sp" is the sliced numeric value from the Pearson Symbol stored in the val2 column.
To tidy up:
$query = "SELECT id, val1, val2 FROM bcc";
$qresult = mysql_query($query);
while($result = mysql_fetch_array($qresult))
{
$query2 = "UPDATE bcc SET val1t=\"".translate_symbols($result["val1"])."\", val2sp=SUBSTRING(val2,3,20) WHERE id = ".$result["id"]." LIMIT 1";
//echo $query2."\n";
$q2result = mysql_query($query2);
echo mysql_error();
}
You can refer to my previous entry for slicing up the "val" column. I had already done this while constructing the pauling.dblxxx tables, so in fact the actual view of the pauling.dbl032 table is the following one
meaning, I can just import them using the id’s of my new table:
UPDATE bcc, pauling.dbl032 SET
bcc.val = pauling.dbl032.val,
bcc.usagecount = pauling.dbl032.usagecount,
bcc.val1 = pauling.dbl032.val1,
bcc.val2 = pauling.dbl032.val2,
bcc.val3 = pauling.dbl032.val3
WHERE bcc.id = pauling.dbl032.id
Now we have something like:
We still have some work to do. Let’s take the two structures Ni2Al and Ni2In. Say that we are looking for superstructures, which have the property that the "atoms occupy atomic positions according to the parent crystal structure". The AET for Ni2Al is given as 14-b;14-b;14-b; whereas the AET for Ni2In is given as 11-a;11-a;14-b; . We want every atom to have the parent crystal structure (14-b – the rhombic dodecahedron for bcc), so we will eliminate those ones that have other AET.
mysql_query("USE pauling");
$ids_q = mysql_query("SELECT id FROM s07pt.bcc");
while($ids = mysql_fetch_row($ids_q))
{
$id= $ids[0];
$query = "
SELECT COUNT(DISTINCT val) FROM pauling.val032
INNER JOIN
(
SELECT v AS EntryCode FROM
(
SELECT pauling.val001.val AS v FROM pauling.val001
INNER JOIN paulingv2.val032
USING (EntryCode)
WHERE paulingv2.val032.val = ".$id."
) AS A
INNER JOIN
(
SELECT pauling.val001.val AS v FROM pauling.val001
INNER JOIN paulingv2.val086
USING (EntryCode)
WHERE paulingv2.val086.val != 6
) AS B
USING (v)
) as C USING (EntryCode)";
$query = mysql_query($query);
$result = mysql_fetch_row($query);
$result = mysql_result($query,0);
$result = ($result+1)%2;
$query = "UPDATE s07pt.bcc SET incl_theo = $result where id = $id LIMIT 1";
$j++;
echo $j.".\t".$query."\n";
mysql_query($query);
}
"incl_theo" is the column which is equal to 1 if the structure in question contains no AET other than 14-b, 0 otherwise. This gives us smt. like:
Boasting? I guess so… 8)
December 21, 2007 Posted by Emre S. Tasci
Suppose that you’ve collected some data from the output of a program. Let’s say that some part of this data consists of Author names something similar to:
You want to split the initials from the surnames. This is piece of cake with PHP but I don’t want to go parsing each row of which there are many… So, take a look at this ugly beauty:
aaaaand here is what you get:
if you are thinking something similar to
UPDATE dbl004 SET val1 = LEFT(val,LOCATE(" ",val)-1), val2 = RIGHT(val,LENGTH(val)-LOCATE(" ",val));
or
UPDATE dbl004 set val1 = TRIM(SUBSTRING(SUBSTRING_INDEX(val,".",1),1,LENGTH(SUBSTRING_INDEX(val,".",1)) – LENGTH(SUBSTRING_INDEX(SUBSTRING_INDEX(val,".",1)," ",-1)))), val2 = TRIM(SUBSTRING(val, LENGTH(SUBSTRING_INDEX(val,".",1)) – LENGTH(SUBSTRING_INDEX(SUBSTRING_INDEX(val,".",1)," ",-1))));
Try to process these 3 values: "van der Graaf K.L. Jr.", "Not Available" and "Editor".
About this entry: I couldn’t refrain myself from boasting after I managed to come up with that beautiful MySQL query… sorry for that. (Yes, I know, superbia, the 7th and the most deadly…) So let me try to balance this arrogant entry of mine:
With my best regards,
Your humble blogger…
SAGE: Open Source Mathematics Software
December 9, 2007 Posted by Emre S. Tasci
Some “trivial” derivations
December 4, 2007 Posted by Emre S. Tasci
Information Theory, Inference, and Learning Algorithms by David MacKay, Exercise 22.5:
A random variable x is assumed to have a probability distribution that is a mixture of two Gaussians,
where the two Gaussians are given the labels k = 1 and k = 2; the prior probability of the class label k is {p1 = 1/2, p2 = 1/2}; are the means of the two Gaussians; and both have standard deviation sigma. For brevity, we denote these parameters by
A data set consists of N points which are assumed to be independent samples from the distribution. Let kn denote the unknown class label of the nth point.
Assuming that and
are known, show that the posterior probability of the class label kn of the nth point can be written as
and give expressions for and
.
Derivation:
Assume now that the means are not known, and that we wish to infer them from the data
. (The standard deviation
is known.) In the remainder of this question we will derive an iterative algorithm for finding values for
that maximize the likelihood,
Let L denote the natural log of the likelihood. Show that the derivative of the log likelihood with respect to is given by
where appeared above.
Derivation:
Assuming that =1, sketch a contour plot of the likelihood function as a function of mu1 and mu2 for the data set shown above. The data set consists of 32 points. Describe the peaks in your sketch and indicate their widths.
Solution:
We will be trying to plot the function
if we designate the function
as p[x,mu] (remember that =1 and
),
then we have:
And in Mathematica, these mean:
mx=Join[N[Range[0,2,2/15]],N[Range[4,6,2/15]]]
Length[mx]
ListPlot[Table[{mx[[i]],1},{i,1,32}]]
p[x_,mu_]:=0.3989422804014327` * Exp[-(mu-x)^2/2];
pp[x_,mu1_,mu2_]:=.5 (p[x,mu1]+p[x,mu2]);
ppp[xx_,mu1_,mu2_]:=Module[
{ptot=1},
For[i=1,i<=Length[xx],i++,
ppar = pp[xx[[i]],mu1,mu2];
ptot *= ppar;
(*Print[xx[[i]],"\t",ppar];*)
];
Return[ptot];
];
Plot3D[ppp[mx,mu1,mu2],{mu1,0,6},{mu2,0,6},PlotRange->{0,10^-25}];
ContourPlot[ppp[mx,mu1,mu2],{mu1,0,6},{mu2,0,6},{PlotRange->{0,10^-25},ContourLines->False,PlotPoints->250}];(*It may take a while with PlotPoints->250, so just begin with PlotPoints->25 *)
That’s all folks! (for today I guess 8) (and also, I know that I said next entry would be about the soft K-means two entries ago, but believe me, we’re coming to that, eventually 😉
Attachments: Mathematica notebook for this entry, MSWord Document (actually this one is intended for me, because in the future I may need them again)