Mutual information¶

The mutual information is a statistic used to compute a correlation between two random variables, and defined by the formula

\[I(A;B)=\sum{_i}\sum{_j}P(a_i,b_j)log[\frac{P(a_i,b_j)}{P(a_i)P(b_j)}]\]

where

\(I(A; B)\) is the mutual information between variable A and B,
\(i\) and \(j\) the possible states of these variables
\(P(a_i; b_j)\) is the probability of having \(a_i\) and \(b_j\) for A and B at the same time
\(P(a_i)\) and \(P(b_j)\) are the probability of having \(a_i\) in A and \(b_j\) in B respectively

To find a correlation between positions in sequences of Protein Blocks (PB) we compute the Mutual Information (MI) for each combination of position so that

A and B represent a position in the sequence (A and B cannot be the same position)
i and j are the values of Protein Blocks (“a” to “p”)
\(P(a_i)\) and \(P(b_j)\) are the probability of having a given protein block at a given position (obtained from the frequency of the PB at this position)
The base of the logarithm is 16 (the number of PB) to normalize the values so that the maximum is 1

The number of operations rise exponentially the longer the sequence are.

Using PBTools to compute the MI¶

[1]:

import pbtools as pbt
print(pbt.__version__)

0.1.0

For the simple sequences “aaa” and “cab”, we can represent it as the matrix

	0	1	2
seq1	a	a	a
seq2	c	a	b

So

\[\begin{split}I(pos0; pos1) = \\ P(a_{pos0}; a_{pos1}) \times log [\frac{P(a_{pos0}; a_{pos1})}{P(a_{pos0})\times P(a_{pos1}})]\\ + P(a_{pos0}; a_{pos1}) \times log [\frac{P(a_{pos0}; a_{pos1})}{P(a_{pos0})\times P(a_{pos1}})]\\ + P(c_{pos0}; a_{pos1}) \times log [\frac{P(c_{pos0}; a_{pos1})}{P(c_{pos0})\times P(a_{pos1}})] \\ + P(c_{pos0}; a_{pos1}) \times log [\frac{P(c_{pos0}; a_{pos1})}{P(c_{pos0})\times P(a_{pos1}})] \\ = 0.5 \times log(\frac{0.5}{0.5}) \\ = 0 = I(pos2; pos1)\end{split}\]

And

\[\begin{split}I(pos0; pos2) = \\ P(a_{pos0}; a_{pos2}) \times log [\frac{P(a_{pos0}; a_{pos2})}{P(a_{pos0})\times P(a_{pos2}})]\\ + P(a_{pos0}; b_{pos2}) \times log [\frac{P(a_{pos0}; b_{pos2})}{P(a_{pos0})\times P(b_{pos2}})]\\ + P(c_{pos0}; a_{pos2}) \times log [\frac{P(c_{pos0}; a_{pos2})}{P(c_{pos0})\times P(a_{pos2}})] \\ + P(c_{pos0}; b_{pos2}) \times log [\frac{P(c_{pos0}; a_{pos2})}{P(c_{pos0})\times P(b_{pos2}})] \\ = 0.5 \times log(\frac{0.5}{0.5 \times 0.5}) \times 2 \\ = 0.5 \times 0.25 * 2 \\ = 0.25\end{split}\]

[2]:

pbt.mutual_information_matrix(["aaa", "cab"])

[2]:

array([[0.  , 0.  , 0.25],
       [0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  ]])

We can observe that at position 0,2 the MI is 0.25 just as calculated earlier, so the matrix computed by PBTools is correct.