Mutual information

The mutual information is a statistic used to compute a correlation between two random variables, and defined by the formula

\[I(A;B)=\sum{_i}\sum{_j}P(a_i,b_j)log[\frac{P(a_i,b_j)}{P(a_i)P(b_j)}]\]

where

  • \(I(A; B)\) is the mutual information between variable A and B,

  • \(i\) and \(j\) the possible states of these variables

  • \(P(a_i; b_j)\) is the probability of having \(a_i\) and \(b_j\) for A and B at the same time

  • \(P(a_i)\) and \(P(b_j)\) are the probability of having \(a_i\) in A and \(b_j\) in B respectively

To find a correlation between positions in sequences of Protein Blocks (PB) we compute the Mutual Information (MI) for each combination of position so that

  • A and B represent a position in the sequence (A and B cannot be the same position)

  • i and j are the values of Protein Blocks (“a” to “p”)

  • \(P(a_i)\) and \(P(b_j)\) are the probability of having a given protein block at a given position (obtained from the frequency of the PB at this position)

  • The base of the logarithm is 16 (the number of PB) to normalize the values so that the maximum is 1

The number of operations rise exponentially the longer the sequence are.

Using PBTools to compute the MI

[1]:
import pbtools as pbt
print(pbt.__version__)
0.1.0

For the simple sequences “aaa” and “cab”, we can represent it as the matrix

0

1

2

seq1

a

a

a

seq2

c

a

b

So

\[\begin{split}I(pos0; pos1) = \\ P(a_{pos0}; a_{pos1}) \times log [\frac{P(a_{pos0}; a_{pos1})}{P(a_{pos0})\times P(a_{pos1}})]\\ + P(a_{pos0}; a_{pos1}) \times log [\frac{P(a_{pos0}; a_{pos1})}{P(a_{pos0})\times P(a_{pos1}})]\\ + P(c_{pos0}; a_{pos1}) \times log [\frac{P(c_{pos0}; a_{pos1})}{P(c_{pos0})\times P(a_{pos1}})] \\ + P(c_{pos0}; a_{pos1}) \times log [\frac{P(c_{pos0}; a_{pos1})}{P(c_{pos0})\times P(a_{pos1}})] \\ = 0.5 \times log(\frac{0.5}{0.5}) \\ = 0 = I(pos2; pos1)\end{split}\]

And

\[\begin{split}I(pos0; pos2) = \\ P(a_{pos0}; a_{pos2}) \times log [\frac{P(a_{pos0}; a_{pos2})}{P(a_{pos0})\times P(a_{pos2}})]\\ + P(a_{pos0}; b_{pos2}) \times log [\frac{P(a_{pos0}; b_{pos2})}{P(a_{pos0})\times P(b_{pos2}})]\\ + P(c_{pos0}; a_{pos2}) \times log [\frac{P(c_{pos0}; a_{pos2})}{P(c_{pos0})\times P(a_{pos2}})] \\ + P(c_{pos0}; b_{pos2}) \times log [\frac{P(c_{pos0}; a_{pos2})}{P(c_{pos0})\times P(b_{pos2}})] \\ = 0.5 \times log(\frac{0.5}{0.5 \times 0.5}) \times 2 \\ = 0.5 \times 0.25 * 2 \\ = 0.25\end{split}\]
[2]:
pbt.mutual_information_matrix(["aaa", "cab"])
[2]:
array([[0.  , 0.  , 0.25],
       [0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  ]])

We can observe that at position 0,2 the MI is 0.25 just as calculated earlier, so the matrix computed by PBTools is correct.