R: Normalized Mutual Information for a Protein Sequence Alignment

aaMIn {aaMI}

R Documentation

Normalized Mutual Information for a Protein Sequence Alignment

Description

Calculate a matrix of pairwise normalized mutual information values for a protein sequence alignment

Usage

aaMIn(file)

Arguments

file a connection or character string giving the name of the file to load.

Details

This script calculates the normalized mutual information between pairs of sites for a protein sequence alignment. The normalization constant used is the joint entropy, as described by Gloor, et al. (2005). The alignment must be in the form of a data frame with the sequence IDs as the row names and each site as a column. The program begins by calculating the amino acid frequencies at each site. These frequencies are then used to calculate a vector containing the Shannon entropy H for each site. Shannon entropy is calculated using the equation

H_i = sum[i] (P(X_i)log2(P(X_i)))

where P(X_i) = frequency of amino acid X at site i of the alignment. Next the program calculates the joint probabilities P(X_i,Z_j) of pairs of amino acids X and Z at sites i and j. The joint probabilities are used to calculate the joint entropy with the formula

sum_{i,j}(P(X_i,Z_j)log2(P(X_i,Z_j))

Shannon entropy and joint entropy are used to calculate the normalized mutual information MI with the formula

MIn_ij = (H_i + H_j - JH_ij) / JH_ij

Value

For the analysis of a protein sequence alignment data frame "file", the output is an NxN upper-triangular matrix, where N is the number of sites in the alignment. Values along the diagonal of the matrix are the entropy values (H) for each site.

Author(s)

Kurt Wollenberg

References

Gloor, G. B, L. C. Martin, L. M. Wahl, and S. D. Dunn. (2005) Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry 44 7156-7165.

Shannon, C. E. and W. Weaver. (1949) The Mathematical Theory of Communication, University of Illinois Press.

Wollenberg, K. R. and W. R. Atchley. (2000) Separation of phylogenetic from functional associations in biological sequences by using the parametric bootstrap. Proceedings of the National Academy of Science 97 3288-3291.

Examples

## Read in a protein sequence alignment file, FastA format
## Not run: SeqDataFA <- read.FASTA("ProteinSeqFastA.txt")
## Read in a protein sequence alignment file, ClustalX .aln format
## Not run: SeqDataCX <- read.CX("ProteinSeq.aln")
## Read in a protein sequence alignment file, GeneDoc .msf format
## Not run: SeqDataGD <- read.Gdoc("ProteinSeq.msf")

## Calculate the mutual information matrix for one of these alignments.
## Not run: ProteinSeqminorm <- aaMIn(SeqDataGD)

[Package aaMI version 1.0-1 Index]