aaMI {aaMI}R Documentation

Mutual Information for a Protein Sequence Alignment

Description

Calculate a matrix of pairwise mutual information values for a protein sequence alignment

Usage

aaMI(file)

Arguments

file a connection or character string giving the name of the file to load.

Details

This script calculates the mutual information between pairs of sites for a protein sequence alignment. The alignment must be in the form of a data frame with the sequence IDs as the row names and each site as a column. The program begins by calculating the amino acid frequencies at each site. These frequencies are then used to calculate a vector containing the Shannon entropy H for each site. Shannon entropy is calculated using the equation

H_i = sum[i] (P(X_i)log2(P(X_i)))

where P(X_i) = frequency of amino acid X at site i of the alignment. Next the program calculates the joint probabilities P(X_i,Z_j) of pairs of amino acids X and Z at sites i and j. The Shannon entropy and joint probabilities are used to calculate the mutual information MI with the formula

MI_ij = H_i + H_j -sum[i,j](P(X_i,Z_j)log2(P(X_i,Z_j))

Value

For the analysis of a protein sequence alignment data frame "file", the output is an NxN upper-triangular matrix, where N is the number of sites in the alignment. Values along the diagonal of the matrix are the entropy values (H) for each site.

Author(s)

Kurt Wollenberg

References

Shannon, C. E. and W. Weaver. (1949) The Mathematical Theory of Communication, University of Illinois Press.

Wollenberg, K. R. and W. R. Atchley. (2000) Separation of phylogenetic from functional associations in biological sequences by using the parametric bootstrap. Proceedings of the National Academy of Science 97 3288-3291.

Examples

## Read in a protein sequence alignment file, FastA format
## Not run: SeqDataFA <- read.FASTA("ProteinSeqFastA.txt")
## Read in a protein sequence alignment file, ClustalX .aln format
## Not run: SeqDataCX <- read.CX("ProteinSeq.aln")
## Read in a protein sequence alignment file, GeneDoc .msf format
## Not run: SeqDataGD <- read.Gdoc("ProteinSeq.msf")

## Calculate the mutual information matrix for one of these alignments.
## Not run: ProteinSeqmi <- aaMI(SeqDataGD)

[Package aaMI version 1.0-1 Index]