Detour: PAM Scoring Matrices

Get introduced to the PAM scoring matrices.

What is the PAM scoring matrix?

Mutations of a gene’s nucleotide sequence often change the amino acid sequence of the translated protein. Some of these mutations impair the protein’s ability to function, making them rare events in molecular evolution. Asn, Asp, Glu, and Ser are the most “mutable” amino acids, whereas Cys and Trp are the least mutable. Knowledge of the likelihood of each possible mutation allows biologists to construct amino acid scoring matrices for biologically sound sequence alignments in which different substitutions are penalized differently. The (i, j)-th entry of the amino acid scoring matrix Score usually reflects how often the i-th amino acid substitutes the j-th amino acid in alignments of related protein sequences. As a result, optimal alignments of amino acid sequences may have very few matches but still represent biologically adequate alignments.

How do biologists know which mutations are more likely than others? If we know a large set of pairwise alignments of related sequences (e.g., sharing at least 90% of amino acids), then computing Score(i, j) is based on counting how many times the corresponding amino acids are aligned. However, we need to know the scoring matrix in advance in order to build this set of starter alignments — a catch-22!

Fortunately, the correct alignment of very similar sequences is so obvious that it can be constructed even with a primitive scoring scheme that doesn’t account for varying mutation propensities (such as +1 for matches and -1 for mismatches and indels), thus resolving the conundrum. After constructing these obvious alignments, we can use them to compute a new scoring matrix that we can use iteratively to form less and less obvious alignments.

This simplified description hides some details. For example, the probability of Ser mutating into Phe in species that diverged 1 million years ago is smaller than the probability of the same mutation in species that diverged 100 million years ago. This observation implies that scoring matrices for protein comparison should depend on the similarity of the organisms and the speed of evolution of the proteins of interest. In practice, the proteins that biologists use to create an initial alignment are extremely similar, having 99% of their amino acids conserved (e.g., most proteins shared by humans and chimpanzees). Sequences that are 99% similar are said to be 1 PAM unit diverged (“PAM” stands for “point accepted mutation”). You can think of a PAM unit as the amount of time in which an “average” protein mutates 1% of its amino acids.

PAM1_1 scoring matrix

The PAM1_1 scoring matrix is defined as follows from many pairwise alignments of 99% similar proteins. Given a set of pairwise alignments, let M(i, j) be the number of times that the i-th and j-th amino acids appear in the same column, divided by the total number of times that the i-th amino acid appears in all sequences. Let f(j) be the frequency of the j-th amino acid in the sequences, or the number of times it appears across all sequences divided by the combined lengths of the two sequences. The (i, j)-th entry of the PAM1_{1} matrix is defined as:

log(M(i,j)f(j))log(\frac{M(i,j)}{f(j)})

For a larger number of PAM units n, the PAMn_{n} matrix is computed based on the observation that the matrix MnM^n (the result of multiplying M by itself n times) holds the empirical probabilities that one amino acid mutates to another during n PAM units. The (i, j)-th entry of the PAMn_{n} scoring matrix is thus given by:

log(Mn(i,j)f(j))log(\frac{M^{n}(i,j)}{f(j)})

This approach assumes that the frequencies of the amino acids f(j) remain constant over time and that the mutational processes in an interval of 1 PAM unit operate consistently over long periods. For large n, the resulting PAM* matrices often allow us to find related proteins, even when the alignment has few matches.

The PAM250_{250} scoring matrix is shown in figure below.

This approach assumes that the frequencies of the amino acids f(j) remain constant over time, and that the mutational processes in an interval of 1 PAM unit operate consistently over long periods. For large n, the resulting PAM matrices often allow us to find related proteins, even when the alignment has few matches.

Get hands-on with 1200+ tech skills courses.