Thesis (Index) <- Sean Forman <- You Are Here
Proteins are constructed from the set of twenty naturally occurring amino acids. Amino acids, and hence proteins, are organic compounds, formed from carbon, hydrogen, nitrogen, oxygen, and sulfur. Amino acids are present in much of the food we eat, and many amino acids can also be synthesized by our bodies as needed. These amino acids then form proteins as they are strung together in our cells by special constructor proteins called ribosomes.
The structures of all twenty amino acids are similar (see
Appendix A for a list of amino acids and
Figures A.1 to A.5 for their residues, three and
one-letter abbreviations). We will refer to all amino acids by their
three-letter abbreviations. All amino acids have a backbone
and a residue. The backbone is the same for every type of
amino acid and contains a nitrogen, denoted by
, followed two
carbons labeled the
-carbon,
, and the prime-carbon,
, so the backbone atoms are ordered from left-to-right
-
-
. In addition, the
has a hydrogen,
,
attached to it, the
has an oxygen,
, attached to it (known
as the carbonyl
), and the
has a
and the
residue attached to it. To form the protein, the amino acid backbones
come together in a long chain by forming peptide bonds between the
of an amino acid and the
of the following amino acid
(see Figure 2.1). Hence, this bonding of amino
acids end-to-end results in the protein being a long chain.
The other part of the amino acid is the residue (also known as the
sidechain). The residue is a set of atoms attached to the
central
. Each type of amino acid is distinguished by its
residue. The residues vary from the simple (Gly, with a single
hydrogen atom), to the complicated, (Trp, with a double carbon ring).
The atoms of the residue are labeled using the Greek alphabet,
beginning with the backbone's
-carbon and followed by the
residue's
-carbon,
-carbon,
-carbon, etc.
Covalent bonding between atoms is the primary definer of
protein structure. Among the parameters of a bond are the
bond length,
,7 the bond angle,
,8 and the torsion angle,
.9 See
Figure 2.2 for examples of how these three parameters
are calculated. The values of the parameters for a covalent bond
depend upon the atoms forming the bond and the processes that went
into creating the bond. We will mention the characteristics of four
types of covalent bonding prevalent in proteins.
![]() |
,
possible bonding configurations forming four disulfide bridges, and
many more if we consider configurations with 0 to 4 disulfide bridges
formed.
In addition to the bond lengths, the backbone bond angles are
also relatively fixed. A bond angle is the planar angle formed by the
bonds of three sequential atoms. For instance, the bond angle formed
by
-
-
is
, by
-
-
is
, and by
-
-
is
[20].
As mentioned before, the main degree of freedom within the bonds is
the torsion angle. Canonical names have been given to the three
torsion angles related to the backbone bonds. The rotational angle
between the
and
atoms is called
, the rotational
angle between the
and
atoms is called
, and the
rotational angle between the
and the following amino acid's
is called
(see Figure 2.1). As we
noted before,
is a peptide bond and is essentially restricted
to angles
and
. The
and
angles, however, are allowed to rotate much more freely.
In addition to the three backbone torsion angles, there are
rotamer angles associated with the torsion angles found in
the single bonds of the amino acid sidechains. These angles begin
with
, the torsion angle of the
-
bond,
and continue on with
, etc. Changes in rotamer and
backbone torsion angles have different effects on the protein's
conformation. A change in a rotamer angle will only affect the
location of a single sidechain's atoms, but a change in
,
, or
will affect the location of every backbone and
sidechain atom following it.
A protein is, therefore, a long chain that varies in shape primarily
because of the rotation of each amino acid's
and
angles,
and their choice of two possible
values.
Each atom has an electron shell which delineates the perimeter of its atomic sphere. This shell prevents atoms (not engaged in a covalent bond) from occupying the same volume. A molecular conformation that places two atoms into the same volume is said to have a steric clash. Steric clashes violate basic atomic properties and do not occur because the energy of the conformation in this position will not be at a minimum. Our program does not allow steric clashes to occur and disqualifies any potential conformations with a steric clash.
As mentioned before, the rotation of the bonds are the primary
determinant of a protein's conformation. When the
angle pairs
for known protein structures are studied, one finds that these
angle pairs are not distributed uniformly among all possible
angle choices. Specific
angles tend to occur with specific
angles and vice-versa.
Why does this happen? Certain
pairs will, however, swing two
atoms into the same volume causing a steric clash making that
pair very unlikely. The most common
rotation angles
are those that conveniently space the atoms a safe distance away from
each other.
Ramachandran plots [64] are two-dimensional
plots of
angle pairs with the
angle along the
-axis
and the
angle along the
-axis. The angle combinations
typically plotted come from
angle data found in the Protein
DataBank. The patterns of angles mentioned earlier are very easy to
see when looking at Ramachandran plots of various amino acids. Note
that the space being graphed is a periodic space
. This is, in
fact, a torus.
The densely populated areas on the Ramachandran plot (see
Appendix C) represent angle pairings
that are most likely to occur. As one can see from Figures
C.1 to C.4, the sidechain conformation has a
significant impact on the distribution of
angle pairs. Larger
sidechains tend to be more restrictive, while Gly, which has only a
solitary
atom for a sidechain has much wider variety of observed
combinations. Interestingly, these common angle conformations
also correspond to regular patterns within the backbone known as
secondary structure.
A secondary structure is a repeating three-dimensional structure with a fixed bonding pattern. The most common structures are helices and sheets. These structures are not formed by strong covalent bonding, but by weaker hydrogen bonding between atoms on different amino acids.
atoms (both backbone and otherwise) often have attached
atoms which they are willing to share with other atoms. These
atoms are known as donors. Acceptors, mostly
, are attracted to these donated
atoms because of their
respective opposite charges. This interaction can occur between atoms
in any part of the protein, but we are largely concerned with
interactions where the donor is a backbone
, and the acceptor is
the carbonyl
(the
bonded to
) from a different amino
acid. These atoms are typically on amino acids four or more amino
acids apart in the protein's sequence. Hence, these interactions are
long range, rather than local interactions as with covalent
bonding. All of the common secondary structure is caused by patterned
formation of hydrogen bonds between the backbone atoms.
The most common protein secondary structures are
-helices. As
can be seen in Figure 2.3, the hydrogen bonds in an
-helix occur between the
of the
amino acid and the carbonyl
of the
amino acid. Since this forms an extremely regular
helical pattern, the
and
torsion angles usually are near
a characteristic set of values (Table 2.1).
| Type of Helix | Frequency | Bond Interval | ||
| -57 | -47 | 98% | 4 amino acids | |
| 3-10 helix | -49 | -26 | 1% | 5 amino acids |
| -57 | -80 | 1% | 3 amino acids |
| Type of Strand | |||
| parallel | -119 | 113 | |
| anti-parallel | -139 | 135 |
Certainly, one can imagine that winding the helix more tightly or more
loosely would likewise produce a regular pattern of hydrogen bonds,
and indeed, two other helical structures do occur. However, the
angles required to produce these alternate helices orient the
protein in such a way that steric clashes between amino acids within
the helix occur more often. The
-helix configuration avoids these steric
clashes between amino acids. The two other types of naturally
occurring helices are called
-helices and
helices.
Rather than producing hydrogen bonds, between the
and
amino
acids as with
-helices,
-helices form hydrogen bonds between the
and
amino acids. The
helical structure is formed by a
series of hydrogen bonds between the
and
amino acids.
Left-handed helices, which rotate in the opposite direction of the
previous three helices, are another rare secondary structure.
The other common secondary structures are
-sheets.
-sheets are
formed when the two relatively straight protein backbone segments lie
parallel to each other and hydrogen bonds form between the two
segments. The individual segments are referred to as
-strands. There
are two types of
-sheets, anti-parallel and parallel (see
Table 2.2 for standard torsion angles).
-sheets are typically anti-parallel. In this form, the directions of the
two
-strands run opposite to each other (as illustrated in
Figure 2.4). Just as with
-helices, hydrogen bonds form
in a regular manner between the two
-strands. Hydrogen bonds form between
the
th amino acid's carbonyl
and the
th amino acid's
backbone
, and between the
th amino acid's
and the
th amino acid's carbonyl
. Unlike the
-helices, the hydrogen
bonding pattern between the two
-strands skips the
and
amino
acids and continues with the
and
amino acids.
A less common conformation is the parallel
-sheet. In this case, the two
strands are running in the same direction (as illustrated in
Figure 2.4). This orientation is less common
because a much longer length of intermediary protein must occur for
these two strands to align in this way. Rather than the carbonyl
and backbone
of one amino acid lining up with their
counterparts on the other amino acid, the bonding pattern is
staggered. The backbone
of the
amino acid forms a hydrogen
bond with the carbonyl
of the
amino acid, but the
carbonyl
forms a hydrogen bond with the
backbone
.
Then the
backbone
will form a hydrogen bond with the
carbonyl
and so on.
While
-helix formation is largely a local phenomena, with bonds forming
between amino acids three to five amino acids apart on the amino acid
sequence,
-sheets can form across a large number of amino
acids.12 This makes prediction of sheet
structure much different than that of helix structure. Helix
prediction can be thought of as a local optimization problem, while
sheet prediction or formation is a global optimization problem. In
optimization, global solutions are more difficult to find than
local ones, so the prediction of
-sheets is more difficult than the
prediction of
-helices.
Often, the secondary structures are organized in larger groups or
motifs. Common motifs include helix-helix, helix-loop-helix,
and the Greek key motif, which is four adjacent anti-parallel
-sheets (see Figure 2.6).
In day-to-day experience, we may think of water (
) as a neutral, non-interactive medium, but it actually has
a small charge distributed on either side of the water molecule. Part
of the molecule has a slightly positive charge, and part of it is
slightly negative. Molecules with this characteristic are called
polar. This means that when a charged molecule is placed in
water it will cause the water molecules to align around it and
interact with those molecules. Non-polar molecules placed in water
will attempt to minimize their interaction with the polar
water.13
In the case of proteins, only some of the amino acids are polar and, therefore, when exposed to the water interact with it. These amino acids are called hydrophilic (an affinity for water) and include charged and polar amino acids like Ser, Thr, Asp, Tyr and Trp. Being exposed to water allows their polar atoms to hydrogen bond with the surrounding water atoms and is energetically beneficial to the protein. Other amino acids are called hydrophobic (repelled from water) and are generally seen on the interior of the protein away from the surface and the surrounding solvent. These include Ala, Val, Pro, and Met [12]. The hydrophobic and hydrophilic effect determines, to a large extent, how the protein will fold.
The hydrophobic and hydrophillic effect is one aspect of the protein's energetics. By energetics, we are referring to the protein's folding process as an attempt to minimize its thermodynamic energy. In addition to the hydrophobic and hydrophilic effect, interactions like hydrogen bonding (see Section 2.1.3) and sidechain entropy also contribute to the protein's energy.
Sidechain entropy relates to the conformation of each amino acid's sidechain. Sidechains have a variety of configurations that they can appear in. As they become buried, this freedom is reduced and entropy decreases. Energetically, it is preferable if these amino acid sidechains are free to interact with the solvent (as measured by accessible surface area) and are not buried within the protein.
The proper combination of these effects is still an open question and one that complicates the effort to effectively model protein folding.
Proteins are constructed of atoms constrained by bonds and affected by forces exerted on them by surrounding atoms, both in the protein and in the surrounding solvent. This naturally leads to the formulation of these problems as optimization problems. The energy function contains terms regarding the covalent bonds between atoms, the hydrophobic effect, hydrogen bonding, and other effects. This leads to a very large and complicated function that in theory should be solvable.
Due to the large number of atoms and forces involved in the protein, one may face over a thousand degrees of freedom. Additionally, the equations describing the inter-atomic forces are not linear or even quadratic, though they are polynomial. This leads to a vast number of local minimums, some of which can be quite deep. Therefore, all solution techniques seeking to minimize this energy function may take a long period of time if they can be computed at all.
As a sidebar, the fact that proteins are able to compute this energy function so quickly and with such ease in nature raises a number of intriguing possibilities. First, it could be (though unlikely) that proteins can solve tremendously hard NP-complete problems in a trivial amount of time. This could lead to a new class of biological problem solvers. One could code a traveling salesman problem as a amino acid sequence and then study the completed structure in order to find a solution. Another possibility is that easily folded amino acid sequences are chosen by natural selection, and researchers have not been be able to determine just what makes these proteins easier to fold.
A wide variety of approaches from computer science and mathematics have been considered as potential solutions to the protein folding problem. This section is just a brief summary of current techniques used in protein folding. For gentle introductions, see Richards [66] and Hayes [35]. For detailed surveys of the field, see Neumaier [55],14Creighton [20,21], Duan [26],15 or Dill [24]. To frame this issue, we will concentrate on techniques entered into blind protein prediction contests. Every other year since 1994, the Protein Structure Prediction Center at Lawrence Livermore Laboratories (http://predictioncenter.llnl.gov/) has conducted a Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP). December 2000 saw the completion of the fourth such assessment (CASP4) [47]. These results are published in a special issue of the journal Proteins: Structure, Function, and Genetics. The review of the 2000 conference has yet to be published at this writing, so we will limit our comments to the third assessment (CASP3) from December of 1998.
CASP takes the form of blind predictions of protein structure. The target proteins are of known amino acid sequence, but unpublished and recently determined three-dimensional structures. Since the targets are not publicly known, the actual structures can be compared with the predicted structures, and the techniques can be evaluated for their accuracy. These targets range in function, length and type. Each research group submits five predictions along with their ``best'' prediction for each target.
Each iteration of CASP has utilized a number of different techniques for evaluating the predictions. While the emphasis has recently been placed on whole atom predictions in the form of 3-D atomic coordinate predictions, the predictions can take a number of other forms as well: alignments to publicly known structures, secondary structure assignments, and residue-residue distances [83].
There are several difficulties in evaluating a prediction. The predictions must be superimposed on the target protein, so that the two models can be compared. This superposition is not always easy to find and often iterative approximations must be performed to find the ideal superposition. Suboptimal superpositions could unduly penalize a predicted fold that has some portion of the protein poorly predicted, but another portion predicted well.
Once the superposition has been settled upon the protein can be evaluated in a number of ways. The two primary methods are the root-mean square deviation (RMSD) of the model's atoms from the target's atoms. This depends on the superposition used. Another is the RMSD of the torsion and rotamer angles for each amino acid. Additionally, measures like accessible surface area, buried residues, and secondary structure found can also be calculated. For comparative models, further measures related to the alignment to known structures are utilized.
For the purpose of evaluating the different techniques, CASP3 divides the predictions into three different types: comparative modeling, fold recognition and ab initio predictions.
Comparative models16 attempt to match the target's amino acid sequence with the amino acid sequences of proteins whose structure is known. Then the target protein's structure is assumed to be similar to that of the matched protein's known structure. This technique performs best when one can find a family of similar proteins. The basic structure of these systems is similar with a significant amount of variation at each step.
There are several difficulties at each step. Protein families can be difficult to discern. Due to evolutionary effects within different species, proteins that at one point (millions and millions of years ago) were very similar in amino acid sequence may have dramatically divergent amino acid sequences, but retained a similar conformation. Likewise, there are many examples of proteins with a high degree of sequence similarity, but different folds. In addition, since unknown structures far outnumber known structures there is a large class of proteins for which no family can be found, and, hence, no model created.
For an overview of all comparative modeling techniques attempted at CASP3, see Jones and Kleywegt's analysis of the comparative models submitted [41]. Some examples of techniques that were successful at CASP3 include those submitted by Burke [16] and Yang [82].
Fold recognition (or threading) is a cousin to sequence homology. Instead of searching for significantly similar sequences and deducing the structure of the protein, a fold recognition package will assume that the unknown structure is similar to a fold that we have seen before and then search for the existence of that similar fold. The problem is then recast as determining the correct similar fold and not the correct similar sequence. Fold recognition techniques do not require similar sequences in the protein databank, just similar folds.
Proteins of decidedly different amino acid sequence within a single protein family17 are a byproduct of evolution. Proteins within a single family are likely to have evolved from a single protein. Along the way to their present state, there were a large number of insertions and deletions in their amino acid sequence. While the sequence similarity may now be distant, the structure similarity will remain as the evolutionary process maintains the protein's function, which depends on the protein's structure. This is seen most clearly in hemoglobin which has nearly the same shape for thousands of species [70].
Fold recognition assumes that there are a limited number of core folds from which all proteins draw their structures. One can think of these core folds as templates or patterns, that the amino acid sequence is molded or fitted to. Many of these templates have not yet been determined, but the expectation of a small number of core folds is implicit in the construction of these algorithms. Because of this, the efficacy of fold recognition, like sequence homology, is limited by the size of the Protein Databank. Assuming the proper template can be determined for a given sequence, one must then align the sequence to the fold. Due to phenomena such as deletions, insertions, varying sequence length and others, there are thousands of possible ways to match a sequence to a template.
To reconcile these two areas, threading approaches generally follow the same general pattern [46]:
Murzin [52] provides an overview of the fold recognition techniques attempted at CASP3. Bryant [60] and Jones [40] are two examples of techniques that were above average performers at CASP3.
Ab initio models attempt to discern the structure of a protein without any direct structural data from proteins in the same evolutionary family. Instead, they construct the protein using general principles-many of which are thermodynamic in nature. This is a far more difficult task than the other two techniques undertake, so the standards of success are often less stringent. In order to take this into account, CASP3 used some different measurements to evaluate the various entries.
RMSD results are often very poor for ab initio techniques.18 Instead, the CASP3 evaluators used measurements like proper recognition of protein class, proper prediction of protein fragments, and proper fold architecture. Fragments are defined by taking the RMSD over 25 or 40 amino acid Lesk Window Plots. These plots match portions of the prediction to portions of the target protein. The percentage of running window RMSD's below some minimum threshold is computed, and these percentages are used to evaluate the models. See Orengo for a full description [58] and a discussion of other evaluation techniques.
The ab initio field can be further differentiated into
knowledge-based ab initio techniques and classical ab
initio techniques. Knowledge-based techniques employ constraints
which are developed using multiple sequence alignments or fragments
from known structures. These fragments can then be combined, often
utilizing an optimization technique, to produce a full protein
representation. The most successful technique in this class was that
of Baker's group [69].
Baker's group utilized three to nine amino acid structure fragments
and then combined these fragments using simulated annealing while
optimizing a scoring function. Their scoring function depended on
things like hydrophobic burial, disulfide bonding,
-helix and
-sheet packing
and formation.
Classical ab initio techniques are the most ambitious structure prediction techniques. They typically rely on basic thermodynamic equations to define the relationships between atoms and amino acids in order to build the protein that minimizes some global energy function. These models will make simplifying assumptions about the protein's structure in order to make the enumeration of the protein's conformations manageable [24]. These methods tend to mix and match from a variety of simplifying assumptions about the geometry or energetics of a protein in order to make the problem tractable.
These techniques often rely on very powerful computing facilities to solve their problems to completion. Also, due to the exponential growth in the number of conformations possible as the number of amino acids grows these techniques do not scale well to large proteins.
CHARMM [15] is a widely used package for energy minimization, and Dill [24], Levitt [50] and others produced much of the early work on classical ab initio structures.19
See Orengo's summary for a listing of ab initio techniques entered at CASP3 [58]. Scheraga's work [49] was deemed to be among the best of the classical ab initio techniques. This technique was most successful on proteins that were shorter than 150 amino acids and primarily helical in nature. They first use an off-lattice simplified model which treats residues as points. These models are then solved multiple times using a simulated annealing technique. This produces a distinct set of potential fold families. Each of these distinct families is then replaced by an all-atom backbone, which is again solved to minimize a potential function. Finally, sidechains are added to the backbone, and the energy is again minimized.
Comparative modeling currently produces the most accurate models of protein structure. However, these techniques rely on the existence of families of similar proteins, whose structures have already been found. While this class is continually growing, it is not a legitimate option for a large number of proteins. In these cases, ab initio techniques are the best bet, but are currently limited by both a lack of understanding of the thermodynamics involved in the protein folding process and a lack of computational power.