... acids.1
A quick study of the Protein Data Bank on October 25, 2001 shows that the average sequence length of known structures, is approximately 220 amino acids, while the range is 8 to 1733 amino acids [37]. For unknown structures the lengths can be considerably higher.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... possible.2
There is some question just how foldable all amino acid sequences are. Some amino acid sequences are not stable as a practical matter and, therefore, will not produce a usable protein.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... structures.3
On July of 2001, the Protein Information Resource [7] (found at http://www-nbrf.georgetown.edu/pir/) contained 232,624 protein sequences, and the Protein Data Bank [9,10] (found at http://www.rcsb.org/pdb/), in comparison, contained only 15,531 protein structures.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...anfinsen73:-princ.4
For this work, Christian Anfinsen won the 1972 Nobel Prize in Chemistry. We will discuss amino acid chemistry in detail later (Chapter 2).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...berger98:-protein-hp-np-compl.5
NP-complete problems are known for explosive exponential growth in the time required for solution. This class of problems includes: Traveling Salesman Problems, $ \alpha-\beta$ minimax problems, 3-SAT, and many other familiar problems from combinatorial optimization. NP-complete means that all the problems in this class are solvable in polynomial time ($ P$ is the class of problems solvable in polynomial time) or none of them are. At this time it is not known whether $ P = NP$ or $ P
\ne NP$ as no polynomial solution technique for an NP-complete problem has been found, but it also has not been proven that none exists.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... initio6
A common definition of ab initio solvers is that they do not require direct information from the Protein Data Bank (or any other structure database) in order to deduce a fold.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...,7
The bond length is measured as the distance between two atom's centers and is in angstroms, $  $Å.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...,8
The bond angle is measured as the angle created by three atoms with two bonds between them and is in degrees.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
....9
The torsion angle is measured as the rotation of the bond about some axis and is in degrees.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... time.10
This is true of every amino acid but Pro (see Figure A.4) where it accounts for 88% of the observed amino acids[20].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... other,11
While the atoms in a disulfide bridge are in close physical proximity, the amino acids they belong to are usually some distance apart from each other on the amino acid sequence. Therefore, disulfide bridges are a long-range covalent bonding pattern, unlike the single and double bonds mentioned above which are local interactions between atoms on the same amino acid.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... acids.12
For instance in protein G [30] (see Figure 2.5), a parallel $ \beta$-sheet forms when amino acids 3 to 8 align with amino acids 50 to 55.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... water.13
This explains why oil, which is not polar, and water do not mix.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...neumaier00:-molec-model-protein,14
Neumaier's paper deals primarily with mathematical optimization issues.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...duan01:-comput,15
Duan and Kollman's survey is the most recent cited here.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... models16
Comparative modeling is also known as sequence homology and sequence alignment.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... family17
By family, we mean proteins of very similar structure, but not necessarily similar sequences.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... techniques.18
In CASP3, 90% of all models were more than $ 10 $Å from the target protein while many few models in the other two areas were this poor.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... structures.19
Neumaier's survey [54] provides a good introduction and bibliography for the work in this field.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... structures.20
HOPS is written is ANSI C, and currently runs on networks of machines running Linux (in particular RedHat and Debian Linux distributions). It can be made available from the author (sforman@sju.edu) or Alberto Segre (alberto-segre@uiowa.edu).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... data.21
The number of angle choices depends of the amino acid and the data set used to construct the angles. This is discussed in detail in Chapters 4 and 6.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... modeled.22
This and the section on energetics assume that we are provided a complete amino acid sequence, a set of possible torsion angles for each type of amino acid, and definitions of backbone bond lengths and angles and coordinates for the sidechains of each type of amino acid.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...orientation23
By orientation, we refer to the relative position of one atom's coordinate system to that of its neighbor's coordinate systems. We will use the term position to signify both location and orientation.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... positions,24
This update usually occurs only at the very end of the protein, so we are often just comparing the position of one amino acid to all the other amino acids currently in our search tree.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... manner.25
The criteria for tagging a hydrogen bond or disulfide bridge is based on experimental data [42]. For hydrogen bonding, we require a distance of less than $ \ensuremath{3.0 + 1.5 \cos \theta \text{\AA}}$ between the donor and acceptor atom, where $ \theta$ is the angle formed by the donor, the acceptor and the atom adjacent to the donor. A disulfide bridge is recognized when two Cys sulfurs are close enough to form a covalent bond $ 1.63 $Å to $ 2.75 $Å. If we should decide to change these parameters, these criteria are easily recast within HOPS.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... entropy.26
An estimate of sidechain entropy can be expressed as a function of the sidechain's accessible surface area and a parameter dependent on the type of amino acid.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... portion27
It is an open question for us as to just how much of a contribution the remainder of the protein can make, so we strive to safely overestimate this value.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... one.28
We typically increment using steps of size 0.25.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... processors29
The fastest, least loaded processors percolate to the top, and therefore work the hardest.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... twiddling.30
The native fold for 1SHG scores better than 95% of its twiddled counterparts, 1YMB better than 92%, and 2ST1 better than 94%.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... metric,31
For a general case, this need not be a metric.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... mind.32
In other words, they are not searching through a set of partitions for the points to be clustered.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... step,33
The Minimum Spanning Tree, in Section 4.3, is an example of a common divisive algorithm.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... points.34
If $ N$ is the number of partitions and $ n$ the number of points the number of possible partitions is given by $ S(n,N) = \frac{1}{N!} \sum_{j=1}^{N} (-1)^{N-j}
\binom {N}{j} j^n$ As an example, $ S(15,3) = 2,375,101$ and $ S(100,5) \approx 10^{68}$ [71, page 59].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...-means,35
Using our notation thus far this would be $ N$-means.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... set36
Note that the discrete set of torsion angles provides a rough skeleton for the search. Using techniques like tweaking (see Chapter 5) and sidechain packing (see Section 3.4) the search space is expanded into other interesting, continuous areas.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... area.37
Gly displays this characteristic especially well (see Figure C.2).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... locations,38
Here we are assuming we know how large $ m$ should be, but in fact this is often unknown. In standard Multi-Facility Location problems the cost of each additional facility must be weighed against the cost savings of shorter travel to serve demand points. This total cost is then minimized by adding locations until the cost no longer improves. In our case, an additional cluster point will mean an enlarged search space but a more representative selection of torsion angles. These tradeoffs are much more difficult to quantify, so we work from heuristics to set $ m$.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... point.39
We have typically required 90% of all torsion angles to be within $ 30^\circ$ of a cluster point. This usually results in four to ten cluster points for each amino acid.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... points40
A cluster point will be perturbed if it is too close to another point or it has too few torsion angles assigned to it.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... position,41
Think of the entire protein being translated or rotated through three-space.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... protein.42
Though it should be noted that the positions of the atoms following the altered torsion angle do not change relative to each other; they change only relative to the atoms prior to the changed amino acid.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... function.43
In Chapter 7, we present results verifying this is true for our scoring function.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
....44
This does not change the value that minimizes the function as both transformations involved (division by two and the square of a non-negative number) are one-to-one.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... array.45
In order to view it this way, we perform the matrix subtraction on only the six entries defined for $ \mathbf{G}_\zeta$.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... angles.46
To build a random data set, we added each torsion angle pair from the original data set with a probability of 50%. This means that the expected number of torsion angles in our random sets is 968.5, but the actual numbers for our ten data sets ranged from 921-998.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... other.47
This problem is similar to that encountered when you try to place your elbow in your ear.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.