- ...
acids.1
- A quick study of the Protein Data Bank on October 25,
2001 shows that the average sequence length of known structures, is
approximately 220 amino acids, while the range is 8 to 1733 amino
acids [37]. For unknown structures the lengths can
be considerably higher.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... possible.2
- There is some question
just how foldable all amino acid sequences are. Some amino acid
sequences are not stable as a practical matter and, therefore, will
not produce a usable protein.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
structures.3
- On July of 2001, the Protein Information
Resource [7] (found at
http://www-nbrf.georgetown.edu/pir/) contained 232,624 protein
sequences, and the Protein Data Bank [9,10] (found
at http://www.rcsb.org/pdb/), in comparison, contained only
15,531 protein structures.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...anfinsen73:-princ.4
- For this work, Christian
Anfinsen won the 1972 Nobel Prize in Chemistry. We will discuss amino
acid chemistry in detail later (Chapter 2).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...berger98:-protein-hp-np-compl.5
- NP-complete
problems are known for explosive exponential growth in the time
required for solution. This class of problems includes: Traveling
Salesman Problems,
minimax problems, 3-SAT, and many
other familiar problems from combinatorial optimization. NP-complete
means that all the problems in this class are solvable in polynomial
time (
is the class of problems solvable in polynomial time) or
none of them are. At this time it is not known whether
or
as no polynomial solution technique for an NP-complete problem
has been found, but it also has not been proven that none exists.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
initio6
- A common definition of ab initio solvers is
that they do not require direct information from the Protein Data Bank
(or any other structure database) in order to deduce a fold.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...,7
- The bond length is measured as the
distance between two atom's centers and is in angstroms,
Å.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...,8
- The bond
angle is measured as the angle created by three atoms with two bonds
between them and is in degrees.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ....9
- The torsion angle is measured as the rotation of
the bond about some axis and is in degrees.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... time.10
- This is true of every amino acid but
Pro (see Figure A.4) where it accounts for 88% of the
observed amino acids[20].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... other,11
- While the atoms in a
disulfide bridge are in close physical proximity, the amino acids they
belong to are usually some distance apart from each other on the amino
acid sequence. Therefore, disulfide bridges are a long-range covalent
bonding pattern, unlike the single and double bonds mentioned above
which are local interactions between atoms on the same amino acid.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
acids.12
- For instance in protein G [30] (see
Figure 2.5), a parallel
-sheet forms when amino acids 3
to 8 align with amino acids 50 to 55.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
water.13
- This explains why oil, which is not polar, and water
do not mix.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...neumaier00:-molec-model-protein,14
- Neumaier's
paper deals primarily with mathematical optimization issues.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...duan01:-comput,15
- Duan and Kollman's survey is the
most recent cited here.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... models16
- Comparative modeling is also known as
sequence homology and sequence alignment.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... family17
- By family, we mean proteins of very similar
structure, but not necessarily similar sequences.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
techniques.18
- In CASP3, 90% of all models were more than
Å from the target protein while many few models in the
other two areas were this poor.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
structures.19
- Neumaier's survey [54]
provides a good introduction and bibliography for the work in this
field.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
structures.20
- HOPS is written is ANSI C, and currently runs on
networks of machines running Linux (in particular RedHat and Debian
Linux distributions). It can be made available from the author
(sforman@sju.edu) or Alberto Segre (alberto-segre@uiowa.edu).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... data.21
- The number of
angle choices depends of the amino acid and the data set used to
construct the angles. This is discussed in detail in Chapters 4 and
6.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
modeled.22
- This and the section on energetics assume that we
are provided a complete amino acid sequence, a set of possible torsion
angles for each type of amino acid, and definitions of backbone bond
lengths and angles and coordinates for the sidechains of each type of
amino acid.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...orientation23
- By orientation, we refer to the relative
position of one atom's coordinate system to that of its neighbor's
coordinate systems. We will use the term position to signify
both location and orientation.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... positions,24
- This
update usually occurs only at the very end of the protein, so we are
often just comparing the position of one amino acid to all the other
amino acids currently in our search tree.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... manner.25
- The
criteria for tagging a hydrogen bond or disulfide bridge is based on
experimental data [42]. For hydrogen bonding, we
require a distance of less than
between the donor and acceptor atom, where
is the
angle formed by the donor, the acceptor and the atom adjacent to the
donor. A disulfide bridge is recognized when two Cys sulfurs are
close enough to form a covalent bond
Å to
Å. If we should decide to change these parameters,
these criteria are easily recast within HOPS.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... entropy.26
- An
estimate of sidechain entropy can be expressed as a function of the
sidechain's accessible surface area and a parameter dependent on the
type of amino acid.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
portion27
- It is an open question for us as to just how much of
a contribution the remainder of the protein can make, so we strive to
safely overestimate this value.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
one.28
- We typically increment using steps of size 0.25.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... processors29
- The fastest,
least loaded processors percolate to the top, and therefore work the
hardest.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... twiddling.30
- The
native fold for 1SHG scores better than 95% of its twiddled
counterparts, 1YMB better than 92%, and 2ST1 better than 94%.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... metric,31
- For a general
case, this need not be a metric.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... mind.32
- In other
words, they are not searching through a set of partitions for the
points to be clustered.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... step,33
- The Minimum Spanning Tree, in
Section 4.3, is an example of a common divisive
algorithm.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
points.34
- If
is the number of partitions and
the number
of points the number of possible partitions is given by
As an example,
and
[71, page 59].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...-means,35
- Using our notation thus
far this would be
-means.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
set36
- Note that the discrete set of torsion angles provides a
rough skeleton for the search. Using techniques like tweaking (see
Chapter 5) and sidechain packing (see
Section 3.4) the search space is expanded into other
interesting, continuous areas.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... area.37
- Gly displays
this characteristic especially well (see Figure C.2).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... locations,38
- Here we are assuming we know how large
should be, but in fact this is often unknown. In standard
Multi-Facility Location problems the cost of each additional facility
must be weighed against the cost savings of shorter travel to serve
demand points. This total cost is then minimized by adding locations
until the cost no longer improves. In our case, an additional cluster
point will mean an enlarged search space but a more representative
selection of torsion angles. These tradeoffs are much more difficult
to quantify, so we work from heuristics to set
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... point.39
- We have
typically required 90% of all torsion angles to be within
of a cluster point. This usually results in four to ten
cluster points for each amino acid.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... points40
- A cluster point will be perturbed
if it is too close to another point or it has too few torsion angles
assigned to it.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... position,41
- Think of
the entire protein being translated or rotated through three-space.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... protein.42
- Though it should be noted that the positions
of the atoms following the altered torsion angle do not change
relative to each other; they change only relative to the atoms prior
to the changed amino acid.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... function.43
- In
Chapter 7, we present results verifying this
is true for our scoring function.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ....44
- This does not change the value
that minimizes the function as both transformations involved (division
by two and the square of a non-negative number) are one-to-one.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... array.45
- In order to view it this way, we perform the
matrix subtraction on only the six entries defined for
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... angles.46
- To build a random
data set, we added each torsion angle pair from the original data set
with a probability of 50%. This means that the expected number of
torsion angles in our random sets is 968.5, but the actual numbers for
our ten data sets ranged from 921-998.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... other.47
- This problem is similar to that encountered when
you try to place your elbow in your ear.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.