Thesis (Index) <- Sean Forman <- You Are Here
HOPS (Hybrid Optimization of Protein Structure) is a tool for the ab initio prediction of unknown protein structures.20 It implements a wide variety of different ideas from statistics, robotics, discrete optimization and computer science in an effort to efficiently predict the three-dimensional structure of a protein when given only its amino acid sequence.
HOPS moves from node-to-node through a large search tree (see Figure 3.1) building, altering and evaluating partial and complete protein folds at each node. The depth of the search tree is determined by the number of amino acids in our protein. The number of branches at each node is defined by a discrete set of possible torsion angles. This discrete set of angles is chosen by a clustering algorithm using experimental data.21 Within this discrete search, we utilize continuous secondary search techniques (tweaking and sidechain packing) to expand our search into nearby advantageous regions.
A protein is folded by HOPS from its
-terminus to its
-terminus (left to right). We begin with the first amino acid,
set the fixed parameters of its backbone bonds (the bond angles and
bond lengths) and then add a sidechain suitable to that type of amino
acid. Using our results from clustering, we then use the first
torsion angle combination available to complete the amino acid's
configuration. Once we've set a torsion angle pair for this amino
acid, we move down to the next node adding a new amino acid. When we
return to the original level of the search tree perhaps after
searching many, many conformations, we then consider another torsion
angle combination for that amino acid. When all angles have been
attempted, we backtrack to the previous level. Our search is
inherently recursive.
At each node in our search tree, we check for steric clashes in our partial (or at the leaves the tree, full) protein conformation and evaluate the score of the protein. In certain cases, we will be able to prune the search tree and reduce the size of our search. If there is a steric clash, the protein is deemed untenable, and we prune that portion of the search tree. If the conformation is legal, we compute the score of the partial fold along with a contribution from the unfolded remainder of the protein. We are able to prune if this score is worse than our current best solution (see Figure 3.4), otherwise we continue searching with the next amino acid.
Sections 3.2 (Geometry) and 3.3 (Energetics) deal with how we build, represent, and evaluate the partial and complete proteins. Section 3.4 (Search) describes how the search space is constructed and searched.
HOPS uses a full protein representation which means that the position of every non-hydrogen atom in three dimensions is modeled.22 Since we are searching through a vast number of potential conformations, HOPS builds data structures fully representing the three-dimensional location and orientation23 of each non-hydrogen atom in a protein. These positions are generated using any backbone bond length and bond angle chosen at compile-time and a variable set of torsion angles chosen at run-time.
To quickly change these positions based on what node of the search
tree we are at, we use a fast positional update technique which
utilizes homogeneous coordinate systems. This representation relies
on a sequence of
matrix multiplications in order to update the
position of the atoms in the protein (see Section 5.1 for a
full explanation).
Following an update to an amino acid's atomic positions,24 HOPS recognizes steric clashes between amino acids using skip lists [63] and bounding spheres. Skip lists are a fast, probalistic method for quickly accessing entries in an ordered linked list. Once we have identified a potential steric clash, we then search, atom-by-atom, between the two amino acids looking for evidence of a steric clash. For each pair of atoms from each amino acid, we compare their locations and the size of their atomic spheres. The intersection of two atomic spheres is recognized as a steric clash and that conformation is rejected.
Steric clashes between nearby amino acids (within two to five amino acids) are likely to occur repeatedly in the search trees as each short torsion angle series is part of thousands of potential folds. Rather than rebuild each of these potential folds many times and repeatedly determine they contain a steric clash, HOPS maintains a clash cache. This cache uses an efficient hash function to index these steric clashes which allows for a quick check before any new amino acid is added to the end of our search tree. This check of the clash cache will determine if this sequence of torsion angles has previously caused steric clashes.
As it is evaluating the possibility of a steric clash, HOPS is also recognizing atoms which are taking part in hydrogen bonds (see Section 2.1.3). HOPS uses the skip list structure as a rough first cut as to which amino acids are close enough to take part in hydrogen bonds. When a donor and an acceptor are close enough, the bond angle formed by the donor and acceptor is checked. If relative position of the two atoms is a suitable, the atoms are tagged as participants in a hydrogen bond. This technique does not preclude multiple atoms from taking part in hydrogen bonding and can recognize disulfide bridges in a similar manner.25
Sidechain atoms are set according to a user-supplied library of sidechain atoms. Though we typically use a single sidechain conformation, multiple sidechain conformations can be searched in a single run through HOPS. Since the sidechain libraries are static, this can sometimes lead to conformations with a large number of voids in their conformations. Using non-linear optimization code, sidechain configurations can be perturbed slightly in order to allow tighter packing of the predicted protein structures. Sidechain packing allows us to more tightly fit amino acids together and more accurately mimic the conformations that actual proteins are able to take on. It can also be used to avoid sidechain-to-sidechain steric clashes.
HOPS uses a different, incremental technique to estimate an atom's accessible surface area. Richards and Lee's technique works well for complete protein conformations, but we are continually adding and removing amino acids from the end of the partial fold. HOPS needs a technique where the majority of the information from previous ASA calculations is carried forward into future ASA calculations, and only the changed portion of the protein must be recomputed.
HOPS borrows a pair of techniques from computer graphics to compute ASA. In a nutshell, we sample points from each atomic shell and determine the percentage which are accessible to the surrounding solvent. The percentage of points accessible to the solvent is then multiplied by the total atomic surface area to calculate each atom's contribution to the protein's accessible surface area. These points can be chosen either randomly or in a deterministic way. A random point set involves selecting a pair of spherical coordinates and then converting those spherical coordinates to Cartesian coordinates on the atomic shell centered at the atom's center. For a deterministic set of points, Hammersley or Halton point sets [32,81] produce uniform, quasi-stochastic point sets on the sphere. They have a low computational cost and are incremental. By incremental, we mean that computing the 101-member point set requires the 100-member point set and one additional point. Therefore, if we wish to compute more finely-tuned estimates of accessible surface area, we do not have to throw away the work used to produce the coarser estimate.
HOPS can evaluate the current energy of a partial fold given a user-defined scoring function. The main scoring function we use includes hydrogen bonds formed, accessible surface area of hydrophilic and hydrophobic amino acids, and sidechain entropy.26 It is also possible for a user to write their own scoring function and drop it into the program. For instance, when testing the tweaking aspect of the program, a scoring function that rewarded a large number of tweaking events was incorporated.
In addition to computing the current score of the partial fold, HOPS will estimate the maximum contribution from the as yet unfolded portion27 of the protein can make to our scoring function. This estimate, coupled with the current known score, is used to inform the search function. The search function uses this information to determine whether following its current branch deeper on the tree will possibly yield a fold better than our current best.
picture(4926,4535)(376,-3733)
(1779,579)(0,0)[lb]AAcid |
In order to delineate the outline of its search space, HOPS can
utilize any of a variety of clustering algorithms to select
prototypical
angle combinations for each type of amino acid,
. The algorithms search for patterns in user-defined data sets of
experimentally determined angle pairs (see
Chapter 4). In addition, the user has the
option of defining their own sets of angles and asking HOPS to search
through those torsion angles. These angle sets define the outline of
our search space.
Given a discrete set of torsion angles for each amino acid, HOPS will
perform a complete
search of all possible torsion angle
combinations and will exit with the best result using any given
scoring function. Since the scoring function also takes into account
the potential score of the as yet unfolded portion of the protein, the
search approaches admissibility and this allows for considerable
pruning along the branches of the search tree. A scoring heuristic is
admissible if the search using the heuristic is guaranteed to
never prune an optimal solution. To define admissibility in our
problem, the estimate of the unfolded portion's contribution should
give us a lower bound (since we are minimizing the scoring function)
on the best fold in this branch on the tree. If this lower bound is
higher than the score of our current best result, the continued
searching of this branch will be fruitless. Due to the nature of the
program's scoring function, we are unable to prove that the scoring
heuristic we use is in fact strictly admissible, but we search the
space confident that an optimal solution will be pruned with a very
low probability.
In incorporating this energy function, we found that beginning with the fully admissible energy function would lead to very broad and deep search trees with a large number of total nodes searched. To alleviate this problem, we borrowed an idea from iterative-broadening search [31] and incorporated an optimism factor which is incremented across multiple complete searches of the search tree. Optimism begins at zero and increments to one.28Optimism refers to what percentage of the unfolded portion's potential contribution we actually expect to occur. So an optimism of zero implies that the unfolded portion will contribute nothing to the partial fold.
With an optimism of zero, the search algorithm will find an initial solution and then will prune the entirety of the remaining search tree. No partial fold will be able to improve on the current scoring value as the unfolded portion will not contribute to the protein's score. This fast strike initial solution is then carried onto the next re-search of the search space. With an initial solution already in hand and slightly more optimism, we can search more of the tree, but our previously found greedy solution allows us to prune to a greater degree. This continues as we complete searches and increment the optimism factor to one, where all of the potential contribution will be added to the scoring function and our search is once again admissible. Searching the same space multiple times with increasing optimism is generally faster than searching the space thoroughly once.
To foster the creation of
-sheets, HOPS uses a non-linear optimization
technique which attempts to align two
-strands into a
-sheet.
Tweaking adjusts the torsion angles between the two strands
until a proper turn region is found which brings two
-strands into
alignment. Tweaking is fully compatible with the other portions of
the search function and will be discussed in
Chapter 5.
All of these modules are tied together and can be utilized across a large number of processors using a fault-tolerant parallel programming technique called nagging [68]. Nagging has been shown to scale to 64 or more processors on difficult combinatorial problems such as the traveling salesman problem and 3-satisfiability. The basic nagging paradigm is that of a tree of processors with a single processor as the root processor and other processors acting as naggers, masters, or both. Naggers ask their parents for attention and are passed a snapshot of the master's current state. In the case of HOPS, this is the current torsion angle choices, and any successful, currently incorporated tweaking or sidechain packing events. The nagger then permutes the choices of torsion angles in the remaining protein. Therefore, the master and nagger are searching the same space, but they are searching the space in a different order, so that a good solution found more quickly by a nagger, will inform the search of the parent and allow for greater pruning. This technique has been shown to provide speedups in cases where the speed to solution depends greatly on the order in which the trees are searched. The nagging infrastructure handles details such as optimization of the hierarchy of processors29 and message passing between nagger and master processors.
In order to develop new features for HOPS, we often found the need to directly access the geometric modeling engine underpinning HOPS. The protein explorer, PEX, provides a user interface for this engine. PEX reads in a standard PDB file (appendix B) and allows the user to explore the protein in a variety of ways. Many of these features are particular to HOPS, but a number of them could have use outside of our work as well.
As stated above, PEX is able to evaluate the score of a protein
conformation using a programmer-defined scoring function. Here, we
will use this ability to comment on the reasonableness of the scoring
function used in HOPS (Section 3.3). An ideal
scoring function would have the protein's native fold as its
minimizer. In this case, any change to the backbone torsion angles of
the protein should cause a worsening of the scoring functions value.
To check the scoring function used in HOPS for this characteristic, we
will twiddle the turn regions of a set of proteins a large
number of times and then compare the score of these randomly perturbed
conformations to that of the native conformation. Twiddling is not a
technical term, but we use it to describe the perturbation of a
backbone torsion angle. Twiddling a protein merely injects noise into
the backbone torsion angle conformations by randomly changing their
conformation by some amount. The change in the protein
conformation is generally a very small one (
-
). Additionally, since we do not twiddle amino acids found
in secondary structures, we will not indiscriminately destroy
secondary structure. This makes the test a stiffer one as we will not
destroy energetically favorable hydrogen bonding patterns found in
helices and sheets.
To study this issue we have chose a small (alpha spectrin, 1SHG, 57 amino acids, with mostly sheets [53]), medium (metmyoglobin, 1YMB, 153 amino acids and helical [27]), and large (subtilisin BPN, 2ST1, 275 amino acids with both sheets and helices [11]) protein. Twiddling often creates steric clashes, so not all proteins that are twiddled can be scored. To gather a large enough set of points, we attempted 1600 twiddles for each of the three proteins and generated 1473 valid twiddles for 1SHG, 413 for 1YMB and 295 for 2ST1. We then generated histograms for the distributions of the score for each of the three proteins (see Figure 3.8). While our scoring function does not identify the native fold as its minimizer, it does view it as a very good solution. In each of the cases, the native fold is in the 90th percentile of all folds considered by our twiddling.30Certainly continued effort needs to be put into refining our scoring function, but it is encouraging that our scoring function is able discern that the native fold is superior to solutions nearby.