Thesis (Index)   <-  Sean Forman   <-  You Are Here



Next: Algorithms for Torsion Angle Up: TORSION ANGLE SELECTION AND Previous: Protein and Protein Folding Subsections


Overview of HOPS

HOPS (Hybrid Optimization of Protein Structure) is a tool for the ab initio prediction of unknown protein structures.20 It implements a wide variety of different ideas from statistics, robotics, discrete optimization and computer science in an effort to efficiently predict the three-dimensional structure of a protein when given only its amino acid sequence.


General Structure of HOPS

HOPS moves from node-to-node through a large search tree (see Figure 3.1) building, altering and evaluating partial and complete protein folds at each node. The depth of the search tree is determined by the number of amino acids in our protein. The number of branches at each node is defined by a discrete set of possible torsion angles. This discrete set of angles is chosen by a clustering algorithm using experimental data.21 Within this discrete search, we utilize continuous secondary search techniques (tweaking and sidechain packing) to expand our search into nearby advantageous regions.

A protein is folded by HOPS from its $ \mathbf{N}$-terminus to its $ \mathbf{C}$-terminus (left to right). We begin with the first amino acid, set the fixed parameters of its backbone bonds (the bond angles and bond lengths) and then add a sidechain suitable to that type of amino acid. Using our results from clustering, we then use the first torsion angle combination available to complete the amino acid's configuration. Once we've set a torsion angle pair for this amino acid, we move down to the next node adding a new amino acid. When we return to the original level of the search tree perhaps after searching many, many conformations, we then consider another torsion angle combination for that amino acid. When all angles have been attempted, we backtrack to the previous level. Our search is inherently recursive.

At each node in our search tree, we check for steric clashes in our partial (or at the leaves the tree, full) protein conformation and evaluate the score of the protein. In certain cases, we will be able to prune the search tree and reduce the size of our search. If there is a steric clash, the protein is deemed untenable, and we prune that portion of the search tree. If the conformation is legal, we compute the score of the partial fold along with a contribution from the unfolded remainder of the protein. We are able to prune if this score is worse than our current best solution (see Figure 3.4), otherwise we continue searching with the next amino acid.

Sections 3.2 (Geometry) and 3.3 (Energetics) deal with how we build, represent, and evaluate the partial and complete proteins. Section 3.4 (Search) describes how the search space is constructed and searched.


\begin{picture}(0,0)%
\includegraphics{GRAPHICS/treeInit.pstex}%
\end{picture}
@font
picture(6316,4765)(301,-3962) (281,-3415)(0,0)[lb]AAcid $ L - 1$% (281,-3939)(0,0)[lb]AAcid $ L$% (5026,-3736)(0,0)[lb]at each level% (5026,-3436)(0,0)[lb]$ n_k$ angle choices% (1051,-286)(0,0)[lb]AAcid $ 2$% (1051, 89)(0,0)[lb]AAcid $ 1$% (2851,464)(0,0)[lb] $ (\phi_1,\psi_1)$% (4876,464)(0,0)[lb] $ (\phi_n,\psi_n)$%
[A search tree of potential protein structures.] Using the set of torsion angles found in clustering, we create a search tree of potential protein structures.

\begin{figure}{\bf program} $HOPS(proteinLength)$\begin{algorithmic}
\par\STATE ...
...hildren are similar, but may
begin at some depth greater than zero.}\end{figure}


Geometry

HOPS uses a full protein representation which means that the position of every non-hydrogen atom in three dimensions is modeled.22 Since we are searching through a vast number of potential conformations, HOPS builds data structures fully representing the three-dimensional location and orientation23 of each non-hydrogen atom in a protein. These positions are generated using any backbone bond length and bond angle chosen at compile-time and a variable set of torsion angles chosen at run-time.

To quickly change these positions based on what node of the search tree we are at, we use a fast positional update technique which utilizes homogeneous coordinate systems. This representation relies on a sequence of  $ 4 \times 4$ matrix multiplications in order to update the position of the atoms in the protein (see Section 5.1 for a full explanation).

Following an update to an amino acid's atomic positions,24 HOPS recognizes steric clashes between amino acids using skip lists [63] and bounding spheres. Skip lists are a fast, probalistic method for quickly accessing entries in an ordered linked list. Once we have identified a potential steric clash, we then search, atom-by-atom, between the two amino acids looking for evidence of a steric clash. For each pair of atoms from each amino acid, we compare their locations and the size of their atomic spheres. The intersection of two atomic spheres is recognized as a steric clash and that conformation is rejected.

Steric clashes between nearby amino acids (within two to five amino acids) are likely to occur repeatedly in the search trees as each short torsion angle series is part of thousands of potential folds. Rather than rebuild each of these potential folds many times and repeatedly determine they contain a steric clash, HOPS maintains a clash cache. This cache uses an efficient hash function to index these steric clashes which allows for a quick check before any new amino acid is added to the end of our search tree. This check of the clash cache will determine if this sequence of torsion angles has previously caused steric clashes.


\begin{picture}(0,0)%
\includegraphics{GRAPHICS/treeClash.pstex}%
\end{picture}
picture(6030,3939)(1051,-3136) (1051,-286)(0,0)[lb]AAcid $ 2$% (1051, 89)(0,0)[lb]AAcid $ 1$% (1051,-1111)(0,0)[lb]AAcid $ 4$%
[A steric clash between two amino acids.] The clash between amino acids two and four and is stored in the clash cache. Any future instances of this torsion angle combination is pruned automatically.

As it is evaluating the possibility of a steric clash, HOPS is also recognizing atoms which are taking part in hydrogen bonds (see Section 2.1.3). HOPS uses the skip list structure as a rough first cut as to which amino acids are close enough to take part in hydrogen bonds. When a donor and an acceptor are close enough, the bond angle formed by the donor and acceptor is checked. If relative position of the two atoms is a suitable, the atoms are tagged as participants in a hydrogen bond. This technique does not preclude multiple atoms from taking part in hydrogen bonding and can recognize disulfide bridges in a similar manner.25

Sidechain atoms are set according to a user-supplied library of sidechain atoms. Though we typically use a single sidechain conformation, multiple sidechain conformations can be searched in a single run through HOPS. Since the sidechain libraries are static, this can sometimes lead to conformations with a large number of voids in their conformations. Using non-linear optimization code, sidechain configurations can be perturbed slightly in order to allow tighter packing of the predicted protein structures. Sidechain packing allows us to more tightly fit amino acids together and more accurately mimic the conformations that actual proteins are able to take on. It can also be used to avoid sidechain-to-sidechain steric clashes.


Energetics

Most biochemists believe the hydrophobic-hydrophilic effect (see Section 2.1.4) is one of the primary driving forces in folding. The magnitude of this effect can be approximated by the sum of the surface areas for both hydrophobic and hydrophilic atoms accessible to the surrounding solvent and also by appropriate thermodynamic constants. In 1971, Lee and Richards [48,65] introduced a means to compute the accessible surface area (ASA) for atoms in a folded protein. They describe a spherical probe the size of a water molecule ( $ 2.5 $Å) which is rolled over the surface of the protein. This probing is done in successive, parallel two-dimensional slices of the protein. Taken together these slices approximate an integral of the accessible surface area of the protein, along with accessible surface areas for both hydrophilic and hydrophobic atoms.

HOPS uses a different, incremental technique to estimate an atom's accessible surface area. Richards and Lee's technique works well for complete protein conformations, but we are continually adding and removing amino acids from the end of the partial fold. HOPS needs a technique where the majority of the information from previous ASA calculations is carried forward into future ASA calculations, and only the changed portion of the protein must be recomputed.

HOPS borrows a pair of techniques from computer graphics to compute ASA. In a nutshell, we sample points from each atomic shell and determine the percentage which are accessible to the surrounding solvent. The percentage of points accessible to the solvent is then multiplied by the total atomic surface area to calculate each atom's contribution to the protein's accessible surface area. These points can be chosen either randomly or in a deterministic way. A random point set involves selecting a pair of spherical coordinates and then converting those spherical coordinates to Cartesian coordinates on the atomic shell centered at the atom's center. For a deterministic set of points, Hammersley or Halton point sets [32,81] produce uniform, quasi-stochastic point sets on the sphere. They have a low computational cost and are incremental. By incremental, we mean that computing the 101-member point set requires the 100-member point set and one additional point. Therefore, if we wish to compute more finely-tuned estimates of accessible surface area, we do not have to throw away the work used to produce the coarser estimate.

HOPS can evaluate the current energy of a partial fold given a user-defined scoring function. The main scoring function we use includes hydrogen bonds formed, accessible surface area of hydrophilic and hydrophobic amino acids, and sidechain entropy.26 It is also possible for a user to write their own scoring function and drop it into the program. For instance, when testing the tweaking aspect of the program, a scoring function that rewarded a large number of tweaking events was incorporated.

In addition to computing the current score of the partial fold, HOPS will estimate the maximum contribution from the as yet unfolded portion27 of the protein can make to our scoring function. This estimate, coupled with the current known score, is used to inform the search function. The search function uses this information to determine whether following its current branch deeper on the tree will possibly yield a fold better than our current best.

Figure 3.4: Example of a poorly scoring partial fold.

\begin{picture}(0,0)%
\includegraphics{GRAPHICS/treeScore.pstex}%
\end{picture}
picture(4926,4535)(376,-3733) (1779,579)(0,0)[lb]AAcid $ 1$% (1615,292)(0,0)[lb]AAcid $ 2$ (1717,-2852)(0,0)[lb]Current best solution, $ \delta_{min}$ (4951,-1320)(0,0)[lb]best possible (4951,-1542)(0,0)[lb]competion, $ \delta_f$ (4941,-633)(0,0)[lb]Evaluate this node,$ \delta_c$ (4951,-1098)(0,0)[lb]Overestimate the (411,-1861)(0,0)[lb]AAcid $ L - 1$% (411,-2536)(0,0)[lb]AAcid $ L$% (451,-3661)(0,0)[lb]this branch, else we can prune.% (376,-3361)(0,0)[lb]If $ \delta_f + \delta_c < \delta_{min}$ continue searching %

Figure 3.5: Algorithm: procedure $ buildProtein(protein,depth, torsionAngle)$.
\begin{figure}{\bf procedure} $buildProtein(protein,depth, torsionAngle)$\begin{...
...par\ENDIF
\par\STATE $return$ \textbf{FALSE};
\par\end{algorithmic}\end{figure}

Figure 3.6: Algorithm: procedure $ addAAcid(protein, depth, torsionAngle)$.
\begin{figure}{\bf procedure} $addAAcid(protein, depth, torsionAngle)$\begin{alg...
...par\ENDIF
\par\STATE $return$ \textbf{FALSE};
\par\end{algorithmic}\end{figure}


Search

In order to delineate the outline of its search space, HOPS can utilize any of a variety of clustering algorithms to select $ N_a$ prototypical $ \phi - \psi$ angle combinations for each type of amino acid, $ a$. The algorithms search for patterns in user-defined data sets of experimentally determined angle pairs (see Chapter 4). In addition, the user has the option of defining their own sets of angles and asking HOPS to search through those torsion angles. These angle sets define the outline of our search space.

Given a discrete set of torsion angles for each amino acid, HOPS will perform a complete $ A^\star$ search of all possible torsion angle combinations and will exit with the best result using any given scoring function. Since the scoring function also takes into account the potential score of the as yet unfolded portion of the protein, the search approaches admissibility and this allows for considerable pruning along the branches of the search tree. A scoring heuristic is admissible if the search using the heuristic is guaranteed to never prune an optimal solution. To define admissibility in our problem, the estimate of the unfolded portion's contribution should give us a lower bound (since we are minimizing the scoring function) on the best fold in this branch on the tree. If this lower bound is higher than the score of our current best result, the continued searching of this branch will be fruitless. Due to the nature of the program's scoring function, we are unable to prove that the scoring heuristic we use is in fact strictly admissible, but we search the space confident that an optimal solution will be pruned with a very low probability.

In incorporating this energy function, we found that beginning with the fully admissible energy function would lead to very broad and deep search trees with a large number of total nodes searched. To alleviate this problem, we borrowed an idea from iterative-broadening search [31] and incorporated an optimism factor which is incremented across multiple complete searches of the search tree. Optimism begins at zero and increments to one.28Optimism refers to what percentage of the unfolded portion's potential contribution we actually expect to occur. So an optimism of zero implies that the unfolded portion will contribute nothing to the partial fold.

With an optimism of zero, the search algorithm will find an initial solution and then will prune the entirety of the remaining search tree. No partial fold will be able to improve on the current scoring value as the unfolded portion will not contribute to the protein's score. This fast strike initial solution is then carried onto the next re-search of the search space. With an initial solution already in hand and slightly more optimism, we can search more of the tree, but our previously found greedy solution allows us to prune to a greater degree. This continues as we complete searches and increment the optimism factor to one, where all of the potential contribution will be added to the scoring function and our search is once again admissible. Searching the same space multiple times with increasing optimism is generally faster than searching the space thoroughly once.

To foster the creation of $ \beta$-sheets, HOPS uses a non-linear optimization technique which attempts to align two $ \beta$-strands into a $ \beta$-sheet. Tweaking adjusts the torsion angles between the two strands until a proper turn region is found which brings two $ \beta$-strands into alignment. Tweaking is fully compatible with the other portions of the search function and will be discussed in Chapter 5.

All of these modules are tied together and can be utilized across a large number of processors using a fault-tolerant parallel programming technique called nagging [68]. Nagging has been shown to scale to 64 or more processors on difficult combinatorial problems such as the traveling salesman problem and 3-satisfiability. The basic nagging paradigm is that of a tree of processors with a single processor as the root processor and other processors acting as naggers, masters, or both. Naggers ask their parents for attention and are passed a snapshot of the master's current state. In the case of HOPS, this is the current torsion angle choices, and any successful, currently incorporated tweaking or sidechain packing events. The nagger then permutes the choices of torsion angles in the remaining protein. Therefore, the master and nagger are searching the same space, but they are searching the space in a different order, so that a good solution found more quickly by a nagger, will inform the search of the parent and allow for greater pruning. This technique has been shown to provide speedups in cases where the speed to solution depends greatly on the order in which the trees are searched. The nagging infrastructure handles details such as optimization of the hierarchy of processors29 and message passing between nagger and master processors.

\begin{figure}{\bf procedure} $search(protein,depth)$\begin{algorithmic}
\par\ST...
...e tweaking event, add a new amino acid
and step to the next depth. }\end{figure}


PEX

In order to develop new features for HOPS, we often found the need to directly access the geometric modeling engine underpinning HOPS. The protein explorer, PEX, provides a user interface for this engine. PEX reads in a standard PDB file (appendix B) and allows the user to explore the protein in a variety of ways. Many of these features are particular to HOPS, but a number of them could have use outside of our work as well.

PEX extracts the torsion angles, bond lengths and bond angles from every backbone bond in a protein.
PEX extracts the positions of the residue atoms relative to their  $ \mathbf{C_{\alpha}}$ atom with the  $ \mathbf{C_{\alpha}}$- $ \mathbf{C_{\beta}}$ bond being aligned with the $ z$-axis of the  $ \mathbf{C_{\alpha}}$'s coordinate system. This allows you to build a library of existing sidechain templates for use in HOPS or in other projects.
PEX computes the accessible surface area of each atom in a protein using any of a variety of sampling methods and parameters.
Using an energy function of your choosing, PEX finds the energy value of a protein.
Given a discrete set of torsion angles, PEX will take an existing protein and change its torsion angles to the ones closest in the given discrete set.
Also by default, it checks the format of a PDB file for errors and determine if a set of atom locations has a steric clash.

As stated above, PEX is able to evaluate the score of a protein conformation using a programmer-defined scoring function. Here, we will use this ability to comment on the reasonableness of the scoring function used in HOPS (Section 3.3). An ideal scoring function would have the protein's native fold as its minimizer. In this case, any change to the backbone torsion angles of the protein should cause a worsening of the scoring functions value. To check the scoring function used in HOPS for this characteristic, we will twiddle the turn regions of a set of proteins a large number of times and then compare the score of these randomly perturbed conformations to that of the native conformation. Twiddling is not a technical term, but we use it to describe the perturbation of a backbone torsion angle. Twiddling a protein merely injects noise into the backbone torsion angle conformations by randomly changing their conformation by some amount. The change in the protein conformation is generally a very small one ($ 1^\circ$ - $ 2^\circ$). Additionally, since we do not twiddle amino acids found in secondary structures, we will not indiscriminately destroy secondary structure. This makes the test a stiffer one as we will not destroy energetically favorable hydrogen bonding patterns found in helices and sheets.

To study this issue we have chose a small (alpha spectrin, 1SHG, 57 amino acids, with mostly sheets [53]), medium (metmyoglobin, 1YMB, 153 amino acids and helical [27]), and large (subtilisin BPN, 2ST1, 275 amino acids with both sheets and helices [11]) protein. Twiddling often creates steric clashes, so not all proteins that are twiddled can be scored. To gather a large enough set of points, we attempted 1600 twiddles for each of the three proteins and generated 1473 valid twiddles for 1SHG, 413 for 1YMB and 295 for 2ST1. We then generated histograms for the distributions of the score for each of the three proteins (see Figure 3.8). While our scoring function does not identify the native fold as its minimizer, it does view it as a very good solution. In each of the cases, the native fold is in the 90th percentile of all folds considered by our twiddling.30Certainly continued effort needs to be put into refining our scoring function, but it is encouraging that our scoring function is able discern that the native fold is superior to solutions nearby.

\begin{figure}\begin{center}\leavevmode
\epsfig{file=GRAPHICS/1SHG.eps, height=...
...core of the
native fold is denoted by a ball and line.}
\end{center}\end{figure}


next up previous
Next: Algorithms for Torsion Angle Up: TORSION ANGLE SELECTION AND Previous: Protein and Protein Folding
sforman@sju.edu