Thesis (Index) <- Sean Forman <- You Are Here
In the course of folding a protein with a discrete set of backbone
torsion angles, HOPS often creates
-strands which do not take part in
hydrogen bonding with other nearby
-strands to form
-sheets. These
unformed
-sheets are poorly scoring conformations. However, with a small
change or ``tweak'' to a handful of amino acids, the
-strands could be
brought into alignment and the resulting conformation would have a
significantly better score.
Discretization of a protein's
torsion angles and the
generation of a finite number of folds converts an overwhelming
continuous problem into a form slightly easier, but still NP-hard and
quite difficult to solve. This simplification, however, must come
with trade-offs. What we find using our technique is that isolated
``local'' motifs occur with great regularity:
-helices and
-strands. However,
these sheet structures do not routinely come together to form more
common super-secondary structures like
-sheets, or any sort of aligned
helical structures. These structures are energetically advantageous,
but they involve complicated long-range interactions which are
unlikely to occur in a discrete search space. This isn't just a
shortcoming of our technique; most ab initio techniques
perform much better on helical proteins than on those with mostly
-sheets.
This stands to reason. These structures are largely global in nature.
The alignment of the beginning of a
-strand, A, to the end of
-strand, B, can be achieved in a vast number of ways. In nature,
you can think of the two strands arriving close together and then
pulling each other into alignment. When pulled into alignment, the
turn regions between the two
-strands are changed slightly to bring them
together, so hydrogen bonding can occur. This is a relaxation of the
discrete
angle pairs that we are using.
Due to this, the alignment of the
-strands defies discretization.
Certainly it would be possible to identify, describe, and implement
some canonical set of turn angles. This would aid in the creation of
anti-parallel
-sheets, but this is in some sense is the least interesting
behavior. A far more interesting case is where two strands from
distant regions of the protein come together to form either a parallel
or anti-parallel
-sheet (see Figure 5.1).
In order to capture this global nature of secondary structure, we need
to shift from discrete torsion angles to continuous torsion angles.
While the discrete angles may bring two structures near each other,
they will not close the last few angstroms of gap and properly align
the two protein segments. Ideally, we would take the discrete angles
and alter them slightly in a continuous manner in order to align two
-strands. This means that we will not be making large changes in the
protein structure-only small tweaks.
This chapter first gives sufficient background to understand the
geometric and mathematical techniques HOPS uses to build protein
conformations. This is followed by a description of an optimization
technique which relaxes the turn region between two
-strands in order to
create a
-sheet.
Our representation of the protein is based on homogeneous coordinate systems [19]. In robotics, homogeneous coordinate systems are used to model the position of a robot arm. The robot arm (or more generally a rigid link open kinematic chain) typically has several rigid pieces connected by linkages. These linkages have some number of degrees of freedom that allow them to change the arm's configuration. Beginning with the position of the base, one can progressively compute the position of the robot's effector (the business end of the arm) by computing the positions of the intermediate pieces using the position of the linkages and the lengths of the arms.
Similarly, our protein backbone is a set of fixed bond lengths and fixed
bond angles but variable torsion angles. Using the same idea as with
robotics, we are able to find the global position of all the backbone
atoms beginning with the position of the first amino acid's
atom through to the final amino acid's
atom (see
Figure 5.2).
Three-dimensional geometry has six degrees of freedom: the atom's 3-D
location,
, and also the atom's orientation (its yaw, pitch
and roll, or more precisely the angle of rotation around the
,
and
axes). To simplify the manipulation of these positions and
the calculation of subsequent atomic positions, these six degrees of
freedom are expanded into twelve entries in a
matrix. While
this takes twice as much storage space, it reduces the complexity of
updating the atomic positions substantially.
We need two types of
matrices in order to compute the position
of each backbone atom in this manner. Each backbone atom has its own
global position matrix, and each covalent bond connecting two
backbone atoms has a related transform matrix. As we travel
from backbone atom A to adjoining backbone atom B,
we compute atom B's position matrix by multiplying atom
A's position matrix by the transform matrix of the bond
connecting them. Using this technique, we can find the global
location of every atom in the protein using only the torsion angle and
the angle and length of the bonds between the backbone atoms along
with the position of the first backbone atom. This first backbone
atom can be set to any arbitrary global position,41but we typically set the first backbone atom at the origin and the
first backbone bond oriented along the global
-axis.
Equation 5.1 shows how we can convert
, the three location coordinates, and the
rotations about the three axes,
, into a
position matrix. The order in which the rotations are preformed is
important, and in the equation given, the first rotation is about the
-axis, followed by
and
. The atom's location can always be
read in the first three rows of the fourth column.
As we build the protein, this equation is not actually used other than
when initializing the very first backbone atom. This position matrix
is set to the
. Note that
our subscript for
tells us that this is the position
matrix for the first amino acid's
atom.
As stated before, following backbone atom position matrices will be
the product of this initial position matrix and all intervening
transform matrices. Just as with the position matrices the transform
matrix is a
matrix. In fact, there is a transform matrix for
each of the six basic transformations we can make to an atom's
position. We can rotate the atom's position about any of the three
axes, or we can translate it along any of its three axes. If we have
a series of these basic transforms to perform, we can do that by
multiplying the basic matrices in the order in which the
transformations occur. In our case, we translate the length of the
bond along the
-axis, followed by a rotation about the
-axis,
which is the torsion angle for that bond, and then rotate about the
-axis, which is the bond angle. Here
is the transform
matrix that represents a bond with a length,
, a bond angle of
and a torsion angle of
(see Equation 5.3).
Just as we have position matrices for each angle, we can define
transform matrices for each bond,
, where
this is the transform matrix for amino acid
's
bond
(
). This converts the bond's three
degrees of freedom to nine entries in a
matrix.
As stated, the computation of a backbone atom's position matrix
requires the previous atom's position matrix and the transformation
matrix representing the bond between them. The position matrices for
amino acid
can be found if the position matrix for
of
amino acid
is known and the three transformation matrices
corresponding to the
and
torsion angles can be
found. Figure 5.3 shows how the coordinate
system changes as we progress along the protein backbone, and
equations 5.4 through 5.6 show how we compute the
positions of the backbone atoms as we move along the protein.
As Equation 5.7 shows, the effect of the transform matrices is
cumulative. Knowing only the bond angle and length and the torsion
angle of each bond, makes it possible for us to find the position of
the final atom in the protein backbone. Later on when we align two
-strands, we will be altering the
matrices
slightly in order to change the position of a set of atoms.
In Equation 5.3, we saw that the transform matrix from one bond to another is determined by the bond angle and length and the bond's torsion angle. Since the first two parameters are essentially constant, we find that the transform matrices are matrices of one variable, the torsion angle for the bond. This also points out that altering a single torsion angle propogates the change throughout the entire protein.42
This means that, at least in our algorithm, the set of torsion angles chosen to build the protein uniquely define its structure. In the case of real proteins, the bond lengths and angles do vary, but not to a great degree. As stated earlier (see Section 2.1.1), these variations can be determined using experimental methods and are known to be small.
Let's for a moment assume we have a
-strand A and another
-strand B with a set of amino acids
between them. Using our discrete angles we may have something like
Figure 5.4.
A and B are too far apart to bond with each other as the current configuration stands. This is likely to be a poor partial fold using any common energy function.43 In HOPS, such a partial fold when found would be pruned. However, some small changes to the backbone torsion angles can lead to a conformation like Figure 5.5. This conformation will be much better energetically and would be more likely to be part of a final prediction.
This points out that the difference between a bad conformation and a
good one is often small; the ability to ``relax'' the backbone torsion
angles of a turn region can cause sizeable improvements in the energy
of a protein. To do this, we need a way to align a pair of amino
acids and, then be certain that a
-sheet will then propogate as new amino
acids are added.
Structures like
-helices and
-strands are largely local in nature. Given the
proper pair of
torsion angles to go along with an
of
we can set up the case where the
-helix or
-strand propogates
infinitely. Using this fact for
-strands, we can define the position of
A's
relative to B's
in some way such
that the two strands will align no matter how long of a
-sheet we decide
to build. (It may be helpful to review the bonding pattern of
-sheets from Section 2.1.3.)
Given this ability, our problem is no longer to align two strands, but now to align two atoms.
If you recall in Section 5.1, we described the manner in
which the global position of each amino acid was determined. More
specifically, the position of our backbone atoms can be found by
multiplying the
position matrix,
, of the
previous backbone atom with the
transform matrix,
, representing the bond between the two atoms.
depends solely upon the bond length, the bond angle
and torsion angle. The bond angle and length are constants, and we
only vary the torsion angles,
, so this becomes a
matrix of only one variable (see Equation 5.3).
Since we only need to set up one interaction between the two
-strands, we
can simplify the problem. We only look at aligning amino acid
's
with amino acid
's
. The tweakable torsion angles are
then assumed to be the backbone torsion angles for the amino acids
to
. However, we can create cases where only some subset
of the intervening amino acids will be tweakable. This might occur if
a
-helix occurs between the two
-strands we are attempting to align. In
this case, we don't want to alter the torsion angles within the
-helix as
this may destroy the helical structure. These intervening structures
are considered rigid bodies unchangeable by our tweaking algorithm.
Our problem is then: given a position matrix
and a desired (or goal) position matrix of
find a set of transform matrices,
such that
Each of these transform matrices is a function of the
torsion angle and any changes made to it. We signify this as,
. This means that if we find
that equate the left and right hand sides of
Equation 5.8 then the two atoms (
and
)
will be in proper alignment for the creation of a
-sheet.
As mentioned earlier (Section 5.1) each
position
matrix is an expansion of only six variables (three location and three
orientation). Expecting each entry of the two matrices in
Equation 5.8 to match exactly would overdefine the problem.
Instead, we can solve this problem by requiring the two matrices to
agree on just six entries, the rightmost column of locations,
, and three entries from the upper left corner
sub-matrix. We found that using the
,
, and
entries was sufficient to force the three orientation values
to be equal. In other words, if we know the indicated entries from
the atom's
position matrix we know the atom's position,
We can further simplify the equation by removing
and
from the picture.
If we left-multiply both sides by
, we are left with
We have one more constraint to make. We want as much continuity as possible from the initial set of torsion angles to the new tweaked set. A wildly different set of torsion angles is more likely to lead to a steric clash within the turn region, and to be true to our discretization we wish to choose angles that are relatively close to the discrete angles that are the basis of our search space.
Our problem can now be formulated as a least-squares equality constrained problem.
Defined formally, we want to find a solution to the following problem.
where
is the product of the transform matrices
for the amino acids between the two that we are attempting to align,
and
and
are vectors
representing the original torsion angles and the amount of change in
the torsion angles, respectively. We are holding the torsion angle
for the
transform matrix
constant. This is done so that we do not unnecessarily increase the
chance of a steric clash between the amino acids we are attempting to
align. We could invert this matrix and multiply it with
. Since we are going to convert the main
constraint into a linear function anyway, it will only affect the
coefficient matrix for our system of linear equations, so we will fold
it into the main constraint function.
This problem as stated is difficult to solve optimally, as
is non-linear. Therefore, we must settle for an
approximate solution which we improve iteratively. While we will make
certain any solution satisfies our equality constraint, we will not be
certain that the objective function is fully minimized. For our
purposes, this is sufficient as our overarching goal is to satisfy the
equality constraint which will align the two
-strands.
To solve this problem, we need to convert the constraint into a linear function. This is done easily using a Taylor polynomial approximation of the main constraint.
To incorporate this iterative technique, we define
. This is slightly different than techniques where each solution
is a function of the previous iteration's solution. In our case, we
use the previous iteration to set up the problem to solve. Our
problem now becomes.
| (5.18) | ||||
![]() |
In our main constraint (5.21), we will see that
is built as a
matrix,
where
is the number of tweakable backbone torsion angles between
the two amino acids we wish to align,
is an
array, and the right-hand side can be written as a
array.45 This will allow us to replace
with
an equal sign (
).
Prior to building this equality constraint, we will need to define a
transform derivative matrix (
). If you recall the matrix introduced in
Equation 5.3 which computes our transformation matrix, we see
that the matrix is a function of
and
. However, we treat the bond angle,
, and the bond
length,
as constants for every backbone bond, so this matrix has
just one variable the torsion angle,
. If we take the
derivative of the matrix with respect to our lone variable we get the
derivative matrix,
.
, each column corresponds to the
partial derivative of
by computing
In order to solve each instance of our program, we apply Lagrangian multipliers to the problem, which produces a system of linear equations whose solution will solve the given program (5.20).
and
Solving this system utilizes some standard techniques from numerical
linear algebra. For reasons that will become apparent we are going to
operate on
, which is
and is a
real matrix. For any real matrix, we are able to produce a
factorization,
where
, which is
, is an orthogonal matrix and
, which is
, is upper triangular. A matrix is
orthogonal if
and
, and a matrix is
upper triangular if it is square and has only zero entries
below the diagonal. A system of linear equations with an upper
triangular matrix,
can be solved much more
quickly than a full matrix can be
solved [3]. This difference is not
insignificant. Using Gaussian Elimination on the full matrix, we would
take many more operations to form an upper triangular matrix
(
) and then solve the system of equations. The factorization
allows us to concentrate on an
matrix rather than a
matrix. Using our factorization,
and
Equation (5.26) becomes,
| (38) | |||||||
| (39) | |||||||
| (40) | |||||||
| (41) | |||||||
| (42) |
As mentioned earlier, we need to decide how to terminate an effort to
align two
-strands. It is not aways possible to align two
-strands. The
most common problem is that the two atoms we wish align are too far
apart and no matter how we alter the backbone bonds between them they
can never be parallel to each other. Less common, but still a problem
is that they could be too close, and no assignment of torsion angles
could cause the intervening turn region to bend sharply enough for the
two atoms to interact.
To account for these situations where no solution can be found, we
have two simple requirements as we iteratively attempt to set
. We require
, and we set some
maximum number of iterations, usually ten. If either of these
requirements is violated, we do not tweak the protein and continue
with our normal search techniques.
There are a number of implementation details that one must consider
before the above techniques can actively be used to align
-strands into
-sheets. Here, we will highlight some of the issues that we need to deal
with.
A critical step in each iteration of solving our program (5.20)
is the construction of the
-factorization of our
derivative matrix
. We have chosen to use a freely
available matrix solver called Meschach [72].
Using Meschach provides numerous advantages. First, the program is in
ANSI C, so it is easily folded into our existing code. It
was written by experts in numerical analysis, so the issues and
considerations inherent in finding numerical solutions using computers
have been considered. And lastly, one of the authors is a colleague
at the University of Iowa and is accessible to answer questions and
provide support.
Using Meschach makes our solution of these problems fairly simple.
First, we load our matrices and vectors into the appropriate data
structures. Next, we find the
-factorization using
library functions, utilize the upper and lower triangular solver
library functions, and finally use matrix multiply to return our
solution to the iteration. All memory is dynamically allocated and
de-allocated, so we can solve matrices of varying size without having
to worry about a maximum size.
Tweaking is not attempted at every node of the search tree. When we enter a new node of the search tree, the torsion angles for the new amino acid are noted and we enter the subroutine attemptTweak (see Figure 5.6). This subroutine checks for potential tweaking events and alignments before actually calling strandsAlign which implements the algorithm and techniques discussed in this chapter.
AttemptTweak first determines if the current amino acid is
the possible starting point for a
-strand. To do this, it will check the
torsion angles of the newly added target amino acid to see if they
fall within a certain range of standard
-strand torsion angles. If not,
we will return to the main search tree and use the discrete torsion
angles. Assuming the torsion angles are amenable, we then run through
the previously folded protein portion to analyze the current secondary
structure. When doing this, we are recognizing
-strands,
-helices,
and turn regions.
We then work backward from our target amino acid looking for a partner
amino acid that is part of a
-strand. We require at least three
intervening turn amino acids before this partner, so that enough
flexibility exists in order to align the target and its partner. For
our implementation, an amino acid is tweakable so long as it is not
currently involved in a
-sheet or an
-helix, and it has not been tweaked in
a previously successful effort to build a
-sheet. This means that
tweaking will not destroy existing secondary structure, but will work
around it to create additional secondary structure.
As we search backward to the beginning of the protein for potential
partner amino acids, once a potential partner is found, we attempt to
create both a parallel and anti-parallel
-sheet. If an alignment is
found, we increment our depth and move onto the next level in our
search tree. If none is found, or we return after searching a branch
of the tree, we continue to search for partners until we reach the
first amino acid. Then we leave tweaking and continue on with the
regular search algorithm.
As we mentioned earlier, we search the search space in a depth-first manner using a parameter called optimism (see Section 3.4). We also use optimism in our tweaking algorithm. When optimism is low we restrict how often we will attempt to perform a tweak. For instance, if amino acids are far apart or if the convergence is slow, we may terminate the tweak attempt before being certain if a tweak is possible. This meshes with our efforts in restricting the scope of the search until we have a higher level of optimism.
Tweaking must also work within our nagging infrastructure. If a child
processor nags a parent and the parent's current state includes a
successful tweak, the child must receive that information and rebuild
the tweak as well. In the nagging case, the child will never
backtrack over the tweak sent it by its parent, so it just sets all
the angles the parent sends it, notes the angles which are the result
of tweaks, and continues searching deeper in the search tree. This is
possible because tweaking is a deterministic process. Both the child
and parent processor will build the
-sheet in the same way.
Similarly, we have implemented a feature that allows us to rewind the search to a previous state if the root processor were to fail or is interrupted in its search. This is accomplished via a checkpoint file. The checkpoint file contains an output of all solutions found and an occasional output of the search state after some number of nodes. This output can then be read back in and the search state is updated to the latest search state in the checkpoint file. Tweak information must be stored and rebuilt as well. Unlike in the message passing for nagging, we have to rewind the search state when performing this type of restart. Therefore, we take great pains to properly travel through the subroutines setting all necessary counters and markers. This allows the search to properly backtrack and restart just as if it had never been halted in the first place.
Results from tweaking and examples of tweaked proteins will be given in Chapter 7.