Thesis (Index)   <-  Sean Forman   <-  You Are Here



Next: Tweaking Algorithms for -Strand Up: TORSION ANGLE SELECTION AND Previous: Overview of HOPS Subsections


Algorithms for Torsion Angle Selection

Clustering problems come in several different variations. We might wish to partition a set of inputs into disjoint subsets based on some relationship, find a set of representative points from the overall set, or match the distribution of the points in our space to a group of continuous functions [28]. Everitt further defines clustering problems as

Given a collection of n objects, each of which is described by a set of p characteristics or variables, derive a useful division into a number of classes. Both the number of classes and the properties of the classes are to be determined. [28]

Everitt lets the reader define just what ``useful'' means. Intuitively, we are often able to observe a set of data and make some decisions as to what is and is not an appropriate way to organize it. Looking at the Ramachandran plots given in Appendix C, one can quickly pick out three or four densely populated areas, but selecting eight representative points for Gly, for instance, could prove to be more difficult. This problem can be difficult both to define formally and to solve in a mathematical sense.

To give us some guidance, Späth [71] defines four vital choices to make when finding clusters:

select an appropriate distance metric,31
select an algorithm to choose the clusters,
determine an appropriate number of clusters, and
choose variables to cluster on.

Taking these guidelines into account, we describe our clustering problem. Given a set of experimentally known $ \phi - \psi$ angle pairs and a desired number of cluster points, $ N$, select $ N$ representative $ \phi - \psi$ angle pairs to represent the patterns found in the set of all $ \phi - \psi$ angle pairs. In addition, we should be able to incrementally add new points to the set of $ N$ found in previous steps.


Introduction

Clustering was studied extensively in the 1970's and 1980's and, to some degree, was left alone for many years. Recently a great deal of work has begun on a variety of problems that relate to clustering. These problems come from a range of disciplines: mining of genomic databases, sequence homology techniques for protein folding, and search engine design for the world wide web, among others. As the size and type of data sets being studied has grown, so have the range of techniques used to solve them (see Fasulo [29] for a summary of recent work). We will focus on more common clustering techniques of which there are two primary families: hierarchical techniques and those based on optimization algorithms.

Our problem is smaller in comparison to some of the new problems being solved, but has its own unique challenges. We are clustering on only two numeric variables ( $ \phi - \psi$), we will likely want only a small number of clusters (less than ten), and our distance metric is standard euclidean distance. We have other constraints which are more difficult to define as well. Since we are choosing cluster points to summarize the information in a larger set, we would like the cluster points to represent a significant number of real points and also would like those points to be a reasonable distance apart from each other.


Hierarchical Techniques

Hierarchical techniques are also sometimes known as agglomerative or divisive techniques [28]. These techniques do not begin with a number of clusters in mind.32 Rather, they begin one of two ways, either the algorithm is initialized with all the points as members of a single cluster (divisive) or all the points as clusters unto themselves (agglomerative).

In the divisive algorithms, we begin with a single cluster and then divide the cluster into finer and finer gradations adding one new cluster in every step,33 until we terminate with every point as a separate cluster. So divisive algorithms end in the same state as agglomerative algorithms begin. And similarly, the agglomerative algorithms combine two clusters into one at each iteration until only a single cluster remains.

This construction technique creates a dendrogram which allows us to easily construct systems with any number of clusters from 1 to $ N$. Much of the art in using hierarchical techniques comes in correctly identifying the appropriate number of clusters to accurately distinguish between points that are and are not alike.

The addition and subtraction of clusters can be determined by a number of different methods. In a divisive algorithm, we might partition the cluster so that the resulting two new clusters have the maximum minimum distance between themselves, or we could chose a separation so that the resulting clusters have the furthest possible distance between their centroids, or we could choose a number of different techniques [28].

One characteristic of these techniques is that they tend to form elongated (rather than circular) clusters. Points that are close together, but in a straight line will be more likely to be clustered together than a circular set of points where the points are slightly further apart than those in the line. For our clustering problem, this is a serious concern as we would prefer to use spherical clusters which imply that their representative $ \phi - \psi$ pair is within some minimum distance of the point representing it. This suggests that an optimization technique is likely to yield superior results.


Optimization Techniques

The other major class of techniques uses optimization. In this class of solutions, we begin with a set of points and search for a partition which will minimize some objective function. The most common objective function (where $ N$ is the number of clusters and $ n$ is the number of data points) is

$\displaystyle Z = \operatorname*{min} \sum_{k=1}^{N} \sum_{ x_i \in \mathcal{C}_k} \Vert x_i - \overline{x_k} \Vert _2$ (1)

which attempts to minimize the square of the distance from every point to its nearest centroid. The point, $ x_i$, is part of a partition, $ \mathcal{C}_k$, and $ \overline{x_k}$ is the centroid of all the points in $ \mathcal{C}_k$

Unfortunately, the number of partitions for even a small problem (fifteen points, three partitions) can number in the millions or billions and the number of partitions rises dramatically as we add points.34Therefore, an enumerative technique is certain to fail, so most of the common techniques feature some sort of hill climbing.

While there are numerous techniques that have been tried, we will focus on the most common, $ k$-means,35 and the one we settled upon, facilities location (see Section 4.4).

The $ k$-means algorithm attempts to find a solution for equation 4.1 using an iterative process to improve the objective at each step. First, the points are randomly partitioned into $ k$ clusters. For each of these clusters, a centroid is found. Next, we step through each of the points and check to see if there is a centroid closer to it than the one it is assigned to. If so, we reassign the point and re-compute the affected centroids. This continues until none of the points are reassigned. The facilities location algorithm works similarly, and we will introduce it later.


Ramachandran Plots

We will expand on the introduction to torsion angles that was given in Section 2.1.2. The selection of appropriate $ \phi - \psi$ torsion angle pairs is integral to our solution of the protein folding problem. We select a discrete set of $ \phi - \psi$ angle pairs, and HOPS searches exhaustively through the space defined by this set36 for the best solution as determined by some energy function. Our basic assumption is that a finite number of torsion angles can represent a protein's continuous range of torsion angles. Therefore, it is imperative that we choose $ \phi - \psi$ angle pairs that accurately reflect real-life torsion angles.

\includegraphics[width=5.0in]{GRAPHICS/ramanchandran_tile.eps} [Ramachandran plot for Ser tiled four times.] This displays the cyclical nature of the angle space.

As we can see in the Ramachandran plots for each amino acid (see Appendix C), the $ \phi - \psi$ angle pairs form distinct patterns when graphed on a scatter plot. Since the points are not uniformly distributed over the angle space, the angle pairs that we choose should not be uniform either. As mentioned earlier, this angle space is cyclical in nature (see Figure 4.1) and can also be visualized as a torus. This is easy to see as the left and right side of the Ramachandran plot are connected and the top and bottom of the Ramachandran plot are connected. This also means that all four corners are connected into a single point, so while some plots may appear to have four distinct densely populated areas in the four corners, this is, in fact, just a single area.37While this peculiarity does not cause any serious difficulty, it requires some care when implementing a function measuring distance between two points in the plot.

Our data points are 2-D vectors, so we can use any of a variety of metrics to measure the distance between two points, $ A_1 =
(\phi_1,\psi_1)$ and $ A_2 = (\phi_2,\psi_2)$.

$\displaystyle \Vert A_1 - A_2 \Vert _1$ $\displaystyle = \vert\phi_1 - \phi_2 \vert + \vert\psi_1 - \psi_2 \vert$ (2)
$\displaystyle \Vert A_1 - A_2 \Vert _2$ $\displaystyle = \sqrt{\vert\phi_1 - \phi_2 \vert^2 + \vert\psi_1 - \psi_2 \vert^2}$ (3)
$\displaystyle \Vert A_1 - A_2 \Vert _{\infty}$ $\displaystyle =$   max$\displaystyle \left(\vert\phi_1 - \phi_2 \vert, \vert\psi_1 - \psi_2 \vert\right)$ (4)

Unless otherwise noted, we will be using (4.3) for all of our distance calculations.


Hierarchical Clustering of Torsion Angles

We have attempted several different techniques for selecting cluster points from the Ramachandran plots. One hierarchical technique we tested builds a minimum spanning tree [80] (MST) from all the points in our data set. MST's are easy to understand and the idea of a cluster transfers easily to this technique. To build a minimum spanning tree, we need to use some graph theory [23]. Every point in the Ramachandran plot is treated as a graph vertex and weighted edges then connect every vertex with every other vertex. These edges are weighted by the distance between the points (see Figure 4.2).

A spanning tree is a set of edges in a graph for which there is a single path from every vertex to every other vertex. A minimum spanning tree on a weighted graph, selects the spanning tree with the smallest possible total edge weights (see Figure 4.3). Finding the minimum spanning tree sounds more difficult than it is. Kruskal's algorithm is a polynomial time algorithm for selecting a graph's minimum spanning tree. Once we have computed the minimum spanning tree for our graph of torsion angle pairs, we then convert this graph into two clusters by removing the longest edge. Since there are no cycles in a minimum spanning tree this will create two disjoint sets (see Figure 4.3). We can then find cluster points by finding the centroids of the two parts, by computing the median of the two smaller trees, or by some other method. We can continue finding up to $ n$ cluster points by continuing to remove edges.

This technique is a divisive technique and can to lead to elongated clusters. With a very large data set, it leads to unsatisfactory answers, so we dismissed it fairly early in our research.

\includegraphics[width=2.0in]{GRAPHICS/mst_graph.pstex} [MST - full graph.] Four torsion angle pairs (A, B, C, D) with their distances represented as edge weights ( $ \Vert A-C\Vert _2 = 5$, etc.)

\includegraphics[width=2.0in]{GRAPHICS/mst_mintree.pstex} [MST - initial graph.] The minimum spanning tree for the graph in Figure 4.2.

\includegraphics[width=2.0in]{GRAPHICS/mst_cluster.pstex}

[MST - pair of clusters found.] A pair of clusters found in Figure 4.2 by removing the largest edge from the previous Figure 4.3. X marks the two cluster points based on this set of torsion angles based on the four points given.

One other method we have implemented attempts to minimize the maximum distance of any $ \phi - \psi$ angle pair to a cluster point when given a number of clusters. This is essentially finding a set of cluster points, $ \mathcal{C}$, solving the following:

$\displaystyle \operatorname*{min} \sum_{k=1}^{N} \sum_{ x_i \in \mathcal{C}_k} \Vert x_i - \overline{x_k} \Vert _\infty$ (5)

$ x_i \in \mathbf{T}$ is our set of torsion angles. We have implemented a potential solution, but in the course of iterating to find $ \mathcal{C}$, the cluster points fail to converge to a solution.


Clustering and Facilities Location

The solution presented here treats our problem as a facilities location problem. The $ \phi - \psi$ angles represent our customers and the cluster points represent the facilities assigned to serve them. As we will see, this solution provides reasonable results and also can be implemented in such a way that incremental addition of new cluster points is possible. In his book Cluster Analysis Algorithms [71], Späth describes using this method to find cluster points on the plane. We will modify this algorithm to fit our needs here.


Single Facility Location

A specific type of facilities location problems called Fermat-Weber Location problems [78] involves a set of demand points that are serviced by a set of facilities. The locations of the demand points are fixed, and the facilities are located to minimize the sum of weighted distances between the demands and the facility that services them (usually the closest one). In the typical representation of the problem, the demand points have varying levels of demand, and there is also no limit on the proximity of the facilities to each other. For instance, Wal-Mart has a set of stores in Iowa that need to be maintained. What is the ideal location for their warehouse given there are more stores in the eastern part of the state?

We will introduce the multi-facility problem after first discussing the single facility problem. The Weber single facility location problem is defined as

$\displaystyle \operatorname*{min} C(x,y) = \sum_{j = 1}^{n} \beta_j w_j \sqrt{(x - x_j)^2 - (y - y_j)^2}$ (6)

where $ (x_j, y_j)$ are the location of our demand points ( $ j
= 1,2,...,n$), $ \beta_j$ is the cost per unit shipped per unit distance, and $ w_j$ is the amount of product to be shipped to demand point $ j$ from our facility. For our problem, we define $ w_j =
\beta_j = 1$, so it reduces to just a minimization of the sum of the distances.

$\displaystyle \operatorname*{min} C(x,y) = \sum_{j = 1}^{n} \sqrt{(x - x_j)^2 - (y - y_j)^2}$ (7)

The $ k$-means solution is not a minimizer for this function. The facilities location method is an iterative technique that will in many cases minimize Equation 4.7. Figure 4.5 shows an example where the minimizer is not the centroid.

\includegraphics[width=3.0in]{GRAPHICS/kmeansVfc.eps} [$ k$-means versus facilities location solution for a small clustering problem.] The centroid is not a minimizer of the sum of distances problem. The facilities location solution on the right does minimize the sum of distances.

For the single facility problem, we have an iterative solution [78]. We choose an initial solution, $ (X^0,Y^0)$, at the centroid of the demand points, $ X^0 =
\frac{\sum x_j}{n}$ and $ Y^0 = \frac{\sum y_j}{n}$. We then use the following iterations until we converge to a solution [13].

$\displaystyle X^{k+1} = \frac{ \displaystyle{\sum_{j=1}^{n} \frac{x_j}{d_j(X^k,Y^k) }}} { \displaystyle{\sum_{j=1}^{n} \frac{1}{d_j(X^k,Y^k) }}}$ (8)

$\displaystyle Y^{k+1} = \frac{ \displaystyle{\sum_{j=1}^{n} \frac{y_j}{d_j(X^k,Y^k)} } }{ \displaystyle{\sum_{j=1}^{n} \frac{1}{d_j(X^k,Y^k)} } }$ (9)

where$\displaystyle  d_j(X,Y) = \sqrt{(X-x_j)^2 + (Y-y_j)^2}$ (10)


Multiple Facility Location

As we have stated earlier, we don't just wish to solve the case for a single cluster point, but for multiple cluster points. The single-facility problem can be expanded to fit this situation. Given a set of demand points $ (x_j, y_j), j = 1,...,n$; find a set of $ m$ facility locations,38 and allocate the demand points to these facilities minimizing the costs of supplying the various demand points. We must find the locations for each facility $ i$, $ (X_i, Y_i)$, and the demand points they serve $ (x_j,y_j), j \in J_i$, so that the set of demand points $ J = J_1
\bigcup J_2 \bigcup ... \bigcup J_m$ and $ J_l \bigcap J_k = \emptyset$ for every $ l \neq k$.

Our problem statement now becomes, find $ (X_i,Y_i) \in \mathbf{C}, (i =
1, ... , m)$ satisfying

$\displaystyle \operatorname*{min} C(x,y) = \sum_{i = 1}^{m} \sum_{j \in J_i} \sqrt{(X_i - x_j)^2 - (Y_i - y_j)^2}$ (11)

Our problem naturally falls in this configuration as we assign each $ \phi - \psi$ angle to a cluster point and then minimize the distance from that cluster point to its assigned $ \phi - \psi$ angles. A solution to this problem gives us a set of $ m$ cluster points for any set of $ n$ $ \phi - \psi$ angle pairs (typically $ n \gg m$).

We solve this problem using a common heuristic for multiple facilities location problems. We first randomly partition the set of $ n$ $ \phi - \psi$ angles into $ m$ subsets, $ {J_i}$ (see Figure 4.6). Next, we solve a single-facility problem (see Section 4.4.1) on each of the $ i$ subsets, which sets a cluster point optimally for that subset (see Figure 4.7). Note that during this step $ \phi - \psi$ angles may not be close to the cluster point they are assigned to. In fact, there is a good chance that several $ \phi - \psi$ angles will be closer to a cluster point they aren't assigned to. The process becomes iterative when we then re-assign the $ \phi - \psi$ angle pairs to their now nearest cluster point (see Figure 4.8). We then re-solve the optimal cluster point placement problem and continue similarly until some termination criteria is met (see Figure 4.9).

\includegraphics[width=2.0in]{GRAPHICS/cluster_init.xfig.eps} [Multi-Facility Location - step 1.] A set of points on the plane. We have randomly partitioned the points into two sets.

\includegraphics[width=2.0in]{GRAPHICS/cluster_next.xfig.eps} [Multi-Facility Location - step 2.] The single-facility solution (cluster point location is the square) is found for each set of points. Some points are closer to another point than the one they were assigned to.

\includegraphics[width=2.0in]{GRAPHICS/cluster_reset.xfig.eps} [Multi-Facility Location - step 3.] Points are reassigned to the cluster point nearest them.

\includegraphics[width=2.0in]{GRAPHICS/cluster_final.xfig.eps} [Multi-Facility Location - step 4.] The single-facility solution is re-solved, and the process is continued until the cluster point locations do not change and no points are reassigned.


Implementation

Implementation of this algorithm follows the above directions fairly closely. Our input is a set of $ \phi - \psi$ torsion angles extracted from some set of proteins. In our case, these inputs are for a particular amino acid. Therefore, we must cluster twenty different data sets in order to find appropriate points for each of the amino acids.

Once the torsion angles are read, we randomly assign them to one of our $ N$ cluster points. We place all of the cluster points at the origin initially, but we could just as easily compute the centroid. The single-cluster problems are solved, the torsion angles are reassigned and the process repeats until some termination criteria is met (see Algorithm 4.10).

We'll deal with some of the implementation particulars now. As stated earlier, we are solving this problem on a torus. While solutions on a sphere [22] are fairly common (the location of global corporate facilities or global military complexes), we have been unable to find other instances of solutions for the facility location problem on a torus. The main complicating factor is that we need to account for the cyclical nature of the angle space. In spherical problems, the authors often use the great circle distance, which is a slightly more complicated metric than euclidean distance, but for our case we decided to use the euclidean distance.

This choice means we need to take some care in computing distances and locating the cluster points. The distance between two angles can be no larger than $ \pi$ radians, so we must translate the torsion angles into the proper quadrants in order to find the distance between them and their cluster point. Once this is done for both $ \phi$ and $ \psi$ we can compute the standard euclidean distance. As we iterate and the cluster points change position on the torus, the translation performed on the $ \phi - \psi$ torsion angle may change as well.

As shown in the Algorithm 4.10, we begin with an initial number of cluster points. This number is typically set quite low. However, in many cases, the torsion angles are quite spread out and only two cluster points would not adequately describe the variety of conformations the amino acid backbone can take. The number of cluster points we finish with is determined by a tuneable parameter. We require a certain percentage of $ \phi - \psi$ torsion angles to be within some distance of a cluster point.39 If evalPlan determines the current clusterPlan is not sufficiently good, we will add another cluster point and re-solve the problem.

Regardless of the number of clusters, we need to find the best clusters given our data points. We have modified the original algorithm in another way. We aren't solely interested in minimizing the distance of the torsion angles to the cluster points. We also want cluster points that represent the diversity of torsion angles. It is of little use to have two cluster points very close to each other as this will create very similar proteins, and likewise it is of little use to have a cluster point with only a handful of torsion angles assigned to it. Ideally, the cluster points will be associated with a large number of torsion angles and they will be adequately spaced throughout the angle space. This adds two new parameters, the minimum distance between two cluster points and the minimum number of torsion angles associated with a cluster point. When our algorithm converges to a solution that violates either of these two parameters, one of the cluster points40 is randomly perturbed into a new position. This is represented in Algorithm 4.10 by perturbTest, which will check if we need to perturb the cluster points and also perform the perturbation.

Regardless of whether a cluster point has been perturbed, we will always call $ apportionPoints$ (see Figure 4.10). This subroutine finds the closest cluster point for each torsion angle, and if necessary, affiliates the torsion angle with a new cluster point. If none of the torsion angles change clusters, we have found a solution and can exit our loop. We then evaluate this clustering plan and determine whether we need to increase the number of cluster points we use.

Figure 4.10: Algorithm: program $ Cluster(dataPoints)$.
\begin{figure}{\bf program} \textit{Cluster (dataPoints)}
\begin{algorithmic}
\S...
...minQuality$}
\par\STATE $outputPlan(clusterPlan);$\end{algorithmic}\end{figure}


Incremental Addition of Cluster Points

On occasion, the initial cluster points used by HOPS will not sufficiently describe the search space, and we will want to expand our search space. However, we do not want to disregard the angle space that we have already searched. Throwing out the previously determined cluster points would require us to re-search the entire space as our previous solution would no longer be valid in the current space.

However, upgrading the current, previously determined $ N$ cluster points into a system with $ N + 1$ cluster points, which maintains the first $ N$ cluster points chosen, produces a set of different solutions than what we would get if we just solved the straightforward $ N + 1$ cluster problem. So while our upgraded solution is sub-optimal, it has the advantage of adding onto the search space rather than obliterating it and starting over.

To find the placement of the new cluster points, we have an algorithm similar to the one for finding the original cluster points. The new point is randomly placed in the set of torsion angles. The angles are reapportioned and solution of the single-facility location problem occurs for each of the cluster points. Once this is done, the original cluster points are snapped back to the their original locations. If no points need to be reapportioned then we are done, otherwise we continue to reapportion, solve the single-facility problems and snap back to our original solution. If no solution is found, we perturb the problem and start again.


Results

Results using the multiple facilities location technique are presented in Chapter 6. Regardless of the technique used once our cluster points are chosen, those angles are used to create configuration files for each protein and define the space through which HOPS will search for protein folds. See Chapter 3 for an overview of how HOPS does this.


next up previous
Next: Tweaking Algorithms for -Strand Up: TORSION ANGLE SELECTION AND Previous: Overview of HOPS
sforman@sju.edu