Thesis (Index) <- Sean Forman <- You Are Here
Clustering problems come in several different variations. We might wish to partition a set of inputs into disjoint subsets based on some relationship, find a set of representative points from the overall set, or match the distribution of the points in our space to a group of continuous functions [28]. Everitt further defines clustering problems as
Given a collection of n objects, each of which is described by a set of p characteristics or variables, derive a useful division into a number of classes. Both the number of classes and the properties of the classes are to be determined. [28]
Everitt lets the reader define just what ``useful'' means. Intuitively, we are often able to observe a set of data and make some decisions as to what is and is not an appropriate way to organize it. Looking at the Ramachandran plots given in Appendix C, one can quickly pick out three or four densely populated areas, but selecting eight representative points for Gly, for instance, could prove to be more difficult. This problem can be difficult both to define formally and to solve in a mathematical sense.
To give us some guidance, Späth [71] defines four vital choices to make when finding clusters:
Taking these guidelines into account, we describe our clustering
problem. Given a set of experimentally known
angle pairs and
a desired number of cluster points,
, select
representative
angle pairs to represent the patterns found in the set of all
angle pairs. In addition, we should be able to incrementally
add new points to the set of
found in previous steps.
Clustering was studied extensively in the 1970's and 1980's and, to some degree, was left alone for many years. Recently a great deal of work has begun on a variety of problems that relate to clustering. These problems come from a range of disciplines: mining of genomic databases, sequence homology techniques for protein folding, and search engine design for the world wide web, among others. As the size and type of data sets being studied has grown, so have the range of techniques used to solve them (see Fasulo [29] for a summary of recent work). We will focus on more common clustering techniques of which there are two primary families: hierarchical techniques and those based on optimization algorithms.
Our problem is smaller in comparison to some of the new problems being
solved, but has its own unique challenges. We are clustering on only
two numeric variables (
), we will likely want only a small
number of clusters (less than ten), and our distance metric is
standard euclidean distance. We have other constraints which are more
difficult to define as well. Since we are choosing cluster points to
summarize the information in a larger set, we would like the cluster
points to represent a significant number of real points and also would
like those points to be a reasonable distance apart from each other.
Hierarchical techniques are also sometimes known as agglomerative or divisive techniques [28]. These techniques do not begin with a number of clusters in mind.32 Rather, they begin one of two ways, either the algorithm is initialized with all the points as members of a single cluster (divisive) or all the points as clusters unto themselves (agglomerative).
In the divisive algorithms, we begin with a single cluster and then divide the cluster into finer and finer gradations adding one new cluster in every step,33 until we terminate with every point as a separate cluster. So divisive algorithms end in the same state as agglomerative algorithms begin. And similarly, the agglomerative algorithms combine two clusters into one at each iteration until only a single cluster remains.
This construction technique creates a dendrogram which allows us to
easily construct systems with any number of clusters from 1 to
.
Much of the art in using hierarchical techniques comes in correctly
identifying the appropriate number of clusters to accurately
distinguish between points that are and are not alike.
The addition and subtraction of clusters can be determined by a number of different methods. In a divisive algorithm, we might partition the cluster so that the resulting two new clusters have the maximum minimum distance between themselves, or we could chose a separation so that the resulting clusters have the furthest possible distance between their centroids, or we could choose a number of different techniques [28].
One characteristic of these techniques is that they tend to form
elongated (rather than circular) clusters. Points that are close
together, but in a straight line will be more likely to be clustered
together than a circular set of points where the points are slightly
further apart than those in the line. For our clustering problem,
this is a serious concern as we would prefer to use spherical clusters
which imply that their representative
pair is within some
minimum distance of the point representing it. This suggests that an
optimization technique is likely to yield superior results.
The other major class of techniques uses optimization. In this class
of solutions, we begin with a set of points and search for a partition
which will minimize some objective function. The most common
objective function (where
is the number of clusters and
is the
number of data points) is
which attempts to minimize the square of the distance from
every point to its nearest centroid. The point,
, is part of a
partition,
, and
is the centroid of all the
points in
Unfortunately, the number of partitions for even a small problem (fifteen points, three partitions) can number in the millions or billions and the number of partitions rises dramatically as we add points.34Therefore, an enumerative technique is certain to fail, so most of the common techniques feature some sort of hill climbing.
While there are numerous techniques that have been tried, we will
focus on the most common,
-means,35 and the one we settled upon, facilities
location (see Section 4.4).
The
-means algorithm attempts to find a solution for
equation 4.1 using an iterative process to improve the
objective at each step. First, the points are randomly partitioned
into
clusters. For each of these clusters, a centroid is found.
Next, we step through each of the points and check to see if there is
a centroid closer to it than the one it is assigned to. If so, we
reassign the point and re-compute the affected centroids. This
continues until none of the points are reassigned. The facilities
location algorithm works similarly, and we will introduce it later.
We will expand on the introduction to torsion angles that was given in
Section 2.1.2. The selection of appropriate
torsion angle pairs is integral to our solution of the protein
folding problem. We select a discrete set of
angle pairs, and
HOPS searches exhaustively through the space defined by this
set36 for the best solution as determined by
some energy function. Our basic assumption is that a finite number of
torsion angles can represent a protein's continuous range of torsion
angles. Therefore, it is imperative that we choose
angle
pairs that accurately reflect real-life torsion angles.
As we can see in the Ramachandran plots for each amino acid (see
Appendix C), the
angle pairs
form distinct patterns when graphed on a scatter plot. Since the
points are not uniformly distributed over the angle space, the angle
pairs that we choose should not be uniform either. As mentioned
earlier, this angle space is cyclical in nature (see
Figure 4.1) and can also be visualized as a
torus. This is easy to see as the left and right side of the
Ramachandran plot are connected and the top and bottom of the
Ramachandran plot are connected. This also means that all four
corners are connected into a single point, so while some plots may
appear to have four distinct densely populated areas in the four
corners, this is, in fact, just a single area.37While this peculiarity does not cause any serious difficulty, it
requires some care when implementing a function measuring distance
between two points in the plot.
Our data points are 2-D vectors, so we can use any of a variety of
metrics to measure the distance between two points,
and
.
Unless otherwise noted, we will be using (4.3) for all of our distance calculations.
We have attempted several different techniques for selecting cluster points from the Ramachandran plots. One hierarchical technique we tested builds a minimum spanning tree [80] (MST) from all the points in our data set. MST's are easy to understand and the idea of a cluster transfers easily to this technique. To build a minimum spanning tree, we need to use some graph theory [23]. Every point in the Ramachandran plot is treated as a graph vertex and weighted edges then connect every vertex with every other vertex. These edges are weighted by the distance between the points (see Figure 4.2).
A spanning tree is a set of edges in a graph for which there is a
single path from every vertex to every other vertex. A minimum
spanning tree on a weighted graph, selects the spanning tree with the
smallest possible total edge weights (see
Figure 4.3). Finding the minimum spanning tree
sounds more difficult than it is. Kruskal's algorithm is a polynomial
time algorithm for selecting a graph's minimum spanning tree. Once we
have computed the minimum spanning tree for our graph of torsion angle
pairs, we then convert this graph into two clusters by removing the
longest edge. Since there are no cycles in a minimum spanning tree
this will create two disjoint sets (see Figure 4.3).
We can then find cluster points by finding the centroids of the two
parts, by computing the median of the two smaller trees, or by some
other method. We can continue finding up to
cluster points by
continuing to remove edges.
This technique is a divisive technique and can to lead to elongated clusters. With a very large data set, it leads to unsatisfactory answers, so we dismissed it fairly early in our research.
[MST - full graph.] Four torsion angle pairs (A, B, C, D)
with their distances represented as edge weights (
[MST - initial graph.] The minimum spanning tree for the graph in
Figure 4.2.
One other method we have implemented attempts to minimize the maximum
distance of any
angle pair to a cluster point when given a
number of clusters. This is essentially finding a set of cluster
points,
, solving the following:
The solution presented here treats our problem as a facilities
location problem. The
angles represent our customers and the
cluster points represent the facilities assigned to serve them. As we
will see, this solution provides reasonable results and also can be
implemented in such a way that incremental addition of new cluster
points is possible. In his book Cluster Analysis
Algorithms [71], Späth describes
using this method to find cluster points on the plane. We will modify
this algorithm to fit our needs here.
A specific type of facilities location problems called Fermat-Weber Location problems [78] involves a set of demand points that are serviced by a set of facilities. The locations of the demand points are fixed, and the facilities are located to minimize the sum of weighted distances between the demands and the facility that services them (usually the closest one). In the typical representation of the problem, the demand points have varying levels of demand, and there is also no limit on the proximity of the facilities to each other. For instance, Wal-Mart has a set of stores in Iowa that need to be maintained. What is the ideal location for their warehouse given there are more stores in the eastern part of the state?
We will introduce the multi-facility problem after first discussing the single facility problem. The Weber single facility location problem is defined as
where
are the location of our demand points (
),
is the cost per unit shipped per unit
distance, and
is the amount of product to be shipped to demand
point
from our facility. For our problem, we define
, so it reduces to just a minimization of the sum of the
distances.
The
-means solution is not a minimizer for this function. The
facilities location method is an iterative technique that will in many
cases minimize Equation 4.7. Figure 4.5
shows an example where the minimizer is not the centroid.
[
For the single facility problem, we have an iterative
solution [78]. We choose an initial
solution,
, at the centroid of the demand points,
and
. We then use the
following iterations until we converge to a
solution [13].
As we have stated earlier, we don't just wish to solve the case for a
single cluster point, but for multiple cluster points. The
single-facility problem can be expanded to fit this situation. Given
a set of demand points
; find a set of
facility locations,38 and allocate the
demand points to these facilities minimizing the costs of supplying
the various demand points. We must find the locations for each
facility
,
, and the demand points they serve
, so that the set of demand points
and
for every
.
Our problem statement now becomes, find
satisfying
Our problem naturally falls in this configuration as we assign each
angle to a cluster point and then minimize the distance from
that cluster point to its assigned
angles. A solution to this
problem gives us a set of
cluster points for any set of
angle pairs (typically
).
We solve this problem using a common heuristic for multiple facilities
location problems. We first randomly partition the set of
angles into
subsets,
(see
Figure 4.6). Next, we solve a single-facility
problem (see Section 4.4.1) on each of the
subsets, which sets a cluster point optimally for that subset (see
Figure 4.7). Note that during this step
angles may not be close to the cluster point they are assigned
to. In fact, there is a good chance that several
angles will
be closer to a cluster point they aren't assigned to. The process
becomes iterative when we then re-assign the
angle pairs to
their now nearest cluster point (see Figure 4.8).
We then re-solve the optimal cluster point placement problem and
continue similarly until some termination criteria is met (see
Figure 4.9).
[Multi-Facility Location - step 1.] A set of points on the
plane. We have randomly partitioned the points into two sets.
[Multi-Facility Location - step 2.] The single-facility
solution (cluster point location is the square) is found for each set
of points. Some points are closer to another point than the one they
were assigned to.
[Multi-Facility Location - step 4.] The single-facility
solution is re-solved, and the process is continued until the cluster
point locations do not change and no points are reassigned.
Implementation of this algorithm follows the above directions fairly
closely. Our input is a set of
torsion angles extracted from
some set of proteins. In our case, these inputs are for a particular
amino acid. Therefore, we must cluster twenty different data sets in
order to find appropriate points for each of the amino acids.
Once the torsion angles are read, we randomly assign them to one of
our
cluster points. We place all of the cluster points at the
origin initially, but we could just as easily compute the centroid.
The single-cluster problems are solved, the torsion angles are
reassigned and the process repeats until some termination criteria is
met (see Algorithm 4.10).
We'll deal with some of the implementation particulars now. As stated earlier, we are solving this problem on a torus. While solutions on a sphere [22] are fairly common (the location of global corporate facilities or global military complexes), we have been unable to find other instances of solutions for the facility location problem on a torus. The main complicating factor is that we need to account for the cyclical nature of the angle space. In spherical problems, the authors often use the great circle distance, which is a slightly more complicated metric than euclidean distance, but for our case we decided to use the euclidean distance.
This choice means we need to take some care in computing distances and
locating the cluster points. The distance between two angles can be
no larger than
radians, so we must translate the torsion angles
into the proper quadrants in order to find the distance between them
and their cluster point. Once this is done for both
and
we can compute the standard euclidean distance. As we iterate and the
cluster points change position on the torus, the translation
performed on the
torsion angle may change as well.
As shown in the Algorithm 4.10, we begin
with an initial number of cluster points. This number is typically
set quite low. However, in many cases, the torsion angles are quite
spread out and only two cluster points would not adequately describe
the variety of conformations the amino acid backbone can take. The
number of cluster points we finish with is determined by a tuneable
parameter. We require a certain percentage of
torsion angles
to be within some distance of a cluster point.39 If evalPlan determines
the current clusterPlan is not sufficiently good, we will add
another cluster point and re-solve the problem.
Regardless of the number of clusters, we need to find the best clusters given our data points. We have modified the original algorithm in another way. We aren't solely interested in minimizing the distance of the torsion angles to the cluster points. We also want cluster points that represent the diversity of torsion angles. It is of little use to have two cluster points very close to each other as this will create very similar proteins, and likewise it is of little use to have a cluster point with only a handful of torsion angles assigned to it. Ideally, the cluster points will be associated with a large number of torsion angles and they will be adequately spaced throughout the angle space. This adds two new parameters, the minimum distance between two cluster points and the minimum number of torsion angles associated with a cluster point. When our algorithm converges to a solution that violates either of these two parameters, one of the cluster points40 is randomly perturbed into a new position. This is represented in Algorithm 4.10 by perturbTest, which will check if we need to perturb the cluster points and also perform the perturbation.
Regardless of whether a cluster point has been perturbed, we will
always call
(see
Figure 4.10). This subroutine finds
the closest cluster point for each torsion angle, and if necessary,
affiliates the torsion angle with a new cluster point. If none of the
torsion angles change clusters, we have found a solution and can exit
our loop. We then evaluate this clustering plan and determine whether
we need to increase the number of cluster points we use.
On occasion, the initial cluster points used by HOPS will not sufficiently describe the search space, and we will want to expand our search space. However, we do not want to disregard the angle space that we have already searched. Throwing out the previously determined cluster points would require us to re-search the entire space as our previous solution would no longer be valid in the current space.
However, upgrading the current, previously determined
cluster
points into a system with
cluster points, which maintains the
first
cluster points chosen, produces a set of different solutions
than what we would get if we just solved the straightforward
cluster problem. So while our upgraded solution is sub-optimal, it
has the advantage of adding onto the search space rather than
obliterating it and starting over.
To find the placement of the new cluster points, we have an algorithm similar to the one for finding the original cluster points. The new point is randomly placed in the set of torsion angles. The angles are reapportioned and solution of the single-facility location problem occurs for each of the cluster points. Once this is done, the original cluster points are snapped back to the their original locations. If no points need to be reapportioned then we are done, otherwise we continue to reapportion, solve the single-facility problems and snap back to our original solution. If no solution is found, we perturb the problem and start again.
Results using the multiple facilities location technique are presented in Chapter 6. Regardless of the technique used once our cluster points are chosen, those angles are used to create configuration files for each protein and define the space through which HOPS will search for protein folds. See Chapter 3 for an overview of how HOPS does this.