Discretizable Distance Geometry
The Distance Geometry Problem (DGP) asks whether a simple weighted undirected graph G=(V,E,d) can be
realized in a K-dimensional space so that the distance between each pair of realized vertices u and v
is the same as the weight d(u,v) assigned to the edge {u,v}, when available. The distance function d
generally associates an interval [lb(u,v),ub(u,v)] to every edge {u,v} of E. The range
of the interval corresponds to the uncertainty on the actual value of the distance: when the interval is degenerate,
i.e. lb(u,v) = ub(u,v), then there can be two different situations: either the distance is exact (i.e. provided
with high precision), or the level of uncertainty on the distance is not a priori known, so that only an approximated
value is provided. Notice that the embedding space is Euclidean in several real-life applications. The DGP is NP-hard,
and in the last years, together with several other colleagues, we have been working on the different facets of this problem.
The sticks are the edges of the graph G representing the DGP instance. Their length corresponds to the weight on the
edges, which is the numerical value of the distance. In this example, there are 5 edges in G connecting 4 vertices in
5 different ways. Each vertex is marked in the drawing with a different color. In the hypothesis the sticks are rigid
(i.e., they are not elastic), and hence the corresponding distances are exact, then the vertex positions shown in the small
box give one of the possible solutions to the DGP on the Euclidean plane. Other possible solutions can be obtained by translating,
rotating or (partially, in some cases) reflecting this given solution. One way to perceive the complexity of
the DGP is to look at the set of sticks when disconnected from each other, and try to figure out the geometric figures
associated to the DGP solutions.
The drawings above show two of the mentioned applications. On the left-hand side, the conformation of the protein having
code
The discretization of the DGP can be performed when the graph G satisfies some particular assumptions.
These assumptions ensure that every vertex v of the graph has at least K adjacent predecessors, which can serve
as a reference for computing candidate positions for the vertex (where K is the dimension of the embedding space).
The identification of the feasible positions for a given vertex are computed by intersecting the K Euclidean
objects related to the K adjacent predecessors of the vertex. These Euclidean objects are points in dimension 1,
circles in dimension 2, spheres in dimension 3, and hyper-spheres in any other dimension. Under the discretization
assumptions, the intersection of these Euclidean objects always gives a discrete and finite number of feasible positions.
In this small example, the vertices labeled from 3 to 5 admit more than one possible position in the 2-dimensional space.
In the drawing, the circle defined by the reference vertex with lowest rank is explicitly shown, while only small marks,
indicating the intersection points, are given for the second (having higher rank) reference vertex.
In the example given above, the graph G, with the vertex ordering specified by the numerical labels, actually
satisfies an additional assumption: for every vertex v, the reference vertices are its two immediately preceding vertices
(for example, the reference vertices for 5 are 3 and 4). This assumption is known under the name of "consecutivity assumption".
However, this assumption is not strictly necessary to perform the discretization of the DGP. The example below shows in fact
another example where the consecutivity assumption is not satisfied: both vertices 4 and 5 have as reference vertices the vertex 2
and the vertex 3.
In the figure above, the circles and the marks are drawn by following the same criteria indicated for the previous drawing. This example basically shows that, when the consecutivity assumption is not satisfied, the search domain is not necessarily a binary tree (even if it can be approximated by a binary tree, as it was previously done in many publications), and the reference vertices for different vertices can form a common symmetry axis. In this example, it is also supposed, for simplicity, that the distance between the vertex 2 and 4 coincide with the distance between 2 and 5 (the positions for both 4 and 5 lie on a common circle). paradoxical DGP and optic computing
As remarked above, the pruning distances allow us to select the branches of the search tree that contain feasible
DGP solutions. When the distances are exact and their numerical value is very precise, the presence of a pruning distance from
the vertex u to the vertex v, in this order in the search tree, makes it possible to reduce the total number of
vertex positions computed for v to the same number of positions found for u. This is a very important property of
DDGPs, because it allows us to estimate in advance the number of solutions that we can expect to find in the search tree. Uncertainty on the value of the distances
In order to deal with real-life data, where distances are generally imprecise and often represented
by intervals having relatively large ranges, it is required that at least K-1 reference distances can be considered as
"exact" (meaning by "exact" that lb(u,v) = ub(u,v) and the approximated distance value is not affected by an error that
is higher than the predefined tolerance error). As a consequence, the Euclidean objects involved in the intersections (performed to
determine the feasible points for a given vertex v) are not always (hyper-)spheres, but rather (hyper-)spherical shells may
actually be involved in the intersection in presence of an interval distance. Recent works have been focusing on the dynamical character of some DGP applications. To represent a dynamical DGP instance, we use a graph G having a special vertex set V × T, where V is a set of objects (the static vertices of the classical problem), while T represents the time as a sequence of discrete steps. The dynamical DGP (dynDGP) was introduced in the talk reported below in 2017.
Some animations obtained via dynamical DGPs can be viewed on my
dynDGP animations page. MD-jeep MD-jeep is an implementation of the BP algorithm for DDGPs, which is available on GitHub since the release 0.3.0, where the software tool was extended for solving instances containing interval distances (see section above about the coarse-grained representation). Differently from other software tools for the DGP, MD-jeep is able to enumerate the entire solution set for a given instance. This is trivially done with instances consisting of only exact distances by performing the entire exploration of the search tree (which is in practice possible when a sufficient number of pruning distances is available). In order to do the same for instances containing interval distances, the resolution parameter was recently introduced in MD-jeep. This parameter allows us to select solutions from a (local) continuous set of feasible solutions. Meta-heuristics
Meta-heuristics are widely employed in optimization. My first experience with meta-heuristics dates back to my PhD
in Italy, where I was working on a geometric model for the simulation of protein conformations. This model leads to the definition
of a global optimization problem, whose solutions represent geometrically "well-shaped" conformations for proteins, that we were
used to obtain by executing a simple Simulated Annealing (SA). The permutations implemented in our SA implementations were
particularly inspired by the geometric nature of the considered problem.
A monkey can in fact survive in a forest because it is able to focus its food researches where it has already found food.
The trees the monkey explores contain solutions to the optimization problem to be solved, and branches of the tree represent
perturbations able to transform one solution into another. At each step, the monkey chooses the branches to climb by applying
a random mechanism, which is based on the objective function values of the new corresponding solutions. Notice that, rather than
a completely new meta-heuristic approach for global optimization, the Monkey Search can be considered as a general
framework describing the monkey behavior, where other ideas and strategies for global optimization are also employed.
Data mining consists in extracting previously unknown, potentially useful and reliable patterns from a given set of data.
In collaboration with Panos Pardalos and Petraq Papajorgji, we wrote a
textbook
describing the most common data mining techniques, by paying a particular attention to those used in the agricultural field.
One interesting example is the problem of predicting, by using the data obtained from the very first stages of the fermentation,
the wine fermentations that are likely to be problematic, in order for the enologist to take actions in advance to avoid
spoiling part of the production. Why all this interest in proteins?
In the different problems introduced above, the main considered application (at least in my publications) is the one
of analyzing or identifying the three-dimensional conformation of protein conformations. Why all this interest? Finding the ground state conformations of clusters of molecules is one of the most difficult global optimization problems. Such clusters are regulated by potential energy functions, such as the Lennard Jones and the Morse potential energy (also involved in the identification of protein conformations, see above). The Lennard Jones potential is very simple but sufficiently accurate for noble gas, and it also provides approximations for structures of nickel and gold. The Morse potential has a parameter which determines the width of the potential well; Morse clusters show a wide variety of structural behaviors in function of this parameter. Both potential functions have several local minima in which methods for optimization may get stuck at. For instance, a Lennard Jones cluster formed by 13 atoms has about 1000 local minima, and the number of local minima increases exponentially with the size of the cluster. The Lennard Jones and Morse optimal clusters usually have particular geometric motifs, which can be exploited during the optimization process. The Cambridge Cluster Database contains experimentally obtained clusters. Some of them may just be local minima: the database is open to newly generated clusters having a lower energy function value. DrawingsThe drawings above that I made with pencil and drawing compass are not meant to be precise but they intend only to visually show the main ideas behind the discretization of DGPs. I had this idea to create such drawings when I was helping some time ago one of my kids with his homework ...
|