Diffusion Maps (Draft)

Diffusion Maps (Draft)#

A common problem that plagues high dimensional data is that as the dimensionality increases, the data becomes sparse relative to the data space; the so-called curse of dimensionality. However, within this data space, data points often still cluster around lower dimensional structures, termed manifolds [Barter and Gross, 2019]. Diffusion mapping is a manifold learning method that characterizes and provides a lower-dimensional representation of such manifolds [Coifman et al., 2005].

To characterize manifolds in high-dimensional data, diffusion maps use the notion of a diffusion process on a network in the data space. Diffusion maps treat data points as nodes in a network with weighted links defined by some measure of distance between them. Most importantly, nodes are connected to only a small number of nearest neighbors such that links represent local distances. In practice, this is achieved by calculating an appropriately chosen measure of distance between all data points and then discarding distances over some threshold. We are left with a set of shorter, trusted distances which are treated as weighted links between data points, forming a network.

Diffusion maps characterize the manifold by modeling a random walk of particles on the thresholded network. But rather than explicitly simulating a random walk, the structure of the network is explored using harmonic analysis [Coifman et al., 2005, Barter and Gross, 2019]. Specifically, the eigenvalues and eigenvectors of the network Laplacian are computed. The smallest, non-zero eigenvalues encode the direction of largest variation of the data (they span the manifold) and their associated eigenvectors define a new, lower-dimensional coordinate system in which the data points are embedded. This new coordinate system, or “diffusion map”, provides a lower-dimensional description of the data.

Diffusion Map Procedure#

Construction of the diffusion map is done via the following steps as described by Barter and Gross [2019]:

  1. Calculate all pairwise similarities between data points. Alternatively, calculate all pairwise distances between points and then convert to similarities.

  2. Threshold the similarity matrix, keeping some small number of nearest neighbors. The thresholded similarity matrix can be treated as an adjacency matrix which defines the structure of the a network.

  3. Construct the Laplacian matrix from the thesholded similarity matrix.

  4. Calculate the eigenvalues and eigenvectors of the Laplacian.

Start with an \(x \times n\) datatable \(\bf X\) where each \(x_i\) is sample and each \(n\) is a measured variable.

First, construct a matrix of pairwise similarities between all \(x_i\) samples. In principle, any similarity metric can be used. Alternatively, the pairwise distances between each \(x_i\) can be computed and then converted to similarities. In either case, the metric should be chosen appropriately based on the objects being compared.

One approach is to calculate the Euclidean distance between samples and then convert them to similarities. The Euclidean distance is defined as

\[ D_{ij} = \sqrt{\sum_n(x_{in} - x_{jn})^2} \]

where \(x_{in}\) is the \(n\)th variable of the \(i\)th sample.

The distance matrix \(\bf D\) is converted to the similarity matrix \(\bf S\) by

\[\begin{split} S_{ij} = \begin{cases} 0 & \text{if}\ i = j \\ \frac{1}{D_{ij}} & \text{if}\ i \neq j \end{cases} \end{split}\]

Second, threshold the similarity matrix. This is done by setting all but the \(y\) largest entries to 0 in each row. This process can yield an asymmetric matrix which can be resymmetrized by the operation

\[ S_{ij} \rightarrow max(S_{ij}, S_{ji}) \]

Third, compute the row-normalized Laplacian from the thresholded similarity matrix

\[\begin{split}L_{ij}^{Rn} = \begin{cases} 0 & \text{if}\ i = j \\ -\frac{S_{ij}}{\sum_j S_{ij}} & \text{if}\ i \neq j \end{cases}\end{split}\]

Finally, calculate the eigenvalues and eigenvectors of the row-normalized Laplacian. The smallest, non-zero eigenvalues encode the directions of largest variation in the data. Their associated eigenvectors define a new coordinate system which are used to embed the data in lower-dimensional space.

Case Study#

See Analyzing Compositional Dissimilarity Using Diffusion Maps for the application of diffusion maps to analyze compositional dissimilarity between ecological communities.

References#

BG19(1,2,3)

Edmund Barter and Thilo Gross. Manifold cities: social variables of urban areas in the uk. Proceedings of the Royal Society A, 475(2221):20180615, 2019.

CLL+05(1,2)

Ronald R Coifman, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, Frederick Warner, and Steven W Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. Proceedings of the national academy of sciences, 102(21):7426–7431, 2005.

Authors#

Jordan A. Gault