Glossary

Glossary#

coupling#

A coupling of two probability measures \(\mu\) and \(\nu\) is a probability measure on the product space of their respective supports. When the coupling is balanced, the first and second marginals of that probability measure must coincide with \(\mu\) and \(\nu\) respectively. Equivalently, given two non-negative vectors \(a\in\mathbb{R}^n\) and \(b\in\mathbb{R}^m\), a coupling in matrix form is a non-negative matrix \(P\) of size \(n\times m\). When the coupling is balanced \(P\) is in their transportation polytope \(U(a,b)\).

dual Kantorovich potentials#

Real-valued functions or vectors that solve the dual Kantorovich problem.

dual Kantorovich problem#

Dual formulation of the Kantorovich problem, seeking two vectors \(f, g\) that are dual Kantorovich potentials such that, given a ground cost matrix \(C\) of size [n, m] and two probability vectors \(a \in\mathbb{R}^n,b\in\mathbb{R}^m\), they belong to the dual transportation polyhedron \(D(C)\) and maximize:

\[\max_{f,g \,\in D(C)} \langle f,a \rangle + \langle g,b \rangle.\]

This problem admits a continuous formulation between two probability distributions \(\mu,\nu\):

\[\max_{f\oplus g\leq c} \int f d\mu + \int g d\nu,\]

where \(f,g\) are in that case real-valued functions on the supports of \(\mu,\nu\) and \(f\oplus g\leq c\) means that for any pair \(x,y\) in the respective supports, one has \(f(x)+g(y)\leq c(x,y)\).

dual transportation polyhedron#

Given a \(n\times m\) cost matrix \(C\), denotes the set of pairs of vectors

\[D(C):= \{f \in\mathbb{R}^n, g \in \mathbb{R}^m | f_i + g_j \leq C_{ij}\}.\]
dualizing#

Within the context of optimization, the process of simplifying a constrained optimization problem into an unconstrained one, by transforming constraints into penalty terms in the objective function.

entropy-regularized optimal transport#

The data of the entropy regularized OT (EOT) problem is parameterized by a cost matrix \(C\) of size [n, m] and two vectors \(a,b\) of non-negative weights of respective size n and m. The parameters of the EOT problem consist of three numbers \(\varepsilon, \tau_a, \tau_b>0\).

The optimization variables are a pair of vectors of sizes n and m denoted as \(f\) and \(g\), akin to dual Kantorovich potentials but not constrained to belong to the dual transportation polyhedron.

Using the reparameterization for \(\rho_a\) and \(\rho_b\) using \(\tau_a=\rho_a /(\varepsilon + \rho_a)\) and \(\tau_b=\rho_b /(\varepsilon + \rho_b)\), the EOT optimization problem reads:

\[\max_{f, g} - \langle a, \phi_a^{*}(-f) \rangle - \langle b, \phi_b^{*}(-g) \rangle - \varepsilon \left(\langle e^{f/\varepsilon}, e^{-C/\varepsilon} e^{g/\varepsilon} \rangle -|a||b|\right)\]

where \(\phi_a(z) = \rho_a z(\log z - 1)\) is a scaled entropy, and \(\phi_a^{*}(z) = \rho_a e^{z/\varepsilon}\), its Legendre transform [Séjourné et al., 2019].

That problem can also be written, instead, using positive scaling vectors u, v of size n, m, and the kernel matrix \(K := e^{-C/\varepsilon}\), as

\[\max_{u, v >0} - \langle a,\phi_a^{*}(-\varepsilon\log u) \rangle + \langle b, \phi_b^{*}(-\varepsilon\log v) \rangle - \langle u, K v \rangle\]

Both of these problems can be written with a primal formulation, that solves the unbalanced optimal transport problem with a variable matrix \(P\) of size n x m and positive entries:

\[\min_{P>0} \langle P,C \rangle +\varepsilon \text{KL}(P | ab^T) + \rho_a \text{KL}(P\mathbf{1}_m | a) + \rho_b \text{KL}(P^T \mathbf{1}_n | b)\]

where \(\text{KL}\) is the generalized Kullback-Leibler divergence.

The very same primal problem can also be written using a kernel \(K\) instead of a cost \(C\) as well:

\[\min_{P>0}\, \varepsilon \text{KL}(P|K) + \rho_a \text{KL}(P\mathbf{1}_m | a) + \rho_b \text{KL}(P^T \mathbf{1}_n | b)\]

The original OT problem taught in linear programming courses is recovered by using the formulation above relying on the cost \(C\), and letting \(\varepsilon \rightarrow 0\), and \(\rho_a, \rho_b \rightarrow \infty\). In that case the entropy disappears, whereas the \(\text{KL}\) regularization above become constraints on the marginals of \(P\): This results in a standard min cost flow problem also called the Kantorovich problem.

The balanced regularized OT problem is recovered for finite \(\varepsilon > 0\) but letting \(\rho_a, \rho_b \rightarrow \infty\). This problem can be shown to be equivalent to a matrix scaling problem, which can be solved using the Sinkhorn algorithm. To handle the case \(\rho_a, \rho_b \rightarrow \infty\), the Sinkhorn function uses parameters tau_a and tau_b equal respectively to \(\rho_a /(\varepsilon + \rho_a)\) and \(\rho_b / (\varepsilon + \rho_b)\) instead. Setting either of these parameters to 1 corresponds to setting the corresponding \(\rho_a, \rho_b\) to \(\infty\).

envelope theorem#

The envelope theorem or Danskin’s theorem is a major result about the differentiability properties of the value function of a parameterized optimization problem. Namely, that for a function \(f\) defined implicitly as an optimal objective parameterized by a vector \(x\),

\[h(x):=\min_z s(x,z), z^\star(x):=\arg\min_z s(x,z)\]

one has

\[\nabla h(x)=\nabla_1 s(x,z^\star(x))\]

stating in effect that the optimal \(z^\star(x)\) does not need to be differentiated w.r.t. \(x\) when computing the gradient of \(h\). Note that this result is not valid for higher order differentiation.

ground cost#

A real-valued function of two variables, \(c(x,y)\) that describes the cost needed to displace a point \(x\) in a source measure to \(y\) in a target measure. Can also refer to a matrix \(C\) of evaluations of \(c\) on various pairs of points, \(C=[c(x_i, y_j)]_{ij}\).

Hungarian algorithm#

Combinatorial algorithm proposed by Harold Kuhn to solve the optimal matching problem. See the Wikipedia definition .

implicit differentiation#

Formula used to compute the vector-Jacobian product of the minimizer of an optimization procedure that leverages the fact that small variations in the input of the optimization problem still result in minimizers that verify optimality conditions (KKT or first-order conditions). These identities can then help recover the vector-Jacobian operator by inverting a linear system.

input-convex neural networks#

A neural network architecture for vectors with a few distinguishing features: some parameters of this NN must be non-negative, the NN’s output is real-valued and guaranteed to be convex in the input vector.

Kantorovich problem#

Linear program that is the original formulation of optimal transport between two point-clouds, seeking an optimal coupling matrix \(P\). The problem is parameterized by a ground cost matrix \(C\) of size [n, m] and two probability vectors \(a,b\) of non-negative weights of respective sizes n and m, summing to \(1\). The coupling is in the transportation polytope \(U(a,b)\) and must minimize the objective

\[\min_{P \in U(a,b)} \langle P,C \rangle = \sum_{ij} P_{ij} C_{ij}.\]

This linear program can be seen as the primal problem of the dual Kantorovich problem. Alternatively, this problem admits a continuous formulation between two probability distributions \(\mu,\nu\):

\[\min_{\pi \in \Pi(\mu,\nu)} \iint cd\pi.\]

where \(\pi\) is a coupling density with first marginal \(\mu\) and second marginal \(\nu\).

low-rank optimal transport#

Variant of the Kantorovich problem whereby the search for an optimal coupling matrix \(P\) is restricted to lie in a subset of matrices of low-rank. Effectively, this is parameterized by replacing \(P\) by a low-rank factorization

\[P = Q \text{diag}(g) R^T,\]

where \(Q,R\) are coupling matrices of size [n,r] and [m,r] and \(g\) is a vector of size [r,]. To be effective, one assumes implicitly that rank \(r\ll n,m\). To solve this in practice, the Kantorovich problem is modified to only seek solutions with this factorization, and updates on \(Q,R,g\) are done alternatively. These updates are themselves carried out by solving an entropy-regularized optimal transport problem.

matching#

A bijective pairing between two families of points of the same size \(N\), parameterized using a permutation of size \(N\).

multimarginal coupling#

A multimarginal coupling of \(N\) probability measures \(\mu_1, \dots, \mu_N\) is a probability measure on the product space of their respective supports, such that its marginals coincide, in that order, with \(\mu_1, \dots, \mu_N\).

optimal matching problem#

Instance of the Kantorovich problem where both marginal weight vectors \(a,b\) are equal, and set both to a uniform weight vector of the form \((\tfrac{1}{n},\dots,\tfrac{1}{n})\in\mathbb{R}^n\).

optimal transport#

Theory that characterizes efficient transformations between probability measures. Theoretical aspects usually arise when studying such transformations between continuous probability measures (e.g. densities) whereas computational aspects become relevant when estimating such transforms from samples.

push-forward map#

Given a measurable mapping \(T\) (e.g. a vector to vector map), the push-forward measure of \(\mu\) by \(T\) denoted as \(T\#\mu\), is the measure defined to be such that for any measurable set \(B\), \(T\#\mu(B)=\mu(T^{-1}(B))\). Intuitively, it is the measure obtained by applying the map \(T\) to all points described in the support of \(\mu\). See also the Wikipedia definition.

Sinkhorn algorithm#

Fixed point iteration that solves the entropy-regularized optimal transport problem (EOT). The Sinkhorn algorithm solves the EOT problem by seeking optimal \(f\), \(g\) dual Kantorovich potentials (or alternatively their parameterization as positive scaling vectors \(u\), \(v\)), rather than seeking a coupling \(P\). This is mostly for efficiency (potentials and scalings have a n + m memory footprint, rather than n m required to store \(P\)). Note that an optimal coupling \(P^{\star}\) can be recovered from optimal potentials \(f^{\star}\), \(g^{\star}\) or scaling \(u^{\star}\), \(v^{\star}\).

\[P^{\star} = \exp\left(\frac{f^{\star}\mathbf{1}_m^T + \mathbf{1}_n g^{*T}-C}{\varepsilon}\right) \text{ or } P^{\star} = \text{diag}(u^{\star}) K \text{diag}(v^{\star})\]

The Sinkhorn algorithm solves this dual problem using block coordinate ascent, i.e. devising an update for each \(f\) and \(g\) (resp. \(u\) and \(v\)) that cancels alternatively their respective gradients, one at a time.

Sinkhorn divergence#

Proxy for the Wasserstein distance between two samples. Rather than use the output of the Kantorovich problem to compare two families of samples, whose numerical resolution requires running a linear program, use instead the objective of entropy-regularized optimal transport or that of low-rank optimal transport properly renormalized. This normalization is done by considering:

\[\text{SD}(\mu, \nu):= \Delta(\mu, \nu) - \tfrac12 \left(\Delta(\mu, \mu) + \Delta(\nu, \nu)\right)\]

where \(Delta\) is either the output of either entropy-regularized optimal transport or low-rank optimal transport

transport map#

A function \(T\) that associates to each point \(x\) in the support of a source distribution \(\mu\) another point \(T(x)\) in the support of a target distribution \(\nu\), which must satisfy a push-forward map constraint \(T\#\mu = \nu\).

transport plan#

A coupling (either in matrix or joint density form), quantifying the strength of association between any point \(x`\) in the source distribution \(\mu\) and target point \(y`\) in the \(\nu\) distribution.

transportation polytope#

Given two probability vectors \(a,b\) of non-negative weights of respective size n and m, summing each to \(1\), the transportation polytope is the set of matrices

\[U(a,b):= \{P \in \mathbb{R}^{n\times m} | , P\mathbf{1}_m = a, P^T\mathbf{1}_n=b \}.\]
twist condition#

Given a ground cost function \(c(x, y)\) taking two input vectors, the twist condition refers to the requirement that at any given point \(x\), the map \(y \mapsto \nabla_1 c(x, y)\) be invertible. Although not necessary, this condition is sufficient to prove the existence of an optimal transport map from a source to a target measure with suitable assumptions on the measures themselves.

unbalanced#

A generalization of the Kantorovich problem defined to bring more flexibility to optimal transport computations. Such a generalization arises when dualizing the constraint that the variable coupling in the Kantorovich problem has marginals that coincide exactly with those of \(a\) and \(b\) or \(\mu\) and \(\nu\) in the continuous formulation. Instead, deviations from those marginals appear as penalty terms.

unrolling#

Automatic differentiation technique to compute the vector-Jacobian product of the minimizer of an optimization procedure by treating the iterations (used to converge from an initial point) as layers in a computational graph, and computing its differential using reverse-order differentiation.

Wasserstein distance#

Distance between two probability functions parameterized by a ground cost function that is equal to the optimal objective reached when solving the Kantorovich problem. The Wasserstein distance is truly a distance (in the sense that it satisfies all 3 metric axioms ) if the ground cost is itself a distance to a power \(p\leq 1\), and the \(p\) root of the objective of the Kantorovich problem is used.