Glossary#

Benamou-Brenier#

A dynamic formulation of optimal transport [Benamou and Brenier, 2000] that characterizes the optimal transport problem between two probability distributions supported on a vector space as the minimum kinetic energy needed to move one distribution to the other, using a time-dependent velocity field. More precisely, given two probability distributions, the problem reads:

\[\min_{(\rho_t, v_t)} \int_0^1 \int \|v_t(x)\|^2 d\rho_t(x) dt\]

where \(\rho_t\) is a time-dependent probability distribution such that \(\rho_0=\mu\) and \(\rho_1=\nu\), and \(v_t\) is a time-dependent velocity field such that the continuity equation \(\partial_t \rho_t + \nabla \cdot (\rho_t v_t)=0\) is satisfied.

The solution to that problem results in a velocity field \(v_t\) that is constant along the integration path linking a source point to the point it is mapped to in the target distribution. Writing \(T\) for the Monge map between \(\mu\) and \(\nu\), one has in particular that \(T(x) = x + \int_0^1 v_t(x) dt\), but also writing \(x_t = x + t(T(x)-x)\) the barycenter between \(x\) and \(T(x)\), one has

\[\forall t\in[0,1], v_t(x_t) = T(x) -x\]

In other words, the velocity field is constant along the straight line connecting a source point to its image in the target.

Brenier potential#

Convex potential function whose gradient transports optimally (w.r.t. the squared-Euclidean cost) a probability measure in \(\mathcal{P}(\mathbb{R}^d)\) onto another (i.e. whose gradient is the Monge map between them). More specifically, the Brenier potential \(\varphi\) for the optimal transport problem from a measure \(\mu\) to \(\nu\) solves a variant of the dual Kantorovich problem that reads

\[\min_{\varphi: \mathbb{R}^d\rightarrow \mathbb{R}} \int \varphi d\mu + \int \varphi^* d\nu\,.\]

where \(\varphi^*\) is the Legendre transform of \(\varphi\). Note that if one picks an arbitrary source measure \(\mu\) and a convex potential \(\varphi\), and thereafter sets \(\nu:=\nabla\varphi \#\mu\), then \(\varphi\) is necessarily a Brenier potential solving the OT problem from \(\mu\) to \(\nu\).

Brenier theorem#

Fundamental result in optimal transport theory [Brenier, 1991] stating that when the ground cost is the squared Euclidean distance, and the source measure is absolutely continuous (e.g. has a density), then there exists a unique optimal transport map between the source and target measures, which is the gradient of a convex function (a so-called Brenier potential). Conversely, any transport map \(T\) that is the gradient of a convex function is optimal for the squared Euclidean cost when considering the OT problem between any source distribution \(\mu\). to that same source modified by the push-forward map \(T\), namely \(T\#\mu\). The Brenier theorem is a special case of the more general Gangbo-McCann theorem when instantiated with the squared Euclidean cost \(c(x,y)=\tfrac12\|x-y\|^2\).

c-concave function#

A function that can be written as the c-transform of another function. Optimal dual Kantorovich potentials can be found within such functions, under mild assumptions on the ground cost and the measures being compared. For c-concave functions, the c-transform is an involution, i.e. \(f^{cc} = f\).

c-transform#

The c-transform of a function \(g\) with respect to a ground cost \(c(x,y)\) is defined as the function \(g^c\) such that for any \(x\) in the source space, \(g^c(x) := \inf_y c(x,y) - g(y)\). Alternatively, the c-transform of a function \(f\) defined on the source space is the function \(f^c\) defined on the target space such that for any \(y\) in the target space, \(f^c(y) := \inf_x c(x,y) - f(x)\) if \(c\) is symmetric. One always has \(f^{ccc} = f^{c}\).

coupling#

A coupling of two probability measures \(\mu\) and \(\nu\) is a probability measure on the product space of their respective supports. When the coupling is balanced, the first and second marginals of that probability measure must coincide with \(\mu\) and \(\nu\) respectively. Equivalently, given two non-negative vectors \(a\in\mathbb{R}^n\) and \(b\in\mathbb{R}^m\), a coupling in matrix form is a non-negative matrix \(P\) of size \(n\times m\). When the coupling is balanced \(P\) is in their transportation polytope \(U(a,b)\).

dual Kantorovich potentials#

Real-valued functions or vectors that solve the dual Kantorovich problem.

dual Kantorovich problem#

Dual formulation of the Kantorovich problem, seeking two vectors \(f, g\) that are dual Kantorovich potentials such that, given a ground cost matrix \(C\) of size [n, m] and two probability vectors \(a \in\mathbb{R}^n,b\in\mathbb{R}^m\), they belong to the dual transportation polyhedron \(D(C)\) and maximize:

\[\max_{f,g \,\in D(C)} \langle f,a \rangle + \langle g,b \rangle.\]

This problem admits a continuous formulation between two probability distributions \(\mu,\nu\):

\[\max_{f\oplus g\leq c} \int f d\mu + \int g d\nu,\]

where \(f,g\) are in that case real-valued functions on the supports of \(\mu,\nu\) and \(f\oplus g\leq c\) means that for any pair \(x,y\) in the respective supports, one has \(f(x)+g(y)\leq c(x,y)\). The semidiscrete problem studies more specifically the mixed setting in which either measure is discrete and the other continuous.

dual transportation polyhedron#

Given a \(n\times m\) cost matrix \(C\), denotes the set of pairs of vectors

\[D(C):= \{f \in\mathbb{R}^n, g \in \mathbb{R}^m | f_i + g_j \leq C_{ij}\}.\]

dualizing#

Within the context of optimization, the process of simplifying a constrained optimization problem into an unconstrained one, by transforming constraints into penalty terms in the objective function.

entropic map#

Refers to an approximate transport map obtained by solving first a entropy-regularized optimal transport problem in dual form (using typically the Sinkhorn algorithm) between two point clouds, to leverage next the pair of dual potential vectors \(f,g\) returned by that algorithm to form a continuous approximation to the dual functions arising in the dual Kantorovich problem. This approximation in then plugged into the Gangbo-McCann theorem to recover an approximate Monge map.

entropy-regularized optimal transport#

The data of the entropy regularized OT (EOT) problem is parameterized by a cost matrix \(C\) of size [n, m] and two vectors \(a,b\) of non-negative weights of respective size n and m. The parameters of the EOT problem consist of three numbers \(\varepsilon, \tau_a, \tau_b>0\).

The optimization variables are a pair of vectors of sizes n and m denoted as \(f\) and \(g\), akin to dual Kantorovich potentials but not constrained to belong to the dual transportation polyhedron.

Using the reparameterization for \(\rho_a\) and \(\rho_b\) using \(\tau_a=\rho_a /(\varepsilon + \rho_a)\) and \(\tau_b=\rho_b /(\varepsilon + \rho_b)\), the EOT optimization problem reads:

\[\max_{f, g} - \langle a, \phi_a^{*}(-f) \rangle - \langle b, \phi_b^{*}(-g) \rangle - \varepsilon \left(\langle e^{f/\varepsilon}, e^{-C/\varepsilon} e^{g/\varepsilon} \rangle -|a||b|\right)\]

where \(\phi_a(z) = \rho_a z(\log z - 1)\) is a scaled entropy, and \(\phi_a^{*}(z) = \rho_a e^{z/\varepsilon}\), its Legendre transform [Séjourné et al., 2019].

That problem can also be written, instead, using positive scaling vectors u, v of size n, m, and the kernel matrix \(K := e^{-C/\varepsilon}\), as

\[\max_{u, v >0} - \langle a,\phi_a^{*}(-\varepsilon\log u) \rangle + \langle b, \phi_b^{*}(-\varepsilon\log v) \rangle - \langle u, K v \rangle\]

Both of these problems can be written with a primal formulation, that solves the unbalanced optimal transport problem with a variable matrix \(P\) of size n x m and positive entries:

\[\min_{P>0} \langle P,C \rangle +\varepsilon \text{KL}(P | ab^T) + \rho_a \text{KL}(P\mathbf{1}_m | a) + \rho_b \text{KL}(P^T \mathbf{1}_n | b)\]

where \(\text{KL}\) is the generalized Kullback-Leibler divergence.

The very same primal problem can also be written using a kernel \(K\) instead of a cost \(C\) as well:

\[\min_{P>0}\, \varepsilon \text{KL}(P|K) + \rho_a \text{KL}(P\mathbf{1}_m | a) + \rho_b \text{KL}(P^T \mathbf{1}_n | b)\]

The original OT problem taught in linear programming courses is recovered by using the formulation above relying on the cost \(C\), and letting \(\varepsilon \rightarrow 0\), and \(\rho_a, \rho_b \rightarrow \infty\). In that case the entropy disappears, whereas the \(\text{KL}\) regularization above become constraints on the marginals of \(P\): This results in a standard min cost flow problem also called the Kantorovich problem.

The balanced regularized OT problem is recovered for finite \(\varepsilon > 0\) but letting \(\rho_a, \rho_b \rightarrow \infty\). This problem can be shown to be equivalent to a matrix scaling problem, which can be solved using the Sinkhorn algorithm. To handle the case \(\rho_a, \rho_b \rightarrow \infty\), the Sinkhorn function uses parameters tau_a and tau_b equal respectively to \(\rho_a /(\varepsilon + \rho_a)\) and \(\rho_b / (\varepsilon + \rho_b)\) instead. Setting either of these parameters to 1 corresponds to setting the corresponding \(\rho_a, \rho_b\) to \(\infty\).

envelope theorem#

The envelope theorem or Danskin’s theorem is a major result about the differentiability properties of the value function of a parameterized optimization problem. Namely, that for a function \(f\) defined implicitly as an optimal objective parameterized by a vector \(x\),

\[h(x):=\min_z s(x,z), z^\star(x):=\arg\min_z s(x,z)\]

one has

\[\nabla h(x)=\nabla_1 s(x,z^\star(x))\]

stating in effect that the optimal \(z^\star(x)\) does not need to be differentiated w.r.t. \(x\) when computing the gradient of \(h\). Note that this result is not valid for higher order differentiation.

flow matching#

Family of methods [Lipman et al., 2022] designed to fit a time-dependent velocity flow model so that it maps progressively a given source to a target distribution. The estimation of that velocity flow is done by specifying a pre-defined stochastic interpolant [Albergo et al., n.d.] between said distribution, and fitting the velocity field so that it agrees with that interpolation.

Gangbo-McCann theorem#

Fundamental result in optimal transport theory [Gangbo and McCann, 1996] stating that when both source and target measures are supported on the same Euclidean vector space, source measure is absolutely continuous (has a density) and cost is differentiable and satisfies the twist condition, then there exists a unique optimal transport map between the source and target measures,

\[T(x) = \nabla_1 c(x, \cdot)^{-1}(\nabla f(x))\]

where \(f\) is a c-concave function that solves the dual Kantorovich problem. Conversely, any transport map \(T\) that can be written as above for some c-concave function \(f\) is optimal for the cost \(c\) when considering the OT problem between any source distribution \(\mu\) and the target distribution \(\nu = T\#\mu\).

Gromov-Wasserstein distance#

A generalization of the Wasserstein distance between two point clouds living in possibly different spaces and equipped each with different cost functions, obtained by solving the Gromov-Wasserstein problem.

Gromov-Wasserstein problem#

A generalization of the Kantorovich problem in which the objective function is no longer a linear function of a coupling matrix \(P\), but more generally a quadratic function of \(P\). See [Mémoli, 2011].

ground cost#

A real-valued function of two variables, \(c(x,y)\) that describes the cost needed to displace a point \(x\) in a source measure to \(y\) in a target measure. Can also refer to a matrix \(C\) of evaluations of \(c\) on various pairs of points, \(C=[c(x_i, y_j)]_{ij}\).

Hungarian algorithm#

Combinatorial algorithm proposed by Harold Kuhn to solve the optimal matching problem. See the Wikipedia definition .

implicit differentiation#

Formula used to compute the vector-Jacobian product of the minimizer of an optimization procedure that leverages the fact that small variations in the input of the optimization problem still result in minimizers that verify optimality conditions (KKT or first-order conditions). These identities can then help recover the vector-Jacobian operator by inverting a linear system.

input convex neural network#

A neural network architecture for vectors with a few distinguishing features: some parameters of this NN must be non-negative, the NN’s output is real-valued and guaranteed to be convex in the input vector. Abbreviated as ICNN, see [Amos et al., 2017] for exact definition.

Kantorovich problem#

Linear program that is the original formulation of optimal transport between two point-clouds, seeking an optimal coupling matrix \(P\). The problem is parameterized by a ground cost matrix \(C\) of size [n, m] and two probability vectors \(a,b\) of non-negative weights of respective sizes n and m, summing to \(1\). The coupling is in the transportation polytope \(U(a,b)\) and must minimize the objective

\[\min_{P \in U(a,b)} \langle P,C \rangle = \sum_{ij} P_{ij} C_{ij}.\]

This linear program can be seen as the primal problem of the dual Kantorovich problem. Alternatively, this problem admits a continuous formulation between two probability distributions \(\mu,\nu\):

\[\min_{\pi \in \Pi(\mu,\nu)} \iint cd\pi.\]

where \(\pi\) is a coupling density with first marginal \(\mu\) and second marginal \(\nu\).

Legendre transform#

The Legendre transform of a convex function \(\phi\) is the convex function \(\phi^{*}\) defined as

\[\phi^{*}(y) = \sup_x \langle x, y \rangle - \phi(x)\]

one has the identities \(\nabla\phi\circ\nabla\phi^{*} = Id\) and \(\nabla\phi^{*}\circ\nabla\phi = Id\) when \(\phi\) is strictly convex and differentiable.

low-rank optimal transport#

Variant of the Kantorovich problem whereby the search for an optimal coupling matrix \(P\) is restricted to lie in a subset of matrices of low-rank. Effectively, this is parameterized by replacing \(P\) by a low-rank factorization

\[P = Q \text{diag}(g) R^T,\]

where \(Q,R\) are coupling matrices of size [n,r] and [m,r] and \(g\) is a vector of size [r,]. To be effective, one assumes implicitly that rank \(r\ll n,m\). To solve this in practice, the Kantorovich problem is modified to only seek solutions with this factorization, and updates on \(Q,R,g\) are done alternatively. These updates are themselves carried out by solving an entropy-regularized optimal transport problem.

matching#

A bijective pairing between two families of points of the same size \(N\), parameterized using a permutation of size \(N\).

Monge map#

A transport map \(T\) that is optimal for the Kantorovich problem. in the sense that for two measures \(\mu\) and \(\nu\), it solves

\[\min_{T : T\#\mu=\nu} \int c(x,T(x)) d\mu(x).\]

The constraint \(T\#\mu=\nu\) means that the push-forward measure obtained by pushing \(\mu\) through \(T\) is equal to \(\nu\). An optimal solution to the problem above solves the Kantorovich problem between \(\mu\) and \(\nu\) in the sense that the coupling \(\pi\) defined as the push-forward of \(\mu\) through the map \((Id,T)\), where \(Id\) is the identity map, is an optimal coupling between \(\mu\) and \(\nu\).

When it exists, a Monge map is a deterministic mapping between the supports of two measures, as opposed to a transport plan that can be stochastic. The Monge map can be recovered from the solution to the dual Kantorovich problem when the ground cost is differentiable and satisfies the twist condition, and the source measure is absolutely continuous (e.g. has a density), using the Gangbo-McCann theorem.

multimarginal coupling#

A multimarginal coupling of \(N\) probability measures \(\mu_1, \dots, \mu_N\) is a probability measure on the product space of their respective supports, such that its marginals coincide, in that order, with \(\mu_1, \dots, \mu_N\).

optimal matching problem#

Instance of the Kantorovich problem where both marginal weight vectors \(a,b\) are equal, and set both to a uniform weight vector of the form \((\tfrac{1}{n},\dots,\tfrac{1}{n})\in\mathbb{R}^n\).

optimal transport#

Theory that characterizes efficient transformations between probability measures. Theoretical aspects usually arise when studying such transformations between continuous probability measures (e.g. densities) whereas computational aspects become relevant when estimating such transforms from samples.

push-forward#

Given a measurable mapping \(T\) (e.g. a vector to vector map), the push-forward measure of \(\mu\) by \(T\) denoted as \(T\#\mu\), is the measure defined to be such that for any measurable set \(B\), \(T\#\mu(B)=\mu(T^{-1}(B))\). Intuitively, it is the measure obtained by applying the map \(T\) to all points described in the support of \(\mu\). See also the Wikipedia definition. Note that the OTT-JAX logo is a stylized depiction of the push-forward operator \(\#\).

semidiscrete problem#

Refers to the optimal transport problem where one of the two measures is discrete (a weighted sum of Dirac masses) and the other is absolutely continuous, which, in the context of this toolbox, means that one can get i.i.d. samples from it. When both conditions are met, the dual Kantorovich problem can be cast as a finite-dimensional concave maximization problem [Cuturi and Peyré, 2018]. Numerical integration methods can be used to approximate the objective and its gradients in lower dimension [Mérigot, 2011], whereas simpler SGD methods can be leveraged in higher dimensions [Genevay et al., 2016].

Sinkhorn algorithm#

Fixed point iteration that solves the entropy-regularized optimal transport problem (EOT). The Sinkhorn algorithm solves the EOT problem by seeking optimal \(f\), \(g\) dual Kantorovich potentials (or alternatively their parameterization as positive scaling vectors \(u\), \(v\)), rather than seeking a coupling \(P\). This is mostly for efficiency (potentials and scalings have a n + m memory footprint, rather than n m required to store \(P\)). Note that an optimal coupling \(P^{\star}\) can be recovered from optimal potentials \(f^{\star}\), \(g^{\star}\) or scaling \(u^{\star}\), \(v^{\star}\).

\[P^{\star} = \exp\left(\frac{f^{\star}\mathbf{1}_m^T + \mathbf{1}_n g^{*T}-C}{\varepsilon}\right) \text{ or } P^{\star} = \text{diag}(u^{\star}) K \text{diag}(v^{\star})\]

The Sinkhorn algorithm solves this dual problem using block coordinate ascent, i.e. devising an update for each \(f\) and \(g\) (resp. \(u\) and \(v\)) that cancels alternatively their respective gradients, one at a time.

Sinkhorn divergence#

Proxy for the Wasserstein distance between two samples. Rather than use the output of the Kantorovich problem to compare two families of samples, whose numerical resolution requires running a linear program, use instead the objective of entropy-regularized optimal transport or that of low-rank optimal transport properly renormalized. This normalization is done by considering:

\[\text{SD}(\mu, \nu):= \Delta(\mu, \nu) - \tfrac12 \left(\Delta(\mu, \mu) + \Delta(\nu, \nu)\right)\]

where \(\Delta\) is either the output of either entropy-regularized optimal transport or low-rank optimal transport

transport map#

A function \(T\) that associates to each point \(x\) in the support of a source distribution \(\mu\) another point \(T(x)\) in the support of a target distribution \(\nu\), which must satisfy a push-forward constraint \(T\#\mu = \nu\).

transport plan#

A coupling (either in matrix or joint density form), quantifying the strength of association between any point \(x`\) in the source distribution \(\mu\) and target point \(y`\) in the \(\nu\) distribution.

transportation polytope#

Given two probability vectors \(a,b\) of non-negative weights of respective size n and m, summing each to \(1\), the transportation polytope is the set of matrices

\[U(a,b):= \{P \in \mathbb{R}^{n\times m} | , P\mathbf{1}_m = a, P^T\mathbf{1}_n=b \}.\]

twist condition#

Given a ground cost function \(c(x, y)\) taking two input vectors, the twist condition refers to the requirement that at any given point \(x\), the map \(y \mapsto \nabla_1 c(x, y)\) be invertible. Although not necessary, this condition is sufficient to prove the existence of an optimal transport map from a source to a target measure with suitable assumptions on the measures themselves.

unbalanced#

A generalization of the Kantorovich problem defined to bring more flexibility to optimal transport computations. Such a generalization arises when dualizing the constraint that the variable coupling in the Kantorovich problem has marginals that coincide exactly with those of \(a\) and \(b\) or \(\mu\) and \(\nu\) in the continuous formulation. Instead, deviations from those marginals appear as penalty terms.

unrolling#

Automatic differentiation technique to compute the vector-Jacobian product of the minimizer of an optimization procedure by treating the iterations (used to converge from an initial point) as layers in a computational graph, and computing its differential using reverse-order differentiation.

Wasserstein barycenter#

The notion of a mean of vectors generalized to the Wasserstein space of probability distributions. A Wasserstein barycenter is a measure \(\mu\) that summarizes a weighted family of measures \((\nu_1,\dots,\nu_n)\) in the sense that for a family of \(n\) probability weights \(\lambda_1,\dots,\lambda_n\),

\[\mu^\star:=\arg\min_{\mu\in\mathcal{P}(\Omega)} \sum_{i=1}^n \lambda_i W(\mu,\nu_i)\]

See for instance [Agueh and Carlier, 2011], [Cuturi and Doucet, 2014] and [Benamou et al., 2015].

Wasserstein distance#

Distance between two probability functions parameterized by a ground cost function that is equal to the optimal objective reached when solving the Kantorovich problem. The Wasserstein distance is truly a distance (in the sense that it satisfies all 3 metric axioms ) if the ground cost is itself a distance to a power \(p\geq 1\) and the \(p\)-th root of the objective of the Kantorovich problem is used. See Chapter 5 in [Santambrogio, 2015].