# Single-cell genomics

## Contents

# Single-cell genomics#

Single-cell RNA sequencing (scRNA-seq) is a technology that provides the gene expression profile of a single-cell. scRNA-seq can help describe a cell as a vector of features, where each feature represents the expression level of a gene in that cell upon measurement. Given a large population of cells observed through time, one can sample a few of these cells at different points in time, and produce at each time point as many feature vectors through scRNA-seq, and obtain a point-cloud. Each of these point-clouds provide a clear picture of the diversity of cells at each point time. Taken together, they describe the collective evolution of these cells along a developmental time-course. However, as measuring destroys the cells, one cannot follow gene expression of the exact same cell through time.

Optimal transport can help us infer individual trajectories from the dynamics of the population overall, and answer questions such as what are the descendants or ancestors of each cell.

In this notebook, we infer the ancestors of induced Pluripotent Stem Cells (iPSCs) using temporal snapshots sampled twice or four times a day for a period of 18 days. iPSCs are a type of pluripotent stem cells that can be generated from specialised cells. Identifying the ancestors of iPSCs would enable us to know which specialised cells are able to revert to iPSCs.

**Reference**

*Schiebinger, G. et al., Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming, Cell 2019*[1]. The data was downloaded from the corresponding WOT tutorial. Downloaded data is capitalised.

```
[1]:
```

```
import matplotlib as mpl
from matplotlib import pyplot as plt
import numpy as np
from ott.core import sinkhorn
from ott.geometry import pointcloud
```

## Calculates optimal transport on gene expression point clouds#

The optimal transport between pairs of gene expression matrices, measured at consecutive time points, is calculated. As a preprocessing step, the cell profiles are projected to a 30-dimensional space with PCA, as suggested in [1].

`DAYS`

is an array of measurement days from 0 to 18.`PCA30_SERUM_DAY`

is a dictionary, which contains as keys the measurement days and as values the low-dimensional representation of gene expression matrices.`CELL_GROWTH_RATE`

is a dictionary, which contains as keys the measurement days and as values the cell growth rate vectors.The optimal transport is calculated between each consecutive pair of datasets.

```
[2]:
```

```
# Defines the optimal transport regularisation parameter
epsilon = 5
dict_results = {}
for i, day in enumerate(DAYS):
if day == 18.:
continue
print('\r' + f'Executing Sinkhorn between the days {day} and {DAYS[i+1]} / 18', end='')
# Computes the marginals
delta_days = DAYS[i+1] - day
n = PCA30_SERUM_DAY[DAYS[i+1]].shape[0]
# The original data has been transformed with PCA as described in [1]
a = np.power(CELL_GROWTH_RATE[day], delta_days) / np.mean(
np.power(CELL_GROWTH_RATE[day], delta_days)) / n
b = np.ones(n) / n
# Applies optimal transport
geom = pointcloud.PointCloud(
PCA30_SERUM_DAY[day], PCA30_SERUM_DAY[DAYS[i+1]], epsilon=epsilon)
out = sinkhorn.sinkhorn(geom, a, b, tau_a=1/(1+epsilon), tau_b=1)
# Saves the geometry and the potentials to calculate the ancestors at a later time
dict_results[day] = [geom, out.f, out.g]
```

```
Executing Sinkhorn on pair of datasets for the days 17.5 and 18.0 / 18
```

## Infers cells’ ancestors#

Cells that are identified as induced Pluripotent Stem Cells (iPSCs) at day 18 are selected, and its ancestors at each previous time points, from day 18 to day 0, are inferred. To do so, the method `apply_transport_from_potentials`

is used, which enables to perform the pull back operation without instantiating directly the transport matrix.

`CELL_DISTRIBUTION_IPSC_DAY18`

is a normalised indicator array of cells identified as iPSCs at day 18.

```
[3]:
```

```
cell_distribution_ipsc = {}
reverse_days = DAYS[::-1]
for i, day in enumerate(reverse_days):
print('\r' + f'Infering ancestor cells at day {day}', end='')
if day == 0.:
continue
if day == 18.:
cell_distribution_ipsc[day] = CELL_DISTRIBUTION_IPSC_DAY18
# Calculates cells' ancestors
geom = dict_results[reverse_days[i+1]][0]
f = dict_results[reverse_days[i+1]][1]
g = dict_results[reverse_days[i+1]][2]
cell_distribution = geom.apply_transport_from_potentials(
f, g, cell_distribution_ipsc[day], axis=1)
cell_distribution_ipsc[reverse_days[i+1]] = cell_distribution / np.sum(
cell_distribution)
```

```
Calculating ancestor cells at day 0.0
```

## Visualizes#

Once the ancestors of iPSCs are identified, they are represented on a low-dimensional space thanks to an FLE visualisation. The iPSCs and its ancestors are coloured while all the other cells remain in grey.

`COORD_DF`

is a dataframe that contains the coordinates of all cells on the FLE visualisation.`IDS_CELLS`

is a dictionary that contains as keys the measurement days and as values the IDs of the cells.All the cells can be represented in a 2D space thanks to an FLE visualisation [1]. The cells’ coordinates are binarised.

```
[4]:
```

```
nbins = 500
xrange = COORD_DF['x'].min(), COORD_DF['x'].max()
yrange = COORD_DF['y'].min(), COORD_DF['y'].max()
COORD_DF['x'] = np.floor(
np.interp(COORD_DF['x'], [xrange[0], xrange[1]], [0, nbins - 1])).astype(int)
COORD_DF['y'] = np.floor(
np.interp(COORD_DF['y'], [yrange[0], yrange[1]], [0, nbins - 1])).astype(int)
```

The coordinates of cells identified as iPSCs at day 18 and its ancestors are filtered.

```
[5]:
```

```
coord_ancestors_ipsc = dict()
for day in DAYS:
cell_ids = np.array(IDS_CELLS[day])
coord_ancestors_ipsc[day] = COORD_DF[COORD_DF.index.isin(
cell_ids)][['x', 'y']].values
```

To be able to discriminate between likely and unlikely ancestors, ancestors are binarised according to their inferred distribution at each time point.

```
[6]:
```

```
alpha_bins = [1, 0.5, 0.]
binned_cell_distribution_ipsc = {}
for day in DAYS:
tmp = cell_distribution_ipsc[day].copy()
tmp[tmp >= 1e-2] = alpha_bins[0]
tmp[np.logical_and(1e-2 > tmp, tmp >= 5e-4)] = alpha_bins[1]
tmp[5e-4 > tmp] = alpha_bins[2]
binned_cell_distribution_ipsc[day] = tmp
```

iPSCs and its ancestors are coloured according to the day the cells are measured on the FLE visualisation (see colorbar). All other cells are represented in grey.

```
[7]:
```

```
# Sets a matplotlib colormap
cm = plt.get_cmap('jet')
cNorm = mpl.colors.Normalize(vmin=0, vmax=len(DAYS))
scalarMap = mpl.cm.ScalarMappable(norm=cNorm, cmap=cm)
fig = plt.figure(figsize=(13, 10))
plt.title(f'Cell type: iPSC, medium: serum, $\epsilon$={epsilon}', fontsize=24)
plt.plot(COORD_DF['x'], COORD_DF['y'], marker='.', color='grey', ls='',
markersize=0.3, alpha=0.07)
for i, day in enumerate(DAYS):
colorVal = scalarMap.to_rgba(i)
for b in alpha_bins:
ind_alpha = np.where(binned_cell_distribution_ipsc[day] == b)[0]
colorVal = np.array(colorVal)
colorVal[3] = b
plt.plot(coord_ancestors_ipsc[day][ind_alpha, 0],
coord_ancestors_ipsc[day][ind_alpha, 1],
marker='.', color=colorVal, ls='', markersize=1)
plt.xlabel('FLE1', fontsize=24)
plt.ylabel('FLE2', fontsize=24)
ax, _ = mpl.colorbar.make_axes(plt.gca(), shrink=1)
cbar = mpl.colorbar.ColorbarBase(ax, cmap=cm,
norm=mpl.colors.Normalize(vmin=0, vmax=18))
plt.show()
```