# Chapter 5 Quality control and normalization

## 5.1 Quality control

Let’s focus now on **AML027 pre-transplant** data. The two datasets will be compared later on.

A first quality control (QC) based on the number of genes and on the number of UMIs detected per cell will give you an idea of how well your experiment worked and how the data is distributed.

A third indicator can be calculated: the percentage of mitochondrial genes per cell. It has been suggested (see Ilicic et al.) that a high rate of mitochondrial genes could be a marker of cytoplasmic RNA loss, and thus of low quality cell. These cells will be filtered too.

```
# add mitochondiral genes percentage to object metadata
# this function compute the percentage of UMI associated with genes starting with "MT-" (mitochondrial genes) in each cell
# a metadata field is then added to AML027.pre object
AML027.pre[["percent.mt"]] <- PercentageFeatureSet(AML027.pre, pattern = "^MT-")
```

```
VlnPlot(
group.by = "orig.ident",
object = AML027.pre,
features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
ncol = 3, pt.size = 0.1
)
```

**Exercise 5.1**

- How could you explain that some
*outliers*show a high number of UMIs? - What are mitochondrial genes?
- Use the
`FeatureScatter`

function to plot the number of genes against the number of UMIs per cell. Does it make sense to compute the correlation between the two variables? Do you expect a linear relation between`nCounts_RNA`

and`nFeature_RNA`

? - Use the
`FeatureScatter`

function to plot the number of UMI against the percentage of mitochondrial genes per cell. Is it consistent with Ilicic’s hypothesis?

We decide to filter out cells that have more than 10% mitochondrial genes, and that have gene counts over 3000 or less than 200.

`AML027.pre <- subset(AML027.pre, subset = nFeature_RNA > 200 & nFeature_RNA < 3000 & percent.mt < 10)`

At this stage, raw data is still in `AML027.pre@assays$RNA@counts`

, and `AML027.pre@assays$RNA@data`

contains the filtered data.

## 5.2 Data normalisation

There are various ways to normalize data, but it is a critical step to adjust for unwanted effects that can mask the signal of interest. Seurat proposes only one way to normalize data at the moment: for each cell, the number of UMIs are normalized by the total number of UMI, multiplied by a scale factor to make it human-readable, and finally log-transformed.

The log-transformation helps stabilizing the variance and having a symmetric distribution of expression.

\(M_{norm}(i, j) = \log(1+ \frac{M_{UMI}(i, j) }{ \sum\limits_{k=1}^n M_{UMI}(k, j) } * S_f)\)

where \(M_{norm}\) is the normalized expression matrix, \(M_{UMI}\) the UMI count matrix, \(n\) the number of genes and \(S_f\) a scale factor (usually 1E4).

`AML027.pre <- NormalizeData(object = AML027.pre, normalization.method = "LogNormalize", scale.factor = 10000)`

**Exercise 5.2**

- Plot a histogram of
`HBB`

expression in`AML027.pre`

with and without expression. What happened? - Plot the genes variance against their mean expression, with and without normalization.
- Without normalization, what happens to the variance when the mean increases?

## 5.3 Find variable genes

The goal once data is normalised is to find variable genes across cells. These genes will be used to find clusters (for cell types for example) or to assign states to cells, such as the cell cycle phase. The assumption here is that only a limited number of genes indeed contain information in the dataset.

```
AML027.pre <- FindVariableFeatures(AML027.pre, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(AML027.pre), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(AML027.pre)
LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)
```

```
## Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.
## Please use `as_label()` or `as_name()` instead.
## This warning is displayed once per session.
```

`## Warning: Transformation introduced infinite values in continuous x-axis`

## 5.4 Scaling data

The goal here is to remove unwanted sources of variation so that each cell weights . The scaling will regress out cell-cell variation driven by the number of UMI, the percentage of mitochondrial genes, the cell cycle effects, the apoptosis, etc in order to increase to highlight cell subtypes features. In our case, regressing data would be time-consuming. We will only standardize the matrix to prepare for dimension reduction.

`AML027.pre <- ScaleData(object = AML027.pre, display.progress = F, features = VariableFeatures(AML027.pre))`

`## Warning: The following arguments are not used: display.progress`

`## Suggested parameter: verbose instead of display.progress`

`## Centering and scaling data matrix`

## 5.5 Reducing dimension

On the scaled data, we will run a principal component analysis (PCA) to reduce the dataset dimension (from > 2,000 variables to about 10-20).

Given a centered (and if needed scaled) dataset \(M \in \mathbf{M}(K, N)\), the principle of PCA is to find an axis \(u\), linear combination of \(X_1...X_N\) the variables, such as the variance of \(M \cdot u\) is maximal. \(u\) is actually an eigen vector associated to the eigen value \(\lambda_1\) of the correlation matrix of \(M\), \(C = 1/K \cdot M^T \cdot M\). The variance explained by the first component is equal to \(\lambda_1\).

Next, we search \(w\), second axis of projection, orthogonal to \(u\), and so on. It is an eigen vector associated to the eigen value (and variance) \(\lambda_2\) of \(C\).

As the covariance matrix \(C\) is squared, symmetric, real, it is diagonalisable on an orthonormal basis. The PCA problem on \(M\) can be interpreted as a diagonalisation problem on \(M\)’s correlation matrix.

By keeping a limited number of principal components (PCs), you expect your matrix to be largely reduced while keeping what makes your observations different from one another.

In practice, we can use several criteria to chose the number of dimensions to keep in the PCA analysis. Some might fix a threshold (0.80 for example) for the sum of \(\lambda\) under which we continue to compute new axes. Another way to decide is called the *Elbow rule*. Let’s plot the \(\lambda\) values, and we cut as soon as an elbow appears on the plot:

```
## PC_ 1
## Positive: HBM, IFIT1B, TMCC2, CA1, ARL4A, ARG1, IFI27, HBG2, KRT1, HBG1
## RSAD2, HIST1H1C, CA2, PRDX2, ITLN1, PCSK1N, H1F0, ARL6IP1, SFRP2, NUSAP1
## REXO2, C10orf10, HBE1, CENPF, TUBB1, RP11-382A20.4, CENPE, NEK2, CITED2, ARG2
## Negative: RPL10, GAPDH, RPL13, RPS19, RPS18, RPS4X, RPL13A, RPL19, RPL15, RPL3
## RPL18A, RPL10A, RPL11, RPL26, RPS2, RPS3A, RPL32, TMSB10, PFN1, RPS14
## RPS6, RPL9, RPS23, RPL6, EEF1A1, RPL36, RPL37, RPS24, RPL37A, RPL27A
## PC_ 2
## Positive: KIAA0101, PRDX2, TUBB, DUT, TYMS, MPC2, STMN1, TUBA1B, NME4, HMGB1
## TMEM14C, HMGA1, KCNH2, UBAC1, RAN, RANBP1, PCNA, HMGB2, HIST1H4C, FAM178B
## CCT6A, MCM7, UBE2T, ATPIF1, TMEM14B, RPA3, CKS2, METAP2, BIRC5, HSPD1
## Negative: FCN1, TYROBP, CTSS, S100A11, FCER1G, S100A9, TYMP, S100A8, CFD, LST1
## SERPINA1, COTL1, LYZ, MNDA, LGALS1, AIF1, S100A10, CST3, CFP, ANXA2
## CD14, MS4A6A, ZFP36, SPI1, CYBA, C1orf162, TMEM176B, CD68, VIM, GRN
## PC_ 3
## Positive: FCN1, S100A9, S100A8, LYZ, CFD, CST3, FTL, CPVL, MNDA, AIF1
## CD14, MS4A6A, GRN, CFP, SERPINA1, LST1, SPI1, FCGRT, PSAP, TMEM176B
## CTSS, CD68, TYMP, LY86, TIMP1, S100A12, NUP214, HCK, PLBD1, NCF2
## Negative: PTPRCAP, GZMA, CTSW, PRF1, CST7, CCL5, CD3E, NKG7, IL32, HOPX
## GNLY, GZMH, GZMM, CD7, EVL, GZMB, CD247, GZMK, IL2RG, CCL4
## CD3D, IL2RB, LCK, CD2, KLRC1, BIN1, C12orf75, KLRD1, SH2D1A, XCL2
## PC_ 4
## Positive: PRSS57, IGLL1, EGFL7, IFIT1B, CYTL1, KIAA0125, NGFRAP1, C1QTNF4, FABP5, SPINK2
## MEST, ITM2C, TMCC2, NPM3, PEBP1, EBPL, CD34, MYC, NPW, LAPTM4B
## RP11-620J15.3, ARL4A, MIF, MPO, SRM, NME1, SOX4, SERPINE2, FSCN1, APOC1
## Negative: NUSAP1, HMGB2, CA2, CDK1, PTTG1, HBM, TUBB4B, CENPF, CDKN3, TUBA1B
## TUBB, UBE2C, PRDX2, BIRC5, HIST1H4C, UBE2T, CCNB2, H1F0, CCNA2, CENPA
## TOP2A, AURKA, CKS2, RRM2, KPNA2, NUF2, CENPE, HMMR, CCNB1, PLK1
## PC_ 5
## Positive: FAM178B, EPCAM, CNRIP1, ALDH1A1, NMU, APOC1, REXO2, GAL, FKBP4, S100A6
## LMNA, MLLT3, S100A4, SMIM10, AKR1C3, ENG, NMNAT3, GSTM3, KRT13, PAK1IP1
## KCNH2, BAG2, GLRX, IFIT1B, B3GALNT1, IFI27, LYAR, CPVL, FAM89A, CBY1
## Negative: ITM2C, PLD4, C1QTNF4, SPINK2, UBE2C, CENPA, MPO, MZB1, IGJ, NUSAP1
## C1orf228, TUBA1B, CENPF, IGLL1, KIAA0125, HMGB2, CDKN3, SOX4, FABP5, CD34
## GPSM2, PRTN3, CTSG, HIST1H4C, IRF8, CD79A, NUF2, PTTG1, PLK1, CDK1
```

**Exercise 5.3**

- PC_3 seems to segregates expression of
**CD14**versus that of**CD3D**. What are**CD14**and**CD3D**used for? How would you interpret this result? Which immune cell types are most likely to lie on both sides of the PC_3 axis?

Now that we have a reduced dataset, let’s cluster the cells and analyze the bone marrow of AML027.