normalization

Re: normalization

av Maria Lehtivaara -
Antal svar: 0
Hi Zsuzsanna!

Super good questions!

Eija replied to the first part, so I try to answer to the second :)

The “scale factor” (default 10 000) is the estimation of the number of molecules present in single cells.

In the normalization procedure described in the course (and used as a default in the Seurat package, and thus in Chipster), the counts are first divided by the total count of that cell, then multiplied with the scale factor (10 000), and then natural-log transformed.

Why so many steps? You really hit the spot there, as normalization has been one of the most debated steps since the days of microarrays and throughout the bulk-RNA-era, and now in scRNAseq as well :)

This method used in Seurat is just one way of doing the normalisation -there are many others (in fact, even inside Seurat you can perform the normalization step in 3 different ways, out of which this is just one! See: https://rdrr.io/cran/Seurat/man/NormalizeData.html). Identifying the size factors in some way is very common approach, as it is familiar from the bulk RNA-seq analysis, but like Eija mentioned, setting one scaling factor for all the cells is actually a bit problematic: for example in SCTransform (which we also have in Chipster) they avoid that.

In this video, Heli Pessa from University of Helsinki discusses nicely the difference between bulk and scRNAseq normalisation: https://youtu.be/gbIks6bA8nI?t=120 and also differences between different normalisation methods.

In case you are interested in diving deeper in to this subject, in this article about SCTransform, the Seurat developers discuss these issues: https://www.biorxiv.org/content/10.1101/576827v2.full.pdf

Here’s a nice blog post discussing this very issue too:
https://towardsdatascience.com/normalizing-single-cell-rna-sequencing-data-pitfalls-and-recommendations-19d0cb4fc43d


It’s somewhat understandable, that because there’s a lot of technical and all kinds of variance and noise between the measurements of different cells, we need to somehow normalise between the cells. Maybe the first idea is that “OK, let’s just divide by the counts by the total number of counts” -that way, the highest expressor get’s the highest “score” and so on. By multiplying with the scale factor (10 000), we get back from nasty decimals (“0.0005”) to nicer, more human-understandable count-equivalents (“5”) , (and we also make sure our log values are positive). So now the issue of different number of total counts is taken care of, right?

Well, not quite! If for example some of the cells have just couple of really strong expressors (=plenty of mRNAs created -> big counts for that gene, for example 1000 counts), they are then responsible for most of the counts in that cell. If they are differently expressed in another cell (very little expression -> small counts, for example 5 counts), it will seem, that there’s a big change there, and important and statistically significant changes in not-so-strong expressors (for example, from 5 to 1) might go unnoticed, as the numbers will be skewed due to the high expressors. (In bulk-RNA, this is sometimes called as “composition bias”.) This is taken care of by “stabilizing the variation” (=making variation similar across different orders of magnitude), and logarithm is one (the simplest) way to do this.

(It won’t be a normal distribution, but a Poisson, or negative bionomial, I think, but I think you have the right idea there, and that term is often (mis-?)used in this situation) : these distributions make things comparable and allows you to model )

Best regards,
Maria L