Q&A Forum

normalization

normalization

by Deleted user -
Number of replies: 2

Hi,

I do not fully understand why bulk RNA seq normalization methods do not work on single-cell RNA seq? I would argue there is also a huge ratio of genes that do not express - why not just filter out genes, which do not express at all or say, less than 10 cells (or less than 5% of the amount of investigated cells)?

Second question: on slide 27 you mention that we multiply the scale factor with 10,000 - why? and later on, we take the log10... (this one is for getting normal distribution I guess).

Regards,

Zsuzsanna

In reply to Deleted user

Re: normalization

by Eija Korpelainen -
Hi Zsuzsanna,

Just a quick note to say that you can and you should filter out genes which are not expressed in any cell or only in a handful of cells. You can do this with the tool "Seurat v3 -Setup and QC".

Regarding normalization, bulk RNA-seq data and scRNA-seq data are very different data types. The latter is zero-inflated, because we have so many genes whose expression is not detected by the technology. Furthermore, the number of UMIs varies hugely between the cells, so even the global scaling normalization struggles to deal with the very high expressing genes, and we need the more sophisticated sctransform method.

Best,
Eija
In reply to Deleted user

Re: normalization

by Maria Lehtivaara -
Hi Zsuzsanna!

Super good questions!

Eija replied to the first part, so I try to answer to the second :)

The “scale factor” (default 10 000) is the estimation of the number of molecules present in single cells.

In the normalization procedure described in the course (and used as a default in the Seurat package, and thus in Chipster), the counts are first divided by the total count of that cell, then multiplied with the scale factor (10 000), and then natural-log transformed.

Why so many steps? You really hit the spot there, as normalization has been one of the most debated steps since the days of microarrays and throughout the bulk-RNA-era, and now in scRNAseq as well :)

This method used in Seurat is just one way of doing the normalisation -there are many others (in fact, even inside Seurat you can perform the normalization step in 3 different ways, out of which this is just one! See: https://rdrr.io/cran/Seurat/man/NormalizeData.html). Identifying the size factors in some way is very common approach, as it is familiar from the bulk RNA-seq analysis, but like Eija mentioned, setting one scaling factor for all the cells is actually a bit problematic: for example in SCTransform (which we also have in Chipster) they avoid that.

In this video, Heli Pessa from University of Helsinki discusses nicely the difference between bulk and scRNAseq normalisation: https://youtu.be/gbIks6bA8nI?t=120 and also differences between different normalisation methods.

In case you are interested in diving deeper in to this subject, in this article about SCTransform, the Seurat developers discuss these issues: https://www.biorxiv.org/content/10.1101/576827v2.full.pdf

Here’s a nice blog post discussing this very issue too:
https://towardsdatascience.com/normalizing-single-cell-rna-sequencing-data-pitfalls-and-recommendations-19d0cb4fc43d


It’s somewhat understandable, that because there’s a lot of technical and all kinds of variance and noise between the measurements of different cells, we need to somehow normalise between the cells. Maybe the first idea is that “OK, let’s just divide by the counts by the total number of counts” -that way, the highest expressor get’s the highest “score” and so on. By multiplying with the scale factor (10 000), we get back from nasty decimals (“0.0005”) to nicer, more human-understandable count-equivalents (“5”) , (and we also make sure our log values are positive). So now the issue of different number of total counts is taken care of, right?

Well, not quite! If for example some of the cells have just couple of really strong expressors (=plenty of mRNAs created -> big counts for that gene, for example 1000 counts), they are then responsible for most of the counts in that cell. If they are differently expressed in another cell (very little expression -> small counts, for example 5 counts), it will seem, that there’s a big change there, and important and statistically significant changes in not-so-strong expressors (for example, from 5 to 1) might go unnoticed, as the numbers will be skewed due to the high expressors. (In bulk-RNA, this is sometimes called as “composition bias”.) This is taken care of by “stabilizing the variation” (=making variation similar across different orders of magnitude), and logarithm is one (the simplest) way to do this.

(It won’t be a normal distribution, but a Poisson, or negative bionomial, I think, but I think you have the right idea there, and that term is often (mis-?)used in this situation) : these distributions make things comparable and allows you to model )

Best regards,
Maria L