“Normalization” refers to the standardization of sequencing data on the basis of sequencing depth and gene length. Some versions of the Assembly Options screen allow you to specify a data normalization method, or to select None, in which case data will not be normalized. Some methods are described in the table below, while DESeq2 and edgeR are discussed below the table.
Normalization method | Description |
---|---|
zRPKM | zRPKM (Krumm et al., 2012) is available for the CNV workflow and is calculated: zRPKM = (RPKMexon, sample – Medianexon) / StdDevexon. This is the optimal normalization method for CNV projects with at least three groups/projects. Otherwise, we recommend using the RPK-CN normalization method. |
RPK-CN | RPKM-CN (Krumm et al., 2012) is available for CNV experiments and is calculated as: RPKM-CN = RPKM / median of the exon’s RPKMs; where RPKM > 1. RPKM-CN calculates the copy number by taking the ratio of the RPKM of an exon versus the median RPKM of any exon in the experiment. The final number is a ratio (or log ratio) indicating a relative copy number with no units, since the units are cancelled out in the ratio. The variable M is a constant: the number of millions of mapped reads in the experiments. The ultimate meaning of the ratio comes from the different reads “R” and length “K” of each exon and the median. The constant, M, drops out of the equation and only affects scaling for initial filtering-out of low-coverage exons. We only recommend using RPKM-CN if you don’t have enough samples to provide a good standard deviation for each exon when using the zRPKM normalization method. Otherwise, zRPKM is the preferred method for the CNV workflow. |
Quantile | Quantile is available for RNA-Seq workflows only. Quantile normalization adjusts all of the values in your project so that the distribution is the same across all of the experiments. |
RMP | RPM (reads assigned per million mapped reads) is available for RNA-Seq, ChIP-Seq, and miRNA experiments, and is the only normalization method available for ChIP-Seq and miRNA experiments. When RPM is selected the signal values for each experiment will be divided by the total number of mapped reads divided by one million. |
RPKM | RPKM (reads assigned per kilobase of target per million mapped reads) is available for RNA-Seq data. When RPKM is selected, the signal values for each experiment will be divided by the total bases of target sequence divided by one thousand; and the resulting number divided by the total number of mapped reads divided by one million. |
DESeq2 and edgeR:
DESeq2 (Love et al. 2014) and edgeR (Robinson et al. 2010) are statistical packages in Bioconductor used to assess differential expression in RNA-Seq experiments.
DESeq2 or edgeR statistics for an assembly can be analyzed by opening the assembly in ArrayStar. For information about setting up an assembly suitable for analyzing DESeq2 or edgeR statistics in ArrayStar, see Create an assembly using DESeq2 or edgeR statistics.
Both methods require a control group to be specified, and both require replicate samples for each experimental condition and for the control. Note that when multiple experimental conditions are being considered, the same control group is used for multiple tests. The original P-values from the statistical tests are then adjusted using the Benjamini-Hochberg (1995) procedure.
Differences between DESeq2 and edgeR are shown in the table below:
Calculation | DESeq2 | edgeR |
---|---|---|
Normalization method | Uses a median of ratios method to normalize read counts to account for sequencing depth and RNA composition. Provides two methods: regularized logarithm (rlog) and Variance Stabilizing Transformations (VST). DESeq2 does not attempt to account for transcript length since it is comparing counts between samples for the same gene and assumes the length does not change. This assumption holds true except in rare cases where the dominant transcript length changes between samples due to alternative splicing for example. |
Uses "trimmed mean of M-values" (TMM) (Robinson & Oshlack, 2010). The TMM normalized read count can be viewed in the ArrayStar tables, where counts are represented as log2(counts-per-million-reads). Normalized counts generated by a different method, RLE, are also available within ArrayStar but these values are not used for the actual statistical tests. RLE is similar to the RLOG normalization method used by DESEq2. |
Statistical tests for differential expression | DESeq2 uses raw counts, rather than normalized count data, and models the normalization to fit the counts within a Generalized Linear Model (GLM) of the negative binomial family with a logarithmic link. Statistical tests are then performed to assess differential expression, if any. | Data are normalized to account for sample size differences and variance among samples. The normalized count data are used to estimate per-gene fold changes and to perform statistical tests of whether each gene is likely to be differentially expressed. EdgeR uses an exact test under a negative binomial distribution (Robinson and Smyth, 2008). The statistical test is related to Fisher’s exact test, though Fisher uses a different distribution. |
Data reporting method | In ArrayStar, the rlog values are used by default in the scatter plot and for clustering. VST values are displayed as Gene Table data columns. | In ArrayStar, the log2(CPM) values calculated using TMM are used by default in the scatter plot. In the Gene Table, values for fold change compared to the control are represented as log(fold change). |
Need more help with this?
Contact DNASTAR