基因表達差異分析是我們做轉錄組最關鍵根本的一步，edgeR+limma是目前最為推薦的方式。本文結合示例數據，將對這個過程進行梳理，讓你明白limma包的why，what，how。
本文示例數據下載

什么是limma？

首先要明白，不管哪種差異分析，其本質都是廣義線性模型。limma也是廣義線性模型的一種，其對每個gene的表達量擬合一個線性方程。limma的分析過程包括ANOVA分析、線性回歸等。
$Y=β0+β1X1+β2X2+?+βpXp+?$
limma對每個gene擬合出這樣一個方程，其中：
$X$ 可以是：

一個連續變量：如pH，RIN值，年齡，體重，身高...
一個分類變量：如性別、種族、與中位數比較的gene高低表達...
$β$ 是limma將要求出的值
$?$ 是假定在整個數據集中正態分布的殘差(residual)

數據解釋

本文數據有兩個因素，均為分類變量

品種：cultivar（C,I5/I8）
時間：time(6,9)
cols為樣本編號，rows為基因表達。和我們平時用的數據一致。

開始分析

1. 準備數據

library(edgeR) #edgeR將同時引入limma
counts <- read.delim("all_counts.txt", row.names = 1)
head(counts)

d0 <- DGEList(counts)
# 注意： calcNormFactors并不會標準化數據，只是計算標準化因子
d0 <- calcNormFactors(d0)
d0

# 過濾低表達
cutoff <- 1
drop <- which(apply(cpm(d0), 1, max) < cutoff)
d <- d0[-drop,] 
dim(d) # number of genes left

# sample names
snames <- colnames(counts)  
snames
# 此數據有兩個因素：cultivar（C,I5/I8）和time(6,9)
cultivar <- substr(snames, 1, nchar(snames) - 2) 
time <- substr(snames, nchar(snames) - 1, nchar(snames) - 1)
cultivar
time


# Create a new variable “group” that combines cultivar and time
group <- interaction(cultivar, time)
group

# Multidimensional scaling (MDS) plot
plotMDS(d, col = as.numeric(group))

我們首先構建了edgeR的DGEList對象，這個對象將來將會轉化成limma中的EList對象。然后計算了標準化因子，過濾低表達基因。然后按照分組整理了列名，并進行初步的MDS plot，以便看到樣品的大概分布

image.png

2. limma

mm <- model.matrix(~0 + group)
par(mfrow = c(1, 3))
y <- voom(d, mm, plot = T)
#voom的曲線應該很光滑，比較一下過濾低表達gene之前的圖形：
voom(d0, mm, plot = T)
# lmFit fits a linear model using weighted least squares for each gene:
fit <- lmFit(y, mm)
head(coef(fit))

#Comparisons between groups (log fold-changes) are obtained as contrasts of these fitted linear models:
# Comparison between times 6 and 9 for cultivar I5
# makeContrasts實際就是定義比較分組信息
contr <- makeContrasts(groupI5.9 - groupI5.6, levels = colnames(coef(fit)))
# 比較每個基因
tmp <- contrasts.fit(fit, contr)
# Empirical Bayes smoothing of standard errors ， (shrinks standard errors that are much larger or smaller than those from other genes towards the average standard error) 
# (see https://www.degruyter.com/doi/10.2202/1544-6115.1027)
tmp <- eBayes(tmp)
# 使用plotSA 繪制了log2殘差標準差與log-CPM均值的關系。平均log2殘差標準差由水平藍線標出
plotSA(tmp, main="Final model: Mean-variance trend")
# topTable 列出差異顯著基因
top.table <- topTable(tmp, sort.by = "P", n = Inf)
# logFC: log2 fold change of I5.9/I5.6
# AveExpr: Average expression across all samples, in log2 CPM
# t: logFC divided by its standard error
# P.Value: Raw p-value (based on t) from test that logFC differs from 0
# adj.P.Val: Benjamini-Hochberg false discovery rate adjusted p-value
# B: log-odds that gene is DE (arguably less useful than the other columns)
head(top.table, 20)

# p值<0.05的基因有多少個？
length(which(top.table$adj.P.Val < 0.05))

#Write top.table to a file
top.table$Gene <- rownames(top.table)
top.table <- top.table[,c("Gene", names(top.table)[1:6])]
write.table(top.table, file = "time9_v_time6_I5.txt", row.names = F, sep = "\t", quote = F)

limma的核心步驟包括voom、fit、eBays等步驟，注釋里都有詳細說明。最后我們用topTable方法按照p值排序輸出結果。
下圖是我為了說明繪制的，順序反了。。。中間的是沒有過濾低表達基因之前的，左邊是過濾后的，最后是fit后的，可以明顯的看出區別。

image.png

這時差異分析就有已經完成了，怎樣，是不是很簡單？

使用limma進行雙變量、多變量、連續變量分析

###################雙變量分析（cultivar+time）########################
mm <- model.matrix(~cultivar*time)
y <- voom(d, mm, plot = F)
fit <- lmFit(y, mm)
head(coef(fit))
# (Intercept)  cultivarI5 cultivarI8      time9 cultivarI5:time9 cultivarI8:time9
# AT1G01010    4.837410  0.53644370  0.2279446 0.20580445      -0.05565729       0.09265044
# AT1G01020    3.530869 -0.03152318 -0.3180096 0.15875297       0.06289715       0.36468449
# AT1G01030    1.250817 -0.32143420  0.3084243 0.03477863      -0.48099113      -0.37842909
# AT1G01040    5.676015  0.27097286  0.1028739 0.50635951      -0.58923660      -0.46975071
# AT1G01050    6.598712 -0.09734846 -0.1347759 0.02052702       0.23139851       0.22730960
# AT1G01060    7.807988 -0.34550979 -0.4172467 1.15805850      -0.34989810      -0.17267051
# 這個表中顯示的是coefficient（相關系數） 
# cultivarI5 這一列表示cultivar I5 組均值 vs cultivar C（參考cultivar）的差異, for time 6 (the reference level for time)
#  time9 這一列表示time9 組均值 vs time6 ,forcultivar C的差異
# cultivarI5:time9 : the difference between times 9 and 6 of the differences between cultivars I5 and C (interaction effect)

# 接下來我們可以定義fit中的coef參數，來進行組間fit
# Let’s estimate the difference between cultivars I5 and C at time 6
tmp <- contrasts.fit(fit, coef = 2) #  the difference in mean expression between cultivar I5 and the reference cultivar (cultivar C), for time 6 (the reference level for time)
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)


tmp <- contrasts.fit(fit, coef = 5) # Test cultivarI5:time9
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)

####################多變量分析########################
#讓事情更復雜一點，我們加入批次信息
batch <- factor(rep(rep(1:2, each = 2), 6))
# 只需要重新定義model matrix，其余都一樣
mm <- model.matrix(~0 + group + batch)
y <- voom(d, mm, plot = F)
fit <- lmFit(y, mm)
contr <- makeContrasts(groupI5.6 - groupC.6, levels = colnames(coef(fit)))
tmp <- contrasts.fitit(fit, contr)
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)


# 加入連續變量
# Generate example RIN data
set.seed(99)
RIN <- rnorm(n = 24, mean = 7.5, sd = 1)
RIN
mm <- model.matrix(~0 + group + RIN)
y <- voom(d, mm, plot = F)
fit <- lmFit(y, mm)
contr <- makeContrasts(groupI5.6 - groupC.6, levels = colnames(coef(fit)))
tmp <- contrasts.fit(fit, contr)
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)


# What if we want to look at the correlation of gene expression with a continuous variable like pH?
# Generate example pH data
set.seed(99)
pH <- rnorm(n = 24, mean = 8, sd = 1.5)
pH
mm <- model.matrix(~pH)
head(mm)
y <- voom(d, mm, plot = F)
fit <- lmFit(y, mm)
tmp <- contrasts.fit(fit, coef = 2) # test "pH" coefficient
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)

上面，我們分別加入了額外的二分變量、連續變量進行limma分析，結果都很好。
這就是有關limma分析的全部內容，注釋寫的很清楚，可以用這個流程分析任何轉錄組數據，進行差異表達分析。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

limma 差異分析透徹講解

limma 差異分析透徹講解

什么是limma？

數據解釋

開始分析

1. 準備數據

2. limma

使用limma進行雙變量、多變量、連續變量分析

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

limma 差異分析透徹講解

什么是limma？

數據解釋

開始分析

1. 準備數據

2. limma

使用limma進行雙變量、多變量、連續變量分析

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频