腿张开再深点好爽宝贝视频,中文字幕人妻字幕乱码中文乱码 ,久久爽狠狠添av激情五月

劉小澤寫于19.3.11

Limma作為差異分析的“金標準”最初是應用在芯片數(shù)據(jù)分析中，voom的功能是為了RNA-Seq的分析產(chǎn)生的。詳細探索一下limma的功能吧

本次的測試數(shù)據(jù)可以在公眾號回復voom獲得

Limma-voom強大在于三個方面：

False discovery rate比較低（準確性），異常值影響小
假陽性控制不錯
運算很快

配置信息

> library(edgeR)
Loading required package: limma
> counts <- read.delim("all_counts.txt", row.names = 1)
> head(counts[1:3,1:3])
          C61 C62 C63
AT1G01010 341 371 275
AT1G01020 164  94 176
AT1G03987   0   0   0
> dim(counts)
[1] 32833    24
# 構(gòu)建DGEList對象，將counts和sample信息包含進去
> d0 <- DGEList(counts)

預處理

> # 計算標準化因子
> d0 <- calcNormFactors(d0)
> d0

注意，這里的calcNormFactors并不是進行了標準化，僅僅是計算了一個參數(shù)，用于下游標準化

# 過濾低表達基因[閾值根據(jù)自己需要設(shè)定]
cutoff <- 1
cut <- which(apply(cpm(d0), 1, max) < cutoff)
d <- d0[-cut,] 
dim(d) 
[1] 21081    24
# 剩下 21081個基因

然后根據(jù)列名提取樣本信息(sample name)

> spname <- colnames(counts) 
> spname
 [1] "C61"  "C62"  "C63"  "C64"  "C91"  "C92"  "C93" 
 [8] "C94"  "I561" "I562" "I563" "I564" "I591" "I592"
[15] "I593" "I594" "I861" "I862" "I863" "I864" "I891"
[22] "I892" "I893" "I894"

看到樣本是按照兩個因素（品系C/I5/I8、時間6/9）分類的，并且四個生物學重復寫在了最后C/I5/I8 | 6/9 | 1/2/3/4

> # 分離出分組信息
> strain <- substr(spname, 1, nchar(spname) - 2)
> time <- substr(spname, nchar(spname) - 1, nchar(spname) - 1)
> strain
 [1] "C"  "C"  "C"  "C"  "C"  "C"  "C"  "C"  "I5" "I5"
[11] "I5" "I5" "I5" "I5" "I5" "I5" "I8" "I8" "I8" "I8"
[21] "I8" "I8" "I8" "I8"
> time
 [1] "6" "6" "6" "6" "9" "9" "9" "9" "6" "6" "6" "6"
[13] "9" "9" "9" "9" "6" "6" "6" "6" "9" "9" "9" "9"

再將這兩部分整合進group分組信息中

> # 再將這兩部分整合進group分組信息中[生成因子型向量]
> group <- interaction(strain, time)
> group
 [1] C.6  C.6  C.6  C.6  C.9  C.9  C.9  C.9  I5.6 I5.6
[11] I5.6 I5.6 I5.9 I5.9 I5.9 I5.9 I8.6 I8.6 I8.6 I8.6
[21] I8.9 I8.9 I8.9 I8.9
Levels: C.6 I5.6 I8.6 C.9 I5.9 I8.9

當然，也可以自己手動輸入或從其他文件導入，但必須注意一點：這個group metadata必須和counts的列明順序?qū)?/p>

多個實驗因子同時存在時，要進行MDS（multidimensional scaling）分析，即“多維尺度變換”。正式差異分析前幫助我們判斷潛在的差異樣本，結(jié)果會將所有樣本劃分成幾個維度，第一維度的樣本代表了最大的差異

> # Multidimensional scaling (MDS) plot
> suppressMessages(library(RColorBrewer))
> col.group <- group
> levels(col.group) <- brewer.pal(nlevels(col.group), "Set1") 
> col.group <- as.character(col.group)
> plotMDS(d, labels=group, col=col.group) 
> title(main="A. Sample groups")

Voom轉(zhuǎn)換及方差權(quán)重計算

> mm <- model.matrix(~0 + group)
> y <- voom(d, mm, plot = T)

Good

voom到底做了什么轉(zhuǎn)換？

首先原始counts轉(zhuǎn)換成log2的CPM（counts per million reads ），這里的per million reads是根據(jù)之前calcNormFactors計算的norm.factors進行規(guī)定的；

然后根據(jù)每個基因的log2CPM制作了線性模型，并計算了殘差；

然后利用了平均表達量（紅線）擬合了sqrt(residual standard deviation)；

最后得到的平滑曲線可以用來得到每個基因和樣本的權(quán)重

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29

上圖效果較好，如果像下面??這樣：就表示數(shù)據(jù)需要再進行過濾

tmp <- voom(d0, mm, plot = T)

Bad

有時我們沒有必要弄明白背后復雜的原理，只需要知道如何解釋結(jié)果：

https://stats.stackexchange.com/questions/160255/voom-mean-variance-trend-plot-how-to-interpret-the-plot

limma-voom method assumes that rows with zero or very low counts have been removed

如果橫坐標接近0的位置出現(xiàn)迅速上升，說明low counts數(shù)比較多

Whether your data are "good" or not cannot be determined from this plot

limma的線性擬合模型構(gòu)建

> fit <- lmFit(y, mm)
> head(coef(fit),3)
          groupC.6 groupI5.6 groupI8.6 groupC.9
AT1G01010 4.685920 5.2477564  4.938939 4.922501
AT1G01020 3.420726 3.4147535  3.130644 3.571855
AT1G01030 1.111114 0.7316936  1.435521 1.157532
          groupI5.9 groupI8.9
AT1G01010  5.382619  5.246093
AT1G01020  3.610579  3.655254
AT1G01030  0.388736  1.222892

組間比較：

例如進行I5品系的6和9小時比較

> contr <- makeContrasts(groupI5.9 - groupI5.6, levels = colnames(coef(fit)))
> contr
           Contrasts
Levels      groupI5.9 - groupI5.6
  groupC.6                      0
  groupI5.6                    -1
  groupI8.6                     0
  groupC.9                      0
  groupI5.9                     1
  groupI8.9                     0

估算組間每個基因的比較：

> tmp <- contrasts.fit(fit, contr)

再利用Empirical Bayes （shrinks standard errors that are much larger or smaller than those from other genes towards the average standard erro）

https://www.degruyter.com/doi/10.2202/1544-6115.1027

> tmp <- eBayes(tmp)

結(jié)果中差異基因有哪些呢？

> top.table <- topTable(tmp, sort.by = "P", n = Inf)
> DEG <- na.omit(top.table)
> head(DEG, 5)
              logFC  AveExpr         t      P.Value
AT5G37260  3.163518 6.939588  23.94081 1.437434e-16
AT3G02990  1.646438 3.190750  13.15656 1.610004e-11
AT2G29500 -5.288998 5.471250 -11.94053 9.584101e-11
AT3G24520  1.906690 5.780286  11.80461 1.179985e-10
AT5G65630  1.070550 7.455294  10.86740 5.208111e-10
             adj.P.Val        B
AT5G37260 3.030255e-12 26.41860
AT3G02990 1.697025e-07 15.50989
AT2G29500 6.218818e-07 14.45441
AT3G24520 6.218818e-07 14.52721
AT5G65630 2.195844e-06 13.12701

logFC: log2 fold change of I5.9/I5.6
AveExpr: Average expression across all samples, in log2 CPM
t: logFC divided by its standard error
P.Value: Raw p-value (based on t) from test that logFC differs from 0
adj.P.Val: Benjamini-Hochberg false discovery rate adjusted p-value
B: log-odds that gene is DE (arguably less useful than the other columns)

從前幾個差異最顯著的基因中可以看到，AT5G37260基因在time9的表達量最高（約time6的8倍），AT2G29500表達量最低，比time6的還少（約1/32）

那么總共有多少差異基因呢？

如果以logFC=2，Pvalue=0.05為閾值進行過濾

> length(which(DEG$adj.P.Val < 0.05 & abs(DEG$logFC)>2 ))
[1] 172

如果要比較其他的組，例如：time6的品系C和品系I5

只需要將makeContrasts修改

contr <- makeContrasts(groupI5.6 - groupC.6, levels = colnames(coef(fit)))
tmp <- contrasts.fit(fit, contr)
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
DEG <- na.omit(top.table)
head(DEG, 5)
length(which(DEG$adj.P.Val < 0.05 & abs(DEG$logFC)>2 ))
# 結(jié)果只有8個

上面利用了單因子group構(gòu)建了model matrix，如果存在多個影響因子，可以利用新的因子（就省去了之前因子組合成group的步驟）構(gòu)建新的矩陣模型

# 構(gòu)建新的model matrix
> mm <- model.matrix(~strain*time)
> colnames(mm)
[1] "(Intercept)"    "strainI5"       "strainI8"      
[4] "time9"          "strainI5:time9" "strainI8:time9"
> y <- voom(d, mm, plot = F)
> fit <- lmFit(y, mm)
> head(coef(fit),3)
          (Intercept)     strainI5   strainI8
AT1G01010    4.685920  0.561836365  0.2530188
AT1G01020    3.420726 -0.005972208 -0.2900818
AT1G01030    1.111114 -0.379420605  0.3244063
              time9 strainI5:time9 strainI8:time9
AT1G01010 0.2365808    -0.10171813     0.07057368
AT1G01020 0.1511295     0.04469623     0.37348052
AT1G01030 0.0464182    -0.38937581    -0.25904674

算法自定義了標準品系為C，標準時間為6（可能是按照字母或數(shù)字順序）
strainI5表示品系I5和標準品系（品系C）在標準時間點（time6）的差異
time9表示標準品系（品系C）在time9和time6的差異
strainI5:time9表示品系I5和品系C在time9和time6的差異（存在交叉影響）

如果我們想比較品系I5和品系C在time6的差異，就可以：

> tmp <- contrasts.fit(fit, coef = 2)
> tmp <- eBayes(tmp)
> top.table <- topTable(tmp, sort.by = "P", n = Inf)
> DEG <- na.omit(top.table)
> head(DEG, 5)
               logFC    AveExpr          t
AT4G12520 -10.254556  0.3581132 -11.402477
AT3G30720   5.817438  3.3950689  10.528934
AT5G26270   2.421030  4.3788335   9.654257
AT3G33528  -4.780814 -1.8612945  -7.454943
AT1G64795  -4.872595 -1.3119360  -7.079643
               P.Value    adj.P.Val          B
AT4G12520 2.206726e-10 4.651998e-06  3.6958152
AT3G30720 9.108689e-10 9.601014e-06  7.9963406
AT5G26270 4.101051e-09 2.881809e-05 10.8356224
AT3G33528 2.741289e-07 1.444728e-03  0.5677732
AT1G64795 5.985471e-07 2.523594e-03  1.8151705
> length(which(DEG$adj.P.Val < 0.05 & abs(DEG$logFC)>2 ))
[1] 8

可以看到，和之前用單因子group得到的結(jié)果一樣

但是，這種方法在同時分析交叉影響時就體現(xiàn)出來強大了：

比如我們想看time9與品系I5的差異結(jié)果

> # cultivarI5:time9
> tmp <- contrasts.fit(fit, coef = 5)
> tmp <- eBayes(tmp)
> top.table <- topTable(tmp, sort.by = "P", n = Inf)
> DEG <- na.omit(top.table)
> #head(DEG, 5)
> length(which(DEG$adj.P.Val < 0.05 & abs(DEG$logFC)>2 ))
[1] 111

更復雜的模型

有時RNA-Seq需要考慮批次效應（Batch effect）的影響

> batch <- factor(rep(rep(1:2, each = 2), 6))
> batch
 [1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
Levels: 1 2

構(gòu)建模型時，需要將batch加在最后，其他不變

> mm <- model.matrix(~0 + group + batch)
> y <- voom(d, mm, plot = F)
> fit <- lmFit(y, mm)
> contr <- makeContrasts(groupI5.6 - groupC.6, levels = colnames(coef(fit)))
> tmp <- contrasts.fit(fit, contr)
> tmp <- eBayes(tmp)
> top.table <- topTable(tmp, sort.by = "P", n = Inf)
> DEG <- na.omit(top.table)
> #head(DEG, 5)
> length(which(DEG$adj.P.Val < 0.05 & abs(DEG$logFC)>2 ))
[1] 9

或者需要考慮其他因素的影響，比如這里有一個連續(xù)型變量rate，它可能是pH、光照等等對研究材料的影響值

> # Generate example rate data[行數(shù)要與count矩陣的列數(shù)相等]
> set.seed(10)
> rate <- rnorm(n = 24, mean = 5, sd = 1.7)
> rate
 [1] 5.031868 4.686771 2.668738 3.981415 5.500727
 [6] 5.662650 2.946271 4.381751 2.234656 4.563987
[11] 6.873025 6.284829 4.595003 6.678656 6.260363
[16] 5.151890 3.376595 4.668244 6.573386 5.821063
> # 指定矩陣模型
> mm <- model.matrix(~rate)
> head(mm)
  (Intercept)     rate
1           1 5.031868
2           1 4.686771
3           1 2.668738
4           1 3.981415
5           1 5.500727
6           1 5.662650
> y <- voom(d, mm, plot = F)
> fit <- lmFit(y, mm)
> tmp <- contrasts.fit(fit, coef = 2) # test "rate" coefficient
> tmp <- eBayes(tmp)
> top.table <- topTable(tmp, sort.by = "P", n = Inf)
> DEG <- na.omit(top.table)
> #head(DEG, 5)
> length(which(DEG$adj.P.Val < 0.05 & abs(DEG$logFC)>2 ))
[1] 0

可見rate值并不能成為產(chǎn)生差異基因的原因，但是rate與基因的相關(guān)性還是可以探索一下的

> AT1G01060 <- y$E["AT1G01060",]
> plot(AT1G01060 ~ rate, ylim = c(6, 12))
> intercept <- coef(fit)["AT1G01060", "(Intercept)"]
> slope <- coef(fit)["AT1G01060", "rate"]
> abline(a = intercept, b = slope)

圖中的斜率就是logFC值，或者可以說每單位rate的增加，gene表達量log2 CPM的改變。這里斜率為-0.096表示：每單位rate的增加，就有-0.096 log2CPM的基因表達量降低；或者每單位rate的增加，就有6.9%的CPM降低（2^0.096 = 1.069）

歡迎關(guān)注我們的公眾號～_～　　
我們是兩個農(nóng)轉(zhuǎn)生信的小碩，打造生信星球，想讓它成為一個不拽術(shù)語、通俗易懂的生信知識平臺。需要幫助或提出意見請后臺留言或發(fā)送郵件到jieandze1314@gmail.com

Welcome to our bioinfoplanet!

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

轉(zhuǎn)錄組差異分析金標準-Limma-voom實戰(zhàn)

轉(zhuǎn)錄組差異分析金標準-Limma-voom實戰(zhàn)

配置信息

預處理

Voom轉(zhuǎn)換及方差權(quán)重計算

limma的線性擬合模型構(gòu)建

結(jié)果中差異基因有哪些呢？

更復雜的模型

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

轉(zhuǎn)錄組差異分析金標準-Limma-voom實戰(zhàn)

配置信息

預處理

Voom轉(zhuǎn)換及方差權(quán)重計算

limma的線性擬合模型構(gòu)建

結(jié)果中差異基因有哪些呢？

更復雜的模型

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频