DESeq2差異基因分析和批次效應移除

差異基因鑒定

基因表達標準化

不同樣品的測序量會有差異,最簡單的標準化方式是計算
counts per million (CPM),即原始reads count除以總reads數乘以1,000,000。

這種計算方式的缺點是容易受到極高表達且在不同樣品中存在差異表達的基因的影響;這些基因的打開或關閉會影響到細胞中總的分子數目,可能導致這些基因標準化之后就不存在表達差異了,而原本沒有差異的基因標準化之后卻有差異了。

RPKM、FPKM和TPM是CPM按照基因或轉錄本長度歸一化后的表達,也會受到這一影響。

calc_cpm <- function (expr_mat, spikes = NULL){
  norm_factor <- colSums(expr_mat[-spikes,])
  return(t(t(expr_mat)/norm_factor)) * 10^6
}

為了解決這一問題,研究者提出了其它的標準化方法。

量化因子 (size factor,SF)是由DESeq提出的。其方法是首先計算每個基因在所有樣品中表達的幾何平均值。每個細胞的量化因子(size
factor)是所有基因與其在所有樣品中的表達值的幾何平均值的比值的中位數。由于幾何平均值的使用,只有在所有樣品中表達都不為0的基因才能用來計算。這一方法又被稱為RLE (relative log expression)

calc_sf <- function (expr_mat, spikes=NULL){
  geomeans <- exp(rowMeans(log(expr_mat[-spikes,])))
  SF <- function(cnts){
    median((cnts/geomeans)[(is.finite(geomeans) & geomeans >0)])
  }
  norm_factor <- apply(expr_mat[-spikes,],2,SF)
  return(t(t(expr_mat)/norm_factor))
}

上四分位數 (upperquartile,UQ)是樣品中所有基因的表達除以處于上四分位數的基因的表達值。同時為了保證表達水平的相對穩定,計算得到的上四分位數值要除以所有樣品中上四分位數值的中位數。

calc_uq <- function (expr_mat, spikes=NULL){
  UQ <- function(x) {
    quantile(x[x>0],0.75)
  }
  uq <- unlist(apply(expr_mat[-spikes,],2,UQ))
  norm_factor <- uq/median(uq)
  return(t(t(expr_mat)/norm_factor))
}

TMM是M-值的加權截尾均值。選定一個樣品為參照,其它樣品中基因的表達相對于參照樣品中對應基因表達倍數的log2值定義為M-值。隨后去除M-值中最高和最低的30%,剩下的M值計算加權平均值。每一個非參照樣品的基因表達值都乘以計算出的TMM。這個方法假設大部分基因的表達是沒有差異的。

DESeq2 差異基因鑒定一步法

為了簡化差異基因的運算,易生信做了腳本封裝,DESeq2.sh,只需提供原始基因表達矩陣、樣品分組信息表即可進行差異基因分析和鑒定。

DESeq2.sh
OPTIONS:
    -f  Data file [A gene count matrix, NECESSARY]
        CHECK ABOVE FOR DETAILS
        
    -s  Sample file [A multiple columns file with header line, 
        For <timeseries>,  one <conditions> columns is needed.
        NECESSARY]
        CHECK ABOVE FOR DETAILS
        
    -d  The design formula for DESeqDataSetFromMatrix.
        [Default <conditions>, 
        accept <cell+time+cell:time> for example 2.]
        
    -D  The reduced design formula for DESeq.
        [Only applicable to <timeseries> analysis, 
        accept <cell+time> or <time> or <cell> for example 2.]
        
    -m  Specify the comparasion mode.
        [Default <pairwise>, accept <timeseries>,
        <pairwise> comparasion will still be done in <timeseries>
        mode.
        NECESSARY]
        
    -p  A file containing the pairs needed to do comparasion. 
        CHECK ABOVE FOR DETAILS
        All samples will be compared in <pairwise> mode if not specified here.
    
    -F  Log2 Fold change for screening DE genes.
        Default 1

    -P  FDR for screening DE genes.
        Default 0.01

    -q  FDR for screening time-series DE genes.
        Default 0.1

    -e  Execute programs
        Default TRUE

    -i  Install packages of not exist. 
        Default FALSE

Eg.
    DESeq2.sh -f matirx -s sample -d conditions
    DESeq2.sh -f matirx -s sample -p compare_pair

具體使用

cd ~/data
# ehbio_trans.Count_matrix.xls和sampleFile都是前面用到的文件
DESeq2.sh -f ehbio_trans.Count_matrix.xls -s sampleFile

[1] "Perform pairwise comparasion using <design=~conditions>"
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
[1] "Output normalized counts"
[1] "Output rlog transformed normalized counts"
[1] "Performing sample clustering"
null device 
          1 
[1] "DE genes between untrt trt"
[1] "PCA analysis"
null device 

運行結束即可獲得以下結果文件

# 具體的文件內容和圖的樣式見后面的分步法文檔
# 原始輸入文件
ehbio_trans.Count_matrix.xls
sampleFile

# 所有差異基因列表
ehbio_trans.Count_matrix.xls.DESeq2.all.DE

# PCA結果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pca.pdf

# 樣品相關性層級聚類結果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf

# rlog轉換后的標準化后的表達結果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls

# 標準化后的表達結果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls

# 運行腳本
ehbio_trans.Count_matrix.xls.DESeq2.r

# 差異基因結果
ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.id.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.id.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.xls

# 火山圖和火山圖輸入數據
ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls.volcano.pdf

一步法腳本可在https://ke.qq.com/course/301387?tuin=20cd7788或點擊閱讀原文獲得。

DESeq2 差異基因鑒定分步法

安裝包 DESeq2安裝方法如下

source("https://bioconductor.org/biocLite.R")
biocLite('BiocInstaller')
biocLite("DESeq2")
install.packages(c("gplots", "amap", "ggplot2"))

A distributional assumption is needed because we want to estimate the
probability of extreme events (large fold change just appearing by
chance) from limited replicates. The negative binomial (a.k.a.
Gamma-Poisson) is a good choice for RNA-seq experiments because

  • The observed data at gene level is inherently counts or estimated
    counts of fragments for each feature and
  • The spread of values among biological replicates is more than given
    by a simpler, one parameter distribution, the Poisson; and it seems
    to be captured by the NB sufficiently well

加載包

library(DESeq2)

## Loading required package: S4Vectors

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colMeans,
##     colnames, colSums, do.call, duplicated, eval, evalq, Filter,
##     Find, get, grep, grepl, intersect, is.unsorted, lapply,
##     lengths, Map, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
##     rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: SummarizedExperiment

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: DelayedArray

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## 
## Attaching package: 'DelayedArray'

## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

## The following object is masked from 'package:base':
## 
##     apply

library("RColorBrewer")
library("gplots")

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:IRanges':
## 
##     space

## The following object is masked from 'package:S4Vectors':
## 
##     space

## The following object is masked from 'package:stats':
## 
##     lowess

library("amap")
library("ggplot2")
library("BiocParallel")

# 多線程加速, 使用4個核
# 如果你電腦性能強大,可以把這個數變大
register(MulticoreParam(4))

讀入數據

# 注意文件的位置,程序自己不知道文件在什么地方
# 如果只給文件名,不給路徑,程序只會在當前目錄查找文件
# 若找不到,則會報錯
# 可以使用下面命令設置工作目錄到文件所在目錄
# 或提供文件全路徑
# setwd("~/data")
data <- read.table("ehbio_trans.Count_matrix.xls", header=T, row.names=1, com='', quote='',
    check.names=F, sep="\t")

# 撇掉在多于兩個樣本中count<1的值,如果樣本數多,這個數值可以適當增加
# 排除極低表達基因的干擾
data <- data[rowSums(data)>2,]
head(data)

##                 untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
## ENSG00000227232           13            25            23            24
## ENSG00000278267            0             5             3             4
## ENSG00000268903            0             2             0             0
## ENSG00000269981            0             3             0             1
## ENSG00000241860            3            11             1             5
## ENSG00000279457           46            90            73            49
##                 trt_N61311 trt_N052611 trt_N080611 trt_N061011
## ENSG00000227232         12          12          22          22
## ENSG00000278267          2           4           3           1
## ENSG00000268903          0           2           0           3
## ENSG00000269981          0           2           0           0
## ENSG00000241860          3           2           0           2
## ENSG00000279457         52          46          89          31

# 讀入分組信息
sample <- read.table("sampleFile", header=T, row.names=1, com='',
    quote='', check.names=F, sep="\t", colClasses="factor")

sample <- sample[match(colnames(data), rownames(sample)),, drop=F]
sample_rowname <- rownames(sample)

# 下面的可以忽略,如果沒遇到錯誤不需要執行
# 目的是做因子轉換
sample <- data.frame(lapply(sample, function(x) factor(x, levels=unique(x))))
rownames(sample) <- sample_rowname
sample

##               conditions
## untrt_N61311       untrt
## untrt_N052611      untrt
## untrt_N080611      untrt
## untrt_N061011      untrt
## trt_N61311           trt
## trt_N052611          trt
## trt_N080611          trt
## trt_N061011          trt

產生DESeq數據集

DESeq2包采用DESeqDataSet存儲原始的read count和中間計算的統計量。

每個DESeqDataSet對象都要有一個實驗設計formula,用于對數據進行分組,以便計算表達值的離散度和估計表達倍數差異,通常格式為~ batch + conditions
(為了方便后續計算,最為關注的分組信息放在最后一位)。

countData: 表達矩陣

colData: 樣品分組信息表

design: 實驗設計信息,conditions必須是colData中的一列

ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
        colData = sample,  design= ~ conditions)

dds <- DESeq(ddsFullCountTable)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

獲取標準化后的數據

# ?counts查看此函數功能
# normalized=T, 返回標準化的數據
normalized_counts <- counts(dds, normalized=TRUE)
head(normalized_counts)

##                 untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
## ENSG00000227232    12.730963     21.179286    19.4979982     25.994727
## ENSG00000278267     0.000000      4.235857     2.5432172      4.332454
## ENSG00000268903     0.000000      1.694343     0.0000000      0.000000
## ENSG00000269981     0.000000      2.541514     0.0000000      1.083114
## ENSG00000241860     2.937915      9.318886     0.8477391      5.415568
## ENSG00000279457    45.048024     76.245430    61.8849507     53.072567
##                 trt_N61311 trt_N052611 trt_N080611 trt_N061011
## ENSG00000227232  13.423907   17.885811   15.750664   23.250144
## ENSG00000278267   2.237318    5.961937    2.147818    1.056825
## ENSG00000268903   0.000000    2.980969    0.000000    3.170474
## ENSG00000269981   0.000000    2.980969    0.000000    0.000000
## ENSG00000241860   3.355977    2.980969    0.000000    2.113649
## ENSG00000279457  58.170264   68.562276   63.718596   32.761567

根據基因在不同的樣本中表達變化的差異程度mad值對數據排序,差異越大的基因排位越前。

normalized_counts_mad <- apply(normalized_counts, 1, mad)
normalized_counts <- normalized_counts[order(normalized_counts_mad, decreasing=T), ]

# 標準化后的數據輸出
write.table(normalized_counts, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls",
quote=F, sep="\t", row.names=T, col.names=T)

# 只在Linux下能運行,目的是填補表格左上角內容
system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls"))

# log轉換后的結果
rld <- rlog(dds, blind=FALSE)
rlogMat <- assay(rld)
rlogMat <- rlogMat[order(normalized_counts_mad, decreasing=T), ]


write.table(rlogMat, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls",
quote=F, sep="\t", row.names=T, col.names=T)

# 只在Linux下能運行,目的是填補表格左上角內容
system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls"))

樣品層級聚類分析,判斷樣品的相似性和組間組內差異

# 生成顏色
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)

# 計算相關性pearson correlation
pearson_cor <- as.matrix(cor(rlogMat, method="pearson"))

# 層級聚類
hc <- hcluster(t(rlogMat), method="pearson")

# 熱圖繪制
## 在命令行下運行時,需要去除下面開頭的#號,把輸出的圖保存到文件中
## 輸出到文件時,dev.off()命令是關閉輸出,從而完成圖片的寫入。如果不做這一步,則圖片則不能寫入,從而不能打開
## 如果在Rstudio或其它可視化界面時,可以直接把圖輸出到屏幕
#pdf("ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf", pointsize=10)
heatmap.2(pearson_cor, Rowv=as.dendrogram(hc), symm=T, trace="none",
col=hmcol, margins=c(11,11), main="The pearson correlation of each
sample")
image
#dev.off()

樣品PCA分析

pca_data <- plotPCA(rld, intgroup=c("conditions"), returnData=T, ntop=5000)

差異基因分析

# 定義變量的好處是下面的代碼就獨立了
# 如果想比較不同的組,只需在這修改即可,后面代碼都可以不動
sampleA = "untrt"
sampleB = "trt"

results函數提取差異基因分析結果,包含log2 fold changes,
p valuesadjusted p values.
constrast用于指定比較的兩組的信息,輸出的log2FoldChange為log2(SampleA/SampleB)

contrastV <- c("conditions", sampleA, sampleB)
res <- results(dds,  contrast=contrastV)
res

## log2 fold change (MLE): conditions untrt vs trt 
## Wald test p-value: conditions untrt vs trt 
## DataFrame with 26280 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE        stat    pvalue
##                  <numeric>      <numeric> <numeric>   <numeric> <numeric>
## ENSG00000227232 18.7141876     0.17695376 0.3809406  0.46451794 0.6422767
## ENSG00000278267  2.8144283     0.02802497 1.0249422  0.02734297 0.9781862
## ENSG00000268903  0.9807232    -1.71669654 2.3186888 -0.74037385 0.4590732
## ENSG00000269981  0.8256996     0.59077960 2.4538911  0.24075217 0.8097472
## ENSG00000241860  3.3713378     1.21773503 1.0789466  1.12863330 0.2590526
## ...                    ...            ...       ...         ...       ...
## 29256            0.3507226      1.8471273  3.484683  0.53007036 0.5960631
## 36308            0.9279831      0.1394489  1.598343  0.08724589 0.9304761
## 8414             0.7459738     -0.2678522  2.053434 -0.13044109 0.8962175
## 28837            1.2626343      0.2689443  1.405958  0.19128899 0.8482992
## 35320            1.0041030      0.5493562  1.717832  0.31979625 0.7491228
##                      padj
##                 <numeric>
## ENSG00000227232 0.8538515
## ENSG00000278267        NA
## ENSG00000268903        NA
## ENSG00000269981        NA
## ENSG00000241860        NA
## ...                   ...
## 29256                  NA
## 36308                  NA
## 8414                   NA
## 28837                  NA
## 35320                  NA

DESeq2的原始輸出結果增加樣品平均表達信息,使得結果更容易理解和解析。

# 獲得第一組數據均值
baseA <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleA]
    
if (is.vector(baseA)){
    baseMeanA <- as.data.frame(baseA)
} else {
    baseMeanA <- as.data.frame(rowMeans(baseA))
}
colnames(baseMeanA) <- sampleA
head(baseMeanA)

##                      untrt
## ENSG00000227232 19.8507436
## ENSG00000278267  2.7778822
## ENSG00000268903  0.4235857
## ENSG00000269981  0.9061570
## ENSG00000241860  4.6300269
## ENSG00000279457 59.0627429

# 獲得第二組數據均值
baseB <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleB]
if (is.vector(baseB)){
        baseMeanB <- as.data.frame(baseB)
} else {
        baseMeanB <- as.data.frame(rowMeans(baseB))
}
colnames(baseMeanB) <- sampleB
head(baseMeanB)

##                        trt
## ENSG00000227232 17.5776316
## ENSG00000278267  2.8509744
## ENSG00000268903  1.5378607
## ENSG00000269981  0.7452421
## ENSG00000241860  2.1126487
## ENSG00000279457 55.8031756

# 結果組合
res <- cbind(baseMeanA, baseMeanB, as.data.frame(res))
head(res)

##                      untrt        trt   baseMean log2FoldChange     lfcSE
## ENSG00000227232 19.8507436 17.5776316 18.7141876     0.17695376 0.3809406
## ENSG00000278267  2.7778822  2.8509744  2.8144283     0.02802497 1.0249422
## ENSG00000268903  0.4235857  1.5378607  0.9807232    -1.71669654 2.3186888
## ENSG00000269981  0.9061570  0.7452421  0.8256996     0.59077960 2.4538911
## ENSG00000241860  4.6300269  2.1126487  3.3713378     1.21773503 1.0789466
## ENSG00000279457 59.0627429 55.8031756 57.4329592     0.08926886 0.2929300
##                        stat    pvalue      padj
## ENSG00000227232  0.46451794 0.6422767 0.8538515
## ENSG00000278267  0.02734297 0.9781862        NA
## ENSG00000268903 -0.74037385 0.4590732        NA
## ENSG00000269981  0.24075217 0.8097472        NA
## ENSG00000241860  1.12863330 0.2590526        NA
## ENSG00000279457  0.30474470 0.7605606 0.9130580

# 增加ID信息
res <- cbind(ID=rownames(res), as.data.frame(res))
res$baseMean <- rowMeans(cbind(baseA, baseB))

# 校正后p-value為NA的復制為1
res$padj[is.na(res$padj)] <- 1

# 按pvalue排序, 把差異大的基因放前面
res <- res[order(res$pvalue),]
head(res)

##                              ID     untrt       trt   baseMean
## ENSG00000152583 ENSG00000152583   77.8512  1900.412   989.1314
## ENSG00000148175 ENSG00000148175 7233.0010 19483.552 13358.2766
## ENSG00000179094 ENSG00000179094  151.7491  1380.881   766.3151
## ENSG00000134686 ENSG00000134686 1554.0390  4082.440  2818.2395
## ENSG00000125148 ENSG00000125148 1298.0392  5958.946  3628.4926
## ENSG00000120129 ENSG00000120129  773.9769  5975.522  3374.7494
##                 log2FoldChange      lfcSE      stat        pvalue
## ENSG00000152583      -4.608311 0.21090819 -21.84984 7.800059e-106
## ENSG00000148175      -1.429585 0.08639329 -16.54741  1.671380e-61
## ENSG00000179094      -3.184674 0.20065896 -15.87108  1.005030e-56
## ENSG00000134686      -1.392749 0.09190323 -15.15451  7.073605e-52
## ENSG00000125148      -2.199191 0.14796048 -14.86337  5.698869e-50
## ENSG00000120129      -2.948122 0.19931242 -14.79146  1.663007e-49
##                          padj
## ENSG00000152583 1.331002e-101
## ENSG00000148175  1.426021e-57
## ENSG00000179094  5.716612e-53
## ENSG00000134686  3.017600e-48
## ENSG00000125148  1.944910e-46
## ENSG00000120129  4.729591e-46

整體分析結果輸出到文件

comp314 <- paste(sampleA, "_vs_", sampleB, sep=".")

# 生成文件名
file_base <- paste("ehbio_trans.Count_matrix.xls.DESeq2", comp314, sep=".")
file_base1 <- paste(file_base, "results.xls", sep=".")
write.table(as.data.frame(res), file=file_base1, sep="\t", quote=F, row.names=F)

提取差異表達基因

# 差異基因篩選,padj<0.1
res_de <- subset(res, res$padj<0.1, select=c('ID', sampleA,
        sampleB, 'log2FoldChange', 'padj'))
# foldchang > 1
res_de_up <- subset(res_de, res_de$log2FoldChange>=1)

file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA,"_higherThan_",sampleB,
        'xls', sep=".") 
write.table(as.data.frame(res_de_up), file=file, sep="\t", quote=F, row.names=F)

res_de_dw <- subset(res_de, res_de$log2FoldChange<=(-1)*1)

file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA, "_lowerThan_", sampleB, 
        'xls', sep=".") 
write.table(as.data.frame(res_de_dw), file=file, sep="\t", quote=F, row.names=F)

# 差異基因ID
res_de_up_id = data.frame(ID=res_de_up$ID, 
        type=paste(sampleA,"_higherThan_", sampleB, sep="."))
res_de_dw_id = data.frame(ID=res_de_dw$ID, 
        type=paste(sampleA,"_lowerThan_", sampleB, sep="."))
de_id = rbind(res_de_up_id, res_de_dw_id)

file <- "ehbio_trans.Count_matrix.xls.DESeq2.all.DE"
write.table(as.data.frame(de_id), file=file, sep="\t", quote=F, row.names=F, col.names=F)

繪制火山圖

# 可以把前面生成的results.xls文件提交到www.ehbio.com/ImageGP繪制火山圖
# 或者使用我們的s-plot https://github.com/Tong-Chen/s-plot
logCounts <- log2(res$baseMean+1)
logFC <- res$log2FoldChange
FDR <- res$padj
#png(filename=paste(file_base, "Volcano.png", sep="."))
plot(logFC, -1*log10(FDR), col=ifelse(FDR<=0.01, "red", "black"),
 xlab="logFC", ylab="-1*log1o(FDR)", main="Volcano plot", pch=".")
#dev.off()
image

差異基因熱圖

res_de_up_top20_id <- as.vector(head(res_de_up$ID,20))
res_de_dw_top20_id <- as.vector(head(res_de_dw$ID,20))

red_de_top20 <- c(res_de_up_top20_id, res_de_dw_top20_id)
red_de_top20

##  [1] "ENSG00000178695" "ENSG00000196517" "ENSG00000116584"
##  [4] "ENSG00000144369" "ENSG00000276600" "ENSG00000108821"
##  [7] "ENSG00000162692" "ENSG00000181467" "ENSG00000145777"
## [10] "ENSG00000163491" "ENSG00000183160" "ENSG00000172986"
## [13] "ENSG00000164484" "ENSG00000131389" "ENSG00000134253"
## [16] "ENSG00000124766" "ENSG00000227051" "ENSG00000146006"
## [19] "ENSG00000112837" "ENSG00000146592" "ENSG00000152583"
## [22] "ENSG00000148175" "ENSG00000179094" "ENSG00000134686"
## [25] "ENSG00000125148" "ENSG00000120129" "ENSG00000189221"
## [28] "ENSG00000109906" "ENSG00000101347" "ENSG00000162614"
## [31] "ENSG00000096060" "ENSG00000162616" "ENSG00000166741"
## [34] "ENSG00000183044" "ENSG00000164125" "ENSG00000198624"
## [37] "ENSG00000256235" "ENSG00000077684" "ENSG00000135821"
## [40] "ENSG00000164647"

# 可以把矩陣存儲,提交到www.ehbio.com/ImageGP繪制火山圖
# 或者使用我們的s-plot https://github.com/Tong-Chen/s-plot
red_de_top20_expr <- normalized_counts[rownames(normalized_counts) %in% red_de_top20,]
library(pheatmap)
pheatmap(red_de_top20_expr, cluster_row=T, scale="row", annotation_col=sample)
image

關于批次效應

見論壇討論 http://www.ehbio.com/Esx/d/10-deseq2

每個DESeqDataSet對象都要有一個實驗設計formula,用于對數據進行分組,以便計算表達值的離散度和估計表達倍數差異,通常格式為~ batch + conditions
(為了方便后續計算,最為關注的分組信息放在最后一位)。

如果記錄了樣本的批次信息,或者其它需要抹除的信息可以定義在design參數中,在下游回歸的分析中會根據design formula來估計batch effect的影響,并在下游分析時減去這個影響。這是處理batch effect的推薦方式。

在模型中考慮batch effect并沒有在數據矩陣中移除bacth effect,如果下游處理時,確實有需要可以使用limma包的removeBatchEffect來處理。

countData: 表達矩陣

colData: 樣品分組信息表

design: 實驗設計信息,batchconditions必須是colData中的一列

ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
        colData = sample,  design= ~ batch + conditions)

dds <- DESeq(ddsFullCountTable)

rld <- rlog(dds, blind=FALSE)
rlogMat <- assay(rld)
rlogMat <- limma::removeBatchEffect(rlogMat, c(sample$batch))

Just to be clear, there's an important difference between removing a
batch effect and modelling a batch effect. Including the batch in your
design formula will model the batch effect in the regression step, which
means that the raw data are not modified (so the batch effect is not
removed), but instead the regression will estimate the size of the batch
effect and subtract it out when performing all other tests. In addition,
the model's residual degrees of freedom will be reduced appropriately to
reflect the fact that some degrees of freedom were "spent" modelling the
batch effects. This is the preferred approach for any method that is
capable of using it (this includes DESeq2). You would only remove the
batch effect (e.g. using limma's removeBatchEffect function) if you were
going to do some kind of downstream analysis that can't model the batch
effects, such as training a classifier.

批次效應模擬

#Make some simulated data with a batch effect:
dds <- makeExampleDESeqDataSet(betaSD=1,interceptMean=10)
dds$batch <- factor(rep(c("A","B"),each=6))

#VST, remove batch effect, then plotPCA:

vsd <- vst(dds)
plotPCA(vsd, "batch")
image
assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$batch)
plotPCA(vsd, "batch")
image

SVA(批次未記錄時,尋找潛在影響因子,并矯正)

dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
##calculating the variables
n.sv <- num.sv(dat,mod,method="leek") #gives 11
### using 4
lnj.corr <- svaBatchCor(dat,mod,mod0,n.sv=4)
co <- lnj.corr$corrected
plPCA(co)
#http://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf
#http://bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects

參考資料

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 230,247評論 6 543
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 99,520評論 3 429
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 178,362評論 0 383
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 63,805評論 1 317
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 72,541評論 6 412
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 55,896評論 1 328
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,887評論 3 447
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 43,062評論 0 290
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 49,608評論 1 336
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 41,356評論 3 358
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 43,555評論 1 374
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 39,077評論 5 364
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,769評論 3 349
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 35,175評論 0 28
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 36,489評論 1 295
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 52,289評論 3 400
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 48,516評論 2 379

推薦閱讀更多精彩內容