差異基因鑒定
基因表達標準化
不同樣品的測序量會有差異,最簡單的標準化方式是計算
counts per million (CPM)
,即原始reads count
除以總reads數乘以1,000,000。
這種計算方式的缺點是容易受到極高表達且在不同樣品中存在差異表達的基因的影響;這些基因的打開或關閉會影響到細胞中總的分子數目,可能導致這些基因標準化之后就不存在表達差異了,而原本沒有差異的基因標準化之后卻有差異了。
RPKM、FPKM和TPM是CPM按照基因或轉錄本長度歸一化后的表達,也會受到這一影響。
calc_cpm <- function (expr_mat, spikes = NULL){
norm_factor <- colSums(expr_mat[-spikes,])
return(t(t(expr_mat)/norm_factor)) * 10^6
}
為了解決這一問題,研究者提出了其它的標準化方法。
量化因子 (size factor,SF)是由DESeq
提出的。其方法是首先計算每個基因在所有樣品中表達的幾何平均值。每個細胞的量化因子(size
factor)是所有基因與其在所有樣品中的表達值的幾何平均值的比值的中位數。由于幾何平均值的使用,只有在所有樣品中表達都不為0的基因才能用來計算。這一方法又被稱為RLE (relative log expression)。
calc_sf <- function (expr_mat, spikes=NULL){
geomeans <- exp(rowMeans(log(expr_mat[-spikes,])))
SF <- function(cnts){
median((cnts/geomeans)[(is.finite(geomeans) & geomeans >0)])
}
norm_factor <- apply(expr_mat[-spikes,],2,SF)
return(t(t(expr_mat)/norm_factor))
}
上四分位數 (upperquartile,UQ)是樣品中所有基因的表達除以處于上四分位數的基因的表達值。同時為了保證表達水平的相對穩定,計算得到的上四分位數值要除以所有樣品中上四分位數值的中位數。
calc_uq <- function (expr_mat, spikes=NULL){
UQ <- function(x) {
quantile(x[x>0],0.75)
}
uq <- unlist(apply(expr_mat[-spikes,],2,UQ))
norm_factor <- uq/median(uq)
return(t(t(expr_mat)/norm_factor))
}
TMM是M-值的加權截尾均值。選定一個樣品為參照,其它樣品中基因的表達相對于參照樣品中對應基因表達倍數的log2值定義為M-值。隨后去除M-值中最高和最低的30%,剩下的M值計算加權平均值。每一個非參照樣品的基因表達值都乘以計算出的TMM。這個方法假設大部分基因的表達是沒有差異的。
DESeq2 差異基因鑒定一步法
為了簡化差異基因的運算,易生信做了腳本封裝,DESeq2.sh
,只需提供原始基因表達矩陣、樣品分組信息表即可進行差異基因分析和鑒定。
DESeq2.sh
OPTIONS:
-f Data file [A gene count matrix, NECESSARY]
CHECK ABOVE FOR DETAILS
-s Sample file [A multiple columns file with header line,
For <timeseries>, one <conditions> columns is needed.
NECESSARY]
CHECK ABOVE FOR DETAILS
-d The design formula for DESeqDataSetFromMatrix.
[Default <conditions>,
accept <cell+time+cell:time> for example 2.]
-D The reduced design formula for DESeq.
[Only applicable to <timeseries> analysis,
accept <cell+time> or <time> or <cell> for example 2.]
-m Specify the comparasion mode.
[Default <pairwise>, accept <timeseries>,
<pairwise> comparasion will still be done in <timeseries>
mode.
NECESSARY]
-p A file containing the pairs needed to do comparasion.
CHECK ABOVE FOR DETAILS
All samples will be compared in <pairwise> mode if not specified here.
-F Log2 Fold change for screening DE genes.
Default 1
-P FDR for screening DE genes.
Default 0.01
-q FDR for screening time-series DE genes.
Default 0.1
-e Execute programs
Default TRUE
-i Install packages of not exist.
Default FALSE
Eg.
DESeq2.sh -f matirx -s sample -d conditions
DESeq2.sh -f matirx -s sample -p compare_pair
具體使用
cd ~/data
# ehbio_trans.Count_matrix.xls和sampleFile都是前面用到的文件
DESeq2.sh -f ehbio_trans.Count_matrix.xls -s sampleFile
[1] "Perform pairwise comparasion using <design=~conditions>"
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
[1] "Output normalized counts"
[1] "Output rlog transformed normalized counts"
[1] "Performing sample clustering"
null device
1
[1] "DE genes between untrt trt"
[1] "PCA analysis"
null device
運行結束即可獲得以下結果文件
# 具體的文件內容和圖的樣式見后面的分步法文檔
# 原始輸入文件
ehbio_trans.Count_matrix.xls
sampleFile
# 所有差異基因列表
ehbio_trans.Count_matrix.xls.DESeq2.all.DE
# PCA結果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pca.pdf
# 樣品相關性層級聚類結果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf
# rlog轉換后的標準化后的表達結果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls
# 標準化后的表達結果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls
# 運行腳本
ehbio_trans.Count_matrix.xls.DESeq2.r
# 差異基因結果
ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.id.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.id.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.xls
# 火山圖和火山圖輸入數據
ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls.volcano.pdf
一步法腳本可在https://ke.qq.com/course/301387?tuin=20cd7788或點擊閱讀原文獲得。
DESeq2 差異基因鑒定分步法
安裝包 DESeq2
安裝方法如下
source("https://bioconductor.org/biocLite.R")
biocLite('BiocInstaller')
biocLite("DESeq2")
install.packages(c("gplots", "amap", "ggplot2"))
A distributional assumption is needed because we want to estimate the
probability of extreme events (large fold change just appearing by
chance) from limited replicates. The negative binomial (a.k.a.
Gamma-Poisson) is a good choice for RNA-seq experiments because
- The observed data at gene level is inherently counts or estimated
counts of fragments for each feature and - The spread of values among biological replicates is more than given
by a simpler, one parameter distribution, the Poisson; and it seems
to be captured by the NB sufficiently well
加載包
library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colMeans,
## colnames, colSums, do.call, duplicated, eval, evalq, Filter,
## Find, get, grep, grepl, intersect, is.unsorted, lapply,
## lengths, Map, mapply, match, mget, order, paste, pmax,
## pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
## rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following object is masked from 'package:base':
##
## apply
library("RColorBrewer")
library("gplots")
##
## Attaching package: 'gplots'
## The following object is masked from 'package:IRanges':
##
## space
## The following object is masked from 'package:S4Vectors':
##
## space
## The following object is masked from 'package:stats':
##
## lowess
library("amap")
library("ggplot2")
library("BiocParallel")
# 多線程加速, 使用4個核
# 如果你電腦性能強大,可以把這個數變大
register(MulticoreParam(4))
讀入數據
# 注意文件的位置,程序自己不知道文件在什么地方
# 如果只給文件名,不給路徑,程序只會在當前目錄查找文件
# 若找不到,則會報錯
# 可以使用下面命令設置工作目錄到文件所在目錄
# 或提供文件全路徑
# setwd("~/data")
data <- read.table("ehbio_trans.Count_matrix.xls", header=T, row.names=1, com='', quote='',
check.names=F, sep="\t")
# 撇掉在多于兩個樣本中count<1的值,如果樣本數多,這個數值可以適當增加
# 排除極低表達基因的干擾
data <- data[rowSums(data)>2,]
head(data)
## untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
## ENSG00000227232 13 25 23 24
## ENSG00000278267 0 5 3 4
## ENSG00000268903 0 2 0 0
## ENSG00000269981 0 3 0 1
## ENSG00000241860 3 11 1 5
## ENSG00000279457 46 90 73 49
## trt_N61311 trt_N052611 trt_N080611 trt_N061011
## ENSG00000227232 12 12 22 22
## ENSG00000278267 2 4 3 1
## ENSG00000268903 0 2 0 3
## ENSG00000269981 0 2 0 0
## ENSG00000241860 3 2 0 2
## ENSG00000279457 52 46 89 31
# 讀入分組信息
sample <- read.table("sampleFile", header=T, row.names=1, com='',
quote='', check.names=F, sep="\t", colClasses="factor")
sample <- sample[match(colnames(data), rownames(sample)),, drop=F]
sample_rowname <- rownames(sample)
# 下面的可以忽略,如果沒遇到錯誤不需要執行
# 目的是做因子轉換
sample <- data.frame(lapply(sample, function(x) factor(x, levels=unique(x))))
rownames(sample) <- sample_rowname
sample
## conditions
## untrt_N61311 untrt
## untrt_N052611 untrt
## untrt_N080611 untrt
## untrt_N061011 untrt
## trt_N61311 trt
## trt_N052611 trt
## trt_N080611 trt
## trt_N061011 trt
產生DESeq數據集
DESeq2包采用DESeqDataSet
存儲原始的read count
和中間計算的統計量。
每個DESeqDataSet
對象都要有一個實驗設計formula,用于對數據進行分組,以便計算表達值的離散度和估計表達倍數差異,通常格式為~ batch + conditions
(為了方便后續計算,最為關注的分組信息放在最后一位)。
countData
: 表達矩陣
colData
: 樣品分組信息表
design
: 實驗設計信息,conditions
必須是colData
中的一列
ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
colData = sample, design= ~ conditions)
dds <- DESeq(ddsFullCountTable)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
獲取標準化后的數據
# ?counts查看此函數功能
# normalized=T, 返回標準化的數據
normalized_counts <- counts(dds, normalized=TRUE)
head(normalized_counts)
## untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
## ENSG00000227232 12.730963 21.179286 19.4979982 25.994727
## ENSG00000278267 0.000000 4.235857 2.5432172 4.332454
## ENSG00000268903 0.000000 1.694343 0.0000000 0.000000
## ENSG00000269981 0.000000 2.541514 0.0000000 1.083114
## ENSG00000241860 2.937915 9.318886 0.8477391 5.415568
## ENSG00000279457 45.048024 76.245430 61.8849507 53.072567
## trt_N61311 trt_N052611 trt_N080611 trt_N061011
## ENSG00000227232 13.423907 17.885811 15.750664 23.250144
## ENSG00000278267 2.237318 5.961937 2.147818 1.056825
## ENSG00000268903 0.000000 2.980969 0.000000 3.170474
## ENSG00000269981 0.000000 2.980969 0.000000 0.000000
## ENSG00000241860 3.355977 2.980969 0.000000 2.113649
## ENSG00000279457 58.170264 68.562276 63.718596 32.761567
根據基因在不同的樣本中表達變化的差異程度mad值
對數據排序,差異越大的基因排位越前。
normalized_counts_mad <- apply(normalized_counts, 1, mad)
normalized_counts <- normalized_counts[order(normalized_counts_mad, decreasing=T), ]
# 標準化后的數據輸出
write.table(normalized_counts, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls",
quote=F, sep="\t", row.names=T, col.names=T)
# 只在Linux下能運行,目的是填補表格左上角內容
system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls"))
# log轉換后的結果
rld <- rlog(dds, blind=FALSE)
rlogMat <- assay(rld)
rlogMat <- rlogMat[order(normalized_counts_mad, decreasing=T), ]
write.table(rlogMat, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls",
quote=F, sep="\t", row.names=T, col.names=T)
# 只在Linux下能運行,目的是填補表格左上角內容
system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls"))
樣品層級聚類分析,判斷樣品的相似性和組間組內差異
# 生成顏色
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
# 計算相關性pearson correlation
pearson_cor <- as.matrix(cor(rlogMat, method="pearson"))
# 層級聚類
hc <- hcluster(t(rlogMat), method="pearson")
# 熱圖繪制
## 在命令行下運行時,需要去除下面開頭的#號,把輸出的圖保存到文件中
## 輸出到文件時,dev.off()命令是關閉輸出,從而完成圖片的寫入。如果不做這一步,則圖片則不能寫入,從而不能打開
## 如果在Rstudio或其它可視化界面時,可以直接把圖輸出到屏幕
#pdf("ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf", pointsize=10)
heatmap.2(pearson_cor, Rowv=as.dendrogram(hc), symm=T, trace="none",
col=hmcol, margins=c(11,11), main="The pearson correlation of each
sample")
#dev.off()
樣品PCA分析
pca_data <- plotPCA(rld, intgroup=c("conditions"), returnData=T, ntop=5000)
差異基因分析
# 定義變量的好處是下面的代碼就獨立了
# 如果想比較不同的組,只需在這修改即可,后面代碼都可以不動
sampleA = "untrt"
sampleB = "trt"
results
函數提取差異基因分析結果,包含log2 fold changes
,
p values
和adjusted p values
.
constrast
用于指定比較的兩組的信息,輸出的log2FoldChange為log2(SampleA/SampleB)
。
contrastV <- c("conditions", sampleA, sampleB)
res <- results(dds, contrast=contrastV)
res
## log2 fold change (MLE): conditions untrt vs trt
## Wald test p-value: conditions untrt vs trt
## DataFrame with 26280 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000227232 18.7141876 0.17695376 0.3809406 0.46451794 0.6422767
## ENSG00000278267 2.8144283 0.02802497 1.0249422 0.02734297 0.9781862
## ENSG00000268903 0.9807232 -1.71669654 2.3186888 -0.74037385 0.4590732
## ENSG00000269981 0.8256996 0.59077960 2.4538911 0.24075217 0.8097472
## ENSG00000241860 3.3713378 1.21773503 1.0789466 1.12863330 0.2590526
## ... ... ... ... ... ...
## 29256 0.3507226 1.8471273 3.484683 0.53007036 0.5960631
## 36308 0.9279831 0.1394489 1.598343 0.08724589 0.9304761
## 8414 0.7459738 -0.2678522 2.053434 -0.13044109 0.8962175
## 28837 1.2626343 0.2689443 1.405958 0.19128899 0.8482992
## 35320 1.0041030 0.5493562 1.717832 0.31979625 0.7491228
## padj
## <numeric>
## ENSG00000227232 0.8538515
## ENSG00000278267 NA
## ENSG00000268903 NA
## ENSG00000269981 NA
## ENSG00000241860 NA
## ... ...
## 29256 NA
## 36308 NA
## 8414 NA
## 28837 NA
## 35320 NA
給DESeq2
的原始輸出結果增加樣品平均表達信息,使得結果更容易理解和解析。
# 獲得第一組數據均值
baseA <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleA]
if (is.vector(baseA)){
baseMeanA <- as.data.frame(baseA)
} else {
baseMeanA <- as.data.frame(rowMeans(baseA))
}
colnames(baseMeanA) <- sampleA
head(baseMeanA)
## untrt
## ENSG00000227232 19.8507436
## ENSG00000278267 2.7778822
## ENSG00000268903 0.4235857
## ENSG00000269981 0.9061570
## ENSG00000241860 4.6300269
## ENSG00000279457 59.0627429
# 獲得第二組數據均值
baseB <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleB]
if (is.vector(baseB)){
baseMeanB <- as.data.frame(baseB)
} else {
baseMeanB <- as.data.frame(rowMeans(baseB))
}
colnames(baseMeanB) <- sampleB
head(baseMeanB)
## trt
## ENSG00000227232 17.5776316
## ENSG00000278267 2.8509744
## ENSG00000268903 1.5378607
## ENSG00000269981 0.7452421
## ENSG00000241860 2.1126487
## ENSG00000279457 55.8031756
# 結果組合
res <- cbind(baseMeanA, baseMeanB, as.data.frame(res))
head(res)
## untrt trt baseMean log2FoldChange lfcSE
## ENSG00000227232 19.8507436 17.5776316 18.7141876 0.17695376 0.3809406
## ENSG00000278267 2.7778822 2.8509744 2.8144283 0.02802497 1.0249422
## ENSG00000268903 0.4235857 1.5378607 0.9807232 -1.71669654 2.3186888
## ENSG00000269981 0.9061570 0.7452421 0.8256996 0.59077960 2.4538911
## ENSG00000241860 4.6300269 2.1126487 3.3713378 1.21773503 1.0789466
## ENSG00000279457 59.0627429 55.8031756 57.4329592 0.08926886 0.2929300
## stat pvalue padj
## ENSG00000227232 0.46451794 0.6422767 0.8538515
## ENSG00000278267 0.02734297 0.9781862 NA
## ENSG00000268903 -0.74037385 0.4590732 NA
## ENSG00000269981 0.24075217 0.8097472 NA
## ENSG00000241860 1.12863330 0.2590526 NA
## ENSG00000279457 0.30474470 0.7605606 0.9130580
# 增加ID信息
res <- cbind(ID=rownames(res), as.data.frame(res))
res$baseMean <- rowMeans(cbind(baseA, baseB))
# 校正后p-value為NA的復制為1
res$padj[is.na(res$padj)] <- 1
# 按pvalue排序, 把差異大的基因放前面
res <- res[order(res$pvalue),]
head(res)
## ID untrt trt baseMean
## ENSG00000152583 ENSG00000152583 77.8512 1900.412 989.1314
## ENSG00000148175 ENSG00000148175 7233.0010 19483.552 13358.2766
## ENSG00000179094 ENSG00000179094 151.7491 1380.881 766.3151
## ENSG00000134686 ENSG00000134686 1554.0390 4082.440 2818.2395
## ENSG00000125148 ENSG00000125148 1298.0392 5958.946 3628.4926
## ENSG00000120129 ENSG00000120129 773.9769 5975.522 3374.7494
## log2FoldChange lfcSE stat pvalue
## ENSG00000152583 -4.608311 0.21090819 -21.84984 7.800059e-106
## ENSG00000148175 -1.429585 0.08639329 -16.54741 1.671380e-61
## ENSG00000179094 -3.184674 0.20065896 -15.87108 1.005030e-56
## ENSG00000134686 -1.392749 0.09190323 -15.15451 7.073605e-52
## ENSG00000125148 -2.199191 0.14796048 -14.86337 5.698869e-50
## ENSG00000120129 -2.948122 0.19931242 -14.79146 1.663007e-49
## padj
## ENSG00000152583 1.331002e-101
## ENSG00000148175 1.426021e-57
## ENSG00000179094 5.716612e-53
## ENSG00000134686 3.017600e-48
## ENSG00000125148 1.944910e-46
## ENSG00000120129 4.729591e-46
整體分析結果輸出到文件
comp314 <- paste(sampleA, "_vs_", sampleB, sep=".")
# 生成文件名
file_base <- paste("ehbio_trans.Count_matrix.xls.DESeq2", comp314, sep=".")
file_base1 <- paste(file_base, "results.xls", sep=".")
write.table(as.data.frame(res), file=file_base1, sep="\t", quote=F, row.names=F)
提取差異表達基因
# 差異基因篩選,padj<0.1
res_de <- subset(res, res$padj<0.1, select=c('ID', sampleA,
sampleB, 'log2FoldChange', 'padj'))
# foldchang > 1
res_de_up <- subset(res_de, res_de$log2FoldChange>=1)
file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA,"_higherThan_",sampleB,
'xls', sep=".")
write.table(as.data.frame(res_de_up), file=file, sep="\t", quote=F, row.names=F)
res_de_dw <- subset(res_de, res_de$log2FoldChange<=(-1)*1)
file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA, "_lowerThan_", sampleB,
'xls', sep=".")
write.table(as.data.frame(res_de_dw), file=file, sep="\t", quote=F, row.names=F)
# 差異基因ID
res_de_up_id = data.frame(ID=res_de_up$ID,
type=paste(sampleA,"_higherThan_", sampleB, sep="."))
res_de_dw_id = data.frame(ID=res_de_dw$ID,
type=paste(sampleA,"_lowerThan_", sampleB, sep="."))
de_id = rbind(res_de_up_id, res_de_dw_id)
file <- "ehbio_trans.Count_matrix.xls.DESeq2.all.DE"
write.table(as.data.frame(de_id), file=file, sep="\t", quote=F, row.names=F, col.names=F)
繪制火山圖
# 可以把前面生成的results.xls文件提交到www.ehbio.com/ImageGP繪制火山圖
# 或者使用我們的s-plot https://github.com/Tong-Chen/s-plot
logCounts <- log2(res$baseMean+1)
logFC <- res$log2FoldChange
FDR <- res$padj
#png(filename=paste(file_base, "Volcano.png", sep="."))
plot(logFC, -1*log10(FDR), col=ifelse(FDR<=0.01, "red", "black"),
xlab="logFC", ylab="-1*log1o(FDR)", main="Volcano plot", pch=".")
#dev.off()
差異基因熱圖
res_de_up_top20_id <- as.vector(head(res_de_up$ID,20))
res_de_dw_top20_id <- as.vector(head(res_de_dw$ID,20))
red_de_top20 <- c(res_de_up_top20_id, res_de_dw_top20_id)
red_de_top20
## [1] "ENSG00000178695" "ENSG00000196517" "ENSG00000116584"
## [4] "ENSG00000144369" "ENSG00000276600" "ENSG00000108821"
## [7] "ENSG00000162692" "ENSG00000181467" "ENSG00000145777"
## [10] "ENSG00000163491" "ENSG00000183160" "ENSG00000172986"
## [13] "ENSG00000164484" "ENSG00000131389" "ENSG00000134253"
## [16] "ENSG00000124766" "ENSG00000227051" "ENSG00000146006"
## [19] "ENSG00000112837" "ENSG00000146592" "ENSG00000152583"
## [22] "ENSG00000148175" "ENSG00000179094" "ENSG00000134686"
## [25] "ENSG00000125148" "ENSG00000120129" "ENSG00000189221"
## [28] "ENSG00000109906" "ENSG00000101347" "ENSG00000162614"
## [31] "ENSG00000096060" "ENSG00000162616" "ENSG00000166741"
## [34] "ENSG00000183044" "ENSG00000164125" "ENSG00000198624"
## [37] "ENSG00000256235" "ENSG00000077684" "ENSG00000135821"
## [40] "ENSG00000164647"
# 可以把矩陣存儲,提交到www.ehbio.com/ImageGP繪制火山圖
# 或者使用我們的s-plot https://github.com/Tong-Chen/s-plot
red_de_top20_expr <- normalized_counts[rownames(normalized_counts) %in% red_de_top20,]
library(pheatmap)
pheatmap(red_de_top20_expr, cluster_row=T, scale="row", annotation_col=sample)
關于批次效應
見論壇討論 http://www.ehbio.com/Esx/d/10-deseq2。
每個DESeqDataSet
對象都要有一個實驗設計formula,用于對數據進行分組,以便計算表達值的離散度和估計表達倍數差異,通常格式為~ batch + conditions
(為了方便后續計算,最為關注的分組信息放在最后一位)。
如果記錄了樣本的批次信息,或者其它需要抹除的信息可以定義在design
參數中,在下游回歸的分析中會根據design formula
來估計batch effect
的影響,并在下游分析時減去這個影響。這是處理batch effect
的推薦方式。
在模型中考慮batch effect
并沒有在數據矩陣中移除bacth effect
,如果下游處理時,確實有需要可以使用limma
包的removeBatchEffect
來處理。
countData
: 表達矩陣
colData
: 樣品分組信息表
design
: 實驗設計信息,batch
和conditions
必須是colData
中的一列
ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
colData = sample, design= ~ batch + conditions)
dds <- DESeq(ddsFullCountTable)
rld <- rlog(dds, blind=FALSE)
rlogMat <- assay(rld)
rlogMat <- limma::removeBatchEffect(rlogMat, c(sample$batch))
Just to be clear, there's an important difference between removing a
batch effect and modelling a batch effect. Including the batch in your
design formula will model the batch effect in the regression step, which
means that the raw data are not modified (so the batch effect is not
removed), but instead the regression will estimate the size of the batch
effect and subtract it out when performing all other tests. In addition,
the model's residual degrees of freedom will be reduced appropriately to
reflect the fact that some degrees of freedom were "spent" modelling the
batch effects. This is the preferred approach for any method that is
capable of using it (this includes DESeq2). You would only remove the
batch effect (e.g. using limma's removeBatchEffect function) if you were
going to do some kind of downstream analysis that can't model the batch
effects, such as training a classifier.
批次效應模擬
#Make some simulated data with a batch effect:
dds <- makeExampleDESeqDataSet(betaSD=1,interceptMean=10)
dds$batch <- factor(rep(c("A","B"),each=6))
#VST, remove batch effect, then plotPCA:
vsd <- vst(dds)
plotPCA(vsd, "batch")
assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$batch)
plotPCA(vsd, "batch")
SVA(批次未記錄時,尋找潛在影響因子,并矯正)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
##calculating the variables
n.sv <- num.sv(dat,mod,method="leek") #gives 11
### using 4
lnj.corr <- svaBatchCor(dat,mod,mod0,n.sv=4)
co <- lnj.corr$corrected
plPCA(co)
#http://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf
#http://bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects