差異基因鑒定

基因表達(dá)標(biāo)準(zhǔn)化

不同樣品的測序量會(huì)有差異，最簡單的標(biāo)準(zhǔn)化方式是計(jì)算
counts per million (CPM)，即原始reads count除以總reads數(shù)乘以1,000,000。

這種計(jì)算方式的缺點(diǎn)是容易受到極高表達(dá)且在不同樣品中存在差異表達(dá)的基因的影響；這些基因的打開或關(guān)閉會(huì)影響到細(xì)胞中總的分子數(shù)目，可能導(dǎo)致這些基因標(biāo)準(zhǔn)化之后就不存在表達(dá)差異了，而原本沒有差異的基因標(biāo)準(zhǔn)化之后卻有差異了。

RPKM、FPKM和TPM是CPM按照基因或轉(zhuǎn)錄本長度歸一化后的表達(dá)，也會(huì)受到這一影響。

calc_cpm <- function (expr_mat, spikes = NULL){
  norm_factor <- colSums(expr_mat[-spikes,])
  return(t(t(expr_mat)/norm_factor)) * 10^6
}

為了解決這一問題，研究者提出了其它的標(biāo)準(zhǔn)化方法。

量化因子 (size factor,SF)是由DESeq提出的。其方法是首先計(jì)算每個(gè)基因在所有樣品中表達(dá)的幾何平均值。每個(gè)細(xì)胞的量化因子(size
factor)是所有基因與其在所有樣品中的表達(dá)值的幾何平均值的比值的中位數(shù)。由于幾何平均值的使用，只有在所有樣品中表達(dá)都不為0的基因才能用來計(jì)算。這一方法又被稱為RLE (relative log expression)。

calc_sf <- function (expr_mat, spikes=NULL){
  geomeans <- exp(rowMeans(log(expr_mat[-spikes,])))
  SF <- function(cnts){
    median((cnts/geomeans)[(is.finite(geomeans) & geomeans >0)])
  }
  norm_factor <- apply(expr_mat[-spikes,],2,SF)
  return(t(t(expr_mat)/norm_factor))
}

上四分位數(shù) (upperquartile,UQ)是樣品中所有基因的表達(dá)除以處于上四分位數(shù)的基因的表達(dá)值。同時(shí)為了保證表達(dá)水平的相對穩(wěn)定，計(jì)算得到的上四分位數(shù)值要除以所有樣品中上四分位數(shù)值的中位數(shù)。

calc_uq <- function (expr_mat, spikes=NULL){
  UQ <- function(x) {
    quantile(x[x>0],0.75)
  }
  uq <- unlist(apply(expr_mat[-spikes,],2,UQ))
  norm_factor <- uq/median(uq)
  return(t(t(expr_mat)/norm_factor))
}

TMM是M-值的加權(quán)截尾均值。選定一個(gè)樣品為參照，其它樣品中基因的表達(dá)相對于參照樣品中對應(yīng)基因表達(dá)倍數(shù)的log2值定義為M-值。隨后去除M-值中最高和最低的30%，剩下的M值計(jì)算加權(quán)平均值。每一個(gè)非參照樣品的基因表達(dá)值都乘以計(jì)算出的TMM。這個(gè)方法假設(shè)大部分基因的表達(dá)是沒有差異的。

DESeq2 差異基因鑒定一步法

為了簡化差異基因的運(yùn)算，易生信做了腳本封裝，DESeq2.sh，只需提供原始基因表達(dá)矩陣、樣品分組信息表即可進(jìn)行差異基因分析和鑒定。

DESeq2.sh
OPTIONS:
    -f  Data file [A gene count matrix, NECESSARY]
        CHECK ABOVE FOR DETAILS
        
    -s  Sample file [A multiple columns file with header line, 
        For <timeseries>,  one <conditions> columns is needed.
        NECESSARY]
        CHECK ABOVE FOR DETAILS
        
    -d  The design formula for DESeqDataSetFromMatrix.
        [Default <conditions>, 
        accept <cell+time+cell:time> for example 2.]
        
    -D  The reduced design formula for DESeq.
        [Only applicable to <timeseries> analysis, 
        accept <cell+time> or <time> or <cell> for example 2.]
        
    -m  Specify the comparasion mode.
        [Default <pairwise>, accept <timeseries>,
        <pairwise> comparasion will still be done in <timeseries>
        mode.
        NECESSARY]
        
    -p  A file containing the pairs needed to do comparasion. 
        CHECK ABOVE FOR DETAILS
        All samples will be compared in <pairwise> mode if not specified here.
    
    -F  Log2 Fold change for screening DE genes.
        Default 1

    -P  FDR for screening DE genes.
        Default 0.01

    -q  FDR for screening time-series DE genes.
        Default 0.1

    -e  Execute programs
        Default TRUE

    -i  Install packages of not exist. 
        Default FALSE

Eg.
    DESeq2.sh -f matirx -s sample -d conditions
    DESeq2.sh -f matirx -s sample -p compare_pair

具體使用

cd ~/data
# ehbio_trans.Count_matrix.xls和sampleFile都是前面用到的文件
DESeq2.sh -f ehbio_trans.Count_matrix.xls -s sampleFile

[1] "Perform pairwise comparasion using <design=~conditions>"
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
[1] "Output normalized counts"
[1] "Output rlog transformed normalized counts"
[1] "Performing sample clustering"
null device 
          1 
[1] "DE genes between untrt trt"
[1] "PCA analysis"
null device

運(yùn)行結(jié)束即可獲得以下結(jié)果文件

# 具體的文件內(nèi)容和圖的樣式見后面的分步法文檔
# 原始輸入文件
ehbio_trans.Count_matrix.xls
sampleFile

# 所有差異基因列表
ehbio_trans.Count_matrix.xls.DESeq2.all.DE

# PCA結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pca.pdf

# 樣品相關(guān)性層級聚類結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf

# rlog轉(zhuǎn)換后的標(biāo)準(zhǔn)化后的表達(dá)結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls

# 標(biāo)準(zhǔn)化后的表達(dá)結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls

# 運(yùn)行腳本
ehbio_trans.Count_matrix.xls.DESeq2.r

# 差異基因結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.id.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.id.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.xls

# 火山圖和火山圖輸入數(shù)據(jù)
ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls.volcano.pdf

一步法腳本可在https://ke.qq.com/course/301387?tuin=20cd7788或點(diǎn)擊閱讀原文獲得。

DESeq2 差異基因鑒定分步法

安裝包 `DESeq2`安裝方法如下

source("https://bioconductor.org/biocLite.R")
biocLite('BiocInstaller')
biocLite("DESeq2")
install.packages(c("gplots", "amap", "ggplot2"))

A distributional assumption is needed because we want to estimate the
probability of extreme events (large fold change just appearing by
chance) from limited replicates. The negative binomial (a.k.a.
Gamma-Poisson) is a good choice for RNA-seq experiments because

The observed data at gene level is inherently counts or estimated
counts of fragments for each feature and
The spread of values among biological replicates is more than given
by a simpler, one parameter distribution, the Poisson; and it seems
to be captured by the NB sufficiently well

加載包

library(DESeq2)

## Loading required package: S4Vectors

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colMeans,
##     colnames, colSums, do.call, duplicated, eval, evalq, Filter,
##     Find, get, grep, grepl, intersect, is.unsorted, lapply,
##     lengths, Map, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
##     rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: SummarizedExperiment

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: DelayedArray

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## 
## Attaching package: 'DelayedArray'

## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

## The following object is masked from 'package:base':
## 
##     apply

library("RColorBrewer")
library("gplots")

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:IRanges':
## 
##     space

## The following object is masked from 'package:S4Vectors':
## 
##     space

## The following object is masked from 'package:stats':
## 
##     lowess

library("amap")
library("ggplot2")
library("BiocParallel")

# 多線程加速, 使用4個(gè)核
# 如果你電腦性能強(qiáng)大，可以把這個(gè)數(shù)變大
register(MulticoreParam(4))

讀入數(shù)據(jù)

# 注意文件的位置，程序自己不知道文件在什么地方
# 如果只給文件名，不給路徑，程序只會(huì)在當(dāng)前目錄查找文件
# 若找不到，則會(huì)報(bào)錯(cuò)
# 可以使用下面命令設(shè)置工作目錄到文件所在目錄
# 或提供文件全路徑
# setwd("~/data")
data <- read.table("ehbio_trans.Count_matrix.xls", header=T, row.names=1, com='', quote='',
    check.names=F, sep="\t")

# 撇掉在多于兩個(gè)樣本中count<1的值，如果樣本數(shù)多，這個(gè)數(shù)值可以適當(dāng)增加
# 排除極低表達(dá)基因的干擾
data <- data[rowSums(data)>2,]
head(data)

##                 untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
## ENSG00000227232           13            25            23            24
## ENSG00000278267            0             5             3             4
## ENSG00000268903            0             2             0             0
## ENSG00000269981            0             3             0             1
## ENSG00000241860            3            11             1             5
## ENSG00000279457           46            90            73            49
##                 trt_N61311 trt_N052611 trt_N080611 trt_N061011
## ENSG00000227232         12          12          22          22
## ENSG00000278267          2           4           3           1
## ENSG00000268903          0           2           0           3
## ENSG00000269981          0           2           0           0
## ENSG00000241860          3           2           0           2
## ENSG00000279457         52          46          89          31

# 讀入分組信息
sample <- read.table("sampleFile", header=T, row.names=1, com='',
    quote='', check.names=F, sep="\t", colClasses="factor")

sample <- sample[match(colnames(data), rownames(sample)),, drop=F]
sample_rowname <- rownames(sample)

# 下面的可以忽略，如果沒遇到錯(cuò)誤不需要執(zhí)行
# 目的是做因子轉(zhuǎn)換
sample <- data.frame(lapply(sample, function(x) factor(x, levels=unique(x))))
rownames(sample) <- sample_rowname
sample

##               conditions
## untrt_N61311       untrt
## untrt_N052611      untrt
## untrt_N080611      untrt
## untrt_N061011      untrt
## trt_N61311           trt
## trt_N052611          trt
## trt_N080611          trt
## trt_N061011          trt

產(chǎn)生DESeq數(shù)據(jù)集

DESeq2包采用DESeqDataSet存儲(chǔ)原始的read count和中間計(jì)算的統(tǒng)計(jì)量。

每個(gè)DESeqDataSet對象都要有一個(gè)實(shí)驗(yàn)設(shè)計(jì)formula，用于對數(shù)據(jù)進(jìn)行分組，以便計(jì)算表達(dá)值的離散度和估計(jì)表達(dá)倍數(shù)差異，通常格式為~ batch + conditions
(為了方便后續(xù)計(jì)算，最為關(guān)注的分組信息放在最后一位)。

countData: 表達(dá)矩陣

colData: 樣品分組信息表

design: 實(shí)驗(yàn)設(shè)計(jì)信息，conditions必須是colData中的一列

ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
        colData = sample,  design= ~ conditions)

dds <- DESeq(ddsFullCountTable)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

獲取標(biāo)準(zhǔn)化后的數(shù)據(jù)

# ?counts查看此函數(shù)功能
# normalized=T, 返回標(biāo)準(zhǔn)化的數(shù)據(jù)
normalized_counts <- counts(dds, normalized=TRUE)
head(normalized_counts)

##                 untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
## ENSG00000227232    12.730963     21.179286    19.4979982     25.994727
## ENSG00000278267     0.000000      4.235857     2.5432172      4.332454
## ENSG00000268903     0.000000      1.694343     0.0000000      0.000000
## ENSG00000269981     0.000000      2.541514     0.0000000      1.083114
## ENSG00000241860     2.937915      9.318886     0.8477391      5.415568
## ENSG00000279457    45.048024     76.245430    61.8849507     53.072567
##                 trt_N61311 trt_N052611 trt_N080611 trt_N061011
## ENSG00000227232  13.423907   17.885811   15.750664   23.250144
## ENSG00000278267   2.237318    5.961937    2.147818    1.056825
## ENSG00000268903   0.000000    2.980969    0.000000    3.170474
## ENSG00000269981   0.000000    2.980969    0.000000    0.000000
## ENSG00000241860   3.355977    2.980969    0.000000    2.113649
## ENSG00000279457  58.170264   68.562276   63.718596   32.761567

根據(jù)基因在不同的樣本中表達(dá)變化的差異程度mad值對數(shù)據(jù)排序，差異越大的基因排位越前。

normalized_counts_mad <- apply(normalized_counts, 1, mad)
normalized_counts <- normalized_counts[order(normalized_counts_mad, decreasing=T), ]

# 標(biāo)準(zhǔn)化后的數(shù)據(jù)輸出
write.table(normalized_counts, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls",
quote=F, sep="\t", row.names=T, col.names=T)

# 只在Linux下能運(yùn)行，目的是填補(bǔ)表格左上角內(nèi)容
system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls"))

# log轉(zhuǎn)換后的結(jié)果
rld <- rlog(dds, blind=FALSE)
rlogMat <- assay(rld)
rlogMat <- rlogMat[order(normalized_counts_mad, decreasing=T), ]


write.table(rlogMat, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls",
quote=F, sep="\t", row.names=T, col.names=T)

# 只在Linux下能運(yùn)行，目的是填補(bǔ)表格左上角內(nèi)容
system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls"))

樣品層級聚類分析，判斷樣品的相似性和組間組內(nèi)差異

# 生成顏色
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)

# 計(jì)算相關(guān)性pearson correlation
pearson_cor <- as.matrix(cor(rlogMat, method="pearson"))

# 層級聚類
hc <- hcluster(t(rlogMat), method="pearson")

# 熱圖繪制
## 在命令行下運(yùn)行時(shí)，需要去除下面開頭的#號(hào)，把輸出的圖保存到文件中
## 輸出到文件時(shí)，dev.off()命令是關(guān)閉輸出，從而完成圖片的寫入。如果不做這一步，則圖片則不能寫入，從而不能打開
## 如果在Rstudio或其它可視化界面時(shí)，可以直接把圖輸出到屏幕
#pdf("ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf", pointsize=10)
heatmap.2(pearson_cor, Rowv=as.dendrogram(hc), symm=T, trace="none",
col=hmcol, margins=c(11,11), main="The pearson correlation of each
sample")

image

#dev.off()

樣品PCA分析

pca_data <- plotPCA(rld, intgroup=c("conditions"), returnData=T, ntop=5000)

差異基因分析

# 定義變量的好處是下面的代碼就獨(dú)立了
# 如果想比較不同的組，只需在這修改即可，后面代碼都可以不動(dòng)
sampleA = "untrt"
sampleB = "trt"

results函數(shù)提取差異基因分析結(jié)果，包含log2 fold changes,
p values和adjusted p values.
constrast用于指定比較的兩組的信息，輸出的log2FoldChange為log2(SampleA/SampleB)。

contrastV <- c("conditions", sampleA, sampleB)
res <- results(dds,  contrast=contrastV)
res

## log2 fold change (MLE): conditions untrt vs trt 
## Wald test p-value: conditions untrt vs trt 
## DataFrame with 26280 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE        stat    pvalue
##                  <numeric>      <numeric> <numeric>   <numeric> <numeric>
## ENSG00000227232 18.7141876     0.17695376 0.3809406  0.46451794 0.6422767
## ENSG00000278267  2.8144283     0.02802497 1.0249422  0.02734297 0.9781862
## ENSG00000268903  0.9807232    -1.71669654 2.3186888 -0.74037385 0.4590732
## ENSG00000269981  0.8256996     0.59077960 2.4538911  0.24075217 0.8097472
## ENSG00000241860  3.3713378     1.21773503 1.0789466  1.12863330 0.2590526
## ...                    ...            ...       ...         ...       ...
## 29256            0.3507226      1.8471273  3.484683  0.53007036 0.5960631
## 36308            0.9279831      0.1394489  1.598343  0.08724589 0.9304761
## 8414             0.7459738     -0.2678522  2.053434 -0.13044109 0.8962175
## 28837            1.2626343      0.2689443  1.405958  0.19128899 0.8482992
## 35320            1.0041030      0.5493562  1.717832  0.31979625 0.7491228
##                      padj
##                 <numeric>
## ENSG00000227232 0.8538515
## ENSG00000278267        NA
## ENSG00000268903        NA
## ENSG00000269981        NA
## ENSG00000241860        NA
## ...                   ...
## 29256                  NA
## 36308                  NA
## 8414                   NA
## 28837                  NA
## 35320                  NA

給DESeq2的原始輸出結(jié)果增加樣品平均表達(dá)信息，使得結(jié)果更容易理解和解析。

# 獲得第一組數(shù)據(jù)均值
baseA <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleA]
    
if (is.vector(baseA)){
    baseMeanA <- as.data.frame(baseA)
} else {
    baseMeanA <- as.data.frame(rowMeans(baseA))
}
colnames(baseMeanA) <- sampleA
head(baseMeanA)

##                      untrt
## ENSG00000227232 19.8507436
## ENSG00000278267  2.7778822
## ENSG00000268903  0.4235857
## ENSG00000269981  0.9061570
## ENSG00000241860  4.6300269
## ENSG00000279457 59.0627429

# 獲得第二組數(shù)據(jù)均值
baseB <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleB]
if (is.vector(baseB)){
        baseMeanB <- as.data.frame(baseB)
} else {
        baseMeanB <- as.data.frame(rowMeans(baseB))
}
colnames(baseMeanB) <- sampleB
head(baseMeanB)

##                        trt
## ENSG00000227232 17.5776316
## ENSG00000278267  2.8509744
## ENSG00000268903  1.5378607
## ENSG00000269981  0.7452421
## ENSG00000241860  2.1126487
## ENSG00000279457 55.8031756

# 結(jié)果組合
res <- cbind(baseMeanA, baseMeanB, as.data.frame(res))
head(res)

##                      untrt        trt   baseMean log2FoldChange     lfcSE
## ENSG00000227232 19.8507436 17.5776316 18.7141876     0.17695376 0.3809406
## ENSG00000278267  2.7778822  2.8509744  2.8144283     0.02802497 1.0249422
## ENSG00000268903  0.4235857  1.5378607  0.9807232    -1.71669654 2.3186888
## ENSG00000269981  0.9061570  0.7452421  0.8256996     0.59077960 2.4538911
## ENSG00000241860  4.6300269  2.1126487  3.3713378     1.21773503 1.0789466
## ENSG00000279457 59.0627429 55.8031756 57.4329592     0.08926886 0.2929300
##                        stat    pvalue      padj
## ENSG00000227232  0.46451794 0.6422767 0.8538515
## ENSG00000278267  0.02734297 0.9781862        NA
## ENSG00000268903 -0.74037385 0.4590732        NA
## ENSG00000269981  0.24075217 0.8097472        NA
## ENSG00000241860  1.12863330 0.2590526        NA
## ENSG00000279457  0.30474470 0.7605606 0.9130580

# 增加ID信息
res <- cbind(ID=rownames(res), as.data.frame(res))
res$baseMean <- rowMeans(cbind(baseA, baseB))

# 校正后p-value為NA的復(fù)制為1
res$padj[is.na(res$padj)] <- 1

# 按pvalue排序, 把差異大的基因放前面
res <- res[order(res$pvalue),]
head(res)

##                              ID     untrt       trt   baseMean
## ENSG00000152583 ENSG00000152583   77.8512  1900.412   989.1314
## ENSG00000148175 ENSG00000148175 7233.0010 19483.552 13358.2766
## ENSG00000179094 ENSG00000179094  151.7491  1380.881   766.3151
## ENSG00000134686 ENSG00000134686 1554.0390  4082.440  2818.2395
## ENSG00000125148 ENSG00000125148 1298.0392  5958.946  3628.4926
## ENSG00000120129 ENSG00000120129  773.9769  5975.522  3374.7494
##                 log2FoldChange      lfcSE      stat        pvalue
## ENSG00000152583      -4.608311 0.21090819 -21.84984 7.800059e-106
## ENSG00000148175      -1.429585 0.08639329 -16.54741  1.671380e-61
## ENSG00000179094      -3.184674 0.20065896 -15.87108  1.005030e-56
## ENSG00000134686      -1.392749 0.09190323 -15.15451  7.073605e-52
## ENSG00000125148      -2.199191 0.14796048 -14.86337  5.698869e-50
## ENSG00000120129      -2.948122 0.19931242 -14.79146  1.663007e-49
##                          padj
## ENSG00000152583 1.331002e-101
## ENSG00000148175  1.426021e-57
## ENSG00000179094  5.716612e-53
## ENSG00000134686  3.017600e-48
## ENSG00000125148  1.944910e-46
## ENSG00000120129  4.729591e-46

整體分析結(jié)果輸出到文件

comp314 <- paste(sampleA, "_vs_", sampleB, sep=".")

# 生成文件名
file_base <- paste("ehbio_trans.Count_matrix.xls.DESeq2", comp314, sep=".")
file_base1 <- paste(file_base, "results.xls", sep=".")
write.table(as.data.frame(res), file=file_base1, sep="\t", quote=F, row.names=F)

提取差異表達(dá)基因

# 差異基因篩選，padj<0.1
res_de <- subset(res, res$padj<0.1, select=c('ID', sampleA,
        sampleB, 'log2FoldChange', 'padj'))
# foldchang > 1
res_de_up <- subset(res_de, res_de$log2FoldChange>=1)

file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA,"_higherThan_",sampleB,
        'xls', sep=".") 
write.table(as.data.frame(res_de_up), file=file, sep="\t", quote=F, row.names=F)

res_de_dw <- subset(res_de, res_de$log2FoldChange<=(-1)*1)

file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA, "_lowerThan_", sampleB, 
        'xls', sep=".") 
write.table(as.data.frame(res_de_dw), file=file, sep="\t", quote=F, row.names=F)

# 差異基因ID
res_de_up_id = data.frame(ID=res_de_up$ID, 
        type=paste(sampleA,"_higherThan_", sampleB, sep="."))
res_de_dw_id = data.frame(ID=res_de_dw$ID, 
        type=paste(sampleA,"_lowerThan_", sampleB, sep="."))
de_id = rbind(res_de_up_id, res_de_dw_id)

file <- "ehbio_trans.Count_matrix.xls.DESeq2.all.DE"
write.table(as.data.frame(de_id), file=file, sep="\t", quote=F, row.names=F, col.names=F)

繪制火山圖

# 可以把前面生成的results.xls文件提交到www.ehbio.com/ImageGP繪制火山圖
# 或者使用我們的s-plot https://github.com/Tong-Chen/s-plot
logCounts <- log2(res$baseMean+1)
logFC <- res$log2FoldChange
FDR <- res$padj
#png(filename=paste(file_base, "Volcano.png", sep="."))
plot(logFC, -1*log10(FDR), col=ifelse(FDR<=0.01, "red", "black"),
 xlab="logFC", ylab="-1*log1o(FDR)", main="Volcano plot", pch=".")
#dev.off()

image

差異基因熱圖

res_de_up_top20_id <- as.vector(head(res_de_up$ID,20))
res_de_dw_top20_id <- as.vector(head(res_de_dw$ID,20))

red_de_top20 <- c(res_de_up_top20_id, res_de_dw_top20_id)
red_de_top20

##  [1] "ENSG00000178695" "ENSG00000196517" "ENSG00000116584"
##  [4] "ENSG00000144369" "ENSG00000276600" "ENSG00000108821"
##  [7] "ENSG00000162692" "ENSG00000181467" "ENSG00000145777"
## [10] "ENSG00000163491" "ENSG00000183160" "ENSG00000172986"
## [13] "ENSG00000164484" "ENSG00000131389" "ENSG00000134253"
## [16] "ENSG00000124766" "ENSG00000227051" "ENSG00000146006"
## [19] "ENSG00000112837" "ENSG00000146592" "ENSG00000152583"
## [22] "ENSG00000148175" "ENSG00000179094" "ENSG00000134686"
## [25] "ENSG00000125148" "ENSG00000120129" "ENSG00000189221"
## [28] "ENSG00000109906" "ENSG00000101347" "ENSG00000162614"
## [31] "ENSG00000096060" "ENSG00000162616" "ENSG00000166741"
## [34] "ENSG00000183044" "ENSG00000164125" "ENSG00000198624"
## [37] "ENSG00000256235" "ENSG00000077684" "ENSG00000135821"
## [40] "ENSG00000164647"

# 可以把矩陣存儲(chǔ)，提交到www.ehbio.com/ImageGP繪制火山圖
# 或者使用我們的s-plot https://github.com/Tong-Chen/s-plot
red_de_top20_expr <- normalized_counts[rownames(normalized_counts) %in% red_de_top20,]
library(pheatmap)
pheatmap(red_de_top20_expr, cluster_row=T, scale="row", annotation_col=sample)

image

關(guān)于批次效應(yīng)

見論壇討論 http://www.ehbio.com/Esx/d/10-deseq2。

如果記錄了樣本的批次信息，或者其它需要抹除的信息可以定義在design參數(shù)中，在下游回歸的分析中會(huì)根據(jù)design formula來估計(jì)batch effect的影響，并在下游分析時(shí)減去這個(gè)影響。這是處理batch effect的推薦方式。

在模型中考慮batch effect并沒有在數(shù)據(jù)矩陣中移除bacth effect，如果下游處理時(shí)，確實(shí)有需要可以使用limma包的removeBatchEffect來處理。

countData: 表達(dá)矩陣

colData: 樣品分組信息表

design: 實(shí)驗(yàn)設(shè)計(jì)信息，batch和conditions必須是colData中的一列

ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
        colData = sample,  design= ~ batch + conditions)

dds <- DESeq(ddsFullCountTable)

rld <- rlog(dds, blind=FALSE)
rlogMat <- assay(rld)
rlogMat <- limma::removeBatchEffect(rlogMat, c(sample$batch))

Just to be clear, there's an important difference between removing a
batch effect and modelling a batch effect. Including the batch in your
design formula will model the batch effect in the regression step, which
means that the raw data are not modified (so the batch effect is not
removed), but instead the regression will estimate the size of the batch
effect and subtract it out when performing all other tests. In addition,
the model's residual degrees of freedom will be reduced appropriately to
reflect the fact that some degrees of freedom were "spent" modelling the
batch effects. This is the preferred approach for any method that is
capable of using it (this includes DESeq2). You would only remove the
batch effect (e.g. using limma's removeBatchEffect function) if you were
going to do some kind of downstream analysis that can't model the batch
effects, such as training a classifier.

批次效應(yīng)模擬

#Make some simulated data with a batch effect:
dds <- makeExampleDESeqDataSet(betaSD=1,interceptMean=10)
dds$batch <- factor(rep(c("A","B"),each=6))

#VST, remove batch effect, then plotPCA:

vsd <- vst(dds)
plotPCA(vsd, "batch")

image

assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$batch)
plotPCA(vsd, "batch")

image

SVA(批次未記錄時(shí)，尋找潛在影響因子，并矯正)

dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
##calculating the variables
n.sv <- num.sv(dat,mod,method="leek") #gives 11
### using 4
lnj.corr <- svaBatchCor(dat,mod,mod0,n.sv=4)
co <- lnj.corr$corrected
plPCA(co)
#http://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf
#http://bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects

參考資料

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
http://dx.doi.org/10.1186/s13059-014-0550-8

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

DESeq2差異基因分析和批次效應(yīng)移除

DESeq2差異基因分析和批次效應(yīng)移除

差異基因鑒定

基因表達(dá)標(biāo)準(zhǔn)化

DESeq2 差異基因鑒定一步法

DESeq2 差異基因鑒定分步法

安裝包 `DESeq2`安裝方法如下

加載包

讀入數(shù)據(jù)

產(chǎn)生DESeq數(shù)據(jù)集

獲取標(biāo)準(zhǔn)化后的數(shù)據(jù)

差異基因分析

關(guān)于批次效應(yīng)

參考資料

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

DESeq2差異基因分析和批次效應(yīng)移除

差異基因鑒定

基因表達(dá)標(biāo)準(zhǔn)化

DESeq2 差異基因鑒定一步法

DESeq2 差異基因鑒定分步法

安裝包 DESeq2安裝方法如下

加載包

讀入數(shù)據(jù)

產(chǎn)生DESeq數(shù)據(jù)集

獲取標(biāo)準(zhǔn)化后的數(shù)據(jù)

差異基因分析

關(guān)于批次效應(yīng)

參考資料

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

安裝包 `DESeq2`安裝方法如下