SO HOT!
1. 安裝一些R包
1.1 安裝
1.1.1 ALL
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("ALL")
1.1.2 pasilla
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("pasilla")
1.1.3 airway
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("airway")
1.1.4 limma
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("limma")
1.1.5 DESeq2
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
> BiocManager::install("DESeq2")
1.1.6 clusterProfiler
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("clusterProfiler")
1.1.7 reshape2
> install.packages('reshape2')
1.1.8 ggplot2
> install.packages('ggplot2')
1.2 運行
用 library()
檢查已安裝的包是否能夠運行
e.g.
> library(CLL)
# 載入需要的程輯包:affy
# 載入需要的程輯包:BiocGenerics
# 載入需要的程輯包:parallel
#
# 載入程輯包:‘BiocGenerics’
#
# The following objects are masked from ‘package:parallel’:
#
# clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
# clusterExport, clusterMap, parApply, parCapply, parLapply,
# parLapplyLB, parRapply, parSapply, parSapplyLB
#
# The following objects are masked from ‘package:stats’:
#
# IQR, mad, sd, var, xtabs
#
# The following objects are masked from ‘package:base’:
#
# anyDuplicated, append, as.data.frame, basename, cbind, colnames,
# dirname, do.call, duplicated, eval, evalq, Filter, Find, get,
# grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match,
# mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position,
# rank, rbind, Reduce, rownames, sapply, setdiff, sort, table,
# tapply, union, unique, unsplit, which, which.max, which.min
#
# 載入需要的程輯包:Biobase
# Welcome to Bioconductor
#
# Vignettes contain introductory material; view with
# 'browseVignettes()'. To cite Bioconductor, see
# 'citation("Biobase")', and for packages 'citation("pkgname")'.
若試圖運行一個沒有安裝的包:
> available.packages()
# Package Version
# A3 "A3" "1.0.0"
> library(A3)
# Error in library(A3) : 不存在叫‘A3’這個名字的程輯包
2. 了解ExpressionSet對象
比如CLL包里面就有data(sCLLex) ,找到它包含的元素,提取其表達矩陣(使用exprs函數),查看其大小。
提取表達矩陣:
> library('CLL')
> data("sCLLex")
> exprSet <- exprs(sCLLex)
查看大小:
> dim(exprSet)
# [1] 12625 22
> str(exprSet)
# num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
3. 了解 str,head,help函數,作用于 第二步提取到的表達矩陣
3.1 str()
> str(exprSet)
# num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object. It calls args for (non-primitive) function objects.
用于查看矩陣的大小(列數、行數),查看列名、行名的示例。
3.2 head()
> head(exprSet)
# CLL11.CEL CLL12.CEL CLL13.CEL CLL14.CEL CLL15.CEL CLL16.CEL
# 1000_at 5.743132 6.219412 5.523328 5.340477 5.229904 4.920686
# 1001_at 2.285143 2.291229 2.287986 2.295313 2.662170 2.278040
# 1002_f_at 3.309294 3.318466 3.354423 3.327130 3.365113 3.568353
# 1003_s_at 1.085264 1.117288 1.084010 1.103217 1.074243 1.073097
# 1004_at 7.544884 7.671801 7.474025 7.152482 6.902932 7.368660
# 1005_at 5.083793 7.610593 7.631311 6.518594 5.059087 4.855161
# ...
Returns the first or last parts of a vector, matrix, table, data frame or function. Since head() and tail() are generic functions, they may also have been extended to other classes.
用于查看對象的前幾行,默認為前6行,可通過
n = xL
調整。
3.3 help()
> help("exprs")
help is the primary interface to the help systems.
用于調出 R Documentation 對某個函數的說明頁面,?fuction
是 help()
的簡略寫法
4. 安裝并了解 hgu95av2.db 包
看看 ls(“package:hgu95av2.db”) 后顯示的哪些變量?
安裝:
> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>
> BiocManager::install("hgu95av2.db")
> library(hgu95av2.db)
> ls("package:hgu95av2.db")
# [1] "hgu95av2" "hgu95av2.db" "hgu95av2_dbconn"
# [4] "hgu95av2_dbfile" "hgu95av2_dbInfo" "hgu95av2_dbschema"
# [7] "hgu95av2ACCNUM" "hgu95av2ALIAS2PROBE" "hgu95av2CHR"
# [10] "hgu95av2CHRLENGTHS" "hgu95av2CHRLOC" "hgu95av2CHRLOCEND"
# [13] "hgu95av2ENSEMBL" "hgu95av2ENSEMBL2PROBE" "hgu95av2ENTREZID"
# [16] "hgu95av2ENZYME" "hgu95av2ENZYME2PROBE" "hgu95av2GENENAME"
# [19] "hgu95av2GO" "hgu95av2GO2ALLPROBES" "hgu95av2GO2PROBE"
# [22] "hgu95av2MAP" "hgu95av2MAPCOUNTS" "hgu95av2OMIM"
# [25] "hgu95av2ORGANISM" "hgu95av2ORGPKG" "hgu95av2PATH"
# [28] "hgu95av2PATH2PROBE" "hgu95av2PFAM" "hgu95av2PMID"
# [31] "hgu95av2PMID2PROBE" "hgu95av2PROSITE" "hgu95av2REFSEQ"
# [34] "hgu95av2SYMBOL" "hgu95av2UNIGENE" "hgu95av2UNIPROT"
> as.list(hgu95av2GO[1]
# $`1000_at`$`GO:0097110`
# $`1000_at`$`GO:0097110`$GOID
# [1] "GO:0097110"
# $`1000_at`$`GO:0097110`$Evidence
# [1] "IEA"
# $`1000_at`$`GO:0097110`$Ontology
# [1] "MF"
1000_at
為探針,分別顯示GOID, Evidence(基因注釋證據代碼), Ontology(基因本體論中的三個分類,BP = Biological Process; CC = Cellular Component; MF = Molecular Function).
> as.list(hgu95av2UNIGENE[1])
# $`1000_at`
# [1] "Hs.861"
"Hs.861"為NCBI UniGene數據庫的ID.
提取其中的數據:
> get(featureNames(sCLLex)[1],hgu95av2GO)[1]
# $`GO:0000165`
# $`GO:0000165`$GOID
# [1] "GO:0000165"
# $`GO:0000165`$Evidence
# [1] "NAS"
# $`GO:0000165`$Ontology
# [1] "BP"
5. 理解 head(toTable(hgu95av2SYMBOL)) 的用法
找到TP53基因對應的探針ID
> head(toTable(hgu95av2SYMBOL))
# probe_id symbol
# 1 1000_at MAPK3
# 2 1001_at TIE1
# 3 1002_f_at CYP2C19
# 4 1003_s_at CXCR5
# 5 1004_at CXCR5
# 6 1005_at DUSP1
找到TP53基因對應的探針ID:
> ids <- toTable(hgu95av2SYMBOL)
> grep("TP53",ids$symbol)
# [1] 732 884 966 997 1420 2675 3322 4120 4304 5475 5526 10084
> ids[grep("TP53",ids$symbol),]
# probe_id symbol
# 732 1711_at TP53BP1
# 884 1860_at TP53BP2
# 966 1939_at TP53
# 997 1974_s_at TP53
# 1420 31618_at TP53
# 2675 33025_at TP53TG5
# 3322 33749_at TP53TG1
# 4120 34629_at TP53I11
# 4304 34822_at TP53BP2
# 5475 36079_at TP53I3
# 5526 36136_at TP53I11
# 10084 40986_s_at TP53BP1
6. 理解探針與基因的對應關系
總共多少個基因,基因最多對應多少個探針,是哪些基因,是不是因為這些基因很長,所以在其上面設計多個探針呢?
> summary(hgu95av2SYMBOL)
# SYMBOL map for chip hgu95av2 (object of class "ProbeAnnDbBimap")
# |
# | Lkeyname: probe_id (Ltablename: probes)
# | Lkeys: "1000_at", "1001_at", ... (total=12625/mapped=11460)
# |
# | Rkeyname: symbol (Rtablename: gene_info)
# | Rkeys: "A1BG", "A2M", ... (total=61468/mapped=8585)
# |
# | direction: L --> R
共有61468個基因,對應上8585個。
> table(ids$symbol) %>% sort() %>% tail()
# YME1L1 GAPDH INPP4A MYB PTGER3 STAT1
# 7 8 8 8 8 8
列出 ids 的 'symbol' 列及每個元素出現的個數,排序,顯示最后6個;
得到對應最多探針的基因。
基因長度與設計在上面的探針數量無關。
7. 第二步提取到的表達矩陣,找到那些不在 hgu95av2.db 包收錄的對應著SYMBOL的探針
> mapped_probes <- mappedkeys(hgu95av2SYMBOL)
> exprSetpb <- row.names(exprSet)
> length(grep("FALSE",(exprSetpb %in% mapped_probes)))
# [1] 1165
> grep("FALSE",(exprSetpb %in% mapped_probes))
# [1] 8 52 98 99 100 109 115 117 128 135 157 168
# [13] 169 171 174 195 197 199 203 219 225 282 283 292
# ...
> head(exprSetpb[index])
# [1] "1007_s_at" "1047_s_at" "1089_i_at" "108_g_at" "1090_f_at" "1099_s_at"
不確定對不對....
8. 過濾表達矩陣
刪除那1165個沒有對應基因名字的探針。
> e = exprSet[rownames(exprSet) %in% mapped_probes,]
> str(exprSet)
# num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
# ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
> str(e)
# num [1:11460, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:11460] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
# ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
9. 整合表達矩陣
多個探針對應一個基因的情況下,只保留在所有樣本里面平均表達量最大的那個探針。
> maxid = by(e,ids$symbol,function(x) rownames(x)[which.max(rowMeans(x))])
# 按照 ids$symbol 將 e 分組,取行平均值最大的行 的 行名
# 但,為什么要用 symbol 分組呢 ??
> uniid = as.character(maxid)
> uni_e = e[rownames(e) %in% uniid,]
> maxid = by(e,ids$probe_id,function(x) rownames(x)[which.max(rowMeans(x))])
> uniid = as.character(maxid)
> uni_e = e[rownames(e) %in% uniid,]
> str(uni_e)
# num [1:11460, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:11460] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
# ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
# emmmmm 結局是一樣的,ok我懂了
10. 把過濾后的表達矩陣更改行名為基因的symbol
因為這個時候探針和基因是一對一關系了。
> rownames(uni_e) = ids[match(rownames(uni_e),ids$probe_id),2]
# 將ids的第2列(symbol)對應到 uni_e 的行名
> library(reshape2)
> m_e = melt(uni_e)
> colnames(m_e) = c('symbol','sample','value')
分組:
> pd = pData(sCLLex)
> Disease = pd[,2]
> table(Disease)
# Disease
# progres. stable
# 14 8
> m_e$group = rep(Disease,each=nrow(uni_e))
11. 對第10步得到的表達矩陣進行探索
先畫第一個樣本的所有基因的表達量的boxplot,hist,density,然后畫所有樣本的這些圖
> suppressPackageStartupMessages(library(ggplot2))
> ggplot(m_e,aes(x=sample,y=value,fill=group))+geom_boxplot()
> ggplot(m_e,aes(value,fill=group)) + geom_histogram(bins = 200)+facet_wrap(~sample, nrow = 4)
> ggplot(m_e,aes(value,col=group)) + geom_density()
[圖片上傳失敗...(image-e00840-1559916805026)]
12. 理解統計學指標mean,median,max,min,sd,var,mad
并計算出每個基因在所有樣本的這些統計學指標,最后按照mad值排序,取top 50 mad值的基因,得到列表。(注意:這個題目出的并不合規,請仔細看。)
12.1 mean
算術平均值
> meanlist=apply(uni_e, 1, mean)
> head(meanlist)
# UTF1 EVX1 ADGRB1 SYN1 CHRNB4 ARR3
# -0.14880683 -0.13705268 0.05817044 0.07589677 0.08317654 0.20826165
12.2 median
中位數
> mdlist=apply(uni_e, 1, median)
> head(mdlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 5.368993 2.333562 3.378740 7.402258 5.902370 3.320628
12.3 max
最大值
> maxlist=apply(uni_e, 1, max)
> head(maxlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 6.251678 2.662170 3.798335 8.802729 9.919125 3.574332
12.4 min
最小值
> minlist=apply(uni_e, 1, min)
> head(minlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 4.826131 2.222276 3.247276 6.456285 4.097580 3.213493
12.5 sd
標準差
> sdlist=apply(uni_e, 1, sd)
> head(sdlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 0.36652641 0.09046121 0.12801488 0.58908623 1.73368583 0.10074804
12.6 var
方差
> varlist=apply(uni_e, 1, var)
> head(varlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 0.13434161 0.00818323 0.01638781 0.34702258 3.00566656 0.01015017
12.7 mad
絕對中位差Median Absolute Deviation
> madlist=apply(uni_e, 1, mad)
> head(madlist)
# MAPK3 TIE1 CYP2C19 CXCR5 DUSP1 MMP10
# 0.22779776 0.06516670 0.08294023 0.38497146 1.43854685 0.09476130
top 50 mad值的基因:
> tail(sort(madlist),50)
# PFN2 TNFSF9 ARHGAP44 P2RY14 THEMIS2 LPL ANXA4 DUSP6
# 1.481294 1.485155 1.488032 1.505107 1.507287 1.532212 1.533327 1.551320
# DUSP5 H1FX FLNA CLEC2B TSPYL2 ZNF266 S100A9 NR4A2
# 1.553782 1.557412 1.574436 1.578055 1.582430 1.587748 1.608285 1.612875
# TGFBI ARF6 APBB2 VCAN RBM38 CAPG PLXNC1 RGS2
# 1.643149 1.654548 1.674443 1.681098 1.703638 1.713747 1.718297 1.770151
# RNASE6 VAMP5 CYBB GNLY CCL3 OAS1 TRIB2 ZNF804A
# 1.775430 1.791017 1.811973 1.814364 1.862311 1.883509 1.937294 1.986163
# IGH PCDH9 VIPR1 COBLL1 GUSBP11 S100A8 HBB LHFPL2
# 2.090866 2.144223 2.171912 2.179666 2.204212 2.220970 2.267136 2.317045
# FCN1 ZAP70 IGLC1 LGALS1 FOS SLAMF1 TCF7 DMD
# 2.371590 2.579046 2.590895 2.600604 2.938078 2.944105 2.993352 3.071541
# IGF2BP3 FAM30A
# 3.234011 3.982191
13. 根據第12步驟得到top 50 mad值的基因列表來取表達矩陣的子集
并且熱圖可視化子表達矩陣。試試看其它5種熱圖的包的不同效果。
取子集:
> top50gene=tail(sort(madlist),50)
> top50gene=as.data.frame(top50gene)
> top50genelist=rownames(top50gene)
> top50matrix=uni_e[top50genelist,]
> ct=scale(top50matrix,center=T,scale=F) ## 中心化
> nmlztop50matrix=scale(ct,center=T,scale=T) ## 標準化
> pheatmap::pheatmap(nmlztop50matrix)
> heatmap(nmlztop50matrix)
> ggplot(ml_nmtop50,aes(sample,symbol))+
+ geom_tile(aes(fill=value),colour = "white")+
+ scale_fill_gradient(low = "white",high = "steelblue")
> library(gplots)
> heatmap.2(nmlztop50matrix,key = T,symkey = FALSE,density.info="none",trace="none")
> library(lattice)
> library(latticeExtra)
> x <- t(as.matrix(scale(nmlztop50matrix)))
> dd.row <- as.dendrogram(hclust(dist(x)))
> row.ord <- order.dendrogram(dd.row)
> levelplot(t(nmlztop50matrix),aspect = "fill",xlab="sample",ylab="symbol",colorkey = list(space = "left"),legend=list(right=list(fun=dendrogramGrob,args=list(x=dd.row,rod=row.ord,side='right',size=5))))
> levelplot(t(nmlztop50matrix),aspect =
+ "fill",xlab="sample",ylab="symbol",colorkey =
+ list(space = "left"),legend=
+ list(right=list(fun=dendrogramGrob,args=
+ list(x=dd.row,rod=row.ord,side='right',size=5))))
emmm PDF示例里的代碼跑出來top50matirx里是很多基因的重復,so這是我琢磨出來的結果
14. 取不同統計學指標mean,median,max,mean,sd,var,mad的各top50基因列表
使用UpSetR包來看他們之間的overlap情況。
各列表:
> e_mean=tail(sort(meanlist),50)
> e_md=tail(sort(mdlist),50)
> e_max=tail(sort(maxlist),50)
> e_min=tail(sort(minlist),50)
> e_sd=tail(sort(sdlist),50)
> e_var=tail(sort(varlist),50)
> e_mad=tail(sort(madlist),50)
> suppressPackageStartupMessages(library("UpSetR"))
> library(magrittr)
> e_all = data.frame(all,
+ e_mean=ifelse(all %in% names(e_mean),1,0),
+ e_md=ifelse(all %in% names(e_md),1,0),
+ e_max=ifelse(all %in% names(e_max),1,0),
+ e_min=ifelse(all %in% names(e_min),1,0),
+ e_sd=ifelse(all %in% names(e_sd),1,0),
+ e_var=ifelse(all %in% names(e_var),1,0),
+ e_mad=ifelse(all %in% names(e_mad),1,0)
+ )
> upset(e_all,nsets = 7,sets.bar.color = "#BD1111")
15. 在第二步的基礎上面提取CLL包里面的data(sCLLex) 數據對象的樣本的表型數據。
> pd=pData(sCLLex)
> grouplist=as.character(pd[,2])
> table(grouplist)
# grouplist
# progres. stable
# 14 8
16. 對所有樣本的表達矩陣進行聚類并且繪圖
然后添加樣本的臨床表型數據信息(更改樣本名)
> clust = t(exprSet)
> rownames(clust) = colnames(exprSet)
> View(clust)
> View(exprSet)
> clust_dist = dist(clust,method = "euclidean")
> hc = hclust(clust_dist,"ward.D")
> suppressPackageStartupMessages(library(factoextra))
> fviz_dend(hc, k = 4 ,cex = 0.5,k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),color_labels_by_k = TRUE, rect = TRUE)
17.對所有樣本的表達矩陣進行PCA分析并且繪圖,添加表型信息
> pca_data <- prcomp(t(exprSet),scale=TRUE)
> pcx <- data.frame(pca_data$x)
> pcr <- cbind(samples=rownames(pcx),grouplist, pcx)
> ggplot(pcr, aes(PC1, PC2))+geom_point(size=5, aes(color=grouplist)) +
+ geom_text(aes(label=samples),hjust=-0.1, vjust=-0.3)
18. 根據表達矩陣及樣本分組信息進行批量T檢驗,得到檢驗結果表格
分組,求p值
> gl = as.factor(grouplist)
> group1 = which(grouplist == levels(gl)[1])
> group2 = which(grouplist == levels(gl)[2])
# e = exprSet[rownames(exprSet) %in% mapped_probes,]
> et1 = e[,group1]
> et2 = e[,group2]
> data_t = cbind(et1,et2)
> pvals = apply(e, 1, function(x){
+ t.test(as.numeric(x)~grouplist)$p.value
+ }) ## p值
> p.adj = p.adjust(pvals, method = "BH") ## 校正后的p值
求log2FC, 做批量t檢驗
> data_mean_c = rowMeans(et1) ## control
> data_mean_t = rowMeans(et2) ## treatment
> log2FC = data_mean_t-data_mean_c
> DEG_t.test = cbind(data_mean_c, data_mean_t, log2FC, pvals, p.adj)
> DEG_t.test=DEG_t.test[order(DEG_t.test[,4]),] ##排序
> DEG_t.test=as.data.frame(DEG_t.test)
> head(DEG_t.test)
# data_mean_c data_mean_t log2FC pvals p.adj
# 36129_at 7.875615 8.791753 0.9161377 1.629755e-05 0.1867699
# 37676_at 6.622749 7.965007 1.3422581 4.058944e-05 0.2211373
# 33791_at 7.616197 5.786041 -1.8301554 6.965416e-05 0.2211373
# 39967_at 4.456446 2.152471 -2.3039752 8.993339e-05 0.2211373
# 34594_at 5.988866 7.058738 1.0698718 9.648226e-05 0.2211373
# 32198_at 4.157971 3.407405 -0.7505660 2.454557e-04 0.3192169
19. 使用limma包對表達矩陣及樣本分組信息進行差異分析,得到差異分析表格
重點看logFC和P值,畫火山圖(就是logFC和-log10(P值)的散點圖)。
> suppressPackageStartupMessages(library(limma))
> design1=model.matrix(~factor(grouplist)) ## 設計矩陣
> colnames(design1)=levels(factor(grouplist))
> rownames(design1)=colnames(exprSet)
> fit1 = lmFit(exprSet,design1)
> fit1=eBayes(fit1)
> options(digits = 3)
> mtx1 = topTable(fit1,coef=2,adjust='BH',n=Inf)
> DEG_mtx1 = na.omit(mtx1) ##去除缺失值
> head(DEG_mtx1)
# logFC AveExpr t P.Value adj.P.Val B
# 39400_at 1.028 5.62 5.84 8.34e-06 0.0334 3.23
# 36131_at -0.989 9.95 -5.77 9.67e-06 0.0334 3.12
# 33791_at -1.830 6.95 -5.74 1.05e-05 0.0334 3.05
# 1303_at 1.384 4.46 5.73 1.06e-05 0.0334 3.04
# 36122_at -0.780 7.26 -5.14 4.21e-05 0.1062 1.93
# 36939_at -2.547 6.92 -5.04 5.36e-05 0.1128 1.74
畫圖:
> DEG=DEG_mtx1
> logFC_cutoff <- with(DEG,mean(abs(logFC)) + 2*sd(abs(logFC)) )
> DEG$result = as.factor(ifelse(DEG$P.Value < 0.05 & abs(DEG$logFC) > logFC_cutoff, ## p.Value<0.05,|logFC|>1
+ ifelse(DEG$logFC > logFC_cutoff ,'UP','DOWN'),'NOT'))
> title <- paste0('Cutoff for logFC is ',round(logFC_cutoff,3), #round保留小數位數
+
+ '\nThe number of up gene is ',nrow(DEG[DEG$result =='UP',]) ,
+
+ '\nThe number of down gene is ',nrow(DEG[DEG$result =='DOWN',])
+ )
> ggplot(data=DEG, aes(x=logFC, y=-log10(P.Value), color=result)) +
+ geom_point(alpha=0.4, size=1.75) +
+ theme_set(theme_set(theme_bw(base_size=20)))+
+ xlab("log2 fold change") + ylab("-log10 p-value") +
+ ggtitle( title ) + theme(plot.title = element_text(size=15,hjust = 0.5))+
+ scale_colour_manual(values = c('blue','black','red'))
20.對T檢驗結果的P值和limma包差異分析的P值畫散點圖,看看哪些基因相差很大?
> DEG_t.test = DEG_t.test[rownames(DEG_mtx1),]
> colnames(DEG_t.test)
# [1] "data_mean_c" "data_mean_t" "log2FC" "pvals"
# [5] "p.adj"
> colnames(DEG_mtx1)
# [1] "logFC" "AveExpr" "t" "P.Value" "adj.P.Val"
# [6] "B"
畫圖:
> plot(DEG_t.test[,4],DEG_mtx1[,4])
> plot(-log10(DEG_t.test[,4]),-log10(DEG_mtx1[,4]))
最后,向大家隆重推薦生信技能樹的一系列干貨!
- 生信技能樹全球公益巡講:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
- B站公益74小時生信工程師教學視頻合輯:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
- 招學徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw