『生信技能樹』R語言20題之我的答案版本

SO HOT!


原題目在這里

1. 安裝一些R包

1.1 安裝

1.1.1 ALL

> if (!requireNamespace("BiocManager", quietly = TRUE))
+     install.packages("BiocManager")
> 
> BiocManager::install("ALL")

1.1.2 pasilla

> if (!requireNamespace("BiocManager", quietly = TRUE))
+    install.packages("BiocManager")
>
> BiocManager::install("pasilla")

1.1.3 airway

> if (!requireNamespace("BiocManager", quietly = TRUE))
+    install.packages("BiocManager")
>
> BiocManager::install("airway")

1.1.4 limma

> if (!requireNamespace("BiocManager", quietly = TRUE))
+     install.packages("BiocManager")
> 
> BiocManager::install("limma")

1.1.5 DESeq2

> if (!requireNamespace("BiocManager", quietly = TRUE))
+    install.packages("BiocManager")

> BiocManager::install("DESeq2")

1.1.6 clusterProfiler

> if (!requireNamespace("BiocManager", quietly = TRUE))
+     install.packages("BiocManager")
> 
> BiocManager::install("clusterProfiler")

1.1.7 reshape2

> install.packages('reshape2')

1.1.8 ggplot2

> install.packages('ggplot2')

1.2 運行

library()檢查已安裝的包是否能夠運行

e.g.

> library(CLL)
# 載入需要的程輯包:affy
# 載入需要的程輯包:BiocGenerics
# 載入需要的程輯包:parallel
# 
# 載入程輯包:‘BiocGenerics’
# 
# The following objects are masked from ‘package:parallel’:
#   
#   clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
# clusterExport, clusterMap, parApply, parCapply, parLapply,
# parLapplyLB, parRapply, parSapply, parSapplyLB
# 
# The following objects are masked from ‘package:stats’:
#   
#   IQR, mad, sd, var, xtabs
# 
# The following objects are masked from ‘package:base’:
#   
#   anyDuplicated, append, as.data.frame, basename, cbind, colnames,
# dirname, do.call, duplicated, eval, evalq, Filter, Find, get,
# grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match,
# mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position,
# rank, rbind, Reduce, rownames, sapply, setdiff, sort, table,
# tapply, union, unique, unsplit, which, which.max, which.min
# 
# 載入需要的程輯包:Biobase
# Welcome to Bioconductor
# 
# Vignettes contain introductory material; view with
# 'browseVignettes()'. To cite Bioconductor, see
# 'citation("Biobase")', and for packages 'citation("pkgname")'.

若試圖運行一個沒有安裝的包:

> available.packages()
#                              Package                         Version         
# A3                            "A3"                            "1.0.0" 
> library(A3)
# Error in library(A3) : 不存在叫‘A3’這個名字的程輯包

2. 了解ExpressionSet對象

比如CLL包里面就有data(sCLLex) ,找到它包含的元素,提取其表達矩陣(使用exprs函數),查看其大小。

提取表達矩陣:

> library('CLL')
> data("sCLLex")
> exprSet <- exprs(sCLLex)

查看大小:

> dim(exprSet)
# [1] 12625    22
> str(exprSet)
#  num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...

3. 了解 str,head,help函數,作用于 第二步提取到的表達矩陣

3.1 str()

> str(exprSet)
#  num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...

Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object. It calls args for (non-primitive) function objects.

用于查看矩陣的大小(列數、行數),查看列名、行名的示例。

3.2 head()

> head(exprSet)
#          CLL11.CEL CLL12.CEL CLL13.CEL CLL14.CEL CLL15.CEL CLL16.CEL
# 1000_at    5.743132  6.219412  5.523328  5.340477  5.229904  4.920686
# 1001_at    2.285143  2.291229  2.287986  2.295313  2.662170  2.278040
# 1002_f_at  3.309294  3.318466  3.354423  3.327130  3.365113  3.568353
# 1003_s_at  1.085264  1.117288  1.084010  1.103217  1.074243  1.073097
# 1004_at    7.544884  7.671801  7.474025  7.152482  6.902932  7.368660
# 1005_at    5.083793  7.610593  7.631311  6.518594  5.059087  4.855161
# ...

Returns the first or last parts of a vector, matrix, table, data frame or function. Since head() and tail() are generic functions, they may also have been extended to other classes.

用于查看對象的前幾行,默認為前6行,可通過 n = xL 調整。

3.3 help()

> help("exprs")
R Documentation

help is the primary interface to the help systems.

用于調出 R Documentation 對某個函數的說明頁面,?fuctionhelp() 的簡略寫法

4. 安裝并了解 hgu95av2.db 包

看看 ls(“package:hgu95av2.db”) 后顯示的哪些變量?

安裝:

> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>  
> BiocManager::install("hgu95av2.db")
> library(hgu95av2.db)
> ls("package:hgu95av2.db")
# [1] "hgu95av2"              "hgu95av2.db"           "hgu95av2_dbconn"     
# [4] "hgu95av2_dbfile"       "hgu95av2_dbInfo"       "hgu95av2_dbschema"   
# [7] "hgu95av2ACCNUM"        "hgu95av2ALIAS2PROBE"   "hgu95av2CHR"         
# [10] "hgu95av2CHRLENGTHS"    "hgu95av2CHRLOC"        "hgu95av2CHRLOCEND" 
# [13] "hgu95av2ENSEMBL"       "hgu95av2ENSEMBL2PROBE" "hgu95av2ENTREZID"   
# [16] "hgu95av2ENZYME"        "hgu95av2ENZYME2PROBE"  "hgu95av2GENENAME"   
# [19] "hgu95av2GO"            "hgu95av2GO2ALLPROBES"  "hgu95av2GO2PROBE"   
# [22] "hgu95av2MAP"           "hgu95av2MAPCOUNTS"     "hgu95av2OMIM"       
# [25] "hgu95av2ORGANISM"      "hgu95av2ORGPKG"        "hgu95av2PATH"       
# [28] "hgu95av2PATH2PROBE"    "hgu95av2PFAM"          "hgu95av2PMID"       
# [31] "hgu95av2PMID2PROBE"    "hgu95av2PROSITE"       "hgu95av2REFSEQ"     
# [34] "hgu95av2SYMBOL"        "hgu95av2UNIGENE"       "hgu95av2UNIPROT"   
> as.list(hgu95av2GO[1]
# $`1000_at`$`GO:0097110`
# $`1000_at`$`GO:0097110`$GOID
# [1] "GO:0097110"

# $`1000_at`$`GO:0097110`$Evidence
# [1] "IEA"

# $`1000_at`$`GO:0097110`$Ontology
# [1] "MF"

1000_at為探針,分別顯示GOID, Evidence(基因注釋證據代碼), Ontology(基因本體論中的三個分類,BP = Biological Process; CC = Cellular Component; MF = Molecular Function).

> as.list(hgu95av2UNIGENE[1])
# $`1000_at`
# [1] "Hs.861"

"Hs.861"為NCBI UniGene數據庫的ID.

提取其中的數據:

> get(featureNames(sCLLex)[1],hgu95av2GO)[1]
# $`GO:0000165`
# $`GO:0000165`$GOID
# [1] "GO:0000165"

# $`GO:0000165`$Evidence
# [1] "NAS"

# $`GO:0000165`$Ontology
# [1] "BP"

5. 理解 head(toTable(hgu95av2SYMBOL)) 的用法

找到TP53基因對應的探針ID

> head(toTable(hgu95av2SYMBOL))
#    probe_id  symbol
# 1   1000_at   MAPK3
# 2   1001_at    TIE1
# 3 1002_f_at CYP2C19
# 4 1003_s_at   CXCR5
# 5   1004_at   CXCR5
# 6   1005_at   DUSP1

找到TP53基因對應的探針ID:

> ids <- toTable(hgu95av2SYMBOL)
> grep("TP53",ids$symbol)
# [1]   732   884   966   997  1420  2675  3322  4120  4304  5475  5526 10084
> ids[grep("TP53",ids$symbol),]
#         probe_id  symbol
# 732      1711_at TP53BP1
# 884      1860_at TP53BP2
# 966      1939_at    TP53
# 997    1974_s_at    TP53
# 1420    31618_at    TP53
# 2675    33025_at TP53TG5
# 3322    33749_at TP53TG1
# 4120    34629_at TP53I11
# 4304    34822_at TP53BP2
# 5475    36079_at  TP53I3
# 5526    36136_at TP53I11
# 10084 40986_s_at TP53BP1

6. 理解探針與基因的對應關系

總共多少個基因,基因最多對應多少個探針,是哪些基因,是不是因為這些基因很長,所以在其上面設計多個探針呢?

> summary(hgu95av2SYMBOL)
# SYMBOL map for chip hgu95av2 (object of class "ProbeAnnDbBimap")
# |
# | Lkeyname: probe_id (Ltablename: probes)
# |    Lkeys: "1000_at", "1001_at", ... (total=12625/mapped=11460)
# |
# | Rkeyname: symbol (Rtablename: gene_info)
# |    Rkeys: "A1BG", "A2M", ... (total=61468/mapped=8585)
# |
# | direction: L --> R

共有61468個基因,對應上8585個。

> table(ids$symbol) %>% sort() %>% tail()

# YME1L1  GAPDH INPP4A    MYB PTGER3  STAT1 
#     7      8      8      8      8      8 

列出 ids 的 'symbol' 列及每個元素出現的個數,排序,顯示最后6個;

得到對應最多探針的基因。

基因長度與設計在上面的探針數量無關。

7. 第二步提取到的表達矩陣,找到那些不在 hgu95av2.db 包收錄的對應著SYMBOL的探針

> mapped_probes <- mappedkeys(hgu95av2SYMBOL)
> exprSetpb <- row.names(exprSet)
> length(grep("FALSE",(exprSetpb %in% mapped_probes)))
# [1] 1165
> grep("FALSE",(exprSetpb %in% mapped_probes))
# [1]     8    52    98    99   100   109   115   117   128   135   157   168
# [13]   169   171   174   195   197   199   203   219   225   282   283   292
# ...
> head(exprSetpb[index])
# [1] "1007_s_at" "1047_s_at" "1089_i_at" "108_g_at"  "1090_f_at" "1099_s_at"

不確定對不對....

8. 過濾表達矩陣

刪除那1165個沒有對應基因名字的探針。

> e = exprSet[rownames(exprSet) %in% mapped_probes,]
> str(exprSet)
# num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
#  ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
#  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
> str(e)
# num [1:11460, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
#  ..$ : chr [1:11460] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
#  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...

9. 整合表達矩陣

多個探針對應一個基因的情況下,只保留在所有樣本里面平均表達量最大的那個探針。

> maxid = by(e,ids$symbol,function(x) rownames(x)[which.max(rowMeans(x))])
# 按照 ids$symbol 將 e 分組,取行平均值最大的行 的 行名
# 但,為什么要用 symbol 分組呢 ??
> uniid = as.character(maxid)
> uni_e = e[rownames(e) %in% uniid,]
> maxid = by(e,ids$probe_id,function(x) rownames(x)[which.max(rowMeans(x))])
> uniid = as.character(maxid)
> uni_e = e[rownames(e) %in% uniid,]
> str(uni_e)
# num [1:11460, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
#  ..$ : chr [1:11460] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
#  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
# emmmmm 結局是一樣的,ok我懂了

10. 把過濾后的表達矩陣更改行名為基因的symbol

因為這個時候探針和基因是一對一關系了。

> rownames(uni_e) = ids[match(rownames(uni_e),ids$probe_id),2]
# 將ids的第2列(symbol)對應到 uni_e 的行名
> library(reshape2)
> m_e = melt(uni_e)
> colnames(m_e) = c('symbol','sample','value')
reshape2_melt()

分組:

> pd = pData(sCLLex)
> Disease = pd[,2]
> table(Disease)
# Disease
# progres.   stable 
#      14        8 
> m_e$group = rep(Disease,each=nrow(uni_e))

11. 對第10步得到的表達矩陣進行探索

先畫第一個樣本的所有基因的表達量的boxplot,hist,density,然后畫所有樣本的這些圖

> suppressPackageStartupMessages(library(ggplot2))
> ggplot(m_e,aes(x=sample,y=value,fill=group))+geom_boxplot()
ggplot_box
> ggplot(m_e,aes(value,fill=group)) + geom_histogram(bins = 200)+facet_wrap(~sample, nrow = 4)
ggplot_facetwrap
> ggplot(m_e,aes(value,col=group)) + geom_density()

[圖片上傳失敗...(image-e00840-1559916805026)]

12. 理解統計學指標mean,median,max,min,sd,var,mad

并計算出每個基因在所有樣本的這些統計學指標,最后按照mad值排序,取top 50 mad值的基因,得到列表。(注意:這個題目出的并不合規,請仔細看。)

12.1 mean

算術平均值

> meanlist=apply(uni_e, 1, mean)
> head(meanlist)
#  UTF1        EVX1      ADGRB1        SYN1      CHRNB4        ARR3 
# -0.14880683 -0.13705268  0.05817044  0.07589677  0.08317654  0.20826165 

12.2 median

中位數

> mdlist=apply(uni_e, 1, median)
> head(mdlist)
#    MAPK3     TIE1  CYP2C19    CXCR5    DUSP1    MMP10 
# 5.368993 2.333562 3.378740 7.402258 5.902370 3.320628

12.3 max

最大值

> maxlist=apply(uni_e, 1, max)
> head(maxlist)
#    MAPK3     TIE1  CYP2C19    CXCR5    DUSP1    MMP10 
# 6.251678 2.662170 3.798335 8.802729 9.919125 3.574332 

12.4 min

最小值

> minlist=apply(uni_e, 1, min)
> head(minlist)
#    MAPK3     TIE1  CYP2C19    CXCR5    DUSP1    MMP10 
# 4.826131 2.222276 3.247276 6.456285 4.097580 3.213493 

12.5 sd

標準差

> sdlist=apply(uni_e, 1, sd)
> head(sdlist)
#      MAPK3       TIE1    CYP2C19      CXCR5      DUSP1      MMP10 
# 0.36652641 0.09046121 0.12801488 0.58908623 1.73368583 0.10074804 

12.6 var

方差

> varlist=apply(uni_e, 1, var)
> head(varlist)
#      MAPK3       TIE1    CYP2C19      CXCR5      DUSP1      MMP10 
# 0.13434161 0.00818323 0.01638781 0.34702258 3.00566656 0.01015017 

12.7 mad

絕對中位差Median Absolute Deviation

> madlist=apply(uni_e, 1, mad)
> head(madlist)
#      MAPK3       TIE1    CYP2C19      CXCR5      DUSP1      MMP10 
# 0.22779776 0.06516670 0.08294023 0.38497146 1.43854685 0.09476130 

top 50 mad值的基因:

> tail(sort(madlist),50)
#     PFN2   TNFSF9 ARHGAP44   P2RY14  THEMIS2      LPL    ANXA4    DUSP6 
# 1.481294 1.485155 1.488032 1.505107 1.507287 1.532212 1.533327 1.551320 
#    DUSP5     H1FX     FLNA   CLEC2B   TSPYL2   ZNF266   S100A9    NR4A2 
# 1.553782 1.557412 1.574436 1.578055 1.582430 1.587748 1.608285 1.612875 
#    TGFBI     ARF6    APBB2     VCAN    RBM38     CAPG   PLXNC1     RGS2 
# 1.643149 1.654548 1.674443 1.681098 1.703638 1.713747 1.718297 1.770151 
#   RNASE6    VAMP5     CYBB     GNLY     CCL3     OAS1    TRIB2  ZNF804A 
# 1.775430 1.791017 1.811973 1.814364 1.862311 1.883509 1.937294 1.986163 
#      IGH    PCDH9    VIPR1   COBLL1  GUSBP11   S100A8      HBB   LHFPL2 
# 2.090866 2.144223 2.171912 2.179666 2.204212 2.220970 2.267136 2.317045 
#     FCN1    ZAP70    IGLC1   LGALS1      FOS   SLAMF1     TCF7      DMD 
# 2.371590 2.579046 2.590895 2.600604 2.938078 2.944105 2.993352 3.071541 
#  IGF2BP3   FAM30A 
# 3.234011 3.982191 

13. 根據第12步驟得到top 50 mad值的基因列表來取表達矩陣的子集

并且熱圖可視化子表達矩陣。試試看其它5種熱圖的包的不同效果。

取子集:

> top50gene=tail(sort(madlist),50)
> top50gene=as.data.frame(top50gene)
> top50genelist=rownames(top50gene)
> top50matrix=uni_e[top50genelist,]
> ct=scale(top50matrix,center=T,scale=F)  ## 中心化
> nmlztop50matrix=scale(ct,center=T,scale=T) ## 標準化
> pheatmap::pheatmap(nmlztop50matrix)
pheatmap
> heatmap(nmlztop50matrix)
heatmap
> ggplot(ml_nmtop50,aes(sample,symbol))+
+   geom_tile(aes(fill=value),colour = "white")+
+   scale_fill_gradient(low = "white",high = "steelblue")
ggplot_heatmap
> library(gplots)
> heatmap.2(nmlztop50matrix,key = T,symkey = FALSE,density.info="none",trace="none")
gplots_heatmap.2
> library(lattice)
> library(latticeExtra)
> x  <- t(as.matrix(scale(nmlztop50matrix)))
> dd.row <- as.dendrogram(hclust(dist(x)))
> row.ord <- order.dendrogram(dd.row)
> levelplot(t(nmlztop50matrix),aspect = "fill",xlab="sample",ylab="symbol",colorkey = list(space = "left"),legend=list(right=list(fun=dendrogramGrob,args=list(x=dd.row,rod=row.ord,side='right',size=5))))
> levelplot(t(nmlztop50matrix),aspect =
+             "fill",xlab="sample",ylab="symbol",colorkey =
+             list(space = "left"),legend=
+             list(right=list(fun=dendrogramGrob,args=
+                               list(x=dd.row,rod=row.ord,side='right',size=5))))
lattice_levelplot

emmm PDF示例里的代碼跑出來top50matirx里是很多基因的重復,so這是我琢磨出來的結果

14. 取不同統計學指標mean,median,max,mean,sd,var,mad的各top50基因列表

使用UpSetR包來看他們之間的overlap情況。

各列表:

> e_mean=tail(sort(meanlist),50)
> e_md=tail(sort(mdlist),50)
> e_max=tail(sort(maxlist),50)
> e_min=tail(sort(minlist),50)
> e_sd=tail(sort(sdlist),50)
> e_var=tail(sort(varlist),50)
> e_mad=tail(sort(madlist),50)
> suppressPackageStartupMessages(library("UpSetR"))
> library(magrittr)
> e_all = data.frame(all,
+                    e_mean=ifelse(all %in% names(e_mean),1,0),
+                    e_md=ifelse(all %in% names(e_md),1,0),
+                    e_max=ifelse(all %in% names(e_max),1,0),
+                    e_min=ifelse(all %in% names(e_min),1,0),
+                    e_sd=ifelse(all %in% names(e_sd),1,0),
+                    e_var=ifelse(all %in% names(e_var),1,0),
+                    e_mad=ifelse(all %in% names(e_mad),1,0)
+ )
> upset(e_all,nsets = 7,sets.bar.color = "#BD1111")
UpSetR_magrittr

15. 在第二步的基礎上面提取CLL包里面的data(sCLLex) 數據對象的樣本的表型數據。

> pd=pData(sCLLex)
> grouplist=as.character(pd[,2])
> table(grouplist)
# grouplist
# progres.   stable 
#       14        8 

16. 對所有樣本的表達矩陣進行聚類并且繪圖

然后添加樣本的臨床表型數據信息(更改樣本名)

> clust = t(exprSet)
> rownames(clust) = colnames(exprSet)
> View(clust)
> View(exprSet)
> clust_dist = dist(clust,method = "euclidean")
> hc = hclust(clust_dist,"ward.D")
> suppressPackageStartupMessages(library(factoextra))
> fviz_dend(hc, k = 4 ,cex = 0.5,k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),color_labels_by_k = TRUE, rect = TRUE)
hclust_fivzdend

17.對所有樣本的表達矩陣進行PCA分析并且繪圖,添加表型信息

> pca_data <- prcomp(t(exprSet),scale=TRUE)
> pcx <- data.frame(pca_data$x)
> pcr <- cbind(samples=rownames(pcx),grouplist, pcx) 
> ggplot(pcr, aes(PC1, PC2))+geom_point(size=5, aes(color=grouplist)) +
+   geom_text(aes(label=samples),hjust=-0.1, vjust=-0.3)
procmp_ggplot

18. 根據表達矩陣及樣本分組信息進行批量T檢驗,得到檢驗結果表格

分組,求p值

> gl = as.factor(grouplist)
> group1 = which(grouplist == levels(gl)[1])
> group2 = which(grouplist == levels(gl)[2])
# e = exprSet[rownames(exprSet) %in% mapped_probes,]
> et1 = e[,group1]
> et2 = e[,group2]
> data_t = cbind(et1,et2)
> pvals = apply(e, 1, function(x){
+   t.test(as.numeric(x)~grouplist)$p.value
+ })    ## p值
> p.adj = p.adjust(pvals, method = "BH")  ## 校正后的p值

求log2FC, 做批量t檢驗

> data_mean_c = rowMeans(et1) ## control
> data_mean_t = rowMeans(et2) ## treatment
> log2FC = data_mean_t-data_mean_c
> DEG_t.test = cbind(data_mean_c, data_mean_t, log2FC, pvals, p.adj)
> DEG_t.test=DEG_t.test[order(DEG_t.test[,4]),] ##排序
> DEG_t.test=as.data.frame(DEG_t.test)
> head(DEG_t.test)
#         data_mean_c data_mean_t     log2FC        pvals     p.adj
# 36129_at    7.875615    8.791753  0.9161377 1.629755e-05 0.1867699
# 37676_at    6.622749    7.965007  1.3422581 4.058944e-05 0.2211373
# 33791_at    7.616197    5.786041 -1.8301554 6.965416e-05 0.2211373
# 39967_at    4.456446    2.152471 -2.3039752 8.993339e-05 0.2211373
# 34594_at    5.988866    7.058738  1.0698718 9.648226e-05 0.2211373
# 32198_at    4.157971    3.407405 -0.7505660 2.454557e-04 0.3192169

19. 使用limma包對表達矩陣及樣本分組信息進行差異分析,得到差異分析表格

重點看logFC和P值,畫火山圖(就是logFC和-log10(P值)的散點圖)。

> suppressPackageStartupMessages(library(limma))
> design1=model.matrix(~factor(grouplist))  ## 設計矩陣
> colnames(design1)=levels(factor(grouplist))
> rownames(design1)=colnames(exprSet)
> fit1 = lmFit(exprSet,design1)
> fit1=eBayes(fit1)
> options(digits = 3) 
> mtx1 = topTable(fit1,coef=2,adjust='BH',n=Inf) 
> DEG_mtx1 = na.omit(mtx1) ##去除缺失值
> head(DEG_mtx1)
#           logFC AveExpr     t  P.Value adj.P.Val    B
# 39400_at  1.028    5.62  5.84 8.34e-06    0.0334 3.23
# 36131_at -0.989    9.95 -5.77 9.67e-06    0.0334 3.12
# 33791_at -1.830    6.95 -5.74 1.05e-05    0.0334 3.05
# 1303_at   1.384    4.46  5.73 1.06e-05    0.0334 3.04
# 36122_at -0.780    7.26 -5.14 4.21e-05    0.1062 1.93
# 36939_at -2.547    6.92 -5.04 5.36e-05    0.1128 1.74

畫圖:

> DEG=DEG_mtx1
> logFC_cutoff <- with(DEG,mean(abs(logFC)) + 2*sd(abs(logFC)) ) 
> DEG$result = as.factor(ifelse(DEG$P.Value < 0.05 & abs(DEG$logFC) > logFC_cutoff,  ## p.Value<0.05,|logFC|>1
+                               ifelse(DEG$logFC > logFC_cutoff ,'UP','DOWN'),'NOT'))
> title <- paste0('Cutoff for logFC is ',round(logFC_cutoff,3), #round保留小數位數
+                     
+                     '\nThe number of up gene is ',nrow(DEG[DEG$result =='UP',]) ,
+                     
+                     '\nThe number of down gene is ',nrow(DEG[DEG$result =='DOWN',])
+ )
> ggplot(data=DEG, aes(x=logFC, y=-log10(P.Value), color=result)) +
+   geom_point(alpha=0.4, size=1.75) +
+   theme_set(theme_set(theme_bw(base_size=20)))+
+   xlab("log2 fold change") + ylab("-log10 p-value") +
+   ggtitle( title ) + theme(plot.title = element_text(size=15,hjust = 0.5))+
+   scale_colour_manual(values = c('blue','black','red'))
ggplot_volcano

20.對T檢驗結果的P值和limma包差異分析的P值畫散點圖,看看哪些基因相差很大?

> DEG_t.test = DEG_t.test[rownames(DEG_mtx1),]
> colnames(DEG_t.test)
# [1] "data_mean_c" "data_mean_t" "log2FC"      "pvals"      
# [5] "p.adj"      
> colnames(DEG_mtx1)
# [1] "logFC"     "AveExpr"   "t"         "P.Value"   "adj.P.Val"
# [6] "B"        

畫圖:

> plot(DEG_t.test[,4],DEG_mtx1[,4])
> plot(-log10(DEG_t.test[,4]),-log10(DEG_mtx1[,4]))
ggplot_dot
ggplot_dot_lg

最后,向大家隆重推薦生信技能樹的一系列干貨!

  1. 生信技能樹全球公益巡講:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小時生信工程師教學視頻合輯:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招學徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容