1. clusterProfiler包,bitr()轉(zhuǎn)換ID函數(shù)
clusterProfiler 是Y叔寫的一個功能強大的R包,可以用來做各種富集分析,如GO、KEGG、DO(Disease Ontology analysis)、Reactome pathway analysis、GSEA富集分析等。還具有非常優(yōu)秀的富集分析結(jié)果可視化功能。
1.1 bitr()的使用方法:
bitr(geneID, fromType, toType, OrgDb, drop = TRUE)
geneID:一個含有基因名的向量
orgDb:人類的注釋包是org.Hs.eg.db,小鼠是org.Mm.eg.db
fromType:輸入的基因名的類型
toType:需要轉(zhuǎn)換成的類型,可以是多種類型,用大寫character類型向量表示
1.2 查看org.Hs.eg.db中可以被選擇/使用的類型
keytypes(org.Hs.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GO" "GOALL" "IPI" "MAP" "OMIM"
[16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
[21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" "UNIGENE"
[26] "UNIPROT"
1.3 示例
library(clusterProfiler)
library(org.Hs.eg.db)
s2e <- bitr(deg$symbol,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db)
deg <- inner_join(deg,s2e,by=c("symbol"="SYMBOL")) #將轉(zhuǎn)換后的結(jié)果加入矩陣,變成一列
需要注意的是,一些數(shù)據(jù)比如從TCGA上下載的數(shù)據(jù),行名是帶有版本號的,如下 :
View(exp)
a = rownames(exp)
head(a)
# [1] "ENSG00000000003.13"
# [2] "ENSG00000000419.11"
# [3] "ENSG00000000457.12"
# [4] "ENSG00000000460.15"
# [5] "ENSG00000000938.11"
# [6] "ENSG00000000971.14"
這樣的ENTREZID是不能被bitr()識別的,需要按點號分隔開,保留點號前面的部分,才能被bitr()識別。
library(clusterProfiler)
library(org.Hs.eg.db)
library(stringr)
a = str_split(a,"\\.",simplify = T)[,1] #不帶\\直接寫"."是正則表達式任意字符的意思,需要\\來轉(zhuǎn)義。
id = bitr(a,
fromType = "ENSEMBL",
toType = "SYMBOL",
OrgDb = "org.Hs.eg.db")
2. 探針I(yè)D轉(zhuǎn)化為gene symbol
2.1 獲得探針和gene symbol的對應(yīng)關(guān)系
- 方法一:根據(jù)gpl_number,從http://www.bio-info-trainee.com/1399.html中尋找對應(yīng)的bioc_package,下載對應(yīng)的包。
比如最常用的GPL570,對應(yīng)hgu133plus2
if(!require(hgu133plus2.db))BiocManager::install("hgu133plus2.db") #注意加上.db
library(hgu133plus2.db)
ls("package:hgu133plus2.db")
# [1] "hgu133plus2"
# [2] "hgu133plus2_dbconn"
# [3] "hgu133plus2_dbfile"
# [4] "hgu133plus2_dbInfo"
# [5] "hgu133plus2_dbschema"
# [6] "hgu133plus2.db"
# [7] "hgu133plus2ACCNUM"
# [8] "hgu133plus2ALIAS2PROBE"
# [9] "hgu133plus2CHR"
# [10] "hgu133plus2CHRLENGTHS"
# [11] "hgu133plus2CHRLOC"
# [12] "hgu133plus2CHRLOCEND"
# [13] "hgu133plus2ENSEMBL"
# [14] "hgu133plus2ENSEMBL2PROBE"
# [15] "hgu133plus2ENTREZID"
# [16] "hgu133plus2ENZYME"
# [17] "hgu133plus2ENZYME2PROBE"
# [18] "hgu133plus2GENENAME"
# [19] "hgu133plus2GO"
# [20] "hgu133plus2GO2ALLPROBES"
# [21] "hgu133plus2GO2PROBE"
# [22] "hgu133plus2MAP"
# [23] "hgu133plus2MAPCOUNTS"
# [24] "hgu133plus2OMIM"
# [25] "hgu133plus2ORGANISM"
# [26] "hgu133plus2ORGPKG"
# [27] "hgu133plus2PATH"
# [28] "hgu133plus2PATH2PROBE"
# [29] "hgu133plus2PFAM"
# [30] "hgu133plus2PMID"
# [31] "hgu133plus2PMID2PROBE"
# [32] "hgu133plus2PROSITE"
# [33] "hgu133plus2REFSEQ"
# [34] "hgu133plus2SYMBOL"
# [35] "hgu133plus2UNIGENE"
# [36] "hgu133plus2UNIPROT"
ids <- toTable(hgu133plus2SYMBOL)
head(ids)
# probe_id symbol
# 1 1053_at RFC2
# 2 117_at HSPA6
# 3 121_at PAX8
# 4 1255_g_at GUCA1A
# 5 1316_at THRA
# 6 1320_at PTPN21
- 方法二:讀取GPL平臺的soft文件,按列取子集
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570
a = getGEO(gpl_number,destdir = ".")
b = a@dataTable@table
colnames(b)
ids2 = b[,c("ID","Gene Symbol")]
colnames(ids2) = c("probe_id","symbol")
ids2 = ids2[ids2$symbol!="" & !str_detect(ids2$symbol,"http:///"),]
- 方法三:官網(wǎng)下載,文件讀取
參考:http://www.affymetrix.com/support/technical/byproduct.affx?product=hg-u133-plus - 方法四:自主注釋
參考:https://mp.weixin.qq.com/s/mrtjpN8yDKUdCSvSUuUwcA
2.2 加上探針注釋
#為deg數(shù)據(jù)框添加幾列(deg是做完差異分析后,行為probe_id,列為logFC等差異分析結(jié)果的矩陣)
#1.加probe_id列,把行名變成一列
library(dplyr)
deg <- mutate(deg,probe_id=rownames(deg))
head(deg)
#2.加上探針注釋
#多個探針對應(yīng)一個基因:按照基因去重復(fù)
table(!duplicated(ids$probe_id))
table(!duplicated(ids$symbol)) #結(jié)果顯示很多基因?qū)?yīng)了多個探針
#按symbol列去重,常見標(biāo)準(zhǔn)有3個:最大值/平均值/隨機去重
#隨機去重,另兩個見zz.去重方式.R
ids = ids[!duplicated(ids$symbol),] #保留重復(fù)值中第一個出現(xiàn)的
deg <- inner_join(deg,ids,by="probe_id")
head(deg)
nrow(deg)
3. gtf文件獲得id轉(zhuǎn)換信息
對于TCGA上下載的數(shù)據(jù),其使用的geneid是ensembl id,做完差異分析后需要把結(jié)果中的每個ensembl id轉(zhuǎn)換為對應(yīng)的symbol和類型(mRNA/lncRNA或其它)。而gtf文件中就包含了我們需要的gene symbol和ensembl id的對應(yīng)關(guān)系。
思路:
1. 找到TCGA數(shù)據(jù)對應(yīng)的參考基因組注釋版本。
2. 下載該版本的參考基因組注釋文件,提取ensembl id 與symbol的對應(yīng)關(guān)系及每個基因的gene type信息。
3. 可以將symbol和gene type 用merge添加到差異分析結(jié)果中,也可以在差異分析前先轉(zhuǎn)換矩陣的行名。
3.1 找參考基因組版本
在gtf文件里并不是直接分出了lncRNA,需要找gtf文件里對biotype的說明,不看不知道,一看發(fā)現(xiàn)這是一個很長的表格。
進入https://www.gencodegenes.org,點擊Documentation下的Biotypes,
其中對lncRNA的說明是:Generic long non-coding RNA biotype that replaced the following biotypes: 3prime_overlapping_ncRNA, antisense, bidirectional_promoter_lncRNA, lincRNA, macro_lncRNA, non_coding, processed_transcript, sense_intronic and sense_overlapping.
所以需要將genetype里這些類型對應(yīng)的行挑出來,就是lncRNA了。 然后與表達矩陣行名進行匹配替換,就可以分別得到mRNA和lncRNA的矩陣了。
3.2 轉(zhuǎn)換
#step1:讀取并探索gtf文件----
#BiocManager::install("rtracklayer")
library(rtracklayer)
gtf = rtracklayer::import("gencode.v22.annotation.gtf")
class(gtf)
gtf = as.data.frame(gtf);dim(gtf)
colnames(gtf)
table(gtf$type)
#step2:先篩選出gene對應(yīng)的行
gtf_gene = gtf[gtf$type=="gene",]
save(gtf_gene,file = "gtf_gene.Rdata")
table(rownames(deg) %in% gtf_gene$gene_id) #deg是需要進行id轉(zhuǎn)換的矩陣
# FALSE TRUE
# 3 30345
an = gtf_gene[,c("gene_name","gene_id","gene_type")] #取出gtf文件中這三列
head(an)
# gene_name gene_id gene_type
# 1 DDX11L1 ENSG00000223972.5 transcribed_unprocessed_pseudogene
# 13 WASH7P ENSG00000227232.5 unprocessed_pseudogene
# 26 MIR6859-3 ENSG00000278267.1 miRNA
# 29 RP11-34P13.3 ENSG00000243485.3 lincRNA
# 37 MIR1302-9 ENSG00000274890.1 miRNA
# 40 FAM138A ENSG00000237613.2 lincRNA
deg = merge(deg,an,by.x = "row.names",by.y = "gene_id")
# mRNA和lncRNA總共有多少個?
lnc = c("3prime_overlapping_ncRNA", "antisense", "bidirectional_promoter_lncRNA", "lincRNA", "macro_lncRNA", "non_coding", "processed_transcript", "sense_intronic" , "sense_overlapping")
k1 = gtf_gene$gene_type %in% lnc;table(k1) #gtf中共14826個lnc
# k1
# FALSE TRUE
# 45657 14826
k2 = gtf_gene$gene_type == "protein_coding";table(k2) #gtf中共19814個mRNA
# k2
# FALSE TRUE
# 40669 19814
# deg中有多少mRNA和lncRNA?
k3 = deg$gene_type %in% lnc;table(k3) #deg中共7501個lnc
# k3
# FALSE TRUE
# 22844 7501
k4 = deg$gene_type =="protein_coding";table(k4) #deg中共17464個mRNA
# k4
# FALSE TRUE
# 12881 17464
# 差異的mRNA和lncRNA 各有多少
k5 = deg$change !="NOT"
table(k3&k5)
# FALSE TRUE #有396個差異性lnc
# 29949 396
table(k4&k5)
# FALSE TRUE #有1084個差異性mRNA
# 29261 1084
表達矩陣的行名id轉(zhuǎn)換
load("gtf_gene.Rdata")
an = gtf_gene[,c("gene_name","gene_id","gene_type")]
exp = exp[rownames(exp) %in% an$gene_id,] #從表達矩陣中去除不存在于gtf中的gene名
an = an[match(rownames(exp),an$gene_id),] #將an的順序調(diào)整成和表達矩陣行名一致
identical(an$gene_id,rownames(exp)) #檢查是否一致
# [1] TRUE
k = !duplicated(an$gene_name);table(k) #因為要用gene_symbol做行名,而行名是不能重復(fù)的,所以要先去重
# k
# FALSE TRUE
# 193 30152
an = an[k,]
exp = exp[k,]
rownames(exp) = an$gene_name
常用的轉(zhuǎn)換需求總結(jié):
- GO分析:需要將gene symbol轉(zhuǎn)化為ENTREZID
- 分析芯片數(shù)據(jù):需要將探針I(yè)D轉(zhuǎn)化為gene symbol
- 將TCGA上下載的矩陣中的ENTREZID轉(zhuǎn)化為gene symbol(1和3實質(zhì)上是一樣的),可以直接轉(zhuǎn)換矩陣,也可以在做完差異分析后,把gene symbol作為一列連接在差異表達矩陣上。
代碼來自2021生信技能樹數(shù)據(jù)挖掘課