解決GEO數(shù)據(jù)中的多個探針對應一個基因
- 當我們來到這一步時我們默認經(jīng)過前期的處理,你已經(jīng)獲得了表達矩陣的數(shù)據(jù),獲得了探針的注釋信息。
- 在進一步探索過程中我們發(fā)現(xiàn)了一個存在多個探針對應一個基因的情況,這是本文需要解決的問題
- 解決的辦法可以是取多個探針中的最大值,也可以是平均值,中位值等等,都存在一定的道理。這同時也提示了我們高通量數(shù)據(jù)的篩選性質(zhì),存在很大的不確定性,并不是基因定量的金標準,常常發(fā)生芯片或者測序的結(jié)果有些難以得到驗證的情況。
- 本文將通過四種方式來完成取多個探針的平均值作為表達的目的,當然如果想用最大值,直接將 mean換成 max即可。
載入數(shù)據(jù)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
if(T){
load("expma.Rdata")
load("probe.Rdata")
}
expma[1:5,1:5]
## GSM188013 GSM188014 GSM188016 GSM188018 GSM188020
## 1007_s_at 15630.200 17048.800 13667.500 15138.800 10766.600
## 1053_at 3614.400 3563.220 2604.650 1945.710 3371.290
## 117_at 1032.670 1164.150 510.692 5061.200 452.166
## 121_at 5917.800 6826.670 4562.440 5870.130 3869.480
## 1255_g_at 224.525 395.025 207.087 164.835 111.609
boxplot(expma)##看下表達情況
Fig1
metadata信息
metdata[1:5,1:5]
## title
## GSM188013 DMSO treated MCF7 breast cancer cells [HG-U133A] Exp 1
## GSM188014 Dioxin treated MCF7 breast cancer cells [HG-U133A] Exp 1
## GSM188016 DMSO treated MCF7 breast cancer cells [HG-U133A] Exp 2
## GSM188018 Dioxin treated MCF7 breast cancer cells [HG-U133A] Exp 2
## GSM188020 DMSO treated MCF7 breast cancer cells [HG-U133A] Exp 3
## geo_accession status submission_date
## GSM188013 GSM188013 Public on May 12 2007 May 08 2007
## GSM188014 GSM188014 Public on May 12 2007 May 08 2007
## GSM188016 GSM188016 Public on May 12 2007 May 08 2007
## GSM188018 GSM188018 Public on May 12 2007 May 08 2007
## GSM188020 GSM188020 Public on May 12 2007 May 08 2007
## last_update_date
## GSM188013 May 11 2007
## GSM188014 May 11 2007
## GSM188016 May 11 2007
## GSM188018 May 11 2007
## GSM188020 May 11 2007
探針注釋信息
head(probe)
## ID Gene Symbol ENTREZ_GENE_ID
## 2 1053_at RFC2 5982
## 3 117_at HSPA6 3310
## 4 121_at PAX8 7849
## 5 1255_g_at GUCA1A 2978
## 7 1316_at THRA 7067
## 8 1320_at PTPN21 11099
查看Gene Symbol是否有重復
table(duplicated(probe$`Gene Symbol`))##12549 FALSE
##
## FALSE TRUE
## 12549 8329
整合注釋信息到表達矩陣
ID<-rownames(expma)
expma<-apply(expma,1,function(x){log2(x+1)})
expma<-t(expma)
eset<-expma[ID %in% probe$ID,] %>% cbind(probe)
eset[1:5,1:5]
## GSM188013 GSM188014 GSM188016 GSM188018 GSM188020
## 1053_at 11.819940 11.799371 11.347428 10.926822 11.719513
## 117_at 10.013560 10.186300 8.999132 12.305549 8.823896
## 121_at 12.531089 12.737178 12.155906 12.519422 11.918297
## 1255_g_at 7.817144 8.629448 7.701043 7.373605 6.815178
## 1316_at 9.497459 9.868281 8.831323 9.211346 8.453592
colnames(eset)
## [1] "GSM188013" "GSM188014" "GSM188016" "GSM188018"
## [5] "GSM188020" "GSM188022" "ID" "Gene Symbol"
## [9] "ENTREZ_GENE_ID"
方法一:aggregate函數(shù)
test1<-aggregate(x=eset[,1:6],by=list(eset$`Gene Symbol`),FUN=mean,na.rm=T)
test1[1:5,1:5]##與去重結(jié)果相吻合
## Group.1 GSM188013 GSM188014 GSM188016 GSM188018
## 1 8.438846 8.368513 7.322442 7.813573
## 2 A1CF 10.979025 10.616926 9.940773 10.413311
## 3 A2M 6.565276 6.422112 8.142194 5.652593
## 4 A4GALT 7.728628 7.818966 8.679885 7.048563
## 5 A4GNT 10.243388 10.182382 9.391991 8.779887
dim(test1)##
## [1] 12549 7
colnames(test1)
## [1] "Group.1" "GSM188013" "GSM188014" "GSM188016" "GSM188018" "GSM188020"
## [7] "GSM188022"
方法二:dplyr
dplyr中的group_by聯(lián)合summarise剛好適用于解決這個問題
eset[1:5,1:5]
## GSM188013 GSM188014 GSM188016 GSM188018 GSM188020
## 1053_at 11.819940 11.799371 11.347428 10.926822 11.719513
## 117_at 10.013560 10.186300 8.999132 12.305549 8.823896
## 121_at 12.531089 12.737178 12.155906 12.519422 11.918297
## 1255_g_at 7.817144 8.629448 7.701043 7.373605 6.815178
## 1316_at 9.497459 9.868281 8.831323 9.211346 8.453592
dim(eset)
## [1] 20878 9
colnames(eset)[8]<-"Gene"
colnames(eset)
## [1] "GSM188013" "GSM188014" "GSM188016" "GSM188018"
## [5] "GSM188020" "GSM188022" "ID" "Gene"
## [9] "ENTREZ_GENE_ID"
test2<-eset[,c(1:6,8)] %>%
arrange(Gene) %>%
group_by(Gene) %>%
summarise_all(mean)
dim(test2)##同樣得出了12457,與方法1結(jié)果相同
## [1] 12549 7
方法三:tapply函數(shù)
這里值得注意的是tapply中的X參數(shù)只能計算一個向量 因此我們這里選擇>一個for循環(huán)進行迭代,將每一個向量單獨計算后rbind起來
#Sys.setenv(LANGUAGE= "en")
eset[1:5,1:5]
## GSM188013 GSM188014 GSM188016 GSM188018 GSM188020
## 1053_at 11.819940 11.799371 11.347428 10.926822 11.719513
## 117_at 10.013560 10.186300 8.999132 12.305549 8.823896
## 121_at 12.531089 12.737178 12.155906 12.519422 11.918297
## 1255_g_at 7.817144 8.629448 7.701043 7.373605 6.815178
## 1316_at 9.497459 9.868281 8.831323 9.211346 8.453592
dim(eset)
## [1] 20878 9
colnames(eset)[8]<-"Gene"
colnames(eset)
## [1] "GSM188013" "GSM188014" "GSM188016" "GSM188018"
## [5] "GSM188020" "GSM188022" "ID" "Gene"
## [9] "ENTREZ_GENE_ID"
##搭配for循環(huán)
output<-vector()
for (i in 1:6) {
value<-tapply(X=(eset[,i]),
INDEX = list(eset$Gene),
FUN = mean
)
output<-rbind(value,output)
}
output<-t(output)
colnames(output)<-colnames(eset)[1:6]
output[1:5,1:5]
## GSM188013 GSM188014 GSM188016 GSM188018 GSM188020
## 8.438846 8.438846 8.438846 8.438846 8.438846
## A1CF 10.979025 10.979025 10.979025 10.979025 10.979025
## A2M 6.565276 6.565276 6.565276 6.565276 6.565276
## A4GALT 7.728628 7.728628 7.728628 7.728628 7.728628
## A4GNT 10.243388 10.243388 10.243388 10.243388 10.243388
test3<-output
dim(test3)
## [1] 12549 6
test3[1:5,1:5]##得到的結(jié)果與test2相同
## GSM188013 GSM188014 GSM188016 GSM188018 GSM188020
## 8.438846 8.438846 8.438846 8.438846 8.438846
## A1CF 10.979025 10.979025 10.979025 10.979025 10.979025
## A2M 6.565276 6.565276 6.565276 6.565276 6.565276
## A4GALT 7.728628 7.728628 7.728628 7.728628 7.728628
## A4GNT 10.243388 10.243388 10.243388 10.243388 10.243388
方法四:by函數(shù)
by函數(shù)的用法與tapply類似,同樣需要結(jié)合for循環(huán) for循環(huán)的思路是相同的
## for循環(huán)完成迭代
output<-vector()
for (i in 1:6){
value=by(eset[,i],
INDICES = list(eset$Gene),FUN = mean)
output<-rbind(value,output)
}
output[1:5,1:5]
## A1CF A2M A4GALT A4GNT
## value 7.165374 9.713797 5.580703 8.583414 9.163199
## value 7.244615 9.743305 5.550033 5.929258 9.431585
## value 7.813573 10.413311 5.652593 7.048563 8.779887
## value 7.322442 9.940773 8.142194 8.679885 9.391991
## value 8.368513 10.616926 6.422112 7.818966 10.182382
output<-t(output)
colnames(output)<-colnames(eset)[1:6]
test4<-output
dim(test4)
## [1] 12549 6
test4[1:5,1:5]##得到的結(jié)果與test4相同
## GSM188013 GSM188014 GSM188016 GSM188018 GSM188020
## 7.165374 7.244615 7.813573 7.322442 8.368513
## A1CF 9.713797 9.743305 10.413311 9.940773 10.616926
## A2M 5.580703 5.550033 5.652593 8.142194 6.422112
## A4GALT 8.583414 5.929258 7.048563 8.679885 7.818966
## A4GNT 9.163199 9.431585 8.779887 9.391991 10.18238
本期內(nèi)容就到這里,我是白介素2,下期再見
相關(guān)閱讀:GEO數(shù)據(jù)挖掘01-數(shù)據(jù)下載及提取表達矩陣
轉(zhuǎn)載請注明出處