0.背景知識一點點
oncoPredict是根據基因表達量來預測藥物敏感性的R包。也就是說它可以根據你的樣本基因表達量來告訴你每個藥物的IC50值,這個值越低就說明藥物越管用。
提到藥物預測,還有一個pRRophetic包,建議不用看了,因為oncoPredict是它的plus版本。
還有一個cellMiner網站,之前寫過,可以翻翻看。
1.載入數據
代碼參考自:https://mp.weixin.qq.com/s/QRaTd-fIsqq6sPsLmOPvIw,一些背景知識也可以補充下.
在Training Data文件夾下存放著R包作者準備好的數據,用作藥物預測的訓練集。下載自:https://osf.io/c6tfx/
rm(list = ls())
library(oncoPredict)
library(data.table)
library(gtools)
library(reshape2)
library(ggpubr)
dir='./DataFiles/DataFiles/Training Data/'
dir(dir)
## [1] "CTRP2_Expr (RPKM, not log transformed).rds"
## [2] "CTRP2_Expr (TPM, not log transformed).rds"
## [3] "CTRP2_Res.rds"
## [4] "GDSC1_Expr (RMA Normalized and Log Transformed).rds"
## [5] "GDSC1_Res.rds"
## [6] "GDSC2_Expr (RMA Normalized and Log Transformed).rds"
## [7] "GDSC2_Res.rds"
可以看到其中包括了Cancer Therapeutics Response Portal (CTRP)和Genomics of Drug Sensitivity in Cancer (GDSC),我們直接用v2
兩個數據庫的數據,都是提供了基因表達矩陣和藥物IC50表格。
exp = readRDS(file=file.path(dir,'GDSC2_Expr (RMA Normalized and Log Transformed).rds'))
exp[1:4,1:4]
## COSMIC_906826 COSMIC_687983 COSMIC_910927 COSMIC_1240138
## TSPAN6 7.632023 7.548671 8.712338 7.797142
## TNMD 2.964585 2.777716 2.643508 2.817923
## DPM1 10.379553 11.807341 9.880733 9.883471
## SCYL3 3.614794 4.066887 3.956230 4.063701
dim(exp)
## [1] 17419 805
drug = readRDS(file = file.path(dir,"GDSC2_Res.rds"))
drug <- exp(drug) #下載到的數據是被log轉換過的,用這句代碼逆轉回去
drug[1:4,1:4]
## Camptothecin_1003 Vinblastine_1004 Cisplatin_1005
## COSMIC_906826 0.3158373 0.208843106 1116.05899
## COSMIC_687983 0.2827342 0.013664227 26.75839
## COSMIC_910927 0.0295671 0.006684071 12.09379
## COSMIC_1240138 7.2165789 NA NA
## Cytarabine_1006
## COSMIC_906826 18.5038719
## COSMIC_687983 16.2943594
## COSMIC_910927 0.3387418
## COSMIC_1240138 NA
dim(drug)
## [1] 805 198
identical(rownames(drug),colnames(exp))
## [1] TRUE
drug是藥物IC50值,exp是對應細胞系基因的表達矩陣。可以看到二者的樣本名稱是對應的。
2.操練一下
搞一個示例數據,從矩陣里面直接隨機取了4個樣本。
test<- exp[,sample(1:ncol(exp),4)]
test[1:4,1:4]
## COSMIC_1290797 COSMIC_906830 COSMIC_907314 COSMIC_907068
## TSPAN6 8.196623 5.542645 6.960978 6.896404
## TNMD 2.692706 2.736643 3.038283 2.774103
## DPM1 10.829487 9.890112 9.912911 10.757162
## SCYL3 3.840380 3.346422 3.845654 4.490674
colnames(test)=paste0('test',colnames(test))
dim(test)
## [1] 17419 4
運行時間很長,所以if(F)注釋掉。
if(F){
calcPhenotype(trainingExprData = exp,
trainingPtype = drug,
testExprData = test,
batchCorrect = 'eb', # "eb" for array,standardize for rnaseq
powerTransformPhenotype = TRUE,
removeLowVaryingGenes = 0.2,
minNumSamples = 10,
printOutput = TRUE,
removeLowVaringGenesFrom = 'rawData' )
}
R包Vignette里關于batchCorrect參數的說明
batchCorrect options: “eb” for ComBat, “qn” for quantiles normalization, “standardize”, or “none”
“eb” is good to use when you use microarray training data to build models on microarray testing data.
“standardize is good to use when you use microarray training data to build models on RNA-seq testing data (this is what Paul used in the 2017 IDWAS paper that used GDSC microarray to impute in TCGA RNA-Seq data, see methods section of that paper for rationale)
R包Vignette里關于removeLowVaringGenesFrom參數的說明
Determine method to remove low varying genes. #Options are ‘homogenizeData’ and ‘rawData’ #homogenizeData is likely better if there is ComBat batch correction, raw data was used in the 2017 IDWAS paper that used GDSC microarray to impute in TCGA RNA-Seq data.
也就是說,芯片數據就用上面代碼里的參數,轉錄組數據的話,就將batchCorrect改為standardize
removeLowVaringGenesFrom,作者說的也模糊啊。隨便吧。
3.看看結果
這是運行之后的結果,被存在固定文件夾calcPhenotype_Output下。文件名也是固定的DrugPredictions.csv。因此一個工作目錄只能計算一個數據,你可別混著用哦。
library(data.table)
testPtype <- read.csv('./calcPhenotype_Output/DrugPredictions.csv', row.names = 1,check.names = F)
testPtype[1:4, 1:4]
## Camptothecin_1003 Vinblastine_1004 Cisplatin_1005
## testCOSMIC_688011 0.10760213 0.11167741 38.54915
## testCOSMIC_687586 0.07044805 0.01858011 20.74525
## testCOSMIC_1290795 0.10672687 0.02699725 43.80543
## testCOSMIC_909709 0.14925178 0.03303756 36.73508
## Cytarabine_1006
## testCOSMIC_688011 8.099842
## testCOSMIC_687586 4.612872
## testCOSMIC_1290795 13.370822
## testCOSMIC_909709 10.393066
dim(testPtype)
## [1] 4 198
identical(colnames(testPtype),colnames(drug))
## [1] TRUE
198種藥物IC50的預測結果就在這個表格里啦。
可以畫個圖比較一下預測結果與真實數據,可以肉眼計算相關性系數基本是1,也就知道了計算的結果確實是IC50值,而且計算的還挺準。(當然準啦,因為數據是從矩陣里面截取的)
library(stringr)
p = str_remove(rownames(testPtype),"test")
a = t(rbind(drug[p,],testPtype))
a = a[,c(1,5,2,6,3,7,4,8)]
par(mfrow = c(2,2))
plot(a[,1],a[,2])
plot(a[,3],a[,4])
plot(a[,5],a[,6])
plot(a[,7],a[,8])