oncoPredict:一枚藥物預測R包

0.背景知識一點點

oncoPredict是根據基因表達量來預測藥物敏感性的R包。也就是說它可以根據你的樣本基因表達量來告訴你每個藥物的IC50值,這個值越低就說明藥物越管用。

提到藥物預測,還有一個pRRophetic包,建議不用看了,因為oncoPredict是它的plus版本。

還有一個cellMiner網站,之前寫過,可以翻翻看。

1.載入數據

代碼參考自:https://mp.weixin.qq.com/s/QRaTd-fIsqq6sPsLmOPvIw,一些背景知識也可以補充下.

在Training Data文件夾下存放著R包作者準備好的數據,用作藥物預測的訓練集。下載自:https://osf.io/c6tfx/

rm(list = ls())
library(oncoPredict)
library(data.table)
library(gtools)
library(reshape2)
library(ggpubr)
dir='./DataFiles/DataFiles/Training Data/'
dir(dir)

## [1] "CTRP2_Expr (RPKM, not log transformed).rds"         
## [2] "CTRP2_Expr (TPM, not log transformed).rds"          
## [3] "CTRP2_Res.rds"                                      
## [4] "GDSC1_Expr (RMA Normalized and Log Transformed).rds"
## [5] "GDSC1_Res.rds"                                      
## [6] "GDSC2_Expr (RMA Normalized and Log Transformed).rds"
## [7] "GDSC2_Res.rds"

可以看到其中包括了Cancer Therapeutics Response Portal (CTRP)和Genomics of Drug Sensitivity in Cancer (GDSC),我們直接用v2

兩個數據庫的數據,都是提供了基因表達矩陣和藥物IC50表格。

exp = readRDS(file=file.path(dir,'GDSC2_Expr (RMA Normalized and Log Transformed).rds'))
exp[1:4,1:4]

##        COSMIC_906826 COSMIC_687983 COSMIC_910927 COSMIC_1240138
## TSPAN6      7.632023      7.548671      8.712338       7.797142
## TNMD        2.964585      2.777716      2.643508       2.817923
## DPM1       10.379553     11.807341      9.880733       9.883471
## SCYL3       3.614794      4.066887      3.956230       4.063701

dim(exp)

## [1] 17419   805

drug = readRDS(file = file.path(dir,"GDSC2_Res.rds"))
drug <- exp(drug) #下載到的數據是被log轉換過的,用這句代碼逆轉回去
drug[1:4,1:4]

##                Camptothecin_1003 Vinblastine_1004 Cisplatin_1005
## COSMIC_906826          0.3158373      0.208843106     1116.05899
## COSMIC_687983          0.2827342      0.013664227       26.75839
## COSMIC_910927          0.0295671      0.006684071       12.09379
## COSMIC_1240138         7.2165789               NA             NA
##                Cytarabine_1006
## COSMIC_906826       18.5038719
## COSMIC_687983       16.2943594
## COSMIC_910927        0.3387418
## COSMIC_1240138              NA

dim(drug)

## [1] 805 198

identical(rownames(drug),colnames(exp))

## [1] TRUE

drug是藥物IC50值,exp是對應細胞系基因的表達矩陣。可以看到二者的樣本名稱是對應的。

2.操練一下

搞一個示例數據,從矩陣里面直接隨機取了4個樣本。

test<- exp[,sample(1:ncol(exp),4)]
test[1:4,1:4]  

##        COSMIC_1290797 COSMIC_906830 COSMIC_907314 COSMIC_907068
## TSPAN6       8.196623      5.542645      6.960978      6.896404
## TNMD         2.692706      2.736643      3.038283      2.774103
## DPM1        10.829487      9.890112      9.912911     10.757162
## SCYL3        3.840380      3.346422      3.845654      4.490674

colnames(test)=paste0('test',colnames(test))
dim(test)

## [1] 17419     4

運行時間很長,所以if(F)注釋掉。

if(F){
  calcPhenotype(trainingExprData = exp,
                trainingPtype = drug,
                testExprData = test,
                batchCorrect = 'eb',  #   "eb" for array,standardize  for rnaseq
                powerTransformPhenotype = TRUE,
                removeLowVaryingGenes = 0.2,
                minNumSamples = 10, 
                printOutput = TRUE, 
                removeLowVaringGenesFrom = 'rawData' )
}

R包Vignette里關于batchCorrect參數的說明

batchCorrect options: “eb” for ComBat, “qn” for quantiles normalization, “standardize”, or “none”

“eb” is good to use when you use microarray training data to build models on microarray testing data.

“standardize is good to use when you use microarray training data to build models on RNA-seq testing data (this is what Paul used in the 2017 IDWAS paper that used GDSC microarray to impute in TCGA RNA-Seq data, see methods section of that paper for rationale)

R包Vignette里關于removeLowVaringGenesFrom參數的說明

Determine method to remove low varying genes. #Options are ‘homogenizeData’ and ‘rawData’ #homogenizeData is likely better if there is ComBat batch correction, raw data was used in the 2017 IDWAS paper that used GDSC microarray to impute in TCGA RNA-Seq data.

也就是說,芯片數據就用上面代碼里的參數,轉錄組數據的話,就將batchCorrect改為standardize

removeLowVaringGenesFrom,作者說的也模糊啊。隨便吧。

3.看看結果

這是運行之后的結果,被存在固定文件夾calcPhenotype_Output下。文件名也是固定的DrugPredictions.csv。因此一個工作目錄只能計算一個數據,你可別混著用哦。

library(data.table)
testPtype <- read.csv('./calcPhenotype_Output/DrugPredictions.csv', row.names = 1,check.names = F)
testPtype[1:4, 1:4]

##                    Camptothecin_1003 Vinblastine_1004 Cisplatin_1005
## testCOSMIC_688011         0.10760213       0.11167741       38.54915
## testCOSMIC_687586         0.07044805       0.01858011       20.74525
## testCOSMIC_1290795        0.10672687       0.02699725       43.80543
## testCOSMIC_909709         0.14925178       0.03303756       36.73508
##                    Cytarabine_1006
## testCOSMIC_688011         8.099842
## testCOSMIC_687586         4.612872
## testCOSMIC_1290795       13.370822
## testCOSMIC_909709        10.393066

dim(testPtype)

## [1]   4 198

identical(colnames(testPtype),colnames(drug))

## [1] TRUE

198種藥物IC50的預測結果就在這個表格里啦。

可以畫個圖比較一下預測結果與真實數據,可以肉眼計算相關性系數基本是1,也就知道了計算的結果確實是IC50值,而且計算的還挺準。(當然準啦,因為數據是從矩陣里面截取的)

library(stringr)
p = str_remove(rownames(testPtype),"test")
a = t(rbind(drug[p,],testPtype))
a = a[,c(1,5,2,6,3,7,4,8)]
par(mfrow = c(2,2))
plot(a[,1],a[,2])
plot(a[,3],a[,4])
plot(a[,5],a[,6])
plot(a[,7],a[,8])
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容