【r<-生信|包】RTCGA包安裝與使用

本文來自對官方文檔的翻譯學習,原3-29發于wordpress。因為網絡不太好,那個博客已差不多棄用了。

安裝

在R或者Rstudio交互界面輸入:

source("http://bioconductor.org/biocLite.R")
 biocLite("RTCGAToolbox")

Windows下,如果安裝出現問題,請查看依賴包是否完整。我安裝時發現XML包可能需要單獨安裝。

如果你是Linux系統,而且XML包一直安裝不上,請仔細查看錯誤信息。有可能是你的系統沒有XML和curl配置,導致不能安裝XML以及Rcurl包(具體依據錯誤信息分析)。在終端下輸入

sudo apt-get install libxml2-dev
 sudo apt-get install libcurl4-gnutls-dev

完成后接著安裝XML包和RCurl包;安裝RTCGA工具包。

數據的查看與導入

首先導入工具包:

library(RTCGAToolbox)

查看合法的數據集別名:

# Valid aliases
 > getFirehoseDatasets()

查看合法標準數據的運行日期:

查看合法的分析運行日期:

# Valid analysis running dates (will return 3 recent date)
 > gisticDate = getFirehoseAnalyzeDates(last=3)
 > gisticDate
 [1] "20160128" "20150821" "20150402"

日期和數據集確定了你通過getFirehoseData函數需要獲取的數據。

# READ mutation data and clinical data
 brcaData = getFirehoseData (dataset="READ", runDate="20150402",forceDownload = TRUE,
 Clinic=TRUE, Mutation=TRUE)

getFirehoseData函數需要設置的一些參數:

- dataset: Users should set cohort code for the dataset they would like to download. List can be accessiable via getFirehoseDatasets() like as explained above.
- runDate: Firehose project provides different data point for cohorts. Users can list dates by using function above. “getFirehoseRunningDates()”
- gistic2_Date: Just like cohorts Firehose project runs their analysis pipelines to process copy number data with GISTIC2 (Mermel, C. H. and Schumacher, S. E. and Hill, B. and Meyerson, M. L. and Beroukhim, R. and Getz, G 2011). Users who want to get GISTIC2 processed copy number data should set this date. List can be accessible via “getFirehoseAnalyzeDates()”

下面是一些提供不同數據類型的邏輯值:

- RNAseq_Gene
- Clinic
- miRNASeq_Gene
- RNAseq2_Gene_Norm
- CNA_SNP
- CNV_SNP
- CNA_Seq
- CNA_CGH
- Methylation
- Mutation
- mRNA_Array
- miRNA_Array
- RPPA

其他一些參數:

- forceDownload: 強制下載。
- fileSizeLimit: 默認為500MB,可以自己根據這個參數設置。
- getUUIDs: Firehose為每個樣本都提供了一個叫做UUID的TCGA barcodes編碼,可以通過提供這個參數獲取。

分析功能

RTCGAToolbox提供了顯示數據基本信息的功能函數。分析功能包括差異基因表達分析、CN與GE相關分析、突變頻率表和報告表等。

“玩具”數據集

一個展示基本數據集信息結構的無意義數據集。

差異基因表達

# Differential gene expression analysis for gene level RNA data.
 diffGeneExprs = getDiffExpressedGenes(dataObject=RTCGASample,DrawPlots=TRUE,
 adj.method="BH",adj.pval=0.05,raw.pval=0.05,
 logFC=2,hmTopUpN=10,hmTopDownN=10)

RTCGA工具集使用了voom包和limma包的函數做這個功能分析。每個經過TCGA項目處理過的樣本都有一個特定的包含組織源信息的barcode數。RTCGA工具集利用這個barcode信息將每個樣本分成正常和腫瘤組以進行差異基因表達分析。因為voom需要RNASeq data的原始計數,所以標準化的數據是不能用來做這個分析。

該函數會返回一個列表,其中每個成員都是一個"DGEResult"對象。該對象有一個top table,包含基因log2倍數表達量變化及其顯著性地矯正p值,函數默認會用初始p值、矯正p值以及log倍數改變過濾結果。我們可以通過adj.pval,raw.pval,logFC參數調整進行定制。函數采用Benjamini & Hochberg方法為p值矯正,更多信息可以通過?p.adjust查看。函數默認只會畫出100個上調和下調基因地熱圖,我們可以使用hmTopUpN和hmTopDownN參數進行調整。

# Show head of expression outputs
> diffGeneExprs
 [[1]]
 Dataset:RNASeq
 DGEResult object, dim: 15 6
# Dataset: RNASeq
> showResults(diffGeneExprs[[1]])
 Dataset: RNASeq
 logFC AveExpr t P.Value adj.P.Val B
 TAP2 5.288573 1.743410 76.38279 7.496531e-76 4.760297e-73 150.82307
 GRTP1 6.187648 2.193494 48.25410 2.030875e-60 6.448029e-58 123.44071
 ENPP5 7.215676 2.707012 45.79069 1.110692e-58 2.350964e-56 120.11226
 APH1B 8.118533 3.158063 37.31744 5.796089e-52 6.134194e-50 106.20724
 INSR 8.055541 3.126521 32.82711 7.967149e-48 1.945823e-46 97.36010
 MINPP1 7.097777 2.647495 30.36322 2.431258e-45 3.508747e-44 91.95723

toptableOut = showResults(diffGeneExprs[[1]])

If “DrawPlots” set as FALSE, running code above won’t provide any output figure.

Voom + limma: To voom (variance modeling at the observational level) is to estimate the mean-variance relationship robustly and non-parametrically from the data. Voom works with log-counts that are normalized for sequence depth, in particular with log-counts per million (log-cpm). The mean-variance is fitted to the gene-wise standard deviations of the log-cpm, as a function of the average log-count. This method incorporates the mean-variance trend into a precision weight for each individual normalized observation. The normalized log-counts and associated precision weights can then be entered into the limma analysis pipeline, or indeed into any statistical pipeline for microarray data that is precision weight aware(Smyth, G. K 2004; Law, C. W. and Chen, Y. and Shi, W. and Smyth, G. K 2014). Users can check the following publications for more information about methods:

limma : Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, Vol. 3, No. 1, Article 3.

Voom: Law, CW, Chen, Y, Shi, W, Smyth, GK (2014). Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology15, R29.

基因表達與拷貝數之間的相關性

getCNGECorrelation 函數返回拷貝數與基因表達數據之間的相關系數和矯正p值。This function takes main dataobject as an input (uses gene copy number estimates from GISTIC2 (Mermel, C. H. and Schumacher, S. E. and Hill, B. and Meyerson, M. L. and Beroukhim, R. and Getz, G 2011) algorithm and gen expression values from every platform (RNAseq and arrays) to prepare return lists. List object stores “CorResult” object that contains results for each comparison.)

> corrGECN = getCNGECorrelation(dataObject=RTCGASample,adj.method="BH",
 + adj.pval=0.9,raw.pval=0.05)
 > corrGECN
 [[1]]
 Dataset:RNASeq
 CorResult object, dim: 43 4
> showResults(corrGECN[[1]])
 Dataset: RNASeq
 GeneSymbol Cor adj.p.value p.value
 2 SEPHS1 -0.3769382 0.6287624 0.01650501
 9 MSMB -0.3404472 0.8472802 0.03158898
 39 PRMT3 -0.3806078 0.6287624 0.01540088
 85 HTR3A 0.3378158 0.8472802 0.03301479
 91 PHLDB1 -0.3260183 0.8688328 0.04007191
 101 EMP1 0.3814696 0.6287624 0.01515084
> corRes = showResults(corrGECN[[1]])
 Dataset: RNASeq
 GeneSymbol Cor adj.p.value p.value
 2 SEPHS1 -0.3769382 0.6287624 0.01650501
 9 MSMB -0.3404472 0.8472802 0.03158898
 39 PRMT3 -0.3806078 0.6287624 0.01540088
 85 HTR3A 0.3378158 0.8472802 0.03301479
 91 PHLDB1 -0.3260183 0.8688328 0.04007191
 101 EMP1 0.3814696 0.6287624 0.01515084

相關分析之后,RNASeq data(如果采用的是RNASeq)會被標準化。相關分析用Benjamini & Hochberg adjustment for p values。這里采用的是皮爾遜積矩相關系數去檢測兩個配對樣本之間的關聯。如果樣本服從獨立正態分布,統計檢驗服從t分布,自由度為length(x)-2。更多詳細信息,使用?cor.test函數獲取。

突變頻率

getMutationRate函數計算得到一個關于每個基因突變頻率的數據框。

# Mutation frequencies
> mutFrq = getMutationRate(dataObject=RTCGASample)
> head(mutFrq[order(mutFrq[,2],decreasing=TRUE),])
 Genes MutationRatio
 FCGBP FCGBP 0.46
 NF1 NF1 0.31
 ASTN1 ASTN1 0.24
 ODZ4 ODZ4 0.22
 BRWD1 BRWD1 0.22
 SYNE2 SYNE2 0.22

單因素(生存)分析

單因素生存分析被成為是一種能夠為臨床診斷提供價值信息的方法。該函數創建2個或者3個基于表達數據的群組。(If the dataset has RNASeq data, data will be normalized for survival analysis.)。 如果group設置為2,工具包將通過獨立基因的表達均值創建兩個群組;如果group設置為3,這些群組被定義為:第一分位數的樣品(expression 3rd Q),以及兩者之間。

單因素生存分析函數需要生存數據,這可以通過臨床數據框獲得。生存數據第一列是sample barcodes,第二列是time,最后一列是event data。下面說明怎樣獲取臨床數據,生成生存數據,并進行單因素生存分析(Univariate survival analysis)。

# Creating survival data frame and running analysis for
# FCGBP which is one of the most frequently mutated gene in the toy data
# Running following code will provide following KM plot.
> clinicData head(clinicData)
 Composite.Element.REF yearstobirth vitalstatus daystodeath
 TEST.00.0026 value 53 0 NA
 TEST.00.0052 value 50 0 NA
 TEST.00.0088 value 56 0 NA
 TEST.00.0056 value 56 0 NA
 TEST.00.0023 value 56 0 NA
 TEST.00.0092 value 52 0 NA
 daystolastfollowup neoplasm.diseasestage pathology.T.stage
 TEST.00.0026 1183 stage iiia stage iiia
 TEST.00.0052 897 stage iib stage iib
 TEST.00.0088 1000 stage iib stage iib
 TEST.00.0056 1134 stage iia stage iia
 TEST.00.0023 794 stage iib stage iib
 TEST.00.0092 1104 stage iia stage iia
clinicData = clinicData[,3:5]
 clinicData[is.na(clinicData[,3]),3] = clinicData[is.na(clinicData[,3]),2]
 survData <- data.frame(Samples=rownames(clinicData),
 Time=as.numeric(clinicData[,3]),
 Censor=as.numeric(clinicData[,1]))
 getSurvival(dataObject=RTCGASample,geneSymbols=c("FCGBP"),sampleTimeCensor=survData)

數據導出

可以用getData()函數將下載數據從FirehoseData對象導出。

# Note: This function is provided for real dataset test since the toy dataset is small.
 RTCGASample

TEST FirehoseData object
 Available data types:
 Clinical: A data frame, dim: 100 7
 RNASeqGene: A matrix withraw read counts or normalized data, dim: 800 80
 GISTIC: A FirehoseGISTIC object to store copy number data
 Mutations: A data.frame, dim: 2685 30
 To export data, you may use getData() function.
RTCGASampleClinical = getData(RTCGASample,"Clinical")
 RTCGASampleRNAseqCounts = getData(RTCGASample,"RNASeqGene")
 RTCGASampleCN = getData(RTCGASample,"GISTIC")

重述原始文章中的BRCA結果

Following code block is provided to reproduce case study in the RTCGAToolbox paper(Samur MK. 2014).

# BRCA data with mRNA (Both array and RNASeq), GISTIC processed copy number data
# mutation data and clinical data
# (Depends on bandwidth this process may take long time)
 brcaData = getFirehoseData (dataset="BRCA", runDate="20140416", gistic2_Date="20140115",
 Clinic=TRUE, RNAseq_Gene=TRUE, mRNA_Array=TRUE, Mutation=TRUE)

# Differential gene expression analysis for gene level RNA data.
# Heatmaps are given below.
 diffGeneExprs = getDiffExpressedGenes(dataObject=brcaData,DrawPlots=TRUE,
 adj.method="BH",adj.pval=0.05,raw.pval=0.05,
 logFC=2,hmTopUpN=100,hmTopDownN=100)
# Show head for expression outputs
 diffGeneExprs
 showResults(diffGeneExprs[[1]])
 toptableOut = showResults(diffGeneExprs[[1]])
# Correlation between expresiion profiles and copy number data
 corrGECN = getCNGECorrelation(dataObject=brcaData,adj.method="BH",
 adj.pval=0.05,raw.pval=0.05)

corrGECN
 showResults(corrGECN[[1]])
 corRes = showResults(corrGECN[[1]])

# Gene mutation frequincies in BRCA dataset
 mutFrq = getMutationRate(dataObject=brcaData)
 head(mutFrq[order(mutFrq[,2],decreasing=TRUE),])

# PIK3CA which is one of the most frequently mutated gene in BRCA dataset
 # KM plot is given below.
 clinicData <- getData(brcaData,"Clinical")
 head(clinicData)
 clinicData = clinicData[,3:5]
 clinicData[is.na(clinicData[,3]),3] = clinicData[is.na(clinicData[,3]),2]
 survData <- data.frame(Samples=rownames(clinicData),
 Time=as.numeric(clinicData[,3]),
 Censor=as.numeric(clinicData[,1]))
 getSurvival(dataObject=brcaData,geneSymbols=c("PIK3CA"),sampleTimeCensor=survData)

報告圖

這里的這個函數使用RCircos(Zhang, H. and Meltzer, P. and Davis, S 2013)為輸入數據集提供了整體的環形圖結果。輸入需要[differential gene expression analysis results (max results for 2 different platforms), copy number data estimates from GISTIC2 (Mermel, C. H. and Schumacher, S. E. and Hill, B. and Meyerson, M. L. and Beroukhim, R. and Getz, G 2011) and mutation data.]。

# Creating dataset analysis summary figure with getReport.
# Figure will be saved as PDF file.
 library("Homo.sapiens")
 locations = genes(Homo.sapiens,columns="SYMBOL")
 locations = as.data.frame(locations)
 locations <- locations[,c(6,1,5,2:3)]
 locations <- locations[!is.na(locations[,1]),]
 rownames(locations) <- locations[,1]
 getReport(dataObject=brcaData,DGEResult1=diffGeneExprs[[1]],
 DGEResult2=diffGeneExprs[[2]],geneLocations=locations)

數據對象

RTCGASample數據雖然是“玩具”,但是也是FirehoseData數據對象,存儲了RNAseq, copy number, muatation, clinical data for artificially created dataset。

data(RTCGASample)
 ## RTCGASample dataset is artificially created for function test.
 ## It isn't biologically meaninful and it has no relation with any cancer type.
 ## For real datasets, please use client function to get data from data portal.

參考官方文檔:RTCGAToolbox

博文鏈接

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容