下載TCGA數據的方法有很多,上一篇介紹了如何用gdc-client批量下載數據,基于網上有很多用TCGAbiolinks包下載數據的教程,所以也想學習一下這個方法。TCGAbiolinks的優點在于具備一體化的下載整合,無需再使用復雜的方法對下載的單個數據重新進行整合,換句話說,就是TCGAbiolinks包下載的數據是合并了的,不需要整理(TCGAbiolinks數據下載)。上一篇里我下載了20多個病人的RNA-seq數據,但是下載后發現這些文件是獨立的,你還要對它們進行整合。所以我看到了TCGAbiolinks這個優點之后就決定要學習它了。(好吧,可能是因為懶。。。)
TCGAbiolinks的官方網站是:http://www.bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
(一)安裝TCGAbiolinks
> BiocManager::install("TCGAbiolinks")
> library(TCGAbiolinks)
(二)選定要下載的cancer類型
> TCGAbiolinks::getGDCprojects()$project_id
[1] "TCGA-SARC" "TARGET-CCSK"
[3] "TARGET-NBL" "TARGET-AML"
[5] "TCGA-MESO" "TCGA-ACC"
[7] "TCGA-READ" "TCGA-LGG"
[9] "BEATAML1.0-CRENOLANIB" "TCGA-THCA"
[11] "VAREPOP-APOLLO" "HCMI-CMDC"
[13] "TCGA-CHOL" "TCGA-KIRC"
[15] "ORGANOID-PANCREATIC" "TCGA-BRCA"
[17] "TCGA-OV" "TCGA-GBM"
[19] "TCGA-SKCM" "GENIE-VICC"
[21] "TCGA-DLBC" "CGCI-BLGSP"
[23] "OHSU-CNL" "CPTAC-3"
[25] "BEATAML1.0-COHORT" "TCGA-KICH"
[27] "TCGA-UVM" "TCGA-THYM"
[29] "TCGA-TGCT" "TCGA-LUSC"
[31] "TCGA-PRAD" "FM-AD"
[33] "TCGA-UCEC" "TCGA-LAML"
[35] "TARGET-ALL-P2" "TCGA-STAD"
[37] "TARGET-ALL-P3" "GENIE-DFCI"
[39] "GENIE-NKI" "GENIE-MDA"
[41] "GENIE-JHU" "GENIE-MSK"
[43] "TCGA-ESCA" "TCGA-HNSC"
[45] "TARGET-OS" "TARGET-RT"
[47] "TCGA-LIHC" "CTSP-DLBCL1"
[49] "TCGA-COAD" "TCGA-LUAD"
[51] "TCGA-CESC" "TARGET-WT"
[53] "NCICCR-DLBCL" "TCGA-PAAD"
[55] "MMRF-COMMPASS" "TARGET-ALL-P1"
[57] "CPTAC-2" "TCGA-UCS"
[59] "TCGA-KIRP" "TCGA-PCPG"
[61] "TCGA-BLCA" "GENIE-UHN"
[63] "GENIE-GRCC"
縮寫代表的癌癥種類見鏈接:TCGA癌癥縮寫、癌癥中英文對照
#因為我做頭頸癌,所以選擇HNSC,這個跟教程里的不一樣
> cancer_type="TCGA-HNSC"
(三)選擇下載你想要的數據類型
這里教程里下載的是臨床數據,我也先按流程走一遍:
> clinical <- GDCquery_clinic(project= cancer_type,type = "clinical")
查看下載的數據:
> clinical[1:4,1:4]
submitter_id year_of_diagnosis classification_of_tumor last_known_disease_status
1 TCGA-4P-AA8J 2013 not reported not reported
2 TCGA-BA-4074 2003 not reported not reported
3 TCGA-BA-4075 2004 not reported not reported
4 TCGA-BA-4076 2003 not reported not reported
> dim(clinical)
[1] 528 78 #是個528行,78列的一個表
那么這個表里都有些什么,可以view一下看看:
> View(clinical)
可以看出這個表,一行是一個病例,列是根據這個病人的各項信息。
然后可以保存一下你下載的這個表了:
> save(clinical,file="BRCA_clinical.Rdata")
> write.csv(clinical, file="TCGAbiolinks-BRCA-clinical.csv")
(四)如果不想下載臨床樣品,我只想下載實驗相關的數據,怎么辦?
好說~網上也能搜到各種數據類型的下載方法:
(1)RNA-seq的count數據
代碼參考:R包:TCGAbiolinks
> library(dplyr)
> library(DT)
> library(SummarizedExperiment)
> data_type <- "Gene Expression Quantification"#選擇數據類型為“基因定量表達”
> data_category <- "Transcriptome Profiling" #選擇數據類別為轉錄組
> workflow_type <- "HTSeq - Counts"
> query_TranscriptomeCounts <- GDCquery(project = cancer_type,
data.category = data_category,
data.type = data_type,
workflow.type = workflow_type)
# GDCquery函數參數詳解官網網址:
http://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html#useful_information
#然后會彈出一大串下面這些。。。
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-HNSC
--------------------
oo Filtering results
--------------------
ooo By data.type
ooo By workflow.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
#將上一步搜索得到的數據下載下來,自動存儲到所設置目錄下的文件夾
> GDCdownload(query_TranscriptomeCounts, method = "api")
#method:使用API (POST方法)或gdc客戶端工具。選擇“api”,“client”。API更快,但是下載過程中數據可能會損壞,可能需要重新執行。
Downloading data for project TCGA-HNSC
GDCdownload will download 546 files. A total of 136.906805 MB
Downloading as: Wed_Jan_08_16_55_43_2020.tar.gz
Downloading: 140 MB
#將搜索得到的數據轉換為適用于R語言的形式,返回值為a summarizedExperiment or a data.frame---類似矩陣的容器,行名為基因,列名為樣本名
> expdat <- GDCprepare(query = query_TranscriptomeCounts)
|===================================================================================|100% Completed after 1 m
Starting to add information to samples
=> Add clinical information to samples
Add FFPE information. More information at:
=> https://cancergenome.nih.gov/cancersselected/biospeccriteria
=> http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html
=> Adding subtype information to samples
hnsc subtype information from:doi:10.1038/nature14129
Accessing www.ensembl.org to get gene information
Downloading genome information (try:0) Using: Human genes (GRCh38.p13)
From the 60483 genes we couldn't map 3971
這一步我在操作的時候有報錯,如果你在操作的時候也出現了類似:internal error -3這樣的報錯,可以重新啟動一下Rstudio。參考文章:lazy-load database 'P' is corrupt #3
> count_matrix=assay(expdat)
> View(count_matrix)#view一下看看矩陣啥樣
這個表就是我們熟悉的count值了,你可以隨心所欲的處理它,折磨它。。。
但是在任何操作之前千萬別忘了保存:
> write.csv(count_matrix,file = "TCGAbiolinks_HNSC_counts.csv")
(2)下載RNA-seq的FPKM數據
> Expr_df <- GDCquery(project = cancer_type,
data.category = data_category,
data.type = data_type,
workflow.type = "HTSeq - FPKM")
> GDCdownload(Expr_df, method = "api", files.per.chunk = 100)
#files.per.chunk:這將使API方法一次只下載n個(files.per.chunk)文件。當數據量過大時,可能會下載出錯,可設置files.per.chunk參數減少下載問題。值為整數,即可將文件拆分為幾個文件下載,如files.per.chunk = 6。
Downloading data for project TCGA-HNSC
GDCdownload will download 546 files. A total of 278.868796 MB
Downloading chunk 1 of 6 (100 files, size = 51.032322 MB) as Wed_Jan_08_19_22_27_2020_0.tar.gz
Downloading: 51 MB Downloading chunk 2 of 6 (100 files, size = 51.063004 MB) as Wed_Jan_08_19_22_27_2020_1.tar.gz
Downloading: 51 MB Downloading chunk 3 of 6 (100 files, size = 51.028334 MB) as Wed_Jan_08_19_22_27_2020_2.tar.gz
Downloading: 51 MB Downloading chunk 4 of 6 (100 files, size = 51.08847 MB) as Wed_Jan_08_19_22_27_2020_3.tar.gz
Downloading: 51 MB Downloading chunk 5 of 6 (100 files, size = 51.14113 MB) as Wed_Jan_08_19_22_27_2020_4.tar.gz
Downloading: 51 MB Downloading chunk 6 of 6 (46 files, size = 23.515536 MB) as Wed_Jan_08_19_22_27_2020_5.tar.gz
Downloading: 24 MB
> expdat_2 <- GDCprepare(query = Expr_df)
> Expr_matrix=assay(expdat_2)
> write.csv(Expr_matrix,file = "TCGAbiolinks_HNSC_FPKM.csv")
(3)下載其他類型的數據
其他的數據我不經常用,但是也查詢了代碼,萬一以后能用得著呢,參考文章有:
1.用TCGAbiolinks從TCGA數據下載到下游分析的學習筆記
2.R包:TCGAbiolinks
3.TCGA3.R包TCGAbiolinks下載數據
4.TCGA數據下載—TCGAbiolinks包參數詳解
5.TCGA數據庫下載:多種方法及優缺點介紹
#下載miRNA數據
query <- GDCquery(project = cancer_type,
data.category = "Transcriptome Profiling",
data.type = "miRNA Expression Quantification",
workflow.type = "BCGSC miRNA Profiling")
GDCdownload(query, method = "api", files.per.chunk = 50)
expdat <- GDCprepare(query = query)
count_matrix=assay(expdat)
write.csv(count_matrix,file = paste(cancer_type,"miRNA.csv",sep = "-"))
#下載Copy Number Variation數據
query <- GDCquery(project = cancer_type,
data.category = "Copy Number Variation",
data.type = "Copy Number Segment")
GDCdownload(query, method = "api", files.per.chunk = 50)
expdat <- GDCprepare(query = query)
count_matrix=assay(expdat)
write.csv(count_matrix,file = paste(cancer_type,"Copy-Number-Variation.csv",sep = "-"))
#下載甲基化數據
query.met <- GDCquery(project =cancer_type,
legacy = TRUE,
data.category = "DNA methylation")
GDCdownload(query.met, method = "api", files.per.chunk = 300)
expdat <- GDCprepare(query = query)
count_matrix=assay(expdat)
write.csv(count_matrix,file = paste(cancer_type,"methylation.csv",sep = "-"))