【r<-包】使用GenomicDataCommons包下載和處理TCGA數據

資料來源該包手冊。R社區已經有不少下載和處理TCGA數據的包,但目前能獲取最新GRch38的應該只有TCGAbiolinks和本包,本包也是TCGA開發的官方R包,值得信賴。 比較麻煩的就是文件,樣本ID的轉換。

GDC是什么?

來自官網Genomic Data Commons (GDC) website的信息:

The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs.

The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonized using a common set of bioinformatics pipelines, so that the data can be directly compared.

As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonizes these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.

The data model for the GDC is complex, but it worth a quick overview. The data model is encoded as a so-called property graph. Nodes represent entities such as Projects, Cases, Diagnoses, Files (various kinds), and Annotations. The relationships between these entities are maintained as edges. Both nodes and edges may have Properties that supply instance details. The GDC API exposes these nodes and edges in a somewhat simplified set of RESTful endpoints.

快速開始

這個軟件尚在開發之中,希望收到用戶的反饋。如果想要報告bug或者問題,要么submit a new issue或者在R中使用bug.report(package='GenomicDataCommons') 。

安裝

從bioconductor:

source('https://bioconductor.org/biocLite.R')
biocLite('GenomicDataCommons')

導入:

library(GenomicDataCommons)

查看基本的特性

GenomicDataCommons::status()
## $commit
## [1] "e9e20d6f97f2bf6dd3b3261e36ead57c56a4c7cc"
## 
## $data_release
## [1] "Data Release 12.0 - June 13, 2018"
## 
## $status
## [1] "OK"
## 
## $tag
## [1] "1.14.1"
## 
## $version
## [1] 1

If this statement results in an error such as SSL connect error, see the troubleshooting section below.

尋找數據

下面的代碼構建了一個manifest用來引導原始數據的下載,使用HTSeq查找和過濾Ovarian Cancer的基因表達的原始計數。

library(magrittr)
ge_manifest = files() %>%
    filter( ~ cases.project.project_id == 'TCGA-OV' &
                type == 'gene_expression' &
                analysis.workflow_type == 'HTSeq - Counts') %>%
    manifest()

下載數據

下面的代碼塊下載 379 個基因表達數據文件。使用多進程進行下載可以極快地提高下載速度。

destdir = tempdir()
fnames = lapply(ge_manifest$id[1:20],gdcdata)

如果下載的數據包含控制data,下載時需要包括一個token(有下載權限的標志)。具體請看the authentication section below.

元數據獲取

expands = c("diagnoses","annotations",
             "demographic","exposures")
clinResults = cases() %>%
    GenomicDataCommons::select(NULL) %>%
    GenomicDataCommons::expand(expands) %>%
    results(size=50)
str(clinResults,list.len=10)
## List of 4
##  $ diagnoses  :List of 50
##   ..$ 3562f2d3-a4ca-4eb3-a6a0-d2e9d68364c6:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "metastasis"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Lung, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-19T09:21:58.285024-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD7376_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Liver"
##   .. .. [list output truncated]
##   ..$ 36a29f50-5081-4a45-ab83-b11164e6781a:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "metastasis"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Lung, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-16T16:15:26.111532-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Small cell carcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD15003_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Lymph node, NOS"
##   .. .. [list output truncated]
##   ..$ b8ed18f9-2582-4454-a5d8-529370d9da2b:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "Unknown"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Head, face or neck, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-16T16:04:08.462038-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Squamous cell carcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD4846_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Lung, NOS"
##   .. .. [list output truncated]
##   ..$ fd9ffb79-7d96-4a14-955e-236011d283b6:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "primary"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Colon, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-19T09:25:18.484242-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD8471_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Colon, NOS"
##   .. .. [list output truncated]
##   ..$ 672fdc0b-0844-4bbf-8b3a-1bff693d17fc:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "primary"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Stomach, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-16T16:04:28.356056-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD880_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Stomach, NOS"
##   .. .. [list output truncated]
##   ..$ 3468b376-b05e-40e7-851b-25bb9251378f:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "Unknown"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Unknown"
##   .. ..$ created_datetime                 : chr "2017-06-16T16:05:07.197842-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD103_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Not Reported"
##   .. .. [list output truncated]
##   ..$ 2d29df1c-a548-4e95-8206-3c473e9ffdca:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "metastasis"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Lung, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-16T15:43:06.585011-05:00"
##   .. ..$ updated_datetime                 : chr "2018-01-29T14:25:29.405142-06:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD2816_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Connective, subcutaneous and other soft tissues, NOS"
##   .. .. [list output truncated]
##   ..$ 5bd74091-bef3-49a1-b617-45167f02baa6:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "Unknown"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Unknown"
##   .. ..$ created_datetime                 : chr "2017-06-16T15:39:59.533998-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Adenocarcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD10160_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Bone, NOS"
##   .. .. [list output truncated]
##   ..$ 48758eeb-6d56-41ff-b0f5-e2e5bfa3e63a:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "metastasis"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Anus, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-19T09:33:40.540869-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Squamous cell carcinoma, NOS"
##   .. ..$ submitter_id                     : chr "AD9751_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Brain, NOS"
##   .. .. [list output truncated]
##   ..$ 2bac2a54-5843-498d-bfeb-6d795ba54665:'data.frame': 1 obs. of  19 variables:
##   .. ..$ progression_or_recurrence        : chr "not reported"
##   .. ..$ classification_of_tumor          : chr "primary"
##   .. ..$ last_known_disease_status        : chr "not reported"
##   .. ..$ tumor_grade                      : chr "not reported"
##   .. ..$ tissue_or_organ_of_origin        : chr "Kidney, NOS"
##   .. ..$ created_datetime                 : chr "2017-06-16T15:43:06.585011-05:00"
##   .. ..$ updated_datetime                 : chr "2017-10-13T13:38:40.036812-05:00"
##   .. ..$ primary_diagnosis                : chr "Papillary renal cell carcinoma"
##   .. ..$ submitter_id                     : chr "AD2814_diagnosis"
##   .. ..$ site_of_resection_or_biopsy      : chr "Kidney, NOS"
##   .. .. [list output truncated]
##   .. [list output truncated]
##  $ case_id    : chr [1:50] "3562f2d3-a4ca-4eb3-a6a0-d2e9d68364c6" "36a29f50-5081-4a45-ab83-b11164e6781a" "b8ed18f9-2582-4454-a5d8-529370d9da2b" "fd9ffb79-7d96-4a14-955e-236011d283b6" ...
##  $ demographic:'data.frame': 50 obs. of  8 variables:
##   ..$ updated_datetime: chr [1:50] "2017-10-13T13:38:40.036812-05:00" "2017-10-13T13:38:40.036812-05:00" "2017-10-13T13:38:40.036812-05:00" "2017-10-13T13:38:40.036812-05:00" ...
##   ..$ created_datetime: chr [1:50] "2017-06-19T11:44:09.223033-05:00" "2017-06-19T11:42:12.871583-05:00" "2017-06-19T11:33:07.960036-05:00" "2017-06-19T11:45:29.124685-05:00" ...
##   ..$ gender          : chr [1:50] "male" "male" "female" "female" ...
##   ..$ submitter_id    : chr [1:50] "AD7376_demographic" "AD15003_demographic" "AD4846_demographic" "AD8471_demographic" ...
##   ..$ state           : chr [1:50] "submitted" "submitted" "submitted" "submitted" ...
##   ..$ race            : chr [1:50] "not reported" "not reported" "not reported" "not reported" ...
##   ..$ demographic_id  : chr [1:50] "39e8ea37-69ca-4541-b26c-4dbaa5cd8b77" "3a456586-6198-44d7-9dfa-fb510796d3ad" "ec055724-878d-4c2f-881c-aae7fcf8e5aa" "6f286488-6b1d-432c-8014-bacf68423fd6" ...
##   ..$ ethnicity       : chr [1:50] "not reported" "not reported" "not reported" "not reported" ...
##  $ id         : chr [1:50] "3562f2d3-a4ca-4eb3-a6a0-d2e9d68364c6" "36a29f50-5081-4a45-ab83-b11164e6781a" "b8ed18f9-2582-4454-a5d8-529370d9da2b" "fd9ffb79-7d96-4a14-955e-236011d283b6" ...
##  - attr(*, "row.names")= int [1:50] 1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, "class")= chr [1:3] "GDCcasesResults" "GDCResults" "list"

基本設計

這個包的設計跟dplyr的"hadleyverse"方法有點類似。大致上,尋找和獲取文件、元數據的函數可以分為:

  1. 基于GDC API終點的簡單查詢構建器
  2. 應用,修正過濾,字段選擇和分面的動詞集合并生成一個新的查詢對象
  3. 執行查詢和從GDC返回結果的動詞集合

另外還有一些詢問GDC API信息與可獲取的字段,索引BAM文件,下載實際數據文件的函數。

下面是一個概覽[1]

  • Creating a query
    • projects()
    • cases()
    • files()
    • annotations()
  • Manipulating a query
    • filter()
    • facet()
    • select()
  • Introspection on the GDC API fields
    • mapping()
    • available_fields()
    • default_fields()
    • grep_fields()
    • field_picker()
    • available_values()
    • available_expand()
  • Executing an API call to retrieve query results
    • results()
    • count()
    • response()
  • Raw data file downloads
    • gdcdata()
    • transfer()
    • gdc_client()
  • Summarizing and aggregating field values (faceting)
    • aggregations()
  • Authentication
    • gdc_token()
  • BAM file slicing
    • slicing()

使用

處理NCI GDC是存在兩類操作。

  1. 查詢元數據和查找數據文件 (例如,為某類癌癥病人查找所有的基因表達定量數據文件)
  2. 從GDC傳輸原始或處理過的數據到另一臺電腦

這兩類操作在下面進行詳述。

查詢元數據

大量關于病人、文件、項目和所謂注釋的元數據都可以通過NCI GDC API獲取。通常,我們想要查詢元數據,然后進行下載或者執行所謂的聚合操作(和table()類似的功能)

首先創建一個空查詢獲取元數據。我們然后經常想要filter查詢,對retrieving results提前進行一些限制。GenomicDataCommons包有列出條目的幫助函數用來幫助過濾。

創建查詢

下面4個函數可以創建GDCQuery對象用來查詢元數據:

  • projects()
  • cases()
  • files()
  • annotations()
pquery = projects()

pquery對象現在是一個S3類,GDCQuery。對象包含下面一些元素:

  • 字段:這是一個關于字段的字符串向量,在檢索數據時會返回。如果沒有指定字段,會使用默認字段 (查看 default_fields())。
  • filters: 在使用filter()方法后會返回結果用于對后續的結果檢索進行過濾。
  • facets: 一個字段的字符串向量,在聚合數據(aggreations())時使用。
  • archive: 要么“default”要么“legacy”。
  • token: 來自GDC的token字符串。查看the authentication section獲取詳情,注意,通常檢索原始數據是不需要權限的,有些數據的下載需要。

查看實際的對象(習慣使用str()!),注意查詢不會包含結果。

str(pquery)
## List of 5
##  $ fields : chr [1:16] "awg_review" "dbgap_accession_number" "disease_type" "in_review" ...
##  $ filters: NULL
##  $ facets : NULL
##  $ legacy : logi FALSE
##  $ expand : NULL
##  - attr(*, "class")= chr [1:3] "gdc_projects" "GDCQuery" "list"

檢索結果

[ GDC分頁文檔]

[ GDC排序文檔 ]

如果構建好了查詢對象,下一步是從GDC檢索結果。檢索結果的最基本類型是滿足條件的、可獲取記錄的簡單counts()。注意我們剛才并未設置如何過濾,所以count()會返回所有滿足標準的項目記錄。

pcount = count(pquery)
# 或者
pcount = pquery %>% count()
pcount
## [1] 40

results()方法會取回實際的結果:

presults = pquery %>% results()

這些從GDC返回的結果都以JSON格式存儲,函數自動將它轉換為R里面的嵌套列表。

str(presults)
## List of 8
##  $ dbgap_accession_number: chr [1:10] "phs001179" "phs000470" NA NA ...
##  $ disease_type          :List of 10
##   ..$ FM-AD    : chr [1:23] "Germ Cell Neoplasms" "Acinar Cell Neoplasms" "Miscellaneous Tumors" "Thymic Epithelial Neoplasms" ...
##   ..$ TARGET-RT: chr "Rhabdoid Tumor"
##   ..$ TCGA-UCS : chr "Uterine Carcinosarcoma"
##   ..$ TCGA-LUSC: chr "Lung Squamous Cell Carcinoma"
##   ..$ TCGA-BRCA: chr "Breast Invasive Carcinoma"
##   ..$ TCGA-SKCM: chr "Skin Cutaneous Melanoma"
##   ..$ TARGET-OS: chr "Osteosarcoma"
##   ..$ TCGA-THYM: chr "Thymoma"
##   ..$ TARGET-WT: chr "High-Risk Wilms Tumor"
##   ..$ TCGA-ESCA: chr "Esophageal Carcinoma"
##  $ released              : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ state                 : chr [1:10] "open" "open" "open" "open" ...
##  $ primary_site          :List of 10
##   ..$ FM-AD    : chr [1:42] "Kidney" "Testis" "Unknown" "Other and unspecified parts of biliary tract" ...
##   ..$ TARGET-RT: chr "Kidney"
##   ..$ TCGA-UCS : chr "Uterus"
##   ..$ TCGA-LUSC: chr "Lung"
##   ..$ TCGA-BRCA: chr "Breast"
##   ..$ TCGA-SKCM: chr "Skin"
##   ..$ TARGET-OS: chr "Bone"
##   ..$ TCGA-THYM: chr "Thymus"
##   ..$ TARGET-WT: chr "Kidney"
##   ..$ TCGA-ESCA: chr "Esophagus"
##  $ project_id            : chr [1:10] "FM-AD" "TARGET-RT" "TCGA-UCS" "TCGA-LUSC" ...
##  $ id                    : chr [1:10] "FM-AD" "TARGET-RT" "TCGA-UCS" "TCGA-LUSC" ...
##  $ name                  : chr [1:10] "Foundation Medicine Adult Cancer Clinical Dataset (FM-AD)" "Rhabdoid Tumor" "Uterine Carcinosarcoma" "Lung Squamous Cell Carcinoma" ...
##  - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
##  - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"

默認只返回10條記錄,我們可以對results()添加sizefrom參數改變數目。這里存在一個簡單的方法,results_all()會返回所有可獲得的結果。小心使用這個函數,它可以消耗非常長的時間,數據巨大。

length(ids(presults))
## [1] 10
presults = pquery %>% results_all()
length(ids(presults))
## [1] 40
# 包含所有的記錄
length(ids(presults)) == count(pquery)
## [1] TRUE

抽取結果的子集或者將結果變成更通用的R數據結構不是很簡單,然后可以借助 purrrrlist、 和data.tree。

想要交互式瀏覽結果,使用listviewer包。

字段和值

[ GDC fields 文檔 ]

查詢和檢索GDC數據的中心是指定想要返回的字段、根據字段與值進行過濾、分面或者聚合。本包包含兩個簡單地函數,available_fields()default_fields()。每個都可以操作“cases”, “files”, “annotations”, “projects”或者GDCQuery對象。

default_fields('files')
##  [1] "access"                "acl"                  
##  [3] "batch_id"              "created_datetime"     
##  [5] "data_category"         "data_format"          
##  [7] "data_type"             "error_type"           
##  [9] "experimental_strategy" "file_autocomplete"    
## [11] "file_id"               "file_name"            
## [13] "file_size"             "file_state"           
## [15] "imaging_date"          "magnification"        
## [17] "md5sum"                "origin"               
## [19] "platform"              "read_pair_number"     
## [21] "revision"              "state"                
## [23] "state_comment"         "submitter_id"         
## [25] "tags"                  "type"                 
## [27] "updated_datetime"
# The number of fields available for files endpoint
length(available_fields('files'))
## [1] 703
# The first few fields available for files endpoint
head(available_fields('files'))
## [1] "access"                    "acl"                      
## [3] "analysis.analysis_id"      "analysis.analysis_type"   
## [5] "analysis.batch_id"         "analysis.created_datetime"

字段可以通過類似dplyr包流程的方式指定,select()函數是重設GDCQuery對象字段槽的動詞;注意這與dplyr限制已有字段不同。

# 這里是默認字段
qcases = cases()
qcases$fields
##  [1] "aliquot_ids"              "analyte_ids"             
##  [3] "batch_id"                 "case_autocomplete"       
##  [5] "case_id"                  "created_datetime"        
##  [7] "days_to_index"            "days_to_lost_to_followup"
##  [9] "disease_type"             "index_date"              
## [11] "lost_to_followup"         "portion_ids"             
## [13] "primary_site"             "sample_ids"              
## [15] "slide_ids"                "state"                   
## [17] "submitter_aliquot_ids"    "submitter_analyte_ids"   
## [19] "submitter_id"             "submitter_portion_ids"   
## [21] "submitter_sample_ids"     "submitter_slide_ids"     
## [23] "updated_datetime"
# 使用所有的字段
# Note that checking of fields is done by select()
qcases = cases() %>% GenomicDataCommons::select(available_fields('cases'))
head(qcases$fields)
## [1] "case_id"                   "aliquot_ids"              
## [3] "analyte_ids"               "annotations.annotation_id"
## [5] "annotations.batch_id"      "annotations.case_id"

grep_fields()field_picker()可以操作尋找感興趣的字段。

分面與聚合

[ GDC facet 文檔 ]

有點類似R的table方法,GDC API提供了稱為聚合或分面的操作,通過指定一個或多個字段,GDC會返回所有可能值的計數。

# 指定文件類型的綜述
res = files() %>% facet(c('type','data_type')) %>% aggregations()
res$type
##                            key doc_count
## 1      simple_somatic_mutation     64015
## 2   annotated_somatic_mutation     63580
## 3                aligned_reads     45985
## 4          copy_number_segment     44752
## 5              gene_expression     34713
## 6                  slide_image     30036
## 7       biospecimen_supplement     25151
## 8             mirna_expression     22976
## 9          clinical_supplement     12496
## 10      methylation_beta_value     12359
## 11 aggregated_somatic_mutation       186
## 12     masked_somatic_mutation       132

Filtering

[ GDC filtering 文檔 ]

The GenomicDataCommons package 使用了一種非標準的評估形式來指定類似于R的查詢,然后翻譯為R列表。這個R表達式使用了公式接口,如Hadley Wickham 在vignette on non-standard evaluation中所建議。

It’s best to use a formula because a formula captures both the expression to evaluate and the environment where the evaluation occurs. This is important if the expression is a mixture of variables in a data frame and objects in the local environment [for example].

對于用戶來說不需要關注很多底層細節,除了注意過濾表達式必須以~開始。

qfiles = files()
qfiles %>% count() # all files
## [1] 356381

過濾文件類型為基因表達:

qfiles = files() %>% filter(~ type == 'gene_expression')
# here is what the filter looks like after translation
str(get_filter(qfiles))
## List of 2
##  $ op     : 'scalar' chr "="
##  $ content:List of 2
##   ..$ field: chr "type"
##   ..$ value: chr "gene_expression"

要是我們想創建一個基于項目(比如“TCGA-OVCA”)的過濾該怎么辦?我們有一些方法可以發現可獲取的字段。

第一種是基于一些基本的R函數和直覺。

grep('pro',available_fields('files'),value=TRUE)
##  [1] "cases.diagnoses.progression_free_survival"               
##  [2] "cases.diagnoses.progression_free_survival_event"         
##  [3] "cases.diagnoses.progression_or_recurrence"               
##  [4] "cases.project.awg_review"                                
##  [5] "cases.project.dbgap_accession_number"                    
##  [6] "cases.project.disease_type"                              
##  [7] "cases.project.in_review"                                 
##  [8] "cases.project.intended_release_date"                     
##  [9] "cases.project.is_legacy"                                 
## [10] "cases.project.name"                                      
## [11] "cases.project.primary_site"                              
## [12] "cases.project.program.dbgap_accession_number"            
## [13] "cases.project.program.name"                              
## [14] "cases.project.program.program_id"                        
## [15] "cases.project.project_id"                                
## [16] "cases.project.releasable"                                
## [17] "cases.project.release_requested"                         
## [18] "cases.project.released"                                  
## [19] "cases.project.request_submission"                        
## [20] "cases.project.state"                                     
## [21] "cases.project.submission_enabled"                        
## [22] "cases.samples.days_to_sample_procurement"                
## [23] "cases.samples.method_of_sample_procurement"              
## [24] "cases.samples.portions.slides.number_proliferating_cells"
## [25] "cases.tissue_source_site.project"

有意思的是,項目信息嵌套在case里面。我們不需要知道細節除了一些信息在某些文件記錄中的猜測,另外,我們需要知道在哪里(project_id)。

files() %>% facet('cases.project.project_id') %>% aggregations()
## $cases.project.project_id
##            key doc_count
## 1        FM-AD     36134
## 2    TCGA-BRCA     31511
## 3    TCGA-LUAD     17051
## 4    TCGA-UCEC     16130
## 5    TCGA-HNSC     15266
## 6      TCGA-OV     15057
## 7    TCGA-THCA     14420
## 8    TCGA-LUSC     15323
## 9     TCGA-LGG     14723
## 10   TCGA-KIRC     15082
## 11   TCGA-PRAD     14287
## 12   TCGA-COAD     14270
## 13    TCGA-GBM     11973
## 14   TCGA-SKCM     12724
## 15   TCGA-STAD     12845
## 16   TCGA-BLCA     11710
## 17   TCGA-LIHC     10814
## 18   TCGA-CESC      8593
## 19   TCGA-KIRP      8506
## 20   TCGA-SARC      7493
## 21   TCGA-PAAD      5306
## 22   TCGA-ESCA      5270
## 23   TCGA-PCPG      5032
## 24   TCGA-READ      4918
## 25   TCGA-TGCT      4217
## 26   TCGA-THYM      3444
## 27   TCGA-LAML      3960
## 28  TARGET-NBL      2795
## 29    TCGA-ACC      2546
## 30   TCGA-KICH      2324
## 31   TCGA-MESO      2330
## 32  TARGET-AML      2170
## 33    TCGA-UVM      2179
## 34    TCGA-UCS      1658
## 35   TARGET-WT      1406
## 36   TCGA-DLBC      1330
## 37   TCGA-CHOL      1348
## 38   TARGET-OS        47
## 39   TARGET-RT       174
## 40 TARGET-CCSK        15

我們注意到這正是我們需要的,TCGA-OV也是正確的項目id。同時注意這里使用的filterdplyr包中的不同。

qfiles = files() %>%
    filter( ~ cases.project.project_id == 'TCGA-OV' & type == 'gene_expression')
str(get_filter(qfiles))
## List of 2
##  $ op     : 'scalar' chr "and"
##  $ content:List of 2
##   ..$ :List of 2
##   .. ..$ op     : 'scalar' chr "="
##   .. ..$ content:List of 2
##   .. .. ..$ field: chr "cases.project.project_id"
##   .. .. ..$ value: chr "TCGA-OV"
##   ..$ :List of 2
##   .. ..$ op     : 'scalar' chr "="
##   .. ..$ content:List of 2
##   .. .. ..$ field: chr "type"
##   .. .. ..$ value: chr "gene_expression"
qfiles %>% count()
## [1] 1137

然后生成manifest進行下載僅需要簡單的一步:

manifest_df = qfiles %>% manifest()
head(manifest_df)
## # A tibble: 6 x 5
##   id                                   filename      md5        size state
##   <chr>                                <chr>         <chr>     <int> <chr>
## 1 567ced20-00cf-46f4-8bb8-a8553eba7f4b b2552f6f-dd1~ 9af0d99~ 258324 live 
## 2 05692746-1770-47bd-8faa-f864a37e0084 701b8c71-6c0~ 8e9816f~ 526537 live 
## 3 e2d47640-8565-4383-b2dd-0d3e2762e3da b2552f6f-dd1~ e05190e~ 543367 live 
## 4 bc6dab72-dc5a-4ca6-aef3-f7242fa8e329 a1c4f19e-079~ 110d8cd~ 253059 live 
## 5 0a176c20-f3f3-4bc9-bfe2-b2469e745025 01eac123-1e2~ b40921f~ 540592 live 
## 6 2ae73487-7acf-4282-85d8-927f2ab8f18b 12c8b289-b9d~ 4d3c2b9~ 549437 live

注意我們可能處理的有些問題。查下文件名,存在很多文件包含“FPKM”, “FPKM-UQ”或 “counts”之類的。所以還需要進行過濾:

qfiles = files() %>% filter( ~ cases.project.project_id == 'TCGA-OV' &
                            type == 'gene_expression' &
                            analysis.workflow_type == 'HTSeq - Counts')
manifest_df = qfiles %>% manifest()
nrow(manifest_df)
## [1] 379

現在可以使用GDC 數據傳輸工具(在R中使用transfer()或者使用命令行)進行所有文件的下載。查看the bulk downloads section。

認證

[ GDC 認證 documentation ]

GDC提供控制和開放數據??刂频臄祿枰@得授權才能下載,請查看going through the process of obtaining access.

獲取控制數據的授權后就可以下載了,首先要拿到一個種子文件(access a GDC authentication token),然后使用該包進行下載。

本包使用認證種子下載數據(查看transfergdcdata 文檔),包含一個幫助函數gdc_token,它會用下面三種方式查找(按順序)種子文件:

  1. 字符串,存儲為環境變量GDC_TOKEN
  2. 文件路徑,存為環境變量GDC_TOKEN_FILE
  3. 用戶家目錄下的文件.gdc_token

下面是一個例子:

token = gdc_token()
transfer(...,token=token)
# 或者
transfer(...,token=get_token())

數據文件獲取和下載

通過GDC API進行數據下載

gdcdata函數以一個或多個文件id的字符串向量作為參數。生成該向量的簡單方式就是生成一個manifest數據框,然后傳入包含文件id的第一列。

fnames = gdcdata(manifest_df$id[1:2],progress=FALSE)

注意對于控制數據,需要提供種子文件。使用BiocParallel包對與平行下載大量的小文件是非常有用的。

大量下載

大量下載文件功能僅對下載相對比較大的文件起作用,因此可以用這種方法下載BAM文件或者VCF文件。不然最好用上面說的方法。

fnames = gdcdata(manifest_df$id[3:10], access_method = 'client')

使用案例

案例

每個project_id有多少案例?

res = cases() %>% facet("project.project_id") %>% aggregations()
head(res)
## $project.project_id
##            key doc_count
## 1        FM-AD     18004
## 2   TARGET-NBL      1127
## 3    TCGA-BRCA      1098
## 4   TARGET-AML       988
## 5    TARGET-WT       652
## 6     TCGA-GBM       617
## 7      TCGA-OV       608
## 8    TCGA-LUAD       585
## 9    TCGA-UCEC       560
## 10   TCGA-KIRC       537
## 11   TCGA-HNSC       528
## 12    TCGA-LGG       516
## 13   TCGA-THCA       507
## 14   TCGA-LUSC       504
## 15   TCGA-PRAD       500
## 16   TCGA-SKCM       470
## 17   TCGA-COAD       461
## 18   TCGA-STAD       443
## 19   TCGA-BLCA       412
## 20   TARGET-OS       381
## 21   TCGA-LIHC       377
## 22   TCGA-CESC       307
## 23   TCGA-KIRP       291
## 24   TCGA-SARC       261
## 25   TCGA-LAML       200
## 26   TCGA-ESCA       185
## 27   TCGA-PAAD       185
## 28   TCGA-PCPG       179
## 29   TCGA-READ       172
## 30   TCGA-TGCT       150
## 31   TCGA-THYM       124
## 32   TCGA-KICH       113
## 33    TCGA-ACC        92
## 34   TCGA-MESO        87
## 35    TCGA-UVM        80
## 36   TARGET-RT        75
## 37   TCGA-DLBC        58
## 38    TCGA-UCS        57
## 39   TCGA-CHOL        51
## 40 TARGET-CCSK        13
library(ggplot2)
ggplot(res$project.project_id,aes(x = key, y = doc_count)) +
    geom_bar(stat='identity') +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

img

有多少案例包含在TARGET項目中?

cases() %>% filter(~ project.program.name=='TARGET') %>% count()
## [1] 3236

有多少個案例包含在所有項目中? How many cases are included in all TCGA projects?

cases() %>% filter(~ project.program.name=='TCGA') %>% count()
## [1] 11315

TCGA-BRCA樣本類型?

# The need to do the "&" here is a requirement of the
# current version of the GDC API. I have filed a feature
# request to remove this requirement.
resp = cases() %>% filter(~ project.project_id=='TCGA-BRCA' &
                              project.project_id=='TCGA-BRCA' ) %>%
    facet('samples.sample_type') %>% aggregations()
resp$samples.sample_type
##                    key doc_count
## 1        Primary Tumor      1098
## 2 Blood Derived Normal      1011
## 3  Solid Tissue Normal       162
## 4           Metastatic         7

獲取TCGA-BRCA所有正常樣本

# The need to do the "&" here is a requirement of the
# current version of the GDC API. I have filed a feature
# request to remove this requirement.
resp = cases() %>% filter(~ project.project_id=='TCGA-BRCA' &
                              samples.sample_type=='Solid Tissue Normal') %>%
    GenomicDataCommons::select(c(default_fields(cases()),'samples.sample_type')) %>%
    response_all()
count(resp)
## [1] 162
res = resp %>% results()
str(res[1],list.len=6)
## List of 1
##  $ updated_datetime: chr [1:162] "2018-05-21T16:07:40.645885-05:00" "2018-05-21T16:07:40.645885-05:00" "2018-05-21T16:07:40.645885-05:00" "2018-05-21T16:07:40.645885-05:00" ...
head(ids(resp))
## [1] "6fa2a667-9c36-4526-8a58-1975e863a806"
## [2] "ef4cbd38-bc79-4d60-a715-647edd2ebe9e"
## [3] "dd3bfb26-b534-4917-9c4d-9fe7b6477762"
## [4] "9ddc3e7b-8b54-4a83-8335-8053940f56c1"
## [5] "f130f376-5801-40f9-975d-a7e2f7b5670d"
## [6] "af577366-0258-49e7-b6af-e70056c081a4"

文件

有多少種可獲取信息的文件?

res = files() %>% facet('type') %>% aggregations()
res$type
##                            key doc_count
## 1      simple_somatic_mutation     64015
## 2   annotated_somatic_mutation     63580
## 3                aligned_reads     45985
## 4          copy_number_segment     44752
## 5              gene_expression     34713
## 6                  slide_image     30036
## 7       biospecimen_supplement     25151
## 8             mirna_expression     22976
## 9          clinical_supplement     12496
## 10      methylation_beta_value     12359
## 11 aggregated_somatic_mutation       186
## 12     masked_somatic_mutation       132
ggplot(res$type,aes(x = key,y = doc_count)) + geom_bar(stat='identity') +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

img

為GBM查找基因水平的RNA-seq定量文件

q = files() %>%
    GenomicDataCommons::select(available_fields('files')) %>%
    filter(~ cases.project.project_id=='TCGA-GBM' &
               data_type=='Gene Expression Quantification')
q %>% facet('analysis.workflow_type') %>% aggregations()
## $analysis.workflow_type
##               key doc_count
## 1  HTSeq - Counts       174
## 2    HTSeq - FPKM       174
## 3 HTSeq - FPKM-UQ       174
# so need to add another filter
file_ids = q %>% filter(~ cases.project.project_id=='TCGA-GBM' &
                            data_type=='Gene Expression Quantification' &
                            analysis.workflow_type == 'HTSeq - Counts') %>%
    GenomicDataCommons::select('file_id') %>%
    response_all() %>%
    ids()

切片

從TCGA-BAM獲取所有的BAM文件

q = files() %>%
    GenomicDataCommons::select(available_fields('files')) %>%
    filter(~ cases.project.project_id == 'TCGA-GBM' &
               data_type == 'Aligned Reads' &
               experimental_strategy == 'RNA-Seq' &
               data_format == 'BAM')
file_ids = q %>% response_all() %>% ids()
bamfile = slicing(file_ids[1],regions="chr12:6534405-6538375",token=gdc_token())
library(GenomicAlignments)
aligns = readGAlignments(bamfile)

最后

從我學習后的感受來看,這個包有自己的生態,有部分動作函數與tidyverse包同名函數存在不同用法。

如果僅是下載數據,更推薦我最近學習了一下的另一個包——TCGAbiolinks,這個官方包可以作為補充做一些分析和數據轉換操作。


  1. 根據使用查看單個函數詳情 ?

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容