昨天在復習Linux基礎知識,沒有繼續記錄,今天繼續
參考學習資料TCGA CNV全攻略
具體數據處理流程見NIH的TCGA官網:https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/
參考文獻:http://mcr.aacrjournals.org/content/12/4/485.long
先看處理流程:
Copy Number Variation Analysis Pipeline
簡介:
This pipeline is built onto the existing TCGA level 2 data generated by Birdsuite and uses the DNAcopy R-package to perform a circular binary segmentation (CBS) analysis [1]. The GDC further transforms these copy number values into segment mean values, which are equal to log2(copy-number/ 2). Diploid regions will have a segment mean of zero, amplified regions will have positive values, and deletions will have negative values.
The GRCh38 probe-set was produced by mapping probe sequences to the GRCh38 reference genome and can be downloaded at the GDC Reference File Website.
Copy Number Estimation
Numeric focal-level Copy Number Variation (CNV) values were generated with "Masked Copy Number Segment" files from tumor aliquots using GISTIC2 [2], [3] on a project level. Only protein-coding genes were kept, and their numeric CNV values were further thresholded by a noise cutoff of 0.3:
- Genes with focal CNV values smaller than -0.3 are categorized as a "loss" (-1)
- Genes with focal CNV values larger than 0.3 are categorized as a "gain" (+1)
- Genes with focal CNV values between and including -0.3 and 0.3 are categorized as "neutral" (0).
GISTIC2 Command Line Parameters
gistic2 \
-b <base_directory> \
-seg <segmentation_file> \
-mk <marker_file> \
-refgene <reference_gene_file> \
-ta 0.1 \
-armpeel 1 \
-brlen 0.7 \
-cap 1.5 \
-conf 0.99 \
-td 0.1 \
-genegistic 1 \
-gcm extreme \
-js 4 \
-maxseg 2000 \
-qvt 0.25 \
-rx 0 \
-savegene 1 \
(-broad 1)
[1] Olshen, Adam B., E. S. Venkatraman, Robert Lucito, and Michael Wigler. "Circular binary segmentation for the analysis of array-based DNA copy number data." Biostatistics 5, no. 4 (2004): 557-572.
[2] Mermel, Craig H., Steven E. Schumacher, Barbara Hill, Matthew L. Meyerson, Rameen Beroukhim, and Gad Getz. "GISTIC2. 0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers." Genome biology 12, no. 4 (2011): R41.
[3] Beroukhim, Rameen, Craig H. Mermel, Dale Porter, Guo Wei, Soumya Raychaudhuri, Jerry Donovan, Jordi Barretina et al. "The landscape of somatic copy-number alteration across human cancers." Nature 463, no. 7283 (2010): 899.
根據曾老師的例子下載數據
wget -c -r -np -nH -k -L --cut-dirs 6 -p -A "*snp_6*hg19*Level_3*" http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/BRCA/20160128/
如果要下載其它癌癥種類,只需要改變url里面的BRCA即可。如果要下載其它類型的數據,只需要改變-A 后面的匹配規則即可,其實就是打開上面url看到的幾十個文件的文件名的規律。
打開網站可以看到
$ cd Downloads/
$ wget -c -r -np -nH -k -L --cut-dirs 6 -p -A "*snp_6*hg19*Level_3*" http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/BRCA/20160128/
evel_3.2016012800.0 28%[====> ] 5.19M 7.15KB/s eta 23m 51s
下載速度限制,估計需要下載半個小時,如果用加速插件估計會快點。反正是在探索,而且時間也不是太長,就繼續看教程
下載完成總共耗時50分鐘
FINISHED --2019-11-08 10:39:06--
Total wall clock time: 50m 13s
Downloaded: 17 files, 19M in 50m 0s (6.62 KB/s)
Converted links in 0 files in 0 seconds.
(test) Cheng-MacBook-Pro:Downloads chelsea$
ls查看下載內容如下
gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5
gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5
可以看到,下載的文件中有一個名稱里面包含Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.
,其中minus了germline的CNV的就是需要分析的somatic CNV.
#先安裝一下bedtools
(test) Cheng-MacBook-Pro:Downloads chelsea$ cd ~
(test) Cheng-MacBook-Pro:~ chelsea$ cd biosoft/
(test) Cheng-MacBook-Pro:biosoft chelsea$ conda install -y bedtools
...
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(test) Cheng-MacBook-Pro:biosoft chelsea$
拿到CNV做什么?
首先兩個segment文本文件已經可以直接載入IGV查看所有BRCA樣本的CNV情況,如下所示:
還可以進行CNV深度分析
注釋基因
既然是對基因組片段做基因注釋,那么首先就需要拿到基因的坐標信息咯,我是在gencode數據庫里面下載,然后解析成下面的bed格式的,如下:
head ~/reference/gtf/gencode/protein_coding.hg19.position
chr1 69091 70008 OR4F5
chr1 367640 368634 OR4F29
chr1 621096 622034 OR4F16
chr1 859308 879961 SAMD11
chr1 879584 894689 NOC2L
chr1 895967 901095 KLHL17
chr1 901877 911245 PLEKHN1
chr1 910584 917473 PERM1
chr1 934342 935552 HES4
chr1 936518 949921 ISG15
然后要把下載的CNV文本文件,轉為bed格式的,就是把列的順序調換一下:
head Features.bed
chr1 3218610 95674710 TCGA-3C-AAAU-10A-01D-A41E-01 53225 0.0055
chr1 95676511 95676518 TCGA-3C-AAAU-10A-01D-A41E-01 2 -1.6636
chr1 95680124 167057183 TCGA-3C-AAAU-10A-01D-A41E-01 24886 0.0053
chr1 167057495 167059336 TCGA-3C-AAAU-10A-01D-A41E-01 3 -1.0999
chr1 167059760 181602002 TCGA-3C-AAAU-10A-01D-A41E-01 9213 -8e-04
chr1 181603120 181609567 TCGA-3C-AAAU-10A-01D-A41E-01 6 -1.2009
chr1 181610685 201473647 TCGA-3C-AAAU-10A-01D-A41E-01 12002 0.0055
chr1 201474400 201474544 TCGA-3C-AAAU-10A-01D-A41E-01 2 -1.4235
chr1 201475220 247813706 TCGA-3C-AAAU-10A-01D-A41E-01 29781 -4e-04
避免重復造輪子,曾老師用其擅長的bedtools解決這個需求,命令如下:
bedtools intersect -a Features.bed -b ~/reference/gtf/gencode/protein_coding.hg19.position -wa -wb \
| bedtools groupby -i - -g 1-4 -c 10 -o collapse
注釋結果,可以看到,每個CNV片段都注釋到了對應的基因,有些特別大的片段,會被注釋到非常多的基因。
chr8 42584924 42783715 TCGA-5T-A9QA-01A-11D-A41E-01 CHRNB3,CHRNA6,THAP1,RNF170,HOOK3
chr8 42789728 42793594 TCGA-5T-A9QA-01A-11D-A41E-01 HOOK3
chr8 42797957 42933372 TCGA-5T-A9QA-01A-11D-A41E-01 RP11-598P20.5,FNTA,HOOK3
chr8 70952673 70964372 TCGA-5T-A9QA-01A-11D-A41E-01 PRDM14
chr10 42947970 43833200 TCGA-5T-A9QA-01A-11D-A41E-01 BMS1,RET,RASGEF1A,ZNF33B,CSGALNACT2
chr10 106384615 106473355 TCGA-5T-A9QA-01A-11D-A41E-01 SORCS3
chr10 106478366 107298256 TCGA-5T-A9QA-01A-11D-A41E-01 SORCS3
chr10 117457285 117457859 TCGA-5T-A9QA-01A-11D-A41E-01 ATRNL1
chr11 68990173 69277078 TCGA-5T-A9QA-01A-11D-A41E-01 MYEOV
chr11 76378708 76926535 TCGA-5T-A9QA-01A-11D-A41E-01 LRRC32,B3GNT6,OMP,TSKU,MYO7A,ACER3,CAPN5
找SOMATIC CNVS
仔細看上面IGV的pattern你會發現某些染色體的某些片段經常會擴增或者缺失,這個現象就是人們想研究是recurrent CNV regions,當然不會用肉眼看咯,這時候需要用GISTIC這個軟件。找到了recurrent CNV regions同樣是需要進行基因注釋,才能進行下游分析咯。
PICNIC軟件用法:http://www.bio-info-trainee.com/1299.html
GISTIC軟件用法:http://www.bio-info-trainee.com/1648.html
TCGA : http://www.biotrainee.com/thread-1696-1-1.html
從今天的教程中知道了CNV的分析還需要用到的軟件:IGV,bedtools,picnic,gistic,又有得學了。