參考基因組下載
有三大全文網(wǎng)站提供參考基因組下載,它們分別是:
1.NCBI (https://www.ncbi.nlm.nih.gov/grc)
2.UCSC (http://hgdownload.soe.ucsc.edu/downloads.html)
3.Ensemble (http://asia.ensembl.org/index.html?redirect=no)
目前最常用的人和小鼠的參考基因組版本如下(Jimmy總結(jié)):
|NCBI | UCSC| Ensemble|
|GRCh36 | hg18 | ENSEMBL release_52 |
|GRCh37 | hg19 | ENSEMBL release_59/61/64/68/69/75|
|GRCh38 | hg38 | ENSEMBL release_76/77/78/80/81/82|
這里我下載的是USCS版本的human基因組(hg19)
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz #下載USCS版本的hg19
$ tar -zxvf chromFa.tar.gz # 解壓縮
chr1.fa
chr10.fa
chr11.fa
chr11_gl000202_random.fa
chr12.fa
chr13.fa
[...]
現(xiàn)在文件夾里應(yīng)該有所有染色體的文件,每一條染色體序列是單獨(dú)的一個(gè)文件,后綴都是.fa。
$ cat *.fa > hg19.fa
#把所有染色體的信息都重定向到一個(gè)文件里,整合所有染色體信息,最后的文件大小大約3.2G
$ rm -rf chr* #刪除單獨(dú)的染色體文件,節(jié)省空間
下載注釋文件GTT/GFT
簡(jiǎn)單來(lái)講注釋文件就是基因組的說(shuō)明書,告訴我們哪些序列是編碼蛋白的基因,哪些是非編碼基因,外顯子、內(nèi)含子、UTR等的位置等等。以上三個(gè)網(wǎng)站都有基因組的注釋文件。
現(xiàn)在最權(quán)威的人類和小鼠基因組的注釋還屬Gencode數(shù)據(jù)庫(kù)。(參考:https://zhuanlan.zhihu.com/p/28126314)注意注釋文件的格式一般是gtf或者gff3格式的。
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.annotation.gtf.gz
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.annotation.gff3.gz
$ gzip -d gencode.v31lift37.annotation.gff3.gz #解壓
$ gzip -d gencode.v31lift37.annotation.gtf.gz #解壓
因?yàn)檫@個(gè)文章的數(shù)據(jù)在fastqc質(zhì)控檢查后質(zhì)量比較好,就沒有做trim。
記錄一下GTF 和GFF3文件的內(nèi)容(http://www.lxweimin.com/p/3e545b9a3c68),感覺沒什么區(qū)別:
GTF(General Transfer Format)其實(shí)就是GFF2,以Tab分割,分為如下幾列:
- seqname 染色體名稱- name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
- source來(lái)源 - name of the program that generated this feature, or the data source (database or project name)
- feature特性(是基因,外顯子,還是其他一些什么) - feature type name, e.g. Gene, Variation, Similarity
- start(在染色體上的開始位置) - Start position of the feature, with sequence numbering starting at 1.
- end(在染色體上結(jié)束的位置) - End position of the feature, with sequence numbering starting at 1.
- score - A floating point value.
- strand - defined as + (forward) or - (reverse).
- frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
- attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
而GFF3(General Feature Format)的格式如下:
- seqid - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seq ID must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
- source - name of the program that generated this feature, or the data source (database or project name)
- type - type of feature. Must be a term or accession from the SOFA sequence ontology
- start - Start position of the feature, with sequence numbering starting at 1.
- end - End position of the feature, with sequence numbering starting at 1.
- score - A floating point value.
- strand - defined as + (forward) or - (reverse).
- phase - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
- attributes - A semicolon-separated list of tag-value pairs, providing additional information about each feature. Some of these tags are predefined, e.g. ID, Name, Alias, Parent - see the GFF documentation for more details.
下載hg19索引
$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz #下載索引,會(huì)彈出下面的下載進(jìn)度
--2019-08-31 21:48:51-- ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz
=> ‘hg19.tar.gz’
Resolving ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)... 128.220.233.225
Connecting to ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)|128.220.233.225|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/infphilo/hisat2/data ... done.
==> SIZE hg19.tar.gz ... 4181115011
==> PASV ... done. ==> RETR hg19.tar.gz ... done.
Length: 4181115011 (3.9G) (unauthoritative)
hg19.tar.gz 100%[============================================>] 3.89G 1.27MB/s in 58m 27s
2019-08-31 22:47:18 (1.14 MB/s) - ‘hg19.tar.gz’ saved [4181115011]
也可以自己構(gòu)建索引,但是據(jù)說(shuō)如果沒有服務(wù)器的話,筆記本電腦構(gòu)建人類基因組索引會(huì)非常費(fèi)時(shí),所以我就直接在官網(wǎng)下載了索引。下載索引也要一些時(shí)間,這個(gè)花了我大概1個(gè)小時(shí)的時(shí)間,網(wǎng)速感人。。。
為什么以后的比對(duì)步驟需要索引?(答案來(lái)自:http://www.lxweimin.com/p/479c7b576e6f)高通量測(cè)序遇到的第一個(gè)問題就是,成千上萬(wàn)甚至上幾億條read如果在合理的時(shí)間內(nèi)比對(duì)到參考基因組上,并且保證錯(cuò)誤率在接受范圍內(nèi)。為了提高比對(duì)速度,就需要根據(jù)參考基因組序列,經(jīng)過(guò)BWT算法轉(zhuǎn)換成index,而我們比對(duì)的序列其實(shí)是index的一個(gè)子集。當(dāng)然轉(zhuǎn)錄組比對(duì)還要考慮到可變剪切的情況,所以更加復(fù)雜。因此我門不是直接把read回貼到基因組上,而是把read和index進(jìn)行比較。
$ tar -zxvf *.tar.gz #解壓縮。解壓的文件中,包含genome.*.ht2的8個(gè)文件和一個(gè)shell腳本。
$ rm -rf *.tar.gz #刪除壓縮包節(jié)省空間
比對(duì)
之前我還跟著視頻練習(xí)了Bowtie2的比對(duì)方法,Bowtie2和hisat2都是比對(duì)軟件,這兩個(gè)有啥不一樣的?我問了王院長(zhǎng),他說(shuō)bowtie2一般用在CHIP,hisat2一般用在RNAseq。我又查了一下別人介紹的一些比對(duì)軟件,有一篇文章是這樣講的:(http://www.lxweimin.com/p/681e02e7f9af)
RNA-Seq數(shù)據(jù)比對(duì)和DNA-Seq數(shù)據(jù)比對(duì)有什么差異?
RNA-Seq數(shù)據(jù)分析分為很多種,比如說(shuō)找差異表達(dá)基因或?qū)ふ倚碌目勺兗羟?。如果找差異表達(dá)基因單純只需要確定不同的read計(jì)數(shù)就行的話,我們可以用bowtie, bwa這類比對(duì)工具,或者是salmon這類align-free工具,并且后者的速度更快。但是如果你需要找到新的isoform,或者RNA的可變剪切,看看外顯子使用差異的話,你就需要TopHat, HISAT2或者是STAR這類工具用于找到剪切位點(diǎn)。因?yàn)镽NA-Seq不同于DNA-Seq,DNA在轉(zhuǎn)錄成mRNA的時(shí)候會(huì)把內(nèi)含子部分去掉。所以mRNA反轉(zhuǎn)的cDNA如果比對(duì)不到參考序列,會(huì)被分開,重新比對(duì)一次,判斷中間是否有內(nèi)含子。
還有一篇文章是這樣講的:https://zhuanlan.zhihu.com/p/26506787:
轉(zhuǎn)錄組測(cè)序的比對(duì)通常分為基因組比對(duì)和轉(zhuǎn)錄組比對(duì)兩種,顧名思義,基因組比對(duì)就是把reads比對(duì)到完整的基因組序列上,而轉(zhuǎn)錄組比對(duì)則是把reads比對(duì)到所有已知的轉(zhuǎn)錄本序列上。如果不是很急或者只想知道已知轉(zhuǎn)錄本表達(dá)量,個(gè)人建議使用基因組比對(duì)的方法進(jìn)行分析,理由如下:
① 轉(zhuǎn)錄組比對(duì)需要準(zhǔn)確的已知轉(zhuǎn)錄本的序列,對(duì)于來(lái)自未知轉(zhuǎn)錄本(比如一些未被數(shù)據(jù)庫(kù)收錄的lncRNA)或序列不準(zhǔn)確的reads無(wú)法正確比對(duì);
② 與上一條類似,轉(zhuǎn)錄組比對(duì)不能對(duì)轉(zhuǎn)錄本的可變剪接進(jìn)行分析,數(shù)據(jù)庫(kù)中未收錄的剪接位點(diǎn)會(huì)被直接丟棄;
③ 由于同一個(gè)基因存在不同的轉(zhuǎn)錄本,因此很多reads可以同時(shí)完美比對(duì)到多個(gè)轉(zhuǎn)錄本,reads的比對(duì)評(píng)分會(huì)偏低,可能被后續(xù)計(jì)算表達(dá)量的軟件舍棄,影響后續(xù)分析(有部分軟件解決了這個(gè)問題);
④ 由于與DNA測(cè)序使用的參考序列不同,因此不利于RNA和DNA數(shù)據(jù)的整合分析。
而上面的問題使用基因組比對(duì)都可以解決。
此外,值得注意的是,RNA測(cè)序并不能直接使用DNA測(cè)序常用的BWA、Bowtie等比對(duì)軟件,這是由于真核生物內(nèi)含子的存在,導(dǎo)致測(cè)到的reads并不與基因組序列完全一致,因此需要使用Tophat/HISAT/STAR等專門為RNA測(cè)序設(shè)計(jì)的軟件進(jìn)行比對(duì)。
HISAT2,取代Bowtie/TopHat程序,能夠?qū)NA-Seq的讀取與基因組進(jìn)行快速比對(duì)。HISAT利用大量FM索引,以覆蓋整個(gè)基因組。Index的目的主要使用與序列比對(duì)。由于物種的基因組序列比較長(zhǎng), 如果將測(cè)序序列與整個(gè)基因組進(jìn)行比對(duì),則會(huì)非常耗時(shí)。因此采用將測(cè)序序列和參考基因組的Index文件進(jìn)行比對(duì),會(huì)節(jié)省很多時(shí)間。以人類基因組為例,它需要48,000個(gè)索引,每個(gè)索引代表~64,000 bp的基因組區(qū)域。這些小的索引結(jié)合幾種比對(duì)策略,實(shí)現(xiàn)了RNA-Seq讀取的高效比對(duì),特別是那些跨越多個(gè)外顯子的讀取。盡管它利用大量索引,但HISAT只需要4.3 GB的內(nèi)存。這種應(yīng)用程序支持任何規(guī)模的基因組,包括那些超過(guò)40億個(gè)堿基的。(http://www.biotrainee.com/thread-2073-1-1.html)
所以這次我試著用hisat2來(lái)做比對(duì),hisat2下載就不說(shuō)了,可以在網(wǎng)上搜一下怎么下載,然后把軟件放在環(huán)境變量里就行了。
$ hisat2 -h #先看一眼hisat2的使用方法,然后會(huì)彈出一堆使用說(shuō)明,各個(gè)參數(shù)的意思
HISAT2 version 2.1.0 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)
Usage:
hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]
#這里-x是必須有的,如果單端測(cè)序的話-U必須有;-S必須有
<ht2-idx> Index filename prefix (minus trailing .X.ht2).
<m1> Files with #1 mates, paired with files in <m2>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<m2> Files with #2 mates, paired with files in <m1>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<r> Files with unpaired reads.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<sam> File for SAM output (default: stdout)
<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
[...]
主要參數(shù):
-x <hisat2-idx>:-x后面跟的是參考基因組索引文件的路徑。
-1 <m1>:雙端測(cè)序結(jié)果的第一個(gè)文件。若有多組數(shù)據(jù),使用逗號(hào)將文件分隔。Reads的長(zhǎng)度可以不一致。
-2 <m2>: 雙端測(cè)序結(jié)果的第二個(gè)文件。若有多組數(shù)據(jù),使用逗號(hào)將文件分隔,并且文件順序要和-1參數(shù)對(duì)應(yīng)。Reads的長(zhǎng)度可以不一致。
-U <r>:?jiǎn)味藬?shù)據(jù)文件。若有多組數(shù)據(jù),使用逗號(hào)將文件分隔??梢院?1、-2參數(shù)同時(shí)使用。Reads的長(zhǎng)度可以不一致。
-S <hit>:指定輸出的SAM文件的路徑。
因?yàn)橐葘?duì)的有4個(gè)fastq文件,所以寫一個(gè)小腳本,讓它自己一個(gè)一個(gè)的比對(duì):
#!/bin/bash
for ((i=77;i<=80;i++))
do hisat2 -t -x /media/yanfang/FYWD/RNA_seq/ref_genome/index/hg19/genome -U /media/yanfang/FYWD/RNA_seq/fastq_files/SRR9576${i}_1.fastq.gz -S /media/yanfang/FYWD/RNA_seq/sam_files/SRR9576${i}.sam
done
-t是讓hisat2顯示每一個(gè)比對(duì)運(yùn)行的時(shí)間。-x后面跟的是我下載的索引存放的路徑。-U后面跟的是每一個(gè)fastq文件存放的路徑,這里的i是變量,指定了不同fastq文件的名稱,因?yàn)樗膫€(gè)文件從SRR957677-SRR957680,所以i從77開始,每循環(huán)一次就加1。
接下來(lái)就是等待的時(shí)間了,取決于你的筆記本電腦的配置,我的配置是RAM 8G, i7 7500。
$ ./hisat2_map.sh #運(yùn)行hisat2比對(duì)的腳本
Time loading forward index: 00:00:40
Time loading reference: 00:00:07
Multiseed full-index search: 00:12:04
20803937 reads; of these:
20803937 (100.00%) were unpaired; of these:
1198535 (5.76%) aligned 0 times
17146459 (82.42%) aligned exactly 1 time
2458943 (11.82%) aligned >1 times
94.24% overall alignment rate
Time searching: 00:12:11
Overall time: 00:12:51
Time loading forward index: 00:00:47
Time loading reference: 00:00:07
Multiseed full-index search: 00:05:04
8828013 reads; of these:
8828013 (100.00%) were unpaired; of these:
572582 (6.49%) aligned 0 times
7275873 (82.42%) aligned exactly 1 time
979558 (11.10%) aligned >1 times
93.51% overall alignment rate
Time searching: 00:05:12
Overall time: 00:05:59
Time loading forward index: 00:00:43
Time loading reference: 00:00:06
Multiseed full-index search: 00:11:37
19909740 reads; of these:
19909740 (100.00%) were unpaired; of these:
1256224 (6.31%) aligned 0 times
16065546 (80.69%) aligned exactly 1 time
2587970 (13.00%) aligned >1 times
93.69% overall alignment rate
Time searching: 00:11:43
Overall time: 00:12:26
Time loading forward index: 00:00:43
Time loading reference: 00:00:07
Multiseed full-index search: 00:13:38
24231941 reads; of these:
24231941 (100.00%) were unpaired; of these:
1348062 (5.56%) aligned 0 times
20030375 (82.66%) aligned exactly 1 time
2853504 (11.78%) aligned >1 times
94.44% overall alignment rate
Time searching: 00:13:45
Overall time: 00:14:28
最重要的是比對(duì)到基因組或是轉(zhuǎn)錄組上的比對(duì)率。人類基因組的比對(duì)率期望值是70-90%,會(huì)出現(xiàn)多個(gè)序列比對(duì)在有限的序列區(qū)稱之為“多重比對(duì)序列”(multi-mapping reads);轉(zhuǎn)錄組上的比對(duì)率較低,由于未注釋的轉(zhuǎn)錄本會(huì)被過(guò)濾且“多重比對(duì)序列”增加,由于同一個(gè)基因不同亞型共有外顯子區(qū)。
SAM文件轉(zhuǎn)換為BAM文件
SAM(sequence Alignment/mapping)數(shù)據(jù)格式是目前高通量測(cè)序中存放比對(duì)數(shù)據(jù)的標(biāo)準(zhǔn)格式。bam是sam的二進(jìn)制格式,為了減少sam文件的儲(chǔ)存量。為什么要轉(zhuǎn)換格式?為了讓計(jì)算機(jī)好處理。工具:SAMtools。
SAMTools的主要功能如下:
view: BAM-SAM/SAM-BAM 轉(zhuǎn)換和提取部分比對(duì)
sort: 比對(duì)排序,-o是根據(jù)染色體排序,-n參數(shù)則是根據(jù)read名進(jìn)行排序,-t 根據(jù)TAG進(jìn)行排序。
merge: 聚合多個(gè)排序比對(duì)
index: 索引排序比對(duì)
faidx: 建立FASTA索引,提取部分序列
tview: 文本格式查看序列
pileup: 產(chǎn)生基于位置的結(jié)果和 consensus/indel calling
#!/bin/bash
#這里寫了一個(gè)小腳本,把三個(gè)步驟寫在一個(gè)for循環(huán)里,for循環(huán)會(huì)依次對(duì)每一個(gè)sam文件進(jìn)行處理
for i in `seq 77 80`
do
samtools view -S /media/yanfang/FYWD/RNA_seq/sam_files/SRR9576${i}.sam -b > /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam
#第一步將比對(duì)后的sam文件轉(zhuǎn)換成bam文件。-S 后面跟的是sam文件的路徑;-b 指定輸出的文件為bam,后面跟輸出的路徑;最后重定向?qū)懭隻am文件
samtools sort /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam -o /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_sorted.bam
#第二步將所有的bam文件按默認(rèn)的染色體位置進(jìn)行排序。-o是指按染色體排序
samtools index /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_sorted.bam
#第三步將所有的排序文件建立索引,索引文件,生成的索引文件是以bai為后綴的
done
生成的bam文件可以先用samtools的flagstat看一下read比對(duì)情況,例如:
$ samtools flagstat SRR957677_sorted.bam
25783414 + 0 in total (QC-passed reads + QC-failed reads)
4979477 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
24584879 + 0 mapped (95.35% : N/A)
0 + 0 paired in sequencing #因?yàn)槭菃味藴y(cè)序,所以這項(xiàng)是0
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
比對(duì)質(zhì)控
對(duì)BAM文件進(jìn)行QC的軟件包括:
RSeQC(依賴于Python2.7的一個(gè)軟件,利用conda創(chuàng)建新環(huán)境)
Qualimap:對(duì)二代數(shù)據(jù)進(jìn)行質(zhì)控的綜合軟件
Picard:綜合質(zhì)控學(xué)習(xí)軟件。
這里我用的是RSeQC。
先安裝RSeQC:
$ sudo -H pip install RSeQC
$ bam_stat.py -i SRR957677_sorted.bam
Load BAM file ... Done
#==================================================
#All numbers are READ count
#==================================================
Total records: 25783414
QC failed: 0
Optical/PCR duplicate: 0
Non primary hits 4979477
Unmapped reads: 1198535
mapq < mapq_cut (non-unique): 2458943
mapq >= mapq_cut (unique): 17146459
Read-1: 0
Read-2: 0
Reads map to '+': 8670460
Reads map to '-': 8475999
Non-splice reads: 14136525
Splice reads: 3009934
Reads mapped in proper pairs: 0
Proper-paired reads map to different chrom:0
htseq-count計(jì)數(shù)
reads的計(jì)數(shù)定量主要可分為三個(gè)水平:基因水平、轉(zhuǎn)錄組水平、外顯子水平。
在基因水平上,常用的軟件為HTSeq-count,featureCounts,BEDTools, Qualimap, Rsubread, GenomicRanges等。以常用的HTSeq-count為例,這些工具要解決的問題就是根據(jù)read和基因位置的overlap判斷這個(gè)read到底是誰(shuí)家的孩子。值得注意的是不同工具對(duì)multimapping reads處理方式也是不同的,例如HTSeq-count就直接當(dāng)它們不存在。而Qualimpa則是一人一份,平均分配。
在轉(zhuǎn)錄本水平上,一般常用工具為Cufflinks和它的繼任者StringTie, eXpress。這些軟件要處理的難題就時(shí)轉(zhuǎn)錄本亞型(isoforms)之間通常是有重疊的,當(dāng)二代測(cè)序讀長(zhǎng)低于轉(zhuǎn)錄本長(zhǎng)度時(shí),如何進(jìn)行區(qū)分?這些工具大多采用的都是expectation maximization(EM)。好在我們有三代測(cè)序。上述軟件都是alignment-based,目前許多alignment-free軟件,如kallisto, silfish, salmon,能夠省去比對(duì)這一步,直接得到read count,在運(yùn)行效率上更高。不過(guò)最近一篇文獻(xiàn)[1]指出這類方法在估計(jì)豐度時(shí)存在樣本特異性和讀長(zhǎng)偏差。
在外顯子使用水平上,其實(shí)和基因水平的統(tǒng)計(jì)類似。但是值得注意的是為了更好的計(jì)數(shù),我們需要提供無(wú)重疊的外顯子區(qū)域的gtf文件[2]。用于分析差異外顯子使用的DEXSeq提供了一個(gè)Python腳本(dexseq_prepare_annotation.py)執(zhí)行這個(gè)任務(wù)。(以上內(nèi)容來(lái)自:http://www.lxweimin.com/p/6d4cba26bb60)
htseq-count 自定義模型:
1.數(shù)據(jù)準(zhǔn)備:
htseq的計(jì)數(shù)需要進(jìn)行按照reads名稱進(jìn)行排序,之前過(guò)程中reads是按照染色體排序的,所以還要重新排序:
#!/bin/bash
for ((i=77;i<=80;i++))
do
samtools sort -n /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam -o /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_nsorted.bam
done
安裝htseq的步驟省略,先看一下htseq-count的使用方法:
$ htseq-count --h
usage: htseq-count [options] alignment_file gff_file
This script takes one or more alignment files in SAM/BAM format and a feature
file in GFF format and calculates for each feature the number of reads mapping
to it. See http://htseq.readthedocs.io/en/master/count.html for details.
positional arguments:
samfilenames Path to the SAM/BAM files containing the mapped reads.
If '-' is selected, read from standard input
featuresfilename Path to the file containing the features
optional arguments:
-h, --help show this help message and exit
-f {sam,bam}, --format {sam,bam}
#指定輸入文件格式,默認(rèn)SAM
type of <alignment_file> data, either 'sam' or 'bam'
(default: sam)
-r {pos,name}, --order {pos,name}
#你需要利用samtool sort對(duì)數(shù)據(jù)根據(jù)read name或者位置進(jìn)行排序,默認(rèn)是name
'pos' or 'name'. Sorting order of <alignment_file>
(default: name). Paired-end sequencing data must be
sorted either by position or by read name, and the
sorting order must be specified. Ignored for single-
end data.
--max-reads-in-buffer MAX_BUFFER_SIZE
When <alignment_file> is paired end sorted by
position, allow only so many reads to stay in memory
until the mates are found (raising this number will
use more memory). Has no effect for single end or
paired end sorted by name
-s {yes,no,reverse}, --stranded {yes,no,reverse}
#數(shù)據(jù)是否來(lái)自于strand-specific assay。
#DNA是雙鏈的,所以需要判斷到底來(lái)自于哪條鏈。
#如果選擇了no, 那么每一條read都會(huì)跟正義鏈和反義鏈進(jìn)行比較。
#默認(rèn)的yes對(duì)于雙端測(cè)序表示第一個(gè)read都在同一個(gè)鏈上,第二個(gè)read則在另一條鏈上。
whether the data is from a strand-specific assay.
Specify 'yes', 'no', or 'reverse' (default: yes).
'reverse' means 'yes' with reversed strand
interpretation
-a MINAQUAL, --minaqual MINAQUAL
#最低質(zhì)量, 剔除低于閾值的read
skip all reads with alignment quality lower than the
given minimum value (default: 10)
-t FEATURETYPE, --type FEATURETYPE
feature type (3rd column in GFF file) to be used, all
features of other type are ignored (default, suitable
for Ensembl GTF files: exon)
-i IDATTR, --idattr IDATTR
GFF attribute to be used as feature ID (default,
suitable for Ensembl GTF files: gene_id)
--additional-attr ADDITIONAL_ATTR
Additional feature attributes (default: none, suitable
for Ensembl GTF files: gene_name). Use multiple times
for each different attribute
-m {union,intersection-strict,intersection-nonempty}, --mode {union,intersection-strict,intersection-nonempty}
mode to handle reads overlapping more than one feature
(choices: union, intersection-strict, intersection-
nonempty; default: union)
--nonunique {none,all}
Whether to score reads that are not uniquely aligned
or ambiguously assigned to features
--secondary-alignments {score,ignore}
Whether to score secondary alignments (0x100 flag)
--supplementary-alignments {score,ignore}
Whether to score supplementary alignments (0x800 flag)
-o SAMOUTS, --samout SAMOUTS
write out all SAM alignment records into SAM files
(one per input file needed), annotating each line with
its feature assignment (as an optional field with tag
'XF')
-q, --quiet suppress progress report
2.寫一個(gè)腳本:
#!/bin/bash
for i in `seq 77 80`
do
htseq-count -r name -f bam /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_nsorted.bam /media/yanfang/FYWD/RNA_seq/ref_genome/gencode.v31lift37.annotation.gtf > /media/yanfang/FYWD/RNA_seq/matrix/SRR9576${i}.count
done
一共是4個(gè)樣品,大概花了兩三個(gè)小時(shí)的時(shí)間。
yanfang@YF-Lenovo:/media/yanfang/FYWD/RNA_seq/matrix$ wc -l *.count
#查看一下每一個(gè)matrix文件有多少行
62297 SRR957677.count
62297 SRR957678.count
62297 SRR957679.count
62297 SRR957680.count
249188 total
查看每個(gè)matrix文件的格式(前4行),第一列ensembl_gene_id,第二列read_count計(jì)數(shù)
$ head -n 4 SRR9576*.count
==> SRR957677.count <==
ENSG00000000003.14_2 807
ENSG00000000005.6_3 0
ENSG00000000419.12_4 389
ENSG00000000457.14_4 288
==> SRR957678.count <==
ENSG00000000003.14_2 357
ENSG00000000005.6_3 0
ENSG00000000419.12_4 174
ENSG00000000457.14_4 108
==> SRR957679.count <==
ENSG00000000003.14_2 800
ENSG00000000005.6_3 0
ENSG00000000419.12_4 405
ENSG00000000457.14_4 218
==> SRR957680.count <==
ENSG00000000003.14_2 963
ENSG00000000005.6_3 1
ENSG00000000419.12_4 509
ENSG00000000457.14_4 283
再看一下后4行
$ tail -n 4 SRR9576*.count
==> SRR957677.count <==
__ambiguous 341518
__too_low_aQual 0
__not_aligned 1198535
__alignment_not_unique 2458943
==> SRR957678.count <==
__ambiguous 138861
__too_low_aQual 0
__not_aligned 572582
__alignment_not_unique 979558
==> SRR957679.count <==
__ambiguous 360081
__too_low_aQual 0
__not_aligned 1256224
__alignment_not_unique 2587970
==> SRR957680.count <==
__ambiguous 411012
__too_low_aQual 0
__not_aligned 1348062
__alignment_not_unique 2853504
合并表達(dá)矩陣并進(jìn)行注釋
上一步得到的4個(gè)單獨(dú)的矩陣文件,現(xiàn)在要把這4個(gè)文件合并為行為基因名,列為樣本名,中間為count的矩陣文件。
首先要啟動(dòng)R-studio, 運(yùn)行R。載入數(shù)據(jù),把矩陣加上列名:
> options(stringsAsFactors = FALSE)
> control1<-read.table("SRR957677.count",sep= "\t",col.names = c("gene_id","control1"))
> head(control1)#查看前幾行
gene_id control1
1 ENSG00000000003.14_2 807
2 ENSG00000000005.6_3 0
3 ENSG00000000419.12_4 389
4 ENSG00000000457.14_4 288
5 ENSG00000000460.17_6 505
6 ENSG00000000938.13_4 0
> control2<-read.table("SRR957678.count",sep= "\t",col.names = c("gene_id","control2"))
> treat1<-read.table("SRR957679.count",sep= "\t",col.names = c("gene_id","treat1"))
> treat2<-read.table("SRR957680.count",sep= "\t",col.names = c("gene_id","treat2"))
> tail(control2)#查看后幾行
gene_id control2
62292 ENSG00000288111.1_1 0
62293 __no_feature 3856774
62294 __ambiguous 138861
62295 __too_low_aQual 0
62296 __not_aligned 572582
62297 __alignment_not_unique 979558
> tail(treat2)
gene_id treat2
62292 ENSG00000288111.1_1 0
62293 __no_feature 10430059
62294 __ambiguous 411012
62295 __too_low_aQual 0
62296 __not_aligned 1348062
62297 __alignment_not_unique 2853504
#合并矩陣
> raw_count <- merge(merge(control1, control2, by="gene_id"), merge(treat1, treat2, by="gene_id"))
> head(raw_count) #這里顯示的合并之后,行的順序改變了
gene_id control1 control2 treat1 treat2
1 __alignment_not_unique 2458943 979558 2587970 2853504
2 __ambiguous 341518 138861 360081 411012
3 __no_feature 9096888 3856774 8247195 10430059
4 __not_aligned 1198535 572582 1256224 1348062
5 __too_low_aQual 0 0 0 0
6 ENSG00000000003.14_2 807 357 800 963
> tail(raw_count)
gene_id control1 control2 treat1 treat2
62292 ENSG00000288106.1_1 0 3 1 2
62293 ENSG00000288107.1_1 2 0 0 0
62294 ENSG00000288108.1_1 0 0 0 0
62295 ENSG00000288109.1_1 0 0 1 0
62296 ENSG00000288110.1_1 0 0 0 0
62297 ENSG00000288111.1_1 0 0 0 0
>
以上代碼簡(jiǎn)要說(shuō)明:
(1)如果stringAsFactor=F,就不會(huì)把字符轉(zhuǎn)換為factor。這樣以來(lái),原來(lái)看起來(lái)是數(shù)字變成了character,原來(lái)是character的還是character。
函數(shù)read.table是讀取矩形格子狀數(shù)據(jù)最為便利的方式。
(2)read.table的用法是:
read.table(file, header = FALSE, sep = "", quote = ""'",dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss")。
file是文件名,header指出文件的第一行是否為數(shù)據(jù)變量的名字,缺省情況下,由文件的格式來(lái)確定此值,如果header設(shè)置為TRUE,則要求第一行要比數(shù)據(jù)列的數(shù)量少一列。sep為數(shù)據(jù)分隔符,這里定義分隔符為制表符。用于指定包圍字符型數(shù)據(jù)的字符。如果不使用引用,則可以將該參數(shù)設(shè)置為quote=""。
raw_count_filt <- raw_count[-1:-5,] #刪掉前5行
> head(raw_count_filt) #查看刪完的數(shù)據(jù)矩陣的前幾行
gene_id control1 control2 treat1 treat2
6 ENSG00000000003.14_2 807 357 800 963
7 ENSG00000000005.6_3 0 0 0 1
8 ENSG00000000419.12_4 389 174 405 509
9 ENSG00000000457.14_4 288 108 218 283
10 ENSG00000000460.17_6 505 208 451 543
11 ENSG00000000938.13_4 0 0 0 0
上面的矩陣?yán)飃ene_id的名字有小數(shù)點(diǎn)??墒俏覀儫o(wú)法在EBI數(shù)據(jù)庫(kù)上直接搜索找到ENSMUSG00000024045.5這樣的基因,只能是ENSMUSG00000024045的整數(shù),沒有小數(shù)點(diǎn),所以需要進(jìn)一步替換為整數(shù)的形式。
> ENSEMBL <- gsub("\\.\\d*\\_\\d*", "", raw_count_filt$gene_id)#把gene_id列里的小數(shù)點(diǎn)后面的都去掉
#還有一篇文章是這樣的代碼:ENSEMBL <- gsub("(.*?)\\.\\d*?_\\d", "\\1", raw_count_filt$gene_id)
> row.names(raw_count_filt) <- ENSEMBL #將ENSEMBL重新添加到raw_count_filt矩陣
> raw_count_filt1 <- cbind(ENSEMBL,raw_count_filt)#合并矩陣ENSEMBL和filt2
> colnames(raw_count_filt1) <- c("ensembl_gene_id","gene_id","control1","control2","treat1","treat2")
#給矩陣1的列加名字
>head(raw_count_filt1)
ensembl_gene_id gene_id control1 control2 treat1 treat2
ENSG00000000003 ENSG00000000003 ENSG00000000003.14_2 807 357 800 963
ENSG00000000005 ENSG00000000005 ENSG00000000005.6_3 0 0 0 1
ENSG00000000419 ENSG00000000419 ENSG00000000419.12_4 389 174 405 509
ENSG00000000457 ENSG00000000457 ENSG00000000457.14_4 288 108 218 283
ENSG00000000460 ENSG00000000460 ENSG00000000460.17_6 505 208 451 543
ENSG00000000938 ENSG00000000938 ENSG00000000938.13_4 0 0 0 0
對(duì)基因進(jìn)行注釋-獲取gene_symbol
> mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
> my_ensembl_gene_id <- row.names(raw_count_filt1)
> options(timeout = 4000000) #提高連接時(shí)間
> hg_symbols<- getBM(attributes=c('ensembl_gene_id','hgnc_symbol',"chromosome_name", "start_position","end_position", "band"), filters= 'ensembl_gene_id', values = my_ensembl_gene_id, mart = mart)
> head(hg_symbols)
ensembl_gene_id hgnc_symbol chromosome_name start_position end_position band
1 ENSG00000000003 TSPAN6 X 100627109 100639991 q22.1
2 ENSG00000000005 TNMD X 100584936 100599885 q22.1
3 ENSG00000000419 DPM1 20 50934867 50958555 q13.13
4 ENSG00000000457 SCYL3 1 169849631 169894267 q24.2
5 ENSG00000000460 C1orf112 1 169662007 169854080 q24.2
6 ENSG00000000938 FGR 1 27612064 27635185 p35.3
>
將合并后的表達(dá)數(shù)據(jù)框raw_count_filt1和注釋得到的hg_symbols整合為一:
> readcount <- merge(raw_count_filt1, hg_symbols, by="ensembl_gene_id")
> head(readcount)
ensembl_gene_id gene_id control1 control2 treat1 treat2 hgnc_symbol chromosome_name start_position end_position band
1 ENSG00000000003 ENSG00000000003.14_2 807 357 800 963 TSPAN6 X 100627109 100639991 q22.1
2 ENSG00000000005 ENSG00000000005.6_3 0 0 0 1 TNMD X 100584936 100599885 q22.1
3 ENSG00000000419 ENSG00000000419.12_4 389 174 405 509 DPM1 20 50934867 50958555 q13.13
4 ENSG00000000457 ENSG00000000457.14_4 288 108 218 283 SCYL3 1 169849631 169894267 q24.2
5 ENSG00000000460 ENSG00000000460.17_6 505 208 451 543 C1orf112 1 169662007 169854080 q24.2
6 ENSG00000000938 ENSG00000000938.13_4 0 0 0 0 FGR 1 27612064 27635185 p35.3
輸出count矩陣文件:
> write.csv(readcount, file='readcount_all,csv')
> readcount<-raw_count_filt1[ ,-1:-2]
> write.csv(readcount, file='readcount.csv')
> head(readcount)
control1 control2 treat1 treat2
ENSG00000000003 807 357 800 963
ENSG00000000005 0 0 0 1
ENSG00000000419 389 174 405 509
ENSG00000000457 288 108 218 283
ENSG00000000460 505 208 451 543
ENSG00000000938 0 0 0 0