轉錄組入門(3):了解fastq測序數據
需要用安裝好的sratoolkit把sra文件轉換為fastq格式的測序文件,并且用fastqc軟件測試測序文件的質量!
作業,理解測序reads,GC含量,質量值,接頭,index,fastqc的全部報告
來源于生信技能樹:http://www.biotrainee.com/forum.php?mod=viewthread&tid=1750#lastpost
fastq-dump將sra數據轉換成fastq格式
for i in {59..62};
do
echo $i
fastq-dump --gzip --split-3 -O /mnt/hgfs/Labubuntu_data/GSE81916.RNAseq/RNA-Seq -A SRR35899$i.sra;
done
59
Written 30468155 spots for SRR3589959.sra
Written 30468155 spots total
60
Written 52972617 spots for SRR3589960.sra
Written 52972617 spots total
61
Written 36763726 spots for SRR3589961.sra
Written 36763726 spots total
62
Written 43802631 spots for SRR3589962.sra
Written 43802631 spots total
fastq-dump參數:
fastq-dump -h #參看幫助
INPUT
-A|--accession <accession> #路徑下文件名
--table <table-name>
OUTPUT
-O|--outdir <path> #結果文件輸出路徑
-Z|--stdout #標準輸出
--gzip #結果壓縮成格式gzip
--bzip2 #結果壓縮成格式bzip2
Multiple File Options
--split-files
--split-3 #PE squence產生files *_1.fastq and *_2.fastq 兩個文件
fastq文件
fastq文件格式
fastq格式是一種基于文本用來儲存生物序列和序列對應質量的文件格式;生物序列和質量均使用單一ASCII碼編碼。
@E00491:115:H3G7WCCXY:1:1101:30787:1731 1:N:0:ACAAGCTA
TGAATAAGTTGGTTCTAGCGGAGTTTCTGTTCCTTGTCCATAAAGCATCTAACCGCCCTGTGCTCAACTCACGCCGTCTAAAGACAGGAAAGGGAAGTGTCAAGCAGTGTACGATTTGTTTCTAAACTGTACAGTGGCGATTTTTCTAGA
\+
AAAFAJAAF7AJFF<JFJFF-JJFJJ-F----F7<F-<--<-<FAAF--<FFJJAJ<FAF------<--7<F--77----7--------7---<77-<------7---77A-----7-<--7AF-<A<---7----)))))-77------
第1行:以@開頭的序列ID,空格后跟著描述性內容;
第2行:序列(堿基序列或者核酸序列);
第3行:以+開頭的序列ID,空格后跟著描述性內容;有時為了節省存儲空間會只保留+;
第4列:序列測序質量,每個質量字符與序列字符一一對應;
測序質量對應的ASCII碼(由低到高排列):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~
Illumina sequence identifiers
@HWUSI-EAS100R:6:73:941:1973#0/1
HWUSI-EAS100R | the unique instrument name |
---|---|
6 | flowcell lane |
73 | tile number within the flowcell lane |
941 | 'x'-coordinate of the cluster within the tile |
1973 | 'y'-coordinate of the cluster within the tile |
#0 | index number for a multiplexed sample (0 for no indexing) |
/1 | the member of a pair, /1 or /2 (paired-end or mate-pair reads only) |
從版本illumina 1.4以后,有所改變:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
EAS139 | the unique instrument name |
---|---|
136 | the run id |
FC706VJ | the flowcell id |
2 | flowcell lane |
2104 | tile number within the flowcell lane |
15343 | 'x'-coordinate of the cluster within the tile |
197393 | 'y'-coordinate of the cluster within the tile |
1 | the member of a pair, 1 or 2 (paired-end or mate-pair reads only) |
Y | Y if the read is filtered, N otherwise |
18 | 0 when none of the control bits are on, otherwise it is an even number |
ATCACG | index sequence |
質量Q值的計算:
Q值與p值對應。 #P值是每個堿基測序錯誤率
這兒有兩個轉換公式:
The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred quality score:
The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used a different mapping, encoding the odds p/(1-p) instead of the probability p:
Although both mappings are asymptotically identical at higher quality values, they differ at lower quality levels (i.e., approximately p > 0.05, or equivalently, Q < 13).
Q與p值之間的關系:紅色的為phred對應方程,黑色的為Illumina對應方程,虛線表明p=0.05,對應的質量得分為Q≈13
Phred+33與Phred+64:
Phred+64:質量字符的ASCII值 = Q + 64
Phred+33: 質量字符的ASCII值= Q + 33