轉錄組入門(3):了解fastq測序數據

轉錄組入門(3):了解fastq測序數據
需要用安裝好的sratoolkit把sra文件轉換為fastq格式的測序文件,并且用fastqc軟件測試測序文件的質量!
作業,理解測序reads,GC含量,質量值,接頭,index,fastqc的全部報告
來源于生信技能樹:http://www.biotrainee.com/forum.php?mod=viewthread&tid=1750#lastpost

fastq-dump將sra數據轉換成fastq格式

for i in {59..62};
do
echo $i
fastq-dump --gzip --split-3 -O /mnt/hgfs/Labubuntu_data/GSE81916.RNAseq/RNA-Seq -A SRR35899$i.sra;
done
59
Written 30468155 spots for SRR3589959.sra
Written 30468155 spots total
60
Written 52972617 spots for SRR3589960.sra
Written 52972617 spots total
61
Written 36763726 spots for SRR3589961.sra
Written 36763726 spots total
62
Written 43802631 spots for SRR3589962.sra
Written 43802631 spots total
fastq-dump參數:
fastq-dump -h    #參看幫助
INPUT
 -A|--accession <accession>    #路徑下文件名
 --table <table-name>    
OUTPUT
  -O|--outdir <path>    #結果文件輸出路徑
  -Z|--stdout     #標準輸出
  --gzip    #結果壓縮成格式gzip
  --bzip2    #結果壓縮成格式bzip2
Multiple File Options 
  --split-files       
  --split-3  #PE squence產生files *_1.fastq and *_2.fastq 兩個文件

fastq文件

fastq文件格式

fastq格式是一種基于文本用來儲存生物序列和序列對應質量的文件格式;生物序列和質量均使用單一ASCII碼編碼。

@E00491:115:H3G7WCCXY:1:1101:30787:1731 1:N:0:ACAAGCTA
TGAATAAGTTGGTTCTAGCGGAGTTTCTGTTCCTTGTCCATAAAGCATCTAACCGCCCTGTGCTCAACTCACGCCGTCTAAAGACAGGAAAGGGAAGTGTCAAGCAGTGTACGATTTGTTTCTAAACTGTACAGTGGCGATTTTTCTAGA
\+
AAAFAJAAF7AJFF<JFJFF-JJFJJ-F----F7<F-<--<-<FAAF--<FFJJAJ<FAF------<--7<F--77----7--------7---<77-<------7---77A-----7-<--7AF-<A<---7----)))))-77------

第1行:以@開頭的序列ID,空格后跟著描述性內容;
第2行:序列(堿基序列或者核酸序列);
第3行:以+開頭的序列ID,空格后跟著描述性內容;有時為了節省存儲空間會只保留+;
第4列:序列測序質量,每個質量字符與序列字符一一對應;
測序質量對應的ASCII碼(由低到高排列):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~

Illumina sequence identifiers
@HWUSI-EAS100R:6:73:941:1973#0/1

HWUSI-EAS100R the unique instrument name
6 flowcell lane
73 tile number within the flowcell lane
941 'x'-coordinate of the cluster within the tile
1973 'y'-coordinate of the cluster within the tile
#0 index number for a multiplexed sample (0 for no indexing)
/1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

從版本illumina 1.4以后,有所改變:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

EAS139 the unique instrument name
136 the run id
FC706VJ the flowcell id
2 flowcell lane
2104 tile number within the flowcell lane
15343 'x'-coordinate of the cluster within the tile
197393 'y'-coordinate of the cluster within the tile
1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y Y if the read is filtered, N otherwise
18 0 when none of the control bits are on, otherwise it is an even number
ATCACG index sequence

質量Q值的計算:

Q值與p值對應。 #P值是每個堿基測序錯誤率
這兒有兩個轉換公式:
The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred quality score:

The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used a different mapping, encoding the odds p/(1-p) instead of the probability p:

Although both mappings are asymptotically identical at higher quality values, they differ at lower quality levels (i.e., approximately p > 0.05, or equivalently, Q < 13).

Q與p值之間的關系:紅色的為phred對應方程,黑色的為Illumina對應方程,虛線表明p=0.05,對應的質量得分為Q≈13
Phred+33與Phred+64:
Phred+64:質量字符的ASCII值 = Q + 64
Phred+33: 質量字符的ASCII值= Q + 33

參考:
FASTQ format
轉錄組入門(3):質量控制
轉錄組入門(3):了解fastq測序數據

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容