Fastq格式是一種基于文本的存儲生物序列和對應堿基(或氨基酸)質量的文件格式。最初由桑格研究所(Wellcome Trust Sanger Institute)開發出來,現已成為存儲高通量測序數據的事實標準。以Illumina Casava 1.8+ 的fastq格式為例,fastq格式的形式如下:
每條序列由4行字符表示,上述樣例顯示有兩條序列:
第一行:必須以“@”開頭,后面跟著唯一的序列ID標識符,然后跟著可選的序列描述內容,標識符與描述內容用空格分開。
第二行:序列字符(核酸為[AGCTN]+,蛋白為氨基酸字符)。
第三行:必須以“+”開頭,后面跟著可選的ID標識符和可選的描述內容,如果“+”后面有內容,該內容必須與第一行“@”后的內容相同。
第四行:堿基質量字符,每個字符對應第二行相應位置堿基或氨基酸的質量,該字符可以按一定規則轉換為堿基質量得分,堿基質量得分可以反映該堿基的錯誤率。這行的字符數與第二行中的字符數必須相同。字符與錯誤率的具體關系見下文介紹。
在滿足上述要求的前提下,不同的測序儀廠商或數據存儲商對第一行和第四行的定義有些差別。
第一行,即標識行在Illumina和NCBI SRA中的樣式如下:
Illumina casava 1.8+(詳細的解釋可參考wiki):
@HWI-ST1276:97:D1DCYACXX:7:1101:1406:2170 1:N:0:CGACGT
NCBI SRA:
@SRR387514.1 ILLUMINA-C4D679_0049_FC:1:12:3317:1141 length=40
對于第四行的編碼,最初由Phred程序的開發者定義,一般稱為Phred qualitiy. 在Illumina早起版本(v1.3,v1.4)中,因為對quality的定義與Phred的不同,這行應該稱為 Solexa quality。但從Illumina v1.5以后,也開始采用Phred的定義。
堿基質量得分是怎么來的?
Phred最初是一個從測序儀中產生的熒光記錄數據中識別堿基的程序。在早起的熒光染料測序中,每次發生堿基合成時會釋放出熒光信號,該信號被CCD圖像傳感器捕獲。記錄下熒光信號的峰值,生成一個實時的軌跡數據(chromatogram)。因為不同的堿基用不用的顏色標記,檢測這些峰值即可判斷出對應的堿基。但由于這些信號的波峰、密度、形狀和位置等是不連續或模糊的,有時很難根據波峰判斷出正確的堿基。
圖1 chromatogram樣圖
Phred計算許多與波峰大小和分辨率相關的參數,根據這些參數,從一個巨大的查詢表中找出堿基質量得分。這個查詢表是根據對已知序列的測序數據分析得到的(應該是分析得到波峰參數與堿基錯誤率的關系,再通過公式把錯誤率轉換成質量得分,得到波峰參數與質量得分的直接對應表)。不同的測序試劑和機器用不同的查詢表。為了節約磁盤空間,質量得分(可能占用兩個字符)按一定規則(Phred+33或Phred+64)被轉換為單個字符表示。
堿基錯誤率與質量得分的關系有如下兩種:
Qphred = -10log10p
Qillumina-prior to v.1.4 = -10log10(p/(1-p))
圖 2 質量得分Q和錯誤率p的關系,紅色的為phred,黑色的為Illumina早期版本,虛線表明p=0.05,對應的質量得分為Q≈13
在不同版本的編碼中,除了質量得分與錯誤率有所差別外,在字符與得分的轉換上也有差別。
圖3 不同版本質量得分與質量字符ASCII值的關系
質量字符的ASCII值和質量得分的關系有如下兩種:
Phred+64 質量字符的ASCII值 - 64
Phred+33: 質量字符的ASCII值 - 33
可以粗略分為 Phred+33和Phred+64,這里的33和64就是指ASCII值轉換為得分該減去的數值。
在處理測序數據時,因為一些軟件會根據堿基質量得分的不同做不同的處理,常要指定正確的編碼方式,有必要對質量字符與質量得分的關系(Phred+33或Phred+64)作出正確的判斷。當然,如果處理的是最近兩年產生的測序數據,基本上都是Phred+33的,但從NCBI SRA數據庫下載的舊數據就不一定了。
根據圖3中Phred+33與Phred+64所使用的質量字符范圍的不同,可以對fastq文件中質量得分的編碼方式做出判斷。圖3中顯示,ASCII值小于等于58(相應的質量得分小于等于25)對應的字符只有在Phred+33的編碼中被使用,所有Phred+64所使用的字符的ASCII值都大于等于59。在通常情況下,ASCII值大于等于74的字符只出現在Phred+64中。利用這些信息即可在程序中進行判斷。
文章末尾是一個對Phred+33或Phred+64做區分的perl腳本。
該腳本的判斷思想如下:
默認讀取1000條序列,在這1000條序列中:
1. 如果有2個以上的質量字符ASCII值小于等于58(即有兩個堿基的得分小于等于25),同時沒有任何質量字符的ASCII值大于等于75,即判斷是Phred+33。
2. 如果有2個以上的質量字符ASCII值大于等于75(即有兩個堿基的得分大于等于10),同時沒有任何質量字符的ASCII值小于等于58,即判斷是Phred+64。
3. 如果所有質量字符的ASCII值介于59到74之間,即判斷可能是Phred+33,但建議使用更多的序列做進一步測試(出現這種結果可能有兩種情況:1, Phred+33編碼,所有堿基質量得分介于26到42之間;2,Phred+64編碼,所有堿基質量得分介于-5到10;是前者的可能性大)。
4. 如果出現上述3種以外的情況,建議打印出質量字符的ASCII值人工判斷。
理解錯誤的地方歡迎指正。
附錄檢測格式的perl:
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
#
# fastq_phred.pl - Script for judge the fastq's encoding, whether it is phred33 or phred64.
#
# Version: 0.3 ( May 19, 2014)
# Author: Wencai Jie (jiewencai<@>qq.com), NJAU, China.
#
# Permission is granted to anyone to use this software for any purpose, without
# any express or implied warranty. In no event will the authors be held liable
# for any damages arising from the use of this software.
#
#Get options.
my ($help, $print_score, $detail, $print_ascii, $reads_num, $reads_start_arg, $reads_end_arg);
my $reads_end_turn;
GetOptions(
'help|h!' => \$help,
'score|s!' => \$print_score,
'detail|d!' => \$detail,
'ascii|a!' => \$print_ascii,
'reads_num|n=i' => \$reads_num,
'reads_start|b=i' => \$reads_start_arg,
'reads_end|e=i' => \$reads_end_arg,
);
my $usage = "
fastq_phred.pl:
This program can print fastq file's reads quality scores, ASCII value, and help to judge it's
encoding by the ASCII value range, whether it is phred33 or phred64.
Usage:
perl fastq_phred.pl [options] <file1.fq [file2.fq ...]>
Options:
-h|--help print this help message.
-s|--score print scores. [default: Do not print scores]
-d|--detail print detail scores or ASCII value when [default: Do not print scores in detail]
--score or --ascii set.
-a|--ascii print quality character's ASCII value. if [default: Do not print ASCII vaule]
this option set, the --score will disabled.
-n|--reads_num reads number used to test phred encoding [default: 1000]
and print scores. It's advised to use more
than 100 reads to do the test.
-b|--reads_start reads start position used to test phred [default: 1]
encoding and print scores.
-e|--reads_end reads end position used to test phred [default: the length of the read]
encoding and print scores.
";
if ($#ARGV < 0 or $help){
print "$usage";
exit;
}
#Check parameters.
unless ($reads_num){
$reads_num = 1000;
}
if ($reads_start_arg && $reads_start_arg < 0){
print STDERR "ERROR:The reads start position should great than 0.\n\n";
exit;
}
if ($reads_end_arg && $reads_end_arg < 0){
print STDERR "ERROR:The reads end position should great than 0.\n\n";
exit;
}
if ($reads_start_arg && $reads_end_arg && $reads_end_arg < $reads_start_arg){
print STDERR "ERROR:The reads start position should great than end position.\n\n";
exit;
}
&main;
sub main{
my $filename = '';
while ($filename = shift @ARGV){
my @FQ = ();
my @all_ascii = ();
my ($file_end, $phred_result) = ('','');
my ($Q, $count, $lt_58, $gt_75) = (0, 0, 0, 0);
open FQ,"<$filename" or die "Can not open $filename:$!\n";
#Read sequences.
while($count < $reads_num){
$count++;
@FQ=();
#read four lines from fastq file.
for(my $i=0; $i<=3; $i++){
if (eof(FQ)){
$file_end = 'yes';
last;
}
$FQ[$i]=<FQ>;
if ($FQ[0] !~ m/^@/){
my $line = $count*4-3;
print STDERR "ERROR:\n$filename: It's not a correct fastq format.\nline '$line': $FQ[0]\n";
exit;
}
}
if ( $file_end eq 'yes'){
next;
}
my @ascii_ref = &cal_ascii($FQ[3], $reads_start_arg, $reads_end_arg);
push @all_ascii, [@ascii_ref];
}
#print ASCII.
if ($print_ascii){
print "\n","."x50," ASCII Value: $filename ","."x50,"\n";
&print_array_of_array(\@all_ascii, 0, $detail);
next;
}
#Stastic ASCII value range.
foreach my $ascii_ref (@all_ascii){
$lt_58 += (grep { $_ <= 58} @{$ascii_ref});
$gt_75 += (grep { $_ >= 75} @{$ascii_ref});
}
#Guess the Phred with ASCII value range.
if ($lt_58 > 1 && $gt_75 == 0 ){
$Q = 33;
$phred_result = "$filename: The encoding should be Phred33.\nThe quality score character number that ASCII value less than 58 : $lt_58\nThe quality score character number that ASCII value great than 75: $gt_75";
}elsif($lt_58 == 0 && $gt_75 > 1){
$Q = 64;
$phred_result = "$filename: The encoding should be Phred64.\nThe quality score character number that ASCII value less than 58 : $lt_58\nThe quality score character number that ASCII value great than 75: $gt_75";
}elsif($lt_58 == 0 && $gt_75 == 0){
print STDERR "$filename: The encoding should be Phred33 that all of the nucleotide quality score great than 25 and less than 41, but it's advised to send more reads to be tested with '-n <int>' options.\n";
exit;
}else{
print STDERR "$filename\nWarning: Abnormal endoding, Please test again with more reads or make a judgement by yourself with ASCII value by '-ascii' options.\n";
exit;
}
#print score.
if ($print_score){
print "\n","."x50," Quality Score: $filename ","."x50,"\n";
&print_array_of_array(\@all_ascii, $Q, $detail);
}
#Print the phred encoding result.
print STDERR "$phred_result\n\n";
}
}
#Print Score or ASCII value.
sub print_array_of_array{
my ($array_of_array_ref, $Q, $detail) = @_;
my ($average_value, $total_value, $value_num) = (0, 0, 0);
my @array_of_array = @{$array_of_array_ref};
my %value_h;
foreach my $array_ref (@array_of_array){
for (my $i=0;$i<=$#{$array_ref};$i++){
my $out_value = ${$array_ref}[$i] - $Q;
$value_h{$out_value}++;
$total_value += $out_value;
$value_num ++;
print "$out_value " if ($detail);
}
print "\n" if ($detail);
}
unless ($detail){
foreach my $out_value (sort {$a <=> $b} keys %value_h){
print "$out_value\t$value_h{$out_value}\n";
}
}
$average_value = (int ($total_value/$value_num)*100) /100;
print "Average: $average_value\n";
}
#Calculate phred score.
sub cal_ascii{
my ($read,$reads_start, $reads_end) = @_;
my @all_ascii = ();
my $ascii = 0;
#The $read string's end is a "\n";
my $reads_len = length($read) - 1;
#The $reads_end should be less than the read length.
if( $reads_end_arg && $reads_end_arg <= $reads_len){
$reads_end = $reads_end_arg;
}else{
$reads_end = $reads_len;
}
#If the the reads start position set, the $reads_start equal to it,
#else the reads start position set to 1.
if ( $reads_start_arg && $reads_start_arg <= $reads_end){
$reads_start = $reads_start_arg;
}else{
$reads_start = 1;
}
#Convert 1 base coordinate system to 0 base coordinate system.
for(my $j=$reads_start-1; $j<=$reads_end-1; $j++){
$ascii = ord(substr($read,$j,1));
push @all_ascii, $ascii;
}
return @all_ascii;
}
```
參考資料:
1. [https://en.wikipedia.org/wiki/FASTQ_format](https://en.wikipedia.org/wiki/FASTQ_format)
2. [https://en.wikipedia.org/wiki/Phred_quality_score](https://en.wikipedia.org/wiki/Phred_quality_score)
3. [https://en.wikipedia.org/wiki/Phred_base_calling](https://en.wikipedia.org/wiki/Phred_base_calling)
4. [http://maq.sourceforge.net/fastq.shtml](http://maq.sourceforge.net/fastq.shtml)
5. [http://maq.sourceforge.net/qual.shtml](http://maq.sourceforge.net/qual.shtml)
6. [http://supportres.illumina.com/documents/myillumina/a557afc4-bf0e-4dad-9e59-9c740dd1e751/casava_userguide_15011196d.pdf](http://supportres.illumina.com/documents/myillumina/a557afc4-bf0e-4dad-9e59-9c740dd1e751/casava_userguide_15011196d.pdf)