Fastq 格式說明 & (Phred33 or Phred64)

Fastq格式是一種基于文本的存儲生物序列和對應堿基（或氨基酸）質量的文件格式。最初由桑格研究所（Wellcome Trust Sanger Institute）開發出來，現已成為存儲高通量測序數據的事實標準。以Illumina Casava 1.8+ 的fastq格式為例，fastq格式的形式如下：

image

每條序列由4行字符表示，上述樣例顯示有兩條序列：

第一行：必須以“@”開頭，后面跟著唯一的序列ID標識符，然后跟著可選的序列描述內容，標識符與描述內容用空格分開。

第二行：序列字符（核酸為[AGCTN]+，蛋白為氨基酸字符）。

第三行：必須以“+”開頭，后面跟著可選的ID標識符和可選的描述內容，如果“+”后面有內容，該內容必須與第一行“@”后的內容相同。

第四行：堿基質量字符，每個字符對應第二行相應位置堿基或氨基酸的質量，該字符可以按一定規則轉換為堿基質量得分，堿基質量得分可以反映該堿基的錯誤率。這行的字符數與第二行中的字符數必須相同。字符與錯誤率的具體關系見下文介紹。

在滿足上述要求的前提下，不同的測序儀廠商或數據存儲商對第一行和第四行的定義有些差別。

第一行，即標識行在Illumina和NCBI SRA中的樣式如下：

Illumina casava 1.8+（詳細的解釋可參考wiki）：

@HWI-ST1276:97:D1DCYACXX:7:1101:1406:2170 1:N:0:CGACGT

NCBI SRA：

@SRR387514.1 ILLUMINA-C4D679_0049_FC:1:12:3317:1141 length=40

對于第四行的編碼，最初由Phred程序的開發者定義，一般稱為Phred qualitiy. 在Illumina早起版本（v1.3，v1.4）中，因為對quality的定義與Phred的不同，這行應該稱為 Solexa quality。但從Illumina v1.5以后，也開始采用Phred的定義。

堿基質量得分是怎么來的？

Phred最初是一個從測序儀中產生的熒光記錄數據中識別堿基的程序。在早起的熒光染料測序中，每次發生堿基合成時會釋放出熒光信號，該信號被CCD圖像傳感器捕獲。記錄下熒光信號的峰值，生成一個實時的軌跡數據（chromatogram）。因為不同的堿基用不用的顏色標記，檢測這些峰值即可判斷出對應的堿基。但由于這些信號的波峰、密度、形狀和位置等是不連續或模糊的，有時很難根據波峰判斷出正確的堿基。

image

圖1 chromatogram樣圖

Phred計算許多與波峰大小和分辨率相關的參數，根據這些參數，從一個巨大的查詢表中找出堿基質量得分。這個查詢表是根據對已知序列的測序數據分析得到的（應該是分析得到波峰參數與堿基錯誤率的關系，再通過公式把錯誤率轉換成質量得分，得到波峰參數與質量得分的直接對應表）。不同的測序試劑和機器用不同的查詢表。為了節約磁盤空間，質量得分（可能占用兩個字符）按一定規則（Phred+33或Phred+64）被轉換為單個字符表示。

堿基錯誤率與質量得分的關系有如下兩種：

Qphred = -10log10p

Qillumina-prior to v.1.4 = -10log10(p/(1-p))

image

圖 2 質量得分Q和錯誤率p的關系，紅色的為phred，黑色的為Illumina早期版本，虛線表明p=0.05，對應的質量得分為Q≈13

在不同版本的編碼中，除了質量得分與錯誤率有所差別外，在字符與得分的轉換上也有差別。

image

圖3 不同版本質量得分與質量字符ASCII值的關系

質量字符的ASCII值和質量得分的關系有如下兩種：

Phred+64 質量字符的ASCII值 - 64

Phred+33: 質量字符的ASCII值 - 33

可以粗略分為 Phred+33和Phred+64，這里的33和64就是指ASCII值轉換為得分該減去的數值。

在處理測序數據時，因為一些軟件會根據堿基質量得分的不同做不同的處理，常要指定正確的編碼方式，有必要對質量字符與質量得分的關系（Phred+33或Phred+64）作出正確的判斷。當然，如果處理的是最近兩年產生的測序數據，基本上都是Phred+33的，但從NCBI SRA數據庫下載的舊數據就不一定了。

根據圖3中Phred+33與Phred+64所使用的質量字符范圍的不同，可以對fastq文件中質量得分的編碼方式做出判斷。圖3中顯示，ASCII值小于等于58（相應的質量得分小于等于25）對應的字符只有在Phred+33的編碼中被使用，所有Phred+64所使用的字符的ASCII值都大于等于59。在通常情況下，ASCII值大于等于74的字符只出現在Phred+64中。利用這些信息即可在程序中進行判斷。

文章末尾是一個對Phred+33或Phred+64做區分的perl腳本。

該腳本的判斷思想如下：

默認讀取1000條序列，在這1000條序列中：

1. 如果有2個以上的質量字符ASCII值小于等于58(即有兩個堿基的得分小于等于25），同時沒有任何質量字符的ASCII值大于等于75，即判斷是Phred+33。

2. 如果有2個以上的質量字符ASCII值大于等于75(即有兩個堿基的得分大于等于10），同時沒有任何質量字符的ASCII值小于等于58，即判斷是Phred+64。

3. 如果所有質量字符的ASCII值介于59到74之間，即判斷可能是Phred+33，但建議使用更多的序列做進一步測試（出現這種結果可能有兩種情況：1, Phred+33編碼，所有堿基質量得分介于26到42之間；2，Phred+64編碼，所有堿基質量得分介于-5到10；是前者的可能性大）。

4. 如果出現上述3種以外的情況，建議打印出質量字符的ASCII值人工判斷。

理解錯誤的地方歡迎指正。

附錄檢測格式的perl：


#!/usr/bin/perl -w
use strict;
use Getopt::Long;
#
# fastq_phred.pl - Script for judge the fastq's encoding, whether it is phred33 or phred64.
#
# Version: 0.3 ( May 19, 2014)
# Author: Wencai Jie (jiewencai<@>qq.com), NJAU, China.
#
# Permission is granted to anyone to use this software for any purpose, without
# any express or implied warranty. In no event will the authors be held liable 
# for any damages arising from the use of this software.
#

#Get options.
my ($help, $print_score, $detail, $print_ascii, $reads_num, $reads_start_arg, $reads_end_arg);
my $reads_end_turn;
GetOptions(
    'help|h!' => \$help,
    'score|s!' => \$print_score,
    'detail|d!' => \$detail,
    'ascii|a!' => \$print_ascii,
    'reads_num|n=i' => \$reads_num,
    'reads_start|b=i' => \$reads_start_arg,
    'reads_end|e=i' => \$reads_end_arg,
);

my $usage = "
fastq_phred.pl:
This program can print fastq file's reads quality scores, ASCII value, and help to judge it's 
encoding by the ASCII value range, whether it is phred33 or phred64.

Usage:
    perl fastq_phred.pl [options] <file1.fq [file2.fq ...]>
Options:
    -h|--help         print this help message.
    -s|--score        print scores.                                 [default: Do not print scores] 
    -d|--detail       print detail scores or ASCII value when       [default: Do not print scores in detail] 
                          --score or --ascii set.
    -a|--ascii        print quality character's ASCII value. if     [default: Do not print ASCII vaule] 
                          this option set, the --score will disabled.
    -n|--reads_num    reads number used to test phred encoding      [default: 1000]
                          and print scores. It's advised to use more
                          than 100 reads to do the test.               
    -b|--reads_start  reads start position used to test phred       [default: 1]
                          encoding and print scores.
    -e|--reads_end    reads end position used to test phred         [default: the length of the read]
                          encoding and print scores.

";
    
if ($#ARGV < 0 or $help){ 
    print "$usage";
    exit;
}

#Check parameters.
unless ($reads_num){
    $reads_num = 1000;
}
if ($reads_start_arg && $reads_start_arg  < 0){
    print STDERR "ERROR:The reads start position should great than 0.\n\n";
    exit;
}
if ($reads_end_arg && $reads_end_arg  < 0){
    print STDERR "ERROR:The reads end position should great than 0.\n\n";
    exit;
}
if ($reads_start_arg && $reads_end_arg && $reads_end_arg < $reads_start_arg){
    print STDERR "ERROR:The reads start position should great than end position.\n\n";
    exit;
}

&main;

sub main{
    my $filename = '';
    while ($filename = shift @ARGV){
    my @FQ = ();
    my @all_ascii = ();
    my ($file_end, $phred_result) = ('','');
    my ($Q, $count, $lt_58, $gt_75) = (0, 0, 0, 0);
    open FQ,"<$filename" or die "Can not open $filename:$!\n";
    #Read sequences.
    while($count < $reads_num){
        $count++;
        @FQ=(); 
        #read four lines from fastq file.
        for(my $i=0; $i<=3; $i++){
            if (eof(FQ)){
                $file_end = 'yes';
                last;
            }
            $FQ[$i]=<FQ>;
            if ($FQ[0] !~ m/^@/){
                my $line = $count*4-3;
                print STDERR "ERROR:\n$filename: It's not a correct fastq format.\nline '$line': $FQ[0]\n";
                exit;
            }
        }
        if ( $file_end eq 'yes'){
            next;
        }
        my @ascii_ref = &cal_ascii($FQ[3], $reads_start_arg, $reads_end_arg);
        push @all_ascii, [@ascii_ref];
    }

    #print ASCII.
    if ($print_ascii){
        print "\n","."x50," ASCII Value: $filename ","."x50,"\n";
        &print_array_of_array(\@all_ascii, 0, $detail);
        next;
    }

    #Stastic ASCII value range.
    foreach my $ascii_ref (@all_ascii){
        $lt_58 += (grep { $_ <= 58} @{$ascii_ref});
        $gt_75 += (grep { $_ >= 75} @{$ascii_ref});
    }

    #Guess the Phred with ASCII value range. 
    if ($lt_58 > 1 && $gt_75 == 0 ){
        $Q = 33;
        $phred_result = "$filename: The encoding should be Phred33.\nThe quality score character number that ASCII value less than 58 : $lt_58\nThe quality score character number that ASCII value great than 75: $gt_75";
    }elsif($lt_58 == 0 && $gt_75 > 1){
        $Q = 64;
        $phred_result = "$filename: The encoding should be Phred64.\nThe quality score character number that ASCII value less than 58 : $lt_58\nThe quality score character number that ASCII value great than 75: $gt_75";
    }elsif($lt_58 == 0 && $gt_75 == 0){
        print STDERR "$filename: The encoding should be Phred33 that all of the nucleotide quality score great than 25 and less than 41, but it's advised to send more reads to be tested with '-n <int>' options.\n";
        exit;
    }else{
        print STDERR "$filename\nWarning: Abnormal endoding, Please test again with more reads or make a judgement by yourself with ASCII value by '-ascii' options.\n"; 
        exit;
    }

    #print score.
    if ($print_score){
        print "\n","."x50," Quality Score: $filename ","."x50,"\n";
        &print_array_of_array(\@all_ascii, $Q, $detail);
    }

    #Print the phred encoding result.
    print STDERR "$phred_result\n\n";
    }
}

#Print Score or ASCII value.
sub print_array_of_array{
    my ($array_of_array_ref, $Q, $detail) = @_;
    my ($average_value, $total_value, $value_num) = (0, 0, 0);
    my @array_of_array = @{$array_of_array_ref};
    my %value_h;
    foreach my $array_ref (@array_of_array){
        for (my $i=0;$i<=$#{$array_ref};$i++){
            my $out_value  = ${$array_ref}[$i] - $Q;
            $value_h{$out_value}++;
            $total_value += $out_value;
            $value_num ++;
            print "$out_value " if ($detail);
        }
        print "\n" if ($detail);
    }
    unless ($detail){
        foreach my $out_value (sort {$a <=> $b} keys %value_h){
            print "$out_value\t$value_h{$out_value}\n";
        }
    }
    $average_value = (int ($total_value/$value_num)*100) /100;
    print "Average: $average_value\n";
}

#Calculate phred score.
sub cal_ascii{
    my ($read,$reads_start, $reads_end) = @_;
    my @all_ascii = ();
    my $ascii = 0;
    #The $read string's end is a "\n";
    my $reads_len = length($read) - 1;
    #The $reads_end should be less than the read length.
    if( $reads_end_arg && $reads_end_arg <= $reads_len){
        $reads_end = $reads_end_arg;
    }else{
        $reads_end = $reads_len;
    }
    #If the the reads start position set, the $reads_start equal to it,
    #else the reads start position set to 1.
    if ( $reads_start_arg && $reads_start_arg <= $reads_end){
        $reads_start = $reads_start_arg;
    }else{
        $reads_start = 1;
    }
    #Convert 1 base coordinate system to 0 base coordinate system.
    for(my $j=$reads_start-1; $j<=$reads_end-1; $j++){
        $ascii = ord(substr($read,$j,1));
        push @all_ascii, $ascii;
    }
    return @all_ascii;
}



```




參考資料：

1. [https://en.wikipedia.org/wiki/FASTQ_format](https://en.wikipedia.org/wiki/FASTQ_format)

2. [https://en.wikipedia.org/wiki/Phred_quality_score](https://en.wikipedia.org/wiki/Phred_quality_score)

3. [https://en.wikipedia.org/wiki/Phred_base_calling](https://en.wikipedia.org/wiki/Phred_base_calling)

4. [http://maq.sourceforge.net/fastq.shtml](http://maq.sourceforge.net/fastq.shtml)

5. [http://maq.sourceforge.net/qual.shtml](http://maq.sourceforge.net/qual.shtml)

6. [http://supportres.illumina.com/documents/myillumina/a557afc4-bf0e-4dad-9e59-9c740dd1e751/casava_userguide_15011196d.pdf](http://supportres.illumina.com/documents/myillumina/a557afc4-bf0e-4dad-9e59-9c740dd1e751/casava_userguide_15011196d.pdf)

最后編輯于：2020.02.19 11:04:16

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明：文章內容（如有圖片或視頻亦包括在內）由作者上傳并發布，文章內容僅代表作者本人觀點，簡書系信息發布平臺，僅提供信息存儲服務。

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 230,622評論 6贊 544
死咒
序言：濱河連續發生了三起死亡事件，死亡現場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機，發現死者居然都...
沈念sama閱讀 99,716評論 3贊 429
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 178,746評論 0贊 383
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 63,991評論 1贊 318
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當我...
茶點故事閱讀 72,706評論 6贊 413
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發上，一...
開封第一講書人閱讀 56,036評論 1贊 329
城市分裂傳說
那天，我揣著相機與錄音，去河邊找鬼。笑死，一個胖子當著我的面吹牛，可吹牛的內容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 44,029評論 3贊 450
雙鴛鴦連環套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 43,203評論 0贊 290
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當地人在樹林里發現了一具尸體，經...
沈念sama閱讀 49,725評論 1贊 336
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 41,451評論 3贊 361
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發現自己被綠了。大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 43,677評論 1贊 374
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 39,161評論 5贊 365
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質發生泄漏。R本人自食惡果不足惜，卻給世界環境...
茶點故事閱讀 44,857評論 3贊 351
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 35,266評論 0贊 28
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 36,606評論 1贊 295
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 52,407評論 3贊 400
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 48,643評論 2贊 380

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Fastq 格式說明 & (Phred33 or Phred64)

Fastq 格式說明 & (Phred33 or Phred64)

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Fastq 格式說明 & (Phred33 or Phred64)

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频