本文完全摘抄自簡書的使用fastp進(jìn)行數(shù)據(jù)質(zhì)控

僅為方便自己學(xué)習(xí)，如有侵權(quán)請(qǐng)?zhí)嵝褎h除，謝謝！

使用fastp進(jìn)行數(shù)據(jù)質(zhì)控

fastp是一款較新的數(shù)據(jù)質(zhì)控軟件，接觸這個(gè)軟件也是由于目前市場的軟件各有功能但是功能都不是很全，譬如最近接觸到一個(gè)RNAseq數(shù)據(jù)，質(zhì)量較差，需要去除接頭而且含N較多，序列起始端的數(shù)據(jù)較差需要去除幾個(gè)bp，本來是打算使用trimmomatic去除接頭和起始幾個(gè)bp+cutadapt去除含N多的序列，但覺得稍微復(fù)雜。下面我們看看fastp能做什么。

fastp的特性：

對(duì)數(shù)據(jù)自動(dòng)進(jìn)行全方位質(zhì)控，生成人性化的報(bào)告

過濾功能（低質(zhì)量，太短，太多N……）;

對(duì)每一個(gè)序列的頭部或尾部，計(jì)算滑動(dòng)窗內(nèi)的質(zhì)量均值，并將均值較低的子序列進(jìn)行切除（類似Trimmomatic的做法，但是快非常多）;

全局剪裁（在頭/尾部，不影響去重），對(duì)于Illumina下機(jī)數(shù)據(jù)往往最后一到兩個(gè)cycle需要這樣處理;

去除接頭污染。厲害的是，你不用輸入接頭序列，因?yàn)樗惴〞?huì)自動(dòng)識(shí)別接頭序列并進(jìn)行剪裁;

對(duì)于雙端測序（PE）的數(shù)據(jù)，軟件會(huì)自動(dòng)查找每一對(duì)read的重疊區(qū)域，并對(duì)該重疊區(qū)域中不匹配的堿基對(duì)進(jìn)行校正;

去除尾部的polyG。對(duì)于Illumina NextSeq/NovaSeq的測序數(shù)據(jù)，因?yàn)槭莾缮òl(fā)光，polyG是常有的事，所以該特性對(duì)該兩類測序平臺(tái)默認(rèn)打開;

對(duì)于PE數(shù)據(jù)中的overlap區(qū)間中不一致的堿基對(duì)，依據(jù)質(zhì)量值進(jìn)行校正;

可以對(duì)帶分子標(biāo)簽（UMI）的數(shù)據(jù)進(jìn)行預(yù)處理，不管UMI在插入片段還是在index上，都可以輕松處理;
-可以將輸出進(jìn)行分拆，而且支持兩種模式，分別是指定分拆的個(gè)數(shù)，或者分拆后每個(gè)文件的行數(shù);

以上功能大多都不需要輸入太多的參數(shù)，一些功能默認(rèn)已經(jīng)開啟，但是可以用參數(shù)關(guān)閉。fastp完美支持gzip的輸入和輸出，同時(shí)支持SE和PE數(shù)據(jù)，而且不但支持像Illumina平臺(tái)的short read數(shù)據(jù)，也在一定程度上支持了PacBio/Nanopore的long reads數(shù)據(jù)。

fastp軟件會(huì)生成HTML格式的報(bào)告，而且該報(bào)告中沒有任何一張靜態(tài)圖片，所有的圖表都是使用JavaScript動(dòng)態(tài)繪制，非常具有交互性。想要看一下樣板報(bào)告的，可以去以下鏈接：http://opengene.org/fastp/fastp.html

而且軟件的開發(fā)者還充分考慮到了各種自動(dòng)化分析的需求，不但生成了人可讀的HTML報(bào)告，還生成了程序可讀性非常強(qiáng)的JSON結(jié)果，該JSON報(bào)告中的數(shù)據(jù)包含了HTML報(bào)告100%的信息，而且該JSON文件的格式還是特殊定制的，不但程序讀得爽，你用任何一款文本編輯器打開，一眼過去也會(huì)看得明明白白。想要看一下JSON結(jié)果長什么樣的，可以去以下鏈接：http://opengene.org/fastp/fastp.json

下面我們先來看看fastp的具體參數(shù)：

usage: fastp -i <in1> -o <out1> [-I <in1> -O <out2>] [options...]
options:
  # I/O options   即輸入輸出文件設(shè)置
  -i, --in1                          read1 input file name (string)
  -o, --out1                         read1 output file name (string [=])
  -I, --in2                          read2 input file name (string [=])
  -O, --out2                         read2 output file name (string [=])
  -6, --phred64                      indicates the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
  -z, --compression                  compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 2\. (int [=2])
    --reads_to_process               specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])

  # adapter trimming options   過濾序列接頭參數(shù)設(shè)置
  -A, --disable_adapter_trimming     adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
  -a, --adapter_sequence               the adapter for read1\. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])
      --adapter_sequence_r2            the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=])

  # global trimming options   剪除序列起始和末端的低質(zhì)量堿基數(shù)量參數(shù)
  -f, --trim_front1                  trimming how many bases in front for read1, default is 0 (int [=0])
  -t, --trim_tail1                   trimming how many bases in tail for read1, default is 0 (int [=0])
  -F, --trim_front2                  trimming how many bases in front for read2\. If it's not specified, it will follow read1's settings (int [=0])
  -T, --trim_tail2                   trimming how many bases in tail for read2\. If it's not specified, it will follow read1's settings (int [=0])

  # polyG tail trimming, useful for NextSeq/NovaSeq data   polyG剪裁
  -g, --trim_poly_g                  force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
      --poly_g_min_len                 the minimum length to detect polyG in the read tail. 10 by default. (int [=10])
  -G, --disable_trim_poly_g          disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data

  # polyX tail trimming
  -x, --trim_poly_x                    enable polyX trimming in 3' ends.
      --poly_x_min_len                 the minimum length to detect polyX in the read tail. 10 by default. (int [=10])

  # per read cutting by quality options   劃窗裁剪
  -5, --cut_by_quality5              enable per read cutting by quality in front (5'), default is disabled (WARNING: this will interfere deduplication for both PE/SE data)
  -3, --cut_by_quality3              enable per read cutting by quality in tail (3'), default is disabled (WARNING: this will interfere deduplication for SE data)
  -W, --cut_window_size              the size of the sliding window for sliding window trimming, default is 4 (int [=4])
  -M, --cut_mean_quality             the bases in the sliding window with mean quality below cutting_quality will be cut, default is Q20 (int [=20])

  # quality filtering options   根據(jù)堿基質(zhì)量來過濾序列
  -Q, --disable_quality_filtering    quality filtering is enabled by default. If this option is specified, quality filtering is disabled
  -q, --qualified_quality_phred      the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
  -u, --unqualified_percent_limit    how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
  -n, --n_base_limit                 if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])

  # length filtering options   根據(jù)序列長度來過濾序列
  -L, --disable_length_filtering     length filtering is enabled by default. If this option is specified, length filtering is disabled
  -l, --length_required              reads shorter than length_required will be discarded, default is 15\. (int [=15])

  # low complexity filtering
  -y, --low_complexity_filter          enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
  -Y, --complexity_threshold           the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])

  # filter reads with unwanted indexes (to remove possible contamination)
      --filter_by_index1               specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])
      --filter_by_index2               specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])
      --filter_by_index_threshold      the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])

  # base correction by overlap analysis options   通過overlap來校正堿基
  -c, --correction                   enable base correction in overlapped regions (only for PE data), default is disabled

  # UMI processing
  -U, --umi                          enable unique molecular identifer (UMI) preprocessing
      --umi_loc                      specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])
      --umi_len                      if the UMI is in read1/read2, its length should be provided (int [=0])
      --umi_prefix                   if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])
      --umi_skip                       if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0 (int [=0])

  # overrepresented sequence analysis
  -p, --overrepresentation_analysis    enable overrepresented sequence analysis.
  -P, --overrepresentation_sampling    One in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20\. (int [=20])

  # reporting options
  -j, --json                         the json format report file name (string [=fastp.json])
  -h, --html                         the html format report file name (string [=fastp.html])
  -R, --report_title                 should be quoted with ' or ", default is "fastp report" (string [=fastp report])

  # threading options   設(shè)置線程數(shù)
  -w, --thread                       worker thread number, default is 3 (int [=3])

  # output splitting options
  -s, --split                        split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])
  -S, --split_by_lines               split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])
  -d, --split_prefix_digits          the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])

  # help
  -?, --help                         print this message

雖然參數(shù)看起來比較多，但常用的主要包括以下幾個(gè)部分：

輸入輸出文件設(shè)置
接頭處理
全局裁剪（即直接剪掉起始和末端低質(zhì)量堿基）
滑窗質(zhì)量剪裁（與trimmomatic相似）
過濾過短序列
校正堿基（用于雙端測序）
質(zhì)量過濾

1、接頭處理

fastp默認(rèn)啟用了接頭處理，但是可以使用-A命令來關(guān)掉。fastp可以自動(dòng)化地查找接頭序列并進(jìn)行剪裁，也就是說你可以不輸入任何的接頭序列，fastp全自動(dòng)搞定了！對(duì)于SE數(shù)據(jù)，你還是可以-a參數(shù)來輸入你的接頭，而對(duì)于PE數(shù)據(jù)則完全沒有必要，fastp基于PE數(shù)據(jù)的overlap分析可以更準(zhǔn)確地查找接頭，去得更干凈，而且對(duì)于一些接頭本身就有堿基不匹配情況處理得更好。fastp對(duì)于接頭去除會(huì)有一個(gè)匯總的報(bào)告。

2、全局裁剪

fastp可以對(duì)所有read在頭部和尾部進(jìn)行統(tǒng)一剪裁，該功能在去除一些測序質(zhì)量不好的cycle比較有用，比如151*2的PE測序中，最后一個(gè)cycle通常質(zhì)量是非常低的，需要剪裁掉。使用-f和-t分別指定read1的頭部和尾部的剪裁，使用-F和-T分別指定read2的頭部和尾部的剪裁。

3、滑窗質(zhì)量剪裁

很多時(shí)候，一個(gè)read的低質(zhì)量序列都是集中在read的末端，也有少部分是在read的開頭。fastp支持像Trimmomatic那樣對(duì)滑動(dòng)窗口中的堿基計(jì)算平均質(zhì)量值，然后將不符合的滑窗直接剪裁掉。使用-5參數(shù)開啟在5’端，也就是read的開頭的剪裁，使用-3參數(shù)開啟在3’端，也就是read的末尾的剪裁。使用-W參數(shù)指定滑動(dòng)窗大小，默認(rèn)是4，使用-M參數(shù)指定要求的平均質(zhì)量值，默認(rèn)是20，也就是Q20。

4、過濾過短序列

默認(rèn)開啟多序列過濾，默認(rèn)值為15，使用-L（--disable_length_filtering）禁止此默認(rèn)選項(xiàng)。或使用-l（--length_required）自定義最短序列。

5、校正堿基（用于雙端測序）

fastp支持對(duì)PE數(shù)據(jù)的每一對(duì)read進(jìn)行分析，查找它們的overlap區(qū)間，然后對(duì)于overlap區(qū)間中不一致的堿基，如果發(fā)現(xiàn)其中一個(gè)質(zhì)量非常高，而另一個(gè)非常低，則可以將非常低質(zhì)量的堿基改為相應(yīng)的非常高質(zhì)量值的堿基值。此選項(xiàng)默認(rèn)關(guān)閉，可使用-c（--correction）開啟。

6、質(zhì)量過濾

fastp可以對(duì)低質(zhì)量序列，較多N的序列，該功能默認(rèn)是啟用的，但可以使用-Q參數(shù)關(guān)閉。使用-q參數(shù)來指定合格的phred質(zhì)量值，比如-q 15表示質(zhì)量值大于等于Q15的即為合格，然后使用-u參數(shù)來指定最多可以有多少百分比的質(zhì)量不合格堿基。比如-q 15 -u 40表示一個(gè)read最多只能有40%的堿基的質(zhì)量值低于Q15，否則會(huì)被扔掉。使用-n可以限定一個(gè)read中最多能有多少個(gè)N。

例子

最后，附一個(gè)簡單的例子：

#!/bin/bash

for i in 74 75 76 82 83 84 85 86 87 88; do
    {
    fastp -i ~/RNAseq/cleandata/SRR17343${i}_1.fastq.gz -o SRR17343${i}_1.fastq.gz \
        -I ~/RNAseq/cleandata/SRR17343${i}_2.fastq.gz -O SRR17343${i}_2.fastq.gz \
        -Q --thread=5 --length_required=50 --n_base_limit=6 --compression=6
    }&
done
wait

雖然軟件作者稱其速度很快，但就我的測試來看好像并沒有那么快，可能與實(shí)驗(yàn)室服務(wù)器還在跑別的程序有關(guān)。其次就是他的質(zhì)控報(bào)告，對(duì)于多個(gè)質(zhì)控結(jié)果，如果能夠與multiqc一樣出一份匯總報(bào)告就更好了。
參考：
fastp: 一款超快速全功能的FASTQ文件自動(dòng)化質(zhì)控+過濾+校正+預(yù)處理軟件
 https://github.com/OpenGene/fastp

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

fastp參數(shù)說明

fastp參數(shù)說明

本文完全摘抄自簡書的使用fastp進(jìn)行數(shù)據(jù)質(zhì)控

僅為方便自己學(xué)習(xí)，如有侵權(quán)請(qǐng)?zhí)嵝褎h除，謝謝！

使用fastp進(jìn)行數(shù)據(jù)質(zhì)控

1、接頭處理

2、全局裁剪

3、滑窗質(zhì)量剪裁

4、過濾過短序列

5、校正堿基（用于雙端測序）

6、質(zhì)量過濾

例子

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

fastp參數(shù)說明

本文完全摘抄自簡書的 使用fastp進(jìn)行數(shù)據(jù)質(zhì)控

僅為方便自己學(xué)習(xí)，如有侵權(quán)請(qǐng)?zhí)嵝褎h除，謝謝！

使用fastp進(jìn)行數(shù)據(jù)質(zhì)控

1、接頭處理

2、全局裁剪

3、滑窗質(zhì)量剪裁

4、過濾過短序列

5、校正堿基（用于雙端測序）

6、質(zhì)量過濾

例子

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

本文完全摘抄自簡書的使用fastp進(jìn)行數(shù)據(jù)質(zhì)控