本文完全摘抄自簡書的 使用fastp進行數(shù)據(jù)質(zhì)控
僅為方便自己學(xué)習(xí),如有侵權(quán)請?zhí)嵝褎h除,謝謝!
使用fastp進行數(shù)據(jù)質(zhì)控
fastp是一款較新的數(shù)據(jù)質(zhì)控軟件,接觸這個軟件也是由于目前市場的軟件各有功能但是功能都不是很全,譬如最近接觸到一個RNAseq數(shù)據(jù),質(zhì)量較差,需要去除接頭而且含N較多,序列起始端的數(shù)據(jù)較差需要去除幾個bp,本來是打算使用trimmomatic去除接頭和起始幾個bp+cutadapt去除含N多的序列,但覺得稍微復(fù)雜。下面我們看看fastp能做什么。
fastp的特性:
- 對數(shù)據(jù)自動進行全方位質(zhì)控,生成人性化的報告
- 過濾功能(低質(zhì)量,太短,太多N……);
- 對每一個序列的頭部或尾部,計算滑動窗內(nèi)的質(zhì)量均值,并將均值較低的子序列進行切除(類似Trimmomatic的做法,但是快非常多);
- 全局剪裁 (在頭/尾部,不影響去重),對于Illumina下機數(shù)據(jù)往往最后一到兩個cycle需要這樣處理;
- 去除接頭污染。厲害的是,你不用輸入接頭序列,因為算法會自動識別接頭序列并進行剪裁;
- 對于雙端測序(PE)的數(shù)據(jù),軟件會自動查找每一對read的重疊區(qū)域,并對該重疊區(qū)域中不匹配的堿基對進行校正;
- 去除尾部的polyG。對于Illumina NextSeq/NovaSeq的測序數(shù)據(jù),因為是兩色法發(fā)光,polyG是常有的事,所以該特性對該兩類測序平臺默認(rèn)打開;
- 對于PE數(shù)據(jù)中的overlap區(qū)間中不一致的堿基對,依據(jù)質(zhì)量值進行校正;
- 可以對帶分子標(biāo)簽(UMI)的數(shù)據(jù)進行預(yù)處理,不管UMI在插入片段還是在index上,都可以輕松處理;
-可以將輸出進行分拆,而且支持兩種模式,分別是指定分拆的個數(shù),或者分拆后每個文件的行數(shù);
以上功能大多都不需要輸入太多的參數(shù),一些功能默認(rèn)已經(jīng)開啟,但是可以用參數(shù)關(guān)閉。fastp完美支持gzip的輸入和輸出,同時支持SE和PE數(shù)據(jù),而且不但支持像Illumina平臺的short read數(shù)據(jù),也在一定程度上支持了PacBio/Nanopore的long reads數(shù)據(jù)。
fastp軟件會生成HTML格式的報告,而且該報告中沒有任何一張靜態(tài)圖片,所有的圖表都是使用JavaScript動態(tài)繪制,非常具有交互性。想要看一下樣板報告的,可以去以下鏈接:http://opengene.org/fastp/fastp.html
而且軟件的開發(fā)者還充分考慮到了各種自動化分析的需求,不但生成了人可讀的HTML報告,還生成了程序可讀性非常強的JSON結(jié)果,該JSON報告中的數(shù)據(jù)包含了HTML報告100%的信息,而且該JSON文件的格式還是特殊定制的,不但程序讀得爽,你用任何一款文本編輯器打開,一眼過去也會看得明明白白。想要看一下JSON結(jié)果長什么樣的,可以去以下鏈接:http://opengene.org/fastp/fastp.json
下面我們先來看看fastp的具體參數(shù):
usage: fastp -i <in1> -o <out1> [-I <in1> -O <out2>] [options...]
options:
# I/O options 即輸入輸出文件設(shè)置
-i, --in1 read1 input file name (string)
-o, --out1 read1 output file name (string [=])
-I, --in2 read2 input file name (string [=])
-O, --out2 read2 output file name (string [=])
-6, --phred64 indicates the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
-z, --compression compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 2\. (int [=2])
--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])
# adapter trimming options 過濾序列接頭參數(shù)設(shè)置
-A, --disable_adapter_trimming adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
-a, --adapter_sequence the adapter for read1\. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])
--adapter_sequence_r2 the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=])
# global trimming options 剪除序列起始和末端的低質(zhì)量堿基數(shù)量參數(shù)
-f, --trim_front1 trimming how many bases in front for read1, default is 0 (int [=0])
-t, --trim_tail1 trimming how many bases in tail for read1, default is 0 (int [=0])
-F, --trim_front2 trimming how many bases in front for read2\. If it's not specified, it will follow read1's settings (int [=0])
-T, --trim_tail2 trimming how many bases in tail for read2\. If it's not specified, it will follow read1's settings (int [=0])
# polyG tail trimming, useful for NextSeq/NovaSeq data polyG剪裁
-g, --trim_poly_g force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
--poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default. (int [=10])
-G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
# polyX tail trimming
-x, --trim_poly_x enable polyX trimming in 3' ends.
--poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default. (int [=10])
# per read cutting by quality options 劃窗裁剪
-5, --cut_by_quality5 enable per read cutting by quality in front (5'), default is disabled (WARNING: this will interfere deduplication for both PE/SE data)
-3, --cut_by_quality3 enable per read cutting by quality in tail (3'), default is disabled (WARNING: this will interfere deduplication for SE data)
-W, --cut_window_size the size of the sliding window for sliding window trimming, default is 4 (int [=4])
-M, --cut_mean_quality the bases in the sliding window with mean quality below cutting_quality will be cut, default is Q20 (int [=20])
# quality filtering options 根據(jù)堿基質(zhì)量來過濾序列
-Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled
-q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
-n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
# length filtering options 根據(jù)序列長度來過濾序列
-L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled
-l, --length_required reads shorter than length_required will be discarded, default is 15\. (int [=15])
# low complexity filtering
-y, --low_complexity_filter enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
-Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])
# filter reads with unwanted indexes (to remove possible contamination)
--filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])
--filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])
--filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])
# base correction by overlap analysis options 通過overlap來校正堿基
-c, --correction enable base correction in overlapped regions (only for PE data), default is disabled
# UMI processing
-U, --umi enable unique molecular identifer (UMI) preprocessing
--umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])
--umi_len if the UMI is in read1/read2, its length should be provided (int [=0])
--umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])
--umi_skip if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0 (int [=0])
# overrepresented sequence analysis
-p, --overrepresentation_analysis enable overrepresented sequence analysis.
-P, --overrepresentation_sampling One in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20\. (int [=20])
# reporting options
-j, --json the json format report file name (string [=fastp.json])
-h, --html the html format report file name (string [=fastp.html])
-R, --report_title should be quoted with ' or ", default is "fastp report" (string [=fastp report])
# threading options 設(shè)置線程數(shù)
-w, --thread worker thread number, default is 3 (int [=3])
# output splitting options
-s, --split split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])
-S, --split_by_lines split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])
-d, --split_prefix_digits the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])
# help
-?, --help print this message
雖然參數(shù)看起來比較多,但常用的主要包括以下幾個部分:
- 輸入輸出文件設(shè)置
- 接頭處理
- 全局裁剪(即直接剪掉起始和末端低質(zhì)量堿基)
- 滑窗質(zhì)量剪裁 (與trimmomatic相似)
- 過濾過短序列
- 校正堿基(用于雙端測序)
- 質(zhì)量過濾
1、接頭處理
fastp默認(rèn)啟用了接頭處理,但是可以使用-A命令來關(guān)掉。fastp可以自動化地查找接頭序列并進行剪裁,也就是說你可以不輸入任何的接頭序列,fastp全自動搞定了!對于SE數(shù)據(jù),你還是可以-a參數(shù)來輸入你的接頭,而對于PE數(shù)據(jù)則完全沒有必要,fastp基于PE數(shù)據(jù)的overlap分析可以更準(zhǔn)確地查找接頭,去得更干凈,而且對于一些接頭本身就有堿基不匹配情況處理得更好。fastp對于接頭去除會有一個匯總的報告。
2、全局裁剪
fastp可以對所有read在頭部和尾部進行統(tǒng)一剪裁,該功能在去除一些測序質(zhì)量不好的cycle比較有用,比如151*2的PE測序中,最后一個cycle通常質(zhì)量是非常低的,需要剪裁掉。使用-f和-t分別指定read1的頭部和尾部的剪裁,使用-F和-T分別指定read2的頭部和尾部的剪裁。
3、滑窗質(zhì)量剪裁
很多時候,一個read的低質(zhì)量序列都是集中在read的末端,也有少部分是在read的開頭。fastp支持像Trimmomatic那樣對滑動窗口中的堿基計算平均質(zhì)量值,然后將不符合的滑窗直接剪裁掉。使用-5參數(shù)開啟在5’端,也就是read的開頭的剪裁,使用-3參數(shù)開啟在3’端,也就是read的末尾的剪裁。使用-W參數(shù)指定滑動窗大小,默認(rèn)是4,使用-M參數(shù)指定要求的平均質(zhì)量值,默認(rèn)是20,也就是Q20。
4、過濾過短序列
默認(rèn)開啟多序列過濾,默認(rèn)值為15,使用-L(--disable_length_filtering)禁止此默認(rèn)選項。或使用-l(--length_required)自定義最短序列。
5、校正堿基(用于雙端測序)
fastp支持對PE數(shù)據(jù)的每一對read進行分析,查找它們的overlap區(qū)間,然后對于overlap區(qū)間中不一致的堿基,如果發(fā)現(xiàn)其中一個質(zhì)量非常高,而另一個非常低,則可以將非常低質(zhì)量的堿基改為相應(yīng)的非常高質(zhì)量值的堿基值。此選項默認(rèn)關(guān)閉,可使用-c(--correction)開啟。
6、質(zhì)量過濾
fastp可以對低質(zhì)量序列,較多N的序列,該功能默認(rèn)是啟用的,但可以使用-Q參數(shù)關(guān)閉。使用-q參數(shù)來指定合格的phred質(zhì)量值,比如-q 15表示質(zhì)量值大于等于Q15的即為合格,然后使用-u參數(shù)來指定最多可以有多少百分比的質(zhì)量不合格堿基。比如-q 15 -u 40表示一個read最多只能有40%的堿基的質(zhì)量值低于Q15,否則會被扔掉。使用-n可以限定一個read中最多能有多少個N。
例子
最后,附一個簡單的例子:
#!/bin/bash
for i in 74 75 76 82 83 84 85 86 87 88; do
{
fastp -i ~/RNAseq/cleandata/SRR17343${i}_1.fastq.gz -o SRR17343${i}_1.fastq.gz \
-I ~/RNAseq/cleandata/SRR17343${i}_2.fastq.gz -O SRR17343${i}_2.fastq.gz \
-Q --thread=5 --length_required=50 --n_base_limit=6 --compression=6
}&
done
wait
雖然軟件作者稱其速度很快,但就我的測試來看好像并沒有那么快,可能與實驗室服務(wù)器還在跑別的程序有關(guān)。其次就是他的質(zhì)控報告,對于多個質(zhì)控結(jié)果,如果能夠與multiqc一樣出一份匯總報告就更好了。
參考:
fastp: 一款超快速全功能的FASTQ文件自動化質(zhì)控+過濾+校正+預(yù)處理軟件
https://github.com/OpenGene/fastp