pb-assembly=0.0.6參數設置

Input 輸入

[General]
input_fofn=input.fofn
input_type=raw
pa_DBdust_option=true
pa_fasta_filter_option=streamed-median

input_type: 可以為raw或者preads,如果指定preads,管道將跳過整個0-rawreads預組裝階段;

pa_fasta_filter_option: 默認為streamed-internal-median,用于處理一個ZMW有多條subreads時,到底選擇哪一條的問題。"pass": 不做過濾,全部要;"streamed-median": 表示選擇中等長度的subreads;"streamed-internal-median": 當一個ZMW里的subread低于3條時選擇最長,多于3條則選擇中等長度的subreads。

Data Partitioning 數據分區

# large genomes
pa_DBsplit_option=-x500 -s200
ovlp_DBsplit_option=-x500 -s200

# small genomes (<10Mb)
pa_DBsplit_option = -x500 -s50
ovlp_DBsplit_option = -x500 -s50

這部分的設置會將參數傳遞給DBsplit,將數據進行拆分多個block,后續的運算都基于blocks,-s 控制 DB blocks的大小

如果前面設置了pa_fasta_filter_option=passpa_DBsplit_option這里要加一個 -a選項

Repeat Masking 屏蔽重復序列

pa_HPCTANmask_option=
pa_REPmask_code=0,300;0,300;0,300

Repeat masking occurs in two phases, Tandem and Interspersed. Tandem repeat masking is run with a modified version of daligner called datander and thus uses a similar parameter set. Whatever settings you use for pre-assembly daligner overlapping in the next section (pa_daligner_option) will be used here for tandem repeat masking. You can supply additional arguments for tandem repeat masking that will be passed to HPC.TANmask with the pa_HPCTANmask_option.

The second phase of masking deals with interspersed repeats and can be run in up to 3 iterations specified with thepa_REPmask_code option. The parameters needed for each iteration are both the group size and coverage specified as group,coverage pairs separated by semicolons as seen above.

For information and theory on how to set up your rounds of repeat masking, consult this blog post.

Pre-assembly 預組裝

genome_size=1000000000
seed_coverage=30
length_cutoff=-1    
pa_HPCdaligner_option=-v -B128 -M24
pa_daligner_option=-e0.8 -l2000 -k18 -h480  -w8 -s100
falcon_sense_option=--output-multi --min-idt 0.70 --min-cov 3 --max-n-read 400
falcon_sense_greedy=False

During pre-assembly, the PacBio subreads are aligned and error correction is performed. The longest subreads are chosen as seed reads and all shorter reads are aligned to them and consensus sequences are generated from the alignments. These consensus sequences are called pre-assembled reads or preads and generally have accuracy greater than 99% or QV20.

如果你想自動計算種子subreads覆蓋度,那就不用去設置 genome_sizeseed_coverage, 只需設置length_cutoff=-1即可自動計算。我們一般推薦“20-40x”種子覆蓋度。
另外,如果你不知道基因組大小,不確定seed_coverage 的大小或者如果您只想利用特定長度以上的所有reads,您可以使用length_cutoff手動設置該限制。

需要注意的是,無論length_cutoff被設置為什么值,都是對falcon-unzip的一個限制,任何小于該截斷值的reads都不會用于phasing。對于組裝來說,除非你期望一個特定的特性,比如微染色體或短圓形質粒,否則在設置高的length_cutoff時可能不會有什么害處。但是,如果你打算unzip,那么你就應該人為地限制你的phasing數據集,而擁有一個較低的length_cutoff可能對你有好處。大多數計算都發生在預組裝中,因此如果計算時間對您很重要,那么增加length_cutoff將提高效率,但是需要進行上述權衡。

Overlap options for daligner are set with the pa_HPCdaligner_option and pa_daligner_option flags. Previous versions of FALCON had a single parameter. This is now split into two flags, one that affects requested resources pa_HPCdaligner_optionand one that affects the overlap search pa_daligner_option. For pa_HPCdaligner_option, the -v parameter is passed to the LAsort and LAmerge programs while -B and -M parameters are passed to the daligner sub-commands.

To understand the theory and how to configure daligner see this blog post and this command reference guide.

For daligner, in general we recommend the following:

-e: average correlation rate (average sequence identity)

0.70 (low quality data) - 0.80 (high quality data). A higher value will help prevent haplotype collapse.

-l: minimum length of overlap

1000 (shorter library) - 5000 (longer library)

-k: kmer size

14 (low quality data) - 18 (high quality data)

較低的-k值在增加磁盤空間、內存消耗和較慢的運行時間之間具有較高的敏感性,并且在較低質量的數據下工作得最好。相反,對于-k,較大的kmer值具有更高的特異性,使用更少的系統資源,運行速度更快,但是只適用于高質量的數據

You can configure basic pre-assembly consensus calling options with the falcon_sense_option flag.
--output-multi necessary for generating proper fasta headers
--min-idt minimum alignment identity
--min-cov minimum coverage necessary
--max-n-read max number of reads for calling consensus to make the preads

By default, -fo are the parameters passed to LA4Falcon. The option falcon_sense_greedy changes this parameter set to -fog which essentially attempts to maintain relative information between reads that have been broken due to regions of low quality.

Pread overlapping 重疊

ovlp_HPCdaligner_option=-v -M24 -l500
ovlp_daligner_option=-e.96 -s1000 -h60

The second phase of error-corrected read overlapping occurs in a similar fashion to the overlapping performed in the pre-assembly, however no repeat masking is performed and no consensus is called. Overlaps are identified and fed into the final assembly. The parameter options work the same way as described above in the pre-assembly section.

Recommendation for preads:

-e: average correlation rate (average sequence identity)

0.93 (inbred) - 0.96 (outbred)

-l: minimum length of overlap

1800 (poor preassembly, short/low quality library) - 6000 (long, high quality library)

-k: kmer size

18 (low quality) - 24 (most cases)

Final Assembly 最終組裝

# experimenent with "--min-idt" to collapse (98-99) or split haplotypes (up to 99.9) during contig assembly
# if you plan to unzip, collapse first using ~98, lower for very divergent haplotypes
# ignore indels looks at only substitutions in overlaps, allows higher overlap stringency to reduce repeat-induced errors
overlap_filtering_setting = --max-diff 400 --max-cov 400 --min-cov 2 --n-core 24 --min-idt 99.9 --ignore-indels

overlap_filtering_setting=--max-diff 100 --max-cov 100 --min-cov 2
fc_ovlp_to_graph_option=
length_cutoff_pr=1000

The option overlap_filter_setting allows setting criteria for filtering pread overlaps. --max-diff filters overlaps that have a coverage difference between the 5' and 3' ends larger than specified. --max-cov filters highly represented overlaps typically caused by contaminants or repeats and --min-cov allows specification of a minimum overlap coverage.

--min-cov設置得太低將允許檢測到更多的重疊,代價是可能會出現額外的嵌合/錯誤組裝。

length_cutoff_pr is the minimum length of pre-assembled preads used for the final assembly. Typically, this value is set to allow for approximately 15 to 30-fold coverage of corrected reads in the final assembly.

通常,將此值設置為允許在最終組裝中對corrected reads進行大約15到30倍的覆蓋度的長度。

Miscellaneous configuration options 其他選項

Additional configuration options that don't necessarily fit into one of the previous categories are described here.

target=assembly
skip_checks=False
LA4Falcon_preload=false

FALCON can be configured to stop after any of its three stages with the target flag set to either overlapping, pre-assembly or assembly. Each option will stop the pipeline at the end of its corresponding stage, 0-rawreads, 1-preads_ovlor 2-asm-falcon respectively. The default is full assembly pipeline.

The flag skip_checks disables .las file checks with LAcheck which has been known to cause errors on certain systems in the past.

選項LA4Falcon_preload-P參數傳遞給LA4Falcon,從而將所有讀取操作加載到內存中。在較慢的文件系統上,這可以顯著加快速度,但這將大大增加consensus階段的內存需求。

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。