一個高雜合真菌基因組組裝腳本(改代碼版)

前言及背景

什么是基因組de?novo測序?其是對某一物種進行高通量測序,利用高性能計算平臺和生物信息學方法,在不依賴于參考基因組的情況下進行組裝,從而繪制該物種的全基因組序列圖譜。針對基因組的特性,基因組常被分為兩類:普通(簡單)基因組和復雜基因組。簡單基因組指單倍體,純合二倍體或者雜合度<0.5%,且重復序列含量<50%,GC含量為35%到65%之間的二倍體。復雜基因組則指雜合率>0.5%,重復序列含量>50%,GC含量處于異常的范圍(GC含量<35%或者GC含量>=65%的二倍體,多倍體。諾禾致源對二倍體復雜基因組進一步細分為微雜合基因組(0.5%<雜合率<=0.8%、高雜合基因組(雜合率>0.8%)以及高重復基因組(重復序列比例>50%)。復雜基因組的組裝一直以來都是一個讓科研工作者為之頭疼的問題??蒲泄ぷ髡咭矠榻鉀Q這個問題一直努力著。隨著三代測序平臺的更迭,測序subreads的長度不斷的增長,以及光學圖譜技術的出現(Hic,10xGenomic等)。重復序列導致的組裝困難逐步被緩解。但高雜合這一特性任一直困擾著科研工作者。

為了解決雜合組裝,各式各樣的方案不斷被提出。總體思路有下:1.設計實驗方案獲取單倍型(例如種間雜種;案例:Sequencing a Juglans regia?×?J. microcarpa hybrid yields high-quality genome assemblies of parental species;2.設計適合于雜合基因組組裝的軟件,優化組裝的算法;3.組裝之后,再用去雜合軟件去除雜合序列。

第一個思路很清晰,獲取單倍型再測序,就解決了高雜合這個難點,可有時單倍體的獲取真不是一般的難,局限性很大。

基于第二個思路,已有很多軟件開發。有NOVOheter, Plantanus, MSR-CA,Plantanus-allee(Plantanus的升級版,支持Hic,10xGenomics等數據)等。此外還有一些軟件有支持高雜合組裝的模塊,如Canu,SPAdes, Falon等。但實測的經驗來看,效果都不是很好。

第三種思路的核心就是,居于相似性cut,目前接觸過的有Redundans,Haplomerger2,Purge_haplogs。Purge_haplogs除了考慮相似性,還有一大特性,就是通過分析比對read的覆蓋度決定誰去誰留。此外,Haplomerger2和Purge_haplogs還支持重復序列部分的屏蔽。

今年我接手到了一個基因組大小為86M,雜合率約2.4%,重復序列率約為20%左右的真菌。測序策略為pacbio seq I + PE 150 。 本來打算再做一個Hic,但咨詢一些測序公司之后,均表示真菌沒有太多成功經驗,且投入與產出可能不對等。故打消測Hic的念頭。

基于已有的數據,我先采取了canu (單倍型組裝;canu多倍體模式,"batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"組裝),MECAT2,MaSuRCA,Falon,flye,wtdbg2 等組裝軟件進行了組裝。接著使用 Purge_haplogs 、Redundans和Haplomerger2進行去雜合,然后再使用 FinisherSC進行基因組升級,最后使用nextpolish 進行 polish。經過測試,canu,MECAT2(經過nextpolish 進行 polish之后), MaSuRCA的三個軟件的表現相對較好。去雜軟件Purge_haplogs的適用性優于Redundans和Haplomerger2,去雜前后BUSCO評估基本不變。由于涉及到文章。故暫時只能先提供最優的腳本,等文章出了之后會進一步深入。

組裝案例


#測序數據Bam to Fastq or Fasta

samtools fastq -0 012m.subreads.fq -@ 32 subreads.bam

#Assessment of genome size and heterozygosity(raw PE reads)

mkdir genomescope

cd genomescope

ln -s ../012m_L1_?.fq ./

jellyfish count -C -m 21 -s 1000000000 -t 32 *.fq -o reads.jf

jellyfish histo -t 32 reads.jf > reads.histo

Rscript /opt/biosoft/genomescope/genomescope.R reads.histo 21 150 output

# 二代數據質控

mkdir Trimmomatic

cd Trimmomatic

java -jar /opt/biosoft/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 32 -phred33 ../012m_L1_1.fq ../012m_L1_2.fq 012m_Trimmomatic.1.fq 012m_Trimmomatic.unpaired.1.fq 012m_Trimmomatic.2.fq 012m_Trimmomatic.unpaired.2.fq ILLUMINACLIP:/opt/biosoft/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:75 TOPHRED33

cd ../

# correct Illumina reads | Assessment of genome size and heterozygosity

mkdir Finderrors

source ~/.bash.pacbio

# ErrorCorrectReads.pl 為 ALLPATHS-LG 的一個perl 程序

ErrorCorrectReads.pl PHRED_ENCODING=33 READS_OUT=012m FILL_FRAGMENTS=0 KEEP_KMER_SPECTRA=1 PAIRED_READS_A_IN=../fastuniq/012m.fastuniq.1.fastq PAIRED_READS_B_IN=../fastuniq/012m.fastuniq.2.fastq PLOIDY=2 PAIRED_SEP=251 PAIRED_STDEV=48

# ErrorCorrectReads.pl也可以對基因組特性進行評估,但針對我的菌株來說,同已發布的鄰近菌株比較,GenomeScope的結果要準確一些。

# kmer plot

cd 012m.fastq.kspec

perl -p -i -e 's/\@fns = \(\"frag_reads_filt.25mer.kspec\", \"frag_reads_edit.24mer.kspec\", \"frag_reads_corr.25mer.kspec\"\);\n//' /opt/biosoft/ALLPATHS-LG/bin/KmerSpectrumPlot.pl

KmerSpectrumPlot.pl SPECTRA=1 FREQ_MAX=255

perl -p -i -e 's/\@fns = \(\"frag_reads_filt.25mer.kspec\", \"frag_reads_edit.24mer.kspec\", \"frag_reads_corr.25mer.kspec\"\);\n//' /opt/biosoft/ALLPATHS-LG/bin/KmerSpectrumPlot.pl

convert kmer_spectrum.cumulative_frac.log.lin.eps kmer_spectrum.cumulative_frac.log.lin.png

convert kmer_spectrum.distinct.lin.lin.eps kmer_spectrum.distinct.lin.lin.png

convert kmer_spectrum.distinct.log.log.eps kmer_spectrum.distinct.log.log.png

cd ../

# Using LoRDEC to modify PacBio Reads

mkdir LoRDEC

cd LoRDEC

lordec-correct -2 ../Finderrors/012m.paired.A.fastq ../Finderrors/012m.paired.B.fastq -i ../012m.subreads.fq -k 19 -o pacbio.LoRDEC.corrected.fasta -s 3 -T 32 &> lordec-correct.log

seqkit seq -u pacbio.LoRDEC.corrected.fasta > pacbio.corrected.fasta**

cd ../

###Genome assembly###

mkdir MaSuRCA

cd MaSuRCA

echo '# example configuration file

DATA

#Illumina paired end reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>

#if single-end, do not specify <reverse_reads>

#MUST HAVE Illumina paired end reads to use MaSuRCA

PE= pe 251 48 /home/bioinfo/data/012m/012m_L1_1.fq  /home/bioinfo/data/012m/012m_L1_2.fq

#Illumina mate pair reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>

#JUMP= sh 3600 200 /FULL_PATH/short_1.fastq  /FULL_PATH/short_2.fastq

#pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped

#if you have both types of reads supply them both as NANOPORE type

PACBIO=/home/bioinfo/data/012m/LoRDEC/pacbio.corrected.fasta

#NANOPORE=/FULL_PATH/nanopore.fa

#Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many

#OTHER=/FULL_PATH/file.frg

#synteny-assisted assembly, concatenate all reference genomes into one reference.fa; works for Illumina-only data

#REFERENCE=/FULL_PATH/nanopore.fa

END

PARAMETERS

#PLEASE READ all comments to essential parameters below, and set the parameters according to your project

#set this to 1 if your Illumina jumping library reads are shorter than 100bp

EXTEND_JUMP_READS=0

#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content

GRAPH_KMER_SIZE = auto

#set this to 1 for all Illumina-only assemblies

#set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)

USE_LINKING_MATES = 0

#specifies whether to run the assembly on the grid

USE_GRID=0

#specifies grid engine to use SGE or SLURM

GRID_ENGINE=SGE

#specifies queue (for SGE) or partition (for SLURM) to use when running on the grid MANDATORY

GRID_QUEUE=all.q

#batch size in the amount of long read sequence for each batch on the grid

GRID_BATCH_SIZE=500000000

#use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads

#can increase this to 30 or 35 if your reads are short (N50<7000bp)

LHE_COVERAGE=25

#set to 0 (default) to do two passes of mega-reads for slower, but higher quality assembly, otherwise set to 1

MEGA_READS_ONE_PASS=0

#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms

#LIMIT_JUMP_COVERAGE = 300

#these are the additional parameters to Celera Assembler.  do not worry about performance, number or processors or batch sizes -- these are computed automatically.

#CABOG ASSEMBLY ONLY: set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.

CA_PARAMETERS =  cgwErrorRate=0.15

#CABOG ASSEMBLY ONLY: whether to attempt to close gaps in scaffolds with Illumina  or long read data

CLOSE_GAPS=1

#auto-detected number of cpus to use, set this to the number of CPUs/threads per node you will be using

NUM_THREADS = 32

#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*20**

JF_SIZE = 1740000000

#ILLUMINA ONLY. Set this to 1 to use SOAPdenovo contigging/scaffolding module.  Assembly will be worse but will run faster. Useful for very large (>=8Gbp) genomes from Illumina-only data

SOAP_ASSEMBLY=0

#Hybrid Illumina paired end + Nanopore/PacBio assembly ONLY.  Set this to 1 to use Flye assembler for final assembly of corrected mega-reads.  A lot faster than CABOG, at the expense of some contiguity. Works well even when MEGA_READS_ONE_PASS is set to 1.  DO NOT use if you have less than 15x coverage by long reads.

FLYE_ASSEMBLY=0

END ' > config.txt

/opt/biosoft/MaSuRCA-3.3.4/bin/masurca config.txt

./assemble.sh

mkdir purge_haplogs

cd purge_haplogs

minimap2 -t 32 -ax map-pb ../final.genome.scf.fasta /home/bioinfo/data/012m/canu/012m.correctedReads.fasta.gz | samtools view -hF 256 - | samtools sort -@ 32 -m 2G -o aligned.bam

purge_haplotigs  hist  -b aligned.bam  -g ../final.genome.scf.fasta -t 20

purge_haplotigs contigcov -i aligned.bam.gencov -o coverage_stats.csv -l 18 -m 76 -h 134

purge_haplotigs purge -g ../final.genome.scf.fasta -c coverage_stats.csv -b aligned.bam -t 4 -a 50

mkdir finisherSC

cd finisherSC

ln -s ../curated.fasta ./contigs.fasta

ln -s ~/data/012m/canu/012m.correctedReads.fasta ./raw_reads.fasta

#這一步不要設置多線程,不然可能報錯,使用mummer4的速度要遠高于mummet3

python /opt/biosoft/finishingTool/finisherSC.py ./ /opt/biosoft/mummer4/bin/

mkdir NextPolish

cd NextPolish

ls ~/data/012m/Finderrors/012m.paired.?.fastq > sgs.fofn

echo '/home/bioinfo/data/012m/canu/012m.correctedReads.fasta' > lgs.fofn

echo '[General]

job_type = local

job_prefix = nextPolish

task = default

rewrite = yes

rerun = 10

parallel_jobs = 5

multithread_jobs = 6

genome = ../improved3.fasta

genome_size = auto

workdir = ./01_rundir

polish_options = -p {multithread_jobs}

[sgs_option]

sgs_fofn = ./sgs.fofn

sgs_options = -max_depth 100 -bwa

[lgs_option]

lgs_fofn = ./lgs.fofn

lgs_options = -min_read_len 10k -max_read_len 150k -max_depth 60

lgs_minimap2_options = -x map-pb

[polish_options]

-ploidy 2 ' > run.cfg

/opt/biosoft/NextPolish/nextPolish run.cfg

cat ./01_rundir/01.kmer_count/*polish.ref.sh.work/polish_genome*/genome.nextpolish.part*.fasta > genome.nextpolish.fasta

參考

動植物基因組de novo常見問題

雜基因組測序技術研究進展

NextPolish

Sequencing a Juglans regia?×?J. microcarpa hybrid yields high-quality genome assemblies of parental species

LoRDEC 利用二代數據糾錯PacBio 數據

genomescope

MaSuRCA

purge_haplogs

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 228,461評論 6 532
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 98,538評論 3 417
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事?!?“怎么了?”我有些...
    開封第一講書人閱讀 176,423評論 0 375
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 62,991評論 1 312
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 71,761評論 6 410
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 55,207評論 1 324
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,268評論 3 441
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,419評論 0 288
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 48,959評論 1 335
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 40,782評論 3 354
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 42,983評論 1 369
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,528評論 5 359
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,222評論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,653評論 0 26
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 35,901評論 1 286
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 51,678評論 3 392
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 47,978評論 2 374