小麥BSgenome構(gòu)建

安裝R包BSgenome

#R
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("BSgenome")

輸入文件1:2bit格式的基因組序列文件

#shell
#從UCSC中下載faToTwoBit
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
chmod 755 faToTwoBit

#下載小麥基因組序列文件
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/triticum_aestivum/dna/Triticum_aestivum.IWGSC.dna.toplevel.fa.gz

#將fasta格式轉(zhuǎn)為2bit格式
faToTwoBit Triticum_aestivum.IWGSC.dna.toplevel.fa.gz Triticum_aestivum.IWGSC.dna.toplevel.2bit

輸入文件2:seed文件

BSgenome包內(nèi)置了某些基因組的seed文件,存在~/anaconda3/envs/python3.8/lib/R/library/BSgenome/extdata/GentlemanLab/路徑下,我們將其中擬南芥的seed文件復(fù)制一份,修改成小麥的

小麥seed文件展示

#shell
#cat Triticum_aestivum.IWGSC-seed
Package: BSgenome.Triticum.aestivum.IWGSC
Title: Full genome sequences for Triticum aestivum
Description: Full genome sequences for Triticum aestivum as provided by IWGSC and stored in Biostrings objects.
Version: 1.0.0
organism: Triticum aestivum
common_name: bread wheat
genome: TAIR10.1#這里要求的genome名是UCSC或者NCBI中的,但小麥的基因組是從EnsemblPlants下載的,因此用擬南芥的代替,不會(huì)影響后續(xù)使用
provider: EnsemblPlants
release_date: 2018/03/15#直接用擬南芥的,未做改動(dòng)
source_url: https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/triticum_aestivum/dna/
organism_biocview: Triticum.aestivum
BSgenomeObjname: Triticum.aestivum.IWGSC
SrcDataFiles: Triticum_aestivum.IWGSC.dna.toplevel.fa.gz from https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/triticum_aestivum/dna/
PkgExamples: genome[["1A"]]#從EnsemblPlants下載的小麥基因組的染色體號(hào)為1A、1B、1D...,因此這里將原本擬南芥的genome[["1"]]改為genome[["1A"]]
seqs_srcdir: ~/genome/ChineseSpring/BSgenome#指定2bit文件路徑
seqfile_name: Triticum_aestivum.IWGSC.dna.toplevel.2bit#指定2bit文件名

在R中構(gòu)建小麥BSgenome

#R
library(BSgenome)
forgeBSgenomeDataPkg("./Triticum_aestivum.IWGSC-seed",seqs_srcdir=getwd(),destdir=getwd(),verbose=T)
#seqs_srcdir指定2bit文件路徑,destdir指定輸出文件路徑,空值則代表當(dāng)前路徑

結(jié)束后會(huì)在當(dāng)前路徑中生成BSgenome.Triticum.aestivum.IWGSC目錄,目錄結(jié)構(gòu)如下

#shell
BSgenome.Triticum.aestivum.IWGSC
├── DESCRIPTION
├── inst
│   └── extdata
│       └── single_sequences.2bit
├── man
│   └── package.Rd
├── NAMESPACE
└── R
    └── zzz.R

5 directories, 5 files

將BSgenome.Triticum.aestivum.IWGSC下的文件打包成R包

#shell
R CMD build BSgenome.Triticum.aestivum.IWGSC
R CMD check BSgenome.Triticum.aestivum.IWGSC_1.0.0.tar.gz
R CMD INSTALL BSgenome.Triticum.aestivum.IWGSC_1.0.0.tar.gz

打包完后就可以直接在R中調(diào)用了

#R
library("BSgenome.Triticum.aestivum.IWGSC")
#顯示染色體號(hào)
seqnames(BSgenome.Triticum.aestivum.IWGSC)
#獲取染色體長度信息
seqlengths(BSgenome.Triticum.aestivum.IWGSC)
#獲取指定染色體的序列,因?yàn)镽不識(shí)別數(shù)字開頭的變量,所以$1A要寫成$"1A",而$Un則可以直接寫成$Un
A1=BSgenome.Triticum.aestivum.IWGSC$"1A"
Un=BSgenome.Triticum.aestivum.IWGSC$Un


?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容