安裝R包BSgenome
#R
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BSgenome")
輸入文件1:2bit格式的基因組序列文件
#shell
#從UCSC中下載faToTwoBit
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
chmod 755 faToTwoBit
#下載小麥基因組序列文件
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/triticum_aestivum/dna/Triticum_aestivum.IWGSC.dna.toplevel.fa.gz
#將fasta格式轉(zhuǎn)為2bit格式
faToTwoBit Triticum_aestivum.IWGSC.dna.toplevel.fa.gz Triticum_aestivum.IWGSC.dna.toplevel.2bit
輸入文件2:seed文件
BSgenome包內(nèi)置了某些基因組的seed文件,存在~/anaconda3/envs/python3.8/lib/R/library/BSgenome/extdata/GentlemanLab/路徑下,我們將其中擬南芥的seed文件復(fù)制一份,修改成小麥的
小麥seed文件展示
#shell
#cat Triticum_aestivum.IWGSC-seed
Package: BSgenome.Triticum.aestivum.IWGSC
Title: Full genome sequences for Triticum aestivum
Description: Full genome sequences for Triticum aestivum as provided by IWGSC and stored in Biostrings objects.
Version: 1.0.0
organism: Triticum aestivum
common_name: bread wheat
genome: TAIR10.1#這里要求的genome名是UCSC或者NCBI中的,但小麥的基因組是從EnsemblPlants下載的,因此用擬南芥的代替,不會(huì)影響后續(xù)使用
provider: EnsemblPlants
release_date: 2018/03/15#直接用擬南芥的,未做改動(dòng)
source_url: https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/triticum_aestivum/dna/
organism_biocview: Triticum.aestivum
BSgenomeObjname: Triticum.aestivum.IWGSC
SrcDataFiles: Triticum_aestivum.IWGSC.dna.toplevel.fa.gz from https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/triticum_aestivum/dna/
PkgExamples: genome[["1A"]]#從EnsemblPlants下載的小麥基因組的染色體號(hào)為1A、1B、1D...,因此這里將原本擬南芥的genome[["1"]]改為genome[["1A"]]
seqs_srcdir: ~/genome/ChineseSpring/BSgenome#指定2bit文件路徑
seqfile_name: Triticum_aestivum.IWGSC.dna.toplevel.2bit#指定2bit文件名
在R中構(gòu)建小麥BSgenome
#R
library(BSgenome)
forgeBSgenomeDataPkg("./Triticum_aestivum.IWGSC-seed",seqs_srcdir=getwd(),destdir=getwd(),verbose=T)
#seqs_srcdir指定2bit文件路徑,destdir指定輸出文件路徑,空值則代表當(dāng)前路徑
結(jié)束后會(huì)在當(dāng)前路徑中生成BSgenome.Triticum.aestivum.IWGSC目錄,目錄結(jié)構(gòu)如下
#shell
BSgenome.Triticum.aestivum.IWGSC
├── DESCRIPTION
├── inst
│ └── extdata
│ └── single_sequences.2bit
├── man
│ └── package.Rd
├── NAMESPACE
└── R
└── zzz.R
5 directories, 5 files
將BSgenome.Triticum.aestivum.IWGSC下的文件打包成R包
#shell
R CMD build BSgenome.Triticum.aestivum.IWGSC
R CMD check BSgenome.Triticum.aestivum.IWGSC_1.0.0.tar.gz
R CMD INSTALL BSgenome.Triticum.aestivum.IWGSC_1.0.0.tar.gz
打包完后就可以直接在R中調(diào)用了
#R
library("BSgenome.Triticum.aestivum.IWGSC")
#顯示染色體號(hào)
seqnames(BSgenome.Triticum.aestivum.IWGSC)
#獲取染色體長度信息
seqlengths(BSgenome.Triticum.aestivum.IWGSC)
#獲取指定染色體的序列,因?yàn)镽不識(shí)別數(shù)字開頭的變量,所以$1A要寫成$"1A",而$Un則可以直接寫成$Un
A1=BSgenome.Triticum.aestivum.IWGSC$"1A"
Un=BSgenome.Triticum.aestivum.IWGSC$Un