blast database

如何下載 NCBI NR NT數據庫?

先了解BLAST Databases

1. Quick Start

Get all numbered files for a database with the same base name: Each of these files represents a subset (volume) of that database, and all of them are needed to reconstitute the database.

After extraction, there is no need to concatenate the resulting files:Call the database with the base name, for nr database files, use "-db nr". 這些數據庫是已經預先進行過makeblastdb命令的,下載后可以直接使用

For easy download, use the update_blastdb.pl script from the blast+ package.

Incremental update is not available.

2. General Introduction

BLAST search pages under the Basic BLAST section of the NCBI BLAST home page(http://blast.ncbi.nlm.nih.gov/)?use a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches.? These databases are made?

available as compressed archives of pre-formatted form) and can be donwloaed from the /db directory of the BLAST ftp site (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). The FASTA files reside under the /FASTA subdirectory.

The pre-formatted databases offer the following advantages:

Pre-formatting removes the need to run makeblastdb; 無需再運行建庫命令行

Species-level taxonomy ids are included for each database entry;

Databases are broken into smaller-sized volumes and are therefore easier to download;

Sequences in FASTA format can be generated from the pre-formatted databases by using the?blastdbcmd?utility;可以從這些數據庫文件中導出FASTA文件

A convenient script (update_blastdb.pl) is available in the blast+ package to download the pre-formatted databases. 可用該腳本升級數據庫

Pre-formatted databases must be downloaded using the update_blastdb.pl script or via FTP in binary mode. Documentation for this script can be obtained by running the script without any arguments; Perl installation is required.

The compressed files downloaded must be inflated with gzip or other decompress tools. The BLAST database files can then be extracted out of the resulting tar file using the tar utility on Unix/Linux, or WinZip and StuffIt Expander on?

Windows and Macintosh platforms, respectively.? 下載的數據庫為壓縮包,要解壓縮

Large databases are formatted in multiple one-gigabyte volumes, which are named using the basename.##.tar.gz convention. All volumes with the same base name are required. An alias file is provided to tie individual volumes together so that the database can be called using the base name (without the .nal or .pal extension). For example, to call the est database, simply use?"-db est" option in the command line (without the quotes). 大的數據庫通常分為多個壓縮包,例如nr庫有11個壓縮包。所有的相關壓縮包都要下載,解壓。解壓縮會生成對應的庫文件,同時生成一個nr.pal文件。檢索nr庫時輸入-d nr 即可。

Additional BLAST databases that are not provided in pre-formatted formats may be available in the FASTA subdirectory. For other genomic BLAST databases, please check the genomes ftp directory at:?ftp://ftp.ncbi.nlm.nih.gov/genomes/

3. Contents of the /blast/db/ directory

The pre-formatted BLAST databases are archived in this directory. The names of these databases and their contents are listed below.

+-----------------------------+------------------------------------------------+

File Name? ? ? ? #? Content Description?

+-----------------------------+------------------------------------------------+

16SMicrobial.tar.gz? ? ? ? ? #? Bacterial and Archaeal 16S rRNA sequences from BioProjects 33175 and 33117

FASTA/? ? ? ? #? Subdirectory for FASTA formatted sequences

README? ? ? ? #? README for this subdirectory (this file)

Representative_Genomes.*tar.gz? ? ? ? #? Representative bacterial/archaeal genomes database

cdd_delta.tar.gz? ? ? ? ? #? Conserved Domain Database sequences for use with stand alone deltablast

cloud/? ? ? ? ? #? Subdirectory of databases for BLAST AMI; see http://1.usa.gov/TJAnEt

env_nr.*tar.gz? ? ? ? #? Protein sequences for metagenomes

env_nt.*tar.gz? ? ? ? #? Nucleotide sequences for metagenomes

est.tar.gz? ? ? ? #? This file requires est_human.*.tar.gz, est_mouse.*.tar.gz, and est_others.*.tar.gz files to function. It contains the est.nal alias so that searches against est (-db est) will include est_human, est_mouse and est_others.

est_human.*.tar.gz? ? ? ? #? Human subset of the est database from the est division of GenBank, EMBL and DDBJ.

est_mouse.*.tar.gz? ? ? ? #? Mouse subset of the est databasae

est_others.*.tar.gz? ? ? ? ? #? Non-human and non-mouse subset of the est database

gss.*tar.gz? ? ? ? ? #? Sequences from the GSS division of GenBank, EMBL, and DDBJ

htgs.*tar.gz? ? ? ? ? #? Sequences from the HTG division of GenBank, EMBL,and DDBJ

human_genomic.*tar.gz? ? ? ? #? Human RefSeq (NC_) chromosome records with gap adjusted concatenated NT_ contigs

nr.*tar.gz? ? ? ? #? Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq

nt.*tar.gz? ? ? ? #? Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS.

other_genomic.*tar.gz? ? ? ? #? RefSeq chromosome records (NC_) for non-human organisms

pataa.*tar.gz? ? ? ? #? Patent protein sequences

patnt.*tar.gz? ? ? ? #? Patent nucleotide sequences. Both patent databases are directly from the USPTO, or from the EPO/JPO via EMBL/DDBJ

pdbaa.*tar.gz? ? ? ? #? Sequences for the protein structure from the Protein Data Bank

pdbnt.*tar.gz? ? ? ? #? Sequences for the nucleotide structure from the Protein Data Bank. They are NOT the protein coding sequences for the corresponding pdbaa entries.

refseq_genomic.*tar.gz? ? ? ? #? NCBI genomic reference sequences

refseq_protein.*tar.gz? ? ? ? #? NCBI protein reference sequences

refseq_rna.*tar.gz? ? ? ? #? NCBI Transcript reference sequences

sts.*tar.gz? ? ? ? ? #? Sequences from the STS division of GenBank, EMBL,and DDBJ

swissprot.tar.gz? ? ? ? ? #? Swiss-Prot sequence database (last major update)

taxdb.tar.gz? ? ? ? ? #? Additional taxonomy information for the databases listed here providing common and scientific names

tsa_nt.*tar.gz? ? ? ? #? Sequences from the TSA division of GenBank, EMBL,and DDBJ

vector.tar.gz? ? ? ? #? Vector sequences from 2010, see Note 2 in section 4.

wgs.*tar.gz? ? ? ? ? #? Sequences from Whole Genome Shotgun assemblies

+-----------------------------+------------------------------------------------+

+-----------------------------+------------------------------------------------+ File Name? ? ? ? #? Content Description? +-----------------------------+------------------------------------------------+16SMicrobial.tar.gz? ? ? ? ? #? Bacterial and Archaeal 16S rRNA sequences from BioProjects 33175 and 33117FASTA/#? Subdirectory for FASTA formatted sequencesREADME#? README for this subdirectory (this file)Representative_Genomes.*tar.gz#? Representative bacterial/archaeal genomes databasecdd_delta.tar.gz#? Conserved Domain Database sequences for use with stand alone deltablastcloud/#? Subdirectory of databases for BLAST AMI; see http://1.usa.gov/TJAnEtenv_nr.*tar.gz#? Protein sequences for metagenomesenv_nt.*tar.gz#? Nucleotide sequences for metagenomesest.tar.gz#? This file requires est_human.*.tar.gz, est_mouse.*.tar.gz, and est_others.*.tar.gz files to function. It contains the est.nal alias so that searches against est (-db est) will include est_human, est_mouse and est_others. est_human.*.tar.gz#? Human subset of the est database from the est division of GenBank, EMBL and DDBJ.est_mouse.*.tar.gz#? Mouse subset of the est databasaeest_others.*.tar.gz#? Non-human and non-mouse subset of the est databasegss.*tar.gz#? Sequences from the GSS division of GenBank, EMBL, and DDBJhtgs.*tar.gz#? Sequences from the HTG division of GenBank, EMBL,and DDBJhuman_genomic.*tar.gz#? Human RefSeq (NC_) chromosome records with gap adjusted concatenated NT_ contigsnr.*tar.gz#? Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeqnt.*tar.gz#? Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS.other_genomic.*tar.gz#? RefSeq chromosome records (NC_) for non-human organismspataa.*tar.gz#? Patent protein sequencespatnt.*tar.gz#? Patent nucleotide sequences. Both patent databases are directly from the USPTO, or from the EPO/JPO via EMBL/DDBJpdbaa.*tar.gz#? Sequences for the protein structure from the Protein Data Bankpdbnt.*tar.gz#? Sequences for the nucleotide structure from the Protein Data Bank. They are NOT the protein coding sequences for the corresponding pdbaa entries.refseq_genomic.*tar.gz#? NCBI genomic reference sequencesrefseq_protein.*tar.gz#? NCBI protein reference sequencesrefseq_rna.*tar.gz#? NCBI Transcript reference sequencessts.*tar.gz#? Sequences from the STS division of GenBank, EMBL,and DDBJswissprot.tar.gz#? Swiss-Prot sequence database (last major update)taxdb.tar.gz#? Additional taxonomy information for the databases listed here providing common and scientific namestsa_nt.*tar.gz#? Sequences from the TSA division of GenBank, EMBL,and DDBJvector.tar.gz#? Vector sequences from 2010, see Note 2 in section 4.wgs.*tar.gz#? Sequences from Whole Genome Shotgun assemblies+-----------------------------+------------------------------------------------++-----------------------+-----------------------------------------------------+

File Name? ? ? ? ? #? Content Description? ? ? ? #

+-----------------------+-----------------------------------------------------+

alu.a.gz? ? ? ? #? translation of alu.n repeats

alu.n.gz? ? ? ? #? alu repeat elements (from 2003)

drosoph.aa.gz? ? ? ? ? #? CDS translations from drosophila.nt?

drosoph.nt.gz? ? ? ? ? #? genomic sequences for drosophila (from 2003)

env_nr.gz*? ? ? ? ? #? Protein sequences for metagenomes, taxid 408169

env_nt.gz*? ? ? ? ? #? Nucleotide sequences for metagenomes, taxid 408169

est_human.gz*? ? ? ? ? #? human subset of the est database (see Note 1)

est_mouse.gz*? ? ? ? ? #? mouse subset of the est database

est_others.gz*? ? ? ? ? #? non-human and non-mouse subset of the est database

gss.gz*? ? ? ? #? sequences from the GSS division of GenBank, EMBL,? and DDBJ

htgs.gz*? ? ? ? #? sequences from the HTG division of GenBank, EMBL,? and DDBJ

human_genomic.gz*? ? ? ? ? #? human RefSeq (NC_) chromosome records? with gap adjusted concatenated NT_ contigs

igSeqNt.gz? ? ? ? ? #? human and mouse immunoglobulin variable region? nucleotide sequences

igSeqProt.gz? ? ? ? #? human and mouse immunoglobulin variable region? protein sequences

mito.aa.gz? ? ? ? ? #? CDS translations of complete mitochondrial genomes

mito.nt.gz? ? ? ? ? #? complete mitochondrial genomes

nr.gz*? ? ? ? ? #? non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq

nt.gz*? ? ? ? ? #? nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) and wgs entries. Partially non-redundant.

other_genomic.gz*? ? ? ? ? #? RefSeq chromosome records (NC_) for organisms other than human

pataa.gz*? ? ? ? ? #? patent protein sequences

patnt.gz*? ? ? ? ? #? patent nucleotide sequences. Both patent sequence? files are from the USPTO, or EPO/JPO via EMBL/DDBJ

pdbaa.gz*? ? ? ? ? #? protein sequences from pdb protein structures

pdbnt.gz*? ? ? ? ? #? nucleotide sequences from pdb nucleic acid structures. They are NOT the protein coding sequences for the corresponding pdbaa entries.

sts.gz*? ? ? ? #? database for sequence tag site entries

swissprot.gz*? ? ? ? ? #? swiss-prot database (last major release)

vector.gz? ? ? ? ? #? vector sequences from 2010. (See Note 2)

wgs.gz*? ? ? ? #? whole genome shotgun genome assemblies

yeast.aa.gz? ? ? ? #? protein translations from yeast.nt

yeast.nt.gz? ? ? ? #? yeast genomes (from 2003)

+-----------------------+---------------------------------------------------+

4. Contents of the /blast/db/FASTA directory

This directory contains FASTA formatted sequence files. The file names and database contents are listed below. These files must be unpacked and processed through blastdbcmd before they can be used by the BLAST programs.

+-----------------------+-----------------------------------------------------+File Name? ? ? ? ? #? Content Description? ? ? ? # +-----------------------+-----------------------------------------------------+alu.a.gz? ? ? ? #? translation of alu.n repeatsalu.n.gz#? alu repeat elements (from 2003)drosoph.aa.gz#? CDS translations from drosophila.nt? drosoph.nt.gz#? genomic sequences for drosophila (from 2003)env_nr.gz*#? Protein sequences for metagenomes, taxid 408169env_nt.gz*#? Nucleotide sequences for metagenomes, taxid 408169est_human.gz*#? human subset of the est database (see Note 1)est_mouse.gz*#? mouse subset of the est databaseest_others.gz*#? non-human and non-mouse subset of the est databasegss.gz*#? sequences from the GSS division of GenBank, EMBL,? and DDBJhtgs.gz*#? sequences from the HTG division of GenBank, EMBL,? and DDBJ human_genomic.gz*#? human RefSeq (NC_) chromosome records? with gap adjusted concatenated NT_ contigs igSeqNt.gz#? human and mouse immunoglobulin variable region? nucleotide sequencesigSeqProt.gz#? human and mouse immunoglobulin variable region? protein sequencesmito.aa.gz#? CDS translations of complete mitochondrial genomesmito.nt.gz#? complete mitochondrial genomesnr.gz*#? non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeqnt.gz*#? nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) and wgs entries. Partially non-redundant.other_genomic.gz*#? RefSeq chromosome records (NC_) for organisms other than humanpataa.gz*#? patent protein sequencespatnt.gz*#? patent nucleotide sequences. Both patent sequence? files are from the USPTO, or EPO/JPO via EMBL/DDBJpdbaa.gz*#? protein sequences from pdb protein structurespdbnt.gz*#? nucleotide sequences from pdb nucleic acid structures. They are NOT the protein coding sequences for the corresponding pdbaa entries.sts.gz*#? database for sequence tag site entries swissprot.gz*#? swiss-prot database (last major release)vector.gz#? vector sequences from 2010. (See Note 2)wgs.gz*#? whole genome shotgun genome assembliesyeast.aa.gz#? protein translations from yeast.ntyeast.nt.gz#? yeast genomes (from 2003)+-----------------------+---------------------------------------------------+

NOTE:?

(1) NCBI does not provide the complete est database in FASTA format. One? needs to get all three subsets (est_human, est_mouse, and est_others and concatenate them into the complete est fasta database).?

(2) For screening for vector contamination, use the UniVec database:?ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/?

*? marked files have pre-formatted counterparts.

5. Database updates

The BLAST databases are updated regularly. There is no established incremental pdate scheme. We recommend downloading the complete databases regularly to keep their content current.

6. Non-redundant defline syntax

The non-redundant databases are nr, nt (partially) and pataa. In them, identical sequences are merged into one entry. To be merged two sequences must have identical lengths and every residue at every position must be the?

same.? The FASTA deflines for the different entries that belong to one record are separated by control-A characters invisible to most programs. In the example below both entries gi|1469284 and gi|1477453 have the same sequence, in every respect:


>gi|3023276|sp|Q57293|AFUC_ACTPL? Ferric transport ATP-binding protein afuC ^Agi|1469284|gb|AAB05030.1|? afuC gene product ^Agi|1477453|gb|AAB17216.1|?

afuC [Actinobacillus pleuropneumoniae]

MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVT

KSSIQNRDICIVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQ

QQRVALARALVLKPKVLILDEPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMN

KGTIMQKARQKIFIYDRILYSLRNFMGESTICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPE

AIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLINANPDQFDPDATKAFIHFTEQGIFLLNKE

The syntax of sequence header lines used by the NCBI BLAST server depends on the database from which each sequence was obtained.? The table at?http://www.ncbi.nlm.nih.gov/toolkit/doc/book/ch_demo/?report=objectonly#ch_demo.T5?lists the supported FASTA identifiers. 有些BLAST數據庫沒有提供預先建庫的文件,這些數據庫可以從FASTA文件夾里下載

For databases whose entries are not from official NCBI sequence databases, such as Trace database, the gnl| convention is used. For custom databases, this convention should be followed and the id for each sequence must be?

unique, if one would like to take the advantage of indexed database, which enables specific sequence retrieval using blastdbcmd program included in the blast executable package.? One should refer to documents distributed in the standalone BLAST package for more details.

7. Formatting a FASTA file into a BLASTable database

FASTA files need to be formatted with makeblastdb before they can be used in local blast search. For those from NCBI, the following makeblastdb commands are recommended:

For nucleotide fasta file:??

makeblastdb -ininput_db -dbtype nucl -parse_seqids

For protein fasta file:?????

makeblastdb -ininput_db -dbtype prot -parse_seqids

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 230,501評論 6 544
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 99,673評論 3 429
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 178,610評論 0 383
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 63,939評論 1 318
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 72,668評論 6 412
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 56,004評論 1 329
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 44,001評論 3 449
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 43,173評論 0 290
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 49,705評論 1 336
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 41,426評論 3 359
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 43,656評論 1 374
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 39,139評論 5 364
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,833評論 3 350
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 35,247評論 0 28
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 36,580評論 1 295
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 52,371評論 3 400
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 48,621評論 2 380

推薦閱讀更多精彩內容