1.準備本地數(shù)據(jù)庫文件

NR(Non-Redundant Protein Sequence Database)非冗余蛋白庫，是所有GenBank+EMBL+DDBJ+PDB中的非冗余蛋白序列。Taxonomy物種分類數(shù)據(jù)庫，包括大于7萬余個物種的名字和系譜，這些物種都至少在遺傳數(shù)據(jù)庫中有一條核酸或蛋白序列。NR和Taxonomy數(shù)據(jù)庫都是NCBI的子數(shù)據(jù)庫，會提供比較全面的對應關系。在本地數(shù)據(jù)庫按物種拆分的話，必須下載這兩個數(shù)據(jù)庫的文件。

1.1 NR庫下載

ftp下載地址：ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
NR數(shù)據(jù)庫更新是相當頻繁的，如果追求新，估計每個月甚至每周就重新下一次，但它又非常大，對于商業(yè)流程使用不可能更新得這么頻繁，可以半年或一年更新一次。

image.png

1.2 Taxonomy數(shù)據(jù)庫下載

ftp下載地址：ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
同樣，taxonomy更新也很快。

image.png

我們分庫需要用到兩個文件，一個是accession2taxid中的prot.accession2taxid文件：

image.png

該文件將accession和taxid關系對應起來（也有GI號，2016年以前大家用的是GI和taxid的對應文件，現(xiàn)在該文件已淘汰）。其格式為：

image.png

另一個是taxdump文件，里面包含了物種層級和物種名稱等文件。解壓后文件：

image.png

readme.txt文件中解釋了每個文件的每一列信息（注意|是列間隔，而非列本身）：

*.dmp files are bcp-like dump from GenBank taxonomy database.

General information.
Field terminator is "\t|\t"
Row terminator is "\t|\n"

nodes.dmp file consists of taxonomy nodes. The description for each node includes the following
fields:
    tax_id                  -- node id in GenBank taxonomy database （Taxonomy記錄號）
    parent tax_id               -- parent node id in GenBank taxonomy database （上一層分類級別的tax_id）
    rank                    -- rank of this node (superkingdom, kingdom, ...)  該tax_id所處的分類層級）
    embl code               -- locus-name prefix; not unique
    division id             -- see division.dmp file
    inherited div flag  (1 or 0)        -- 1 if node inherits division from parent
    genetic code id             -- see gencode.dmp file
    inherited GC  flag  (1 or 0)        -- 1 if node inherits genetic code from parent
    mitochondrial genetic code id       -- see gencode.dmp file
    inherited MGC flag  (1 or 0)        -- 1 if node inherits mitochondrial gencode from parent
    GenBank hidden flag (1 or 0)            -- 1 if name is suppressed in GenBank entry lineage
    hidden subtree root flag (1 or 0)       -- 1 if this subtree has no sequence data yet
    comments                -- free-text comments and citations

Taxonomy names file (names.dmp):
    tax_id                  -- the id of node associated with this name （為taxonomy的記錄號）
    name_txt                -- name itself  （即對應tax_id號的物種名稱）
    unique name             -- the unique variant of this name if name not unique
    name class              -- (synonym, common name, ...)

Divisions file (division.dmp):
    division id             -- taxonomy database division id
    division cde                -- GenBank division code (three characters)
    division name               -- e.g. BCT, PLN, VRT, MAM, PRI...
    comments

Genetic codes file:
    genetic code id             -- GenBank genetic code id
    abbreviation                -- genetic code name abbreviation
    name                    -- genetic code name
    cde                 -- translation table for this genetic code
    starts                  -- start codons for this genetic code

Deleted nodes file (delnodes.dmp):
    tax_id                  -- deleted node id

Merged nodes file (merged.dmp):
    old_tax_id                              -- id of nodes which has been merged
    new_tax_id                              -- id of nodes which is result of merging

Citations file (citations.dmp):
    cit_id                  -- the unique id of citation
    cit_key                 -- citation key
    pubmed_id               -- unique id in PubMed database (0 if not in PubMed)
    medline_id              -- unique id in MedLine database (0 if not in MedLine)
    url                 -- URL associated with citation
    text                    -- any text (usually article name and authors).
                        -- The following characters are escaped in this text by a backslash:
                        -- newline (appear as "\n"),
                        -- tab character ("\t"),
                        -- double quotes ('\"'),
                        -- backslash character ("\\").
    taxid_list              -- list of node ids separated by a single space

其中最關鍵的是names.dmp和nodes.dmp文件。names.dmp示例（共四列，重要的也就taxid和物種名的前兩列信息）：

image.png

nodes.dmp示例（共13列，重要的也就taxid、上層級taxid、分類層級這前三列信息）：

image.png

為了讓分類更簡單，我們按taxonomy數(shù)據(jù)庫本身分類的分法，即division.dmp文件，共12類物種。

image.png

2.按物種拆分NR庫

2.1 第一步：獲得Aceesson和分類物種的對應關系

根據(jù)以上的prot.accession2taxid.gz、nodes.dmp和division.dmp文件，可通過編寫腳本來獲得accession和以上12類物種的對應關系。腳本略，自己寫。假設結果文件命名為acc2sp.xls，格式如下：

image.png

2.2 第二步：獲得分類物種的序列

根據(jù)acc2sp.xls這個文件以及NR總庫序列文件nr.gz，我們就可以獲得各類物種的序列信息了。當然除了taxonomy數(shù)據(jù)庫本身分的這12類，我們也可以將它們合并來自定義子庫。比如這12類中沒有動物，我們可以將Invertebrates.fa、 Mammals.fa、 Primates.fa、 Rodents.fa 和Vertebrates.fa合并為動物作為一類，也可以將"Bacteria"、"fungi"、"Viruses"、"Phages"和"Environmental.samples"等合并為微生物作為一類（這在宏組學注釋中常用）。當然NR中也有這12類中沒包含的序列，我們可將其歸為unknown.fa（不同于Unassigned.fa，它是沒有物種信息）。

腳本自己寫，最后得到的是各個子數(shù)據(jù)庫的fasta序列文件。

2.3 第三步：建庫和比對

blast或diamond比對工具進行序列數(shù)據(jù)庫建庫，后面比對選擇對應的字庫就可。
blastall：

formatdb -p T  -i Plants.fa
blastall -i query.fa -d Plants.fa -o blastout.nr -p blastp -F F -m 7 -e 1e-5 -b 10 -v 10 -a 5

或diamond：

diamond makedb --in Plants.fa -d Plants.fa
diamond blastp --evalue 1e-5 --threads 4 --outfmt 5 -q query.fa  -d Plants.fa.dmnd -o blastout.nr --seg no --max-target-seqs 20 --more-sensitive -b 0.5 --salltitles

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

【數(shù)據(jù)庫】本地NR數(shù)據(jù)庫如何按物種拆分？

【數(shù)據(jù)庫】本地NR數(shù)據(jù)庫如何按物種拆分？

1.準備本地數(shù)據(jù)庫文件

1.1 NR庫下載

1.2 Taxonomy數(shù)據(jù)庫下載

2.按物種拆分NR庫

2.1 第一步：獲得Aceesson和分類物種的對應關系

2.2 第二步：獲得分類物種的序列

2.3 第三步：建庫和比對

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

【數(shù)據(jù)庫】本地NR數(shù)據(jù)庫如何按物種拆分？

1.準備本地數(shù)據(jù)庫文件

1.1 NR庫下載

1.2 Taxonomy數(shù)據(jù)庫下載

2.按物種拆分NR庫

2.1 第一步：獲得Aceesson和分類物種的對應關系

2.2 第二步：獲得分類物種的序列

2.3 第三步：建庫和比對

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频