KEGG pathway 注釋整理
獲得KEGG注釋
通過(guò)eggnog-mapper和interproscan兩個(gè)軟件(或數(shù)據(jù)庫(kù)),可以獲得KEGG ORTHOLOGY(KO)的注釋,即基因或者轉(zhuǎn)錄本對(duì)應(yīng)的K number, 具體參見(jiàn)兩個(gè)軟件的wiki.
獲得KO與pathway的關(guān)系
進(jìn)入KEGG官網(wǎng),然后點(diǎn)擊KEGG BRITE進(jìn)入該數(shù)據(jù)庫(kù),在這個(gè)數(shù)據(jù)庫(kù)中可以下載KEGG數(shù)據(jù)庫(kù)中手工創(chuàng)建的層次結(jié)構(gòu)文件(BRITE hierarchy files)。在這里,需要下載包含pathway和KO對(duì)應(yīng)關(guān)系的文件,點(diǎn)擊KEGG Orthology (KO)下載,這里下載json版本。
下面解析該文件,生成表格文件便于使用。
import json
import re
with open("ko00001.json") as f:
ko_map_data = json.load(f)
with open("KEGG_pathway_ko.txt", "w") as oh:
line = "level1_pathway_id\tlevel1_pathway_name\tlevel2_pathway_id\tlevel2_pathway_name"
line += "\tlevel3_pathway_id\tlevel3_pathway_name\tko\tko_name\tko_des\tec\n"
oh.write(line)
for level1 in ko_map_data["children"]:
m = re.match(r"(\S+)\s+([\S\w\s]+)", level1["name"])
level1_pathway_id = m.groups()[0].strip()
level1_pathway_name = m.groups()[1].strip()
for level2 in level1["children"]:
m = re.match(r"(\S+)\s+([\S\w\s]+)", level2["name"])
level2_pathway_id = m.groups()[0].strip()
level2_pathway_name = m.groups()[1].strip()
for level3 in level2["children"]:
m = re.match(r"(\S+)\s+([^\[]*)", level3["name"])
level3_pathway_id = m.groups()[0].strip()
level3_pathway_name = m.groups()[1].strip()
if "children" in level3:
for ko in level3["children"]:
m = re.match(r"(\S+)\s+(\S+);\s+([^\[]+)\s*(\[EC:\S+(?:\s+[^\[\]]+)*\])*", ko["name"])
if m is not None:
ko_id = m.groups()[0].strip()
ko_name = m.groups()[1].strip()
ko_des = m.groups()[2].strip()
ec = m.groups()[3]
if ec==None:
ec = "-"
line = level1_pathway_id + "\t" + level1_pathway_name + "\t" + level2_pathway_id + "\t" + level2_pathway_name
line += "\t" + level3_pathway_id + "\t" + level3_pathway_name + "\t" + ko_id + "\t" + ko_name + "\t" + ko_des + "\t" + ec + "\n"
oh.write(line)
這會(huì)生成KEGG_pathway_ko.txt文件,隨后對(duì)行去重。
import pandas as pd
data = pd.read_csv("KEGG_pathway_ko.txt", sep="\t",dtype=str)
data = data.drop_duplicates()
data.to_csv("KEGG_pathway_ko_uniq.txt", index=False, sep="\t")
最后得到KEGG_pathway_ko_uniq.txt文件,這個(gè)文件包含了KO和KEGG pathway的對(duì)應(yīng)關(guān)系信息,也包含了pathway的級(jí)別分類(KEGG pathway分為3級(jí)),如下所示:
level1_pathway_id level1_pathway_name level2_pathway_id level2_pathway_name level3_pathway_id level3_pathway_name ko ko_name ko_des ec
9100 Metabolism 9101 Carbohydrate metabolism 10 Glycolysis / Gluconeogenesis K00844 HK hexokinase [EC:2.7.1.1]
9100 Metabolism 9101 Carbohydrate metabolism 10 Glycolysis / Gluconeogenesis K12407 GCK glucokinase [EC:2.7.1.2]
9100 Metabolism 9101 Carbohydrate metabolism 10 Glycolysis / Gluconeogenesis K00845 glk glucokinase [EC:2.7.1.2]
合并結(jié)果
現(xiàn)在是表格文件,和容易將上面多種對(duì)應(yīng)關(guān)系合并起來(lái),進(jìn)行后續(xù)的分析,例如可以對(duì)KEGG的注釋結(jié)果按照KEGG中通路類型或者不同的level進(jìn)行分類匯總,又或者對(duì)特定的基因集進(jìn)行KEGG pathway的富集分析等。