kegg pathway解析之前寫過,鏈接在此:kegg pathway三級(jí)層級(jí)結(jié)構(gòu)轉(zhuǎn)對(duì)應(yīng)表格,今天重新運(yùn)行發(fā)現(xiàn)運(yùn)行報(bào)錯(cuò),經(jīng)檢查原來是網(wǎng)頁內(nèi)容變了,需要修改部分代碼,重新解析網(wǎng)頁。
#!/usr/bin/python3
import re, sys
import lxml
import requests
from bs4 import BeautifulSoup
pyfile, outfile = sys.argv
class kegg(object):
def __init__(self):
self.url = 'https://www.genome.jp/kegg/pathway.html'
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.55'}
def get_html(self, url):
response = requests.get(url, headers=self.headers)
if response.status_code == 200:
return response
else:
return None
def keggPathway(self, url):
if self.get_html(url):
html = self.get_html(url).text
soup = BeautifulSoup(html, "lxml")
Kmain = soup.find(class_="main")
# print(Kmain)
fp = []
sp = []
for i in Kmain:
if re.findall(r"<h4 id=.*>(.*?)</h4>", str(i)):
fp.append(re.findall(r"<h4 id=.*>(.*?)</h4>", str(i)))
if re.findall(r"<b.*>(.*?)</b>", str(i)):
sp.append(re.findall(r"<b.*>(.*?)</b>", str(i)))
del sp[0]
# print(fp,sp)
kt = Kmain.find_all(class_="list")
print(len(fp), len(sp), len(kt))
tpw = []
mapc = 0
for i in fp:
# print(i[0])
for j, k in zip(sp, kt):
# print(j[0])
if i[0][0] == j[0][0]:
fs = re.sub(r'^[^A-Z]*', '', i[0]) + "\t" + re.sub(
r'^[^A-Z]*', '', j[0])
ka = k.find_all('a')
for l in ka:
kau = l.get("href")
kat = l.get_text()
mapN = re.findall(r'(\w\w\w=?\d{5})', str(kau))[0]
mapN = re.sub(r'=', '', mapN)
if "map" in mapN:
mapc += 1
tpw.append(fs + "\t" + kat + "\t" + mapN)
print("map has :", mapc)
return tpw
def main(self):
tpw = self.keggPathway(self.url)
with open(outfile, "w") as f:
f.write("level1\tlevel2\tlevel3\tmap\n")
for i in tpw:
f.write(i + "\n")
if __name__ == "__main__":
kp = kegg()
kp.main()
添加了命令行參數(shù),可以指定輸出文件了,還有添加了表頭。
keggpathway.jpg
解析后的表
keggmap.png
這個(gè)表格暫時(shí)還缺了map和ko的對(duì)應(yīng),后面加上去。
明天即是黨成立100周年,在此熱烈慶祝中國共產(chǎn)黨成立100周年!
image.png
不忘初心,牢記使命,為共產(chǎn)主業(yè)事業(yè)奮斗終身。