假設(shè)各位老哥已經(jīng)安裝好了bs4 requests這些庫了
這個(gè)小說是隨便挑的,各位也就不用太介意(僅供各位學(xué)習(xí))
python3 實(shí)現(xiàn),網(wǎng)上用python2做爬蟲的太多了,但用python3的還是比較少
- 爬取的鏈接是
https://www.qu.la/book/12763/10664294.html
- 整合文章寫入
- 沒有看過 步驟一 的朋友們可以點(diǎn)擊下面的鏈接看看步驟一先
點(diǎn)擊查看步驟一 - 沒有看過 步驟二 的朋友們可以點(diǎn)擊下面的鏈接看看步驟二先
點(diǎn)擊查看步驟二
步驟三:整合文章寫入
- 基本實(shí)現(xiàn)小說爬取
- 通過隨機(jī)數(shù)來模擬休息時(shí)間,避免被禁
- 缺點(diǎn):必須要先用一個(gè)起始文章的鏈接來做(這個(gè)鏈接無所謂,反正讀取的話,就是讀取這個(gè)鏈接的文章作為起始點(diǎn))
import requests
import time
import random
from bs4 import BeautifulSoup
begin_url = "https://www.qu.la/book/12763/10664294.html"
base = begin_url[:begin_url.rindex('/')+1]
urls = [begin_url] # 初始化url池
first = True
for url in urls:
req = requests.get(url)
req.encoding = 'utf-8'
soup = BeautifulSoup(req.text, 'html.parser')
try:
content = soup.find(id='content')
title = soup.find(attrs={"class": "bookname"})
title = title.find('h1').text
except:
break
string = content.text.replace('\u3000', '').replace('\t', '').replace('\n', '').replace('\r', '').replace(
'『', '“') .replace('』', '”').replace('\ufffd', '') # 去除不相關(guān)字符
string = string.split('\xa0') # 編碼問題解決
string = list(filter(lambda x: x, string))
for i in range(len(string)):
string[i] = ' ' + string[i]
if "本站重要通知" in string[i]: # 去除文末尾注
t = string[i].index('本站重要通知')
string[i] = string[i][:t]
string = '\n'.join(string)
string = '\n' + title + '\n' + string
if first:
first = False
with open('E:/Code/Python/Project/txtGet/1.txt', 'w') as f:
f.write(string)
else:
with open('E:/Code/Python/Project/txtGet/1.txt', 'a') as f:
f.write(string)
print(title+' 寫入完成')
next_ = soup.find(attrs={"class": "next"})
next_url = base + next_['href']
urls.append(next_url)
time.sleep(random.randint(1, 5)) # 別訪問的太快了..擔(dān)心被禁hhh也別訪問的太死板了
進(jìn)階文章:
爬取小說(步驟四)python
爬取小說(步驟五)python
爬取小說(步驟六)python