就是照搬被人文章到公眾號上,一般格式是不能復制粘貼的,怎么辦呢,爬源碼
import requests
import re
import time
from lxml import html
from selenium import webdriver
r = requests.get(url='https://mp.weixin.qq.com/s?__biz=MzA5NjgxNjgxNQ==&mid=403557217&idx=1&sn=3b8038565f9c699a0121f64aed2f5d22&mpshare=1&scene=1&srcid=1206O2RAeNX16c88CbMrryCI&key=f57fc7001c9b61fadf60eb0d80c982c3f9b772f324115b802c9c69eba4603a5f6da7bf5ee9975261ac5812427e154113c8c2eba3f19dbf10c35ae2251b4f6aed955bd68532a3f4248069b54851973942&ascene=0&uin=MjEyODY1MzIwMQ%3D%3D&devicetype=iMac+MacBookPro11%2C1+OSX+OSX+10.12.3+build(16D32)&version=11000003&pass_ticket=5jR8RnNSI7woS8zm30GvzXC2C8NHS5ayD4%2B7qltAzc%2FzfQgzX4KOt1d3LtJrvfVD') # 最基本的GET請求
r.S是指可以換行匹配,不然查找不到,真是坑死人了
content = re.findall(r'<div class="rich_media_content " id="js_content">.*?</div>',r.text, re.S)
然而這里有個問題,這里獲取的是網頁code與網頁里看到的element不一致,網頁是執行了所有js請求后情況,搜索了下,無解,換一個辦法
自動化工具selenium,這個是動態的
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get('https://mp.weixin.qq.com/s?__biz=MzA5NjgxNjgxNQ==&mid=403557217&idx=1&sn=3b8038565f9c699a0121f64aed2f5d22&mpshare=1&scene=1&srcid=1206O2RAeNX16c88CbMrryCI&key=f57fc7001c9b61fadf60eb0d80c982c3f9b772f324115b802c9c69eba4603a5f6da7bf5ee9975261ac5812427e154113c8c2eba3f19dbf10c35ae2251b4f6aed955bd68532a3f4248069b54851973942&ascene=0&uin=MjEyODY1MzIwMQ%3D%3D&devicetype=iMac+MacBookPro11%2C1+OSX+OSX+10.12.3+build(16D32)&version=11000003&pass_ticket=5jR8RnNSI7woS8zm30GvzXC2C8NHS5ayD4%2B7qltAzc%2FzfQgzX4KOt1d3LtJrvfVD')
time.sleep(60)
sleep就是讓網頁加載完成后在獲取需要的內容
import codecs
content = re.findall(r'<div class="rich_media_content " id="js_content">.*?</div>',browser.page_source, re.S)
# 去掉換行符號把content寫到文件
new_content = content[0].replace('\n', '')
#print new_content
file_obj = codecs.open("/Users/xxx/Desktop/markdown/7.8.md", 'w', 'utf-8')
file_obj.write(new_content)
file_obj.close()
這時候用vim打開7.8.md時打開的網頁就是和原網頁一摸一樣的網頁,發現有些圖片不顯示,畢竟是別人公眾號的圖片
imgs = re.findall(r'\"http://.*?\"', content[0], re.S)
re.S)
for img in imgs:
print img
print
可以將圖片上傳到自己公眾號,沒有認證只能上傳臨時素材
import json
# 我的token開發者有接口可以獲取,我拷貝過來用下
access_token="_RyG5BzY0Ait19ctrYtCmHe5-FT5VVqUy14HFFsa7BZbtq9btBE6diEFem6yjiuinZD7xApbqbJO6nwKhx99N9V2ClmPeUHHIthUqhkjH2XPKqB7S8u6Yc0bprsjh8GDVEEjAEALUU"
pp=requests.get("http://mmbiz.qpic.cn/mmbiz/x0QjkAOuB5YoQpVBrCWVdouMKd1UxjYhiaXnfQ3vF7KHiaFhQe91Gtsd1cNXZYzHoaGSpv2ak2M8pb9icSEkBKic1A/0?wx_fmt=jpeg").content # get the online png data (binary data)
files = {'media': ('temp2.png',pp)} # the first item "temp2.png" is the file name, the second one is the file data
upload_url="https://api.weixin.qq.com/cgi-bin/media/upload?access_token="+access_token+"&type=image" # set your access_token
r1 =requests.post(upload_url, files=files) # upload
media_id=json.loads(r1.content)['media_id'] # if it is success, you get media id
再使用media_id獲取圖片,得到圖片網址
getload_url = "https://api.weixin.qq.com/cgi-bin/media/get?access_token="+access_token+"&media_id="+media_id
pp=requests.get(getload_url) # get the online png data (binary data)
print dir(pp)
print pp.url
只要把這個網頁換掉之前的網頁,一篇文章就出來了