女朋友這幾天晚上總是在看電影,╮(╯▽╰)╭,讓哥哥一個人自己玩。哼,不就是電影嘛?我給你一個庫!
說干就干,前幾天在程老哥的指導下,終于理解了多層網頁爬取的時候,數據是怎么傳遞的。今天選的陽光電影網也是這種的結構:http://www.ygdy8.com/ 選擇最喜歡的歐美類
嘿嘿嘿
起始網頁之這樣:http://www.ygdy8.com/html/gndy/oumei/list_7_1.html
Paste_Image.png
不廢話,先上代碼:
Paste_Image.png
# -*- coding: utf-8 -*-
import scrapy
from yangguang.items import YangguangItem
from scrapy.spiders import CrawlSpider
class Ygdy8ComSpider(CrawlSpider):
name = "ygdy8.com"
allowed_domains = ["ygdy8.com"]
start_urls = ['http://www.ygdy8.com/html/gndy/oumei/list_7_1.html']
def parse(self, response):
items=[]
print(response.url)
infos= response.xpath('//table[@border="0"]/tr[2]/td[2]/b/a[2]')
for info in infos:
item = YangguangItem()
next_page_link = info.xpath('@href')[0].extract()
next_page_name = info.xpath('text()')[0].extract()
full_page_link= 'http://www.ygdy8.com'+next_page_link#這里一定要加http:// 不然會報錯
item['next_page_name'] = next_page_name
item['full_page_link'] = full_page_link
items.append(item)
for item in items:
yield scrapy.Request(url=item['full_page_link'],meta={'item_1':item},callback=self.parse_page) #老規矩,這里把下一頁的網址傳遞給下一頁的解析函數
for i in range(2,164): #構造循環函數
url= 'http://www.ygdy8.com/html/gndy/oumei/list_7_%s.html'%i
yield scrapy.Request(url,callback=self.parse)
def parse_page(self,response): #解析傳遞下來的網址的函數
items= response.meta['item_1']
item = YangguangItem()
dy_link= response.xpath('//table[@border="0"]/tbody/tr/td/a/@href').extract()
item['dy_link']=dy_link
item['next_page_name']=items['next_page_name']
item['full_page_link']= items['full_page_link']
print(item)
yield item
items
import scrapy
class YangguangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
full_page_link= scrapy.Field()
dy_link = scrapy.Field()
next_page_name= scrapy.Field()
pipline
import pymysql
def dbHandle():
conn = pymysql.connect(
host="localhost",
user="root",
passwd="密碼",
charset="utf8",
use_unicode=False
)
return conn
class YangguangPipeline(object):
def process_item(self, item, spider):
dbObject = dbHandle()
cursor = dbObject.cursor()
sql = "insert into ygdy.dy(dy_link,next_page_name,full_page_link) value (%s,%s,%s)"
try:
cursor.execute(sql, (item['dy_link'], item['next_page_name'], item['full_page_link']))
cursor.connection.commit()
except BaseException as e:
print("錯誤在這里>>>>", e, "<<<<<<錯誤在這里")
dbObject.rollback()
return item
setting
之前一直存不到數據庫,后來問了程老哥,
ITEM_PIPELINES = {
'yangguang.pipelines.YangguangPipeline': 300,
}```
這句話默認是不打開的,所以要在setting里把他打開。
最后看一下存下來的東西

老婆,來找我要電影吧。。
