這是全部的調(diào)試過(guò)程,我已經(jīng)整理成為筆記,這里分享給大家:
python爬取豆瓣兩千萬(wàn)圖書(shū)簡(jiǎn)介信息:(一)目標(biāo)API分析
python爬取豆瓣兩千萬(wàn)圖書(shū)簡(jiǎn)介信息:(二)簡(jiǎn)單python請(qǐng)求urllib2
python爬取豆瓣兩千萬(wàn)圖書(shū)簡(jiǎn)介信息:(三)異常處理
python爬取豆瓣兩千萬(wàn)圖書(shū)簡(jiǎn)介信息:(四)多進(jìn)程并發(fā)
python爬取豆瓣兩千萬(wàn)圖書(shū)簡(jiǎn)介信息:(五)數(shù)據(jù)庫(kù)設(shè)計(jì)
python爬取豆瓣兩千萬(wàn)圖書(shū)簡(jiǎn)介信息:(六)數(shù)據(jù)庫(kù)操作類(lèi)
python爬取豆瓣兩千萬(wàn)圖書(shū)簡(jiǎn)介信息:(七)代理IP
python爬取豆瓣兩千萬(wàn)圖書(shū)簡(jiǎn)介信息:(八)總結(jié)
數(shù)據(jù)庫(kù)設(shè)計(jì)
爬取數(shù)據(jù)是為了存儲(chǔ),以供之后的分析使用。于是就要有相應(yīng)的數(shù)據(jù)設(shè)計(jì)。
因?yàn)槊看握?qǐng)求的數(shù)據(jù)結(jié)構(gòu)大概都是固定的,而且,我也不想做太復(fù)雜太高端的數(shù)據(jù)設(shè)計(jì),這樣就盡量按照簡(jiǎn)單的來(lái),所以,最終的數(shù)據(jù)庫(kù)設(shè)計(jì)就是一本書(shū)的數(shù)據(jù)一條,而且較長(zhǎng)的字段就拋棄掉。
一次請(qǐng)求獲取數(shù)據(jù)示例:
https://api.douban.com/v2/book/10554308
請(qǐng)求結(jié)果如下:
{"rating":{"max":10,"numRaters":99295,"average":"9.2","min":0},"subtitle":"","author":["東野圭吾"],"pubdate":"2013-1-1","tags":[{"count":28691,"name":"東野圭吾","title":"東野圭吾"},{"count":13320,"name":"懸疑推理","title":"懸疑推理"},{"count":10713,"name":"日系推理","title":"日系推理"},{"count":9958,"name":"推理","title":"推理"},{"count":9058,"name":"日本文學(xué)","title":"日本文學(xué)"},{"count":8985,"name":"日本","title":"日本"},{"count":8460,"name":"小說(shuō)","title":"小說(shuō)"},{"count":7434,"name":"模糊式愛(ài)情","title":"模糊式愛(ài)情"}],"origin_title":"","image":"https://img1.doubanio.com\/mpic\/s29384019.jpg","binding":"精裝","translator":["劉姿君"],"catalog":"","ebook_url":"https:\/\/read.douban.com\/ebook\/680843\/","pages":"538","images":{"small":"https://img1.doubanio.com\/spic\/s29384019.jpg","large":"https://img1.doubanio.com\/lpic\/s29384019.jpg","medium":"https://img1.doubanio.com\/mpic\/s29384019.jpg"},"alt":"https:\/\/book.douban.com\/subject\/10554308\/","id":"10554308","publisher":"南海出版公司","isbn10":"7544258602","isbn13":"9787544258609","title":"白夜行","url":"https:\/\/api.douban.com\/v2\/book\/10554308","alt_title":"","author_intro":"東野圭吾\n日本著名作家。\n1985年,憑《放學(xué)后》獲第31屆江戶(hù)川亂步獎(jiǎng),開(kāi)始專(zhuān)職寫(xiě)作;\n1999年,《秘密》獲第52屆日本推理作家協(xié)會(huì)獎(jiǎng);此后《白夜行》、《單戀》、《信》、《幻夜》先后入圍直木獎(jiǎng)。\n2005年出版的《嫌疑人X的獻(xiàn)身》史無(wú)前例地將第134屆直木獎(jiǎng)、第6屆本格推理小說(shuō)大獎(jiǎng),以及年度三大推理小說(shuō)排行榜第1名一并斬獲;\n2008年,《流星之絆》獲第43屆新風(fēng)獎(jiǎng)。\n2009年出版的《新參者》獲兩大推理小說(shuō)排行榜年度第1名;\n2012年,《浪矢雜貨店的奇跡》獲第7屆中央公論文藝獎(jiǎng)。","summary":"東野圭吾萬(wàn)千書(shū)迷心中的無(wú)冕之王\n周刊文春推理小說(shuō)年度BEST10第1名\n本格推理小說(shuō)年度BEST10第2名\n《白夜行》是東野圭吾迄今口碑最好的長(zhǎng)篇杰作,具備經(jīng)典名著的一切要素:\n一宗離奇命案牽出跨度近20年步步驚心的故事:悲涼的愛(ài)情、吊詭的命運(yùn)、令人發(fā)指的犯罪、復(fù)雜人性的對(duì)決與救贖……\n-------------------------------------------------------------------\n1973年,大阪的一棟廢棄建筑中發(fā)現(xiàn)一名遭利器刺死的男子。案件撲朔迷離,懸而未決。\n此后20年間,案件滋生出的惡逐漸萌芽生長(zhǎng),綻放出惡之花。案件相關(guān)者的人生逐漸被越來(lái)越重的陰影籠罩……\n“我的天空里沒(méi)有太陽(yáng),總是黑夜,但并不暗,因?yàn)橛袞|西代替了太陽(yáng)。雖然沒(méi)有太陽(yáng)那么明亮,但對(duì)我來(lái)說(shuō)已經(jīng)足夠。\n憑借著這份光,我便能把黑夜當(dāng)成白天。\n我從來(lái)就沒(méi)有太陽(yáng),所以不怕失去。”\n“只希望能手牽手在太陽(yáng)下散步”,這句象征本書(shū)故事內(nèi)核的絕望念想,有如一個(gè)美麗的幌子,隨著無(wú)數(shù)凌亂、壓抑、悲涼的事件片段如紀(jì)錄片一樣一一還原,最后一絲溫情也被完全拋棄,萬(wàn)千讀者在一曲救贖罪惡的愛(ài)情之中悲切動(dòng)容。","ebook_price":"15.00","series":{"id":"868","title":"新經(jīng)典文庫(kù)·東野圭吾作品"},"price":"CNY 39.50"}
忍痛拋棄掉最長(zhǎng)的作品簡(jiǎn)介,將tag先還原成一維數(shù)據(jù)aaaa&bbbb&cccc 。
最后的數(shù)據(jù)設(shè)計(jì)如下:
SET NAMES utf8;
SET FOREIGN_KEY_CHECKS = 0;
-- ----------------------------
-- Table structure for `books`
-- ----------------------------
DROP TABLE IF EXISTS `books`;
CREATE TABLE `books` (
`id` int(10) NOT NULL,
`isbn13` varchar(250) DEFAULT NULL,
`publisher` varchar(30) DEFAULT NULL,
`pages` varchar(20) DEFAULT NULL,
`title` varchar(250) DEFAULT NULL,
`image` varchar(250) DEFAULT NULL,
`alt` varchar(250) DEFAULT NULL,
`isbn10` varchar(250) DEFAULT NULL,
`large_image` varchar(250) DEFAULT NULL,
`subtitle` varchar(250) DEFAULT NULL,
`price` varchar(250) DEFAULT NULL,
`rating` varchar(250) DEFAULT NULL,
`numRaters` varchar(230) DEFAULT NULL,
`binding` varchar(250) DEFAULT NULL,
`author` varchar(250) DEFAULT NULL,
`tags` varchar(250) DEFAULT NULL,
`pubdate` varchar(250) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SET FOREIGN_KEY_CHECKS = 1;
僅有這些字段,也夠我以后分析了。
以下是常用的SQL語(yǔ)句:
sql使用記錄
ps:命令行比界面好用多了.主要是速度快
mac:~ caobo56$ mysql -u root -p
//進(jìn)入數(shù)據(jù)庫(kù)
mysql> show databases
//顯示數(shù)據(jù)庫(kù)
mysql> USE doubanbook
//進(jìn)入指定庫(kù)
mysql> show tables;
//顯示庫(kù)中的所有表
mysql> show create table books;
//顯示建表語(yǔ)句,可用于檢查表字段,編碼,長(zhǎng)度等
mysql> alter table books change title title varchar(20) CHARSET utf8;
//更改某一列的編碼格式
mysql> alter table books default character set utf8;
//更改某表的編碼格式
mysql> truncate table books;
//清空某表
mysql> select * from books;
//查找表中所有數(shù)據(jù)
mysql> select max(id) from books;
//查找表中最大的id
mysql> update books set summary=replace(summary,' ','');
//將book表中 summary 字段中的空格清除掉
mysql> alter table books drop summary;
//將book表中 summary 字段刪除
mysql> select max(id) from books;
//查詢(xún)id最大的值
mysql> select title from books where rating>9;
//查詢(xún)r(jià)ating大于9的title
mysql> select title,rating from books where rating>9;
//查詢(xún)r(jià)ating大于9的title和rating
mysql> select id,title,rating,numRaters from books where rating>9;
//查詢(xún)r(jià)ating大于9的title和rating
mysql> insert into books select * from book group by id;
//將book中的數(shù)據(jù)以id分組插入到books表里。
mysql> update thread_index set bookid = 1030897 where id = 3;
//查某一區(qū)間的id最大值
mysql>
select id+1 from books a
where not exists(select * from books b where b.id = a.id + 1)
and id < (select MAX(id) from books) ;
//查詢(xún)id不連續(xù)的位置,并不是查詢(xún)出所有不連續(xù)的id
mysql> SELECT title FROM books WHERE title Like '%健康%';
模糊查詢(xún)同”健康”相關(guān)的title
mysql> SELECT sum(DATA_LENGTH)/1024/1024+sum(INDEX_LENGTH)/1024/1024 FROM information_schema.TABLES where TABLE_SCHEMA='doubanbook';
//計(jì)算數(shù)據(jù)庫(kù)的大小
mysql> INSERT INTO books (id, isbn13, publisher, pages, title, image, alt, isbn10, subtitle, price, binding,pubdate, large_image, rating, numRaters,tags, author) VALUES (1715056,"9787533913311","浙江文藝出版社","255","\"一九九七\(yùn)"之夜","https://img3.doubanio.com\/mpic\/s6250116.jpg","https:\/\/book.douban.com\/subject\/1715056\/","7533913310","","12.40元","平裝","2000-2","https://img3.doubanio.com\/lpic\/s6250116.jpg","0.0","0","","陶然");
//插入語(yǔ)句,帶雙引號(hào)的需要加\轉(zhuǎn)譯符
連表查詢(xún)
SELECT works.id,works.title,works.author_id,works.author,works.dynasty,works.kind,works.kind_cn,works.foreword,works.content,works.intro,works.layout,dynasties.name,dynasties.start_year AS worksDy
FROM works INNER JOIN dynasties
ON works.dynasty = dynasties.name
ORDER BY dynasties.start_year;