url :
the url or url list to be crawled.
爬行url或url列表。
callback:
the method to parse the response. _default: call _
該方法解析響應。
def on_start(self):
self.crawl('http://scrapy.org/', callback=self.index_page)
age:
the period of validity of the task. The page would be regarded as not modified during the period. default: -1(never recrawl)
有效期內的任務。該頁面將被視為不修改期間。默認值:1(從來沒有重新抓取)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
...
priority:
the priority of task to be scheduled, higher the better. default: 0
要調度的任務的優先級,高越好。默認值:0
def index_page(self):
self.crawl('http://www.example.org/page2.html', callback=self.index_page)
self.crawl('http://www.example.org/233.html', callback=self.detail_page,
priority=1)
exetime:
the executed time of task in unix timestamp. default: 0(immediately)
任務的執行時間在unix時間戳。默認值:0(馬上)
import time
def on_start(self):
self.crawl('http://www.example.org/', callback=self.callback,
exetime=time.time()+30*60)
retries:
retry times while failed. default: 3
重試次數而失敗了。默認值:3
itag:
a marker from frontier page to reveal the potential modification of the task. It will be compared to its last value, recrawl when it's changed. default: None
從邊境頁面標記揭示潛在的修改任務。這將是最后的價值相比,重新抓取的時候改變了。默認值:無
def index_page(self, response):
for item in response.doc('.item').items():
self.crawl(item.find('a').attr.url, callback=self.detail_page,
itag=item.find('.update-time').text())
auto_recrawl:
when enabled, task would be recrawled every age time. default: False
當啟用,任務將會重新抓取每個時代。默認值:假
def on_start(self):
self.crawl('http://www.example.org/', callback=self.callback,
age=5*60*60, auto_recrawl=True)
method:
HTTP method to use. default: GET
使用HTTP方法。違約:
params:
dictionary of URL parameters to append to the URL.
字典的URL參數附加到URL。
def on_start(self):
self.crawl('http://httpbin.org/get', callback=self.callback,
params={'a': 123, 'b': 'c'})
self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)
data:
the body to attach to the request. If a dictionary is provided, form-encoding will take place.
身體附加到請求。如果提供了字典,form-encoding將發生。
def on_start(self):
self.crawl('http://httpbin.org/post', callback=self.callback,
method='POST', data={'a': 123, 'b': 'c'})
files:
dictionary of {field: {filename: 'content'}} files to multipart upload.
{領域的詞典:{文件名:‘內容’} }文件多部分upload.”
user_agent:
the User-Agent of the request
用戶代理的請求
headers:
dictionary of headers to send.
字典的頭來發送。
cookies:
dictionary of cookies to attach to this request.
字典的餅干附著這個請求。
connect_timeout:
timeout for initial connection in seconds. default: 20
首次連接超時秒。默認值:20
timeout:
maximum time in seconds to fetch the page. default: 120
最長時間以秒為單位獲取頁面。默認值:120
allow_redirects:
follow 30x redirect default: True
遵循30 x定向違約:真的
validate_cert:
For HTTPS requests, validate the server’s certificate? default: True
HTTPS請求,驗證服務器的證書嗎?默認值:真正的
proxy
proxy server of username:password@hostname:port to use, only http proxy is supported currently.
代理服務器的用戶名:password@hostname:端口使用,目前只支持http代理。
class Handler(BaseHandler):
crawl_config = {
'proxy': 'localhost:8080'
}
etag
use HTTP Etag mechanism to pass the process if the content of the page is not changed. default: True
使用HTTP Etag機制通過這個過程如果頁面的內容沒有改變。默認值:真正的
last_modified
use HTTP Last-Modified header mechanism to pass the process if the content of the page is not changed. default: True
使用HTTP last - modified頭機制通過這個過程如果頁面的內容沒有改變。默認值:真正的
fetch_type
set to js to enable JavaScript fetcher. default: None
將js啟用JavaScript取物。默認值:無
js_script
JavaScript run before or after page loaded, should been wrapped by a function like function() { document.write("binux"); }.
JavaScript運行之前或之后頁面加載,應該被包裝函數函數(){ document . write(“binux”);}。
def on_start(self):
self.crawl('http://www.example.org/', callback=self.callback,
fetch_type='js', js_script='''
function() {
window.scrollTo(0,document.body.scrollHeight);
return 123;
}
''')
js_run_at
run JavaScript specified via js_script at document-start or document-end. default: document-end
通過指定運行JavaScript js_script document-start或document-end。默認值:document-end
js_viewport_width/js_viewport_height
set the size of the viewport for the JavaScript fetcher of the layout process.
設置窗口的大小的JavaScript取物的布局過程。
load_images
load images when JavaScript fetcher enabled. default: False
當啟用JavaScript訪問者載入圖像。默認值:假
save
a object pass to the callback method, can be visit via response.save.
一個對象傳遞給回調方法,可以通過response.save訪問。
def on_start(self):
self.crawl('http://www.example.org/', callback=self.callback,
save={'a': 123})
def callback(self, response):
return response.save['a']
taskid
unique id to identify the task, default is the MD5 check code of the URL, can be overridden by method def get_taskid(self, task)
惟一的id來識別任務,默認URL的MD5校驗碼,可以覆蓋方法def get_taskid(自我,任務)
import json
from pyspider.libs.utils import md5string
def get_taskid(self, task):
return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))
force_update
force update task params even if the task is in ACTIVE status.
力更新任務params即使任務處于活動狀態。
cancel
cancel a task, should be used with force_update to cancel a active task. To cancel an auto_recrawl task, you should set auto_recrawl=False as well.
取消任務,應該使用force_update取消活動任務。取消auto_recrawl任務,你應該設定auto_recrawl = False。
cURL command
self.crawl(curl_command)
cURL is a command line tool to make a HTTP request. It can easily get form Chrome Devtools > Network panel, right click the request and "Copy as cURL".
旋度是一個命令行工具做一個HTTP請求。它可以很容易地形成鉻Devtools >網絡面板中,右鍵單擊請求和“復制為旋度”。
You can use cURL command as the first argument of self.crawl. It will parse the command and make the HTTP request just like curl do.
您可以使用cURL命令self.crawl的第一個參數。它將解析命令和HTTP請求和旋度一樣。
@config(**kwargs)
default parameters of self.crawl when use the decorated method as callback. For example:
默認參數的自我。爬行時使用裝飾方法的回調
@config(age=15*60)
def index_page(self, response):
self.crawl('http://www.example.org/list-1.html', callback=self.index_page)
self.crawl('http://www.example.org/product-233', callback=self.detail_page)
@config(age=10*24*60*60)
def detail_page(self, response):
return {...}
Handler.crawl_config = {}
default parameters of self.crawl for the whole project.
The parameters in crawl_config for scheduler (priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel) will be joined when the task created, the parameters for fetcher and processor will be joined when executed.
You can use this mechanism to change the fetch config (e.g. cookies) afterwards.
默認參數的自我。爬行對于整個項目。
crawl_config的參數調度器(優先級、重試exetime,年齡,itag,force_update,auto_recrawl,取消)將加入任務創建時,參數的取物和處理器執行時將加入。
您可以使用這種機制改變獲取配置(如餅干)。