抓包框架:php-spider
簡介:
The easiest way to install PHP-Spider is with composer. Find it on Packagist.
PHP-Spoder在Composer上安裝非常簡單。
特性:
(1)supports two traversal algorithms: breadth-first and depth-first
(1)支持兩種遍歷算法:廣度優先和深度優先
(2)supports crawl depth limiting, queue size limiting and max downloads limiting
(2)支持抓取深度限制,隊列大小限制和最大下載限制
(3)supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
(3)支持基于XPath,CSS選擇器,或普通的PHP的自定義url設計模式
(4)comes with a useful set of URI filters, such as Domain limiting
(4)配備一套有用的URI的過濾器,如限域
(5)supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
(5)支持自定義URL過濾器,預?。║RI)和postfetch(資源量)
(6)supports custom request handling logic
(6)支持自定義請求處理邏輯
(7)comes with a useful set of persistence handlers (memory, file. Redis soon to follow)
(7)自帶一個有用的持久化處理程序集(內存,文件。redis跟隨)
(8)supports custom persistence handlers
(8)支持自定義持久處理程序
(9)collects statistics about the crawl for reporting
(9)收集關于報告的抓取的統計信息
(10)dispatches useful events, allowing developers to add even more custom behavior
(10)將有用的事件,允許開發者添加更多的自定義行為
(11)supports a politeness policy
(11)符合法律規定
(12)will soon come with many default discoverers: RSS, Atom, RDF, etc.
(12)即將支持(暫未實現):RSS,原子,RDF,等。
(13)will soon support multiple queueing mechanisms (file, memcache, redis)
即將支持(暫未實現)多隊列機制(文件、Memcache、Redis)
(14) will eventually support distributed spidering with a central queue
(14)最終將支持分布式搜索與中央隊列
使用教程:
Windows 上安裝 需要composer 環境
(1)下載:https://github.com/mvdbos/php-spider 下載到本地,放在xampp環境下的htdoc下。
(2)進入 目錄 使用 composer update 更新
cd php-spider-master // 打開目錄
composer update // 更新目錄
網頁打開即可使用
http://localhost/php-spider-master/example/example_simple.php
使用技巧:非直譯,有誤請指正。
(1)This is a very simple example. This code can be found in example/example_simple.php. For a more complete example with some logging, caching and filters, see example/example_complex.php. That file contains a more real-world example.
First create the spider
簡單的例子:代碼在 example/example_simple.php。
完整例子:example/example_complex.php。能查看日志,緩存和文件。
$spider = new Spider('http://www.dmoz.org');
(2)Add a URI discoverer. Without it, the spider does nothing. In this case, we want all
<a>nodes from a certain<div>
$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("http://div[@id='catalogs']//a"));
譯:創建一個抓包url。 如果沒有這個url,什么也不能做、、例如,我盟可以從<div > 中獲取所有節點
<a>$spider->getDiscovererSet()->set(new) XPathExpressionDiscoverer("http://div[@id='catalogs']//a"));
(3)Set some sane options for this example. In this case, we only get the first 10 items from the start page.
$spider->getDiscovererSet()->maxDepth = 1;$spider->getQueueManager()->maxQueueSize = 10;
譯:對于這些例子做一些限制。例如,我們在開始頁面中,只獲取10個項目。
(4)Add a listener to collect stats from the Spider and the QueueManager.There are more components that dispatch events you can use.
$statsHandler = new StatsHandler();$spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);$spider->getDispatcher()->addSubscriber($statsHandler);
Execute the crawl
$spider->crawl();
譯:在隊列和抓包過程中添加監聽器,這樣你可以調度更多的事件去只用那些組件。
(5)When crawling is done, we could get some info about the crawl
echo "\n ENQUEUED: " . count($statsHandler->getQueued());echo "\n SKIPPED: " . count($statsHandler->getFiltered());echo "\n FAILED: " . count($statsHandler->getFailed());echo "\n PERSISTED: " . count($statsHandler->getPersisted());
譯“抓包完成后,我們通過以下方法能獲取抓取的信息。
(6)
Finally we could do some processing on the downloaded resources. In this example, we will echo the title of all resources
echo "\n\nDOWNLOADED RESOURCES: ";foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) { echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text();}
譯:最后我們可以下載這些資源來使用。在例子中,我們將獲取title和所有資源。
[貢獻](https://github.com/mvdbos/php-spider#contributing)
demo:測試
修改:文件
修改源網址。此步驟會請求網絡,獲取源代碼。
$spider = new Spider('http://www.hao123.com/');
// Add a URI discoverer. Without it, the spider does nothing. In this case, we want <a> tags from a certain <div>
篩選出自己想要的節點。
echo "\n - ". $spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("http://div[@id='layoutContainer']//a")) ;
常用方法:
1 .設置我們抓取的url:
// The URI we want to start crawling with
$seed = 'http://www.dmoz.org';
2. 設置是否允許獲取子域內容。
// We want to allow all subdomains of dmoz.org
$allowSubDomains = true;
3.創建一個抓包url。。
// Create spider
$spider = new Spider($seed);
$spider->getDownloader()->setDownloadLimit(10);
4. 設置 隊列和日志 管理,用于監聽管理,抓取信息。
$statsHandler = new StatsHandler();
$LogHandler = new LogHandler();
$queueManager = new InMemoryQueueManager();
$queueManager->getDispatcher()->addSubscriber($statsHandler);
$queueManager->getDispatcher()->addSubscriber($LogHandler);
5.設置網頁的最大深度 和網頁隊列的最大資源數為10,
// Set some sane defaults for this example. We only visit the first level of www.dmoz.org. We stop at 10 queued resources
$spider->getDiscovererSet()->maxDepth = 1;
6.我們設置為廣度優先,默認是深度優先。
// This time, we set the traversal algorithm to breadth-first. The default is depth-first
$queueManager->setTraversalAlgorithm(InMemoryQueueManager::ALGORITHM_BREADTH_FIRST);
$spider->setQueueManager($queueManager);
7.添加url隊列,如果沒有url,就不能抓包獲取資源。
// We add an URI discoverer. Without it, the spider wouldn't get past the seed resource.
$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("http://*[@id='cat-list-content-2']/div/a"));
8.保存資源到文件
// Let's tell the spider to save all found resources on the filesystem
$spider->getDownloader()->setPersistenceHandler(
new \VDB\Spider\PersistenceHandler\FileSerializedResourcePersistenceHandler(__DIR__ . '/results')
);
9.添加一些過濾器,用于在請求數據之前執行
擁有這些,可以更少的執行http請求
// Add some prefetch filters. These are executed before a resource is requested.
// The more you have of these, the less HTTP requests and work for the processors
$spider->getDiscovererSet()->addFilter(new AllowedSchemeFilter(array('http')));
$spider->getDiscovererSet()->addFilter(new AllowedHostsFilter(array($seed), $allowSubDomains));
$spider->getDiscovererSet()->addFilter(new UriWithHashFragmentFilter());
$spider->getDiscovererSet()->addFilter(new UriWithQueryStringFilter());
10.我們添加監聽器用來實現爬蟲。對于每個域名的請求我們需要等待450ms。
// We add an eventlistener to the crawler that implements a politeness policy. We wait 450ms between every request to the same domain
$politenessPolicyEventListener = new PolitenessPolicyListener(100);
$spider->getDownloader()->getDispatcher()->addListener(
SpiderEvents::SPIDER_CRAWL_PRE_REQUEST,
array($politenessPolicyEventListener, 'onCrawlPreRequest')
);
$spider->getDispatcher()->addSubscriber($statsHandler);
$spider->getDispatcher()->addSubscriber($LogHandler);
11. 添加一些東西,停止腳本。
// Let's add something to enable us to stop the script
$spider->getDispatcher()->addListener(
SpiderEvents::SPIDER_CRAWL_USER_STOPPED,
function (Event $event) {
echo "\nCrawl aborted by user.\n";
exit();
}
);
12.添加CLI進度表
// Let's add a CLI progress meter for fun
echo "\nCrawling";
$spider->getDownloader()->getDispatcher()->addListener(
SpiderEvents::SPIDER_CRAWL_POST_REQUEST,
function (Event $event) {
echo '.';
}
);
13.對每個http的抓包請求,設置緩存,日志打印和文件。
// Set up some caching, logging and profiling on the HTTP client of the spider
$guzzleClient = $spider->getDownloader()->getRequestHandler()->getClient();
$tapMiddleware = Middleware::tap([$timerMiddleware, 'onRequest'], [$timerMiddleware, 'onResponse']);
$guzzleClient->getConfig('handler')->push($tapMiddleware, 'timer');
14.開始抓取
// Execute the crawl
$result = $spider->crawl();
15.執行結果
// Report
echo "\n\nSPIDER ID: " . $statsHandler->getSpiderId();
echo "\n ENQUEUED: " . count($statsHandler->getQueued());
echo "\n SKIPPED: " . count($statsHandler->getFiltered());
echo "\n FAILED: " . count($statsHandler->getFailed());
echo "\n PERSISTED: " . count($statsHandler->getPersisted());
16.從插件和監聽器中,獲取一些確定的指標
// With the information from some of plugins and listeners, we can determine some metrics
$peakMem = round(memory_get_peak_usage(true) / 1024 / 1024, 2);
$totalTime = round(microtime(true) - $start, 2);
$totalDelay = round($politenessPolicyEventListener->totalDelay / 1000 / 1000, 2);
echo "\n\nMETRICS:";
echo "\n PEAK MEM USAGE: " . $peakMem . 'MB';
echo "\n TOTAL TIME: " . $totalTime . 's';
echo "\n REQUEST TIME: " . $timerMiddleware->getTotal() . 's';
echo "\n POLITENESS WAIT TIME: " . $totalDelay . 's';
echo "\n PROCESSING TIME: " . ($totalTime - $timerMiddleware->getTotal() - $totalDelay) . 's';
17.最后:執行下載資源的進度
// Finally we could start some processing on the downloaded resources
echo "\n\nDOWNLOADED RESOURCES: ";
$downloaded = $spider->getDownloader()->getPersistenceHandler();
foreach ($downloaded as $resource) {
$title = $resource->getCrawler()->filterXpath('//title')->text();
$contentLength = (int) $resource->getResponse()->getHeaderLine('Content-Length');
// do something with the data
echo "\n - " . str_pad("[" . round($contentLength / 1024), 4, ' ', STR_PAD_LEFT) . "KB] $title";
}
echo "\n";