開箱即用的免費(fèi)高度匿名代理抓取工具

golang-proxy v3.0

golang-proxy是一個(gè)開箱即用的高匿代理抓取工具, 它是語(yǔ)言無(wú)關(guān)的
項(xiàng)目地址: https://github.com/storyicon/golang-proxy

golang-proxy

中文文檔

Golang-Proxy -- 簡(jiǎn)單高效的免費(fèi)代理抓取工具通過(guò)抓取網(wǎng)絡(luò)上公開的免費(fèi)代理,來(lái)維護(hù)一個(gè)屬于自己的高匿代理池,用于網(wǎng)絡(luò)爬蟲、資源下載等用途。

v3.0 有哪些新特性

  1. 依舊提供了高度靈活的 API 接口,在啟動(dòng)主程序后,即可通過(guò)在瀏覽器訪問(wèn)localhost:9999/alllocalhost:9999/random 直接獲取抓到的代理!甚至可以使用 localhost:9999/sql?query=來(lái)執(zhí)行一些簡(jiǎn)單的 SQL 語(yǔ)句來(lái)自定義代理篩選規(guī)則!
  2. 依舊提供 WindowsLinuxMac 開箱即用版
    Download Release v3.0
  3. 支持自動(dòng)對(duì)代理類型進(jìn)行判斷, 可以通過(guò) schemeType 判定代理對(duì)httphttps的支持程度
  4. 支持了MySQL數(shù)據(jù)庫(kù), 詳情請(qǐng)見 Config
  5. 支持單獨(dú)啟動(dòng)服務(wù), 在啟動(dòng)編譯好的二進(jìn)制文件時(shí), 通過(guò) -mode= 來(lái)指定是否單獨(dú)啟動(dòng) producer/consumer/assessor/service
  6. 重新設(shè)計(jì)了數(shù)據(jù)表, 請(qǐng)注意, 這意味著 API 接口發(fā)生了變動(dòng)
  7. 重新設(shè)計(jì)了 的數(shù)據(jù)結(jié)構(gòu), 去除了 filter 等字段, 請(qǐng)注意, 這意味著 v2.0 的源在直接提供給v3.0 使用時(shí)可能會(huì)出現(xiàn)一些問(wèn)題
  8. 更新了一些
  9. 不再支持 -source 啟動(dòng)參數(shù)

如何使用 golang-proxy

1. 使用開箱即用版本

Release 頁(yè)面 根據(jù)系統(tǒng)環(huán)境提供了一些壓縮包,將他們解壓后執(zhí)行即可。

開箱即用版下載地址: Download Release v3.0

下載完成后, 將壓縮包中的二進(jìn)制文件和 source 目錄解壓到同一個(gè)位置, 啟動(dòng)二進(jìn)制文件即可, 程序?qū)?huì)啟動(dòng)下面這些服務(wù):

  1. producer : 周期性的抓取source目錄中定義的源, 將抓取到的代理寫入到 crude_proxy 表中
  2. consumer : 周期性的從 crude_proxy 中讀取一定數(shù)量的代理, 判斷它們的代理類型以及可用性, 將它們寫入到 proxy表中
  3. assessor : 周期性的從 proxy 表中讀取一定數(shù)量的代理, 評(píng)估它們的質(zhì)量
  4. service : golang-proxy 提供的 http api 接口, 使你可以通過(guò) localhost:9999/all, localhost:9999/random, localhost:9999/sql?query= 這三個(gè)接口來(lái)篩選和獲取 crude_proxyproxy 表中的代理

當(dāng)你啟動(dòng)編譯好的二進(jìn)制文件時(shí), 默認(rèn)這些服務(wù)會(huì)依次啟動(dòng), 但是在 v3.0 版本, 你可以通過(guò)添加 -mode 啟動(dòng)參數(shù)來(lái)指定單獨(dú)啟動(dòng)某個(gè)服務(wù), 比如:

golang-proxy -mode=service

這樣運(yùn)行, 將只會(huì)啟動(dòng) service 服務(wù), 在啟動(dòng)了 service 之后, 你可以在瀏覽器中訪問(wèn)以下接口, 獲得相應(yīng)的代理:

url description
localhost:9999/all 獲取 proxy 表中所有已經(jīng)抓取到的代理
localhost:9999/all?table=proxy 獲取 proxy 表中所有已經(jīng)抓取到的代理
localhost:9999/all?table=crude_proxy 獲取 crude_proxy 表中所有已經(jīng)抓取到的代理
localhost:9999/random proxy 表中隨機(jī)獲取一條代理
localhost:9999/random?table=proxy proxy 表中隨機(jī)獲取一條代理
localhost:9999/random?table=crude_proxy crude_proxy 表中隨機(jī)獲取一條代理
localhost:9999/sql?query= query=后加上SQL語(yǔ)句, 返回SQL執(zhí)行結(jié)果, 只支持較為簡(jiǎn)單的查詢語(yǔ)句

請(qǐng)注意, crude_proxy 只是抓取到的代理的臨時(shí)儲(chǔ)存表, 不能保證它們的質(zhì)量, 而proxy 表中的代理將會(huì)不斷得到 assessor 的評(píng)估, proxy 表中的 score 字段可以較為全面的反映一個(gè)代理的質(zhì)量, 質(zhì)量較低時(shí)會(huì)被刪除

接口示例: localhost:9999/sql

例如訪問(wèn) localhost:9999/sql?query=SELECT * FROM PROXY WHERE SCORE > 5 ORDER BY SCORE DESC, 將會(huì)返回 proxy 表中所有分?jǐn)?shù)大于5的代理, 并按照分?jǐn)?shù)從高到低返回

{
    "error": "",
    "message": [
        {
            "id": 2,
            "ip": "45.113.69.177",
            "port": "1080",
            // scheme_type 可以取以下值:
            // 0: 代理只支持 http
            // 1: 代理只支持 https
            // 2: 代理同時(shí)支持 http 和 https
            "scheme_type": 0,
            "content": "45.113.69.177:1080",
            // 評(píng)估次數(shù)
            "assess_times": 9,
            // 評(píng)估成功次數(shù), 可以通過(guò) success_times/assess_times獲得代理連接成功率
            "success_times": 9,
            // 平均響應(yīng)時(shí)間
            "avg_response_time": 0.098,
            // 連續(xù)失敗次數(shù)
            "continuous_failed_times": 0,
            // 分?jǐn)?shù), 推薦使用 5 分以上的代理
            "score": 68.45106053570785,
            "insert_time": 1540793312,
            "update_time": 1540797880
        },
    ]
}

2. 使用源碼編譯

go get -u github.com/storyicon/golang-proxy

進(jìn)入到 golang-proxy 目錄,執(zhí)行 go build main.go,執(zhí)行生成的二進(jìn)制的執(zhí)行程序即可。

注意:

項(xiàng)目根目錄下的 ./source 是項(xiàng)目執(zhí)行必須的文件夾,里面存儲(chǔ)了各類網(wǎng)站源,其他的文件夾儲(chǔ)存的均為項(xiàng)目源碼。所以在編譯后得到二進(jìn)制程序 main 文件后,即可將 main 文件和 source 文件夾一同移動(dòng)到任意地方,main 文件可以任意命名。

為什么要用 Golang-Proxy

  1. 穩(wěn)定、快速。
    抓取模塊,單核并發(fā)可以到達(dá) 1000 個(gè)頁(yè)面/秒
  2. 高可配置性、高拓展性。
    你不需要寫任何代碼,花一兩分鐘填寫一個(gè)配置文件就可以添加一個(gè)新的網(wǎng)站源。
  3. 評(píng)估功能。
    通過(guò) Assessor 評(píng)估模塊,周期性測(cè)試代理質(zhì)量,根據(jù)代理的測(cè)試成功率、高匿性、測(cè)試次數(shù)、突變性、響應(yīng)速度等獨(dú)立影響因子進(jìn)行綜合評(píng)分,算法具有高度可配置性,可以根據(jù)項(xiàng)目的需要可以對(duì)因子的權(quán)重進(jìn)行獨(dú)立調(diào)整。
  4. 提供了高度靈活的 API 接口,在啟動(dòng)主程序后,即可通過(guò)在瀏覽器訪問(wèn)localhost:9999/alllocalhost:9999/random 直接獲取抓到的代理!甚至可以使用 localhost:9999/sql?query=來(lái)執(zhí)行 SQL 語(yǔ)句來(lái)自定義代理篩選規(guī)則!
  5. 不依賴任何服務(wù)型數(shù)據(jù)庫(kù),一鍵下載,開箱即用!

如何配置一個(gè)新的源

./source/下的所有 yml 格式的文件都是,你可以增加源,也可以通過(guò)在文件名前加上一個(gè) . 來(lái)使程序忽略這個(gè)源,當(dāng)然你也可以直接刪除,來(lái)讓一個(gè)源永遠(yuǎn)的消失,下面進(jìn)行 Source 參數(shù)介紹:

#Page配置項(xiàng)
page:
    entry: "https://xxx/1.html"
    template: "https://xxx/{page}.html"
    from: 2
    to: 10
#publisher將會(huì)首先抓取entry,即 https://xxx/1.html
#然后根據(jù) template、from 和 to 依次抓取
#  https://xxx/2.html
#  https://xxx/3.html
#  https://xxx/4.html
#  ...
#  https://xxx/10.html
#Selector配置項(xiàng)
selector:
    iterator: ".table tbody tr"
    ip: "td:nth-child(1)"
    port: "td:nth-child(2)"
# 以上配置用于抓取下面這種 HTML 結(jié)構(gòu)
# <table class="table">
#     <tbody>
#         <tr>
#             <td>187.3.0.1</td>
#             <td>8080</td>
#             <td>HTTP</td>
#         <tr>
#         <tr>
#             <td>164.23.1.2</td>
#             <td>80</td>
#             <td>HTTPS</td>
#         <tr>
#         <tr>
#             <td>131.9.2.3</td>
#             <td>8080</td>
#             <td>HTTP</td>
#         <tr>
#     <tbody>
# <table>
# 選擇器為通用的JQuery選擇器,iterator為循環(huán)對(duì)象,比如表格里的行,每行一條代理,那這個(gè)行的選擇器就是iterator,而ip、port、protocal則是在iterator選擇器的基礎(chǔ)上進(jìn)行子元素的查找。
category:
    # 并行數(shù)
    parallelnumber: 1
    # 對(duì)于這個(gè)源,每抓取一個(gè)頁(yè)面
    # 將會(huì)隨機(jī)等待5~20s再抓下一個(gè)頁(yè)面
    delayRange: [5, 20]
    # 間隔多長(zhǎng)時(shí)間啟用一次這個(gè)源
    # @every 10s , @every 10h...
    interval: "@every 10m"
debug: true

征求意見

  1. 使用中任何問(wèn)題提 issues 即可
  2. 如果發(fā)現(xiàn)了新的好用的源,歡迎提交上來(lái)分享
  3. 來(lái)都來(lái)了點(diǎn)個(gè) Star 再走唄 : )

English Document

Golang-proxy is an efficient free proxy crawler that ensures that the captured proxies are highly anonymous and at the same time guarantee their quality. You can use these captured proxies to download network resources and ensure the privacy of your own identity.

1. Feature

  • Very high speed of proxy crawler, which can download 1000 pages per second.
  • You can customize the source of proxy crawler. The configuration file is extremely simple.
  • Provide a compiled version, comes with a SQLite database, and supports mysql
  • Comes with an API interface, all functions can be used with one click
  • Proxy evaluation system to ensure the quality of the proxy pool

2. How to use

golang-proxy provides compiled binary files so that you do not need golang on the machine. Download binary compression pack to Release Page
According to your system type, download the corresponding compression package, unzip it and run it. After a few minutes, you can access localhost:9999/all in the browser to see the proxy's crawl results.

Before I go into the detailed introduction of golang-proxy, I think it's best to tell you the most useful information first.

API interface

After you start the binary, you can access the following interface in the browser to get the proxy

url description
localhost:9999/all Get all highly available proxies
localhost:9999/all?table=proxy Get all highly available proxies
localhost:9999/random Randomly acquire a highly available proxy
localhost:9999/all?table=crude_proxy Obtain the proxies in the temporary table (the quality of them cannot be guaranteed)
localhost:9999/random?table=proxy Randomly get an proxy from the temporary table (the quality of them cannot be guaranteed)
localhost:9999/sql?query= Write the SQL statement you want to execute after query=, customize your filter rules.

Having mastered the above content, you have been able to use the 50% function of golang-proxy. But the last interface allows you to execute custom SQL statements, and you'll find that you need to know at least the structure of the tables. The following will tell you.

3. Advanced

golang-proxy consists of the following parts:

  • two data tables
  • one configuration file
  • one source folder
  • four modules

two data tables

1. Table Crude Proxy

In order to store temporary proxies, we designed the data table crude_proxy, the table is defined as follows.

field type example description
id int - -
ip string 192.168.0.1 -
port string 255 -
content string 192.168.0.1:255 -
insert_time int 1540798717 -
update_time int 1540798717 -

table crude_proxy stores the proxies that are crawled out, and cannot guarantee their quality.

2. Table Proxy

When the agent in the crude_proxy table passes through pre assess ( pre assess roughly verifies the availability of the proxy and tests the proxy's support for https and http ), it will enter the proxy table.

field type example description
id int - -
ip string 192.168.0.1 -
port string 255 -
scheme_type int 2 Identify the extent to which the proxy supports http and https, 0: http only, 1 https only, 2 https & http
content string 192.168.0.1:255
assess_times int 5 proxy evaluation times
success_times int 5 The number of times the proxy successfully passed the evaluation
avg_response_time float 0.001 -
continuous_failed_times int 0 The number of consecutive failures during the proxy evaluation process
score float 25 The higher the better
insert_time int 1540798717 -
update_time int 1540798717 -

The proxy in the proxy table will be evaluated periodically and their scores will be modified. Low scores will be deleted.

one configuration file

For convenience, the proxy in golang-proxy is stored in the portable database sqlite by default. You can make golang-proxy use the mysql database by adding the config.yml file in the executable directory.

For details, see Config page.

one source folder

golang-proxy needs source to define its crawling contents and rules. Therefore, the run directory of golang-proxy needs at least one source folder, and the source folder should have at least one source in yml format.
The source is defined as follows:

page: 
    entry: "http://www.xxx.com/http/?page=1"
    template: "http://www.xxx.com/http/?page={page}"
    from: 1
    to: 2000
selector:
    iterator: ".list item"
    ip: ".ip"
    port: ".port"
category:
    parallelnumber: 3
    delayRange: [10, 30]
    interval: "@every 10m"
debug: true

In the definition above, producer will first crawl the entry page, then crawl:

http://www.xxx.com/http/?page=1      
http://www.xxx.com/http/?page=2      
http://www.xxx.com/http/?page=3      
...      
http://www.xxx.com/http/?page=2000     

This source definition page expects this format:

<html>
    ...
    <div class="list">
        <div class="item">
            <div class="ip"> 127.0.0.1 </div>
            <div class="port"> 80 </div>
            ...
        </div>
        <div class="item">
            <div class="ip"> 125.4.0.1 </div>
            <div class="port"> 8080 </div>
            ...
        </div>
        ...
    </div>
    ...
</html>

When producer parses a single page, it always traverses the nodes defined by iterator first, and then gets the elements defined by ip and port selectors from these nodes. The source definition above is still valid for the following HTML structure.

<html>
    ...
    <div class="list">
        <div class="item">
            <div class="ip"> 127.0.0.1:80 </div>
        </div>
        <div class="item">
            <div class="ip"> 125.4.0.1:8080</div>
        </div>
        ...
    </div>
    ...
</html>

Because when the port selector cannot get the content, it will try to parse the port from the text selected by the ip selector.

The source is stored in the source folder in yml format, and a source definition is completed. Golang-proxy will read it and crawl it the next time it starts. So you successfully define a source, store it in the source folder in YML format, and the next time you start golang-proxy, the source will enter the crawl list.

If a source file name starts with a . , the source will not be read.

four modules

golang-proxy consists of four modules, which cooperate to complete the task that golang-proxy wants to accomplish.

module name description
producer Periodically fetch the source defined in the source directory, and write the fetched proxy to the crude_proxy table.
consumer Periodically read a certain number of proxies from crude_proxy, determine their proxy scheme type and availability, and write them to the proxy table.
assessor Periodically read a number of proxies from the proxy table to evaluate their quality.
service Be responsible for the HTTP API interface provided by golang-proxy, allows you to filter and obtain the proxies in the crude_proxy and proxy tables by localhost: 9999/all, localhost: 9999/random, and localhost: 9999/sql.

When you start the executable file of golang-proxy, you will start these module in turn. But you can add the -mode startup parameter after the golang-proxy executable to command golang-proxy to start only one module. Like below:

golang-proxy -mode=service

This will only start the HTTP API interface service.

At this point, you have mastered the 95% function of golang-proxy. If you want to find more, you can read the source code provided above, and improve them.

Request for comments

Welcome to submit issue.
If you feel that golang-proxy is helping you, you can order a star or watch, thanks !

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 230,825評(píng)論 6 546
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 99,814評(píng)論 3 429
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人,你說(shuō)我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 178,980評(píng)論 0 384
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)。 經(jīng)常有香客問(wèn)我,道長(zhǎng),這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 64,064評(píng)論 1 319
  • 正文 為了忘掉前任,我火速辦了婚禮,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 72,779評(píng)論 6 414
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 56,109評(píng)論 1 330
  • 那天,我揣著相機(jī)與錄音,去河邊找鬼。 笑死,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 44,099評(píng)論 3 450
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 43,287評(píng)論 0 291
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 49,799評(píng)論 1 338
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 41,515評(píng)論 3 361
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 43,750評(píng)論 1 375
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 39,221評(píng)論 5 365
  • 正文 年R本政府宣布,位于F島的核電站,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 44,933評(píng)論 3 351
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 35,327評(píng)論 0 28
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 36,667評(píng)論 1 296
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 52,492評(píng)論 3 400
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 48,703評(píng)論 2 380

推薦閱讀更多精彩內(nèi)容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,418評(píng)論 0 10
  • 蛙泳,狗刨誰(shuí)能教我怎么換氣
    簡(jiǎn)書郡閱讀 252評(píng)論 0 1
  • 清明過(guò)后春末的一天,在家收拾書柜,無(wú)意中翻到一本老版的《魯迅文集》。 魯迅先生是我非常喜歡的一位作家,中學(xué)時(shí)代學(xué)過(guò)...
    遇見云妮閱讀 318評(píng)論 3 3
  • 她來(lái)自心靈深處的感嘆,她孕藏著高天闊海的吶喊,她散發(fā)著名山大川的靈氣,她展示了人生價(jià)值的內(nèi)涵! 序 妙音詩(shī)...
    何立紅閱讀 537評(píng)論 0 2
  • 經(jīng)過(guò)兩天的搗鼓,昨天這個(gè)公眾號(hào)寫作系統(tǒng)算是完成了第一稿,寫完后總覺得哪不對(duì)勁,但是還沒找到。 永澄老大看后說(shuō)你輸入...
    饞人小博閱讀 360評(píng)論 1 1