引言
該教程獻給那些剛剛知道nutch這個東西,充滿好奇心想要嘗試卻一臉懵逼的小伙伴們。
nutch源碼下載
簡書上沒有上傳的地方,有點淡淡的憂傷,所以我只有借助<a >CSDN</a>了(走過路過不要錯過,只要2個C幣,業界良心)。
nutch編譯前的配置
- 打開mysql支持
<!--配置ivy/ivy.xml--> <!--ivy也是一種包管理工具,和maven差不多,這里就是添加sql的依賴--> <!--解注釋--> <dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/> <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" /> <修改 <dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/> <為 <dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/> <原因 <!-- Uncomment this to use SQL as Gora backend. It should be noted that the gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should downgrade to gora-core 0.2.1 in order to use SQL as a backend. -->
- 配置mysql參數
//conf/gora.properties
//注釋掉Default SqlStore properties并添加MySQL properties
//MySQL properties
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=password
- 修改nutch的參數
<!--將nutch-site.xml.template重命名為nutch-site.xml-->
<!--conf/nutch-site.xml文件中添加-->
<property>
<name>http.agent.name</name>
<value>LiuXun Nutch Spider</value>
</property>
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
nutch編譯工具的安裝
下載<a >ant</a>并配置path(就這么簡單)
nutch的編譯
- 配置
將<a >sonar-ant-task-2.1.jar</a>放入nutch根目錄,并修改build.xml
<!-- Define the Sonar task if this hasn't been done in a common script -->
<taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml">
<classpath path="${ant.library.dir}" />
<classpath path="${mysql.library.dir}" />
<classpath><fileset dir="." includes="sonar*.jar" /></classpath>
</taskdef>
- ant編譯
在nutch的根目錄運行ant runtime命令,然后就是漫長的依賴下載時間。
真是可怕:
Paste_Image.png
預告
下一篇<a href="http://www.lxweimin.com/p/6c8d59d1f920">ubuntu15.10下nutch2.2.1+hbase1.1.1搭建爬蟲平臺(失敗的嘗試)</a>