Home » Software » How To Create a Web Crawler and Data Miner

How To Create a Web Crawler and Data Miner

Web Spider

A web crawler is an internet bot that browses the Internet World Wide Web, Its often to be called a web spider. Most known web crawler is googlebot. A web crawler starting to browse a list of URL to visit (seeds). After that, it identifies all the hyperlink in the web page and adds them to list of URLs to visit. In this article, i will show you How To Create A Web Crawler. There are many ways to create a web crawler, One of them is using Apache Nutch.

Apache Nutch is a scalable and very robust tool for web crawling. Apache Nutch can be integrated with Phyton programming language for web crawling. You can use it to crawl on your data, for a better indexing. If you understand Apache Nutch clearly, you can create your own search engine like Google.

Apache Nutch can run on a single machine as well as on a distributed environment like Apache Hadoop. It’s written in java. Apache Nutch can also integrated with Apache Solr (Solr is a search platform that can be used for searching any type of data and web pages) easily, so we can pass all the indexed and crawled page by Apache Nutch to Apache Solr.

Set Up Your Web Crawler

To start using Apache Nutch, First we need to install it. First thing to do is installing dependencies in Apache Nutch.

The dependencies are :

Apache Nutch
HBase
Ant
JDK

In this tutorial, we will use Apache Nutch 2.2.1 version. These are the steps for installation and configuration of Apache Nutch 2.2.1

1. Download Apache Nutch

2.Extract it by using this command # tar -zxvf apache-nutch.2.2.1-src.tar.gz

3.Download HBase Apache Hadoop

4.Extract it by using this command # tar -zxvf Hbase.x.x.tar.gz

5.Configure HBase. Go to hbase-site.xml and find <Your HBase home>/conf and modify it like image below

6.Specify Gora backend in nutch-site.xml (You can find it at $NUTCH_HOME/conf)

<property> <name> storage.data.store.class </name> <value> org.apache.gora.hbase.store.HBaseStore </value> <description> Default class for storing data </decription> </property>

1

2

3

4

5

<property>

<name> storage.data.store.class </name>

<value> org.apache.gora.hbase.store.HBaseStore </value>

<description> Default class for storing data </decription>

</property>

7. Ensure that HBasegora-hbase dependency is available in ivy.xml by putting the following configuration

<dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*- >default" />

1

2

<dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*-

>default" />

8. Make sure HBaseStore is set as default data by putting the following configuration into gora.properties

gora.properties: gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

1

2

gora.properties:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

9. Go to Apache Nutch home directory and type following command

ant runtime

1

ant runtime

10. At this point, Apache Nutch will create respective directories.

11. Make sure Hbase is working properly by go to the home directory of hbase and type the following command

./bin/hbase shell

1

./bin/hbase shell

If everything goes well you will see this output

HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version: 0.90.4, r1001068, Fri Sep 24 13:55:42 PDT 2010

1

2

3

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version: 0.90.4, r1001068, Fri Sep 24 13:55:42 PDT 2010

Start to Crawling Your First Website Using Apache Nutch

After finished installation steps of Apache Nutch, you can start crawling by use following steps

1. Add your agent name in value field in nutch-site.xml by add following configuration

<configuration> <property> <name> http.agent.name </name> <value> My Private Spider Bot </value> </property> </configuration>

1

2

3

4

5

6

<configuration>

<property>

<name> http.agent.name </name>

<value> My Private Spider Bot </value>

</property>

</configuration>

2.Go to the local directory of Apache Nutch which located at <your Apache Nutch home>/runtime and create a directory called urls inside it

3.Create seed.txt inside urls directory and put whatever you want to crawl first. for example

http://technotif.com

1

http://technotif.com

4. Now you can start to crawl by starting Apache Nutch and HBase by using following command

cd<Respective directory of Apache Nutch>/runtime bin/crawl urls/seed.txt TestCrawl

1

2

cd<Respective directory of Apache Nutch>/runtime

bin/crawl urls/seed.txt TestCrawl

If you got errors when starting Apache Nutch, Check for common errors

February 21, 2014Technology Tips Software 3 Comments

«Reading Files Without Filehandle PHP

How To Create a Simple Search Engine»

You May Want to See :

3 thoughts on “How To Create a Web Crawler and Data Miner”

Trover says:

June 10, 2014 at 2:16 am

In the momento of compilation, show an error: “[FAILED ] org.hasqldb#hsqldb;2.2.8!hsqldb.jar:…” , “Imposible to resolve dependencies:…, My OS is Ubuntu 14.0.4 Any Idea? Thanks.

Reply
1. James Howard says:
  
  June 10, 2014 at 5:23 am
  
  Make sure dependecies set correctly
  
  Try to delete entire .ivy directory and re-run ant
  
  Reply
Leonardo says:

December 16, 2015 at 11:15 pm

And the data miner? 🙂

Reply

How To Create a Web Crawler and Data Miner

Set Up Your Web Crawler

Start to Crawling Your First Website Using Apache Nutch

You May Want to See :

3 thoughts on “How To Create a Web Crawler and Data Miner”

Leave a Reply Cancel reply

Latest Posts

Polls

Subscribe to Technotif

Latest Comment

How To Create a Web Crawler and Data Miner

Set Up Your Web Crawler

Start to Crawling Your First Website Using Apache Nutch

You May Want to See :

3 thoughts on “How To Create a Web Crawler and Data Miner”

Leave a Reply Cancel reply

Latest Posts

Polls

Subscribe to Technotif

Latest Comment

Tags