Home » Software » How To Create a Web Crawler and Data Miner

How To Create a Web Crawler and Data Miner

How-To-Create-a-Web-Crawler-and-Data-Miner-2 How To Create a Web Crawler and Data Miner

Web Spider

A web crawler is an internet bot that browses the Internet World Wide Web, Its often to be called a web spider. Most known web crawler is googlebot. A web crawler starting to browse a list of URL to visit (seeds). After that, it identifies all the hyperlink in the web page and adds them to list of URLs to visit. In this article, i will show you How To Create A Web Crawler. There are many ways to create a web crawler, One of them is using Apache Nutch.

Apache Nutch is a scalable and very robust tool for web crawling. Apache Nutch can be integrated with Phyton programming language for web crawling. You can use it to crawl on your data, for a better indexing. If you understand Apache Nutch clearly, you can create your own search engine like Google.

Apache Nutch can run on a single machine as well as on a distributed environment like Apache Hadoop. It’s written in java. Apache Nutch can also integrated with Apache Solr (Solr is a search platform that can be used for searching any type of data and web pages) easily, so we can pass all the indexed and crawled page by Apache Nutch to Apache Solr.

Set Up Your Web Crawler

To start using Apache Nutch, First we need to install it. First thing to do is installing dependencies in Apache Nutch.

The dependencies are :

  1. Apache Nutch
  2. HBase
  3. Ant
  4. JDK

In this tutorial, we will use Apache Nutch 2.2.1 version. These are the steps for installation and configuration of Apache Nutch 2.2.1

1. Download Apache Nutch

2.Extract it by using this command # tar -zxvf apache-nutch.2.2.1-src.tar.gz

3.Download HBase Apache Hadoop

4.Extract it by using this command # tar -zxvf Hbase.x.x.tar.gz

5.Configure HBase. Go to hbase-site.xml and find <Your HBase home>/conf and modify it like image below

How-To-Create-a-Web-Crawler-and-Data-Miner How To Create a Web Crawler and Data Miner

6.Specify Gora backend in nutch-site.xml (You can find it at $NUTCH_HOME/conf)

7. Ensure that HBasegora-hbase dependency is available in ivy.xml by putting the following configuration

8. Make sure HBaseStore is set as default data by putting the following configuration into gora.properties

9. Go to Apache Nutch home directory and type following command

10. At this point, Apache Nutch will create respective directories.

11. Make sure Hbase is working properly by go to the home directory of hbase and type the following command

If everything goes well you will see this output

Start to Crawling Your First Website Using Apache Nutch

After finished installation steps of Apache Nutch, you can start crawling by use following steps

1. Add your agent name in value field in nutch-site.xml by add following configuration

2.Go to the local directory of Apache Nutch which located at <your Apache Nutch home>/runtime and create a directory called urls inside it

3.Create seed.txt inside urls directory and put whatever you want to crawl first. for example

4. Now you can start to crawl by starting Apache Nutch and HBase by using following command

If you got errors when starting Apache Nutch, Check for common errors

3 thoughts on “How To Create a Web Crawler and Data Miner

  1. Trover says:

    In the momento of compilation, show an error: “[FAILED ] org.hasqldb#hsqldb;2.2.8!hsqldb.jar:…” , “Imposible to resolve dependencies:…, My OS is Ubuntu 14.0.4 Any Idea? Thanks.

    1. James Howard says:

      Make sure dependecies set correctly

      Try to delete entire .ivy directory and re-run ant

  2. Leonardo says:

    And the data miner? 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

Name *
Email *
Website

This site uses Akismet to reduce spam. Learn how your comment data is processed.