How To Create a Web Crawler and Data Miner
A web crawler is an internet bot that browses the Internet World Wide Web, Its often to be called a web spider. Most known web crawler is googlebot. A web crawler starting to browse a list of URL to visit (seeds). After that, it identifies all the hyperlink in the web page and adds them to list of URLs to visit. In this article, i will show you How To Create A Web Crawler. There are many ways to create a web crawler, One of them is using Apache Nutch.
Apache Nutch is a scalable and very robust tool for web crawling. Apache Nutch can be integrated with Phyton programming language for web crawling. You can use it to crawl on your data, for a better indexing. If you understand Apache Nutch clearly, you can create your own search engine like Google.
Apache Nutch can run on a single machine as well as on a distributed environment like Apache Hadoop. It’s written in java. Apache Nutch can also integrated with Apache Solr (Solr is a search platform that can be used for searching any type of data and web pages) easily, so we can pass all the indexed and crawled page by Apache Nutch to Apache Solr.
Set Up Your Web Crawler
To start using Apache Nutch, First we need to install it. First thing to do is installing dependencies in Apache Nutch.
The dependencies are :
- Apache Nutch
- HBase
- Ant
- JDK
In this tutorial, we will use Apache Nutch 2.2.1 version. These are the steps for installation and configuration of Apache Nutch 2.2.1
1. Download Apache Nutch
2.Extract it by using this command # tar -zxvf apache-nutch.2.2.1-src.tar.gz
3.Download HBase Apache Hadoop
4.Extract it by using this command # tar -zxvf Hbase.x.x.tar.gz
5.Configure HBase. Go to hbase-site.xml and find <Your HBase home>/conf and modify it like image below
6.Specify Gora backend in nutch-site.xml (You can find it at $NUTCH_HOME/conf)
12345 <property><name> storage.data.store.class </name><value> org.apache.gora.hbase.store.HBaseStore </value><description> Default class for storing data </decription></property>
7. Ensure that HBasegora-hbase dependency is available in ivy.xml by putting the following configuration
12 <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />
8. Make sure HBaseStore is set as default data by putting the following configuration into gora.properties
12 gora.properties:gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
9. Go to Apache Nutch home directory and type following command
1 ant runtime
10. At this point, Apache Nutch will create respective directories.
11. Make sure Hbase is working properly by go to the home directory of hbase and type the following command
1 ./bin/hbase shell
If everything goes well you will see this output
123 HBase Shell; enter 'help<RETURN>' for list of supported commands.Type "exit<RETURN>" to leave the HBase ShellVersion: 0.90.4, r1001068, Fri Sep 24 13:55:42 PDT 2010
Start to Crawling Your First Website Using Apache Nutch
After finished installation steps of Apache Nutch, you can start crawling by use following steps
1. Add your agent name in value field in nutch-site.xml by add following configuration
123456 <configuration><property><name> http.agent.name </name><value> My Private Spider Bot </value></property></configuration>
2.Go to the local directory of Apache Nutch which located at <your Apache Nutch home>/runtime and create a directory called urls inside it
3.Create seed.txt inside urls directory and put whatever you want to crawl first. for example
1 http://technotif.com
4. Now you can start to crawl by starting Apache Nutch and HBase by using following command
12 cd<Respective directory of Apache Nutch>/runtimebin/crawl urls/seed.txt TestCrawl
If you got errors when starting Apache Nutch, Check for common errors
In the momento of compilation, show an error: “[FAILED ] org.hasqldb#hsqldb;2.2.8!hsqldb.jar:…” , “Imposible to resolve dependencies:…, My OS is Ubuntu 14.0.4 Any Idea? Thanks.
Make sure dependecies set correctly
Try to delete entire .ivy directory and re-run ant
And the data miner? 🙂