How To Create a Simple Search Engine
According to wikipedia, a Web Search Engine is a software that designed to search for information, data, and etc on the internet world wide web. This article will covers on how to create a search engine like google.
What You Need To Create A Simple Search Engine
In order to create Search engine. You need 2 main part of search engine, They are
- Web crawler, to collect information on the internet.
- Search Platform, used for searching any data, in this case web pages.
To set up a web crawler, you can use Apache Nutch. You can check my previous article to create a web crawler using Apache Nutch. After web crawler is working. You need a search platform to display crawled information from apache nutch. The search platform that we will use is Apache Solr.
Apache Solr is a search platform which is built on top of Apache Lucene. It’s a very powerful searching platform because provides full-text search, dynamic clustering, database integration, rich document handling, and much more.
How To Install Apache Solr
Follow these steps for installation of Apache Solr
1. Download Apache Solr from apache’s website
2. Extract the downloaded file by use following commands
12 $ sudo tar xzf apache-solr-4.6.1/$ sudo mv apache-solr-4.6.1/ solr
These commands will extract all apache solr’s file in the destined folder.
3. Navigate to ~/.bashrc file (go to the root directory and type gedit ~/.bashrc) and put the following configuration into ~/.bashrc file :
12 #set SOLR homeexport SOLR_HOME=/usr/local/solr/example/solr
This will create an enviroment variable called SOLR_HOME which is required for Apache Solr to run.
4. Test your Apache Solr installation by navigating to example directory of apache solr, and type following command to start Apache Solr
1 java -jar start.jar
If it’s done correctly, You will get this output
1234 INFO: solr home defaulted to 'solr/' (could not find system property or JNDI)23 Jan, 2014 4:25:24 AM org.apache.solr.servlet.SolrUpdateServlet initINFO: SolrUpdateServlet.init() done2014-01-23 04:25:24.762:INFO::Started SocketConnector00.0.0.0:8983
5. Verify Apache Solr integrity by browsing the following URL
1 http://localhost:8983/solr/admin/
You will get the image of Running Apache Solr like images below
6. At this point, Both Apache Nutch and Apache Solr are installed correctly. We need to integrate Apache Solr into Apache Nutch.
Integrate Apache Solr to Apache Nutch
Integration is required for indexing URLs to Apache Solr crawled by Apache Nutch. So once Apache Nutch done with crawling. The information will be indexed by Apache Solr. To integrate Apache Solr into Apache Nutch follow these steps
1. Copy Schema.xml file (Apache Nutch directory/conf) and put it into the conf directory of Apache Solr.
2. Enter the following command to copy schema.xml
1 cp <apache nutch directory>/conf/schema.xml <Apache Solr directory>/example/solr/conf/
3. Navigate to example directory. Type the following command to restart Apache Solr
1 java -jar start.jar
4. Now you can start Apache Nutch by use these command
cd<Apache Nutch’s directory>/runtime
bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/2
Now you will be able to create a simple search engine. Apache Nutch provide many parameters to extend according to your requirements.