17 Mar
Posted by Ganesh H S , Bangalore, India as zend framework
In this article i will be discussing about creating index using zend lucene search .
Conventionally most of the site search are powered by database driven.
Lets consider my blog site, if anyone comes to my site and wants to search for any keyword, if i have to give search results i may have to look into articles table, comments table, executing SQL queries against 2 tables is acceptable, but if we go to any e-commerce application, we may have to search against lot of categories and products, since database queries are costlier, it consumes more resources. One more important point is we cannot get more relevant results first, in general we cannot rank the search results.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. This is being used in most of web2.0 websites. Zend_Search_Lucene was derived from the Apache Lucene project.
<?php//Index the blog articles
require_once 'Zend/Search/Lucene.php';
$articlesData = array (0 => array( "url" => "http://ganeshhs.com/url-1",
"title" => "Google suggest : pick right search keyword",
"contents" => "Picking the right keywords for the websites is the success of search engine marketing. When i started search engine optimization, i used to use overture keyword selector tool and check the search counts what other users have searched. "
"category" => "Google",
"postedDateTime" => "2007-12-26 12:20:00",
"articleId" => 1),
1 => array( "url" => "http://ganeshhs.com/url-2",
"title" => "zend framework tutorial | part 9 Zend Auth",
"contents" => "Zend Auth is easy to set up and provides a system that secures our site with an easy to use authentication mechanism. Zend Auth(Zend_Auth) provides an API for authentication. "
"category" => "zend-framework",
"postedDateTime" => "2007-12-26 12:20:00",
"articleId" => 2));
if(is_array($articlesData) && count($articlesData))
{
$index = Zend_Search_Lucene::create('/var/www/lucene-data/blog-index');
foreach($articlesData as $articleData)
{
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Keyword('url',
$articleData["url"]));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('articleId',
$articleData["articleId"]));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('postedDateTime',
$articleData["postedDateTime"]));
$doc->addField(Zend_Search_Lucene_Field::Text('title',
$articleData["title"]));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
$articleData["contents"]));
$doc->addField(Zend_Search_Lucene_Field::Text('category',
$articleData["category"]));
echo "
Adding: ". $articleData["title"] ."\n";
$index->addDocument($doc);
}
$index->commit();
$index->optimize();
}
?>
$index = Zend_Search_Lucene::create(’/var/www/lucene-data/blog-index’);
Specifies the path of zend lucene index where the documents will be store.
For each iteration, we are creating a document-
$doc = new Zend_Search_Lucene_Document();
Once the document is created we need to add the fields and contents to the document -
Here since the URL is unique to the article we are indexing it as a Keyword field type.
we may need blog article id and blog create date time in the display part, it wont be used for search so we are storing it as UnIndexed field type.
Title is stored as text field type.
Content/Description is indexed but not stored in index. Because description occupies more space and creates a larger index on disk, so if we need to search but not redisplay the data, UnStored field type is preferred.
$doc->addField(Zend_Search_Lucene_Field::Keyword(’url’, $articleData[”url”]));$doc->addField(Zend_Search_Lucene_Field::UnIndexed(’articleId’, $articleData[”articleId”])); $doc->addField(Zend_Search_Lucene_Field::UnIndexed(’postedDateTime’, $articleData[”postedDateTime”])); $doc->addField(Zend_Search_Lucene_Field::Text(’title’, $articleData[”title”])); $doc->addField(Zend_Search_Lucene_Field::UnStored(’contents’, $articleData[”contents”])); $doc->addField(Zend_Search_Lucene_Field::Text(’category’, $articleData[”category”]));
Once the document is created and fields are added we need to add the document to the index -
$index->addDocument($doc);
After all the iterations we can commit the index-
$index->commit();
Following command is used to optimize the index -
$index->optimize();
Understanding Field Types -
| Field Type | Stored | Indexed | Tokenized | Binary | |
| Keyword | yes | yes | no | no | |
| UnIndexed | yes | no | no | no | |
| Binary | yes | no | no | yes | |
| Text | yes | yes | yes | no | |
| UnStored | no | yes | yes | no |
Related articles:
Zend Lucene Search - part2 - Real time indexing
Zend Lucene Search - part3 - retrieving the indexed data
Zend Lucene Search - part4 - Search Results Highlighting
Home Page
4 Responses
Martenick Antunes Penchel
May 27th, 2008 at 4:40 pm
1I want to be able to search for numeric contents in indexed documents, using Zend Lucene, with php. When I add a document with a text like “Brasil123″, Lucene just index “Brasil “, and is not possible to find the string “123″ using the search mechanism. Can you help me ? I’am not finding any articles about it, in the Internet. Thank you very much.
Martenick
Anh
May 29th, 2008 at 4:27 am
2Thank you very much
savitha
June 2nd, 2008 at 1:50 am
3To recognize numerics use
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());
as default analyser.
For more info refer
http://framework.zend.com/manual/en/zend.search.lucene.extending.html
Ganesh H S , Bangalore, India
June 2nd, 2008 at 1:56 am
4Martenick Antunes Penchel -
As mentioned in Savitha’s comment, add below expression in bootstrap file, this fixes your problem.
RSS feed for comments on this post · TrackBack URI
Leave a reply
Categories
Archives
Blogroll
Recent Comments
Recent Posts
Ganesh H S is proudly powered by WordPress - BloggingPro theme by: Design Disease