Ganesh H S

Thoughts on open source technologies, search engine optimization, website security

zend Lucene search part5 search engine results page formatting

In the previous article Zend Lucene Search - part4 - Search Results Highlighting i talked about highlighting the keywords in search results.

In this article i will be writing about highlighting the keywords in search results and formating the output display format much similar to most search engines result page using the zend lucene search.

<?phprequire_once ‘Zend/Search/Lucene.php’;

$queryStr= "php";

$snapshotTextLength = 155;

$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);

$index = Zend_Search_Lucene::open("/var/www/lucene-data/blog-index");

$results = $index->find($query);

echo "Index contains ".$index->count()." documents.\n\n";

if($index->count())

{

$count = 0;

displayResults($results, $snapshotTextLength);

}

// Format and display the search results

function displaySearchResults(&$results, $snapshotTextLength)

{

if(is_array($results) && count($results))

{

foreach ($results as $result)

{

$data[$count]["article_url"]         		= $result->url;

$data[$count]["article_title"]        		= $query->highlightMatches($result->title);

$data[$count]["article_description"]        = $query->highlightMatches($result->contents);

$data[$count]["article_created_date_time"]  = $result->postedDateTime;

$data[$count]["article_id"]             	= $result->articleId;

$count++;

// title of each article with URL as link

$searchResultsContent .= sprintf("%“, $data[$count][”article_url”], $data[$count][”article_title”]);

// snapshot of the description

$searchResultsContent .= sprintf(”%s”, substr($data[$count][”article_description”], 0, $snapshotTextLength));

// url

$searchResultsContent .= $data[$count][”article_url”];

// leave 2 lines after each search results

$searchResultsContent .= “<br> <br> <br>”;

}

}

else

{

$searchResultsContent = “No results found, try using different keywords”;

}

return $searchResultsContent;

}

?>

This program is similar to Zend Lucene Search - part3 - retrieving the indexed data , the only difference is i am formating the display format, the output of this program displays the output much similar to what you get in the search engine result page of google.com or search.yahoo.com

Related articles:
Zend Lucene Search - part1 - creating index
Zend Lucene Search - part2 - Real time indexing
Zend Lucene Search - part3 - retrieving the indexed data
Zend Lucene Search - part4 - Search Results Highlighting

Enter the world of PERL

Its been 4 years since i started my career, PERL was one of the theory subject in 7th semester B.E. All these years I enjoyed coding in PHP a lot and its very exciting to work on it.

But in Yahoo! i just see lot of very interesting tools been developed in PERL, i always thought why couldn’t it be coded in PHP? may be since i was from PHP programmer i always asked that question myself, but i see lot of my colleagues do lot of coding in PERL, pretty excited about it but not want to give up comfort zone, finally after 11 months i gave a try, i completed my first package done entirely coded in PERL.

I just entered the world of PERL, if you are a PHP programmer and feel PERL is not what you want to learn since you know PHP? I would recommend you give a try, you would love both PHP and PERL.

The best book for beginners in PERL is -
Learning PERL

Related links -
PERL
CPAN - The Comprehensive Perl Archive Network

disallow website search results

We generally index website search results in search engines with intention to get more back links from Search Engine results page (SERP).

So we should use robots.txt to disallow website search results pages crawling that don’t add much value for users coming from search engines. Its one of the quality guidelines Google mentions in webmaster guidelines. Its one of the important SEO checklist we should track.

Possible reasons:

  1. Duplicate content -
    Search results holds a snippet of article title, short description and link to that article and in contrast we also have article page which has article title, article description. So if search bot crawls search results and individual article, it can potentially lead to duplicate content.
  2. Door way pages -
    Search results acts like a doorway pages to individual articles, doorway pages are one of the cloaking techniques that should be avoided.
  3. Search engine subsystem -
    Search engine as a whole strives to provide unique results, clicking on the links(indexed website search pages) in search engine results page which in turn takes to the website search results (running another search in website).

Example -
Disallow: /search?p=* in health.yahoo.com/robots.txt
Disallow :/search in http://search.yahoo.com/robots.txt
Disallow: /results in http://youtube.com/robots.txt

SEO checklist robots.txt

In my earlier post i had posted about robots.txt and robots meta tag.

Following are the Search engine optiomization(SEO) checklist  related to robots.txt -

1. robots.txt http status code

Search bot (eg: googlebot) before crawling the website it will always requests robots.txt and understands the definition robots.txt and crawls the website locations which is allowed. So its always important for webmaster to check the http status code of the robots.txt of the site and make sure it returns http status code 200 or http status code 404.

Why is that so important to check the http status code?

Search bot before crawling requests for robots.txt-

If the http status code of 200 is returned, it reads and crawls the locations of website which is allowed.

If the http status code of 404 is returned, search bot goes ahead with its job with no restriction on website crawling.

If the page takes lot of time and if there is no response code returns, search bot waits and after sometime it skips crawling because it always respects robots.txt and this can adversely affect the crawling of our website.

2. URLs restricted by robots.txt

Consider the impact of following robots.txt definition

User-agent: *
Disallow: *

It blocks all search bots to crawl the entire website, we should make sure we block only those areas which block.

submit site to google yahoo dmoz msn

You had a plan for a business, you need a website, now the website is done. What next?

How do you inform search engines that your website existed and inform them to index your website?

When i started working on Search Engine Optimization ( SEO ) for 3 ecommerce sites in 2006, this was the first question i had in mind.

Following are the ways of getting your website indexed by search engines -

SEO - set preferred domain

I always thought following links are same -
http://ganeshhs.com/search-engine-optimization-seo/noindex-nofollow
http://www.ganeshhs.com/search-engine-optimization-seo/noindex-nofollow

Above links leads to the same page, but it differs with www.
But search engine treats both links are different, i have seen in few cases where we link many a times we ignore www. and in some cases we do include www. in the links.

So what are the impacts?

  1. Search engines keep both the versions of the URLs, when people click on search engine results links which leads to our site with different versions of these URLs, it will drastically affect the page rank and traffic.
  2. These URLs look like different documents to crawlers and create excessive crawling on our website.

How do we instruct search engine to treat both the URL’s as same, Google webmasters tool has a option to set the preferred domain

So whats the advantage of set preferred domain ? If i set my preferred domain as ganeshhs.com and next time if Google comes and crawls my website, and if it finds any link starting with www.ganeshhs.com it will follow it as ganeshhs.com and when Google displays my website links in search results it will show the links as ganeshhs.com

It also helps us to fix the external site referrals, few guys started provide links to my website, if suppose their referral link is http://www.ganeshhs.com/category/search-engine-optimization-seo where as my actual article URL was http://ganeshhs.com/category/search-engine-optimization-seo and when google crawls our website through that referral link it will keep the right version of domain what we preferred.

noindex nofollow

HTML tag tells robots not to index the content of a page, and/or not scan it for links to follow, keeping this metatag for pages which we don’t want to index, nor to follow the links on the webpage is helpful.

In some cases, we come across situations where we keep links to external sites. But what are the impacts of this?

  1. Part of page rank is shared to external website -
    When we link to other websites, our part of our website page rank is shared to
    those external sites, and we may end up sending the search engine crawlers to other side.
  2. Leading Search Engine Crawlers to crawl external website -
    Crawler entered our website to crawl more pages, it will help us to have more indexes in Search Engines, but what did we end up keeping external links, we created a way to Crawler to leave our website and crawl the external websites.

We have to keep external links, but how do we prevent the above scenario -

  • If google.com is a external link, we could use < a href=”http://www.google.com” rel=”noindex, nofollow” > , when the crawler comes across this external link, it tells the crawler not crawl or follow that link.
  • ganeshhs.com google page rank

    My blog site ganeshhs.com has now Google Page Rank of 2/10.
    ganesh-h-s-google-page-rank

    When i started first project with zend framework may 2007, there were very few articles/tutorials and my first point of getting info was using search engine, then i realised it would be a great idea if my articles list in search engine and my first eye was on search engine optimization.

    Looking at my website analytics i noticed that my recent posts on zend lucene search had more number of unique visits which also increased my website daily visits to average of 100 visits (with more unique visits), and also i started getting backlinks from other websites(namely http://www.phpimpact.com/ etc.) which also contributed for this page rank.

    More essentially keywords(relevant to the context of the website/article) helps the articles to get indexed by search engines, following lists some of the blog articles and keywords i targeted and their stats in search engines Yahoo!/Google -

    Keyword Google Position Yahoo! Position
    Zend Lucene Search Page 1 Page 1
    Zend Auth Page 1 Page 2
    Zend Registry Page 1 -
    Zend Debug Page 1 -
    Zend Exception Page 1 -
    Zend Config Page 1 -
    Zend Loader Page 1 -

    Zend Lucene Search - part4 - Search Results Highlighting

    Zend_Search_Lucene_Search_Query::highlightMatches() method allows the developer to highlight HTML document terms in the context of a search query.

    In the previous article Zend Lucene Search - part3 - retrieving the indexed data i talked about retrieving the search results. When we search, highlighting the searched keyword in the search result is one of the important aspect which most search engines follow, in this article i will be writing about highlighting the search results retrieved using the zend lucene search.

    <?phprequire_once ‘Zend/Search/Lucene.php’;$queryStr= "php";
    
    $query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
    
    $index = Zend_Search_Lucene::open("/var/www/lucene-data/blog-index");
    
    $results = $index->find($query);
    
    echo "Index contains ".$index->count()." documents.\n\n";
    
    if($index->count())
    
    {
    
    $count = 0;
    
    foreach ($results as $result)
    
    {
    
    $data[$count]["article_url"]         = $result->url;
    
    $data[$count]["article_title"]        = $query->highlightMatches($result->title);
    
    $data[$count]["article_description"]        = $query->highlightMatches($result->contents);
    
    $data[$count]["article_created_date_time"]    = $result->postedDateTime;
    
    $data[$count]["article_id"]             = $result->articleId;
    
    $count++;
    
    }
    
    }
    
    print_R($data);
    
    ?>

    This program is same as in the Zend Lucene Search - part3 - retrieving the indexed data only one thing differs is now i am calling highlightMatches for the search results returned.

    Related articles:
    Zend Lucene Search - part1 - creating index
    Zend Lucene Search - part2 - Real time indexing
    Zend Lucene Search - part3 - retrieving the indexed data
    Home Page

    Zend Lucene Search - part3 - retrieving the indexed data

    Once the index is created, we are ready use zend lucene search to search the website. In the following example, php is the search keyword used to fetch the relevant search results in the already indexed data.

    <?php
    
    require_once ‘Zend/Search/Lucene.php’;$query = "php";
    
    $index = Zend_Search_Lucene::open("/var/www/lucene-data/blog-index");
    
    $results = $index->find($query);
    
    echo "Index contains ".$index->count()." documents.\n\n";
    
    if($index->count())
    
    {
    
    $count = 0;
    
    foreach ($results as $result)
    
    {
    
    $data[$count]["article_url"]         = $result->url;
    
    $data[$count]["article_title"]        = $query->highlightMatches($result->title);
    
    $data[$count]["article_description"]        = $query->highlightMatches($result->contents);
    
    $data[$count]["article_created_date_time"]    = $result->postedDateTime;
    
    $data[$count]["article_id"]             = $result->articleId;
    
    $count++;
    
    }
    
    }
    
    print_R($data);
    
    ?>

    To retrieve the index data, first thing we need to do is to open the indexed path.

    $index = Zend_Search_Lucene::open("/var/www/lucene-data/blog-index");

    Suppose if user search input is -

    $query = "php";

    We have to use the find method of zend search lucene -

    $results = $index->find($query);

    To retrieve the total records resulted in the search result, we have to use count method of zend lucene search -

    echo "Index contains ".$index->count()." documents.\n\n";

    To limit the search result count we have to use setResultSetLimit of zend lucene search -

    $index->setResultSetLimit(10);

    Related articles:
    Zend Lucene Search - part1 - creating index
    Zend Lucene Search - part2 - Real time indexing
    Zend Lucene Search - part4 - Search Results Highlighting
    Home Page

    « Previous Entries