Phrase Based Indexing (PBI) & Information Retrieval

Introduction:

  • An information retrieval system uses phrases to index, retrieve, organize and describe documents.
  • It was a patent application submitted by the Google Engineer, Anna Lynn Patterson to U.S
  • Application filed on July, 2004
  • Published on January, 2006

Background of Invention:

  • Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet
  • A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document
  • The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like
  • Concepts are often expressed in phrases, such as “Australian Shepherd,” “President of the United States,” or “Sundance Film Festival
  • Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases

Summary:

An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection.

  1. Identifying Phrases and Related Phrases
  2. Indexing Documents w.r.t Phrases
  3. Ranking Documents w.r.t Phrases
  4. Creating description for the Documents
  5. Elimination of Duplicate Documents

1. Identifying Phrase and Related Phrases:

  • Based on a phrase’s ability to predict the presence of other phrases in a document.
  • It looks to identify phrases that have frequent and/or distinguished/unique usage
  • Prediction measure is used for identifying related phrases
  • Prediction measure relates Actual co -occurrence rate of two phrases to expected co-occurrence rate of the two phrases
  • Information gain = actual co-occurrence rate : expected co-occurrence rate
  • Two Phrases are related to each other when the prediction measure exceeds the prediction threshold
  • Example: The phrase ”President of the United States” predicts the related phrase ”White House“, ”George Bush” etc.,

2. Indexing Documents based on Related Phrases:

  • An information retrieval system indexes documents in the document collection by the valid or good phrases.
  • Posting List = documents that contain the phrase
  • Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase

3. Ranking Documents with respect to Phrases:

  • Ranking documents is based on two factors

* Ranking Documents based on Contained Phrases
* Ranking Documents based on Anchor Phrases

  • Document Score = Body Hit Score + Anchor Hit Score
  • For Example: Body Hit Score = 0.30, Anchor Hit Score = 0.70
  • Document Score = 0.30 + 0.70

Phrase Extension:

  • The information retrieval system is also adapted to use the phrases when searching for documents in response to a query.
  • A user may enter an incomplete phrase in a search query, such as “President of the

Incomplete phrases such as these may be identified and replaced by a phrase extension, such as “President of the United States“.

4. Creating Descriptions for Documents:

  • Phrase information is used to create description of a document
  • System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences
  • Ranks the sentences based on the count
  • Selects some number of top ranking sentences as description and includes it in the search results

5. Eliminating Duplicate Documents:

  • Identifying and Eliminating duplicate documents while crawling a document or when processing the search query
  • The description is stored in association with every document in a hash table
  • The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value
  • The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query

Functions of Indexing System:

  • Indentifies Phrases in documents
  • Indexing Documents according to the phrases by accessing various websites

Functions of Front End Server:

  • Receives queries from a user
  • Provides those queries to the search system

Functions of Searching System:

  • Searching for documents relevant to the search query
  • Identifies the phrases in the search query
  • Ranking the documents

Functions of Presentation System:

  • Modifying the search results including removing of duplicate content
  • Generating topical descriptions of documents and provides modified

Spam Detection Process:

  • “SPAM” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as ”keyword stuffed pages
  • Pages containing specific words and phrases that advertisers might be interested in are often called “HoneyPots,” and are created for search engines to display along with paid advertisements
  • A phrase based indexing system knows the number of related phrases in a document
  • A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection
  • A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases

Advantages of Phrase Based Indexing:

  • Detecting Duplicate Pages
  • Spam Detection
  • Save Time

Other Patent Applications:

  • Phrase identification in an information retrieval system
  • Phrase-based searching in an information retrieval system
  • Phrase-based generation of document descriptions
  • Detecting spam documents in a phrase based information retrieval system
  • Efficient Phrase Based Document Indexing for Document Clustering

According to data collected from users of European Web Analytics provider OneStat, most people use 2- or 3-word queries in search engines

  • Two-word phrases — 28.38 percent
  • Three-word phrases — 27.15 percent
  • Four-word phrases — 16.42 percent
  • One-word phrase — 13.48 percent
  • Five-word phrases — 8.03 percent
  • Six-word phrases — 3.67 percent
  • Seven-word phrases — 1.63 percent
  • Eight-word phrases — 0.73 percent
  • Nine-word phrases — 0.34 percent
  • Ten-word phrases — 0.16 percent

Tags: , ,

40 Responses to “Phrase Based Indexing (PBI) & Information Retrieval”

  1. SEO Updates, News, Search Engine Optimization Best Practise Tips … SEO Solutions Says:

    [...] the original here:  SEO Updates, News, Search Engine Optimization Best Practise Tips … By admin | category: seo optimization | tags: black, daily, friday, from-top, [...]

  2. SEO Updates, News, Search Engine Optimization Best Practise Tips … : Stilton Company - NJ SEO & IT Services - Ocean County - Monmouth County - Toms River - Jackson - Freehold Says:

    [...] Continued here: SEO Updates, News, Search Engine Optimization Best Practise Tips … [...]

  3. SEO Updates, News, Search Engine Optimization Best Practise Tips … Search Engine Optimizer Says:

    [...] original here: SEO Updates, News, Search Engine Optimization Best Practise Tips … By admin | category: Search Engine Optimize | tags: actually-the-last, content-on-site, [...]

  4. The Best Search Engine Submission Programs? | Submit To Search Engine Says:

    [...] SEO Updates, News, Search Engine Optimization Best Practise Tips … [...]

  5. SEO Updates, News, Search Engine Optimization Best Practise Tips … | SFWEBDESIGN.com Says:

    [...] post: SEO Updates, News, Search Engine Optimization Best Practise Tips …Share this on del.icio.usDigg this!Share this on RedditBuzz up!Stumble upon something good? Share it [...]

  6. forex robot Says:

    good article as usual!

  7. Andrew Pelt Says:

    Very interesting sharing about seo =)

  8. Andrew Peltzer Says:

    Thank you very much for providing this post.

  9. Burt Hayes Says:

    Very helpful post. Very clear commentary and suggested phrasing are most impressive, as are his and your generosity in sharing this explanation and example

  10. SEO Updates, News, Search Engine Optimization Best Practise Tips … « Blogging Says:

    [...] Continued here:  SEO Updates, News, Search Engine Optimization Best Practise Tips … [...]

  11. JamesDX Says:

    Maybe this is me talking nonsense, but it seems like Google isn’t a company run strictly by the top and they seem to be doing quite well.

  12. brasil no Says:

    After reading you blog, I thought your articles is great! I am very like your articles and I am very interested in the field of Free trial. Your blog is very useful for me .I bookmarked your blog! I trust you will behave better from now on; I hope she understands that she cannot exepct a raise.

  13. free call international Says:

    seoedition.com; You saved my day again.

  14. Roger Ebert Lost His Jaw to Cancer, But Not His Last Words | WeCharts.com Says:

    [...] SEO Updates, News, Search Engine Optimization Best Practices Tips … [...]

  15. SEO Guide Says:

    Hey I discovered your blog by chance on ask while trying to find something totally different but I am truly pleased that I did, You have just captured yourself another subscriber. :)

  16. WP Themes Says:

    Nice brief and this post helped me alot in my college assignement. Gratefulness you for your information.

  17. http://www.casinogamblinglist.com Says:

    News Article…

    Only the best resources are mentioned in this article [...]…

  18. soccert5 Says:

    Whats up everyone, I just signed up on this marvelous online community and wanted to say hi! Have a fabulous day!

  19. WP Themes Says:

    Good dispatch and this enter helped me alot in my college assignement. Thanks you as your information.

  20. WP Themes Says:

    Good dispatch and this post helped me alot in my college assignement. Say thank you you on your information.

  21. Wordpress Themes Says:

    Amiable post and this mail helped me alot in my college assignement. Say thank you you as your information.

  22. Spanish John Says:

    hello

    Just saying hello while I read through the posts

    hopefully this is just what im looking for looks like i have a lot to read.

  23. zerodaysoft Says:

    I thank for very valuable information. It very much was useful to me.

  24. Eli Says:

    That is good information tx. Any body heard anything about a cheap search engine marketing firm that will not rip you off??? I’ve got soo much SEM work to do and no where near enough time….. Need to pay someone to help.

  25. mack12carpenÐá Says:

    - Great ideas and guides in celebrating your baby shower

  26. Trista Says:

    You are absolutely right. In it something is and it is excellent idea. It is ready to support you.

  27. San Diego Photographer Says:

    Take this and apply it…

    You realize I enjoy to uncoverhelpful resources you might find helpful. Well here is one….

  28. maserati gransport Says:

    I am not really sure if best practices have emerged around things like that, but I am sure that your great job is clearly identified. I was wondering if you offer any subscription to your RSS feeds as I would be very interested.

  29. interior design living room Says:

    I tried to subscribe to your rss feed, but had a problem adding it to google reader. Could you please check this out.

  30. ebonite bowling Says:

    Excellent post I must say.. Simple but yet interesting and engaging.. Keep up the awesome work!

  31. seaside vacations Says:

    Hrmm that was weird, my comment got eaten. Anyway I wanted to say that it’s nice to know that someone else also mentioned this as I had trouble finding the same info elsewhere. This was the first place that told me the answer. Thanks.

  32. glass shelf brackets Says:

    Just wanted to say I enjoyed the blog. You have really put a lot of energy into your content and it is just great! :]

  33. maori tattoo Says:

    I completely agree with the above comment, the internet is with a doubt growing into the most important medium of communication across the globe and its due to sites like this that ideas are spreading so quickly.

  34. etestmaycle Says:

    I enjoyed reading your blog. Keep it that way.

  35. dvdcopier Says:

    Hey, Good evening.
    I like seoedition.com because I learned a lot here. Now it’s time for me to pay back.
    Why I post this guide on this of seoedition.com is to help people solve the same problem.
    Please contact me if it is unacceptable here.
    This is the guide, wish it would do people a favor.

    How to burn / write ISO image files to CD/DVD disc mac dvd to apple tv converter convert m4v
    How to burn/write ISO image files to CD/DVD disc? With ISO burner application you’ll be able to burn ISO image files to your CD/DVD disc.
    What is ISO image file? (From Wikipedia)
    An ISO image is an archive file (a.k.a. disk image) of an optical disc using a conventional ISO (International Organization for Standardization) format that is supported by many software vendors. ISO image files typically have a file extension of .ISO but Mac OS X ISO images often have the extension “.CDR”. The name “ISO” is taken from the ISO 9660 file system used with CD-ROM media but the term ISO image can refer to any optical disc image, even a UDF image.
    ISO file contains the content of the whole disc, including every single track, directory, file and information about the structure of the disc, like a snapshot “image” of a CD/DVD-ROM’s file. ISO images are widely used to copy existed CD/DVD discs, transfer them on web to other location or persons, and burn to CD/DVD that will be an identical replica of the original disc.
    ISO burning application: ImTOO ISO Burner
    Multi-format source files - Besides ISO image file, it can also burn CD/DVD disc from other image files including BIN/CUE, IMG, MDF, NRG, CDI, B5i, B6i, and DMG. Types of ISO image file - Support several source image files including data CD/DVD image, bootable CD/DVD image, media CD/DVD image, and so on. Support various discs - It can burn ISO image file to CD-R, CD-RW, DVD-R, DVD+R, DVD-RW, DVD+RW and DVD+R DL. Maximum burning speed - The ISO burner provides you with maximum writing speed that the target disc and recorder can work. iPod Transfer Convert X to DVD
    How to burn/write ISO image files to CD/DCD disc
    Step 1. Launch ImTOO ISO Burner.
    Step 2. Click “Browse” button to choose ISO image files or other image files like BIN/CUE, IMG, MDF, NRG, CDI, B5i, B6i, and DMG you want to burn.
    Step 3. Insert a blank or rewritable disc.
    Step 4. Click “Burn” button to start to burn ISO image files to your disc.
    Tips:
    1. If your disc is not empty and rewritable, you can choose to erase the disc first. If not, this ISO burner will remind you of continuing or canceling once clicking “Burn”.
    2. The burning speed is set to max by default, you can reset the speed.
    3. To ensure the usability of the disc content, the program can verify the written data after burning. Just check “Data Verify”.
    Ok, done. Just try to burn/write ISO image files to CD/DVD disc yourselft.

  36. Kristofer Council Says:

    I thought it was going to be some boring old site, but I’m glad I visited. I will post a link to this page on my blog. I believe my visitors will find that very useful.

  37. tahiti vacations Says:

    I thought it was going to be some boring old post, but it really compensated for my time. I will post a link to this page on my blog. I am sure my visitors will find that very useful.

  38. tahiti vacations Says:

    Guys, Great article and very very interesting blog. That’s one thing I’m really looking forward. Looking forward to reading more from you next week.

  39. Free Avatars Says:

    A topic close to my heart thanks. Needed more pictures though.

  40. CoolAmp Says:

    I Like You - Tekbuz.com.

Leave a Reply


Powered by WP Robot

web statistics