Phrase Based Indexing (PBI) & Information Retrieval
Introduction:
- An information retrieval system uses phrases to index, retrieve, organize and describe documents.
- It was a patent application submitted by the Google Engineer, Anna Lynn Patterson to U.S
- Application filed on July, 2004
- Published on January, 2006
Background of Invention:
- Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet
- A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document
- The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like
- Concepts are often expressed in phrases, such as “Australian Shepherd,” “President of the United States,” or “Sundance Film Festival“
- Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases
Summary:
An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection.
- Identifying Phrases and Related Phrases
- Indexing Documents w.r.t Phrases
- Ranking Documents w.r.t Phrases
- Creating description for the Documents
- Elimination of Duplicate Documents
1. Identifying Phrase and Related Phrases:
- Based on a phrase’s ability to predict the presence of other phrases in a document.
- It looks to identify phrases that have frequent and/or distinguished/unique usage
- Prediction measure is used for identifying related phrases
- Prediction measure relates Actual co -occurrence rate of two phrases to expected co-occurrence rate of the two phrases
- Information gain = actual co-occurrence rate : expected co-occurrence rate
- Two Phrases are related to each other when the prediction measure exceeds the prediction threshold
- Example: The phrase ”President of the United States” predicts the related phrase ”White House“, ”George Bush” etc.,
2. Indexing Documents based on Related Phrases:
- An information retrieval system indexes documents in the document collection by the valid or good phrases.
- Posting List = documents that contain the phrase
- Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase
3. Ranking Documents with respect to Phrases:
- Ranking documents is based on two factors
* Ranking Documents based on Contained Phrases
* Ranking Documents based on Anchor Phrases
- Document Score = Body Hit Score + Anchor Hit Score
- For Example: Body Hit Score = 0.30, Anchor Hit Score = 0.70
- Document Score = 0.30 + 0.70
Phrase Extension:
- The information retrieval system is also adapted to use the phrases when searching for documents in response to a query.
- A user may enter an incomplete phrase in a search query, such as “President of the“
Incomplete phrases such as these may be identified and replaced by a phrase extension, such as “President of the United States“.
4. Creating Descriptions for Documents:
- Phrase information is used to create description of a document
- System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences
- Ranks the sentences based on the count
- Selects some number of top ranking sentences as description and includes it in the search results
5. Eliminating Duplicate Documents:
- Identifying and Eliminating duplicate documents while crawling a document or when processing the search query
- The description is stored in association with every document in a hash table
- The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value
- The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query
Functions of Indexing System:
- Indentifies Phrases in documents
- Indexing Documents according to the phrases by accessing various websites
Functions of Front End Server:
- Receives queries from a user
- Provides those queries to the search system
Functions of Searching System:
- Searching for documents relevant to the search query
- Identifies the phrases in the search query
- Ranking the documents
Functions of Presentation System:
- Modifying the search results including removing of duplicate content
- Generating topical descriptions of documents and provides modified
Spam Detection Process:
- “SPAM” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as ”keyword stuffed pages“
- Pages containing specific words and phrases that advertisers might be interested in are often called “HoneyPots,” and are created for search engines to display along with paid advertisements
- A phrase based indexing system knows the number of related phrases in a document
- A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection
- A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases
Advantages of Phrase Based Indexing:
- Detecting Duplicate Pages
- Spam Detection
- Save Time
Other Patent Applications:
- Phrase identification in an information retrieval system
- Phrase-based searching in an information retrieval system
- Phrase-based generation of document descriptions
- Detecting spam documents in a phrase based information retrieval system
- Efficient Phrase Based Document Indexing for Document Clustering
According to data collected from users of European Web Analytics provider OneStat, most people use 2- or 3-word queries in search engines
- Two-word phrases — 28.38 percent
- Three-word phrases — 27.15 percent
- Four-word phrases — 16.42 percent
- One-word phrase — 13.48 percent
- Five-word phrases — 8.03 percent
- Six-word phrases — 3.67 percent
- Seven-word phrases — 1.63 percent
- Eight-word phrases — 0.73 percent
- Nine-word phrases — 0.34 percent
- Ten-word phrases — 0.16 percent



December 4th, 2009 at 10:17 pm
[...] the original here: SEO Updates, News, Search Engine Optimization Best Practise Tips … By admin | category: seo optimization | tags: black, daily, friday, from-top, [...]
December 5th, 2009 at 1:02 am
[...] Continued here: SEO Updates, News, Search Engine Optimization Best Practise Tips … [...]
December 5th, 2009 at 1:56 am
[...] original here: SEO Updates, News, Search Engine Optimization Best Practise Tips … By admin | category: Search Engine Optimize | tags: actually-the-last, content-on-site, [...]
December 5th, 2009 at 3:54 am
[...] SEO Updates, News, Search Engine Optimization Best Practise Tips … [...]
December 5th, 2009 at 4:22 am
[...] post: SEO Updates, News, Search Engine Optimization Best Practise Tips …Share this on del.icio.usDigg this!Share this on RedditBuzz up!Stumble upon something good? Share it [...]
December 6th, 2009 at 1:10 pm
good article as usual!
December 8th, 2009 at 12:52 pm
Very interesting sharing about seo =)
December 13th, 2009 at 1:41 am
Thank you very much for providing this post.
December 13th, 2009 at 3:02 am
Very helpful post. Very clear commentary and suggested phrasing are most impressive, as are his and your generosity in sharing this explanation and example
December 13th, 2009 at 2:32 pm
[...] Continued here: SEO Updates, News, Search Engine Optimization Best Practise Tips … [...]
February 11th, 2010 at 9:20 pm
Maybe this is me talking nonsense, but it seems like Google isn’t a company run strictly by the top and they seem to be doing quite well.
February 12th, 2010 at 6:12 am
After reading you blog, I thought your articles is great! I am very like your articles and I am very interested in the field of Free trial. Your blog is very useful for me .I bookmarked your blog! I trust you will behave better from now on; I hope she understands that she cannot exepct a raise.
February 17th, 2010 at 12:19 am
seoedition.com; You saved my day again.
February 17th, 2010 at 5:55 am
[...] SEO Updates, News, Search Engine Optimization Best Practices Tips … [...]
February 20th, 2010 at 1:13 pm
Hey I discovered your blog by chance on ask while trying to find something totally different but I am truly pleased that I did, You have just captured yourself another subscriber.
March 13th, 2010 at 11:08 pm
Nice brief and this post helped me alot in my college assignement. Gratefulness you for your information.
March 14th, 2010 at 12:14 pm
News Article…
Only the best resources are mentioned in this article [...]…
March 15th, 2010 at 1:06 am
Whats up everyone, I just signed up on this marvelous online community and wanted to say hi! Have a fabulous day!
March 19th, 2010 at 7:03 pm
Good dispatch and this enter helped me alot in my college assignement. Thanks you as your information.
April 2nd, 2010 at 7:13 am
Good dispatch and this post helped me alot in my college assignement. Say thank you you on your information.
April 10th, 2010 at 9:28 am
Amiable post and this mail helped me alot in my college assignement. Say thank you you as your information.
April 12th, 2010 at 3:58 am
hello
Just saying hello while I read through the posts
hopefully this is just what im looking for looks like i have a lot to read.
April 24th, 2010 at 1:16 am
I thank for very valuable information. It very much was useful to me.
May 1st, 2010 at 2:11 am
That is good information tx. Any body heard anything about a cheap search engine marketing firm that will not rip you off??? I’ve got soo much SEM work to do and no where near enough time….. Need to pay someone to help.
May 2nd, 2010 at 8:40 am
- Great ideas and guides in celebrating your baby shower
May 4th, 2010 at 12:59 am
You are absolutely right. In it something is and it is excellent idea. It is ready to support you.
May 13th, 2010 at 3:33 pm
Take this and apply it…
You realize I enjoy to uncoverhelpful resources you might find helpful. Well here is one….
May 17th, 2010 at 7:51 pm
I am not really sure if best practices have emerged around things like that, but I am sure that your great job is clearly identified. I was wondering if you offer any subscription to your RSS feeds as I would be very interested.
May 18th, 2010 at 9:31 am
I tried to subscribe to your rss feed, but had a problem adding it to google reader. Could you please check this out.
May 18th, 2010 at 9:13 pm
Excellent post I must say.. Simple but yet interesting and engaging.. Keep up the awesome work!
May 19th, 2010 at 11:42 am
Hrmm that was weird, my comment got eaten. Anyway I wanted to say that it’s nice to know that someone else also mentioned this as I had trouble finding the same info elsewhere. This was the first place that told me the answer. Thanks.
May 23rd, 2010 at 7:29 am
Just wanted to say I enjoyed the blog. You have really put a lot of energy into your content and it is just great! :]
May 26th, 2010 at 7:23 pm
I completely agree with the above comment, the internet is with a doubt growing into the most important medium of communication across the globe and its due to sites like this that ideas are spreading so quickly.
June 29th, 2010 at 7:47 am
I enjoyed reading your blog. Keep it that way.
June 30th, 2010 at 1:09 am
Hey, Good evening.
I like seoedition.com because I learned a lot here. Now it’s time for me to pay back.
Why I post this guide on this of seoedition.com is to help people solve the same problem.
Please contact me if it is unacceptable here.
This is the guide, wish it would do people a favor.
How to burn / write ISO image files to CD/DVD disc mac dvd to apple tv converter convert m4v
How to burn/write ISO image files to CD/DVD disc? With ISO burner application you’ll be able to burn ISO image files to your CD/DVD disc.
What is ISO image file? (From Wikipedia)
An ISO image is an archive file (a.k.a. disk image) of an optical disc using a conventional ISO (International Organization for Standardization) format that is supported by many software vendors. ISO image files typically have a file extension of .ISO but Mac OS X ISO images often have the extension “.CDR”. The name “ISO” is taken from the ISO 9660 file system used with CD-ROM media but the term ISO image can refer to any optical disc image, even a UDF image.
ISO file contains the content of the whole disc, including every single track, directory, file and information about the structure of the disc, like a snapshot “image” of a CD/DVD-ROM’s file. ISO images are widely used to copy existed CD/DVD discs, transfer them on web to other location or persons, and burn to CD/DVD that will be an identical replica of the original disc.
ISO burning application: ImTOO ISO Burner
Multi-format source files - Besides ISO image file, it can also burn CD/DVD disc from other image files including BIN/CUE, IMG, MDF, NRG, CDI, B5i, B6i, and DMG. Types of ISO image file - Support several source image files including data CD/DVD image, bootable CD/DVD image, media CD/DVD image, and so on. Support various discs - It can burn ISO image file to CD-R, CD-RW, DVD-R, DVD+R, DVD-RW, DVD+RW and DVD+R DL. Maximum burning speed - The ISO burner provides you with maximum writing speed that the target disc and recorder can work. iPod Transfer Convert X to DVD
How to burn/write ISO image files to CD/DCD disc
Step 1. Launch ImTOO ISO Burner.
Step 2. Click “Browse” button to choose ISO image files or other image files like BIN/CUE, IMG, MDF, NRG, CDI, B5i, B6i, and DMG you want to burn.
Step 3. Insert a blank or rewritable disc.
Step 4. Click “Burn” button to start to burn ISO image files to your disc.
Tips:
1. If your disc is not empty and rewritable, you can choose to erase the disc first. If not, this ISO burner will remind you of continuing or canceling once clicking “Burn”.
2. The burning speed is set to max by default, you can reset the speed.
3. To ensure the usability of the disc content, the program can verify the written data after burning. Just check “Data Verify”.
Ok, done. Just try to burn/write ISO image files to CD/DVD disc yourselft.
July 2nd, 2010 at 6:55 am
I thought it was going to be some boring old site, but I’m glad I visited. I will post a link to this page on my blog. I believe my visitors will find that very useful.
July 31st, 2010 at 11:51 am
I thought it was going to be some boring old post, but it really compensated for my time. I will post a link to this page on my blog. I am sure my visitors will find that very useful.
July 31st, 2010 at 12:03 pm
Guys, Great article and very very interesting blog. That’s one thing I’m really looking forward. Looking forward to reading more from you next week.
August 9th, 2010 at 2:19 am
A topic close to my heart thanks. Needed more pictures though.
August 9th, 2010 at 6:52 am
I Like You - Tekbuz.com.