Why are Search engines free? Should we not be charged for this service?


Key Questions

1. How does a search engine find your site?
2. Why is it when you enter the same search terms on different search engines you get different results?
3. How do search engines make money? Why do they offer their services for free?

Search engines search the WWW indirectly

* When you type in a search term, these terms are matched in a database of the full text of web pages that selected from the WWW.
* This database is ‘stale’ in that the most up-to-date copies of each page may not be there.
* A list of links with some brief information is compiled to match your search term.
* When you click on the link in the search engines results list, it takes you to the actual page on the WWW.


How do copies of the web pages get into this database?

Crawler-based Search Engines:


* Crawler-based search engine databases are selected and built by computer robot programs called spiders.
* Crawler-based search engines have three major elements.


1. The spider, also called the crawler.

* The spider visits a web page, reads it, and then follows links to other pages within the site.
* Spiders find the pages for potential inclusion by following the links in the pages they already have in their database (i.e., already "know about").
* Spiders cannot think or type a URL or use judgment to "decide" to go look something up and see what's on the web about it - this is what it means when someone refers to a site being "spidered" or "crawled."
* The spider returns to the site on a regular basis, such as every month or two, to look for changes.


2. The Index

* Everything the spider finds goes into the second part of the search engine, the index.
* After a spider finds a page, it is passed on to another computer program for "indexing."
* This indexing program identifies the text, links, and other content in the page and stores it in the search engine database's files.
* The index is a giant database containing a copy of every web page that the spider finds.
* The index can be searched by keyword(s) and the page will be found if the search matches its content.
* If a web page changes, then the index is updated with new information.
* Sometimes it can take a while for new pages or changes that the spider finds to be added to the index.
* A web page may have been "spidered" but not yet "indexed." Until it is indexed -- added to the index -- it is not available to those searching with the search engine.


3. Search Engine Software

* This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant.


How do I get my web page added to the index?


* If a web page is never linked to in any other page, search engine spiders cannot find it.
* The only way a brand new page - one that no other page has ever linked to - can get into a search engine is for its URL to be sent by some human to the search engine companies as a request that the new page be included.
* All search engine companies offer ways to do this.


What if I don’t want my website indexed?

The following information is mostly taken from the article: Creating and Using a robots.txt File

Pages that are excluded are referred to as the "Invisible Web" - what you don't see in search engine results. The Invisible Web is estimated to be two to three or more times bigger than the visible web.


Use robot.txt files


Some types of pages and links are excluded from most search engines by giving instructions to the search engine spider.

'A robots.txt is a file placed on your server to tell the various search engine spiders not to crawl or index certain sections or pages of your site. You can use it to prevent indexing totally, prevent certain areas of your site from being indexes or to issue individual indexing instructions to specific search engines.The file itself is a simple text file, which can be created in Notepad. It need to be saved to the root directory of your site, that is the directory where your home page or index page is.by policy.'

There are many reasons why you may not want to have your website indexed. Here are some (also from the same source as above).
  1. You are still building the site, or certain pages, and do not want the unfinished work to appear in search engines
  2. You have information that, while not sensitive enough to bother password protecting, is of no interest to anyone but those it is intended for and you would prefer it did not appear in search engines.
  3. Most people will have some directories they would prefer were not crawled - for example do you really need to have your cgi-bin indexed? Or a directory that simply contains thank you or error pages.

Here are examples of robot.txt files:


1. Exclude a file from an individual Search Engine
You have a file, privatefile.htm, in a directory called 'private' that you do not wish to be indexed by Google. You know that the spider that Google sends out is called 'Googlebot'. You would add these lines to your robots.txt file:
User-Agent: Googlebot
Disallow: /private/privatefile.htm

2. Exclude a section of your site from all spiders and bots
You are building a new section to your site in a directory called 'newsection' and do not wish it to be indexed before you are finished. In this case you do not need to specify each robot that you wish to exclude, you can simply use a wildcard character, '*', to exclude them all.
User-Agent: *
Disallow: /newsection/


Note that there is a forward slash at the beginning and end of the directory name, indicating that you do not want any files in that directory indexed.

3. Allow all spiders to index everything
Once again you can use the wildcard, '*', to let all spiders know they are welcome. The second, disallow, line you just leave empty, that is your disallow from nowhere.
User-agent: *
Disallow:



Spiders cannot access the page

Others are excluded because search engine spiders cannot access them. That may mean that there are no links to the pages so the spider cannot access it and the search engine has not been informed about the site.





Some Answers to Questions:



1. How many pages are on the www?

Kamol - please add your content here


2. How many pages does a search engine such as Google index?


According to the article in Techworld, Google's Storage Strategy, Google uses a system of Linux servers to store information. There are 18,000 servers which approximates to 5PB of storage (5x1015bytes).


3. You have a brand new website - so how do you get your page listed with a search engine?

Poom - please add your information here and link


4. Define the following terms: deep crawl, paid inclusion, stop words, meta tag, meta robots tag, search algorithm, link bomb


Link Bomb
A collection of web pages containing easily detectable methods of artificially increasing web ranking, created with the intention of getting the targeted site penalized or de-listed by search engines.
www.geneffects.com/briarsearch/index.html
Meta Tag
A Meta tag is a tag (that is, a coding statement) in the Hypertext Markup Language (HTML) that describes the contents of a Web page.
searchsoa.techtarget.com/sDefinition/0,,sid26_gci542231,00.html
Deep Crawl
Once a month, Googlebot will crawl all of the links it has listed in its database on your site. This is known as the Deep Crawl.
www.accuracast.com/resources/glossary/
Stop Words
Frequently occurring words that are not searchable. Some search engines include common words such as 'the' while others ignore such words in a query. Stop words may include numbers and frequent HTML strings. Some search engines only search stop words if they are part of a phrase search or the only search terms in the query
www.searchengineshowdown.com/defs/stop.html



5. How often is your site ‘spidered/crawled’ by Google and other popular search engines?

Doly - please add content here with link


6. Search engines provide their services for free so how do they make money? Describe methods that search engines use to earn money (clue: how does Google earn money – does it support ‘paid inclusion’?)




7. How does Google decide which pages to display and in what order. Outline how Google’s PageRank algorithm works.


Obviously, Google's PageRank™ algorithm is a closely guarded secret, however, Google's website does outline the basic method in which it ranks pages.

"PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at considerably more than the sheer volume of votes, or links a page receives; for example, it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important." Using these and other factors, Google provides its views on pages' relative importance." (Google)

Not only this, but Google also uses a sophisticated text-matching technology to ensure that the pages displayed are relevant to your query, not simply looking at the frequency that certain text appears on the page, but also looking at other factors such as the rest of the page's content, and the content of the pages linking in to it.

8. Different search engines give different results – what are the factors that affect the results?



9. What can you do to ensure that your site is listed on the top of a search? With Google? With Yahoo?






Some References:


Ethical and Social Issues

What are the social and ethical issues that may arise with search engines?
What does Google do with all the information from searches? Can you track someone’s search history? Who owns this information?
Can you find some relevant news items that raise negative issues arising from the use of search engines.