Search+Engines



Why are Search engines free? Should we not be charged for this service?
toc 

Key Questions
1. How does a search engine find your site? 2. Why is it when you enter the same search terms on different search engines you get different results? 3. How do search engines make money? Why do they offer their services for free?

Search engines search the WWW indirectly
* When you type in a search term, these terms are matched in a database of the full text of web pages that selected from the WWW. * This database is ‘stale’ in that the most up-to-date copies of each page may not be there. * A list of links with some brief information is compiled to match your search term. * When you click on the link in the search engines results list, it takes you to the actual page on the WWW.  

Crawler-based Search Engines:
* Crawler-based search engine databases are selected and built by computer robot programs called spiders. * Crawler-based search engines have three major elements. <span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">1. The spider, also called the crawler.
<span style="FONT-FAMILY: 'Arial','sans-serif'">* The spider visits a web page, reads it, and then follows links to other pages within the site. <span style="FONT-FAMILY: 'Arial','sans-serif'">* Spiders find the pages for potential inclusion by following the links in the pages they already have in their database (i.e., already "know about"). <span style="FONT-FAMILY: 'Arial','sans-serif'">* Spiders cannot think or type a URL or use judgment to "decide" to go look something up and see what's on the web about it - this is what it means when someone refers to a site being "spidered" or "crawled." <span style="FONT-FAMILY: 'Arial','sans-serif'">* The spider returns to the site on a regular basis, such as every month or two, to look for changes. <span style="FONT-FAMILY: 'Arial','sans-serif'"> <span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">2. The Index
<span style="FONT-FAMILY: 'Arial','sans-serif'">* Everything the spider finds goes into the second part of the search engine, the index. <span style="FONT-FAMILY: 'Arial','sans-serif'">* After a spider finds a page, it is passed on to another computer program for "indexing." <span style="FONT-FAMILY: 'Arial','sans-serif'">* This indexing program identifies the text, links, and other content in the page and stores it in the search engine database's files. <span style="FONT-FAMILY: 'Arial','sans-serif'">* The index is a giant database containing a copy of every web page that the spider finds. <span style="FONT-FAMILY: 'Arial','sans-serif'">* The index can be searched by keyword(s) and the page will be found if the search matches its content. <span style="FONT-FAMILY: 'Arial','sans-serif'">* If a web page changes, then the index is updated with new information. <span style="FONT-FAMILY: 'Arial','sans-serif'">* Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. <span style="FONT-FAMILY: 'Arial','sans-serif'">* <span style="FONT-FAMILY: 'Arial','sans-serif'">A web page may have been "spidered" but not yet "indexed." Until it is indexed -- added to the index -- it is not available to those searching with the search engine. <span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">3. Search Engine Software
<span style="FONT-FAMILY: 'Arial','sans-serif'">* This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. <span style="FONT-FAMILY: 'Arial','sans-serif'"> <span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">How do I get my web page added to the index?
<span style="FONT-FAMILY: 'Arial','sans-serif'">* If a web page is never linked to in any other page, search engine spiders cannot find it. <span style="FONT-FAMILY: 'Arial','sans-serif'">* The only way a brand new page - one that no other page has ever linked to - can get into a search engine is for its URL to be sent by some human to the search engine companies as a request that the new page be included. <span style="FONT-FAMILY: 'Arial','sans-serif'">* All search engine companies offer ways to do this. <span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">What if I don’t want my website indexed?
The following information is mostly taken from the article: [|Creating and Using a robots.txt File]

Pages that are excluded are referred to as the "Invisible Web" - what you don't see in search engine results. <span style="FONT-FAMILY: 'Arial','sans-serif'">The Invisible Web is estimated to be two to three or more times bigger than the visible web. <span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">Use robot.txt files
Some types of pages and links are excluded from most search engines by giving instructions to the search engine spider.

'A robots.txt is a file placed on your server to tell the various search engine spiders not to crawl or index certain sections or pages of your site. You can use it to prevent indexing totally, prevent certain areas of your site from being indexes or to issue individual indexing instructions to specific search engines.The file itself is a simple text file, which can be created in Notepad. It need to be saved to the root directory of your site, that is the directory where your home page or index page is.by policy.'

There are many reasons why you may not want to have your website indexed. Here are some (also from the same source as above).
 * 1) You are still building the site, or certain pages, and do not want the unfinished work to appear in search engines
 * 2) You have information that, while not sensitive enough to bother password protecting, is of no interest to anyone but those it is intended for and you would prefer it did not appear in search engines.
 * 3) Most people will have some directories they would prefer were not crawled - for example do you really need to have your cgi-bin indexed? Or a directory that simply contains thank you or error pages.

Here are examples of robot.txt files:

You have a file, privatefile.htm, in a directory called 'private' that you do not wish to be indexed by Google. You know that the spider that Google sends out is called 'Googlebot'. You would add these lines to your robots.txt file: User-Agent: Googlebot Disallow: /private/privatefile.htm
 * 1. Exclude a file from an individual Search Engine**

You are building a new section to your site in a directory called 'newsection' and do not wish it to be indexed before you are finished. In this case you do not need to specify each robot that you wish to exclude, you can simply use a wildcard character, '*', to exclude them all. User-Agent: * Disallow: /newsection/ Note that there is a forward slash at the beginning and end of the directory name, indicating that you do not want any files in that directory indexed.
 * 2. Exclude a section of your site from all spiders and bots**

Once again you can use the wildcard, '*', to let all spiders know they are welcome. The second, disallow, line you just leave empty, that is your disallow from nowhere. User-agent: * Disallow:
 * 3. Allow all spiders to index everything**

<span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">Spiders cannot access the page
Others are excluded because search engine spiders cannot access them. That may mean that there are no links to the pages so the spider cannot access it and the search engine has not been informed about the site. <span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">

Some Answers to Questions:
<span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">1. How many pages are on the www?
Kamol - please add your content here

<span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">2. How many pages does a search engine such as Google index?
According to the article in Techworld, [|Google's Storage Strategy], <span style="FONT-FAMILY: Arial, Helvetica, sans-serif"> Google uses a system of Linux servers to store information. There are  18,000 servers which approximates to   <span style="FONT-FAMILY: Arial, Helvetica, sans-serif"> 5PB of storage (5x1015bytes). <span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">3. You have a brand new website - so how do you get your page listed with a search engine?
Poom - please add your information here and link

<span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">4. Define the following terms: deep crawl, paid inclusion, stop words, meta tag, meta robots tag, search algorithm, link bomb
** Link Bomb ** A collection of web pages containing easily detectable methods of artificially increasing web ranking, created with the intention of getting the targeted site penalized or de-listed by search engines. <span style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'">www.geneffects.com/briar**search**/index.html A Meta tag is a tag (that is, a coding statement) in the Hypertext Markup Language (HTML) that describes the contents of a Web page. <span style="FONT-SIZE: 10pt; LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'">searchsoa.techtarget.com/sDefinition/0,,sid26_gci542231,00.**html** <span style="FONT-FAMILY: 'Calibri','sans-serif'; mso-bidi-font-family: Tahoma; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin">Once a month, Googlebot will crawl all of the links it has listed in its database on your site. This is known as the Deep Crawl. <span style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"> <span style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'">www.accuracast.com/resources/glossary/ <span style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"> <span style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin"> Frequently occurring words that are not searchable. Some search engines include common words such as 'the' while others ignore such words in a query. Stop words may include numbers and frequent HTML strings. Some search engines only search stop words if they are part of a phrase search or the only search terms in the query <span style="FONT-SIZE: 10pt; FONT-FAMILY: 'Arial','sans-serif'">www.**searchengines**howdown.com/defs/**stop**.**html** <span style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin">
 * Meta Tag **
 * <span style="FONT-FAMILY: 'Calibri','sans-serif'; mso-bidi-font-family: Tahoma; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin">Deep Crawl **
 * <span style="FONT-FAMILY: 'Calibri','sans-serif'; mso-ascii-theme-font: minor-latin; mso-hansi-theme-font: minor-latin">Stop Words **

<span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">5. How often is your site ‘spidered/crawled’ by Google and other popular search engines?
Doly - please add content here with link

===6. Search engines provide their services for free so how do they make money? Describe methods that search engines use to earn money (clue: how does Google earn money – does it support ‘paid inclusion’?)===

<span style="FONT-FAMILY: 'Arial','sans-serif'">

7. How does Google decide which pages to display and in what order. Outline how Google’s PageRank algorithm works.
<span style="FONT-FAMILY: 'Arial','sans-serif'">Obviously, Google's PageRank™ algorithm is a closely guarded secret, however, Google's [|website] does outline the basic method in which it ranks pages.

"PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at considerably more than the sheer volume of votes, or links a page receives; for example, it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important." Using these and other factors, Google provides its views on pages' relative importance." ([|Google])

Not only this, but Google also uses a sophisticated text-matching technology to ensure that the pages displayed are relevant to your query, not simply looking at the frequency that certain text appears on the page, but also looking at other factors such as the rest of the page's content, and the content of the pages linking in to it.

<span style="FONT-FAMILY: 'Arial','sans-serif'">8. Different search engines give different results – what are the factors that affect the results?
<span style="FONT-FAMILY: 'Arial','sans-serif'">

9. What can you do to ensure that your site is listed on the top of a search? With Google? With Yahoo?
<span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'"> <span style="FONT-FAMILY: 'Arial','sans-serif'">

<span style="FONT-FAMILY: 'Arial','sans-serif'">Some References:
<span style="FONT-FAMILY: 'Arial','sans-serif'"> >
 * <span style="FONT-FAMILY: 'Arial','sans-serif'">Franklin, Curt. "How Internet Search Engines Work." 27 September 2000. HowStuffWorks.com. <http://computer.howstuffworks.com/search-engine.htm> 06 May 2008.
 * <span style="FONT-FAMILY: 'Arial','sans-serif'"> <span style="FONT-FAMILY: 'Arial','sans-serif'">[|www.searchenginehistory.com/]

<span style="FONT-FAMILY: 'Arial','sans-serif'">Ethical and Social Issues
<span style="FONT-FAMILY: 'Arial','sans-serif'">What are the social and ethical issues that may arise with search engines? <span style="FONT-FAMILY: 'Arial','sans-serif'">What does Google do with all the information from searches? Can you track someone’s search history? Who owns this information? <span style="FONT-FAMILY: 'Arial','sans-serif'">Can you find some relevant news items that raise negative issues arising from the use of search engines.