THE DEEP WEB

4000.00

CHAPTER ONE

1.0     INTRODUCTION

The volume of information on the web is already vast and is increasing at a very fast rate according to Deepweb.com [1]. The Deep Web is a vast repository of web pages, usually generated by database-driven websites, that are available to web users yet hidden from traditional search engines. The computer program that searches the Internet for newly accessible information to be added to the index examined by a standard search tool search engine [2] used by these search engines to crawl the web cannot reach most of the pages created on-the-fly in dynamic sites such as e-commerce, news and major content sites, Deepweb.com [1].

According to a study by Bright Planet [3], the deep web is estimated to be up to 550 times larger than the ‘surface web’ accessible through traditional search engines and over 200,000 database-driven websites are affected (i.e. accessible through traditional search engines). Sherman & Price [4], estimates the amount of quality pages in the deep web to be 3 to 4 times more than those pages accessible through search engine like Google, About, Yahoo, etc. While the actual figures are debatable, it made it clear that the deep web is far bigger than the surface web, and is growing at a much faster pace, Deepweb.com [1].

In a simplified description, the web consists of these two parts: the surface Web and the deep Web (invisible Web or hidden Web) but the deep Web came into public awareness only recently with the publication of the landmark book by Sherman & Price [4], “The invisible Web: Uncovering Information Sources Search Engines Can’t See”. Since then, many books, papers and websites have emerged to help further explore this vast landscape and these needs to be brought to your notice too.

  1. Statement of Problem

Most people access Web contents with Surface Search Engines and 99% of Web content is not accessible through Surface Search Engines.

A complete approach to conducting research on the Web incorporates using surface search engines and deep web databases. However, most users of the Internet are skilled in at least elementary use of search engines but the skill in accessing the deep web is limited to a much smaller population. It is desirable for most user of the Web to be enabled to access most of the Web content.  This work therefore seeks to address problems such as how Deep Web affects: search engines, websites, searchers and proffered solutions.

  1. Objective of the study

The broad objective of this study is meant to aid IT researchers in finding quality information in less time. The main objective of the project work can be stated more clearly as follows:

  1. To describe the Deep Web and Surface Web
  2. To compare deep web and surface web
  3. To develop a piece of software to implement a Deep Web search technique
  1. Significance of the study

The study on deep web is necessary because, it brings to focus problems encountered by search engines, websites and searchers. More importantly, the study will provide information on the results of searches made using both surface search engines and deep web search tools. Finally, it presents deep web not only as a substitute for surface search engines, but as a complement to a complete search approach that is highly relevant to the academia and the general public.

  1. Literature review

What is Deep Web?

Wikipedia [5], defined the surface Web (also known as the visible Web or indexable Web) as that portion of the World Wide Web that is indexed by conventional search engines. Search engines construct a database of the Web by using programs called spiders or Web crawlers that begin with a list of known Web pages. For each page the spider knows of it retrieves the page and indexes it. Any hyperlinks to new pages are added to the list of pages to be crawled. Eventually all reachable pages are indexed, unless the spider runs out of time or disk space. The collection of reachable pages defines the surface Web.

For various reasons (e.g., the Robots Exclusion Standard, links generated by JavaScript and Flash, password-protection) some pages cannot be reached by the spider. These ‘invisible’ pages are referred to as the Deep Web.

Bergman [6], defined the deep Web (also known as: Deepnet, invisible Web or hidden Web) to mean World Wide Web content that is not part of the surface Web indexed by search engines. Dr. Jill Ellsworth coined the term “Invisible Web” in 1994 to refer to websites that are not registered with any search engine.

Sherman and Price [4], defined deep web as text pages, files, or other often high-quality authoritative information available via the World Wide Web that general-purpose search engines cannot, due to technical limitations, or will not, due to deliberate choice, add to their indices of Web pages. Sometimes referred to as invisible web” or “dark matter”

          Origin of Deep Web

In 1994, Dr. Jill H. Ellsworth a university professor who is also an Internet consultant for Fortune 500 companies was the first to coin the term “Invisible Web” [6]. In a January 1996 article, Ellsworth states: “It would be a site that’s possibly reasonably designed, but they didn’t bother to register it with any of the search engines. So, no one can find them! You’re hidden. He called that the invisible Web”.

The first commercial deep Web tool (although they referred to it as the “Invisible Web”) was AT1 (@1) from Personal Library Software (PLS), announced December 12th, 1996 in partnership with large content providers. According to a December 12th, 1996 press release, AT1 started with 5.7 terabytes of content which was estimated to be 30 times the size of the nascent World Wide Web.

Another early use of the term “invisible web” was by Bruce Mount (Director of Product Development) and Dr. Matthew B. Koll (CEO/Founder) of Personal Library Software (PLS) when describing AT1 (@1) to the public. PLS was acquired by AOL in 1998 and AT1 (@1) was abandoned [7], [8].

AT1 is an invisible web which allows users to find content “below, behind and inside the Net” therefore users can now identify high quality content amidst multiple terabytes of data on the AT1 Invisible Web; top publishers join as charter members.

  1.  The Internet and the Visible Web

The primary focus of this project work is on the Web and more specifically, the parts of the Web that search engines can’t see (known as the invisible Web) but in order to fully understand the phenomenon called the Invisible Web, it is important to first understand the fundamental differences between the Internet and the Web.

Most people tend to use the words “Internet” and “Web” interchangeably, but they are not synonyms. The Internet is a networking protocol (set of rules) that allows computers of all types to connect to and communicate with other computers on the Internet. The Internet’s origin traced back to a project sponsored by the U.S. Defense Advanced Research Agency (DARPA) in 1969 as a means for researchers and defense contractors to share information. The World Wide Web (Web), on the other hand, is a software protocol that allows users to easily access files stored on the Internet computers. The Web was created in 1990 by Tim Berners-Lee, a computer programmer working for the European Organization for Nuclear Research (CERN). Prior to the Web, accessing files on the Internet was a challenging task, requiring specialized knowledge and skills. The Web made it easy to retrieve a wide variety of files, including text, images, audio, and video by the simple mechanism of clicking a hypertext link. Hypertext is a system that allows computerized objects (text, images, sounds, etc.) to be linked together, while a Hypertext link points to a specific object, or a specific place with a text; clicking the link opens the file associated with the object [4].

The Internet is a massive network of networks, a networking infrastructure. It connects millions of computers together globally, forming a network in which any computer can communicate with any other computer as long as they are both connected to the Internet. Information that travels over the Internet does so via a variety of languages known as protocols.

The World Wide Web, or simply Web, is a way of accessing information over the medium of the Internet. The Web uses the Hypertext Transfer Protocol (HTTP protocol), as one of the languages spoken over the Internet, to transmit data. The Web also utilizes browsers to access Web documents called Web pages that are linked to each other via hyperlinks. Web documents also contain graphics, sounds, text and video.

The Web is just one of the ways that information can be disseminated over the Internet. The Internet, not the Web, is also used for e-mail, relies on Simple Mail Transfer Protocol (SMTP, the standard protocol used in the Internet for mail transfer and forwarding), Usenet, news groups, instant messaging and File Transfer Protocol (FTP, a network protocol used to transfer data from one computer to another through a network, such as the Internet). So the Web is just a portion of the Internet, albeit a large portion, but the two terms are not synonymous and should not be confused.

1.4.2  How the Internet came to be

Up until the mid-1960s, most computers were stand-alone machines that did not connect to or communicate with other computers. In 1962 J.C.R. Licklider, a professor at The Massachusetts Institute of Technology (MIT, a private, coeducational research university located in Cambridge, Massachusetts.), wrote a paper envisioning a globally connected “Galactic Network” of computers [4]. The idea was far-out at the time, but it caught the attention of Larry Roberts, a project manager at the U.S. Defense Department’s Advanced Research Projects Agency (ARPA). In 1996 Roberts submitted a proposal to ARPA that would allow the agency’s numerous and different computers to be connected in a network similar to Licklider’s Galactic Network.

Robert’s proposal was accepted, and work began on the “ARPANET”, which would in time become what we know as today’s Internet. The first “node” on the ARPANET was installed at UCLA in 1969 and gradually, throughout the 1970s, universities and defense contractors working on ARPA projects began to connect to the ARPANET.

In 1973, the U.S. Defense Advanced Research Projects Agency (DARPA) initiated another research program to allow networked computers to communicate transparently across multiple linked networks. Whereas the ARPANET was just one network, the new project was designed to be a “network of networks”. According to Vint Cerf, widely regarded as one of the “fathers” of the Internet, the Internetting project and the system of networks which emerged from the research were known as the “Internet” [9].

It wasn’t until the mid 1980s, with the simultaneous explosion in the use of personal computers, and the widespread adoption of a universal standard of Internet communication called Transmission Control Protocol/Internet Protocol (TCP/IP), that the Internet became widely available to anyone desiring to connect to it. Other government agencies fostered the growth of the Internet by contributing communications “backbones” that were specifically designed to carry Internet traffic. By the late 1980s, the Internet had grown from its initial network of a few computers to a robust communications network supported by government and commercial enterprises around the world.

Despite this increased acceptability, the Internet was still primarily a tool for academics environment and government contractors well into the early 1990s. As more and more computers are connected to the Internet, users began to demand tools that would allow them to search for and locate text and other files on computers anywhere on the net.

1.4.3  Early Net Search Tools

In this section, we will trace the development of the early Internet search tools, and show how their limitations ultimately spurred the popular acceptance of the web. This historical background, while it is very fascinating in its own right, lays the foundation for understanding why the Invisible Web could arise in the first place.

Although sophisticated search and information retrieval techniques dated back to the late 1950s and early ‘60s, these techniques were used primarily in closed or proprietary systems. Early Internet search and retrieval tools lacked even the most basic capabilities, primarily because it was thought that traditional information retrieval techniques would not work on an open, unstructured information universe like the Internet.

Accessing a file on the Internet was a two-part process. First, you needed to establish direct connection to the remote computer where the file was located using a terminal emulation program called Telnet. Telnet is a terminal emulation program that runs on your computer allowing you to access a remote computer via a TCP/IP network and execute commands on that computer as if you were directly connected to it. Many libraries offered telnet access to their catalogs. Then you needed to use another program, called a File Transfer Protocol (FTP) client, to fetch the file itself. File Transfer Protocol (FTP) is a set of rules for sending and receiving files of all types between computers connected to the Internet. For many years, to access a file, it was necessary to know both the address of the computer and the exact location and name of the file you were looking for, that is, there were no search engines or other file-finding tools like the ones we are familiar with today.

Thus “search” often meant sending a request for help to an e-mail message list or discussion forum and hoping some kind soul would respond with the details you needed to fetch the file you were looking for. The situation improved somewhat with the introduction of “anonymous” FTP servers, which were centralized file-servers specifically intended for enabling the easy sharing of files. The servers were anonymous because they were not password protected, that is, anyone could simply log on and request any file on the system.

Files on FTP servers were organized in hierarchical directories, much like files are organized in hierarchical folders on personal computer systems today. The hierarchical structure made it easy for the FTP server to display a directory listing of all the files stored on the server, but you still needed good knowledge of the contents of the FTP server. If the file you were looking for didn’t exist on the FTP server you were logged into, you were out of luck.

The first true search tool for files stored on FTP servers was called Archie, created in 1990 by a small team of systems administrators and graduate students at McGill University in Montreal. Archie was the prototype of today’s search engines, but it was primitive and extremely limited compared to what we have today. Archie roamed the Internet searching for files available on anonymous FTP servers, downloading directories listings of every anonymous FTP server it could find. These listings were stored in a central, searchable database called the Internet Archives Database at McGill University, and were updated monthly.

Although it represented a major step forward, the Archie database was still extremely primitive, limiting searches to a specific file name, or for computer programs that performed specific functions. Nonetheless, it proved extremely popular because nearly 50% of Internet traffic to Montreal in the early ‘90s was Archie related, according to Deutsch [10], who headed up the McGill University Archie team.

“In the brief period following the release of Archie, there was an explosion of Internet-based research projects, including WWW, Gopher, WAIS, and others” [4]. Each explored a different area of the Internet information problem space, and each offered its own insights into how to build and deploy Internet-based services. The team licensed Archie to others, with the first shadow sites launched in Australia and Finland in 1992. The Archie network reached a peak of 63 installations around the world by 1995.

Gopher, an alternative to Archie, was created by Mark McCahill and his team at the University of Minnesota in 1991 and was named for the university’s mascot, the Golden Gopher. Gopher essentially combined the Telnet and FTP protocols, allowing users to click hyperlinked menus to access information on demand without resorting to additional commands. Using a series of menus that allowed the user to drill down through successively more specific categories, users could ultimately access the full text of documents, graphics, and even music files, though not integrated in a single format. Gopher made it easy to browse for information on the Internet.

According to Gopher creator McCahill, “Before Gopher there wasn’t an easy way of having the sort of big distributed system where there were seamless pointers between stuff on one machine and another machine”. You had to know the name of this machine and if you wanted to go over here you had to know its name.

Gopher takes care of all that stuff for you. So navigating around Gopher is easy. It points and clicks typically. So it’s something that anybody could use to find things. It’s also very easy to put information up so a lot of people started running servers themselves and it was the first of the easy-to-use, no fuss, you can just crawl around and look for information tools. It was the one that wasn’t written for techies” [4].

Gopher’s “no muss, no fuss” interface was an early precursor of what later evolved into popular Web directories like Yahoo!. “Typically you set this up so that you can start out with a sort of overview or general structure of a bunch of information, choose the items that you’re interested in to move into a more specialized area and then either look at items by browsing around and finding some documents or submitting searches” [4].

A problem with Gopher was that it was designed to provide a listing of files available on computers in a specific location – the University of Minnesota, for example. While Gopher servers were searchable, there was no centralized directory for searching all other computers that were both using Gopher and connected to the Internet, or “Gopherspace” as it was called. In November 1992, Fred Barrie and Steven Foster of the University of Nevada System Computing Services group solved this problem, creating a program called Veronica, a centralized Archie-like search tool for Gopher files. In 1993 another program called Jughead added keyword search and Boolean operator capabilities to Gopher search. Keyword is a word or phrase entered in a query form that a search system attempts to match in text documents in its database. Boolean is a system of logical operators (AND, OR, NOT) that allows true-false operations to be performed on search queries, potentially narrowing or expanding results when used with keywords.

THE DEEP WEB