CHAPTER
ONE
1.0 INTRODUCTION
The volume of information on the web is
already vast and is increasing at a very fast rate according to Deepweb.com [1].
The Deep Web is a vast repository of web pages, usually generated by
database-driven websites, that are available to web users yet hidden from
traditional search engines. The computer program that searches the Internet for newly
accessible information to be added to the index examined by a standard search
tool search engine [2] used by these search engines to crawl the web cannot reach most of the
pages created on-the-fly in dynamic sites such as e-commerce, news and major
content sites, Deepweb.com [1].
According to a study by Bright Planet [3],
the deep web is estimated to be up to 550 times larger than the ‘surface web’
accessible through traditional search engines and over 200,000 database-driven
websites are affected (i.e. accessible through traditional search engines).
Sherman & Price [4], estimates the amount of quality pages in the deep web
to be 3 to 4 times more than those pages accessible through search engine like
Google, About, Yahoo, etc. While the actual figures are debatable, it made it clear
that the deep web is far bigger than the surface web, and is growing at a much
faster pace, Deepweb.com [1].
In a simplified description, the web
consists of these two parts: the surface Web and the deep Web
(invisible Web or hidden Web) but the deep Web came into public awareness only
recently with the publication of the landmark book by Sherman & Price [4],
“The invisible Web: Uncovering Information Sources Search Engines Can’t See”.
Since then, many books, papers and websites have emerged to help further
explore this vast landscape and these needs to be brought to your notice too.
- Statement of Problem
Most people access Web contents with Surface Search
Engines and 99% of Web content is not accessible through Surface Search Engines.
A complete approach to conducting
research on the Web incorporates using surface search engines and deep web
databases. However, most users of the Internet are skilled in at least
elementary use of search engines but the skill in accessing the deep web is
limited to a much smaller population. It is desirable
for most user of the Web to be enabled to access most of the Web content. This work therefore seeks to
address problems such as how Deep Web affects: search engines, websites,
searchers and proffered solutions.
- Objective of the study
The broad objective of this study is
meant to aid IT researchers in finding quality information in less time. The
main objective of the project work can be stated more clearly as follows:
- To describe the
Deep Web and Surface Web
- To compare deep
web and surface web
- To develop a
piece of software to implement a Deep Web search technique
- Significance of the study
The study on deep web is necessary
because, it brings to focus problems encountered by search engines, websites
and searchers. More importantly, the study will provide information on the
results of searches made using both surface search engines and deep web search
tools. Finally, it presents deep web not only as a substitute for surface
search engines, but as a complement to a complete search approach that is
highly relevant to the academia and the general public.
- Literature review
What is Deep Web?
Wikipedia [5], defined the surface Web (also known as the visible Web or indexable Web) as that portion of the World Wide Web that is indexed by conventional search engines. Search engines construct a database of the Web by using programs called spiders or Web crawlers that begin with a list of known Web pages. For each page the spider knows of it retrieves the page and indexes it. Any hyperlinks to new pages are added to the list of pages to be crawled. Eventually all reachable pages are indexed, unless the spider runs out of time or disk space. The collection of reachable pages defines the surface Web.
For various reasons (e.g., the Robots Exclusion Standard, links generated by JavaScript and Flash, password-protection) some pages cannot be reached by the spider. These ‘invisible’ pages are referred to as the Deep Web.
Bergman [6], defined the deep Web (also known as: Deepnet, invisible Web or hidden Web) to mean World Wide Web content that is not part of the surface Web indexed by search engines. Dr. Jill Ellsworth coined the term “Invisible Web” in 1994 to refer to websites that are not registered with any search engine.
Sherman and Price [4], defined deep web as text pages, files, or other
often high-quality authoritative information available via the World Wide Web
that general-purpose search engines cannot, due to technical limitations, or
will not, due to deliberate choice, add to their indices of Web pages.
Sometimes referred to as invisible web” or “dark matter”
Origin of Deep Web
In 1994, Dr. Jill H. Ellsworth a university professor who is also an
Internet consultant for Fortune 500 companies was
the first to coin the term “Invisible Web” [6]. In a January 1996 article, Ellsworth states: “It would be a site that’s possibly reasonably designed, but they didn’t
bother to register it with any of the search engines. So, no one can find them!
You’re hidden. He called that the invisible Web”.
The first commercial deep Web tool (although they referred to it as the “Invisible Web”)
was AT1 (@1) from Personal Library Software (PLS), announced December 12th,
1996 in partnership with large content providers. According to a December 12th,
1996 press release, AT1 started with 5.7 terabytes of content which was
estimated to be 30 times the size of the nascent World Wide Web.
Another early use of the term
“invisible web” was by Bruce Mount (Director of Product Development) and Dr.
Matthew B. Koll (CEO/Founder) of Personal Library Software (PLS) when
describing AT1 (@1) to the public. PLS was acquired by AOL in 1998 and AT1 (@1)
was abandoned [7], [8].
AT1 is an invisible web which allows users to find content “below,
behind and inside the Net” therefore users can now identify high quality
content amidst multiple terabytes of data on the AT1 Invisible Web; top
publishers join as charter members.
- The Internet and the Visible Web
The primary focus of this project work
is on the Web and more specifically, the parts of the Web that search engines
can’t see (known as the invisible Web) but in order to fully understand the
phenomenon called the Invisible Web, it is important to first understand the
fundamental differences between the Internet and the Web.
Most people tend to use the words “Internet”
and “Web” interchangeably, but they are not synonyms. The Internet
is a networking protocol (set of rules) that allows computers of all types to
connect to and communicate with other computers on the Internet. The Internet’s
origin traced back to a project sponsored by the U.S. Defense Advanced Research
Agency (DARPA) in 1969 as a means for researchers and defense contractors to
share information. The World Wide Web (Web), on the other hand, is a
software protocol that allows users to easily access files stored on the
Internet computers. The Web was created in 1990 by Tim Berners-Lee, a computer
programmer working for the European Organization for Nuclear Research (CERN).
Prior to the Web, accessing files on the Internet was a challenging task,
requiring specialized knowledge and skills. The Web made it easy to retrieve a
wide variety of files, including text, images, audio, and video by the simple
mechanism of clicking a hypertext link. Hypertext is a system that allows
computerized objects (text, images, sounds, etc.) to be linked together, while
a Hypertext link points to a specific object, or a specific place with a text;
clicking the link opens the file associated with the object [4].
The Internet is a massive network of networks, a networking infrastructure. It connects millions of computers together globally, forming a network in which any computer can communicate with any other computer as long as they are both connected to the Internet. Information that travels over the Internet does so via a variety of languages known as protocols.
The World Wide Web, or simply Web, is a way of accessing information over the medium of the Internet. The Web uses the Hypertext Transfer Protocol (HTTP protocol), as one of the languages spoken over the Internet, to transmit data. The Web also utilizes browsers to access Web documents called Web pages that are linked to each other via hyperlinks. Web documents also contain graphics, sounds, text and video.
The Web is just one of the ways that information can be disseminated over the Internet. The Internet, not the Web, is also used for e-mail, relies on Simple Mail Transfer Protocol (SMTP, the standard protocol used in the Internet for mail transfer and forwarding), Usenet, news groups, instant messaging and File Transfer Protocol (FTP, a network protocol used to transfer data from one computer to another through a network, such as the Internet). So the Web is just a portion of the Internet, albeit a large portion, but the two terms are not synonymous and should not be confused.
1.4.2 How the Internet came to be
Up until the mid-1960s, most computers
were stand-alone machines that did not connect to or communicate with other
computers. In 1962 J.C.R. Licklider, a professor at The Massachusetts Institute
of Technology (MIT, a private, coeducational research
university located in Cambridge, Massachusetts.), wrote a paper envisioning a
globally connected “Galactic Network” of computers [4]. The idea was far-out at
the time, but it caught the attention of Larry Roberts, a project manager at
the U.S. Defense Department’s Advanced Research Projects Agency (ARPA). In 1996
Roberts submitted a proposal to ARPA that would allow the agency’s numerous and
different computers to be connected in a network similar to Licklider’s Galactic
Network.
Robert’s proposal was accepted, and work
began on the “ARPANET”, which would in time become what we know as today’s
Internet. The first “node” on the ARPANET was installed at UCLA in 1969 and
gradually, throughout the 1970s, universities and defense contractors working
on ARPA projects began to connect to the ARPANET.
In 1973, the U.S. Defense Advanced
Research Projects Agency (DARPA) initiated another research program to allow
networked computers to communicate transparently across multiple linked
networks. Whereas the ARPANET was just one network, the new project was
designed to be a “network of networks”. According to Vint Cerf, widely regarded
as one of the “fathers” of the Internet, the Internetting project and the
system of networks which emerged from the research were known as the “Internet”
[9].
It wasn’t until the mid 1980s, with the
simultaneous explosion in the use of personal computers, and the widespread
adoption of a universal standard of Internet communication called Transmission
Control Protocol/Internet Protocol (TCP/IP), that the Internet became widely
available to anyone desiring to connect to it. Other government agencies
fostered the growth of the Internet by contributing communications “backbones”
that were specifically designed to carry Internet traffic. By the late 1980s,
the Internet had grown from its initial network of a few computers to a robust
communications network supported by government and commercial enterprises
around the world.
Despite this increased acceptability,
the Internet was still primarily a tool for academics environment and
government contractors well into the early 1990s. As more and more computers are
connected to the Internet, users began to demand tools that would allow them to
search for and locate text and other files on computers anywhere on the net.
1.4.3 Early Net Search Tools
In this section, we will trace the development of the early Internet
search tools, and show how their limitations ultimately spurred the popular
acceptance of the web. This historical background, while it is very fascinating
in its own right, lays the foundation for understanding why the Invisible Web
could arise in the first place.
Although sophisticated search and
information retrieval techniques dated back to the late 1950s and early ‘60s,
these techniques were used primarily in closed or proprietary systems. Early
Internet search and retrieval tools lacked even the most basic capabilities,
primarily because it was thought that traditional information retrieval techniques
would not work on an open, unstructured information universe like the Internet.
Accessing a file on the Internet was a
two-part process. First, you needed to establish direct connection to the
remote computer where the file was located using a terminal emulation program
called Telnet. Telnet is a terminal emulation program that runs on your
computer allowing you to access a remote computer via a TCP/IP network and
execute commands on that computer as if you were directly connected to it. Many
libraries offered telnet access to their catalogs. Then you needed to use
another program, called a File Transfer Protocol (FTP) client, to fetch the
file itself. File Transfer Protocol (FTP) is a set of rules for sending and
receiving files of all types between computers connected to the Internet. For
many years, to access a file, it was necessary to know both the address of the
computer and the exact location and name of the file you were looking for, that
is, there were no search engines or other file-finding tools like the ones we
are familiar with today.
Thus “search” often meant sending a
request for help to an e-mail message list or discussion forum and hoping some
kind soul would respond with the details you needed to fetch the file you were
looking for. The situation improved somewhat with the introduction of
“anonymous” FTP servers, which were centralized file-servers specifically
intended for enabling the easy sharing of files. The servers were anonymous
because they were not password protected, that is, anyone could simply log on
and request any file on the system.
Files on FTP servers were organized in
hierarchical directories, much like files are organized in hierarchical folders
on personal computer systems today. The hierarchical structure made it easy for
the FTP server to display a directory listing of all the files stored on the
server, but you still needed good knowledge of the contents of the FTP server.
If the file you were looking for didn’t exist on the FTP server you were logged
into, you were out of luck.
The first true search tool for files
stored on FTP servers was called Archie, created in 1990 by a small team of
systems administrators and graduate students at McGill University in Montreal. Archie
was the prototype of today’s search engines, but it was primitive and extremely
limited compared to what we have today. Archie roamed the Internet searching
for files available on anonymous FTP servers, downloading directories listings
of every anonymous FTP server it could find. These listings were stored in a
central, searchable database called the Internet Archives Database at McGill
University, and were updated monthly.
Although it represented a major step
forward, the Archie database was still extremely primitive, limiting searches
to a specific file name, or for computer programs that performed specific
functions. Nonetheless, it proved extremely popular because nearly 50% of
Internet traffic to Montreal in the early ‘90s was Archie related, according to
Deutsch [10], who headed up the McGill University Archie team.
“In the brief period following the
release of Archie, there was an explosion of Internet-based research projects,
including WWW, Gopher, WAIS, and others” [4]. Each explored a different area of
the Internet information problem space, and each offered its own insights into
how to build and deploy Internet-based services. The team licensed Archie to
others, with the first shadow sites launched in Australia and Finland in 1992.
The Archie network reached a peak of 63 installations around the world by 1995.
Gopher, an alternative to Archie, was
created by Mark McCahill and his team at the University of Minnesota in 1991
and was named for the university’s mascot, the Golden Gopher. Gopher
essentially combined the Telnet and FTP protocols, allowing users to click
hyperlinked menus to access information on demand without resorting to
additional commands. Using a series of menus that allowed the user to drill
down through successively more specific categories, users could ultimately
access the full text of documents, graphics, and even music files, though not
integrated in a single format. Gopher made it easy to browse for information on
the Internet.
According to Gopher creator McCahill,
“Before Gopher there wasn’t an easy way of having the sort of big distributed
system where there were seamless pointers between stuff on one machine and
another machine”. You had to know the name of this machine and if you wanted to
go over here you had to know its name.
“Gopher takes care of
all that stuff for you. So navigating around Gopher is easy. It points and
clicks typically. So it’s something that anybody could use to find things. It’s
also very easy to put information up so a lot of people started running servers
themselves and it was the first of the easy-to-use, no fuss, you can just crawl
around and look for information tools. It was the one that wasn’t written for
techies” [4].
Gopher’s “no muss, no fuss” interface
was an early precursor of what later evolved into popular Web directories like
Yahoo!. “Typically you set this up so that you can start out with a sort of
overview or general structure of a bunch of information, choose the items that
you’re interested in to move into a more specialized area and then either look
at items by browsing around and finding some documents or submitting searches”
[4].
A problem with Gopher was that it was designed to provide a listing of files available on computers in a specific location – the University of Minnesota, for example. While Gopher servers were searchable, there was no centralized directory for searching all other computers that were both using Gopher and connected to the Internet, or “Gopherspace” as it was called. In November 1992, Fred Barrie and Steven Foster of the University of Nevada System Computing Services group solved this problem, creating a program called Veronica, a centralized Archie-like search tool for Gopher files. In 1993 another program called Jughead added keyword search and Boolean operator capabilities to Gopher search. Keyword is a word or phrase entered in a query form that a search system attempts to match in text documents in its database. Boolean is a system of logical operators (AND, OR, NOT) that allows true-false operations to be performed on search queries, potentially narrowing or expanding results when used with keywords.