[RC5] d.net project: indexing the web
AZilinskas at SolutionsIQ.com
Fri Jul 9 14:31:01 EDT 1999
>Prompted by the "Distributed Indexing thread on /.:
Funny thing, I was thinkng about posting something a distributed
web crawler. Here is a dump of my ideas/thoughts:
* pro: A single web crawler system has a limitation of that
site's internet connection. A company's T1 or equivalent line
would get a traffic jam as sends out thousands of requests
probing pages. The users of the index would have to compete
with the probes to access the index's database.
A distributed system spread all over the world would use many
different internet conections at a hopefully low level but
all at the same time to get a high bandwidth search.
* pro: Certain web topologies can form a "spider trap". The probes
could keep requesting more and more pages from a server that creates
them on the fly. The index system could be spending all its time
in that web site and not collect anything useful. If you lose one
of your web searching "ants" to this spider trap, oh well, there are
thousands other "ants" still searching.
* con: Certain company policies frown upon accessing declared evil web
"But but, Boss, it wasn't my doing that my machine just
surfed all over the Playboy web site. It was just doing
a really good network index."
* con: (maybe) Akin to a spider trap, there may be malicious web sites
that will attack or attempt to infect your machine. Again, your boss
may frown upon awakening a hacker who then starts probing your site
to break in (the security through obscurity scheme).
* con: (really more of a programming issue) With many of web "ants"
searching at the same time, there will be points where two or more
ants find a similar link and duplicate their search efforts. The
collector of the index will spend alot of time removing duplicates.
* con: Right now, d.net tasks use the minimum of client resources. A
web searching ant could tie up a company's internet connection and
download very large items from the outside.
Now what might be needed for many web searching "ants"
* a central broker that the ants get starting point information from.
When the ant has visitied and summarized a page, the summary is passed
back to the central broker to be included in the index.
* an ant's job will be to visit one or more pages and generate
keyword summaries of the page. A starting point is "checked out"
of the broker. The summary is checked back in.
* a summary problably consists of a list of searchable keywords
appeared on the web page. A cool thing, but probably hard to
implement, would also be to summarize images on that page
(an example could be page www.starwars.com contains a picture
of a spaceship). Links out of the page would also be listed
so that the central broker can use them as other starting point
for web searches.
* A really big task for the central broker will be to store
all the summaries in some searchable database. Duplicates are
removed. Unsearched links are maintained as a list of starting
* Mildly tricky algorithms needed. An ant may crash after checking
out a starting point address. After some reasonable time, the check-out
will expire and the page is available for another ant to check-out.
Somehow "spider-traps", "black holes" or other web phenomenon that
causes any ant reaching that starting point to die will have to be
noted (the last 20 ants going to www.crashme.com never came back
so mark that spot as a "bad place").
* Thinking about crashing ants. The ant software must protect the
host machine as much as possible. Nobody will tolerate this d.net
task if the machine running it keeps crashing every couple of hours.
This gets messy in having to safely interpret DHTML, ActiveX and
scripts in web pages.
* Of course, ants will be limited to machines with fast probably
permenant net connections.
It looks like it would be a very big task. The hardest part problably
is the storage scheme required to store all the searchable indexes.
Another issue is probably self maintenance. Its one thing to snapshot
the internet, another to keep rechecking all the sites already indexed
for changes. Some sites/pages may go down temporarily, others go down
for good. Some cleverness will be needed to keep the information in good
shape and to keep up with growth/changes of the web.
azilinskas at solutionsiq.com
To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest
More information about the rc5