[RC5] d.net project: indexing the web

Adam Zilinskas AZilinskas at SolutionsIQ.com
Fri Jul 9 14:31:01 EDT 1999


>From: Matt.Wilkie 
>Prompted by the "Distributed Indexing thread on /.:
>(http://slashdot.org/article.pl?sid=99/07/08/136255&mode=thread)

Funny thing, I was thinkng about posting something a distributed
 web crawler. Here is a dump of my ideas/thoughts:

*  pro: A single web crawler system has a limitation of that
 site's internet connection. A company's T1 or equivalent line 
 would get a traffic jam as sends out thousands of requests 
 probing pages. The users of the index would have to compete
 with the probes to access the index's database.
 A distributed system spread all over the world would use many
  different internet conections at a hopefully low level but
  all at the same time to get a high bandwidth search.

* pro: Certain web topologies can form a "spider trap". The probes
 could keep requesting more and more pages from a server that creates 
 them on the fly. The index system could be spending all its time 
 in that web site and not collect anything useful. If you lose one
 of your web searching "ants" to this spider trap, oh well, there are
 thousands other "ants" still searching.

* con: Certain company policies frown upon accessing declared evil web
sites.
  "But but, Boss, it wasn't my doing that my machine just 
   surfed all over the Playboy web site. It was just doing 
   a really good network index."

* con: (maybe) Akin to a spider trap, there may be malicious web sites
  that will attack or attempt to infect your machine. Again, your boss
  may frown upon awakening a hacker who then starts probing your site
  to break in (the security through obscurity scheme).

* con: (really more of a programming issue) With many of web "ants"
  searching at the same time, there will be points where two or more
 ants find a similar link and duplicate their search efforts. The 
 collector of the index will spend alot of time removing duplicates.

* con: Right now, d.net tasks use the minimum of client resources. A 
  web searching ant could tie up a company's internet connection and 
  download very large items from the outside.

Now what might be needed for many web searching "ants"

* a central broker that the ants get starting point information from. 
  When the ant has visitied and summarized a page, the summary is passed
  back to the central broker to be included in the index.

* an ant's job will be to visit one or more pages and generate 
  keyword summaries of the page. A starting point is "checked out"
  of the broker. The summary is checked back in. 

* a summary problably consists of a list of searchable keywords
  appeared on the web page. A cool thing, but probably hard to 
  implement, would also be to summarize images on that page
  (an example could be page www.starwars.com contains a picture
   of a spaceship). Links out of the page would also be listed 
   so that the central broker can use them as other starting point
   for web searches.

* A really big task for the central broker will be to store
  all the summaries in some searchable database. Duplicates are
  removed. Unsearched links are maintained as a list of starting
  points.

* Mildly tricky algorithms needed. An ant may crash after checking
  out a starting point address. After some reasonable time, the check-out
  will expire and the page is available for another ant to check-out.
  Somehow "spider-traps", "black holes" or other web phenomenon that
  causes any ant reaching that starting point to die will have to be 
  noted (the last 20 ants going to www.crashme.com never came back
  so mark that spot as a "bad place").

* Thinking about crashing ants. The ant software must protect the 
  host machine as much as possible. Nobody will tolerate this d.net 
  task if the machine running it keeps crashing every couple of hours.
  This gets messy in having to safely interpret DHTML, ActiveX and 
  scripts in web pages.

* Of course, ants will be limited to machines with fast probably 
  permenant net connections. 

It looks like it would be a very big task. The hardest part problably
 is the storage scheme required to store all the searchable indexes.
Another issue is probably self maintenance. Its one thing to snapshot 
 the internet, another to keep rechecking all the sites already indexed
 for changes. Some sites/pages may go down temporarily, others go down
 for good. Some cleverness will be needed to keep the information in good 
 shape and to keep up with growth/changes of the web.
 

                  Adam Zilinskas
                  Solutions IQ
                  azilinskas at solutionsiq.com


--
To unsubscribe, send 'unsubscribe rc5' to majordomo at lists.distributed.net
rc5-digest subscribers replace rc5 with rc5-digest



More information about the rc5 mailing list