x
alterego
Search 11 billion web pages on archive.org

This new search engine is an item to keep in our crosshairs. I have done some research and exchanged e-mails with anna@cs.stanford.edu, who you will be interested to know now works at Google, and continues to work on recall.archive.org, a new text search of archive.org's 11 billion web pages in her free time. Here is an example search of the term 'Gmail': Term 'Gmail' on recall.archive.org

I want to point out that I manually edited the section of the URL string '&login=' to 'true'. Initially it is blank and I'm not sure if this does much, but it does indeed indicate that you are logged in afterwards. Restricting our search to 1996-1997 highlights mentions of Gmail as early as January 1997, which is very near the beginning of the archive's records. This, however, is simply informative compared to a Google Groups search for 'Gmail' which yields results as early as July of 1986. We all know that Google employs nothing but the best, from brain surgeons to trademark attorneys, so this is all in the sake of research

In the FAQ Anna mentions that she uses "the content measure of the page to determine its worth rather than the now standard popularity measure." This proves to be an inherent weakness, and the results displayed for the search term "Microsoft" highlight this, as the first result is a stuffed spam page for a pornography site. If you follow Google, this should ring some bells. This style of black hat marketing is the bane of search existence.

So far, so dissapointing, yet the archive does provide promise. Firstly, a Google employee is working on it (see here), and secondly, it is still an emerging technology. As the largest web database in the world, on a limited budget to boot, they are no doubt faced with problems Google has yet to encounter. In order save the resources needed for either loading their index into RAM or doing a realtime search of all 11 billion web pages, they have resorted to a search that is limited in scope for popular queries, but also gains accurancy as the terms become more arcane. A "whiz pop bang boom!" search for the term "Onomatopoeia" yeilds us much more relevant results than the search for Microsoft above, with equally exciting graphs.

Archive.org's new search engine may not trump existing players, but with its current capability and magnanimous potential, i'd say it definitely earns a place in my hall of bookmarks.


No replies - reply
 
Profile
alterego @ MindSay
No picture
View My Full Profile
RSS Feed
Calendar

March 2010
123456
78910111213
14151617181920
21222324252627
28293031

August 2004
1234567
891011121314
15161718192021
22232425262728
293031

July 2004
123
45678910
11121314151617
18192021222324
25262728293031


Older

Recent Visitors

February 13th
google

September 20th
google

August 4th
google

May 8th
google

May 7th
google

May 6th
google

May 5th
google

May 4th
google

May 3rd
google

May 2nd
google

May 1st
google

April 30th
google

April 29th
google

April 28th
google

April 27th
google