Original article text at www.ranish.com
The Internet and its' Search Engines have become an important aspect of our life. Every day we rely more and more on them to find and deliver relevant information. With the current amount of raw data on the web, both tasks have become non-trivial, and it is clear that serious research has to be done in order to meet increasing expectations of speed and quality in the exponentially growing environment. How do we separate the relevant knowledge from the junk? Is it possible, short of creating artificial intelligence? These are the questions I am going to address.
People who develop search engines tend to agree that the best theory that applies to the web is one from the social sciences that says: "members of the community tend to be easily identified as highly connected nodes in the social network graph of their community." Therefore, you can understand the web and judge its content intelligently just by analyzing the links between web pages.
That would be true if all the links on the web were made by honest people with good intentions and in their right mind. The truth is that the half of the links are created by drunk teenagers, pointing from one junk blog to another. And the second half is generated by the scripts of Search Engine Optimizers (SEO) that are hired by the greedy corporations to increase their rankings. While the third half of the links is simply broken.
If you want to apply laws of any science to the web it has to be physics. Thermodynamics of hot gases and plasma inside an exploding star would pefectly suit the purpose.
I must say that, intuitively, creators of the best search technology available today were heading in the right direction. Just look at the formula for the PageRank algorithm:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
It looks suspiciously similar to Dalton's Law of Partial Pressures:
P total = ( RT(n1)/V(1) + RT(n2)/V(2) + ... + RT(nn)/V(n) )
Note that the PageRank formula is a special case of Dalton's Law, where the temperature stays constant. Treating it as a variable would improve the PageRank algorithm immediately. For example, we could say that a lone link that somebody put on his/her page by hand has completely a different energy (temperature) level from the thousands links generated by a spambot. The search engine has to sense different energy levels and treat those links appropriately.
Adding an extra coefficient to the formula, however, wouldn't solve all problems. Even John Dalton himself assumed that he deals with the Ideal Gas and gas molecules are spaced away from each other and don't interact. That can't be said about web pages crowded with the links where you would hardly find any content between them.
A real model of the Web with its dynamic energy flows and trend changes has to replace the idealistic academic view of the world wide web. If half of the content doesn't make it through the parental control filter we have to openly say that instead of trying to blushingly hide and ignore the issue.
Another problem is that search engines are no longer an objective measurement instruments. Every time the measurement is made, it significantly affects the system. People keep putting links in their pages, not because it's the best fit, but because it came up on the top of the search results. There is a whole discipline dedicated to performing objective measurements and its experience has to be applied to the web search business.
In the meantime, the only viable solution to the problem that I see would be to make people who link to the top search results personally accountable for that and have them pay severe fines to the FCC, the search engine company, and the local charity of their choice.
Current ranking of a page is calculated by analyzing links pointing to that page. However, to determine the true importance of the page one must also look both at the content of the page and where this page points itself. The outbound link flow is, particularly, important for understanding the information subdomain to which this page belongs.
In fact, despite denying that, Google does look at outgoing links. When a number of people had complained that Google ranks higher those pages that link to Google, I did the experiment myself. When I placed the link to Google at the top of my main page my site ranking increased from PR5 to PR6 literally overnight! Placing a second link didn't get PR7 though...
The VaporRank algorithm is deceptively simple. We use vapor stage recovery technique to separate the real content of the page from the irrelevant stuffing. The vapor stage recovery has functional efficiency of 93% on the saturated hyper-link levels and 10% relevant keyword density threshold.
Or, in layman terms, it means that if a page contains a bunch of links, 93% of them are, probably, junk. And if you see word "free" more than 10 times on a page you ought to utilize the functionality of the "Back" button.
Just go to any internet portal and count total number of links versus number of links that you would like to visit. You will be surprised at the accuracy of approximation that's in the heart of the VaporRank.
VaporRank relies on the uniquely democratic nature of the web that guarantees abundance of the junk pages and no one to clean them up. Especially with the increasing amount of pages thinking of themselves as "important" VaporRank offers truly unique and scalable solution at the minimum administration cost.
We run an extensive web crawl on the beowulf cluster of Top-10 supercomputers. After passing through the stages of superlinear and then hyperlinear speedups the algorithm yielded the result 42.7. This amounts to about 1.67% statistical error from the theoretically ideal answer 42. We will refine the algorithm to the point where it will be accurate - the dew point. Some of our opponents tried to insert here a sick joke mentioning coffee and mountain dew but their attempts, fortunately, have failed.
The VaporRank is now in the vapor stage of its lifecycle. Two alternative development strategies are suggested. First is to increase the system energy level to the point that project content enters the plasma phase and the second is to condense it using a revolutionary cooling method developed in the vicinity of NASA Johnson Space Center. More research has to be conducted to determine which solution is the best.
A good user interface (or lack of such) is another important issue. At present time, once you enter a set of keywords you can go only up and down in the result list. There has to be a mechanism to adjust the importance of each search term and watch the result list changing in the real-time.
Just imagine a gigantic five scroll buttons mouse floating in the vacuum of 5-dimentional space and performing fast floating point computations over the billions of cached web pages that silently collide with each other in the multi-terabyte suffix array in the search engine memory ... Is that impressive or what?
[Adams] Douglas Adams. The Hitchhiker's Guide to the Galaxy.
[Moran] M. J. Moran, H. N. Shapiro. Fundamentals of Engineering Thermodynamics.
[Page] L. Page, S. Brin, R. Motwani, and T. Winograd.
Your remarkable VaporRank is already obsolete. Here's a nickel, kid. Buy yourself a couple of PR7 hyperlinks.
Arguably VaporRank is the greatest discovery since the invention of the whistling kettle.
You are telling me, guys, "The content is the King." Then who is the Queen?
The VaporRank algorithm: Bringing chaos to the web...
In Soviet Russia kywords enter you!
And the last but not least: