Friday, October 7, 2011

Computing Algorithm used in Google Search

Google Search is a search engine owned by the ‘Google Incorporation’. According to a statistical analysis carried out by ‘Alexa Traffic Rank’, Google Search is the most visited search engine on the Internet. It receives unfathomable amounts of queries every day and carries out the search in a fraction of a second. Since its launch in 1997 by Larry Page and Sergey Brin, Google Search has gone through a series of substantial changes in its algorithms. Every year, Google Inc. invests obscene sums of money on the Research and Development of the algorithms used. It is the innovation and novelty of these algorithms which is the basis of Google’s rise to success.

Google Inc. has acquired patent rights of the algorithms employed in Google Search. However, it is possible to outline the core logic behind these complex algorithms.  Simply defined, whenever a query is keyed in, the Search algorithm initially accepts it as simple text. It then breaks up the query into a series of search terms.  These search terms are usually words which are matched with the content of the websites in Google’s database. The websites containing the required words are then displayed. However, every search engine on the internet uses a search algorithm which is very similar to Google’s search algorithm. So, the question arises that, what differentiates Google from other search engines? The answer to this question lies in the difference in the algorithm used to rank the web pages generated by the search algorithm. The ranking algorithm used by Google Search is called the ‘PageRank’ which renders Google more successful than all other search engines. Hence, PageRank would be the main focus of this critique essay.

Page rank is a patented algorithm which aims to rank the webpages according to ‘relevance’ to the query keyed in. According to researchers at Google Inc., PageRank, unlike other ranking algorithms, tries to list the web pages according to the human concepts of relevance and importance. In order to qualitatively analyze PageRank, it is important to develop a general understanding of the basic logic of the computation carried out by PageRank. The main assumption underlying the logic of PageRank is that the webpages linked from other highly ranked pages are likely to be more important. So, in simple words, the World Wide Web acts as a giant recommendation system where webpages vote for other pages by sending outgoing links to them. Moreover, the votes from “more important” pages carry more weight.

According to Google, the PageRank actually calculates the probability that a person randomly clicking on links will arrive at any particular web page. So, it initially computes the number of human-generated links on each webpage. Furthermore, it allocates the weights of every link based on the number of links coming out from the source webpage. This means that the PageRank carried by any outgoing link is equal to the document’s own PageRank divided by the number of outgoing links of that document. So, the incoming PageRank of any webpage is the sum of the PageRanks carried by all incoming links.

Furthermore, as the algorithm is used to find the probability of visiting a webpage, PageRank further assumes that the user who is randomly clicking on links will stop clicking after some finite time. The probability that the user continues to click after reaching a particular page is called the ‘damping factor (d)’. So logically, this damping factor is multiplied to the sum of incoming PageRanks in order to calculate the probability of reaching a webpage through external links. However, a webpage can be visited directly by typing the URL in the browser. Intuitively, the probability of reaching a page in this way is calculated by subtracting ‘d’ from 1. So the final PageRank is the sum of these two probabilities.

A simple analysis of the PageRank algorithm depicts that the PageRank is relatively easy to implement for practical purposes. It has an optimal substructure, which means that the result generated by the PageRank algorithm can be processed using the ‘Greedy Method’. The pages with higher PageRanks can be displayed higher in the list of pages, and this greedy method is sure to yield the desired optimal result.    

Furthermore, the PageRank algorithm is easy to understand. This special property of the PageRank algorithm means that programmers can easily manipulate the algorithm to debug it and to keep it more updated. This leads to the dynamic nature of PageRank and ensures that it can cope up with technological modifications in the future.

In addition to these strategic advantages, PageRank also is much better than other traditional ranking algorithms as  it carries out ‘link analysis’. This means that PageRank not only considers the incoming links but also the importance of the source of these links. This results in much more greater relevance of the ranked results returned to the user as compared to traditional methods.  ‘Link analysis’ is also designed to protect users  from spammers. Ranking algorithms which only consider the content of webpages to generate results can easily be spammed. Spammers usually have financial motives associated to their websites. As spammers control the content of their webpages, they can attach meta-tags and special keywords into the HTML of their webpages, which these algorithms use to evaluate the data in a page. But, the content on the page may actually be very different. So, they try to mislead these algorithms into conferring them an unfairly high rank so that their webpages show in the top of the results list. However, due to the use of ‘link analysis’ in PageRank, these spammers are rendered extremely less successful because they have little or no control over the webpages which send incoming links to their pages.

On the other hand, PageRank also has several grave limitations. One of the major questions that can be raised is whether PageRank is adequately scalable. This means that can the algorithm can cope up efficiently when the size of data to be processed becomes extremely large. If encountered with a very large database of web pages, the PageRank algorithm would require a very large memory space to store them. Furthermore, as the algorithm not only maps page ids but also separate terms; the increase in the number of webpages would result in much larger memory requirement, which would not only be less efficient but also very expensive.

In addition to this, the runtime of PageRank is longer as compared to other ranking algorithms. The reason for this is that the calculation of PageRank not only involves  consideration of all the links (which can be considered as the edges which have to be visited once during the calculation) but it also requires calculation of weights of the PageRank. This results in complexity of the calculations and hence the PageRank is expected to be slower as compared to other ranking algorithms.

Another problem that is likely to occur with an algorithm such as PageRank is that sometimes it is not necessary that the pages occurring on the top of the sorted list must be relevant to the query keyed in. There can be several reasons for this. Most of the highest ranked webpages (as they are linked to several other webpages)  for example Google, Yahoo, or BBC etc are inhomogeneous in terms of theme. This means that websites such as these may be placed highest in the ranked result, however, they may be thematically unrelated to the query. Furthermore, links can be bought by spammers in order to attain higher ranks for their pages. These issues are still unaddressed by the PageRank algorithm.

In addition to this, careful analysis of the algorithm also reveals another limitation of PageRank. If there are webpages on the internet which only have incoming links and no outgoing links, it means that the PageRank given to them is undistributed and hence these webpages act as sinks of PageRank. This sinking of rank would result in a disequilibrium and would make the calculation of PageRank much more unreliable.

There is another significant reason to doubt the accuracy of PageRank. The links to and from webpages are dynamic and may change over the period of time. This may be the result of hundreds of websites being launched and several being removed every day. Moreover, changes in interests can also result in the regrouping of links over time. For example, political websites may experience increase in incoming links during polling seasons. This requires the database to be constantly updated in order to achieve accuracy. However, the PageRank database is only updated on quarterly basis per annum. This may result in unjust calculation of the PageRank!

The above discussion depicts that the current version of PageRank is still far from Larry Page’s claim that PageRank is able to  “understand exactly what you mean and give you back exactly what you want.”  However, it can be stated that it is still the best of all contemporary ranking algorithms. This is due to its specialized feature of ‘link analysis’. Although, PageRank is expected to have a longer run time than other algorithms, PageRank is still ‘super-fast’ according to human perception. It is able to do calculations in less than human reaction time, which makes it acceptable for the users. The problem of rank sink can be addressed using a systematic method. For example a simplistic approach can be the creation outgoing  links from the ‘rank-sink webpages’ to all webpages in the database. This would result in an even distribution of the PageRank and hence it would eliminate sinking of rank.

So, to conclude, if Google makes similar adequate modifications and the PageRank database is updated more frequently, PageRank would become much similar to what Larry Page aims it to be.

References
  1. Wikipedia
  2. http://www.beninbrown.com/2010/09/15/time-reevaluate-google-pagerank/
  3. http://dev.whydomath.org/node/google/math.html
  4. http://www.voelspriet2.nl/PageRank.pdf
  5. http://delivery.acm.org/10.1145/1150000/1145629/p233-desikan.pdf?key1=1145629&key2=5619550921&coll=DL&dl=ACM&CFID=115606932&CFTOKEN=11892472

No comments:

Post a Comment