Showing posts with label sorting. Show all posts
Showing posts with label sorting. Show all posts

Thursday, November 10, 2011

Recursion, randomization, sorting and computations

Comment 1:

Divide-and-conquer vs. recursion?

Follow-up:

Divide-and-conquer is a general technique in which we divide the problem into small parts and then solve the smallers independently. Recursion is closely related to divide-and-conquer in that it is usually the most concise way to express a divide-and-conquer idea. However, a divide-and-conquer idea does not always need to be realized by using recursion. Indeed, sometimes, we would like to avoid recursion because it can be very slow, as you have seen (or will see) in Tutorial 1.

Comment 2:

Randomization?

Follow-up:

Let us consider a simple problem in order to illustrate the usefulness of randomization. This problem is about several important but related concepts: worst-case analysis, average-case
analysis, and probabilistic analysis. Consider the following “hiring” algorithm:

1) Set candidate best to be unknown;

2) For each of the n candidates, do the following:

3) Interview candidate i ;

4) If candidate i is better than candidate best, then hire candidate i and set best to be i ;

Assume that interviewing each candidate has a cost of c In and hiring a candidate has a cost of c H

(where c H > c In under normal circumstances).

(a)

Can you give the worst case total cost of the above hiring algorithm?

(b)

Assume that the candidates come to the interview in a random order, i.e., each candidate is
??? equally likely to be the best. Specifically, candidate i has a probability of -- to be the best
among the first i candidates. Can you give the average case total cost of the above hiring
algorithm?


Hint: You can consider to use the “indicator random variable” X i , which is equal to 1 if

candidate i is hired and 0 otherwise. Hence, the average number of candidates that are actually

hired is equal to the “expected value” (i.e., the average) of ∑ X i .

Answers:

(a)

The worst case is that every interviewed candidate is hired. Thus, the total cost is: c In n + c H n .

(b)

In this average case, the only change to the total cost is the hiring part. So let’s focus just on this
part. Specifically, as given in the Hint, the average number of candidates that will be hired is:

??? which in turn is equal to: ∑ -- . A good bound for this sum is: log n . Thus, the average
case hiring cost is just c H log n , which is much smaller than the worst case’s hiring cost.
The lesson we learn is that average case is sometimes much better than the worst case.

As you can see, it is helpful to assume that all permutations of the input are equally likely so that a probabilistic analysis can be used. Now, here is the power of randomization—instead of assuming a distribution of inputs (i.e., the candidates), we impose a distribution. In particular, before running the algorithm, we randomly permute the candidates in order to enforce the property that every permutation is equally likely. This modification does not change our expectation of hiring a new person roughly log n times. It means, however, that for any input we expect this to be the case, rather than for inputs drawn from a particular distribution.

Comment 3:

Quantum computing? Parallel processing? Biological computing?

Follow-up:

These are really exortic computing models that we will elaborate on in the later part of the course. Please be patient. Thanks!

Comment 4:

There are a lot of different mathematical ways of documenting/calculating numbers/codings. How come only a few can be applied to computing processing algorithms?

Follow-up:

Good point. But please note that not many mathematical ways of calculations can be realized, in a mechanical manner, using a computing procedure (i.e., to be carried out by a computer). For instance, think about integration in calculus, there are many integration problems that need very good “inspection” or insight to solve. Agree?

Comment 5:

Insertion sort?

Follow-up:

Please find the sketch of a computing procedure using insertion sort below.

(1)Given a list of numbers A[1], A[2], ..., A[n]
(2)for i = 2 to n do:
(3)move A[i] forward to the position j <= i such that
(4)A[i] < A[k] for j <= k < i, and
(5)either A[i] >= A[j-1] or j = 1

Now, it is not difficult to see that the number of checkings/swappings in lines (3) to (5) above cannot be larger than i . Thus, the total number of steps, i.e., the estimated running time, would be ??? , i.e., on ??? the order of n .

Comment 6:

Quicksort? Randomization?

Follow-up:

The Quicksort algorithm looks very similar to the algorithm that you have worked (will work) on in
Problem 3 of Tutorial 1 (about “searching”). So I leave this to you to write up the computing
procedure. You can also prove that the estimated running time is n log n .

On the other hand, I would like to supplement a bit more about the “randomization” part used in
Quicksort. Similar to the “hiring problem” in Comment 2 above, we need a certain “distribution” in the input list so as to realize the potential of Quicksort (or, divide-and-conquer, for that matter).
Specifically, in Quicksort, we would like to choose a pivot so that the resulting two partitions are of more or less equal size. It is reasonable to assume that if the list is somehow “totally random” (we will talk more about generating randomness later on), then it is likely that a randomly selected number from the list has a value right in the middle, i.e., it will divide the list into two equal halves. So just like the hiring problem, we will randomly shuffle the list before sorting and then, statistically, we would expect the list to be divided into equal halves when we partition it.

Comment 7:

P2P should be discussed/elaborated.

Follow-up:

We will spend some time discussing about P2P systems in later part of the course. Please be patient.
Thanks!

Comment 8:

We talked about our actions being monitored, even in P2P because we are accessing the Trackers. But what about the ISPs? They track everything we do. What about VPN (virtual private network)? Can it prevent ISPs from tracking us?

Follow-up:

Yes it is true that the ISPs are keeping track of our moves all the time. So when the law enforcement people need the information (with warrant), they will supply it. Even VPN (i.e., setting up the so-called private links, in the form of encrypted channels) cannot help because ultimately your IP address has to be revealed. Only the data can be encrypted. We will discuss more about Internet security and privacy in later part of the course.

Comment 9:

Feasibility of parallel processing? For example, in the Tower of Hanoi problem we are limited by the number of pegs and the rules of the game.

Follow-up:

Yes you are right. How to do things in parallel in a computer has been baffling researchers for decades.

We will discuss more about these difficulties later in the course.

Comment 10:

Isn’t it true that “recursion” is something just like mathematical induction?

Follow-up:

Yes you are absolutely right! Very good observation. Indeed, recursion, or even divide-and-conquer, is closely related to the “induction” concept. We try to “extrapolate” solutions of smaller problems to larger ones. That is the idea.

Comment 11:

The CPUs are evolving nowadays. Their computational speeds increase exponentially and this lowers the significance of the effectiveness of one algorithm to solving the problem as the CPUs can carry out the tasks equally fast and well. But still thinking of an effective algorithm is still challenging and worth continuing.

Follow-up:

Oh this one I cannot agree with you. Indeed, as you will find out in this course and we will also discuss in more detail soon, there are some problems that cannot be solved practically without a smart algorithm, even if you have thousands of processors at your service.

Friday, October 7, 2011

Computing Algorithm used in Google Search

Google Search is a search engine owned by the ‘Google Incorporation’. According to a statistical analysis carried out by ‘Alexa Traffic Rank’, Google Search is the most visited search engine on the Internet. It receives unfathomable amounts of queries every day and carries out the search in a fraction of a second. Since its launch in 1997 by Larry Page and Sergey Brin, Google Search has gone through a series of substantial changes in its algorithms. Every year, Google Inc. invests obscene sums of money on the Research and Development of the algorithms used. It is the innovation and novelty of these algorithms which is the basis of Google’s rise to success.

Google Inc. has acquired patent rights of the algorithms employed in Google Search. However, it is possible to outline the core logic behind these complex algorithms.  Simply defined, whenever a query is keyed in, the Search algorithm initially accepts it as simple text. It then breaks up the query into a series of search terms.  These search terms are usually words which are matched with the content of the websites in Google’s database. The websites containing the required words are then displayed. However, every search engine on the internet uses a search algorithm which is very similar to Google’s search algorithm. So, the question arises that, what differentiates Google from other search engines? The answer to this question lies in the difference in the algorithm used to rank the web pages generated by the search algorithm. The ranking algorithm used by Google Search is called the ‘PageRank’ which renders Google more successful than all other search engines. Hence, PageRank would be the main focus of this critique essay.

Page rank is a patented algorithm which aims to rank the webpages according to ‘relevance’ to the query keyed in. According to researchers at Google Inc., PageRank, unlike other ranking algorithms, tries to list the web pages according to the human concepts of relevance and importance. In order to qualitatively analyze PageRank, it is important to develop a general understanding of the basic logic of the computation carried out by PageRank. The main assumption underlying the logic of PageRank is that the webpages linked from other highly ranked pages are likely to be more important. So, in simple words, the World Wide Web acts as a giant recommendation system where webpages vote for other pages by sending outgoing links to them. Moreover, the votes from “more important” pages carry more weight.

According to Google, the PageRank actually calculates the probability that a person randomly clicking on links will arrive at any particular web page. So, it initially computes the number of human-generated links on each webpage. Furthermore, it allocates the weights of every link based on the number of links coming out from the source webpage. This means that the PageRank carried by any outgoing link is equal to the document’s own PageRank divided by the number of outgoing links of that document. So, the incoming PageRank of any webpage is the sum of the PageRanks carried by all incoming links.

Furthermore, as the algorithm is used to find the probability of visiting a webpage, PageRank further assumes that the user who is randomly clicking on links will stop clicking after some finite time. The probability that the user continues to click after reaching a particular page is called the ‘damping factor (d)’. So logically, this damping factor is multiplied to the sum of incoming PageRanks in order to calculate the probability of reaching a webpage through external links. However, a webpage can be visited directly by typing the URL in the browser. Intuitively, the probability of reaching a page in this way is calculated by subtracting ‘d’ from 1. So the final PageRank is the sum of these two probabilities.

A simple analysis of the PageRank algorithm depicts that the PageRank is relatively easy to implement for practical purposes. It has an optimal substructure, which means that the result generated by the PageRank algorithm can be processed using the ‘Greedy Method’. The pages with higher PageRanks can be displayed higher in the list of pages, and this greedy method is sure to yield the desired optimal result.    

Furthermore, the PageRank algorithm is easy to understand. This special property of the PageRank algorithm means that programmers can easily manipulate the algorithm to debug it and to keep it more updated. This leads to the dynamic nature of PageRank and ensures that it can cope up with technological modifications in the future.

In addition to these strategic advantages, PageRank also is much better than other traditional ranking algorithms as  it carries out ‘link analysis’. This means that PageRank not only considers the incoming links but also the importance of the source of these links. This results in much more greater relevance of the ranked results returned to the user as compared to traditional methods.  ‘Link analysis’ is also designed to protect users  from spammers. Ranking algorithms which only consider the content of webpages to generate results can easily be spammed. Spammers usually have financial motives associated to their websites. As spammers control the content of their webpages, they can attach meta-tags and special keywords into the HTML of their webpages, which these algorithms use to evaluate the data in a page. But, the content on the page may actually be very different. So, they try to mislead these algorithms into conferring them an unfairly high rank so that their webpages show in the top of the results list. However, due to the use of ‘link analysis’ in PageRank, these spammers are rendered extremely less successful because they have little or no control over the webpages which send incoming links to their pages.

On the other hand, PageRank also has several grave limitations. One of the major questions that can be raised is whether PageRank is adequately scalable. This means that can the algorithm can cope up efficiently when the size of data to be processed becomes extremely large. If encountered with a very large database of web pages, the PageRank algorithm would require a very large memory space to store them. Furthermore, as the algorithm not only maps page ids but also separate terms; the increase in the number of webpages would result in much larger memory requirement, which would not only be less efficient but also very expensive.

In addition to this, the runtime of PageRank is longer as compared to other ranking algorithms. The reason for this is that the calculation of PageRank not only involves  consideration of all the links (which can be considered as the edges which have to be visited once during the calculation) but it also requires calculation of weights of the PageRank. This results in complexity of the calculations and hence the PageRank is expected to be slower as compared to other ranking algorithms.

Another problem that is likely to occur with an algorithm such as PageRank is that sometimes it is not necessary that the pages occurring on the top of the sorted list must be relevant to the query keyed in. There can be several reasons for this. Most of the highest ranked webpages (as they are linked to several other webpages)  for example Google, Yahoo, or BBC etc are inhomogeneous in terms of theme. This means that websites such as these may be placed highest in the ranked result, however, they may be thematically unrelated to the query. Furthermore, links can be bought by spammers in order to attain higher ranks for their pages. These issues are still unaddressed by the PageRank algorithm.

In addition to this, careful analysis of the algorithm also reveals another limitation of PageRank. If there are webpages on the internet which only have incoming links and no outgoing links, it means that the PageRank given to them is undistributed and hence these webpages act as sinks of PageRank. This sinking of rank would result in a disequilibrium and would make the calculation of PageRank much more unreliable.

There is another significant reason to doubt the accuracy of PageRank. The links to and from webpages are dynamic and may change over the period of time. This may be the result of hundreds of websites being launched and several being removed every day. Moreover, changes in interests can also result in the regrouping of links over time. For example, political websites may experience increase in incoming links during polling seasons. This requires the database to be constantly updated in order to achieve accuracy. However, the PageRank database is only updated on quarterly basis per annum. This may result in unjust calculation of the PageRank!

The above discussion depicts that the current version of PageRank is still far from Larry Page’s claim that PageRank is able to  “understand exactly what you mean and give you back exactly what you want.”  However, it can be stated that it is still the best of all contemporary ranking algorithms. This is due to its specialized feature of ‘link analysis’. Although, PageRank is expected to have a longer run time than other algorithms, PageRank is still ‘super-fast’ according to human perception. It is able to do calculations in less than human reaction time, which makes it acceptable for the users. The problem of rank sink can be addressed using a systematic method. For example a simplistic approach can be the creation outgoing  links from the ‘rank-sink webpages’ to all webpages in the database. This would result in an even distribution of the PageRank and hence it would eliminate sinking of rank.

So, to conclude, if Google makes similar adequate modifications and the PageRank database is updated more frequently, PageRank would become much similar to what Larry Page aims it to be.

References
  1. Wikipedia
  2. http://www.beninbrown.com/2010/09/15/time-reevaluate-google-pagerank/
  3. http://dev.whydomath.org/node/google/math.html
  4. http://www.voelspriet2.nl/PageRank.pdf
  5. http://delivery.acm.org/10.1145/1150000/1145629/p233-desikan.pdf?key1=1145629&key2=5619550921&coll=DL&dl=ACM&CFID=115606932&CFTOKEN=11892472

Tuesday, July 19, 2011

HW: Sorting in the daily life

Problem (10 points):

This problem is about a common daily life application of sorting. Specifically, sorting (e.g., QuickSort) is usually done before we want to search for something. For example, bank accounts are always sorted according to the account numbers in order to facilitate searching a particular account on request.
Consider, as shown above, a sorted (in ascending order) list of n numbers (assume n is a power of 2, i.e., n = (2^k) for some positive integer k ). To search for a number x in the list, we can use the following algorithm. Compare x with the number in the middle of the list (i.e., m shown above). If they are equal, then our job is done. If x is smaller than that middle number, then we repeat this search process using just the left half of the list; otherwise, we repeat this search process using just the right half of the list.

(a) Why we can use just the left half of the list if the first comparison tells us that x is smaller than
the middle number?

(b) Please express the above algorithm as a recursive method.

(c) Please give the number of comparisons needed in the worst case. Explain your answer.