Showing posts with label crawling. Show all posts
Showing posts with label crawling. Show all posts

Sunday, July 24, 2011

Programming using stack

Problem:

A stack is a useful structure for storing data because it has many useful applications. For example, it can be applied in implementing recursive computing method. A stack is a “last-in-first-out” structure — data items are stored linearly as one item is placed on top of the other. The only two operations are: pop and push. The pop operation retrieves the top data item on the stack, while the push operation puts a new data item on the top of the stack. The figure below shows how a stack works:


(a) Can you write up a computing procedure for DFS using a stack instead of using recursion?

(b) Can you suggest a practical situation where we must use a stack instead of using recursion?

Follow-Up:

(a) The first possible version is as follows.

Stack_DFS1(x):
  1. Mark x; Push(x);
  2. if stack is empty, then end of algorithm; otherwise, do the following:
  3.    y = top of stack (without popping it);
  4.    if there is a neighbor z of y that is not marked, then do:
  5.       Mark z; Push(z);
  6.    else
  7.       Pop;
  8. go to Step 2.

Another possible version is as follows.

Stack_DFS2(x):
  1. Push(x);
  2. if stack is empty, then end of algorithm; otherwise, do the following:
  3.    y <- Pop;
  4.    if y is not marked, then Mark y;
  5. for every neighbor z of y that is not marked, do the following:
  6.    Push(z);
  7.    go to Step 2.

(b) In the Web crawling application, we need to use a stack. In fact, crawling is done by many computers running the “search” (or explore) simultaneously. Each computer takes the next node to be explored from the top of the stack, downloads the HTML file, and then scans it for further hyperlinks. But when a new HTML document is found (i.e., a “neighbor”), no recursive call is made; instead, the new node is pushed onto the stack.

Yet one interesting question remains: when we see a HTML document, how do we know it is indeed “new” (i.e., not marked) so that we want to push it on stack?

Sunday, July 17, 2011

Computing using depth-first-search (DFS)

Problem:

In class today, we learn about a computing method for traversing a graph (see Presentation #2, slide #9), called depth-first-search (DFS).

(a) Can you write up a recursive computing procedure for doing so?

(b) Can you analyze its estimated running time?

(c) Can you name some “everyday computing” applications of this method?

Follow-Up:

(a) A possible recursive algorithm is as follows:

DFS(x):
  1. Mark x;
  2. For every neighbor y of x that is not marked:
  3. DFS(y);

(b) As discussed in class, the DFS procedure visits each and every node once, also each and every link once. Thus, the estimated running time should be on the order of N + E , where N and E are the number of nodes and number of links, respectively.

(c) We need to use DFS for many useful daily life applications:

  • playing chess: searching for the best move in the game tree;
  • playing puzzle: searching for the exit in a maze;
  • crawling the Web: the crawler program (e.g., in Google) needs to search the Web graph (i.e., visualizing the Web as a graph of connected documents, where the connections are the hyper-links) in a depth-first way.