Everyday Computing and the Internet -- HKU CCST9003 Common Core

In order to make informed decisions in this information age, everyone needs to have an efficient way to sift through and evaluate the myriads of information that is available through the internet. The ultimate objective of this course (HKU CCST9003) is to help students develop a “computational” state of mind for everyday events. We will also discuss intensively the societal impacts of computing technologies on our daily life.

Showing posts with label parallel processing. Show all posts

Sunday, July 24, 2011

Speed-up expressions for multicore processors

Problem:

In class we learnt about Amdahl’s Law (see Presentation #4, pp. 11–13), here in this problem we try to derive speedup expressions for three variations of multicore processors:

1. A symmetric multicore processor composed of n base core equivalents (BCEs); each BCE is the smallest realization of a processor core, as illustrated in Figure 1(a);

2. A symmetric multicore processor composed of n/r cores, each of which is made by combining BCEs, as illustrated in Figure 1(b);

3. An asymmetric multicore processor composed of one r -BCE core and n – r BCEs, as illustrated in Figure 1(c).

Now, we try to derive speedup expressions of a certain program. We use the following notation.

f : the fraction of the program that must be executed sequentially
perf ( r ) : normalized performance of a r -BCE core (note: perf ( 1 ) = 1 ), defined as:

$\textup{perf} ( r ) = \frac{\textup{absolute performance of a }r\textup{-BCE core}}{\textup{absolute performance of one BCE}}$

where absolute performance can be the processing speed (e.g., Millions of Instructions Per Second, MIPS), or some other performance measure.

Speedup S is defined as:

$S= \frac{\textup{running time using 1 BCE} }{\textup{running time using a multicore processor chip}}$

(a) Derive S for a symmetric multicore processor chip with n BCEs in terms of f and n.

(b) Derive S for a symmetric multicore processor chip with (n/r) r-BCE cores in terms of f , n , r , and perf ( r ) .

(c) Derive S for an asymmetric multicore processor chip with one r-BCE core and n – r BCEs in terms of f , n , r , and perf ( r ) .

Follow-up:

(a) Let T denote the time required to execute the whole program using 1 BCE. Now, the time required to execute the whole program with a symmetric multicore processor chip is given by:

$fT+\frac{\left (1-f \right )T}{n}$

Consequently we have:

$S=\frac{T}{fT+\frac{\left (1-f \right )T}{n}}=\frac{1}{f+\frac{\left (1-f \right )}{n}}$

(b) Let T denote the time required to execute the whole program using 1 BCE. Now, the time required to execute the whole program with a symmetric multicore processor chip with (n/r) r-BCE cores is given by:

$\frac{fT}{\textup{perf}\left ( r \right )}+\frac{\left (1-f \right )T}{\left ( \frac{n}{r} \right )\textup{perf}\left ( r \right )}$

Consequently we have:

$S=\frac{T}{\frac{fT}{\textup{perf}\left ( r \right )}+\frac{\left (1-f \right )T}{\left ( \frac{n}{r} \right )\textup{perf}\left ( r \right )}}=\frac{1}{\frac{f}{\textup{perf}\left ( r \right )}+\frac{\left (1-f \right )}{\left ( \frac{n}{r} \right )\textup{perf}\left ( r \right )}}$

(c) Let T denote the time required to execute the whole program using 1 BCE. Now, the time required to execute the whole program with an asymmetric multicore processor chip with one r-BCE core and n – r BCEs is given by:

$\frac{fT}{\textup{perf}\left (r \right )}+\frac{\left (1-f \right )T}{\textup{perf}\left (r \right )+\left (n-r \right )}$

Consequently we have:

$S=\frac{T}{\frac{fT}{\textup{perf}\left (r \right )}+\frac{\left (1-f \right )T}{\textup{perf}\left (r \right )+\left (n-r \right )}}+\frac{1}{\frac{f}{\textup{perf}\left (r \right )}+\frac{\left (1-f \right )}{\textup{perf}\left (r \right )+\left (n-r \right )}}$

Thursday, July 21, 2011

Sequential processing

Comment 3:

What exactly is “sequential processing”?

Follow-up:

Sequential processing refers to those computations that can only be handled by one single processor. Putting in more processors does not help. For example, in a typical parallel or multi-task program, there are some “managerial” tasks to be carried out. One most commonly encountered task is the division of the whole job into pieces. Specifically, if we consider an image processing task, dividing up the image into pieces is one such managerial task. Obviously, this action is not parallelizable.

Can you think about some daily life examples of sequential processing task?

Parallel processing: data parallel and functional parallel

Comment:

Differences between data parallel and functional parallel approaches?

Follow-up:

Data parallel approach is suitable for situations where exactly the same actions can be taken for different pieces of data. A very good example is filtering or quantization in image processing because these tasks require operating on each pixel independently for the whole image. Another daily life example that we discussed in class is the painting of a large wall (in the same color!) by a group of workers in parallel.

Functional parallel approach is suitable for situations where different actions can be taken on the data (possibly the same) independently and simultaneously. A very good example is the recognition of different objects in the same image. Here, we apply different “actions”, which are the matching of different known object models, over the same image. These actions are independently usually and are different because matching different models require different kinds of processing. Another daily life example that we discussed in class is the re-modeling of an apartment — we usually need to do many different things such as installation of electrical wirings, painting of walls, installation of fixtures, etc. These functions (sometimes) are independent and can be carried out in parallel.

Finally, one very important difference between the two approaches is the “degree of parallelism” — which can be defined as the number of processors that you can use in a productive manner (i.e., having some useful work to do). Obviously, in data parallel approach, if the data involved is large (e.g., a very large image), then we can use many processors (e.g., up to thousands) to enhance the speed-up. On the other hand, usually there will not be many different (and independent) functions required in a task (e.g., re-modeling of an apartment). So the number of processors that can be put into productive use is usually limited (e.g., on the order of 10 or lower).

Think about “weather forecasting” computing. Which approach do you think is more suitable?

Comment:

Functional parallel computing can only be used when the different functions are independent of each other. So, for example, in the case of “object tracking”, as the different functions are actually dependent on each other, i.e., the object tracking can only be done after background subtracting, so the functions need to be done one after another. Hence, functional parallel computing could not be used in this case. In a sense, this is “completely sequential”?

Follow-up:

This is an excellent observation.

Sunday, July 17, 2011

Sequential mode or parallel mode?

Problem:

A uniprocessor computer can operate in either sequential mode or parallel mode. In parallel mode, computations can be performed nine times faster than in sequential mode. A certain benchmark program took time T to run on this computer. Furthermore, suppose that 25% of T was spent in parallel mode whereas the remaining portion was in sequential mode.

(a) What is the effective speedup of the above execution as compared with the condition when parallel mode is not used at all?

(b) What is the fraction α of parallelized code of the benchmark program?

(c) Suppose we double the speed ratio between parallel mode and the sequential mode by hardware improvements. What is the new effective speedup?

(d) Suppose the speedup you calculated in (c) is to be obtained by software improvement alone in terms of parallelization fraction, instead of by any hardware improvement. What is the new parallelization fraction required?

Follow-up:

(a) If parallel mode is not used at all, the execution time will be 0.75T + 9 × 0.25T = 3T .
Therefore, the speedup of the program is 3T ⁄ T = 3 .

(b) The fraction of parallelized code is given by:

$\alpha = \frac{\textup{Sequential time of parallelized code}}{\textup{Total sequential time}} = \frac{9 \times 0.25T}{3T} = 0.75$

$0.75T+\frac{0.25T}{2}=0.875T$

And the new speedup is:

$\frac{3T}{0.875T}=3.43$

(d) Parallelization in general is done by the compiler (an important piece of software for translating your program into machine language). Since the new speedup is to be obtained by software instead of hardware, the speed of parallelized code is still 9 times faster than sequential mode. Let β be the new parallelization fraction. Then, we have:

$\frac{1}{\frac{\beta }{9}+1-\beta }=3.43$

Thus, we have

$\beta \approx 0.8$

That is, the compiler has to parallelize 5% more of the benchmark program.

Parallel processing: Amdahl's Law, speed-up and overhead

Comment:

The “speed-up” formula (i.e., Amdahl’s Law) still a bit confusing?

Follow-up:

First of all, it is important to understand that this formula assumes that a program (or a task, or a job) can be divided into two parts, i.e., sequential part (must be carried out by one single processor only), and parallel part. Most importantly, one key component in the assumption is that the parallel part can be perfectly divided into totally independent (i.e., no communication is required) pieces for different processors to work on.

Consequently, given n processors, we can perfectly speed up the computation of the parallel part by n times. Using the terminology in the presentation slide, the Speedup of enhancement is n. Thus, if the original total execution time of the program using one processor is T and the fraction of sequential part is x, then the new total execution time is:

$xT+\frac{\left (1-xT \right )}{n}$

The first term in the above expression is the time required to execute the sequential part. The second term is the time required to execute the parallel part, using n processors simultaneously. Finally, speed-up, which is defined as the ratio of original time to the new time, is given by:

$speed-up = \frac{T}{xT+\frac{\left (1-x \right )T}{n}}=\frac{1}{x+\frac{\left (1-x \right )}{n}}$

We can see that perfect speed-up, i.e., n , is achievable only when x = 0 . which means that the whole program is perfectly parallelizable into totally independent pieces. This is a situation not likely to happen in a typical computer program (or any daily life task, for that matter).

Comment:

Is the parallel processing formula (speed-up) really works?

As we assume the processor count tends to infinity, the time for communication should also be in a large scale. Should we consider the running time to be larger? Or the formula just neglects the factor of communication time between processors?

Follow-up:

This is another excellent observation. As mentioned in the Follow-Up to the comment above, in deriving the formula, we assume that the parallel part can be divided up perfectly, without any communication. In a real-life parallel program, of course there will be communications. Thus, in deriving a speed-up formula for such a program, we might need an additional term for the total execution time:

$xT+\frac{\left (1-x \right )T}{n}+cn^{2}$

In the above expression, the third term denotes the communication overhead time introduced by parallelization, which grows as (n x n) with a multiplicative constant c. This is not an “artificial” example but really represents some real-life communication overhead that you can find in many practical parallel programs.

What new conclusions can you draw from this new expression?

Comment:

Still quite confused about Slide 14?

Follow-up:

Slide 14 illustrates the need for extra communication overhead when we perform parallel processing. Specifically, in the Main Memory, there is a variable u, which has a value of 5. This variable is needed for some parallel computations to be carried out by three different processors as shown. Now, each processor has to read this variable from the Main Memory into its own memory in order to perform the computation. The problem here is that after P1 and P3 have read the variable u, P3 subsequently modifies its value (from 5 to 7), making different copies of u having different values (an inconsistent situation).

Obviously, inconsistency in shared data variables is unacceptable because it will lead to mistakes. Thus, to rectify the inconsistency, some extra communication has to be carried out. For instance, after P3 modifies u, it has to immediately inform every processor that is currently using u (i.e., having u in its own memory).

Can you think of some other way to handle this inconsistent situation?