Everyday Computing and the Internet -- HKU CCST9003 Common Core

In order to make informed decisions in this information age, everyone needs to have an efficient way to sift through and evaluate the myriads of information that is available through the internet. The ultimate objective of this course (HKU CCST9003) is to help students develop a “computational” state of mind for everyday events. We will also discuss intensively the societal impacts of computing technologies on our daily life.

Showing posts with label speed-up. Show all posts

Sunday, July 24, 2011

Speed-up expressions for multicore processors

Problem:

In class we learnt about Amdahl’s Law (see Presentation #4, pp. 11–13), here in this problem we try to derive speedup expressions for three variations of multicore processors:

1. A symmetric multicore processor composed of n base core equivalents (BCEs); each BCE is the smallest realization of a processor core, as illustrated in Figure 1(a);

2. A symmetric multicore processor composed of n/r cores, each of which is made by combining BCEs, as illustrated in Figure 1(b);

3. An asymmetric multicore processor composed of one r -BCE core and n – r BCEs, as illustrated in Figure 1(c).

Now, we try to derive speedup expressions of a certain program. We use the following notation.

f : the fraction of the program that must be executed sequentially
perf ( r ) : normalized performance of a r -BCE core (note: perf ( 1 ) = 1 ), defined as:

$\textup{perf} ( r ) = \frac{\textup{absolute performance of a }r\textup{-BCE core}}{\textup{absolute performance of one BCE}}$

where absolute performance can be the processing speed (e.g., Millions of Instructions Per Second, MIPS), or some other performance measure.

Speedup S is defined as:

$S= \frac{\textup{running time using 1 BCE} }{\textup{running time using a multicore processor chip}}$

(a) Derive S for a symmetric multicore processor chip with n BCEs in terms of f and n.

(b) Derive S for a symmetric multicore processor chip with (n/r) r-BCE cores in terms of f , n , r , and perf ( r ) .

(c) Derive S for an asymmetric multicore processor chip with one r-BCE core and n – r BCEs in terms of f , n , r , and perf ( r ) .

Follow-up:

(a) Let T denote the time required to execute the whole program using 1 BCE. Now, the time required to execute the whole program with a symmetric multicore processor chip is given by:

$fT+\frac{\left (1-f \right )T}{n}$

Consequently we have:

$S=\frac{T}{fT+\frac{\left (1-f \right )T}{n}}=\frac{1}{f+\frac{\left (1-f \right )}{n}}$

(b) Let T denote the time required to execute the whole program using 1 BCE. Now, the time required to execute the whole program with a symmetric multicore processor chip with (n/r) r-BCE cores is given by:

$\frac{fT}{\textup{perf}\left ( r \right )}+\frac{\left (1-f \right )T}{\left ( \frac{n}{r} \right )\textup{perf}\left ( r \right )}$

Consequently we have:

$S=\frac{T}{\frac{fT}{\textup{perf}\left ( r \right )}+\frac{\left (1-f \right )T}{\left ( \frac{n}{r} \right )\textup{perf}\left ( r \right )}}=\frac{1}{\frac{f}{\textup{perf}\left ( r \right )}+\frac{\left (1-f \right )}{\left ( \frac{n}{r} \right )\textup{perf}\left ( r \right )}}$

(c) Let T denote the time required to execute the whole program using 1 BCE. Now, the time required to execute the whole program with an asymmetric multicore processor chip with one r-BCE core and n – r BCEs is given by:

$\frac{fT}{\textup{perf}\left (r \right )}+\frac{\left (1-f \right )T}{\textup{perf}\left (r \right )+\left (n-r \right )}$

Consequently we have:

$S=\frac{T}{\frac{fT}{\textup{perf}\left (r \right )}+\frac{\left (1-f \right )T}{\textup{perf}\left (r \right )+\left (n-r \right )}}+\frac{1}{\frac{f}{\textup{perf}\left (r \right )}+\frac{\left (1-f \right )}{\textup{perf}\left (r \right )+\left (n-r \right )}}$

Sunday, July 17, 2011

Sequential mode or parallel mode?

Problem:

A uniprocessor computer can operate in either sequential mode or parallel mode. In parallel mode, computations can be performed nine times faster than in sequential mode. A certain benchmark program took time T to run on this computer. Furthermore, suppose that 25% of T was spent in parallel mode whereas the remaining portion was in sequential mode.

(a) What is the effective speedup of the above execution as compared with the condition when parallel mode is not used at all?

(b) What is the fraction α of parallelized code of the benchmark program?

(c) Suppose we double the speed ratio between parallel mode and the sequential mode by hardware improvements. What is the new effective speedup?

(d) Suppose the speedup you calculated in (c) is to be obtained by software improvement alone in terms of parallelization fraction, instead of by any hardware improvement. What is the new parallelization fraction required?

Follow-up:

(a) If parallel mode is not used at all, the execution time will be 0.75T + 9 × 0.25T = 3T .
Therefore, the speedup of the program is 3T ⁄ T = 3 .

(b) The fraction of parallelized code is given by:

$\alpha = \frac{\textup{Sequential time of parallelized code}}{\textup{Total sequential time}} = \frac{9 \times 0.25T}{3T} = 0.75$

$0.75T+\frac{0.25T}{2}=0.875T$

And the new speedup is:

$\frac{3T}{0.875T}=3.43$

(d) Parallelization in general is done by the compiler (an important piece of software for translating your program into machine language). Since the new speedup is to be obtained by software instead of hardware, the speed of parallelized code is still 9 times faster than sequential mode. Let β be the new parallelization fraction. Then, we have:

$\frac{1}{\frac{\beta }{9}+1-\beta }=3.43$

Thus, we have

$\beta \approx 0.8$

That is, the compiler has to parallelize 5% more of the benchmark program.

Parallel processing: Amdahl's Law, speed-up and overhead

Comment:

The “speed-up” formula (i.e., Amdahl’s Law) still a bit confusing?

Follow-up:

First of all, it is important to understand that this formula assumes that a program (or a task, or a job) can be divided into two parts, i.e., sequential part (must be carried out by one single processor only), and parallel part. Most importantly, one key component in the assumption is that the parallel part can be perfectly divided into totally independent (i.e., no communication is required) pieces for different processors to work on.

Consequently, given n processors, we can perfectly speed up the computation of the parallel part by n times. Using the terminology in the presentation slide, the Speedup of enhancement is n. Thus, if the original total execution time of the program using one processor is T and the fraction of sequential part is x, then the new total execution time is:

$xT+\frac{\left (1-xT \right )}{n}$

The first term in the above expression is the time required to execute the sequential part. The second term is the time required to execute the parallel part, using n processors simultaneously. Finally, speed-up, which is defined as the ratio of original time to the new time, is given by:

$speed-up = \frac{T}{xT+\frac{\left (1-x \right )T}{n}}=\frac{1}{x+\frac{\left (1-x \right )}{n}}$

We can see that perfect speed-up, i.e., n , is achievable only when x = 0 . which means that the whole program is perfectly parallelizable into totally independent pieces. This is a situation not likely to happen in a typical computer program (or any daily life task, for that matter).

Comment:

Is the parallel processing formula (speed-up) really works?

As we assume the processor count tends to infinity, the time for communication should also be in a large scale. Should we consider the running time to be larger? Or the formula just neglects the factor of communication time between processors?

Follow-up:

This is another excellent observation. As mentioned in the Follow-Up to the comment above, in deriving the formula, we assume that the parallel part can be divided up perfectly, without any communication. In a real-life parallel program, of course there will be communications. Thus, in deriving a speed-up formula for such a program, we might need an additional term for the total execution time:

$xT+\frac{\left (1-x \right )T}{n}+cn^{2}$

In the above expression, the third term denotes the communication overhead time introduced by parallelization, which grows as (n x n) with a multiplicative constant c. This is not an “artificial” example but really represents some real-life communication overhead that you can find in many practical parallel programs.

What new conclusions can you draw from this new expression?

Comment:

Still quite confused about Slide 14?

Follow-up:

Slide 14 illustrates the need for extra communication overhead when we perform parallel processing. Specifically, in the Main Memory, there is a variable u, which has a value of 5. This variable is needed for some parallel computations to be carried out by three different processors as shown. Now, each processor has to read this variable from the Main Memory into its own memory in order to perform the computation. The problem here is that after P1 and P3 have read the variable u, P3 subsequently modifies its value (from 5 to 7), making different copies of u having different values (an inconsistent situation).

Obviously, inconsistency in shared data variables is unacceptable because it will lead to mistakes. Thus, to rectify the inconsistency, some extra communication has to be carried out. For instance, after P3 modifies u, it has to immediately inform every processor that is currently using u (i.e., having u in its own memory).

Can you think of some other way to handle this inconsistent situation?