Lecture Notes in PARALLEL PROCESSING Prepared by Rza

Lecture Notes in PARALLEL PROCESSING Prepared by Rza

Lecture Notes in PARALLEL PROCESSING Prepared by Rza Bashirov High-performance computers are increasingly in demand in the areas of structural analysi...

245KB Sizes 0 Downloads 7 Views

Recommend Documents

Lecture Notes, by James Cahill
12.17.1: Anonymous, Portrait of the Chan Master Wu-chun, 1238. Tofukuji, Kyoto. Reproduced in Skira 48; T&V 7–44; Sire

lecture notes
Economic models. 4 / 53. Introduction. Ten Principles. Economic Models ... Society's tradeoffs: • Guns vs. Butter. •

Lecture Notes
The theories of heredity attributed to Gregor Mendel, based on his work with .... xz-test to pool the results (chapter 2

Lecture Notes
want to look at the relationship between competitive equilibrium and Pareto optimal- ity in models with infinite-dimensi

Lecture notes
Feb 24, 2009 - Normal good: if income increase consumption increase as well. • Inferior good: increase in income cause

Parallel Graph Processing - ICPE 2016
LLC hit ratio, etc. ▫PAPI. • Provides an interface for using the HW counters in the code. 4 ... A well-known pointer

lecture notes - IARE
o On 25 April 1783, they launched the first true hot-air balloon in Annonay, France. The balloon rose 305 ..... a Boeing

Lecture Notes - Cs.UCLA.Edu
Operating systems made available in source-code format rather than just binary closed-source. ▫ Counter to the copy pr

Math 017 Lecture Notes
Considers all results of a preference schedule (not just the top votes). - Better representation ... If candidate X is a

Lecture Notes #1 - web.pdx.edu
One is the loss involved due to the appreciation of the dollar; the ... new exchange rates represent an appreciation or

Lecture Notes in PARALLEL PROCESSING Prepared by Rza Bashirov High-performance computers are increasingly in demand in the areas of structural analysis, weather forecasting, petroleum exploration, medical diagnosis, aerodynamics simulation, artificial intelligence, expert systems, genetic engineering, signal and image processing, among many other scientific and engineering applications. Without superpower computers, many of these challenges to advance human civilization cannot be made within a reasonable time period. Achieving high performance depends not only on using faster and more reliable hardware devices but also on major improvements in computer architecture and processing techniques. 1

FLYNN’S TAXONOMY

In general, digital computers may be classified into four categories, according to the multiplicity of instruction and data streams. This scheme for classifying computer organizations was introduced by Michael J. Flynn. The essential computing process is the execution of a sequence of instructions on a set of data. The term stream is used here to denote a sequence of items (instructions or data) as executed or operated upon by a single processor. Instructions or data are defined with respect to a referenced machine. An instruction stream is a sequence of instructions as executed by the machine; a data stream is a sequence of data including input, partial, or temporary results, called for the instruction stream. Computer organizations are characterized by the multiplicity of the hardware provided to service the instruction and data streams. Listed below are Flynn’s four machine organizations: • • • •

Single instruction stream single data stream (SISD) Single instruction stream multiple data stream (SIMD) Multiple instruction stream single data stream (MISD) Multiple instruction stream multiple data stream (MIMD)

SISD computer organization This organization represents most serial computers available today. Instructions are executed sequentially but may be overlapped in their execution stages. SIMD computer organization In this organization, there are multiple processing elements supervised by the same control unit. All PE’s receive the same instruction broadcast from the control unit but operate on different data sets from distinct data streams.

MISD computer organization There are n processor units, each receiving distinct instructions operating over the same data stream and its derivatives. The results (output) of one processor become the input (operands) of the next processor in the macropipe. MIMD computer organization Most multiprocessor systems and multiple computer systems can be classified in this category. MIMD computer implies interactions among the n processors because all memory streams are derived from the same data space shared by all processors. If the n data streams were from disjointed subspaces of the shared memories, then we would have the so-called multiple SISD (MSISD) operation, which is nothing but a set of n independent SISD uniprocessor systems. The last three classes of computer organization are the classes of parallel computers.

2

PIPELINING: AN OVERLAPPED PARALLELISM

Pipelining offers an economical way to realise temporal parallelism in digital computers. The concept of pipeline processing in a computer is similar to assembly lines in an industrial plant. To achieve pipelining, one must subdivide the input task (process) into a sequence of subtasks, each of which can be executed by a specialised hardware stage that operates concurrently with other stages in the pipeline. Successive tasks are streamed into the pipe and get executed in an overlapped fashion at the subtask level. The subdivision of labour in assembly lines has contributed to the success of mass production in modern industry. By the same token, pipeline processing has led to the improvement of system throughput in the modern digital computer. 2.1

Principles of linear pipelining

Assembly lines have been used in automated industrial plants in order to increase productivity. Their original form is a flow line (pipeline) of assembly stations where items are assembled continuously from separate parts along a moving conveyor belt. Ideally, all the assembly stations should have equal processing speed. Otherwise, the slowest station becomes the bottleneck of the entire pipe. This bottleneck problem plus the congestion caused by improper buffering may result in many idle stations waiting for new parts. The subdivision of the input task into a proper sequence of subtasks becomes a crucial factor in determining the performance of the pipeline. In a uniform-delay pipeline, all tasks have equal processing time in all station facilities. The stations in an ideal assembly line can operate synchronously with full resource utilisation. However, in reality, the

2

successive stations have unequal delays. The optimal partition of the assembly line depends on a number of factors, including the quality (efficiency and capability) of the working units, the desired processing speed, and the cost effectiveness of the entire assembly line. The precedence relation of a set of subtasks {T1 ,..., Tk } for an T implies

that some task Ti cannot start until some earlier task T j (i < j ) finishes. A linear pipeline can process a succession of subtasks with a linear precedence graph. A linear pipeline consists of cascade of processing stages. High-speed interface latches separate the stages. The latches are fast registers for holding the intermediate results between the stages. Information flows between adjacent stages are under the control of a common clock applied to all the latches simultaneously. Clock period

The logic circuitry in each stage Si has a time delay

denoted by τ i . Let τ l be the time delay of each interface latch. The clock period of a linear pipeline is defined by

τ = max{τ i } k + τ l = τ m + τ l . i =1

f = 1τ . The reciprocal of the clock period is called the frequency Ideally, a linear pipeline with k stages can process n tasks in Tk = k + (n − 1) periods, where k cycles are used to fill up the pipeline or to complete execution of the first task and n − 1 cycles are needed to complete the remaining n − 1 tasks. The same number of tasks (operand pairs) can be executed in a nonpipeline processor with an equivalent function in T1 = n ⋅ k time delay. Speedup We define the speedup of a k -stage linear pipeline processor over an equivalent nonpipeline processor as Sk =

T1 n⋅k = Tk k + ( n − 1) .

It should be noted that the maximum speedup is S k → k , for n >> k . In other words, the maximum speedup that a linear pipeline can provide us is k , where k is the number of stages in the pipe. The maximum speedup is never fully achievable because of data dependencies between instructions, interrupts, and other factors. Efficiency The efficiency of a linear pipeline is measured by the percentage of busy time-space spans over the total time-space span, which equals the

3

sum of all busy and idle time-space spans. Let n, k , τ be the number of tasks (instructions), the number of pipeline stages, and the clock period of a linear pipeline, respectively. The pipeline efficiency is defined by

η=

n ⋅ k ⋅τ n = k ⋅ [ kτ + ( n − 1)τ ] k + ( n − 1) .

Note that η → 1 as n → ∞. This implies that the larger the number of tasks flowing through the pipeline, the better is its efficiency. Moreover, we Sk realize that η = k . This provides another view of efficiency of a linear pipeline as the ratio of its actual speedup to the ideal speedup k . In the steady state of a pipeline, we have n >> k , the efficiency η should approach 1. However, this ideal case may not hold all the time because of program branches and interrupts, data dependency, and other reasons. Throughput The number of results (tasks) that can be completed by a pipeline per unit time is called its throughput. This rate reflects the computing power of a pipeline. In terms of efficiency η and clock period τ of a linear pipeline, we define the throughput as follows: w=

η n = kτ + ( n − 1)τ τ

where n equals the total number of tasks being processed during an w = 1τ = f observation period kτ + ( n − 1)τ . In the ideal case, when η → 1. This means that the maximum throughput of a linear pipeline is equal to its frequency, which corresponds to one output result per clock period. According to the levels of processing, pipeline processors can be classified into the classes: arithmetic, instruction, processor, unifunction vs. multifunction, static vs. dynamic, scalar vs. vector pipelines. Arithmetic pipelining The arithmetic logic units of a computer can be segmentized for pipeline operations in various data formats. Well-known arithmetic pipeline examples are the four-stage pipes used in STAR-100, the eight-stage pipelines used in the TI-ASC, the up to 14 pipeline stages used in the CRAY-1, and the up to 26 stages per pipe in the CYBER-205. Instruction pipelining The execution of a stream of instructions can be pipelined by overlapping the execution of the current instruction with the fetch, decode, and operand fetch of subsequent instructions. This technique is also known as instruction lookahead. Almost all high-performance computers are now equipped with instruction-execution pipelines.

4

Processor pipelining This refers to the pipeline processing of the same data stream by a cascade of processors, each of which processes a specific task. The data stream passes the first processor with results stored in a memory block which is also accessible by the second processor. The second processor than passes the refined results to the third, and so on. Unifunctional vs. multifunctional pipelines A pipeline unit with a fixed and dedicated function, such as the floating-point adder, is called unifuctional. A multifunctional pipe may perform different functions, either at different subsets of stages in the pipeline. Static vs. Dynamic pipelines A static pipeline may assume only one fuctional configuration at a time. Static pipelines can be either unifunctional or multifunctional. Pipelining is made possible in static pipes only if instructions of the same type are to be executed continuously. The function performed by a static pipeline should not change frequently. Otherwise, its performance may be very low. A dynamic pipeline processor permits several functional configurations to exist simultaneously. In this sense, a dynamic pipeline must be multifunctional. On the other hand, a unifunctional pipe must be static. The dynamic configuration needs much more elaborate control and sequencing mechanisms than those for static pipelines. Most existing computers are equipped with static pipes, either unifunctional or multifunctional. Scalar vs. vector pipelines Depending on instruction or data types, pipeline processors can be also classified as scalar pipelines and vector pipelines. A scalar pipeline processes a sequence of scalar operands under the control of a DO loop. Instructions in a small DO loop are often prefetched into the instruction buffer. The required scalar operands for repeated scalar instructions are moved into a data cache in order to continuously supply the pipeline with operands. Vector pipelines are specially designed to handle vector instructions over vector operands. Computers having vector instructions are often called vector processors. The design of a vector pipeline is expanded from that of a scalar pipeline.

I n p u t

S1

S3

S2

Output

5

O u t p u t

2.2

General pipelines and reservation tables

What we have studied so far are linear pipelines without feedback connections. The inputs and outputs of such pipelines are totally independent. In some computations, like linear recurrence, the outputs of the pipeline are fed back as future inputs. In other words, the inputs may depend on previous outputs. Pipelines with feedback may have a nonlinear flow of data. The utilization history of pipeline determines the present state of the pipeline. The timing of the feedback inputs becomes crucial to the nonlinear data flow. Improper use of feed-forward or feedback inputs may destroy the inherent advantages of pipelining. On the other hand, proper sequencing with nonlinear data flow may enhance the pipeline efficiency. In practice, many of the arithmetic pipeline processors allow nonlinear connection as a mechanism to implement recursion and multiple functions. Consider a simple pipeline that has a structure with both feed-forward and feedback connection, as shown in figure. Assume that this pipeline is dual functional, denoted as function A and function B . We will number the pipeline stages S1 , S 2 , S 3 from the input end to the output end. The one-way connections between adjacent stages form the original linear cascade of the pipeline. A feed-forward connection connects a stage Si to a stage S j such that j ≥ i + 2 and a feedback connection connects a stage Si to a stage S j such that j ≤ i . In this sense, a “pure” linear pipeline is a pipeline without any feedback or feed-forward connections. The circles in figure refer o data multiplexers. The two reservation tables shown below correspond to the two functions of the sample pipeline. The rows correspond to pipeline stages and the columns to clock time units. The total number of clock units in the table called the evaluation time for the given function. A reservation table represents the flow of data through the pipeline for one complete evaluation of a given function.

S1 S2 S3

S1 S2 S3

t0 A

T1

t2

t3 A

t4

t5

t6 A

A

A A

t0 B

t1

t2

t3

B B

t7

A

A

t4 B

t5

t6

B B

B

A marked entry in the (i , j ) th square of the table indicates that stage Si will be used j time units after the initiation of the function evaluation. For a unifunctional pipeline, one can simply use an “x” to mark the table entries. 6

For a multifunctional pipeline, different marks are used for different functions, such as the A’s and B’s in the two reservation tables for the simple pipeline. Different functions may have different evaluation times, such as 8 and 7 for functions A and B, respectively. The 8 steps required to evaluate A are S1, S2, S3, S1, S3, S3, S1, S2. Similarly, the 7 steps needed to evaluate B are S1, S3, S2, S3, S1, S2, S3. The data-flow pattern in a static, unifuctional pipeline can be fully described by one reservation table. A multifunctional pipeline may use different reservation tables for different functions to be performed. On the other hand, a given reservation table does not uniquely correspond to one particular hardware pipeline. One may find that several hardware pipelines with different interconnection structures can use the same reservation table.

R e c e i v e

2.3

M u l t I p l

A c c u m u l a t

E x p.

A l I g n

S u b t

A d d

N o r m a l I z

O u t p u t

Multifunction and Array Pipelines

Texas Instruments’ Advanced Scientific Computer (TI-ASC) was the first vector processor that was installed with multifunction pipelines in its arithmetic processors. The ASC arithmetic pipeline consists of eight stages, as illustrated in figure. All the interconnection routes among the eight stages are shown. This pipeline can perform either fixed-point or floating-point arithmetic functions and many logical-shifting operations over scaler and vector operands of length 16, 32, or 64 bits.

R e c e i v e

A d d

Fixed-point add

7

O u t p u t

Different arithmetic-logic instructions are allowed to use different connecting paths through the pipeline. Figure shows four interconnection patterns of the ASC pipeline for the evaluation of the functions: fixed-point add, fixed-point multiply, and floating-point add, floating-point multiply. It’s not difficult to see that the receiver and output stages are used by all instructions. The multiply stage performs multiplication. The multiply stage produces two results, called pseudo sum and pseudo carry, which are sent to the accumulator stage or the add stage to produce the desired product. The exponent subtract stage determines the exponent difference and sends this shift count to the align stage to align fractions for floating-point add or subtract instructions. All right shift operations are also implemented in this align stage. The normalize stage does the floating-point normalization, all leftshift operations, and conversions between fixed-point and floating-point operands.

R e c e i v e

E x p.

A l I g n

S u b t

A d d

N o r m a l I z

O u t p u t

Floating-point add

R e c e i v e

M u l t I p l

O u t p u t

A d d

Fixed-point multiply

R e c e i v e

M u l t i p l

A c c u m u l a t

E x p.

A l i g n

S u b t

8

A d d

N o r m a l i z

O u t p u t

Floating-point multiply Array pipelines are two-dimensional pipelines with multiple data-flow streams for high-level arithmetic computations, such as matrix multiplication, inversion, and L-U decomposition. The pipeline is usually constructed with a cellular array of arithmetic units. The cellular array is usually regularly structured and suitable for microprocessor implementation. The basic building blocks in the array are the M cells. Each M cell performs an additive inner-product operation as illustrated in figure. Each such cell performs an additive inner0product operation. Each cell has the three input operands a, b and c and the three outputs a ' = a , b' = b, d = a × b + c . Fast latches are used in all input output terminals and all interconnecting paths in the array pipeline. All the latches are synchronously controlled by the same clock. The array shown in figure performs the multiplication of two 3 × 3 dense matrices A ⋅ B = C

t3 t2 t1

t8

t7

t6

c13

c23

c12 c33

b33 0 0

c11 c22

c21

c32

b23 b32 0

c31

t8

b13 b22 b31

t7

t6

0 b12 b21

0 0 b11

a11

a12

a13

0

0

t1

0

a21

a22

a23

0

t2

0

0

a31

a33

t3

9

a32

a11  A ⋅ B = a 21 a 31

a12 a 22 a 32

a13  b11 b12   a 23  ⋅ b21 b22 a 33  b31 b32

b13   b23  = b33 

c11  c21 c31

c13   c23  = C c33  .

c12 c22 c32

2 In general, to multiply two ( n × n) matrices requires 3n − 4n + 2 cells. It takes 3n − 1 clock periods to complete the multiply process.

2.4

Problems 1. Consider a four-segment floating-point adder with a 10ns clock period. Find the minimum number of periods required for 99 floating-points additions, which use pipeline adder.

X Y

S4

S3

S2

S1

Z

Solution. The problem of computing A1 + " + A100 is equivalent to the processing of a stream of 99 problems of type A+B. Substituting n=99 and k=4 in Tk = k + (n − 1) we obtain T4 = 4 + (99 − 1) = 102 . So, 1020ns is enough to evaluate A1 + " + A100 in four-segment pipeline. 2. Find frequency, efficiency, throughput and speedup of a four-segment linear pipeline in problem 1 over an equivalent nonpipeline processor. Solution. Substituting n=99 and k=4 in respective formulae we obtain

f =

Sk =

η=

1

τ

= 0.1

1 , ns

T1 n⋅k 99 ⋅ 4 = = ≈ 3.88 , Tk k + (n − 1) 4 + 98

n ⋅ k ⋅τ n = ≈ 0.97 , k ⋅ [kτ + (n − 1)τ ] k + (n − 1) n 1 η w= = ≈ 9.7 . kτ + (n − 1)τ τ ns

3. Draw the four-segment pipeline to realize two functions A and B described in the following reservation table.

10

t0 A

S1 S2 S3

t1

t2

t3 A

A

t4

t5

A A

t0 B

S1 S2 S3 S4

t1

A

t2

t3

B

t4

t5

B B

B B

Solution.

Input

S1

S2

S3

S4

Input

Input 4. Describe the following terminologies associated with pipeline computers (a) (b) (c) (d) (e) (f)

Static pipeline Dynamic pipeline Unifunctional pipeline Multifunctional pipeline Instruction pipeline Vector pipeline

Solution. (a) A static pipeline may assume only one fuctional configuration at a time. Static pipelines can be either unifunctional or multifunctional. Pipelining is made possible in static pipes only if instructions of the same type are to be executed continuously. The function performed by a static pipeline should not change frequently. Otherwise, its performance may be very low. (b) A dynamic pipeline processor permits several functional configurations to exist simultaneously. In this sense, a dynamic pipeline must be multifunctional. On the other hand, a unifunctional pipe must be static. The dynamic configuration needs much more elaborate control and sequencing mechanisms than those for static pipelines.

11

(c) A pipeline unit with a fixed and dedicated function, such as the floating-point adder, is called unifuctional. (d) A multifunctional pipe may perform different functions, either at different subsets of stages in the pipeline. (e) The execution of a stream of instructions can be pipelined by overlapping the execution of the current instruction with the fetch, decode, and operand fetch of subsequent instructions. This technique is also known as instruction lookahead. Almost all high-performance computers are now equipped with instruction-execution pipelines. (f) Vector pipelines are specially designed to handle vector instructions over vector operands. Computers having vector instructions are often called vector processors. The design of a vector pipeline is expanded from that of a scalar pipeline.

12