Journal of VLSISignal Processing, 10, 295-310 (1995) (~) 1995KluwerAcademicPublishers,Boston. Manufacturedin The Netherlands.
Resource Constrained Scheduling of Uniform Algorithms LOTHAR THIELE Swiss Federal Institute of Technology Zuerich (ETHZ), Computer Engineering and Communication Network.~ Lab. (TIK), Gloria.vtrasse 35, CH-8092 Zuerich, Switzerland
Abstract. A method for optimizing the schedule and allocation of uniform algorithms onto processor arrays is derived. The main results described in the following paper are: (1) single (integer) linear programs are given for the optimal schedule of regular algorithms with and without resource constraints, (2) the class of algorithms is extended by allowing certain non-convex index domains, (3) efficient branch and bound techniques are used such that problems of relevant size can be solved. Moreover, additional constraints such as cache memory, bus bandwidths and access conflicts can be considered also. The results are applied to an example of relevant size. 1
Introduction
This paper deals with techniques for mapping regular algorithms onto massively parallel architectures. First, some remarks on different mathematical approaches to the design of (piecewise) regular architectures are given. The potential of VLSI for massive parallel computations was first recognized by Kung and Leiserson [1]. The class of architectures they proposed is characterized by massive parallel computation, intensive multilevel pipelining, local interconnection scheme, distributed memory, and distributed computing power. The term systolic array has been introduced for regular mesh connected synchronous processor arrays consisting of identical processing elements that perform a time invariant processor function, see e.g. [ 1]-[4]. About the same time, synthesis methods for VLSI processor arrays received much attention. In principle, behavioral descriptions are transformed in functions that distribute operations over time and space. The first pioneering result has been obtained by Kuhn [5] who used linear index transformations for relating algorithms and their systolic implementation. This result has been rediscovered later by Moldovan [6]. The extensions of Quinton [7], Miranker et al. [8], and Capello et al. [9] to this work resulted in the theory of regular iterative algorithms as proposed by Rao [10], [I1]. He linked results obtained so far to the work of Karp, Miller, and Winograd [12] on uniform recurrence equations and he introduced considerable extensions to the synthesis methodology. These methods are used increasingly to parallelize loop programs for massive parallel architectures, see e.g. [13], [14].
Scheduling and allocation are among the most important problems to be solved in mapping algorithms onto either domain specific hardware or programmable massively parallel architectures since the operations of a program are assigned to control steps and processing elements. In case of unlimited resources, scheduling tries to paraltelize the operations in order to minimize the overall execution time of an algorithm. On the other hand, if only limited resources are available, scheduling procedures attempt to serialize some operations to meet resource constraints. If unlimited computing resources are available, efficient solution to the scheduling problem are known. In the acyclic case, there is a close relation to the PERT/CPM scheduling problem as described in [15]. Either all activities are scheduled as early as possible (ASAP) or as late as possible (ALAP). Moreover, solutions to certain scheduling problems can be obtained by solving the longest path problem on the given signal flow graph. The weights associated to its edges correspond to execution times of operations. In [16], an unified approach to the scheduling of signal flow graphs without resource constraints is given. The approach is based on a circuit transformation called retiming: registers are added or removed without changing the functional behavior of the circuit. There are many interpretations of this transformation scheme as the transformation of registers is equivalent to changing the schedule of operations. Particularly, in [16] polynomial time algorithms are proposed for the solution of the following scheduling problems: clock period minimization and register minimization. Again, there is a close relation to the solution of the longest path problem
296
Thiele
on a weight transformed signal flow graph. Rao [10] converted the framework of [16] in a linear program setting and extended the results to the scheduling of regular processor arrays. The computational complexity of the corresponding optimization problems is independent of the size of the algorithms or arrays. In [17], the optimal scheduling of systolic arrays has been reduced to the solution of a single (integer) linear program. Many other results are available on minimizing the overall computation time, e.g. [18]. If finite resources are taken into account, scheduling and allocation determines the cost/speed trade-off of the architecture. Very often, the required throughput does not necessitate a complete parallelization of the algorithm. Moreover, there may be modules available which are specialized to certain classes of operations. Unfortunately, most of the resource constrained scheduling problems are known to be NP-complete, see e.g. [19], [20]. Numerous heuristic methods have been proposed to find a compromise between the run-time of the scheduling algorithm and the quality of the solution see [21]. List scheduling [22] and force-directed scheduling [23] improve the results of simple longest-path procedures by using a global criterion for se!ecting the operation which is scheduled next. Moreover, it is possible to model the scheduling and allocation problem as an integer linear program (ILP), see e.g. [24]-[27]. In [24], ILP's for the following problems have been derived: time constrained scheduling, resource constrained scheduling, scheduling using pipelined arithmetic units (functional pipelining), and scheduling of loop programs or signal flow graphs (loop folding). The solution of the corresponding ILP's yields optimal solutions to the resource and time constrained scheduling problems. This approach has been generalized in [28] to the synthesis of heterogeneous multiprocessor systems including memory and bus design. In [29], hierarchical methods to solve the resource constraint scheduling problem for uniform recurrence equations are proposed. Here, at first a conventional scheduling and allocation for programs with acyclic data dependencies is performed. Then the resuits are used in an affine by statement scheduling of the uniform algorithm. In this paper, the concepts described in [17], [24], and [30] are combined so as to yield the following results: 9 Single (integer) linear programs are given for the optimal schedule of regular algorithms with and without resource constraints.
9 The class of algorithms used in [17] is extended by allowing certain non-convex index domains, i.e. linearly bounded lattices [30], [31]. 9 Efficient branch and bound techniques are used such that problems of relevant size can be solved. Moreover, additional constraints such as cache memory, bus bandwidths and access conflicts can be considered also. To this end, the approach taken in [28] can be adopted.
2
Informal Introduction of the Problem
In this section, two instances of the problem solved in this paper are described. At first, the following algorithm xl [i, j ] = f l (x2[i - 1, j]) x2[i, j ] : f 2 ( x l [ i , j ] , x3[i, j - 1]) x3[i, j] = f3(xl[i, j])
(1)
is considered, where the three equations must be executed for all index points (i, j ) ~ I within the index d o m a i n I = { ( i , j ) : 0 < i < N A 0 < j < M}. The computations can be visualized by means of a dependence graph shown in Fig. 1. In particular, for each index point (i, j ) ~ I there are three vertices corresponding to each of fl, fz and f3, respectively. The edges represent the data dependencies of the algorithm. The reduced dependence graph of this algorithm as defined in Definition 3 is shown in Fig. 2. To each variable of the algorithm there is associated one node. The edges and their labels correspond to the data dependencies and their index displacements, respectively. For example, the edge from vertex 3 to vertex 2 represents the fact, that variable Xz is computed using variable x3. The label (~ characterizes the index displacement, i.e. x2[i, j] directly depends on x3[i - O, j - 1]. Now, a linear allocation of the operations to processing elements is carried out, i.e. all operations in the direction of a projection direction v are computed within one processing element of a linear processors array. If o = (1 0) is chosen the processor array shown in Fig. 1 is obtained. Moreover, linear scheduling functions will be used. The calculation of a variable xt[I] is finished at time instance tl[I] = zrl + Yt E Z Schedule a: Now let us suppose that each operation
takes 1 time unit and that each processing element
Resource Constrained Scheduling of Uniform Algorithms
Fig. 1.
297
Dependence graph for N = M = 2. The labels associated to the nodes correspond to the two schedules a/b.
1
)2
(~
1
)3 Fig. 2. Reduced dependence graph.
Schedule b:Jr=(3
contains unlimited computing resources then a feasible schedule satisfies :rr = ( 2
1)
Y! = 0
y2=l
F3= 1
as
tl[i, j ] - t2[i - l, j ] > 1 t2[i, j ] - tl[i, j ] >_ 1 t2[i, j ] -
of the dependence graph in Fig. 1. The total evaluation time is Ta = t3[N, M ] - q[0, 0] = 2N + M + 1. Obviously, only two operations are executed simultaneously within each processing element. Therefore, only two modules are necessary. Their occupation with operations is shown in Fig. 3. The iteration interval, i.e. the number of time instances between two successive iterations is ~-a = 2. Methods to determine a time-optimal linear schedule are well known, see e.g. [17]. Usually, there is only a limited number of available modules within each processing element. For example, let us suppose that there is only one module capable of executing functions fl, f2 and f~. A feasible schedule is
t3[i,j - 1] > 1
t3[i, j ] - tx [i - 1, j ] > 1
The time potentials tt[I] are associated to the nodes
0), Yz = 0 ,
y2=2,
Y3=l
which yields Tb = 3N + 1, ~b -----3. The schedule and bar chart is shown in Figs. 1 and 3, respectively. It is interesting to note that schedule b is faster than schedule a if M > N despite of the fact that schedule a uses one more module within the processing elements and that its iteration interval ~.b is larger than that of schedule a. It should be obvious that it is not sufficient to consider a time-optimal schedule for one iteration only, i.e. without taking into account the dependencies across iterations and between processing elements. The optimal solution to the resource constraint scheduling problem
298
Thiele I I
a
I I
I I
li i i !lliiii i!i!!liiiil]!i!i!iiiiiii i iiiiil l li i i iIi!iii i iiiii] il
Ta = 2 N + M + I
I
I
I
0
1
2
3
4
i
,0 Fig. 3.
5
t
,
i
,
,
i
1
2
3
4
5
-Ik
t
Bar charts correspondingto schedules a and b.
0
Fig. 4.
2
I I I
fii i i![iiii!ii i i iliiiiiiltiiiiii [!iiiiiil iiliiii iiiii] iiliiiiii !l iiiiill I
)L a =
I
I I
b
,
2'
Partially unfolded reduced dependencegraph for partitioning.
requires the consideration of the complex relation between intra- and inter-processor communication. Another example for finite resources is the partitioning problem, see e.g. [32] and the references therein. The following example is influenced by the results presented in [33]. Let us suppose that for the given algorithm only a limited number of processing elements is available, and each processing element has a limited number of modules only. Then one possibility would be to assign several rows of the dependence graph shown in Fig. 1 to one processing element. If two adjacent rows are assigned to one PE, the reduced dependence graph shown in Fig. 4 must be mapped onto a linear array of r - ~ q processing elements. Again, the problem to be solved consists in assigning operations 1, 2, 3, 1', 2', 3' to modules and time instances such that the overall computation time of the algorithm is minimized. In general, a partial
unfolding of the reduced dependence graph must be carried out. The resulting reduced dependence graph represents all operations which are to be assigned to one processing element. This can be done for more general tilings of the dependence graph in a similar way. It is possible to add additional limited resources such as 9 available cache memory, see [28], 9 limited communication bandwidth within a PE, 9 access conflicts to memory and buses and 9 limited number and bandwidth of inter-processor links. Some remarks concerning these obvious but important extensions are given in the last chapter.
Resource Constrained Scheduling of Uniform Algorithms
@0
J
299
@@
oo 0O
1-
9
0
I 01
I I
J
I I
01
--,k
. . . .
5
.k
_
I A
10
15
i
Fig. 5. Examples of linearly bounded lattices.
3
3.1
Regular Algorithms, Scheduling and Allocation
Regular Algorithms
The class of algorithms we are dealing with is related to uniform recurrence equations [ 12] or regular iterative algorithms [10]. But, similar to the class ofpiecewise regular algorithms, see [30], [31] the index domain can be a linearly bounded lattice. Therefore, more general input specifications for the scheduling procedure are allowed. In particular, this generalization is necessary if the regular algorithm to be scheduled results from previous design transformations such as partitioning. Moreover, there are input specifications which naturally lead to non-convex index domains, see the final example of this paper. DEFINITION 1 (Regular Algorithm). A regular algorithm consists of a set of IVI quantified indexed equations St[l] SI[I] "'" St[I] . . . SIvI[I]
u
~ I
where I ~ Z r is the index vector and the linearly bounded lattice I c Z r is the index domain. Each equation St[l] is of the form
xt[t] =~t(..- x,[l -d,t] ...) where xl[l] are indexed variables, ~-I are arbitrary functions and dkt ~ Z r are constant dependence vectors. The example used in the previous chapter is a regular algorithm. Finally, the class of index domains, i.e. linearly bounded lattices, see [30], [31], needs to be defined. DEFINITION 2 (Linearly Bounded Lattice). A linearly bounded lattice is an index space of the form I={I:l
= Ltc + p A AK > b A tc E Z l}
where L ~ Z r•
p e Z r, A ~ Z m• and b e Z " .
Obviously, {K: Ar > b /x K e Z t} defines the set of all integer vectors within a polytope. The polytope is characterized by a set of linear inequalities. This set is mapped on I using an affine function, e.g. I = L x + m . Throughout this paper, we require that the matrix L is square and invertible. The index domain of the final example is a linearly bounded lattice of the form given above, see Fig. 10. The following example serves to clarify the above notations. EXAMPLE 1. Let us suppose that an index set I] is defined by L = E, where E denotes the unity matrix, a n d p = 0. Then w e h a v e I ] = {I: I = ic/x A x > b / x K ~ Zt}, or equivalently, I] = {I: A I > b /x I ~ Zt}. ForI = (}),a
0
= (-il
1
1
._01),b = ( ~ , ) a n d n = 4
the index set shown on the left hand side of Fig. 5 is obtained. Moreover, Fig. 5 also shows the linearly bounded lattice I2 = {i:i = Ki + 3K2 A to2 < tq < 4 A ~c2 > 1 }. It is obvious, that 12 is not necessarily a lattice or a convex set. In order to describe the new scheduling procedures it is useful to define the reduced dependence graph, see e.g. [10], [12]. DEFINITION 3 (Reduced Dependence Graph). The reduced dependence graph DG = (V, E, D) associated to a regular algorithm as defined above is defined as follows: To each variable xt[l] there is associated a node vt ~ V. There is an edge (Ok, Ol) ~ E with distance vector dkt if the variable xt directly depends on x~ via the dependence vector dkt. The reduced dependence graph can also be represented algebraically. To this end we will use the node-arc incidence matrix C ~ {0, 1, - 1 } IvlxlEIl and
Thiele
300
the distance matrix D E Z r• whose column vector associated to an edge (k, l) is dkt. The above described notations will be explained by the example introduced in the previous section. EXAMPLE algorithm
2.
The example is concerned with the
xl[i, j] = fl(x2[i - 1, j ] )
u
x2[i,j]= f2(xl[i,j],xa[i,j-1])
u
x3[i, j] = f3(xl[i, j ] )
V(i, j ) ~ I
j) e I
whereI={(i,j):0 < i < N /x 0 < j < M}. Obviously, the index domain is a linearly bounded lattice with
L=
-1
('0
p=0
A=
0 /
o 1
1 0
0
1
The incidence and distance matrices corresponding to the reduced dependence graph shown in Fig. 2 are
C=
( , , , 0)/o01 o/ 1 0
0 1
-1 0
1 -1
D=
0
0
determine a corresponding projection vector u unique up to its sign with the following properties: DEFINITION 4 (Projection Vector). Given a linearly bounded lattice according to Definition 2 and a projection direction v ~ Z ~. Then the corresponding projection vector satisfies 1. u has the same direction as v, i.e. u = flu ~ Z ~ for some # ~ Q. 2. There exists an index vector I0 E I such that Io+u I holds. 3. For all Io ~ I the relation Io -t- 7u ~ I with y Q A I• < 1 holds for 7 = 0 only. EXAMPLE 3. Let us suppose that I = {i:i = 2x /x 0 < x < 10} and v = 1. I f u = v = 1 then condition 2 is not satisfied as all io are even but all io + u are odd. If u = 4v = 4, then condition 3 is not satisfied as e.g. for i0 = 4 we have io Jr 89 = 6 6 I. Finally, if u = 2v = 2 is chosen, all conditions are satisfied. Intuitively, Definition 4 requires that the length of u is not too small, i.e. there must be at least two index vectors in the index domain whose distance is u. On the other hand, u is not too large, i.e. there are no two index vectors whose distance has the direction of v but is smaller than u. It remains to give a procedure which enables the determination of a projection vector given the corresponding direction.
Given a projection direction v E Z r. Then a projection vector according to the above Definition 4 satisfies THEOREM 1.
u=/3o
where the columns are ordered according to (1,2), (1, 3), (2, 1), (3, 2).
where L ~I --The allocation and scheduling of operations to processing elements and time instances, respectively, closely follows the principles known from systolic array design, see e.g. [5], [6], [34].
3.2
Allocation to Processing Elements
The allocation o f operations to processing elements is determined by a projection direction v ~ Z r. Then for any given index vector Io e I the variables xt[Io + otv] are computed in one processing element for all 1 < l < V and for all oe ~ Q such that Io-I- otv E I. In this definition v determines a direction only. 2 It will be useful for the derivation of the scheduling procedure to
zrxr.
~= t
Idet(L)l
Idet(L)l
gcd{[adj(L)v]i: 1 < i < r} adj(L), det(L) ~ Z and adj(L)
PROOF. Condition 1 is satisfied as u = fly. Let us suppose that Io = LK ~ + p for some K~ 6 Z r. Then condition 2 is satisfied i f 3 x I ~ Z r such that Io + u = L x i + p. Therefore, condition 2 is satisfied if 3x 6 Z r such that LK = u. Indeed, x = L - j u is integral as
L _ l u = L_l fl v =
adj(L)v
gcd{[adj(L)v]i: 1 < i < r} Moreover, because of the gcd-operation in the denominator, all other vectors u I of smaller length, i.e. u' = y u with I• < 1, lead to a non integral K'. Therefore, condition 3 is satisfied too. []
Resource Constrained Scheduling of Uniform Algorithms
3.3
Scheduling
301
an equivalent scheduling for the original problem and vice versa, i.e. tt(1) = ?t(I).
Corresponding to the design of systolic arrays we will use a linear scheduling function. Therefore, the calculation of a variable xt[l] for I ~ I is finished at time instance
tt(I) = 7rl q- YI where tt(l) ~ Z, Jr ~ Qr and yt ~ Z. The iteration interval ~. is an important quantity of the scheduling procedures which will be described next. It is closely related to the previously defined projection vector u and scheduling parameter zr. DEFINITION 5 (Iteration Interval). The iteration interval ~. of an allocated and scheduled uniform algorithm is the number of time instances between the evaluation of two successive instances of a variable within one processing element. Using the definitions given above the iteration interval can be computed according to the following theorem.
The iteration interval 3. of an allocated and scheduled uniform algorithm is THEOREM 2.
~. =
i~rul
PROOF. The computations of variables xl[lo + otu] assigned to one processing element are finished at times tt = r r l + ~'l = trio + otzru + Yr. Two neighbouring instances, i.e. xl[ lo q- otu] and xt[ Io + (~ 4- 1)u], have a distance in time of A t ---- -4-zru. As the iteration interval is positive, we have 3. = 17rul. [] Now, the problem which we are going to solve can be characterized as follows: Given a regular algo-
rithm according to Definition 1 and an allocation, i.e. a projection direction o. Determine under various implementation models an optimal scheduling function which minimizes the time interval between the first and last calculation. In order to simplify the notation, we will suppose that the index domains are linearly bounded lattices of the form I = { I : I = LK A Ate > b A ic ~ Zr}, i.e. in contrary to Definition 2 we suppose that p = 0. It can be seen that this simplification does not restrict the generality. If the given index domain does not satisfy p = 0, then at first a schedule with z?, Yt and tt(?) = Ji"/r + Yt for a shifted index domain I = I - p can be determined. If we set rr = ~" and Yt = y~ - rrp we find
4 Scheduling Procedures 4.1
Unlimited Resources
At first, the model as used for this section is defined. 9 Each processing element contains IVI modules. Module l with 1 < l < IVI is capable of executing the function ~'t. The module evaluates variables xt[l] for some 1 ~ I. Therefore, there is one module for any node o f the reduced dependence graph within each processing element. 9 To each input (k, l) of the module l there is associated an evaluation time wkt ~ Z. These evaluation times can be combined in a weight vector w ~ Z I• If these weights are included in the reduced dependence graph, a weighted reduced dependence graph DG = (V, E, D, w) is obtained. 9 If the inputs (k, l) of the module l are available at time instances tk, then the output variable is available at time instance tt > max{tk + wkt: (k, l) ~ E}. 9 The time difference between two successive iterations ~. is called iteration interval. It has to satisfy the inequality 3. > tbm~x. For example, if module l is not pipelined it can accept new input data every u)t = max{wkt: (k, l) ~ E} time instances. In this case we have ~max : max{tbt: 1 < l <
IVI}
There are many other similar models and simple extensions. We restrict ourselves to the above for the sake of brevity. Now, we are going to determine a time optimal scheduling function under the above implementation model. The solution is similar to [17]. These results are generalized to non-convex iteration spaces, i.e. linearly bounded lattices as defined in Definition 2.
Given is a regular algorithm and its weighted reduced dependence graph. Then a schedule which minimizes the maximal value of rr ( J - l ) for all I, J E {Ltc: AK > b A t c E Qr}3 under the above described implementation model is given by
THEOREM 3.
n" = r r ' L - l
(2)
302
Thiele
where 7r' and y are a solution of the integer linear program minimize
-(y~ + y2)b
subject to
y~A = Jr'
Yl >_0
Yl E Ql•
both problems into one integer linear program. To this end, (8) can be replaced by
(3)
rr' adj(L) u > Idet(L)l(~bmax -- o'(~.max --~ ~max))
(4)
7r' adj(L)u < ]det(L)l(~-max -a(~-max + tbmax)) a 9 {0, 1}
y2A = - z r '
Y2 > 0
7l"/ 9 Z lxr
Y2 9 Ql•
(5) (6)
7/ 9 Z I•
zr' adj(L) D + Idet(L)[ y C > [det(L)[ w
(7)
4-zr'adj(L) u > [det(L)[ tOmax
(8)
Here we used the representation of a linearly bounded lattice according to Definition 2, the incidence and distance matrices C and D and the weight vector w. Moreover, we define L -t __ ~ 1 adj(L) where det(L) 9 Z andadj(L) 9 Z TM. PROOF. The execution of operation l at index point I is finished at tt = JrI + Yt 9 Z where I = Lg for some g 9 Z r with AK > b. Therefore, tt = 7rLg + 7/t. In order to guarantee that t is integer for all x 9 Z r with Ag > b we have JrL = 7r' 9 Z r in (2) 4. Using L - I _ Idet(L)l I adj(L) we can also write Jr Idet(L)l = Jr' adj(L). The total evaluation time is approximated by max{n'(J - I): I, J 9 {Lg: Ag >_ b A g 9 Qr}} or the solution to the following linear program maximize subject to
rr'(g2 - gj) Agl > b gl E Q r AgE >_ b g2 9 Qr
The dual linear program has the form minimize subjectto
-(Y2 + yl)b yl A = rr' Yl >_0 y2A = - J r ' Y2 >-- 0
Yl 9 QI•
where )~m~xis an arbitrary upper bound on the iteration interval, i.e. the final solution satisfies ~. < 3-max, see also the remarks in section 4. EXAMPLE 4. For the regular algorithm introduced in Example 2 we choose u = (~). Moreover, we suppose that Wl2 = wl3 = w21 = w32 = Wmax ~--- 1. With N ---- 10 and M = 20 we obtain as a solution of the linear integer program in Theorem 3 Jr=Jr'=(3 0) Yl = ( 3 0 0 0) (Yl + y2)b = 30
y=(0 Y2=(0
2 0
1) 3 0)
In other words, the computation of the variables xl [i, j], x2[i, j] and x3 [i, j ] is finished at time instances tl = 3i, t2 = 3i + 2 and t3 = 3i + 1, respectively. This is the schedule b shown in Figs. 1 and 3. If we choose N = 20 and M ---- 10 the results zr=Jr'=(2 y~ = (2 1 Y2=(0 0 tl = 2 i + j
1) y=(0 1 1) 0 0) 2 1) (Yl+y2)b=50 t2=2i+j+l t3=2i+j+l
are obtained, see schedule a in Figs. 1 and 3. It is possible also to solve the relaxation of (3-8) in Theorem 3 by dropping the integral constraints and using the nonlinear scheduling [Jr I +YtJ of the evaluation ofxt[l]. Obviously, better solutions in terms of the total evaluation time can be obtained but this idea cannot be applied directly to the case of limited resources.
Y2 9 QlXm
4.2 which leads to (3, 4, 5, 6). In addition, yr and 7/must satisfy the dependence constraints rr D + 7/C >_ w. As the iteration interval is either ~. = 7ru or ~. = - z r u one of the constraints zru >_ ff~maxo r - J r u > tbr~x must be satisfied. Using the parameter Jr' according to (2) leads to (7) and (8). [] The signs 4- in (8) denote that the integer linear program must be solved for both signs, i.e. + and - , and the better result must be used. It is possible to combine
Limited Resources
Again, it is useful to start with the implementation model for this section. In the above described case of unlimited resources any processing element contained IVI modules. Each module has been responsible for the calculation of one variable, i.e. xt[l] forall I E I. Now, the number and functionality of the modules within each processing element is restricted. To model this situation, as well a weighted reduced dependence graph as a bipartite resource grapla is used.
Resource Constrained Scheduling of Uniform Algorithms
9 The weighted reduced dependence graph DG = (VD, Eo, D, w) is defined as in the previous section. In particular, if the input variables Xk[l --dkl] necessary to compute a variable xt[l] for some I are available at times tk, then the time instance where Xl[l] is available satisfies tt > max{tk + Wkl: (k, l) ~ Eo}. 9 The number of clock cycles a module is reserved for the calculation of xt via .~ is denoted as t~l. 9 In the bipartite resource graph RG = (VR, ER, M) with VR = VD U Vr there is exactly one edge leaving each node l ~ Vo and there are no edges ending in s o m e l ~ Vo. T o e a c h vertexk ~ Vr there is associated a non negative integer M ( k ) ~ Z>_.o. 9 Each processing element consists of modules. There are I Vrl different types of modules, i.e. to each vertex k ~ Vr there is associated a module type. If there is an edge (l, k) ~ ER then the function ~-t associated to l ~ Vo can be executed by a module of type k ~ Vr. There are at most M ( k ) modules of type k in each processing element. The following example serves to explain the notation of a resource graph. Again, we use the algorithm described in Example 2. Let us suppose that within the processing elements there is just one module a available. Therefore, we have M ( a ) = 1. All operations fl, f2 and f3 can be executed on this module. Therefore, the resource graph has three nodes in 1/o corresponding to fl, f2 and f3 and one node in Vr corresponding to module type a. There are edges from all nodes in Vo to the node a in VT, see Fig. 6.
EXAMPLE 5.
Again, we are going to determine an optimal scheduling function under the above implementation model. The optimal solution is formulated as a mixed integer linear program again. There are many different formulations possible. Only one basic approach is
303
given in this paper. It is based on results described in [24] for high level synthesis.
4.3
Basic Procedure
There are two main concepts behind the proposed approach: (1) Binary variables are used to represent the vector y and (2) a convolution like operator is used to consider resource constraints. The main result is described in the following theorem. THEOREM 4. There is given a regular algorithm, its weighted reduced dependence graph, a resource graph, the iteration interval ) 5 and lower and upper bounds on y, i.e. r(l) < Yt < s(1) for l ~ I/o. Then the parameters rc and y of a scheduling which minimizes the maximal value of 7r(J - I) for all I, J ~ {Ltc: Ate > b /x tc ~ Qr} can be obtainedas
s(I) Yt = ~ i Y~i u
zr = zr'L -I,
Vo
(9)
i=r(l)
where lr' and Yl'i are a solution of the following integer linear program minimize - ( Y l + y2)b subjectto y l A = J r '
y2A = - i t '
(10)
Yl >_0
Y2 > 0
Yl 6Q1•
Y2 E QlXm
(11) (12)
7gt E Z I x r
y~ E{0,1}
V l ~ Vo, r ( l ) < _ i < _ s ( l ) s(l) Yt'i = 1 u ~ 1Io
(13) (14)
i=r(l)
Jr' adj(L) dkt + Idet(L)l / s(l) s(k) \ x ( ~ i y;i - y~. i Y~i -- Wkl) > 0 \ i=r(I) i=r(k) V(k,l) e E o Jr' adj(L) u = -I-Idet(L)l~.
(15) (16)
t~t--I
M(a)=l 30-
a
vo Fig. 6.
vr Resource graph.
l:(l.k)EEtt p=0 p':r(l)<_i.-I-p+p'), <_s(l)
< M(k)
Yk ~ Vr, 0 < i < 3 . - 1
(17)
PROOF. (9) defines the transformed scheduling parameter rr' as in Theorem 3 and the representation of y as a weighted sum of binary variables y~. Constraint (14) guarantees a unique representation, i.e. only exactly one of the variables y~ for r(l) < i < s(l) equals
304
Thiele
one. Constraints (11, 12) are identical to those used in Theorem 3. Using the above defined notation, (15) directly corresponds to (7). (16) guarantees that the given iteration interval 3. is achieved. Finally, (17) is responsible for the consideration of finite resources. Let us suppose that Yt'k = 1 for some operation I ~ Vo. Then ~ l -p=0 n y,I.i+p = 1 f o r a l l k - t ~ t < i < k. Thisdoes not yet correspond to the time interval where a resource is used. Remember that an operation I is executed repeatedly, i.e. with a distance of), time instances. Therefore, operations in different iterations may overlap and multiples of the iteration interval 3. must be added to j + p,
determine these bounds systematically are described in later sections. With N = 20 and M = 10 the following results are obtained: 9 For ~. = 2 there is no solution to the set of mixed integer constraints. 9 For 3. = 3 we find yn=(3
Concerning the 4- in (16) the same remark as in Theorem 3 can be made. In order to combine both problems (16) can be replaced by
0
0)Y2=(0
0
3 0)
O) (Yn + y2)b = 60
Yl.l, YI.2, YI.3 = I, O, 0
Y2.1, F2.2, Y2,3 = O, O, I
Y3,1, Y3,2, Y3.3 = 0, 1 , 0
X--~tbi- l
i.e. /--~t,=o ~-~.?':r(t)<_i+p+v'x<_s(t)Y,'t.i+r+t,'x" It can be seen that (17) guarantees the resource constraints, ra
0
zr=Jr'=(3
Therefore, wehave tl = 3i,t2 = 3 i + 2 a n d t 3 = 3 i + 1 . This solution corresponds to schedule b as shown in Figs. 1 and 3.
5
Remarks, Variations and Extensions
rr' adj(L) u > [det(L)[ (1 - 2cr)3. zr' a d j ( L ) u < Idet(L)[ (1 - 2cr)3.
EXAMPLE 6. We again consider the regular algorithm introduced in Example 2. L, m, A, b, u and D are as in the previous examples. The weighted reduced dependence graph is DG = (Vo, Eo, D, w) with Vo = 1, 2, 3, Eo = (1, 2), (1,3), (2, 1), (3, 2) and w = (1 1 1 1). The resource graph has the form RG = (VR, ER, M) with VR = Vo U {a}, ER = ( 1 , a ) , ( 2 , a ) , ( 3 , a) a n d M ( a ) = 1. There is only one resource type a available. On one instance, all operations are executed, see Fig. 6. We choose the constraints 0 < ~/1,2,3 ~ 2, i.e. r(l) = 0 a n d s ( l ) = 2 for a l l l ~ I'D. Methods to
The following remarks mainly concern the optimal scheduling in Theorem 4. A closer look at the main resource constraint (16) in Theorem 4 re.veals that operations are not guaranteed to be assigned to a single module during their evaluation time. Therefore, it may happen that the first half of an operation is assigned to one module and the second half to another one. Usually, this is not acceptable. Figure 7 gives an example. There are three operations 1, 2 and 3 with evaluation time 2 and there are two modules. The figure at the top shows a feasible schedule according to Theorem 4 as there is no time instance with more than two simultaneous operations. On the other hand, there is no assignment to two modules which has a period of three time instances. Two solutions to this problem are given below:
l
l
J
"t
Fig. 7. Allocationof operations.
Resource Constrained Scheduling of Uniform Algorithms
9 Module Selection: For each module (and not for each
9
module type!) one vertex of the resource graph is generated. As a consequence, we have M = 1 for all vertices in Vr. If an operation can be executed by a certain module type, there are edges from the corresponding node to all vertices corresponding to this type. This extension of the resource graph can also be used to model the following situation: An operation can be executed either on a module of type a or of type b. As a consequence, the formulation of the linear integer program becomes more involved. Details can be found in [28]. Loop Unfolding: It can be shown that there is a consistent allocation of the operations to modules whose period is a multiple of the iteration interval. An example is given at the bottom of Fig. 7. Here, the final period is twice the iteration interval. The situation simply generalizes to an arbitrary number of modules and operations.
Next, one notes that the iteration interval ~. must be given but it is not known beforehand. There are several possibilities to solve this problem: 1. ~. is required to be integral. If there is an interval for all accessible ~., then Theorem 4 could be applied for any ~. within this interval. Lower bounds on ~. can be obtained as follows: (a) Determine a systolic schedule according to Theorem 3 without taking into account resource constraints. The term Izr'adj(L)ul is a lower bound for Idet(L)l ~.. (b) Another lower bound for ~. is
6Omax.
(c) Let us suppose that there are Ark time instances to be executed on modules of resource type k VT, i.e. Nk = ~..t~Vo:(t.k)eER COtwhere COtis the number of time instances, a module is reserved for operation I. Then
Imax{ M(Nkk):k ~ Vr } ] is another lower bound for ),. 2. The most elegant but also most complex way is to use ~. as a linear parameter in the ILP formulation. Here only the main concepts can be described. In Theorem 4, Eqs. (10--16) remain essentially equivalent. But, instead of shifting the time potentials y ' by multiples of ~ in (17) by adding p',k to its time index, new potentials ~ (p') are introduced. The shift in time of p'~. is obtained by additional
305
sets of equations where ~. is a linear parameter. The resulting solution time is inferior to the above described possibilities. Other necessary but unknown parameters are the bounds r(l) and s(l) on the time potentials Yr. Here again, there are several possibilities to determine appropriate values: 1. It is possible to reduce the number of unknowns to one parameter Ymax. A trivial possibility would be to set r(l) = 1 and s(1) = Ymax for all I ~ I/o. 2. A more involved possibility is adopted from [24]. On the acyclic dependence graph obtained after removing all edges with dkt ~ 0 ASAP and ALAP schedules using Ymax are determined. The corresponding schedules of operations are r(l) from ASAP and s(l) from ALAP schedule. It can easily be seen that with the constraint 1 < g < }/max no other time potentials are feasible. As a result, the number of binary variables can be reduced to a great extent. 3. It is possible to determine a ILP formulation of the scheduling problem without the need of a parameter Ymax. To this end, additional integer variables gt for all l ~ Vo are introduced. In particular, a time potential Yl is now replaced by fit +gt3.. Obviously, Yt can now be bounded by 1 < Yt < ~., i.e. r(l) = 1 ~,s(I) and s(l) = )~. The terms of the form/-..~i=r(I) i Yti in (15) are replaced by X
i=I
and (17) is replaced by Ibt--I I, ( i + p -- 1)modX + 1 - -
I:(l,k)~Et~ p=O
This transformation can be interpreted as a cutset transformation, see [16], which has been chosen such that all time potentials are within the interval 1 < yt < ~.. The modulo-operation is caused by the fact that the time interval a module is reserved may wrap around from the end of one iteration to the beginning of the next one.
6 Experimental Results The solution of resource constraint scheduling problems of relevant complexity has been made possible by
306
Thiele
I
v t
(~ add-operation multiply-operation t~.s.
Reduced dependence graph of a PDE solver.
carefully choosing the branch-and-bound strategies in the used ILP-solver. Otherwise, the exponential complexity of the proposed method will make its practical application infeasible. In particular 9 at first the binary variables with low time indices are made integral and 9 the local branching strategy has been adapted to the fact that most of the binary variables are 0. The execution times are taken on a SUN Sparc Station. A s an example, we have chosen a wave digital filter algorithm for the solution of certain partial differential equations, see Fig. 9 in [35]. The algorithm contains 21 add-operations and 6 multiply-operations. The reduced dependence graph is shown in Fig. 8. All displacement vectors are 0 but those whose edges contain the boxes with d t inside. The add- and multiplyoperations need 1 and 2 time instances, respectively. We suppose that the implementation consists of two module types, i.e. adders and multipliers. Moreover, within one processing element of the processor array there are only 2 adders and 1 multiplier available. The corresponding resource graph is shown in Fig. 9. For completeness, the whole specification of the problem in form of the weighted reduced dependence graph and resource graph is given: Vo = { a l l , a l 2 , al3, al4, al5, m l l , m l 2 , a21, a22, a23, a24, a31, a32, a33, a34, a41, a42, a43, a44, m41, rn42, a51, a52, a53, a54, m51, m52}
Eo = {(all, a15), (a12, a15), (a12, a23), (a12, a24), (a13, a15), (a13, a33), (a13, a34), (a14, a15), (a14, m l l ) , (a14, m12), (a15, a14), (roll, al 1), (m12, a12), (m12, a13), (a21, a12), (a21, a14), (a22, a41), (a22, a42), (a23, a21), (a23, a22), (a24, a21), (a24, a22), (a31, a13), (a31, a14), (a32, a51), (a32, a52), (a33, a31), (a33, a32), (a34, a31), (a34, a32), (a41, m41), (a41, rn42), (a41, a44), (a42, a23), (a42, a24), (a42, a44), (a43, a44), (a44, a41), (m41, a42), (m42, a43), (a51, m51), (a51, m52), (a51, a54), (a52, a33), (a52, a34), (a52, a54), (a53, a54), (a54, a51), (m51, a52), (m52, a53)} All distances but the following satisfy dkt (0 0 O)t dais,at4 = (002) t da23,a22 = ( - I 0 1)t da24,a21 = (1 0 1)t da33.a31 = (0 -- i 1)t da33,a32 = (0 - 1 1)t da34,a32 = (0 1 1)t da44,a41 = (0 0 2) t
da23,a21 =
(-1 0 1)t
da24,a22 -~- (l 0 1)t da34,a31 = (0 1 l) t d.54,.51 = (002) t
(18)
Resource Constrained Scheduling of Uniform Algorithms
al 1
307
M(Add)=2
a540ml 1 Fig. 9. Resourcegraph.
M(Mul)=I
m520-
The resource graph is given by
Vr = {Add, Mul} ER = {(a 11, Add), (a 12, Add), (a 13, Add), (a14, Add), (a15, Add), (ml 1, Mul), (m 12, Mul), (a21, Add), (a22, Add), (a.23, Add), (a24, Add), (a31, Add), (a32, Add), (a33, Add), (a34, Add), (a41, Add), (a42, Add), (a43, Add),
set transformations on the given reduced dependence graph must be carried out such that there is a spanning tree whose edges satisfy dkt = 0. Then the HermiteNormal Form, see e.g. [37], of the transformed distance matrix is the required lattice matrix L. In the example, the reduced dependence graph already contains the required spanning tree. The distance matrix D contains as nonzero columns those given in (18). Elementary column operations, i.e. adding integer multiples of one column to another column, leads the required lattice matrix
(a44, Add), (m41, Mul), (m42, Mul), (a5 I, Add), (a52, Add), (a53, Add),
L=
(a54, Add), (m51, Mul), (m52, Mul)} The evaluation times satisfy w~t -- 2 'v'(k, l): (l, Mul) ER and w~t = 1 u Add) ~ Etc. As we suppose non pipelined arithmetic units we also have Cot = 2 Vl: (l, Mul) E Etc and Cot = 1 Vl: (l, Add) Etc. The numbers of available modules are M(Mul) = 1 and M(Add) = 2. The index vector of the above algorithm is of the form I = (i j t) t where 0 < i < N 1, 0 < j < M - 1 and 0 < t < T - 1. A c l o s e r examination of the dependence vectors dkt reveals that the dependence graph of the regular algorithm consists of two connected components i.e. the algorithm is able to compute two independent sets of problems. This is mentioned also in [35]. Consequently, if the solution of only one problem is of concern, the index space is a linearly bounded lattice. The method which enables the computation of this connected component is closely related to the problems described in [36]. At first, cut
1 1
The final index space of the algorithm has the form I={LK:0
^ 0
A 0
< KI q-/r q- 2tc3 < T - 1} where K = Oct to2 to3)t. Figure 10 serves to explain the lattice structure of the index space. Only every second index point is used for computations. The lattice structure is shifted from one layer with constant t to the next one. As the projection direction we choose o = (0 0 1) t. According to Theorem 4, the projection vector u satisfies u = (0
0
2) t
For the following calculations, N ----M = T = 1000 is chosen. The final processor array has the form shown in Fig. 11.
308
Thiele
y/l,, "
/
A i
/
I
_----','- o 9 9 9 o - 9 .,"
,:':iii!~!iiiiiii!ii!!!!ii!iiiiiiii
.:.-iijii..kil..iiikii,iii,kiiiii,kli..ii!,k!i,kl. , _ i!i ,::iiiiiiiii~ii::i::iiii~i~::~i~::~::~::~i~
l
...!iiiii.~iiil...~iiji-:.iiiiiiiilmi!iiii!ili!mliii[..~i!iiiiiiiii!iiill "t
k.-'"
,
V
Fig. I0. Lattice structure of the index space.
,//e
-':~
.
.
9
9
~
9
~
9
/
~
Fig. 11. Processorarray for PDE solver.
At first, the optimal systolic schedule according to Theorem 3 is determined. The solution takes less than 0.1 seconds of CPU time. The objective function has the value (yj + y2)b = 5994 where w=(0
0
6)
(Y1+Y2)b=7992 Jr=(0 0 8) k = 1 6 y =(147941613611883310
k=12
y=(5552644116 61166255644255644)
Now, the resource constraint scheduling according to Theorem 4 is evaluated. The solution of the ILP takes about 40 seconds of CPU time. The results are
1 0 2 6 13 144 1 1 4 9 16 188 15) t t
and the elements of y are ordered as the list of nodes I/o given above. A corresponding implementation needs 6 multipliers and 7 adders. The integer linear program has 57 equations and uses 42 variables.
The bar charts of the two available adders and the multiplier are shown in Fig. 12. Here, the number of inequalities and variables are 743 and 106, respectively. The branch-and-bound algorithm visited 189 branching nodes.
R e s o u r c e Constrained Scheduling o f U n i f o r m A l g o r i t h m s
I 0 Fig. 12.
7
1
I 2
I
I 4
I
I
I
6
I 8
I
I 10
I
1 12
[
I 14
309
I 16
Bar charts.
Concluding Remarks
Finally s o m e possible extensions to the m e t h o d s as described in this paper are described. In a similar way as in [24], additional degrees o f f r e e d o m or constraints can easily be taken into account, such as chaining, pipelined arithmetic units and multi cycle operations. It has been shown in [28] that it is possible to explicitly include also the m e m o r y design, i.e. optim i z i n g the total a m o u n t o f local m e m o r y needed and a v o i d i n g access conflicts to m e m o r y banks, and to include the bus design, i.e. taking into account limited bandwidth and a v o i d i n g access conflicts on the interconnection structure. M o r e o v e r , it is possible to assign m o r e than one r e s o u r c e type to an operation. In this case, the operation will be allocated to one o f the resources ( m o d u l e selection). T h e s e results can be applied to the class o f algorithms considered in this paper. In the case o f parameterized index spaces, i.e. b in the definition o f the linearly bounded lattices depends on s o m e constant but u n k n o w n parameters, the optimal solution can be d e t e r m i n e d using methods to solve p a r a m e t e r i z e d integer programs, see [38].
Notes I. The column vector c,! of C associated to an edge (k, l) satisfies ckt = et - ek where ek and et denote the kth and lth unit vector, respectively. 2. In order to exclude degenerate cases, we suppose that I is sufficiently fat with respect to v, i.e. there exist Io ~ I, ~t ~ Q#O such that Io + ~to ~ I. 3. This expression is an approximation of the total evaluation time m a x [ r t ( J - - l ) + y k - - F t : l , J E I A k , I E V}. 4. This condition is necessary also if Ate > b is sufficiently fat, i.e. contains a unit cube. 5. The iteration interval of a schedule denotes the number of timeinstances between two successive iterations.
References 1. H.T. Kung and C.E. Leisersoa, "Systolic arrays for VLSI:' SlAM Sparse Matrix Proceedings, Philadelphia, 1978. pp. 245-282. 2. H.T. Kung, "Let's design algorithms for VLSI systems," Proc. Caltech Conf. on VLSI, pp. 65-90, 1979. 3. H.T. Kung, "Why systolic architectures?" IEEE Computer, pp. 37-46, 1980. 4. S.Y. Kung, "On supercomputing with systolic/wavefront array processors," Proceedings of the IEEE, 1984, pp. 39-46. 5. R.H. Kuhn, "Transforming algorithms for single-stage and VLSI architectures," Workshop on Interconnection Networks for Parallel and Distributed Processing, 1980. 6. D.I. Moldovan, "On the design of algorithms for VLSI systolic arrays," Proceedings of the IEEE, 1983, pp. 113-120. 7. P. Quinton, "Automatic synthesis of systolic arrays from uniform recurrent equations" The IEEF_,/ACM11th Annual lnt'l Symp. on CmnputerArchitecture, pp. 208-214, Ann Arbor, MI, USA,
1984, 8. W. L. Miranker and A. Winkler, "Space-time representation of computational structures," Computing, pp. 93-114, 1984. 9. P.R. Capello and K. Steiglitz, "Unifying VLSI array design with linear transformations of space-time," Advances in Camputing Research, Vol. 2, pp. 23--65, 1984. I0. S.K. Rao, Regular iterative algorithms and their implementations on processor arrays, Ph.D. Thesis, Stanford University, 1985. 11. S.K. Ran and T. Kallath, "Regular iterative algorithms and their implementations on processor arrays," Ptr;ceedings of the IEEE, Vol. 6, pp. 259-282, March 1988. 12. R.M. Karp, R.E. Miller, and S. Winograd, "The organization of computations for uniform recurrence equations," JOurnal of the ACM, Vol. 14, pp. 563-590, 1967. 13. M. Wolfe, "Massive parallelism through program restructuring," Proc. IEEE Conf. on Frontiers of Massively Parallel Computation, 1990, pp. 407-415.
14. M.W. Wolfe and M.S. Lain, "A Loop Transformation Theory and an Algorithm to Maximize Parallelism," IEEE Transactions on Parallel and Distributed Syster~r Vol. 2, pp. 452---471, 1991. 15. ES. Hillier and G.J. Lieberman, An Introduction to Operations Research, San Francisco: Holden Day, 1980. 16. C.E. Leiserson, F.M. Rose, and J.B. Saxe, "Optimizing synchronous circuitry by retiming," Proc. Third Caltech. Conf. on VLSI, pp. 87-116, 1983. 17. A. Darte and Y. Robert, "Scheduling uniform loop nests," Technical Report 92-10, ENSL Lyon, France, 1992.
310
Thiele
18. Y. Wong and J.M. Delosme, "Optimization of Computation Time for Systolic Arrays," 26th AIlerton Conference on Communication, Control and Computing, September 1988. 19. E.G. Coffman, Computer and Job-Scheduling Theory, New York: John Wiley and Sons, 1976. 20. M.R. Garey and D.S. Johnson, Computers andlntractability: A Guide w the Theory of NP-Completeness, New York: Freeman, 1976. 21. M.C. McFarland, A.C. Parker, and R. Camposano, "The highlevel synthesis of digital systems:' Proceedings of the IEEE, Vol. 78, pp. 301-318, 1990. 22. S. Davidson, D. Landskov, B.D. Shriver, and P.W. Mallett, "Some experiments in local microcode compaction for horizontal machines," IEEE Transactions on Computers, Vol. C-30, pp. 460--477, 1981. 23. P.G. Paulin and J.P. Knight, "Force-directed scheduling for the behavioral synthesis," IEEE Transactions on Computer-Aided Design, Vol. CAD-8, pp. 661-679, 1989. 24. C.T. Hwang, J.H. Lee, and Y.C. Hsu, "A formal approach to the scheduling problem in high level synthesis," IEEE Transactio~ on Computer-Aided Design, Vol. CAD-10, pp. 464--475, 1991. 25. N. Park and A.C. Parker, "Theory of clocking for maximum execution overlap of high-speed digital systems," IEEE Transactions on Computers, Vol. C-37, pp. 678-690, 1988. 26. L. Hafer and A.C. Parker, "A formal method for the specification, analysis and design of register-transfer digital logic:' IEEE Transactions on Computer-Aided Design, Vol. CAD-2, pp. 4 18, 1983. 27. N. Park and A.C. Parker, "Schwa: a software package for synthesis of pipelines from behavioral specifications:" IEEE Transactions on Computer-Aided Design, Vol. CAD-7, pp. 356-370, 1988. 28. A. Bachmann, M. Sch~binger, and L. Thiele, "Synthesis of domain specific multiproeessor systems including memory design:' VLSISignal Processing VI, pp. 417-425, New York: IEEE Press, 1993. 29. M.V. Swaaij, J. Rossel, F. Catthoor, and H.D. Man, "Synthesis of ASIC regular arrays for real-time image processing systems," Ed F. Deprettere and Alle-Jan van der Veen, (Eds.), Algorithms and Parallel VLSI-Architectures, Volume B, pp. 329-341, Amsterdam, Elsevier, 1991. 30. L. Thiele, "On the design of piecewise regular processor arrays," Proc. IEEE Symp. on Circuits and Systems, pp. 2239-2242, Portland, 1989. 31. L. Thiele, "Compiler techniques for massive parallel architectures:' P. Dewilde fed.), The State of the Art in Computer Systerns and Software Engineering, pp. 101-151, Boston: Kluwer Academic Publishers, 1992. 32. J. Teich and L. Thiele, "Partitioning of processor arrays: A piecewise regular approach," INTEGRATION: The VLSI Journal, Vol. 14, pp. 297-332, 1993. 33. W.H. Chou and S.Y. Kung, "Scheduling partitioned algorithms on processor arrays with limited communication supports,"
34. 35.
36.
37. 38.
Proc. Conf. on AppL Spec. Array Processors, pp. 53-64, Venice, Italy, 1993. S.Y. Kung, VLSI Processor Arrays, Englewood Cliffs, NJ: Prentice Hall, 1987. A. Fettweis and G. N'.'tsche, "r,,umerical in~,:~rztion of partial differential equations using principles of multidimensional wave digital filters," Journal of VLSI Signal Processing, Vol. 3, pp. 7-24, 1991. J.-K. Peir and R. Cytron, "Minimum distance: a method for partitioning recurrences for multiproeessors:' IEEE Transactions on Computers, Vol. 38, pp. 1203-1211, 1989. A. Schrijver, Theory of linear and integer programming, Wiley lntersc. Series in Discrete Math. and Optimization, 1986. P. Feautrier, "Parametric integer programming," Operations Research, Vol. 22, pp. 243-268, 1988.
Lothar Thiele was born in Aachen, Germany on April 7, 1957. He received the Diploma-Ingenieur and Dr.-Ing. degrees in electrical engineering from Technical University of Munich, West Germany, in 1981 and 1985, respectively. Since 1981, he has been a research associate with Professor R. Saal at the Institute of Network Theory and Circuit Design of the Technical University Munich. After finishing his Habilitation thesis, he joined the group of Professor T. Kailath at the Information Systems Laboratory, Stanford University, in 1987. In 1988, he has taken up the chair of microelectronics at the faculty of Engineering, University of Saarland, Saarbrucken, West Germany. He joined ETH Zurich, Switzerland, as a full professor in Computer Engineering in 1994. The research interests include models, methods and software tools for the design of hardware/software systems and array processors and the development of parallel algorithms for signal and image processing, combinatorial optimization and cryptography. Lothar Thiele authored and co-authored more than eighty papers. In 1986, he received the award of the Technical University for his Ph.D. Thesis. He received the "1987 Outstanding Young Author Award" of the IEEE Circuits and Systems Society. In 1988, he was the recipient of the "1988 Browder J. Thompson Memorial Prize Award" of the IEEE. thiele @tik.ee.ethz.ch