Journal of VLSI Signal Processing, 2, 9-16 (1990) 9 1990 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Configurable Hardware: Two Case Studies of Micro-Grain Computation TOM KEAN AND JOHN GRAY Algotronix Ltd., TIC, Kings Buildings, Mayfield Road, Edinburgh EH9 3JL, Scotland Received March 26, 1990, Revised May 31, 1990.
Abstract. This paper describes a new VLSI architecture--Configumble Array Logic (CAL) which, at its lowest level,
can be programmed electrically to implement any circuit composed of logic gates. At higher levels the technology provides a medium for the direct implementation of algorithms. It particularly addresses systolic and cellular automaton algorithms where the basic computational elements perform computations unsuited to conventional processors. 1. Introduction
In their seminal paper [1] Foster and Kung argued for a computer architecture based around the classic von Neumann processor and memory but with a number of special purpose VLSI chips added to the bus. These chips would implement systolic algorithms to provide high-performance computation of important functions such as pattern matching, fast Fourier transform and sorting. Foster and Kung describe a methodology for the implementation of such chips based on careful algorithm design and simplified and formalized layout techniques. Despite the considerable potential performance advantages of this architecture it has not been successfully adopted in any common computer design to date. Commercial designers have not been able to justify the large design cost of designing many special purpose chips, all of which would require different support from applications programs in order to function effectively. In this paper we will present an alternative approach to the implementation of systolic algorithms within a conventionalcomputersystem:insteadofanumberofspecial purpose systolic chips a single configurable computing surface is provided. This architecture is termed configurable logic because it is best viewed as a reconfigurable hardware implementation style: that is, algorithms are programmed by specifying connections between active logic elements (via a programmable switching structure) rather than as instruction data to be interpreted by processing units. The advantage of this is that the programmable structure can implement bit level systolic algorithms unsuited to arrays of conventional processors, This approach is illustrated by direct conversion of two well known systolic algorithms previously implemented in special purpose silicon to the programmable
structure. The ease with which this transformation can be done suggests that the programmable structure is suitable for a very general purpose systolic coprocessor. 2. Architecture
The basic configurable logic structure is a rectangular array of identical cells with the same orientation and nearest neighbor connections, see figure 1. In reality, 1024 cell CAL chips are interconnected as in figure 2. Each cell has a simple function unit and a permutation network. Nonlocal connections must be routed through intermediate cells. A range of such architectures is l [ l ~ Cell ~ C ~ [ ~ _~ Unit I
1
l ~ l ]Cell~ml~n~"] ~ C ~
~
l ]Cel~
~
l ~Cell
1
]Unit I ~ C eIl ~Unit I I
~
l ~ C ~
l ~ C ~
1 I
] Unit l
l
~
l
Fig. 1. Basic structure. Reproduced from "Systolic Array Processors "" by McCanny et al. with kind permission of Prentice Hall.
10
Kean and Gray
such as Shoup's Cell Overhead Factor (COF) [3], defined as the ratio of the total number of components in the cell to the number in the cell function unit as measures of efficiency, and to cell designs, such as the cutpoint array [2], in which nearly all the area is devoted to the function unit. The cutpoint array cell has a CI of 3 (it needs 6 distinct logic functions and has fixed routing). The design presented in this paper has a CI of 17: about 40% of cell area is used by inter-cell routing, 20% by intra-cell routing and 40% by cell function. This change in the allocation of cell area to resources is motivated by several considerations.
Fig. 2. CAL system.
possible according to the complexity of the function unit and routing network within each cell. The regularity of these structures makes them very suitable for implementation in VLSI. They have the important properties of having a single resource (cells) and infinite expandability since large arrays can be built up by connecting multiple chips at the board level. 2.1. Design Considerations
A member of the class of cellular systems described above can be characterized completely by the design of the basic cell. This is composed of three resources: a function block or kernel which provides the computation and two kinds of routing. These are inter-cell routing which routes inputs from other cells or the fimction block output to adjacent cells and intra-cell routing which routes inputs from other cells to the function unit inputs. Each of these resources has associated with it a control store. A good measure of a cell's complexity is the logarithm (to base 2) of the number of functionally distinct cell configurations. This is the lower bound on the number of bits of RAM required in the cells control store. We will call this the complexity index (CI). Early work in this area [2] focused on low-generality arrays capable of synthesizing arbitrary combinational logic functions of a set of input variables. Given that inputs and outputs occur on fixed sides of the array only very simple routing functions are required. Concentration on this problem led to the employment of metrics
hnprovements in processing technology. These have changed the cost analysis in two ways: firstly because memory cells have scaled better than logic higher generality arrays are now more practical, secondly the number of cells that can be placed on a single chip is now so great that it makes sense to implement whole systems rather than just single combinational blocks within the array. This implies that the ease of implementation of routing areas between logic blocks, the ability to rotate and mirror subunits to get a squarer floorplan and flexibility of aspect ratio and port placement in logic blocks are all important factors in obtaining efficient cellular systems. Experience with VLSI design. This has led to an appreciation that routing area within logic blocks is not an overhead but a computational resource. As an illustration the PLA and standard cell implementations of many systems require approximately the same space. In the PLA all the area is used by logic devices whereas in the standard cell implementation there are many fewer logic gates and half the area may be taken up by wiring channels. The extra wires have allowed a more general network of gates resulting in less recomputation of sub-results. Ease of Use. It was decided to support arbitrary interconnection of function units to allow the processes of logic synthesis and placement and routing to be separated (without a general routing network the logic synthesis step must take into account routing considerations to ensure its design can be realized). This separation can provide a major simplification of the design process whether it is done manually or automatically. 2.2. Function Unit Design
The simplest function units capable of general computation when connected together will have two one-bit wide
Two Case Studies of Micro-Grain Computation inputs and one one-bit wide output. The simplest such general function unit is a NAND or NOR gate but too many of these cells would be required to realize useful designs with the array. To minimize the number of function blocks required we choose to make any of the 16 two input one output combinational logic function implementable within one cell. The present cell also provides 4 kinds of D latches since the ability to obtain a latch with only one cell is important to allow efficient implementation of pipelined and systolic designs (table 1).
x2
11
Y2
Table 1. CAL Programming Table. Number
Function
Number
Function E
0
ZERO
8
X1 9 X2
1
ONE
9
X1 9 X2
2
XI
10
X1 + X 2
3
XI
ll
X1 + X 2
4
X2
12
X1 + X 2
5
X2
13
XI G X 2
6
XI .X2
14
X1 + X 2
7
XI .X2
15
X1 (~ X2
16
D Clk Latch
17
D Clk Latch
18
D Clk Latch
19
D Clk Latch
A block diagram of the function unit within each cell is given as figure 3. It relies on the fact that all functions of two variables (X1 and X2) can be computed by a 2:1 multiplexer (marked F in the diagram) which selects F2 if Y1 = 1 and Y3__ifY1 =Ogiven appropriate inputs chosen from {X1, X1, X2, X2, 0, 1}. Note that the degenerate functions ZERO, ONE, X1, X1, X2 and X2 are computed using two input functions with the same cell input routed to the X1 and X2 function block inputs, e.g., ZERO could be computed using XOR with X1 = X2.
2.3. Routing Network Design The simplest routing network on an array of identical cells capable of implementing any legal circuit composed of two-input one output logic gates requires single bit input and output connections between each cell and its neighbors to the North, South, East and West. Connection schemes in which signals only flow, for example, from east to west cannot accommodate circuits with feedback connections. A fundamental problem with having only nearest neighbor connections is that propagation delay increases
Func. O u t Fig. 3. CAL function block design.
quickly for long wires. This problem is particularly severe in the case of clock wires and special measures have been taken to overcome it: two global signals are routed to all cells in the array without passing through any multiplexers. These signals are intended to be used as clocks by user designs. This has the useful side effect of freeing a large number of connections for use by data. Global wires running along rows and columns of the array were considered at an early stage of the design and rejected for several reasons, one of the most important being that because of limitations on device pinout it would be impossible to extend these signals across chip boundaries in a multi-chip array. Longer wires crossing several cell boundaries were also considered and rejected because they considerably complicate the task of automatic place and route software and do not offer sufficient performance advantages over carefully designed neighbor routing systems to compensate for the considerable control logic overhead. These points are covered in more detail in [4]. The propagation delay problem has two main consequences for CAL users: if the CAL is being used to prototype systems which will eventually be implemented in silicon, then the prototype operating speed will be less than that of the final implementation; if a system is being designed for implementation using CAL, then architectures with mainly local connections and pipelining between stages should be used where possible. Minimization of cells needed for routing could be achieved by having the maximum routing complexity, i.e., a crossbar switch with 6 outputs (N, S, E, W,
12
Kean and Gray
2 function block inputs) and 5 inputs (N, S, E, W and function block output). Some permutations, for example, routing North input to North output are not useful since the cell to the North will already have the signal available or be receiving an undefined value (if it has connected South output to South input) so these possibilities are removed. Also, connections from function block outputs to function block inputs are unnecessary. A slight complication is the two global signals mentioned above: these can be selected as sources by one of the function block inputs. This leaves a switching function which can be realized by five 4:1 and one 6:1 multiplexer controlled by 13 bits of RAM (figure 4).
North
design to illustrate the ease with which silicon implementations of systolic algorithms can be mapped to the programmable structure. In order to implement the string matching algorithm two functional cells must be designed: the one bit comparator and the accumulator. These cells are then arrayed as shown in figure 5 (based on one in [1]).
P
LSB
d
In
MSB
e l _ _
G2
Fig. 4. Cell routing,
3. CAL Implementation The present CAL chip is implemented in 1.5/zm double metal CMOS. This chip contains a 32• array of configurable cells and is packaged in a 176 pin PGA package. A novel pad sharing scheme using ternary signaling [5] is used to allow all 32 inputs and outputs on each side of the array to be communicated to neighboring chips without time-multiplexing pads. Communication to non-CAL chips takes place at normal logic levels and has no special timing requirements. CAL chips are programmed by writing into a static RAM which controls the configuration circuits. There is a parallel, byte wide, interface to the control store and normal address decoding so that it may be mapped into the memory of a host computer. Routing delay through a single cell on the prototype chip has been measured at approximately 1.6 ns and the combined time to perform a logic computation and route the result to a neighboring cell at about 8.7 ns.
Fig. 5. Systolic string pattern matcher.
4.1. One Bit Comparator
The comparator computes the following function: Pout Pin, so~ ~ sin, wout ~ win " (Pi,~ = si,~). The variables p and s represent the single bits of the pattern we are matching against and the string respectively, w indicates that the match has been successful up to this bit position. The cellular layout is shown in figure 6.
PIN
~'i:~:!:!:~:!:~:!:i:! ::'~ :""::i:i:i:i:~:~:~:~:i:i::" UT
ii~iiii::?:!ii!i::::i!iiiii i::
4. Systolic String Matching
.g:::i
~OUT
The systolic string matching algorithm of" Foster and Kung [1] is a classic example of a linear systolic array. In this section we will convert the VLSI layouts of the original paper into an equivalent configurable logic
;!ii51~i!ii~?15i5;i;2~i5j5i~ Fig. 6, O n e bit comparator layout.
Two Case Studies of Micro-Grain Computation
On this diagram the grey boxes represent individual cells within the array, the black box at the center of some cells represents the function unit, its two inputs are at the left (X1) and right (X2) hand side and the output comes from the center. The function being computed is printed at the top left hand corner of the grey box. Black lines represent connections through the routing structure, in some cases the label G1 or G2 beside the left function unit input (the clock line when the function unit implements a latch) to indicate it is being driven by a global signal. Note that this unit is designed to allow a one cell vertical overlap with the adjacent unit in the array.
13
allows wild card matching at this position, fl indicates a successful character match at this position. 4.3. Overall Design
The overall sting matcher is composed of an array of these two cell types, the comparator cells on the top are slightly different because there is no win from the previous comparators. Figure 8 shows a 3 bit 4 stage comparator implemented in a 20x8 array of cells. i:::
9 .................. :: ::::,:::s ~. ::::::::::::::::::::::: : "-::ii!~i;!!i!!!Ey -!~!i!:!:.? ! :,-: :-":!ii:!i!~}!i::ii~!:~i
4.2. Accumulator
sour:://~i~!::::iii::iii!i The accumulator computes the following functions: Xout ~ Xin, f~out ~ ~i,,, TEMP 9 Xin + min " Xo,,r mout, Xin + TEMP(f~in + win)-'* TEMP. T h e state variable TEMP indicates that so far there has been a match up to this character position, m is the result, X is used to mark the end of the pattern to initialize TEMP, w
WIN
O1/4
LIN
Motrl Fig. 7. Accumulatorlayout.
Fig. 8. String matcher layout.
.:zi:~
14
K e a nand Gray
5. Bit Serial Multiplication latch Implementation of multiplication is arguably the most critical component of most hardware Digital Signal Processing systems, ha this section we will illustrate the implementation of a pipelined bit-serial multiplier using configurable logic. The multiplier design is based on that used in the FIRST silicon compiler [6] allowing direct comparison of the two implementation styles. The implementation of the multiplier (by far the most complex of the FIRST library elements) also illustrates the potential to build a complete library of FIRST primitives in configurable logic allowing implementation of FIRST DSP designs using the configurable technology. In many applications, e.g., fixed response FIR filter, the multiplier coefficients are parameters which are fixed or changed much less frequently than the multiplicand data. In the case of a VLSI multiplier it is still necessary to support variable coefficients in order to increase the range of usefulness of the chip but in a eonfigurable logic implementation it makes much more sense to write a fixed coefficient multiplier generator, i.e., a program which can generate fixed coefficient multiplier designs for particular coefficients, and reconfigure the chips as necessary. Note that this is an example of a general technique: analogous techniques are often used for computation expensive processes on conventional computers. For example, the GOALIE program for artwork analysis of integrated circuits generates a program for a particular set of design rules which is then run on the input data rather than using a parameterized program capable of handling general design rules. This allows conditionals to be eliminated with key inner loop code which, combined with loop unrolling and other transforms, provides a factor of two speedup [7]. This technique is central to the efficient use of configurable logic in many applications and it is conjectured that in many cases it can recover much of the time/area penalties over conventional hardware incurred by the programming circuitry. The FIRST multiplier uses a modified Booth algorithm and its block diagram is shown in figure 9. Data and control signals are fed from right to left through the stage with 3 bits of delay, requiring 6 latches per signal. The block LATCH__AND__COEFF is concerned with latching the coefficient which is fed in bit serially like the data. A latch signal is shifted-in in parallel with the coefficient which, when high, causes the coefficient bits to be latched into the recoder unit prior to multiplication. The block labelled LATCH___AND RECODE
~
latch
LATCH-AND-COEFF COel~
9
coe~]~
I LATCH_AND_RECODE
data
DATA.LINE
--- d a t a
SELECT ~init
~I-LINE
cl
+/-
--,- c l ~init ~sign e x t e n d
!PP
ADDSUB pps Fig. 9.
__,
--,-pps out
Serial multiplier cell--block diagram.
decodes three bits of the coefficient according to table 2 and generates control signals for the selector and the programmable adder/subtractor (ADDSUB). The block labelled SELECTOR outputs 0, a or 2a to ADDSUB depending on the control inputs (xo, xl, x2), a and 2a are available on the three bit delay line DATA_LINE. The block CL___LINEis concerned with an initialization signal. ADDSUB is a programmable carry save adder/ subtractor with sign extension and settable carry. It adds the appropriate value to the current partial product sum pps and passes the result to the next stage. Table 2.
Multiplier recoding. Operation
hi+ 1
bi
bi_ 1
0 0 0
0 0 1
0 1 0
P P +-- P P PP ~ PP + a PP ~ PP + a
0 1
1 0
1 0
1
0
1
1 1
1 1
0 1
PP PP PP PP PP
+-'PP ~ PP *-- P P ~ FP "- P P
+ 2a - 2a -
a
- a
Fixed coefficient multipliers can be much smaller and faster than variable coefficient multipliers. The savings are greatest when a recoding scheme is used since the recoding can be chosen to minimize the number of stages where addition and subtraction are required. In
Two Case Studies of Micro-Grain Computation many cases fewer than half the stages will require adders or subtractors [8]. In a fixed coefficient design we can eliminate the LATCH_AND__COEFFICIENT, LATCH__AND__RECODE and SELECT blocks. We can also replace the complex programmable block marked ADDSUB with a simple carry-save adder or subtractor or eliminate it entirely depending on the coefficient bits for a given stage. Thus at each stage of the multiplier we replace a single complex design with one of a number of much simpler designs selected automaticallyby a multiplier-generator program according to the coefficient. The fixed coefficient stages have approximately 1Athe hardware of programmable stages and are considerably faster because the critical path is much smaller. The layout of a typical stage in the multiplier is shown in figure 10, this consists of delay elements and a carry save adder with presettable carry. This stage design corresponds to the operation P P ~ (1/4)PP + a in the recoding table 2. The other operations in the table are computed similarly by substituting a subtractor and taking the 2a signal from after the third delay element on the data input as required. Since two bits of coefficient are dealt with in each stage, the number of stages required is half the number of coefficient bits. The first and last stage designs in the array are slightly simpler than the general stage.
iiiiiii?iiii! iN
N Fig. I0. Complete multiplier layout.
15
6. Conclusions A novel approach to implementing systolic algorithms for important applications within a conventional computer has been described. Instead of providing a variety of special purpose chips a single fine grain computing surface is used which can be configured to emulate arbitrary gate level circuits. Systolic algorithms typically implemented in silicon can easily be mapped onto this surface. This mapping is illustrated for two well known systolic algorithms. Further studies of the performance and area tradeoffs of CAL based solutions and comparisons with other implementation techniques such as custom VLSI and Programmable Gate Arrays [91 are provided in [10], [11], [12]. References 1. M.J. Foster and H.T. Kung, "The Design of Special-Purpose VLS1 Chips," IEEE Computer, January 1980, pp. 26-40. 2. R.C. Minnick, ' ~ Survey of Microcellular Research," J. ACM, Vol. 14: 203-241, 1967. 3. R.G. Shoup, Programmable Cellular Logic Arrays, Ph.D. Thesis, Computer Science Department, Carnegie-Mellon University, 1970. 4. Tom Kean, Configurable Logic: A Dynamically Programmable Cellular Architecture and its VLSI Implementation, Ph.D. Thesis, University of Edinburgh, Department of Computer Science, 1989. 5. Algotronix Ltd., CAIA024 Preliminary Data Sheet, Edinburgh, UK, 1989. 6. Peter Denyer and David Renshaw, VLSI Signal Processing: A Bit-Serial Approach, Reading, MA: Addison Wesley, 1985. 7. T.G. Szymanski and C.J. Van Wyk, "GOALIE: A Space Efficient System for VLSI Artwork Analysis" Proc. Int. Conference on Computer Aided Design, 1984. 8. R.E Lyon, "Two's Complement Pipeline Multipliers," 1EEE Transactions on Communications, April 1976. 9. Xilinx Inc. The Programmable Gate Array Design Handbook. San Jose, CA, 1986. 10. J.P. Gray and T.A. Kean, "Configurable Hardware: A New Paradigm for Computation;' Proc. Decennial Caltech Conference on VLS1, Pasadena, CA, March 1989. 11. Jouko Viitanen and Tom Kean, "Image Pattern Recognition using Configurable Logic Cell Arrays," Proc. Computer Graphics International, Leeds, UK, June 1989. 12. Tom Kean, Genbao Feng, Irene Buchanan, and John Gray, 'N Novel ImplementationStyle for TeachingVLSI," Proc. 1989 VLSI Education Conference, Santa Clara, CA, June 1989.
16
Kean and Gray
Tom Kean is a member of the technical staff of Algotronix Ltd., where he was responsible for the design of the CALl024 chip. He has co-authored five technical papers on configurable logic. He received his Ph.D. degree from the University of Edinburgh in 1989 for his thesis describing the CAL architecture. He is a member of the IEEE.
John Gray is the managing director of Algotronix Ltd. Previously he was with European Silicon Structures and Lattice Logic where he directed the development of silicon compiler products. He has more than twenty years experience in electronics and computing and has written extensively on design technology. He received his Ph.D. from the University of Newcastle and he is a member of the ACM.