A NEW MODEL FOR LARGE MEMORIES(I)
R. FRANCESE (2) _ L. PAOLI (3) _ D. PARENTE (2) . M. TALAMO (4)
ABSTRACT - In this paper a new Memory Model of Computation (CMM) is introduced. In C M M , a RAM processor accesses a memory ofx cells in log x time. In fact, the usual assumption of the RAM model, that all memory cells are accessed in constant time, becomes impractical as x increases. With a very simple modification of the boolean circuits of the memory, C M M makes it possible to access in constant time, a memory cell consecutive to another already accessed cell. Problems of size n requiring time T(n) in the RAM model can be solved in C M M with a multiplicative factor O(log x) in time complexity. Ad hoc algorithms are instead designed for other basic problems such as searching and sorting.
1. I n t r o d u c t i o n In the classical R A M model of computation, the cost of a r a n d o m access to the m e m o r y is generally assumed as independent of the m e m o r y size and no limitations are posed on the n u m b e r of memory cells (registers). T h e assumption of constant access time is realistic if we consider a fast m e m o r y of fixed size, such as a r a n d o m access memory, and a very large a m o u n t of slower (secondary) memory. However, the enormous evolution of the technology of integrated cir-
- - Received 31 March 1992. (l) This work has been supported in part by the C.N.R. project <>. (2) Dipartimento di Informatica e Applicazioni, Universit~ di Salerno, 84081 Baronissi, Salerno, Italy. (3) Dipartimento di Informatica, Universit~ di Pisa, Corso Italia 40, 56100 Pisa, Italy. (4) Dipartimento di Informatica e Sistemistica, Universit~t di Roma La Sapienza, Via Salaria 113, 00136 Roma, Italy.
144
R. FRANCESE L. PAGLI D. PARENTE M. TALAMO"A New Model -
-
-
cuits makes it possible to design random access memories of larger and larger size and it is no more possible, over a certain size, considering the access time to be constant. F r o m V L S ! theory, we know that the logic gates have limited fan in - fan out and a constant time access becomes unacceptable as the m e m o r y size increases [6]. T h e access time log x seems to be correct, for a m e m o r y of size x, if we consider the above limitations. Some recent papers have addressed this problem from a c o m p u t a t i o n a l complexity point of view, by defining alternative models with different access time cost functions. In the models proposed in [1,2,3], the main m e m o r y is composed of two or more levels hierarchically organized, (e.g. cache and main memory), and an access to location i requires time log i or other cost functions such as f(i) = x ~, 0 < a ~< 1. T h e overheads for the access time can be optimized by bringing data into fast m e m o r y and by using them several times before returning them in slower memory. In [2] the model allows a -block transfer~ (BT) where a m e m o r y block of size t, starting at location i, can be copied sequentially in time f ( i ) + t . Efficient sequential algorithms are designed for the models in [1,2] while parallel algorithms are considered in [3]. A different approach is taken in [5] where the memory is a single bank of size x and the extra access time is log x for all m e m o r y cells; pipeting is also allowed in the access channels ( L P M model). In our work, we follow this last approach for the m e m o r y organization; we introduce a Computational M e m o r y Model, C M M , where the access time is either log x or requires constant time, if the m e m o r y cell is adjacent (preceding) to the one already accessed. O u r approach is different from the , B T - m o d e b ~ [2] since both the m e m o r y organization and the access cost function does not depend on the m e m o r y address. It also differs from the one proposed in [5] since the constant access is realized in a very simple way by slightly modifying the boolean circuits of the memory, and does not require complex synchronizations to allow pipeline technics. F r o m a theoretical point of view, we shall see that our model, in the proposed realization, is a restriction of the LPM: pipeling is only allowed to consecutive cells. As a consequence we have that lower bounds valid for the L P M also hold here, while algorithms designed for the C M M (upper bounds) can be used in the L P M , with obvious modifications. C o m p l e x i t y results found in [1,2] are not valid for C M M . In our model we assume that the n u m b e r of internal registers of the R A M processor is fixed and
for Large Memories
145
the RAM program is independent from the input size n. We also assume that the memory is properly chosen to host the problem, that is the memory size is polinomially related to n, i.e. x = n k. We have then O(log x) = O(log n), that is the access time is proportional to log n. In the following, without affecting the order of magnitude, we will use log n instead ofO(log n). A problem of size n, requiring time T(n) in the RAM model, can obviously be solved in C M M in O(T(n)log n) time but, for some particular problems, it is possible to make use of the constant access facility, which leads to better time complexities. For example operations of read, write and transfer of consecutive sets of n data can be easily performed in O(n). Linear problems in the RAM model, where the entire input has to be examined in sequence (e.g. finding the sum or maximum of a given set of numbers), can be directly implemented in this model still in linear time: only the first access has cost log n. In particular, we present an algorithm for the searching problem implemented in such a way that consecutive data are examined before performing a new random access. The algorithm takes O(log 2 n/loglog n) instead of O(log 2 n) required by the binary search in this model. Since in the LPM the complexity of searching problem is (2(log 2 n/loglog n), our algorithm is also optimal in CMM. Moreover, we study a sorting scheme based on a generalization of heap sort, requiring time O(nlog 2 n/loglog n). The memory organization is presented in Section 2. Searching and sorting algorithms are introduced in Section 3, while Section 4 is devoted to a general discussion.
2. The memory organization Let us consider a random access memory, consisting of n cells, a Memory Address Register (MAR), a Memory Buffer Register (MBR) and a decoder (DEC), performing the standard read/write operations. The MAR register plus the decoder select the cell involved in the operation, by a cell selection command. In a read operation, the selected cell transmits its contents to the MBR; for this purpose the outputs of all cells are OR-ed, column by column, and sent to the MBR. The read command to the memory is a store (write) command to the MBR. The write command is instead sent to all memory cells and only the selected one performs a store operation. (Note that the actual organization of a R W M memory is somehow different [7]. For our purpose, however, we will refer to the above memory scheme that is simple to visualize, and has the same properties as the actual one).
146
R. FRANCESE L. PAGLI D. PARENTE M. TALAMO:A New Model -
-
-
U n d e r our hypothesis that the memory is very large, the write c o m m a n d to be transmitted to all memory cells, and the contents of the M B R in a write operation, are broadcast through a binary tree. Moreover the OR-function of the outputs of all the m e m o r y cells is not performed in a single step, because of the bounded fan in degree, but through a binary tree of log n levels. Intrinsically the decoder is also a set of binary trees, since it connects log n inputs to n outputs. For these reasons a write operation takes log n steps, and a read operation takes 2 log n steps (log n steps to produce the cell selection and log n steps to transfer the result to MBR). In order to speed up the read and write operation commands, d a t a and addresses could be sent in pipeline to the memory, as proposed in [5]. This is obviously only possible if such information is known a priori. We propose a new
a i -1
UI
.,I I
I
'I I
II
clock
write command.
command
/
\
broadcasting trees
Fig. 1 - The new memory scheme.
for Large Memories
147
memory scheme where the speed up is obtained with some extra hardware, connecting each memory cell with the consecutive one. We define a global read (global write) operation as the read (write) of a set of r consecutive cells: after r a n d o m access to the first of the r memory cells involved, which requires log n time, the access to each of the consecutive cells is performed in constant time. Hence global read and global write operations require O(log n ) + r - 1 time. A possible organization of the memory is shown in Figure 1. Besides the broadcasting and OR-trees, the new scheme contains an additional counter register RI of size log n which stores the number r of cells involved in the global operation; the bits of RI are OR-ed and broadcast to all memory cells (signal c) for r consecutive clock times. A simple read/write operation is performed in the standard way (RI contains 0). The n outputs of the decoder are stored in n flip flop's si. To perform a global operation on a set ofr cells, the address of the first cell is sent to MAR, while the number r - 1 is stored in RI. After log n + 2 clock times (log n to produce ai, one to produce si and one more to access the first memory cell), register RI sends r - 1 successive c-values equal to 1 to all memory cells. The first memory access is activated by the 1 of the flip flop si selected by the decoder, while the c-value is 0 (that is the cell-selection signals are zero except for the one selected by the decoder). At the subsequent clock time we have c = l and the successive cell is selected by the preceding one, which is no longer selected. Indeed, let a i 0 ~ i ~< n - l , be the outputs of the decoder, the cell selection signals si, 0 <~ i ~< n - 1 (bit stored in flip flop si) is obtained as: Si: = ----1 c a i + c s i - - 1
After the first cell is read, in r - 1 successive clock times the read information arrives at M B R in pipeline. To perform a global write operation the write comm a n d must be AND-ed with si, obtained as in the read operation. The r data to be written must be stored in the M B R in r clock times, starting at time 2: once the first is written (after log n + 1 steps), the others are present at the memory cells at the subsequent r - 1 clock times. The usual execution of a computer program is performed by the fetchexecution phase for each instruction. A global operation has to be specified by a single program instruction, because the contents of the M A R register must not be modified during the execution of the whole operation. For this purpose a special format for global operations must be provided for all the instructions involving memory access. Besides the usual information, the instruction format must in-
148
R. FRANCESE L. PAGLI D. PARENTE M. TALAMO:A New Model -
-
-
clude, the number of consecutive cells considered (called rank r). For instance, the operation **sum the content of the cell 100 to the content of the accumulator (ACC)~ may have the form: op code ADD
rank 0
address 100
The same operation replicated on the cells 100, 101, 102, 103, is then expressed as: ADD
4
100
An example of p-sequence associated to the ADD operation is shown in Figure 2, where the pinstructions separated by a comma are executed in parallel while those separated by a semicolon are executed in sequence. Recall that before
address -~ MAR, r-1 --~ RI; M(MAR) ~ MBR;
* read *
a: MBR+ACC --* ACC, if RI=0 then stop else "read the consecutive", RI-1 --~ RI goto o~ fi Fig. 2 - A possible psequence associated to the A D D instruction f o r the new model.
the first execution of M(MAR)--~MBR log n time must be allowed to attain the proper cell selection, and after the execution of this operation log n time must be allowed before the read is completed. The/x-sequences of all the other operations involving memory accesses can be derived in a similar way. We remark that in CMM, as in the LPM model of [5], the addresses of the data involved in the operations have to be known a priori. As pointed out before, from the point of view of computational complexity, our model can be seen as a particular case of LPM, where pipelining is restricted to consecutive cells. However, the synchronization problem related to pipelining is only studied in [5] for particular algorithms, but not treated in general. Our scheme instead allows to define, in a simple way and for each program instruction involving the memory,the corresponding global operation.
for Large Memories
14:9
3. Upper Bounds An algorithm is obviously optimal in C M M if the extra time required is merely due to the initial access to the memory and to the store of the last result. For example, finding the sum of all the elements of a vector of n elements, and finding the m a x i m u m in the vector can be solved in O(n) time. C o m p u t i n g prefix sums or finding all the l's in a boolean vector can also be solved in optimal time. Let us now introduce more complex C M M algorithms for sorting and searching. For both problems, we represent the set ofn elements as an m-aTy tree, where a node contains m elements X0,-..,Xm--1, and partitions a set of subsequent elements in m (possibly empty) disjoint subtrees, T(x0),...,T(xm_I). T h e root is at level one. An m-ary tree T o f d e p t h h is full if each level of Tis complete, except possibly for level h, where the leaves are sons of the leftmost nodes of level h - 1 . A complete m-ary tree of height h contains Zi=0,h m i nodes, that is Zi=0,h m i+l elements. In Figure 3 an example of full m-ary tree is given, for m = 3 .
IMITIxl IQIRIsl
lulvlw!
.La_ka_ [EL I I Fig. 3 - An example of 3-ary search tree.
Given an element numbered i, let SON(i) be the n u m b e r j associated to the first element of the root ofT(i); we also indicate i as F A T H E R ( j + y ) , 0 ~
3.1. Searching We now introduce m-a~y search trees. These are full m-ary trees containing n distinct elements numbered in increasing order with a breadth-first traversal of
150
R. FRANCESE L. PAGLI D. PARENTE M. TALAMO:A New Model -
-
-
the tree, visiting the elements in each node from left to right. Note that each node has m sons instead of m + 1, as in ordinary search trees. As preprocessing for the search operation, first ~
K1
/N
i I I Level j
K2
A.
'N
I
'\
I I I .
.
.
.
1
Fig. 4 - The shape of the full m-ary search tree. K1 is the number of the complete m-ary serach subtrees of heightj - 1 and size st. K2 is the number of the complete m-ary search subtrees of height j - 2 and size s2.
A, B and C, where IAI = klSl-t-kb IcI = kzs2+k2 and IB[ = n - (klsl+kl) (k2s2+k2). Each set corresponds to one of the above defined groups of subtrees and each subtree is followed by the element to be assigned to the root. At this point, the elements to be assigned to the root can be selected and the procedure is applied again to each subtree. At each step m elements are arranged in the tree, the procedure is then executed n/m times and requires O(n) time in the RAM model. By considering the access time delay, the time required by the preprocessing, after sorting, is actually O(n log n).
for Large Memories
151
The search algorithm is then performed in a standard way (see Figure 5). Correctness of the algorithm is trivial to prove. The time complexity of the search algorithm is given by the following: THEOREM 1.
Procedure SEARCH(y, 1, n) takes O ( ( m + l o g n)logm n) time and is optimal in C M M .
Proof. The while statement takes O(m+log n) time, since log n is the time to access a node of the m-ary search tree and m is the time to scan it. The algorithm is recursively called O(1Ogm n/m) times. The thesis then follows, n Procedure SEARCH(y,i,n); /given an m-ary search tree MTREE[I:n], determine ify is present/ /in the m-ary search subtree whose root is MTREE[i]. / if i > n then return ("not found") endif; j :=i; while (j # i+m) and (MTREE[/] < y) doj:=j+l; c3se
:j=j+m: return ("not found"); : MTREE[j]=y: return(/); :otherwise: call SEARCH(y,j.m+I,n); endcase end.
Fig. 5 - The search algorithm. Obviously, m = L log n J is optimal for the algorithm. Due to this choice, the overall complexity is O(log 2 n/log log n). The Procedure Search is optimal because a lower bound of log 2 n/log log n has been shown in [5] for the same problem.
3.2. Sorting The sorting problem can be solved with a generalization of the heap-sort algorithm. The data structure supporting the algorithm is still an m-ary tree called m-heap with the following additional properties:
t52
R. FRANCESE L. PAOLI D. PARENTE M. TALAMO:A New Model -
-
-
1. the greatest element of the i-th son (from left to right) of a node X is equal to the i-th element in X, for l~i~ 0 , where k is the height of the tree, and the s nodes oflevelj occur in the s leftmost positions. Given a set A of n distinct elements to be sorted, let us first see how to construct an m-heap, arranged in an array MHEAP. T h o u g h some elements are replicated (from property 1), the size of M H E A P will be O(n). The m-heap will be constructed in a bottom-up fashion. It will not result full and balanced, as in the classical version [4], but still maintains the property of logarithmic height. Suppose that the height k of the m-heap is known. T h e n insert n ' = n - k ( m - 1) elements in the leaf level k of the m-heap, while the other k ( m - 1) are temporarily put aside. The elements of level k are then grouped in nodes of size m. According to the definition of the m-heap, some other elements can be required (at most m - 1) to complete the rightmost node of level k. These elements are taken from those put aside. The greatest element of each node is then determined, and replicated to be the father of the node at level k - 1. If the n u m b e r of elements at level k - 1 is not a multiple ofm (see property 2), complete the rightmost node as before. The procedure is repeated until the root is completed. If not all the elements have found the allocation in the m-heap, the remaining elements are inserted in a top down way preserving the heap properties. The height k must still be computed; by theorem 2 we show that the minimum height k must be equal either to lOgm n j or to [ lOgm n ~.
TnZOREM 2. The m i n i m u m height k of an m-heap with n distinct elements is either L1Ogm n j or F logm n 7. Let us suppose that the height is I logm n 1 - 2. Since n ' = n - k ( m - 1 ) and m k + 2 < n < m k+l, it results n = m k + ( m - - 1 ) m k + r , where l~(m - 1)mk+r, which is a contradiction. 9 Now let us derive k in a constructive way. First set k = L 1ogre n J, compute the n u m b e r of elements in each level, and how many of these should be taken from those put aside. If at level 0 there are still more than m elements of the initial n', then reset k to r logm n ] and repeat the above computation. This procedure first requires n'log n to allocate n ' = n - k ( m - l ) elements, then ( n - n ' ) l o g 2 n/log m steps to allocate, in top-down fashion, the remaining elements; these are Proof
for Large Memories
153
( n - n ' ) in the worst case. Hence the overall complexity for the construction of the m-heap is O(nlog n) in CMM. The Procedure MHEAPSORT is performed in a classical way by calling O(n) times the ADJUST procedure. Each time, MHEAPSORT determines the maximum of the elements of the root, which is stored in the array O U T P U T , and replaces it by the rightmost element of the last level. The ADJUST is then called to maintain the m-heap properties. The ADJUST procedure is also realized in a standard way (see Figure 6). Procedure ADJUST (T, size); /the procedure maintains the heap property of the subtree/ /rooted in MHEAP[T]./ r:=MHEAP[T]; j:=2; i:=SON[T,/]; while /<_size do max:=0; if i+m-1 < TAB[/,1] then p:=m-1 else p:=TAB [j'+ 1,2]- l-i; for s:=i to i+p do if MHEAP[s]>max then max:=MHEAP[s]; T:=s; endif endfor if t
Fig. 6 - The ADJUSTprocedure.
In the following procedures it is assumed that the number of elements in each level is known (this can be easily determined during the construction of MHEAP). Moreover, the function SON(ij) and FATHER(i,j) give the index of the son and the father of the i-th element at level j, respectively (see Figure 7). THEOREM 3. Procedure MHEAPSORT(A,n,OUT, I log n ]) takes O(nlog2n/ loglog n) time in CMM.
Proof As derived above, the m-heap can be constructed in O(nlog n) time. The ADJUST procedure processes at most one node for each level in the worst case,
154
R. FRANCESE L. PAOLI D. PARENTE M. TALAMO:A New Model -
-
-
Procedure M H E A P S O R T (A,n,OUT,MHEAP,m); j:=n+I; t:=k; for i:=nrot to 1 by-1 do max:=0; y:=0; for s:=l to TAB[l,1/ do /find the maximum of the m-heap./ if MHEAP[s]>max then max:=MHEAP[s]; y:=s; endif; endfor if OUT[/]r thenj:=j-1; OUT[/]:=max; endif; /output the element if not yet considered./ Nff-J.EAP[y]:=MHEAP[i]; /copy the last element of the m-heap/ /in the right position of the root./ TAB[t,1]:=TAB[t,I]-I; /update size of the m-heap./ if TAB[t,1/=0 then t:=t-1; endif; call ADJUST(y,1); endfor; end. Fig. 7 - The sorting algorithm. TAB[i, 1] contains the number of elements at level i and TAB[i, 2] contains the index of the first element at level i. ntot is the number of elements in the m-heap.
hence it takes O ( ( m + l o g n)logm n/log m) time. So the overall complexity of M H E A P S O R T is O(n(m+log n)logm n/tog m). As in the searching problem, the choice m = [ log n ] is optimal for the algorithm leading to a complexity of O(nlog2n/loglog n). 9 Note that sorting in C M M seems to be computationally harder than in the L P M (where it is optimally solved in O(nlog n).
4. Concluding remarks This paper introduces a new model of computation called C M M . The new model differs from the RAM because a memory access takes log n time or constant time, if the location accessed is random or is consecutive to the one just accessed. Obviously, the RAM polinomial time algorithms run on C M M with at most an O(log n) multiplicative factor in the time complexity. Some problems, such as finding the maximum of an unsorted list, or computing the sum or other arithmetic functions of the elements of the list are performed optimally by the consecutive constant access facility.
for Large Memories
155
For the problems of searching and sorting, we have provided algorithms to reduce the O(log n) factor by carefully rearranging the data elements in blocks and scanning each block before accessing a new block. This strategy is the same as that used to access secondary storage, where however the block size is fixed and the access function is of higher order than the memory access time, hence it is greater than log n. In CMM, the blocksize can be suitably defined. In the given searching and sorting algorithms a block size of log n yields optimal complexity values. For other problems the factor O(log n) seems not to be improvable. Further studies will be directed towards defining smart algorithms for other classical problems, or showing that the time T(n)log n is a lower bound. As a conclusion we point out that our memory organization is very simple, and can be a useful reference scheme; if we introduce a bank of fast memory (e.g. a cache memory) with the consecutive access facility, such that a block of t cells can be moved in t+log n time, this constitutes a practical realization of the BTmodel of [2]. Finally, we observe that the values of the constants occurring in the complexities of the searching and sorting procedures are comparable to those of the equivalent procedures for the traditional RAM model [4]. Thus our algorithms are reasonable in the new model.
REFERENCES [1] A. AGGARWAL,B. ALPERN, A. K. CHANDRA, M. SNm, A model for Hierarchical Memory, Proc. 19-th Simp. on Theory of Comp., (1987), 305-314. [2] A. AGGARWAL,B. ALPERN, A. K. CHANDRA,M. SNIR, Hierarchical Memory with Block Transfer, Proc. 28-th IEEE F.O.C.S., (1987), 1-13. [3] A. AGGARWAL,A. K. CHANDRA,M. SNIR, Communication latency in P R A M computation, Proc. 1-st ACM-SPAA, (1989), 11-21. [4] A. V. AHO,J. E. HOPCROFT,J. D. ULLMAN,The Design and Analysis of Computer Algorithms, 1984, Addison-Wesley, Reading, Mass. [5] F. Luccio, L. PAOLI,Sequential computation based on pipelined access to memory, Proc. 27-th Allerton Conference, (1989), 702-711. [6] J. ULLMAN,Computational Aspect of VLSI Design, 1984, Comp. Sci. Press, Rockville, MD. [7] F. P. PREPARATA,Introduction to Computer Engineering, 1985, Harper & Road, New York, NY.