Journal of VLSI Signal Processing 49, 3–18, 2007
* 2007 Springer Science + Business Media, LLC. Manufactured in The United States. DOI: 10.1007/s11265-007-0058-5
A Decimal Floating-Point Divider Using Newton–Raphson Iteration LIANG-KAI WANG AND MICHAEL J. SCHULTE Department of ECE, University of Wisconsin-Madison, Madison, WI, USA
Received: 9 March 2005; Revised: 9 May 2006; Accepted: 15 August 2006
Abstract. Increasing chip densities and transistor counts provide more room for designers to add functionality for important application domains into future microprocessors. As a result of rapid growth in financial, commercial, and Internet-based applications, hardware support for decimal floating-point arithmetic is now being considered by various computer manufacturers and specifications for decimal floating-point arithmetic have been added to the draft revision of the IEEE-754 Standard for Floating-Point Arithmetic (IEEE P754). In this paper, we presents an efficient arithmetic algorithm and hardware design for decimal floating-point division. The design uses an efficient piecewise linear approximation, a modified Newton–Raphson iteration, a specialized rounding technique, and a simplified decimal incrementer and decrementer. Synthesis results show that a 64-bit (16-digit) implementation of the decimal divider, which is compliant with the current version of IEEE P754, has an estimated critical path delay of 0.69 ns (around 13 FO4 inverter delays) when implemented using LSI Logic_s 0.11 micron Gflx-P standard cell library. Keywords: decimal, division, floating-point, Newton–Raphson iteration, initial approximation, computer arithmetic, hardware design
1.
Introduction
Humans typically represent numerical data and perform computations using decimal arithmetic. One of the first electronic computers, the ENIAC, used decimal arithmetic to perform the basic arithmetic operations [1]. Currently, however, commercial microprocessors do not provide instructions or hardware support for decimal floating-point arithmetic. Consequently, decimal numbers are often read into computers, converted to binary numbers, and then processed using binary floating-point arithmetic. Results are then converted back to decimal before being stored. Besides being time-consuming, this process is error-prone, since most decimal numbers, such as 0.2, cannot be exactly represented as binary numbers. Thus, if binary floating-point arithmetic is
used to process decimal data, unexpected results may occur after a few computations [2–4]. In many commercial applications, including financial analysis, banking, tax calculation, currency conversions, insurance, and accounting, the errors introduced by converting between decimal and binary numbers are unacceptable and may violate legal accuracy requirements [5]. Therefore, these applications often store data in decimal format and use software to perform decimal floating-point arithmetic. For example, a survey of numerical data in commercial databases of 51 major organizations found that 55.0% were decimal numbers and 43.6% were integers, which could have been stored as decimal number [6, 7]. Several programming languages, including COBOL [8], Java [9], Visual Basic [10], C [11], and C# [12], provide support for
4
Wang and Schulte
decimal arithmetic, through either primitive data types or libraries. Because of the growing importance of decimal floating-point arithmetic, specifications for it have recently been added to the draft revision of the IEEE-754 Standard for Floating-Point Arithmetic (IEEE P754) [13]. The IEEE P754 Standard is still being reviewed and may change. In the remainder of the paper, we use IEEE P754 to refer to the current version of the IEEE P754 Standard [13]. Although decimal arithmetic operations can be emulated using decimal software libraries, software emulation of decimal floating-point arithmetic operations are typically 100 to 1,000 times slower than hardware implementations of equivalent binary floating-point arithmetic operations [7, 14]. In [15], Schulte et al. evaluate and compare the performance of decimal floating-point arithmetic operations when implemented on superscalar processors using the decNumber library [11] and specialized hardware designs. The processors are configured based on information given in [16] and the hardware cycle counts are from either the results presented in [17–19] or the authors_ estimates. Their paper demonstrates that hardware implementations of decimal floatingpoint arithmetic operations are one to two orders of magnitude faster than their software counterparts. Because commercial benchmarks often have a relatively low percentage of floating-point division instructions, chip designers usually focus on the design of floating-point adders and multipliers. Floating-point division is often performed using software libraries, microcode, or relatively slow floating-point dividers [20]. As a result, programmers often avoid the long latency of floating-point division by developing division-free methods to approximate the original algorithms. However, division-free methods may overflow or display poor numerical behavior [20]. Although floating-point division accounts for a relatively low percentage of all instructions, Oberman and Flynn show that the latency of binary floating-point division has a significant impact on the overall performance of commercial benchmarks due
Width Field Figure 1.
to data dependencies [21]. We anticipate similar results for decimal floating-point division [15]. This paper presents an arithmetic algorithm and hardware design for decimal floating-point division. Compared to previous decimal dividers [22–26], which use digit recurrence algorithms and only produce one quotient digit each iteration, our divider doubles the number of accurate digits each iteration. This is especially useful for large operand sizes. Furthermore, the proposed decimal divider can be modified to share hardware with decimal multiplication and square root units. Several novel techniques, including an efficient piecewise linear approximation, a modified Newton–Raphson iteration, a specialized rounding technique, and a simplified decimal incrementer/decrementer, are used to improve the divider_s area and performance. To the best of our knowledge, our design is the first decimal floatingpoint divider that is compliant with IEEE P754 using the Densely Packed Decimal Encoding. The rest of the paper is organized as follows. Section 2 describes the decimal floating-point formats and exceptions specified in IEEE P754. Section 3 describes our decimal division algorithm, along with a detailed error analysis. Section 4 presents the hardware design for our decimal floating-point divider. Section 5 gives synthesis results and analyzes the performance of our division algorithm for different implementations. Section 6 summarizes previous research on decimal division. Section 7 gives our conclusions. This paper is an extension of the research presented in [19]. 2.
Decimal Floating-point Numbers in IEEE P754
Figure 1 shows the basic decimal number format using the Densely Packed Decimal (DPD) encoding specified in IEEE P754 [13]. The 1-bit Sign Field, S, indicates the sign of the number. The 5 -bit Combination Field, G , provides the two leading exponent bits and the leading digit of the significand.
1 bits
5 bits
w bits
Sign S
Combination G
Following Exponent F
Basic decimal floating-point format.
t=(10xJ/3) bits = J digits Trailing Significand T
A Decimal Floating-Point Divider Using Newton–Raphson Iteration
Table 1.
Parameters for different decimal floating-point formats.
Format name
Decimal32
Total storage width
Decimal64
Decimal128
32
64
Sign field width
1
1
1
Combination field width
5
5
5
Exponent field width (w)
6
8
12
Trailing significand field width (t)
20
50
110
Total significand digits (n ¼ J þ 1)
7
16
34
101
398
6,176
Exponent bias
128
emax
+96
+384
+6,144
emin
j95
j383
j6143
5
where S is the sign of the decimal number, e is the exponent, and m is the significand which represents a decimal fraction in the form of d0 :d1 d2 :::dn1 , n ¼ J þ 1, 0 di < 10, and at least one di is nonzero if the decimal number is not zero. The smallest positive normal number is 10emin and the largest is 10emax ð10 101n Þ. A decimal number with an absolute value less than 10emin is called subnormal. For example, 0:001 10emin is a representable subnormal number, but 0:001 10eminþ3 is a normal number. Unlike binary arithmetic, in decimal arithmetic, the significand is usually represented as an integer. Therefore, instead of using the previous representation, the following representation is often used: v ¼ ð1ÞS 10q c
This field is also used to represent the special values NaN (not-a-number) and 1. The w-bit Following Exponent Field, F , combined with the two leading exponent bits from G, represents a ðw þ 2Þ-bit biased binary exponent, E. The t-bit trailing significand field, T , uses the Densely Packed Decimal (DPD) encoding, which represents three decimal digits using a 10-bit block, or declet. Consequently, for a (J+1)digit decimal number, the J least significant digits are encoded in DPD format, such that only 3J 10 bits are required in the Trailing Significand Field, T . More details on the DPD encoding are given in [27]. In IEEE P754, there are three basic decimal formats; decimal32, decimal64, and decimal128, where the trailing number corresponds to the total storage width of the decimal number in bits. Table 1 shows the parameters for each format. The value of a finite IEEE P754 decimal number is v ¼ ð1ÞS 10e m
Table 2.
Here, q is the new exponent and can be any integer, such that emin ðn 1Þ q emax ðn 1Þ, and c is the new significand, which is represented as a decimal integer. Since decimal numbers are not normalized, a subnormal decimal number is difficult to detect. Leading zero detection circuitry, along with some combinational logic, is used to determine if a decimal operation generates a subnormal result and an underflow exception is raised. Since the exponent of a decimal number is biased such that 0 E emax emin, the bias is equal to ðemin ðn 1ÞÞ. There are several exceptions that need to be handled during decimal floating-point division; Division by Zero, Inexact, Invalid Operation, Overflow, and Underflow. Table 2 shows the conditions and the default output for each exception. As defined in IEEE P754, there are some important differences between binary and decimal floating-
Exceptions, conditions, and default outputs for decimal division.
Exception
Conditions
Default output
Division by zero
The significand of the divisor is zero and the dividend is a finite nonzero number.
Correctly signed 1
Inexact
The quotient is not exact
Rounded quotient
Invalid operation
Either operand is a Signaling NaN or the dividend and divisor are both zero or 1.
Quiet NaN
Overflow
The quotient exceeds the largest finite number in the designated format.
1 or 10emax ð10 101n Þ
Underflow
emin
The quotient is between 10
.
0, subnormal, or 10emin
6
Wang and Schulte
point numbers. First, the significand of a decimal number is not normalized, which means that a single decimal floating-point number may have multiple representations. A set of these equivalent decimal numbers is called a cohort. Because of this characteristic, the standard defines the Preferred Representation Exponent, which refers to a required exponent (and implicitly the significand) after a decimal operation. For example, for division, if the quotient is not exact, the preferred exponent is the least possible exponent of the quotient, so as to preserve the maximum precision in the significand. If the quotient is exact, the preferred exponent of the result is ex ey, where ex and ey are the exponents of X and Y, respectively. More details on the preferred exponent are given in Section 4.2.
3.
Decimal Division Algorithm
Newton–Raphson iteration provides a high-speed method for performing division. With Newton– Raphson iteration, the division Q ¼ Y=X is performed by first obtaining an initial approximation to the divisor_s reciprocal, R0 1=X. Next, m Newton– Raphson iterations are performed to produce an improved reciprocal approximation, Rm . The dividend, Y is then multiplied by Rm to obtain an approximate quotient, Q0 , which is adjusted and rounded to obtain the final quotient, Q. Prior to obtaining the initial approximation, the operand significands are converted from DPD to Binary Coded Decimal (BCD) and then normalized. The decimal division algorithm presented in this section uses several novel techniques including an efficient piecewise linear initial approximation, a modified Newton–Raphson iteration, a specialized rounding technique, and a simplified decimal incrementer/ decrementer to improve the divider_s area and performance. For the remainder of the paper, Y and X correspond to normalized decimal significands that are viewed as fractions, such that 0:1 Y < 1:0 and 0:1 X < 1:0. Without loss of generality, we also assume that Y X, which gives 0:1 < Q 1.
and the desired accuracy of the final quotient. In [28], Takagi proposes a general framework to compute Xp , for certain values of p . The analysis presented in [28] focuses on binary operands and radix-2i sign-digit representations. In this paper, we modify Takagi_s method and error analysis for decimal floating-point division. Like Takagi_s method, our method is based on a piecewise first order Taylor series expansion, which approximates a function f ðXÞ close to the point A as: f ð XÞ f ð AÞ þ f 0 ð AÞ ðX AÞ
ð1Þ
To obtain the initial approximation, the n -digit divisor significand, X ¼ ½0:Xn1 Xn2 . . . X0 , is divided into a k -digit more significant part, XM = ½0:Xn1 Xn2 . . . Xnk , and an ðn kÞ -digit less significant part XL = ½Xnk1 Xnk2 . . . X0 10k . Using XM as the input to a lookup table divides the original input interval [0.1, 1) into subintervals of size 10k . On the subinterval ½XM ; XM þ 10k Þ, a standard piecewise Taylor series expansion of f ðXÞ ¼ 1=X about the subinterval midpoint A ¼ XM þ 5 10k1 has the form: 1 X
¼
1 1 ðX þ510 k1 Þ2 XM þ510k1 M k 2XM Xþ10 ðXM þ510k1 Þ2
X XM þ 5 10k1
ð2Þ Since ð2 XM X ¼ XM XL Þ and (10k XL ) corresponds to the ten_s complement of XL , Eq. (2) can be rewritten as:
1=X ¼
XM þ10k XL ðXM þ510k1 Þ2 XM þXL þ10n ðXM þ510k1 Þ2
ð3Þ
where XL is the nine_s complement of XL and 10n is added to obtain the ten_s complement of XL . Thus, the reciprocal approximation, R0 can be obtained as 1=X R0 ¼ C0 X0 where
3.1. Initial Reciprocal Approximation The number of Newton–Raphson iterations needed depends on the accuracy of the initial approximation
C0 ¼ X0 ¼
1 ðXM þ510k1 Þ2 XM þ XL þ 10n
ð4Þ
A Decimal Floating-Point Divider Using Newton–Raphson Iteration
This allows the linear approximation to be performed using
& & &
A table lookup on XM to obtain C0 Operand modification of X to obtain X0 Multiplication of C0 by X0 to obtain R0 .
j"approx j 12 f 00 ð AÞðX AÞ2 2 1 ¼ ðX þ510 X XM þ 5 10k1 k1 Þ3 M 2 1 ¼ ðX þ510 XL 5 10k1 k1 Þ3 M
ð5Þ Since 0:1 XM < 1 and 0 XL < 10k , "approx is bounded by "approx
< ¼ ¼
1 ð0:1þ510k1 Þ3 2k2 2510 0:13 102kþ3 4
0 5 10k1
2 ð6Þ
When computed with infinite precision, "approx 0, such that R0 1=X, since the piecewise linear Taylor series expansion of 1=X always under approximates 1=X. In practice, the nine_s complement of the most significant digits of XL is used instead of the ten_s complement of XL , only the most significant digits of C0 are stored in the table, and the product C0 X0 is truncated. The nine_s complement of XL is obtained from XL using only simple two-level logic. Since the approximation error in the initial estimate is less than 102kþ3 =4, our goal is to limit the overall error in the initial estimate to less than 102kþ3, so that the initial approximation is still accurate to at least ð2k 3Þ fraction digits. A second goal is to ensure that the error in the initial approximation is less than zero, since this simplifies the Newton–Raphson iterations and final rounding. For the initial approximation, only the 2k most significant digits of X0 and C0 are used and R0 is truncated to 2k 1 digits. Thus, the value actually computed for the initial approximation is R0 ¼ C 0 X 0 þ " R 0 ¼ ðC0 þ "C0 Þ ðX0 þ "X0 Þ þ "trunc þ "approx
where "C0, "X0, and "trunc correspond to the errors due to truncating C0 , X0 , and R0 , respectively. Consequently, "R0 ¼ X0 "C0 þ C0 "X0 þ "C0 "X0 þ "trunc þ "approx
The approximation error, "approx, from this method is upper-bounded by the second-order term of the Taylor series expansion at A, which gives:
ð7Þ
7
ð8Þ
Since 0:1 X0 < 1:0 , 1:0 < C0 < 100 , and 1 < R0 10, we have 102kþ2 < "C0 0, 102k < "X0 0, and 102kþ2 < "trunc 0, which gives the bounds: 1:0 102kþ2 þ 100 102k þ 102kþ2 þ 2:5 102kþ2 < "R0 < 0 0:55 102kþ3 < "R0 < 0
(9)
Thus, if the k most significant digits of X are used to access a lookup table, where each entry contains 2k digits, the initial approximation is accurate to at least 2k 3 fraction digits. If the lookup table is accessed using k 4-bit BCD digits and the output of the table is 2k 4-bit BCD digits, the total table size is 24k 8k. To reduce the size of the lookup table, it is accessed using a Densely Packed Decimal (DPD) encoded version of XM and the output, C0, is also in DPD format. Simple conversion logic, which takes roughly two gate delays, is used before and after the lookup table to convert between BCD and DPD formats. Since DPD represents three decimal digits using just 10 bits, this approach reduces the size of the lookup table to roughly 2j 2j, where j ¼ dðk 10Þ=3e. For example, with k ¼ 3 the size of the memory lookup is reduced from 12 KB to only 2.5 KB. Compared to conventional piecewise linear approximations, which require two coefficients to be read from memory and a decimal multiplyaccumulate unit, our method only requires one coefficients to be read from memory and a decimal multiplier. Our use of DPD encoding reduces the memory requirements by more than a factor of four, and only introduces a tiny conversion overhead. Finally, the same decimal multiplier is used to help perform the Newton–Raphson iterations, final multiplication, and rounding. Important differences between our initial approximation method and the method presented in [28]
8
Wang and Schulte
include (1) the values of the coefficients stored in the lookup table, (2) the use of the DPD encoding to reduce the size of the table lookup, and (3) the method for modifying the input operand to produce X0 . We also perform a detailed error analysis to determine the accuracy of R0 and suitable lengths for X0, C0, and R0 . 3.2. Newton–Raphson Iteration In this paper, we use an efficient version of the firstorder Newton–Raphson division algorithm to approximate X1 given the initial reciprocal approximation R0. We decrease the latency of the algorithm by performing reduced-precision calculations during the early stages of Newton–Raphson iteration, while guaranteeing that the number of accurate digits doubles each iteration. We also replace subtraction, by a much simpler nine_s complement operation, as described below. Although similar techniques have been applied to binary Newton–Raphson division, applying them to decimal Newton–Raphson division requires a new error analysis and changes the precision that is needed in intermediate results. The first order Newton–Raphson iterative equation for division is [29]: Riþ1 ¼ Ri ð2 X Ri Þ
ð10Þ
Since Ri ¼ X1 þ "Ri , where "Ri is the error in iteration i, Eq. (10) can be rewritten as: 1 1 Riþ1 ¼ ð þ "Ri Þ ð2 X ð þ "Ri ÞÞ X X
ð11Þ
Since Ri < 1=X, we get X Ri < 1 and X Ri 1, which allows 2 X Ri to be approximated by taking the nine_s complement of the fraction digits of X Ri and setting the integer digit to one. We also avoid full precision multiplications, which results in the new iterative equation: 1 Riþ1 ¼ ð þ "Ri Þ ð1 X "Ri þ "m1 Þ X þ "m2
ð12Þ
where "m1 is the error due to truncating X Ri and taking its nine_s complement to get V 2 X Ri
and "m2 is the error due to truncating Ri V. Eq. (12) can then be rewritten as: "Riþ1 ¼ Riþ1
1 "m1 ¼ X "2Ri þ "Ri "m1 X X þ "m2
ð13Þ
The error in the initial reciprocal approximation R0 is bounded by 0:55 10G < "R0 < 0 , where G ¼ 2k 3 . Truncating X R0 to 2G þ 2 fraction digits and taking its nine_s complement results in an error "m1, which is bounded by 102G2 "m1 < 0. Similarly, truncating X V to 2G þ 1 fraction digits, results in an error "m2 that is bounded by 102G1 < "m2 0. In Eq. (13), if "Ri < 0, then "Ri "m1 > 0, "Xm1 < 0, X "2Ri < 0, and "m2 0. Since "Ri "m1 < "Xm1 , it can be ignored when computing the error bounds, which gives: j"R1 j <
102G2 0:1
2 þ 1:0 0:55 10G þ102G1
< 0:5025 102G ð14Þ It is worthwhile to note that Eq. (14) shows the worst-case error, which occur if the required precision in the last iteration is exactly 2G or if all the intermediate values are truncated due to hardware optimization. In general, however, if k is carefully selected, in the early iterations of Newton–Raphson iteration, no rounding is required on the results from the inner multiplication. In the last iteration, the approximation error from X "2Ri can be ignored and only the errors from "m1 and "m2 contribute to the total error. Each Newton–Raphson iteration more than doubles the number of accurate digits. By either keeping or truncating the result of each multiplication and taking the nine_s complement of X Ri , it is guaranteed that Riþ1 < 1=X, which simplifies computing 2 X Ri and the final rounding. Newton– Raphson iteration continues for m iterations, until j "Rm j < 10n2 . 3.3. Rounding Scheme Since the Newton–Raphson iterations only produce a reciprocal approximation, Rm, our algorithm needs an
A Decimal Floating-Point Divider Using Newton–Raphson Iteration
additional multiplication to generate a preliminary quotient, which is then adjusted to produce the correctly rounded IEEE quotient. However, a small error between the true quotient and the preliminary quotient can cause the preliminary quotient to round incorrectly. Consequently, an important consideration for the Newton–Raphson division algorithm is how to adjust the preliminary quotient to obtain the correctly rounded quotient. Unlike binary floating-point division, decimal floating-point division produces infinitely precise quotients that are exactly halfway between two floating-point numbers. Decimal division also has to correctly handle guard digits that can have values between 0 and 9. Therefore, the rounding scheme presented here for decimal division is more complex than rounding schemes for binary division. The intuitive way to handle rounding with an approximate quotient is to compute the remainder, N, using the preliminary quotient, Q0 , dividend, Y, and divisor, X, as: N ¼ Y Q0 X
ð15Þ
and then use the sign of the remainder and if the remainder is zero to help determine the rounding direction. The dividend is normalized to 0:1 Y < 1:0 and we assume that Y X, which gives 0:1 < Q 1. The preliminary quotient is obtained by multiplying the dividend by the divisor_s reciprocal, such that Q0 ¼ Y Rm . However, a small error between the true quotient, Q , and the preliminary quotient, Q0 , may causes Q and Q0 to round in different directions. For example, if Q ¼ 0:19 þ 1010 and Q0 ¼ 0:19 1010, when the rounding mode is round toward zero and the rounded quotient has n ¼ 7 digits, then Q rounds to 0:1900000, but Q0 rounds to 0:1899999. To correctly round the quotient, we first adjust Q0 to obtain the ðn þ 1Þ-digit quotient, Q00 , by truncating Q0 to ðn þ 1Þ digits and then adding 10ðnþ1Þ to the result. This technique is similar to what is done in [30] for binary division. The error then becomes: 00
10ðnþ1Þ Q Q 10ðnþ1Þ
ð16Þ
Q00 is now used instead of Q0 to determine the sign of the remainder and if the remainder is equal to zero. 00 The nth fraction digit of Q is called its least significant digit (LSD) and the ðn þ 1Þth fraction digit
9
of Q00 is called its guard digit (GD). By observing the LSD and GD of Q00 , and the sign and equality with zero of the remainder, the correctly rounded quotient is selected as Q00 T , Q00 T þ 10n , or Q00T 10n , where Q00T corresponds to Q00 truncated to n digits. An action table is used to determine the correct quotient. Schwarz [31] and Oberman [32] propose simplified methods for determining the correct quotient. For example, Oberman showed that for binary division (1) the sign of the remainder can be determined by comparing the least significant bit (LSB) of Y and the corresponding bit of Q00 X, and (2) the remainder is equal to zero if all the bits to right of the nth fraction bit are zero in Q00 X . In our decimal division algorithm, a similar method is used, but the action table and its associated logic becomes more complex. In binary, the guard bit can only be 0 or 1, while in decimal, we need to consider four regions for the guard digit; GD ¼ 0, 1 GD 4, GD ¼ 5, and 6 GD 9 . Moreover, with decimal floating-point division, a special case is needed to deal with quotients that are exactly halfway between two floating-point numbers and additional rounding modes are required. With decimal division, the maximum difference between Y ¼ Q X and Q00 X is bounded by 10ðnþ1Þ ð1 10n Þ ¼ Q X Q00 X < 10ðnþ1Þ ð1 10n Þ
ð17Þ
Thus, the maximum absolute difference between Y and Q00 X is less than 10ðnþ1Þ . Consequently, the sign of the remainder can be determined by comparing the LSD of Y and the corresponding digit of Q00 X . If the LSD of Y is not equal to the nth fraction digit of Q00 X , the remainder is positive. Otherwise, the remainder is negative or zero. The remainder is zero if all of the digits right of the nth digit in Q00 X are zero, since this means that Q00 is the exact quotient. Table 3 explains the abbreviations of the rounding modes supported by our rounding scheme. Of these rounding modes, RNE, RNA, RPI, RMI, and RTZ are required for decimal arithmetic in IEEE P754. Our rounding scheme also supports the RNT and RAZ rounding modes, which are included in the Java BigDecimal Class and considered useful in financial applications.
10
Table 3.
Wang and Schulte
Explanation of rounding mode abbreviations.
Abbreviations
Explanations
RNE
Round to nearest, ties round to even number
RNA
Round to nearest, ties round away from zero
RNT
Round to nearest, ties round toward zero
RPI
Round toward positive infinity
RMI
Round toward negative infinity
RTZ
Round toward zero
RAZ
Round away from zero
Table 4 is the action table for rounding in decimal division. In Table 4, LSB corresponds to the Least 00 Significant Bit of the Least Significand Digit of Q , 00 GD corresponds to the guard digit of Q , X corresponds to don_t care, and ðþ=Þ next to each 00 rounding mode corresponds to the sign of Q . Based on the LSB, GD, remainder, rounding mode, and sign of Q00, the correctly rounded quotient is selected as Q00T , Q00Tþ ¼ Q00T þ 10n , or Q00T ¼ Q00T 10n . For example, if LSB is 0, GD is 5, the remainder is 0, the 00 rounding mode is RNA, and the sign of Q is 00 negative, then the final quotient is equal to QT ¼ Q00T 10n . 3.4. Summary of the Decimal Division Algorithm In summary, our decimal division algorithm computes Q ¼ Y=X with the following steps: 1. Use the k most significant digits of X (i.e., XM ) to perform a table lookup, which provides C0 ¼ 1=ðXM þ 5 10k1 Þ2 truncated to 2k digits (2 integer digits and 2k 2 fraction digits). The table inputs and outputs are in DPD format to reduce the size of the lookup table.
Table 4.
Action table for rounding in decimal division.
2. Obtain X0 ¼ XM þ XL by leaving the k most significant digits of X (i.e., XM ) unchanged and taking the nine_s complement of the k next most significant digits of X (i.e., XL ). 3. Compute the initial reciprocal approximation R0 ¼ C0 X0 and truncate the result to 2k 1 digits (one integer digit and 2k 2 fraction digits), where 0:55 102kþ3 < "R0 0. 4. Perform m Newton–Raphson iterations to obtain an improved reciprocal approximation Rm , where 10n2 < "Rm < 0 . Each Newton–Raphson iteration consists of the computations: (a) V ¼ ð2 X Ri Þ , where V is obtained by taking the nine_s complement of the 2G þ 2 most significant fraction digits of X Ri and setting the integer digit to one. (b) Riþ1 ¼ Ri V , where Ri V is truncated to 2G þ 1 fraction digits. Each Newton– Raphson iteration doubles the number of accurate digits in the reciprocal approximation. 5. Compute Q00 by truncating Q0 ¼ Rm Y to n þ 1 fraction digits and then adding 10n1 to the result. 6. Perform the multiplication Q00 X and compare its nth fraction digit to the LSD of Y to determine the sign of the remainder. Also, examine the n least significant digits of Q00 X to determine if the remainder is zero. 7. Use the LSD and GD of Q00, the sign and equality with zero of the remainder, the rounding mode, and the sign of the final quotient to select the correctly rounded quotient as Q00T , Q00T þ 10n , or Q00T 10n, where Q00T corresponds to Q00 truncated to n digits.
A Decimal Floating-Point Divider Using Newton–Raphson Iteration
4.
Decimal Floating-Point Divider
Figure 2 shows the block diagram of our IEEE P754 compliant decimal divider, which is composed of five main blocks. Block 1 includes the Data Preprocessing Logic, the DPD-to-BCD Logic, the BCD-to-DPD Logic, the 9_s Complement Logic, the Significand Comparator, the Exception Handler, and the Exponent Generator. The Data Preprocessing Logic converts the dividend and the divisor from the IEEE P754 decimal format to BCD for internal processing. It also normalizes the dividend and the divisor and records their length by using a variation of binary leading zero detection logic [33] and a decimal barrel shifter. The BCD-to-DPD Logic and DPD-to-BCD Logic convert the data to and from the Lookup Table so as to reduce the memory requirements of the table. The Exception Handler detects special input operand values (NaNs, 1, and 0) and exceptions (e.g., Invalid Operation and Divideby-Zero) by examining the dividend and the divisor. The Significand Comparator compares the dividend and the divisor to facilitate data alignment. The Exponent Generator computes the initial exponent of the quotient and signals the potential for overflow or underflow. Block 2 is composed of the Decimal Multiplier, the 9_s Complement Logic, the Simplified Decimal Adder, the Decimal Barrel Shifter, and the latch for
11
the intermediate divisor reciprocal. The Decimal Multiplier is a 20-digit multiplier that helps perform the initial approximation, Newton–Raphson iterations, and multiplications to obtain the preliminary and final quotients. The 9_s Complement Logic approximates ð2 X Ri Þ from ðX Ri Þ . The Simplified Decimal Adder adds a bias value of 10n1 to the preliminary quotient. The Decimal Barrel Shifter shifts the intermediate divisor reciprocal and pads the unused leading zeros in the intermediate divisor reciprocal with early exit identifiers to reduce the latency of decimal multiplication in the early stages of Newton–Raphson iteration. Block 3 consists of the Decimal Incrementer and Decrementer, the Rounding and Correction Logic, and the Data Postprocessing Logic. The Decimal 00 Incrementer and Decrementer calculates QT þ 10n 00 or QT 10n according to the signal from the rounding logic. The Rounding and Correction Logic outputs a signal to the output multiplexor and the Decimal Incrementer/Decrementer to select the correctly rounded quotient. The Data Postprocessing Logic adjusts the initial exponent and significand if the quotient is exact and combines the sign, the final exponent, and the significand to generate the final quotient in the IEEE P754 decimal format. Block 4 is the Lookup Table, which is indexed by XM in DPD format and outputs C0 in DPD format. Block 5 is the control logic for the decimal divider. It provides
X’s coefficient Control Logic
Block 3 DPD-to-BCD
Block 2
Decimal Incrementer/ Decrementer
Decimal Multiplier BCD-to-DPD 9's Complement Logic
Data Preprocessing
X
Y
X’s exponent Y’s exponent Y’s coefficient
Bias 9's Complement Logic
Sht Amt Simplified Decimal Adder
Significand Comparator, Exception Handler, and Exponent Generator
Block 1
Figure 2.
Data Postprocessing
Lookup Table
Block diagram of decimal floating-point divider.
Decimal Barrel Shifter
Reciprocal Latch
Rounding Mode
Rounding and Correction Logic
Exponent
Quotient
12
Wang and Schulte
signals to mask the intermediate result and supplies the adjustment for rounding. It also provides a division_done signal. Our decimal divider uses a decimal multiplier to help perform the initial approximation, Newton– Raphson iterations, and final multiplication and rounding. Therefore, a high-speed sequential or parallel decimal multiplier is desired. The initial implementation of our decimal floating-point divider uses the sequential fixed-point decimal multiplier proposed by Erle in [18]. Erle_s decimal multiplier uses decimal carry-save addition to accumulate the partial products, which leads to a short critical path delay. It performs multiplication in (nmult þ 4) cycles, where nmult is the number of significant digits in the multiplier operand. Erle_s multiplier uses several novel techniques including fast generation of multiplicand multiples, decimal (3:2) counters and (4:2) compressors, and a simplified decimal carry-propagate adder to produce the final quotient. Furthermore, early exit provides the opportunity to finish the multiplication in less time when the multiplier operand is short, which reduces the time needed to perform the initial reciprocal approximation and early Newton–Raphson iterations. In our implementation, we slightly modify Erle_s multiplier to reduce the cycle time.
4.1. Decimal Incrementer/Decrementer The decimal incrementer/decrementer, which helps to round the quotient, is much simpler than a decimal adder and subtracter, since it only needs to add or subtract 10n . The decimal incrementer/decrementer uses high-speed carry lookahead logic to improve performance. When incrementing, a digit carry is propagated only when the current digit, Ai is equal to nine, which can be detected as: PIi ¼ Ai ½3 ^ Ai ½0
ð18Þ
where Ai ½3, Ai ½2, Ai ½1, and Ai ½0 are the four bits that make up the BCD digit Ai and ^ denotes logical AND. When decrementing, a digit borrow is propagated only when the digit is equal to zero, which can be detected as: PDi ¼ Ai ½3 _ Ai ½2 _ Ai ½1 _ Ai ½0
ð19Þ
where _ denotes logical OR. The combined propagate signal for addition and subtraction is Pi ¼ ðdec ^ PIi Þ _ ðdec ^ PDi Þ
ð20Þ
where dec is set to zero when incrementing and to one when decrementing. Since a carry (borrow) can only be generated when there is a carry (borrow) in and the propagate signal is 1, the carry (borrow) out signal is computed as Ciþ1 ¼ Pi ^ Ci
ð21Þ
All values for Ciþ1 ð0 i n 1Þ are computed using a parallel prefix tree. When incrementing, each result digit, Zi, is set to Ai when Ci is zero and ðAi þ 1Þ mod 10 when Ci is one. When decrementing, Zi is set to Ai when Ci is zero and ðAi 1Þ mod 10 when Ci is one. The values for ðAi þ 1Þ mod 10 or ðAi 1Þ mod 10 are produced from Ai and dec using simple two-level logic to obtain A0i. Once the carries emerge from the parallel prefix tree, Zi is set to Ai when Ci ¼ 0 or A0i when Ci ¼ 1. 4.2. Preferred Exponent As mentioned in Section 2, a decimal number may have several equivalent IEEE number representations, which are collectively referred to as a cohort. Since one of the results in the cohort must be selected as the final result, IEEE P754 explicitly defines the preferred exponent of the result (and implicitly the significand) after each decimal arithmetic operation. For decimal division, if the quotient is inexact, the preferred exponent is the least possible value of the exponent to preserve the maximum precision in the significand. If the quotient is exact, the preferred exponent is equal to eY eX , where eY and eX are the exponent of the dividend and the divisor, respectively. Since handling an inexact quotient is fairly simple, we only discuss the cases where the quotient is exact. Suppose both operands are integers. We consider two cases: a quotient with integer digits only and a quotient with fraction digits and possibly integer digits. For an exact quotient with integer digits only, the length of the quotient can be determined from the dividend and divisor significands. Suppose CY CX , where CY and CX are the integer significands for the
A Decimal Floating-Point Divider Using Newton–Raphson Iteration
dividend and the divisor, respectively. If LðCY Þ and LðCX Þ are the lengths of the significands of the dividend and the divisor, and normðCY Þ and normðCX Þ are the normalized dividend and divisor, the length of the quotient is LðQint Þ ¼ LðCY Þ LðCX Þ þ M
Zero Detection Logic performs a variation of leading zero detection to obtain LðQfp Þ and the Shift Computation Logic produces LðQint Þ based on Eq. (22). The Comparator then determines the value of LðQÞ based on Eq. (23). After this, the preliminary quotient is shifted by either LðQÞ or zero digits to obtain the final quotient.
ð22Þ
5. where M ¼ 1 when normðCY Þ normðCX Þ, and M ¼ 0 otherwise. For example, if Y ¼ 320 and X ¼ 40, then LðCY Þ ¼ 3, LðCX Þ ¼ 2, and M ¼ 0. This gives LðQint Þ ¼ 1, which is the length of Q ¼ 8 On the other hand, when the exact quotient has both integer and fraction parts, predicting the length of the quotient is much more difficult. In this case, we simply count the number of nonzero digits of the quotient once it is computed, which increases the latency slightly, but saves hardware. Suppose the number of nonzero digits of the quotient is LðQfp Þ. To combine the two cases (i.e., the exact quotient has only an integer part or both integer and fraction parts), we compare the values from these two cases and select the maximum as: LðQÞ ¼ maxðLðQint Þ; LðQfp ÞÞ
Experimental Results and Analysis
We modeled a 64-bit decimal floating-point divider, which uses our Newton–Raphson division algorithm, in structural Verilog. The divider is a latch-based design. It was simulated using Modelsim to ensure correct performance and then synthesized using Synopsys Design Compiler and LSI Logic_s Gflx-P 0.11 micron standard cell library [34]. Under nominal operating conditions and a supply voltage of 1.2 V, the divider has a critical path delay of 0.69 ns, when the design is optimized for delay. The critical delay path occurs in the decimal barrel shifter if we perform synthesis from the top module, set boundary optimization to true, and set un-group on all the sub-modules. When implemented using a lookup table with k ¼ 3 and a 20-digit sequential fixed-point multiplier that processes one digit per cycle, our decimal floating-point divider takes 150 cycles to implement 64-bit (16-digit) decimal division. Figure 4 shows the estimated area for each block in our implementation, except the lookup table. As shown in this figure, Block 2, which includes the decimal multiplier, consumes most of the area.
ð23Þ
Figure 3 shows a potential implementation to obtain the final quotient with the IEEE P754 preferred exponent. Since LðCY Þ and LðCX Þ are already computed during normalization, we can reuse them and perform the computation in parallel with the Newton–Raphson iterations. The Trailing
Decimal Barrel Shifter
1 Preliminary Quotient
Trailing Zero Detection
Comparator L(Qint)
Figure 3.
0
L(Qfp) L(Q)
L(CY) L(CX) M
13
Shift Computation
Hardware implementation for preferred exponent representation.
0
Final Quotient
Quotient is exact?
14
Wang and Schulte
For example, if n ¼ 16 and k ¼ 3, three iterations are needed, but if n ¼ 16 and k ¼ 4, only two iterations are needed. The number of cycles needed to execute our Newton–Raphson floating-point division algorithm, depends on the latency of multiplication, the size of the initial lookup table, and quotient_s precision. Figure 6 shows the number of cycles required to execute our division algorithm when k ¼ 3 and the multiplier is capable of processing from one to four multiplier operand digits each cycle. The figure assumes multiplies are performed in a sequential manner with a design similar to the one presented in [18], such that each multiplication takes 4 þ dnmult =de cycles, where nmult is the number of digits in the multiplier operand and d is the number of digits processed each cycle. For 64-bit operands, increasing d from one to two decreases the number of cycles needed for division by about 32%. Further increasing d yields smaller improvements, due to constant terms in the multiplier and divider latencies and because nmult is less likely to be a multiple d. If a parallel multiplier, which can perform decimal multiplication in P cycles is used and the number of Newton–Raphson iterations required is m , our Newton–Raphson division algorithm has a latency of 10 þ P ð3 þ 2 mÞ cycles. Figure 7 shows the number of cycles required to execute our division algorithm when k ¼ 3 and P varies from 3 to 11. For example, if a 20-digit by 20-digit decimal fixed-point multiply has a latency of three cycles, then a 16-digit decimal floating-point division has a latency of 37 cycles. As illustrated in Figure 7, a linear increase in the latency of the fixed-point multiplier leads to a linear increase in the divider latency.
2
Area = 450877um Control Logic 1%
Block 3 16%
Block 1 19%
Block 2 64%
Figure 4.
Area estimate for each block in the decimal divider.
The number of digits used to access the lookup table, k, influences the memory size and the number of Newton–Raphson iterations required. Figure 5 shows how the memory size in KB and number of iterations vary with the number of digits used to access the lookup table. Since the inputs and outputs of the lookup table use DPD encoding, which represents three decimal digits using 10 bits, the table size is roughly 2j 2j, where j ¼ dðk 10Þ=3e. For example, if k ¼ 3, the memory size is 2.5 KB, while if k ¼ 4, the memory size is 54 Kbytes. Since R0 is accurate to at least 2k 3 fraction digits, Rm must be accurate to at least n þ 2 fraction digits, and each iteration doubles the number of accurate digits, the number of iterations, m , required to guarantee correct rounding is determined by ð2k 3Þ 2m n þ 2 m log2 ðn þ 2Þ log2 ð2k 3Þ
(24)
1000 Iterations
Memory (KByte)
10000
100 10 1 0.1 2
3
4
5
Number of Digits (K) Figure 5.
6
7 6 5 4 3 2 1 0
Decimal 32 Decimal 64 Decimal 128
2
3
4
5
Number of Digits (K)
Memory size and number of iteration vs. number of input digits (k).
6
A Decimal Floating-Point Divider Using Newton–Raphson Iteration
250 200 Decimal 32 150
Decimal 64 Decimal 128
100 50 0 1-digit
2-digit
3-digit
4-digit
Multiplier Digits Per Cycle
Figure 6. Division latency as the number of multiplier digits per cycle varies.
6.
Related Work on Decimal Division
Many techniques for decimal division have been invented and most of them have been patented [22– 26]. However, these implementations are for integer division and only produce one quotient digit each iteration. In this section, we summarize these inventions and analyze their potential latencies. In [22], Yabe uses a digit recurrence algorithm to perform decimal division. In each iteration, the quotient digit is predicted based on the three most significant digits of the partial remainder (initially the dividend) and the two most significant digits of the divisor. The predicted quotient digit is then used to index a table to select the corresponding divisor multiple, which serves as the subtrahend in the partial remainder computation. If the quotient is mis-predicted, the true quotient is equal to the predicted quotient plus one and the remainder is recomputed. The novelty of their design is the use of a quotient prediction table to speed up the quotient selection process. This table, however, may be as large as 0.5 MB (254 4 bits) or 64 KB if the table uses DPD encoding. Moreover, the generation of the divisor multiples introduces some overhead during each division and becomes inefficient unless carrysave decimal addition and fast divisor multiple generation are used [18]. If 16-digit decimal addition/subtraction takes two cycles and the quotient prediction takes one cycle, a 16-digit decimal integer division (with no normalization, format conversion, rounding, exception handling or post-processing) completes in around 80 clock cycles. This method may be appropriate when execution time is much more important than area and a high-speed decimal multiplier is not available.
Instead of using a prediction table to guess the quotient digit, Yamaoka uses a simpler method to compute the quotient digit in each iteration [23]. In this design, the divisor is repetitively subtracted from the partial remainder until the partial remainder becomes negative. A counter is employed to count the number of subtractions in each iteration and the final value of the counter is the quotient digit for that iteration. A novel aspect of their design is that the subtraction and digit shifting are performed in parallel. Consequently, no restoring needs to be performed when the partial remainder becomes negative. This technique saves up to n clock cycles, where n is the number of digits in the dividend. If 16digit decimal addition/subtraction takes two clock cycles, a 16-digit integer division (with no normalization, format conversion, rounding, exception handling or post-processing) takes at most 288 cycles and on average 144 cycles. This may be quite competitive with our division algorithm, when a fast decimal multiplier is not available. Rather than using a non-restoring scheme, the division algorithm presented in [24] implements restoring decimal integer division on the IBM z900 microprocessor. First, the processor computes the length N1 of the dividend and the length N2 of the divisor to determine the number of iterations required as Nq ¼ N1 N2 þ 1. The operands are then normalized such that each operand has one leading zero. After this, Nq iterations are performed, in which the quotient digit computation is the same as in Yamaoka_s method [23]. This method uses microcode to perform the division and the latency is similar to Yamaoka_s method. The technique presented in [26] is designed for software implementation and uses a non-restoring algorithm to implement decimal division. Similar to 140 120 Division Latency (Cycles)
Division Latency (Cycles)
300
15
100 Decimal 32
80
Decimal 64 60
Decimal 128
40 20 0 3
4
5
6
7
8
9
10
11
Multiplier Latency (Cycles)
Figure 7.
Division latency as the multiplication latency varies.
16
Wang and Schulte
Yabe_s method, they use the four most significant digits from the divisor and two most significant digits from the partial remainder to generate a trial quotient digit. However, instead of using all of these digits to access the table, Ferguson_s table, shown in Table 5, is simplified such that only the MSDs from the partial remainder and divisor, and some comparisons are required. The process starts with the normalization of the divisor if the MSD of the divisor is less than 5. This is followed by using the normalized divisor to select the sub-table, and then using the two most significant digits of the partial remainder to select the quotient digit from the subtable. At the end, a final correction is performed to clean up the intermediate negative quotient digits and recover the remainder, if it is negative. The novel aspect of this design is the trial quotient table. Ferguson_s table only requires a small table and random logic and can be further optimized since many entries are identical. This design, however, has overhead for the final quotient correction and extra comparison logic to select the correct quotient digits. Moreover, if the divisor is less than 5, pre-normal-
Table 5.
ization and post-denormalization need to be performed to obtain the correct result. The technique presented in [25] uses millicode (vertical microcode) instructions to implement decimal division as a sequence of simple operations. Millicode instructions are a special instruction type used in the IBM Power4 System. In Power4, instructions are dispatched in groups, where each group contains up to five internal operations (IOPs). Usually, a simple instruction takes one IOP; however, complex operations, such as decimal division, are split into several smaller instructions, called millicode instruction. To perform a decimal division, they uses a digit recurrence algorithm in which the most significant digit of the normalized divisor and partial remainder are used to provide a trial quotient digit. The trial quotient digit accesses a dividend multiple table, and a single addition or subtraction determines if the trial quotient digit needs to be adjusted. Several restrictions are applied in group issue and retire, such that a decimal division may take more cycles than expected. Reader are referred to [35] for further details.
Trial quotient digit table used in [26].
Trial quotient digit table Leading partial remainder digit and next digit < or 5
Leading divisor digit 5
6
7 8 9
+0 < j9 <
+1 < j8 <
+2 < j7 <
+3 < j6 <
+4 < j5 <
+5 < j4 <
+6 < j3 <
+7 < j2 <
+8 < j1 <
+9 < j0 <
Next three divisor digits < 625
11
23
45
67
89
9
625 and < 750
11
23
45
67
78
99
750
11
23
45
66
78
99
< 250
11
23
44
56
78
99
9
250 and < 500
11
23
44
56
77
89
9
500 and < 750
11
23
34
56
67
89
99
750
11
23
34
56
67
89
99
< 500
11
22
34
45
67
78
99
9
500
11
12
33
45
56
77
89
99
< 500
11
12
33
44
56
67
88
99
9
500
11
12
23
44
55
67
78
89
99
< 500
11
12
23
34
45
66
77
88
99
9
500
11
12
23
34
45
56
67
78
89
99
A Decimal Floating-Point Divider Using Newton–Raphson Iteration
7.
Conclusions
In this paper, we first reviewed the decimal number format specified in the IEEE P754 standard. Then, we introduced a method for performing piecewise linear decimal reciprocal approximations that only requires one coefficient from the lookup table, a nine_s complement operation, and multiplication. We also developed a modified version of Newton– Raphson iteration, which works well for decimal numbers and can benefit from a decimal multiplier with early exit. We then adapted an efficient scheme for rounding in binary floating-point division, so that it can be used in decimal floating-point division. This new rounding scheme supports all the rounding modes of the IEEE P754 standard, plus additional rounding modes supported in the Java BigDecimal Class. We provided a detailed error analysis for the initial approximation, Newton– Raphson iterations, and rounding scheme to show the correctness of our algorithm and to demonstrate improvements that can be made to the decimal division algorithm. We then presented the hardware design of a decimal floating-point multiplier, which implements our proposed algorithm and is compliant with the IEEE P754 standard. The divider has several useful features including a lookup table that uses DPD encodings, a high-frequency sequential multiplier, and a simplified incrementer/decrementer. Our design, when synthesized using Synopsys Design Compiler and LSI Logic_s 0.11 micron Gflx-P standard cell library, has a critical path delay of 0.69 ns. Next, we analyzed the relationship between the number of digits used for the lookup table, the overall table size, and the number of Newton– Raphson iterations required for different operand sizes. We also studied the impact of the multiplication latency on the overall latency of our division algorithm. Finally, we described some related research on hardware implementations of decimal division and compared these implementations to our design. Our proposed decimal floating-point division algorithm and hardware implementation should help to improve the execution times of financial, commercial, and Internet-based applications, which frequently perform operations on decimal data.
17
Acknowledgment This research was supported in part by an IBM Faculty Award.
References 1. M. H. Weik, BThe ENIAC Story,^ http://ftp.arl.mil/~mike/ comphist/eniac-story.html, 1961. 2. M. F. Cowlishaw, BDecimal Arithmetic FAQ: Part 1—General Questions,^ Available at http://www2.hursley.ibm.com/decimal/ decifaq1.htm, 2003. 3. M. Blair, S. Obenski, and P. Bridickas, BPatriot Missile Defense—Software Problem Led to System Failure at Dhahran, Saudi Arabia,^ http://www.fas.org/spp/starwars/gao/ im92026.htm, 1992. 4. W. Kahan, BHow Futile are Mindless Assessments of Roundoff in Floating-point Computation?,^ Available at http:// www.cs.berkeley.edu/~wkahan/Mind1ess.pdf, 2004. 5. European Commission, BThe Introduction of the Euro and the Rounding of Currency Amounts,^ http://europa.eu.int/comm/ economy_finance/publications/euro_papers/200%1/eup22en.pdf, March 1998. 6. A. Tsang and M. Olschanowsky, BA Study of Database 2 Customer Queries,^ IBM Technical Report 03.413, April 1991. 7. M. F. Cowlishaw, BDecimal Floating-point: Algorism for Computers,^ in Proceedings of the 16th IEEE Symposium on Computer Arithmetic, June 2003, pp. 104–111. 8. JTC1/SC22/WG4, BProposed Revision of ISO 1989:1985 Information Technology—Programming Languages, Their Environments and System Software Intefaces—Programming Language COBOL,^ Available at http://www.cobolstandard.info/wg4/ wg4.html, Dec 2001. 9. Sun Microsystem, BJava 2 Platform Standard Edition v1.3.1,^ Available at http://java.sun.com/j2se/1.3/docs/api/, 2001. 10. Microsoft Corporation, BVisual Basic Language and Run-time Reference,^ Available at http://msdn.microsoft.com/library/ default.asp?url=/library/en-us/csspec% /html/CSharpSpecStart.asp, 2002. 11. IBM Corporation, BThe decNumber Library.^ http:// www2.hursley.ibm.com/decimal/decnumber.html, 2004. 12. Microsoft Corporation, BC# Language Specification,^ Available at http://msdn.microsoft.com/library/default.asp?url=/ library/en-us/csspec%/html/CSharpSpecStart.asp, 2002. 13. 754 Working Group, BDRAFT IEEE Standard for Floatingpoint Arithmetic,^ Available at http://754r.ucbtest.org/drafts/, April 2004. 14. M. A. Erle, M. J. Schulte, and J. M. Linebarger, BPotential Speedup Using Decimal Floating-point Hardware,^ in Conference Record of the Thirty-sixth Asilomar Conference on Signals, Systems and Computers, 2002, pp. 1073–1077, Nov. 15. M. J. Schulte, M. Lindberg, and A. Laxminarain, BPerformance Evaluation of Decimal Floating-point Arithmetic,^ in IBM Austin Center for Advanced Studies Conference, February 2005.
18
Wang and Schulte
16. A. Parashar, S. Gurumurthi, and A. Sivasubramaniam, BA Complexity-effective Approach to ALU Bandwidth Enhancement for Instruction-level Temporal Redundancy,^ in 31st Annual International Symposium on Computer Architecture, June 2004. 17. J. Thompson, M. J. Schulte, and N. Karra, BA 64-bit Decimal Floating-point Adder,^ in Proceedings of the IEEE Computer Society Annual Symposium on Lafayette, LA, VLSI, February 2004. 18. M. A. Erle and M. J. Schulte, BDecimal Multiplication via Carry-save Addition,^ in Proceedings of IEEE International Conference on Application-Specific Systems, Architectures, and Processors, June 2003, pp. 337–347. 19. L.-K. Wang and M. J. Schulte, BDecimal Floating-point Division Using Newton–Raphson Iteration,^ in Proceedings of IEEE International Conference on Application-specific System, Architectures and Processors, September 2004, pp. 84–95. 20. P. Soderquist and M. Leeser, BArea and Performance Tradeoffs in Floating-point Divide and Square-root Implementations,^ ACM Comput. Surv., vol. 28, no. 3, 1996, pp. 518–564. 21. M. J. Flynn and S. F. Oberman, Advanced Computer Arithmetic Design, Wiley, 2001. 22. Yabe et al., Binary Coded Decimal Number Division Apparatus, U.S. Patent 4,634,220, January 1987. 23. Yamaoka et al., Coded Decimal Non-restoring Divider, U.S. Patent 4,692,891, September 1987. 24. F. Y. Busaba et al., BThe IBM z900 Decimal Arithmetic Unit,^ in Proceedings of the 35th Asilomar Conference on Signals, Systems and Computers, IEEE Computer Society, vol. 2, November 2001, pp. 1335–1339. 25. C. F. Webb et al., Specialized Millicode Instructions for Packed Decimal Division, U. S. Patent 6,067,617, May 2000. 26. D. E. Ferguson, Non-heuristic Decimal Divide Method and Apparatus, U.S. Patent 5,587,940, December 1996. 27. M. F. Cowlishaw, BDensely Packed Decimal Encoding,^ in IEE Proc., Comput. Digit. Tech., vol. 149, May 2002, pp. 102–104. 28. N. Takagi, BPowering by a Table Look-up and a Multiplication with Operand Modification,^ IEEE Trans. Comput., vol. 47, no. 4, 1998, pp. 1216–1222. 29. S. F. Oberman and M. J. Flynn, BDivision Algorithms and Implementations,^ IEEE Trans. Comput., vol. 46, no. 8, 1997, pp. 833–854. 30. S. F. Oberman and M. J. Flynn, BAn Analysis of Division Algorithms and Implementations,^ Technical Report, Stanford University, 1995. 31. E. M. Schwarz, BRounding for Quadratically Converging Algorithms for Division and Square Root,^ in Proceedings of the 29th Asilomar Conference on Signals, Systems and Computers, IEEE Computer Society, 1995, pp. 600–603. 32. S. F. Oberman, Design Issues in High Performance Floating Point Arithmetic Units, PhD thesis, Stanford University, Stanford, CA, 1996. 33. V. G. Oklobdzija, BAn Algorithmc and Novel Design of a Leading Zero Detector Circuit: Comparison with Logic Synthesis,^ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 2, March 1994, pp. 124–128. 34. LSI Logic, BGflx-p 0.11 micron SoC Technology,^ Available at http://www.lsilogic.com/files/docs/marketing_docs/asic/ gflx_pb.pdf, 2005.
35. J. M. Tendler, J. S. Dodson, J. S. Fields Jr., H. Le, and B. Sinharoy, BPOWER4 System Microarchitecture,^ IBM J. Res. Develop., vol. 46, no. 1, 2002, pp. 5–26.
Liang-Kai Wang is currently a PhD candidate in the Department of Electrical and Computer Engineering at the University of Wisconsin-Madison. He is working with Professor Schulte on the topics of high-performance processor design, domain-specific processors, and decimal floating-point arithmetic. In the past, he worked at Intel, where he helped to develop a new methodology for testing the Intel PXA800F cellular processor and other cellular/PDA processors. LiangKai Wang came to Madison in 2001 and received his MS degree in Electrical Engineering in 2003. In 1999, he received his BS degree in Electronic Engineering from the National Chiao Tung University, Hsinchu, Taiwan, where he focused on audio signal processing for musical instruments.
Michael J. Schulte received a B.S. degree in Electrical Engineering from the University of Wisconsin-Madison, and M.S. and Ph.D. degrees in Electrical Engineering from the University of Texas at Austin. He is currently an associate professor at the University of Wisconsin-Madison, where he leads the Madison Embedded Systems and Architectures Group. His research interests include high-performance embedded processors, computer architecture, domain-specific systems, and computer arithmetic. He is an associate editor for the IEEE Transactions on Computers and the Journal of VLSI Signal Processing.