Design Automation for Embedded Systems, 3, 255–290 (1998)
c 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. °
Design Methodology for a DVB Satellite Receiver ASIC MARTIN VAUPEL Lehrstuhl f¨ur Integrierte Systeme der Signalverarbeitung, ISS–611810, RWTH Aachen University of Technology, D–52056 Aachen, Germany
[email protected]
UWE LAMBRETTE Lehrstuhl f¨ur Integrierte Systeme der Signalverarbeitung, ISS–611810, RWTH Aachen University of Technology, D–52056 Aachen, Germany HERBERT DAWID Lehrstuhl f¨ur Integrierte Systeme der Signalverarbeitung, ISS–611810, RWTH Aachen University of Technology, D–52056 Aachen, Germany OLAF JOERESSEN Lehrstuhl f¨ur Integrierte Systeme der Signalverarbeitung, ISS–611810, RWTH Aachen University of Technology, D–52056 Aachen, Germany STEFAN BITTERLICH Lehrstuhl f¨ur Integrierte Systeme der Signalverarbeitung, ISS–611810, RWTH Aachen University of Technology, D–52056 Aachen, Germany HEINRICH MEYR Lehrstuhl f¨ur Integrierte Systeme der Signalverarbeitung, ISS–611810, RWTH Aachen University of Technology, D–52056 Aachen, Germany FOCKO FRIELING Siemens AG, Munich, Germany
[email protected]
¨ KARSTEN MULLER Siemens AG, Munich, Germany ¨ GOTZ KLUGE Siemens AG, Munich, Germany
Abstract. This contribution describes design methodology and implementation of a single-chip timing and carrier synchronizer and channel decoder for digital video broadcasting over satellite (DVB-S). The device consists of an A/D converter with AGC, timing and carrier synchronizer with matched filter, Viterbi decoder including node synchronization, byte and frame synchronizer, convolutional de-interleaver, Reed Solomon decoder, and a descrambler. The system was designed in accordance with the DVB specifications. It is able to perform Viterbi decoding at data rates up to 56 Mbit/s and to sample the analog input values with up to 88 MHz. The chip allows automatic acquisition of the convolutional code rate and the position of the puncturing mask. The symbol synchronization is performed fully digitally by means of interpolation and controlled decimation. Hence, no external analog clock recovery circuit is needed. For algorithm design, system performance evaluation, co-verification of the building blocks, and functional hardware verification an advanced design methodology and the corresponding tool framework are presented
256
VAUPEL ET AL.
Table 1. DVB specifications (excerpt). modulation pulse shaping frame format interleaving scheme channel coding Viterbi code rates Reed Solomon (RS) BER behind Viterbi BER behind RS symbol rate
QPSK (quadrature phase shift keying) square-root raised cosine impulse, roll-off 0.35 MPEG transport stream convolutional concatenated (convolutional and block code) 1/2, 2/3, 3/4, 5/6, 7/8 (204,188,8) 2 · 10−4 at SNR of 4.2 to 6.4 dB (dep. on code rate) 1 · 10−11 (quasi error free) not specified (up to 44 MSym/s)
which guarantee both short design time and highly reliable results. The chip has been fabricated in a 0.5 µm CMOS technology with three metal layers. A die photograph is included. Keywords: Methodology, DVB, algorithm and architecture design, performance analysis, verification.
1.
Introduction
A modulation and channel decoding system for digital multi-program television broadcasting is standardized in the digital video broadcasting (DVB) standard [1] (see Table 1 for an excerpt of the specifications). The satellite system is intended to provide direct-to-home services for consumer integrated receiver decoders (either as set-top boxes or integrated in the television set), as well as cable television head-end stations. For the consumer market, inexpensive and robust implementations are required. Therefore, the goal of the system designer is to implement as many functions as possible on a single chip and to minimize the use of expensive analog components which may cause long-term-stability problems. The standard specifies a BER (bit error rate) of 2 · 10−4 at the Viterbi decoder output for an inner code rate of 1/2 and an E b /N0 of 4.2 dB. This corresponds to quasi-error-free operation (QEF, i.e., BER of 10−11 or one error per hour) behind RS-decoding. Note that the symbol rates are not specified but left open to the provider of the digital video programs. In addition to the requirements imposed by the standard, the device has to fulfill demands such as a high degree of flexibility, observability, easy integration into customer products, robustness, and competitive performance and costs. A proper design methodology must ensure the satisfaction of these requirements within a short design time. The paper is organized as follows. Section 2 presents a brief summary of the building blocks’ algorithms and architectures. Section 3, the main part of this paper, describes the design methodology and the corresponding framework used in this project. Requirements for a design flow suited to support the development from functional specification to final hardware verification are discussed. Crucial phases of the flow will be highlighted using the building blocks of Section 2 as examples. Section 4 summarizes results in terms of performance and implementation complexity.
257
DESIGN METHODOLOGY
SAW
IF
Tuner
A D
Timing
Carrier
Viterbi
Frame
Convol.
RS
Sync
Sync
Decoder
Sync
Deinterl.
Decoder
Descrambler
IIC-Interface
µC
MPEG2 Decoder
Figure 1. Block diagram of the system.
2.
Algorithms and Architectures of Building Blocks
A block diagram of the system [2] is shown in Figure 1. A three-stage down-conversion scheme is applied. The LNB (low noise block-downconverter) output signal is within the first intermediate frequency (IF) range between 950 MHz and 2150 MHz. It is pre-filtered and down-converted to the second IF (480 MHz) in the tuner and subsequently filtered with a surface acoustic wave (SAW) filter. The following down-conversion unit generates a complex-valued baseband signal consisting of in-phase and quadrature component and performes a first analog automatic gain control. The analog I and Q signals are fed into the on-chip A/D converter that includes a second digitally controlled AGC. Within the timing synchronizer, timing offset correction and adaptation of the sample rate to the different symbol rates are performed fully digitally by means of interpolation and controlled decimation. Carrier phase and frequency offsets are corrected in the carrier synchronizer unit. The output of the matched filter that is embedded in the carrier synchronizer serves as input of the de-puncturing unit of the Viterbi decoder. After byte and MPEG transport multiplex packet synchronization, the de-interleaved byte stream is fed into the Reed-Solomon (RS) decoder. The decoded information bytes are de-scrambled and fed to the MPEG2 decoder. Loop parameters, acquisition and tracking performance of all synchronizing units, and acquisition strategies are configurable via the standardized I2 C bus interface. In addition, internal states and important system information can be read out.
258
VAUPEL ET AL.
1/Ts
2/T
1/T
Interpolator
Cordic Timing Error Detector
NCO
Power Estimation
Lock detect
Scaling
NCO
Loop Filter
Loop Filter
acquisition control
Matched Filter
acquisition control
Phase Error Detector
Lock detect
Figure 2. Block diagram of the timing and carrier recovery.
2.1.
A/D Conversion and AGC
The analog I and Q input signals are converted to the digital domain by two 5 bit flash A/D-converters running with up to 88 MHz. The sample clock is not synchronized to the symbol rate. For best isolation of both converters, separate power supplies and reference voltages are provided. An automatic offset compensation is included. An internal automatic gain control (AGC) adapts the amplitudes of the incoming signals to the A/D conversion range. The RMS (root-mean-square) amplitude is detected. The AGC output controls the conversion gain within a range of +/− 6 dB in 32 steps.
2.2.
Synchronization of QPSK Signals
The implemented system can be conceptually divided into two parts: the inner receiver consisting of timing and carrier recovery and the outer receiver comprising the concatenated channel decoders. The outer receiver has the task to optimally decode the transmitted sequence. The sole task of the inner receiver is to generate a good channel for the decoder such that the ‘synchronized’ channel has a capacity close to that of the information theoretic channel. This task is performed by estimation of the (unknown) channel parameters and adjustment of the synchronizers which results (ideally) in an AWGN channel with a small amount of additive noise. The synchronization of PAM signals is a well-explored concept within the engineering community. For a thorough treatment of the subject refer to Part D of [3]. The synchronizer of the implemented receiver consists of timing and carrier recovery (see Figure 2) which are described in more detail in the next two subsections.
259
DESIGN METHODOLOGY
pipeline depth = 2
bitplane X Z Z 0
c2
-1
Z
0
0
c1
c0
1
1
c1
c2
-1
-1
Z 1
2
c2
c0
-1
Z
2
2
c1
c0
3
3
c1
c2
-1 3
c0
Y Z
-1
Z
-1
Z
-1
right shift by one bit
Z
-1
Z
-1
left shift by three bits
Figure 3. Principle of modified bitplanes.
2.2.1.
Timing Synchronization
The timing synchronizer is basically a second-order digital phase-locked loop (DPLL) consisting of a timing error detector, a loop filter, a numerically-controlled oscillator (NCO), and an interpolator with consecutive controlled decimator. In order to achieve a carrier-independent timing acquisition, the non-data-aided (NDA) Gardner Timing Error Detector [4] which is known to produce an error estimate that will lead to timing estimates approaching the Cram`er-Rao bound (CRB) [5] is used. A more detailed discussion of the algorithm developed for timing synchronization of variable sample rates can be found in [6], [7]. The structure of the timing synchronizer is depicted on the left side of Figure 2. After interpolation and consecutive decimation, a loop-in-lock criterion is computed within the blocks ‘power estimation’ and ‘lock detection’. This in-lock criterion is evaluated in the acquisition control unit. Steered by the acquisition control unit, the (first order) loop filter processes the output samples of the timing error detector. The output of the loop filter is connected to the NCO that provides the interpolator filters with filter coefficients and controls the decimation. For each quadrature component, the interpolator consists of a real-valued FIR filter with variable coefficients. These idependently-operating filters are implemented according to the modified bitplane approach [8] which yields a small silicon area in conjunction with a high sample rate. The number of full adder cells between two consecutive pipeline register cells has been chosen as 2 in order to achieve the smallest possible area while satisfying the constraints on the data rate. This results in a modified structure [9] compared to [8]. The principle of the structure is depicted in Figure 3, exemplified by a filter with three taps and a coefficient word length of four bits with the transfer function à ! 2 3 X X Y (z) −4 −i j j 2 ci =z z G(z) = X (z) i=0 j=0
(1)
By means of re-ordering the add-operations the partial products with the smallest possible values are added up first leading to a smaller word length of the intermediate results. To increase efficiency further, modified booth encoding of the coefficients is applied.
260
VAUPEL ET AL.
-8 Z
-2 Z
-6 Z
-8 Z
-5 Z
-2 Z
-1 Z
-7 Z -1 Z
-5 Z
-6 Z -1 Z
Figure 4. Block diagram of one branch of the matched filter.
2.2.2.
Carrier Recovery
Carrier Recovery [10], [3] is based on a decision-directed maximum likelihood (DD-ML) phase error detector that feeds a first-order loop filter whose output is then passed to an NCO steering a phase rotator. Carrier recovery itself is performed at symbol rate, while carrier frequency and phase correction is carried out before the matched filter and hence runs at the sample rate of 2/T . The structural block diagram of the carrier synchronizer can be seen on the right side of Figure 2. The output samples of the interpolator are rotated in a CORDIC [11], [12] processor and consecutively filtered in a matched filter, an FIR filter with fixed coefficients. Since the Viterbi decoder requires a sign-magnitude representation at the input, the two’s complement encoded samples are converted in the scaling block. Additionally, the output samples of the matched filter are scaled according to the input requirements of the phase error detector and the carrier lock detection unit. Parameters of the carrier loop filter are set by the carrier acquisition control unit in conjunction with the I2 C bus parameters. The output of the loop filter is accumulated in a numerically controlled oscillator that provides the CORDIC processor with the rotation angle. The complex-valued matched filter is implemented as two equivalent real-valued FIR filters with fixed identical coefficients which are encoded in canonical signed digit (CSD) format in order to increase efficiency. Exploiting a carry-save representation as internal data format, the filters are implemented as rows of adder cells (bitplanes). The optimum pipeline depth (the number of full adder cells between two registers) is three, thus a re-ordering of the bitplanes similar to [9] has been applied to reduce silicon area. Figure 4 illustrates this principle. In order to provide the adder cells with the correctly delayed values, the input samples are delayed in one shift register chain.
2.3.
Viterbi Decoder
The Viterbi decoder operates on all DVB compliant code rates (1/2, 2/3, 3/4, 5/6, and 7/8) by means of de-puncturing. The decoder consists of the Viterbi core, a de-puncturing unit, an error correction rate (ECR) measurement unit, and a synchronization controller. The basic Viterbi decoder core consists of a transition metric unit (TMU), an add compare select unit (ACSU), and a survivor memory unit (SMU) with an implemented survivor depth of 128. The de-puncturing unit steers the input FIFO to convert the data rates according to the code rates and performs the actual de-puncturing operation according to the current synchronization state. It is able to perform a 90 degree rotation of the received QPSK
261
DESIGN METHODOLOGY
clock_in
clock_out
ClockGate
Underflow In_Real In_Real In_Imag
FIFO incl. Control
In_Imag
sym_1
Depuncturing Data_Request
Overflow
Viterbi Decoder Core
sym_0
Info bit
hard_0
Delay
hard_1
Re-Coding
trial_sync_state trial_rate
Synchronization Controler
ECR Measurement
IIC Interface
Figure 5. Block diagram of the Viterbi decoder.
symbol prior to the actual de-puncturing for synchronization purposes. Since up to 4 QPSK symbols belong to one de-puncturing period (for code rate 7/8) an offset is input to the unit to adjust the de-puncturing sequence to possible offsets of the received sequence. The error correction rate (ECR) of the Viterbi decoder, i.e., the rate of differing bits in the hard-decision decoder input and the re-encoded output data stream, is detected. This rate is an estimate of the bit error rate of the channel and can thus be used to estimate the channel SNR. The synchronization controller performs node synchronization automatically, based on a choice of programmable code rates and thresholds on the correction rate which indicate out-of-sync conditions. The performance of the Viterbi core in terms of output BER depends mainly on the internal quantization and the length of the survivor memory. 2.4.
Frame and Byte Synchronization
The frame structure of the interleaved data is depicted in Figure 6. An MPEG-2 transport MUX packet consists of 187 information bytes and one leading sync byte (47 hex). The RS-encoder adds 16 byte redundancy to each packet. Each eighth packet (super frame) is indicated by an inversion of the sync byte. On the transmitter side, all data bytes beside the sync bytes are scrambled prior to RS-encoding. This structure of the data stream is exploited in the frame synchronizer to perform 1. byte synchronization of the infinite bit stream 2. frame synchronization, which is needed to synchronize the deinterleaver and the RSdecoder 3. resolving the π -ambiguity of the output data stream of the Viterbi decoder
262
VAUPEL ET AL.
frame 1 B8h
8
2
187 bytes
16 bytes
47h
187
16 bytes
Figure 6. DVB frame structure.
Tracking, false lock
Acquisition Phase
Tracking, lock
Figure 7. Phase transitions.
After reset or a loss of sync has occurred, the synchronizer is in the acquisition phase (see Figure 7). The incoming bit stream is bitwise loaded into a shift register of length eight and bitwise correlated with both the sync (47 hex) and the inverted sync byte (B8 hex). If no more than K mismatched bits are found, the consecutive correlations are done at frame-spaced positions only. Whenever the match condition (with K allowed mismatched bits) is met, a counter is incremented, otherwise it is decremented. If this match counter has reached the programmable threshold SC, the tracking phase (either with false or with correct lock) is started. If the counter equals zero, the next acquisition trial starts. The principle of the tracking algorithm is the same. The acquisition and tracking performance can be controlled via the IIC bus. It depends partly on the bit error rate. For a typical parameter set and a BER of 2*10−4 the mean time until detecting in-sync correctly is below 0.5 ms and the mean time until loss-of-sync is above 1050 s.
2.5.
Convolutional De-Interleaving
Following RS encoding and prior to convolutional coding, the error data packets of 204 bytes (sync byte, information and redundancy bytes) are interleaved in the transmitter in order to distribute burst errors produced by the Viterbi decoder. Therefore, a deinterleaver has to process the byte stream before the RS-decoder is able to decode the packets. The deinterleaver is a convolutional interleaver with I = 12 branches [13], [14]. Each branch j consists of a shift register with M ∗ (11 − j) cells (M = 17). Each register has a wordlength of eight bits. The data are (de)interleaved byte–wise. For synchronization purposes, the (possibly inverted) sync bytes are always routed to branch ‘0’ of the deinterleaver (see Figure 8). Due to the large consumption of silicon area, implementing the deinterleaver using register cells would be very inefficient. Instead, a RAM-based solution was implemented. In order to obtain the minimal possible memory size, an addressing scheme was developed that allows in-place updating.
263
DESIGN METHODOLOGY
0 11x17
8 3x17 9 2x17 1 byte
10 17 11
Figure 8. De-interleaving scheme.
2.6.
Reed Solomon Decoder
The DVB standard specifies a shortened (204, 188, 8) Reed Solomon (RS) code, derived from the original systematic RS(255, 239, 8) code. One codeword consists of 204 bytes, separated into 188 information bytes (including the packet sync byte) and 16 parity check bytes. Since errors-only decoding is employed (no erasure processing), the RS decoder is able to detect and correct up to t = 8 byte errors per codeword (a byte error specifies an erroneous byte, independent of the number of corrupted bits) which can be arbitrarily distributed within the data. This code is designed to achieve QEF (quasi error free) performance. The code is characterized by the code generator polynomial g(z) =
d−2 Y
(z − α i )
(2)
i=0
with the distance d = 2t + 1 and α = 02 hex as specified in the DVB standard. The DVB Reed Solomon decoder uses a finite Galois field (GF) of size 28 which is specified in the DVB standard by the field generator polynomial f (x) = x 8 + x 4 + x 3 + x 2 + 1
(3)
For the DVB application the ‘traditional’ method, given by syndrome calculation in the frequency domain and calculation of the Error Locator (3) and Evaluator (Ä) polynomials using the Berlekamp-Massey algorithm, is considered to be optimum and was therefore implemented. The entire decoding process, which has to be performed for each codeword, can be roughly divided into the following steps: •
Syndrome calculation
•
Calculation of the Error Locator (3) and Evaluator (Ä) Polynomials
•
Chien Search (Determination of the roots of the Error Locator Polynomial)
264
VAUPEL ET AL.
Syndrome
S
BerlekampMassey Core
Λ, Ω
Chien Search error locations error values
Codeword Buffer
Error Correction
Figure 9. Block diagram of RS decoding architecture.
1/x Syndrome shadow register
M U X
S Λ
to chien search M U X
0
M U X
D
Ω, d D
register file
Λ
register file
Controller (hard-wired state machine)
Figure 10. Galois Field ALU.
•
Calculation of the correction values
•
Correction and output of the codeword
These steps are reflected in the top level structure which is shown in Figure 9. Due to the high throughput requirements, every block is implemented as a separate hardware unit. Given a syndrome, a time budget of 204*4 = 816 clock cycles is available for solving the key equation using the Berlekamp-Massey Algorithm. In order to minimize area consumption while meeting this throughput constraint, a special ALU supporting Galois field arithmetic was developed (see Figure 10). The polynomial coefficients are stored intermediately in two register files, one for the Ä and one for the 3 polynomial. A large hard-wired state machine steers the operations in the ALU and the register files. This design approach leads to a highly efficient implementation of the Berlekamp-Massey algorithm, implementing exactly the amount of parallel processing which is necessary to meet the given throughput constraint. The input data which are stored in a dual port RAM (the codeword buffer, see Figure 9) are finally read out and corrected. 3.
Design Methodology
The principle of a design flow for the development of integrated circuits in the field of digital communications is depicted in Figure 11.
265
DESIGN METHODOLOGY
Implementation
Function
Selection/Development of suitable algorithms Selection/Development of suitable architectures
Analysis/Verification COSSAP
Algorithm Simulator Coupling
Architecture Generation of netlist (Synopsys DC)
VHDL Simulator
Operators/Gates Place & Route
Post Layout Simulation
Physical Layout Foundry
Production Test
Chip
Figure 11. Abstraction levels of design flow.
Starting with a functional specification of the system, all of the design tasks depicted on the left side have to be performed. The representations of the intermediate results (e.g. algorithmic block diagram or VHDL netlist) differ in the degree of abstraction: at the highest levels implementation-related details are hidden while the importance of physical information is increasing at lower levels. The right side of Figure 11 shows the tasks of performance analysis and verification by means of comparing the intermediate results of different design phases. In case of a discrepancy or insufficient performance, the corresponding development phase has to be repeated. Thus, the design flow is an iterative rather than a one-through process. A proper design methodology must satisfy two major requirements: the process has to be reliable and fast. A closer analysis reveals the following more detailed demands which contribute to the two main goals and must be fulfilled for each design phase and each level of abstraction: •
seamless flow from functional specification to final hardware verification
•
high quality of results
•
design time –
high simulation efficiency
–
easy verification by means of co-verification and co-simulation
–
high degree of automation
266 –
VAUPEL ET AL.
reuseability ∗ ∗ ∗ ∗
•
•
easily reusable existing components generic components flexible and parameterizable designs re-use of design and verification setup
modeling style –
appropriate modeling style of components
–
appropriate simulation paradigm
–
support of hierarchy and modularity
manageability –
concurrent designing
–
project management and monitoring
–
visualization of results
–
predictable and reproducable results
The decision on the entrance point in the design flow for different components of the system depends on the nature of the specifications within the standard [1]. We distinguish between two cases (cf. Table 1): in the first case, the algorithm is completely specified by the standard. Therefore, algorithmic performance evaluations are not necessary and the design space is limited to the architectural choices. To this class belong in particular the de-interleaver and the de-scrambler, whose schemes are specified on the bit-level, and the Reed-Solomon decoder whose output stream is fixed for a given set of input stimuli. In the second case, the algorithm is not completely specified, rather only the required functionality is described. Here, the designer is free to choose both the algorithm as well as the architecture. The timing and carrier synchronizer, the byte and frame synchronizer and the node synchronizer belong to this class. The Viterbi decoder is in between: while the code is specified, the architecture and the algorithmic properties such as survivor memory depth and internal quantization are a trade-off between performance and implementation complexity (and hence cost). The following sections give examples of the tasks within each step of the design flow, highlight crucial points, and show how we have met the requirements mentioned above. 3.1.
Algorithm Design and Performance Evaluation
The algorithm design process can be broken into four phases: the designer must 1) develop and partition a global system structure, 2) design algorithms, 3) determine algorithm-specific parameters (e.g. loop-bandwidth, threshold values) as well as the accuracies of the signal representations (i.e., internal quantization), and 4) analyze system performance.
DESIGN METHODOLOGY
267
To compare the algorithmic performance against the specification (i.e., to perform a validation), the designer must assess the system performance by means of analysis, numerical computation, or simulation, and he must establish criteria for the comparision. The theoretical performance analysis of systems that include non-linear operations is complicated where not impossible. For the DVB-receiver, this is the case with the timing and carrier synchronizer loops and all components where internal quantizations are performed. A more appropriate and faster method is to obtain performance measures by means of MonteCarlo simulations. The simulation efficiency depends on both the proper modeling style of the building block under consideration and the respective environment (e.g. the channel), as well as on the appropriate simulation paradigm. Where applicable, performance evaluation of components by means of numerical analysis has to be considered as well. The only quality measure that is specified in the standard is the output bit-error-rate (BER) at a certain signal-to-noise ratio (SNR). While this system-wide metric is sufficient for the analysis of the outer receiver, for the inner receiver it is not (although, of course, the implementation of the inner receiver influences the BER). Thus, the designer must establish explicit criteria for the performance assessment of the inner receiver components by herself. As discussed in [3], proper quality measures of the synchronizer loops are the variance values of the timing or phase error estimates. For all building blocks performing synchronization, the set of significant metrics includes the mean acquisition time and the mean time until loss of sync. Absolute bounds on the majority of these measures can be derived by theoretical analysis. They are used to detect modeling or simulation errors and to identify degradations due to approximations of optimal algorithms. The Cram`er-Rao bound, for example, serves as lower bound of the variance of timing and phase error. The degradation from the optimum can be divided into the detection loss caused by imperfect synchronization and the implementation loss due to finite word length effects. Together, these two losses describe the decrease in SNR with respect to an implementation with perfect synchronization and perfect implementation. To allow the analysis of the implementation loss by means of simulations, the models of the receiver components must be converted from a floating point number representation into a fixed point format with limited word length. In addition to the theoretical bounds, the actual requirements must be determined in conjunction with customer needs and are a trade-off between performance and cost. All algorithm design and simulative performance evaluation was performed using the system level design tool COSSAP [15]. This environment makes algorithm specification and verification convenient through its libraries of pre-defined blocks and through the embedding of newly-developed blocks whose functions can be described in the programming language C. 3.1.1.
Partitioning
The partitioning of the system into major building blocks is quite obvious for most of the functional tasks specified in the standard: symbol synchronization has to preceed decoding, and the sequence of components for concatenated channel decoding is fixed by the coding schemes and the data structures. However, the inner structure of the main blocks has to be carefully chosen. For the symbol synchronization, for instance, we used two
268
VAUPEL ET AL.
off-chip
IF amplif. AGC1
on-chip
A(f) AAF
sampling
A->D
AGC2
Figure 12. Analog processing.
feedback loops for timing and carrier recovery, which are completely separated (cf. Figure 2). This approach leads to a simpler acquisition strategy and increases design robustness. In addition, performance analysis and the specification of algorithmic parameter ranges and quantization are faciliated and thus design time and test complexity are reduced. The timing synchronization error feedback loop precedes the carrier synchronization loop in which the matched filter is embedded. While exploiting the advantages of error feedback loops, we are faced with two structural possibilities for the derivation of samples that are synchronized to the symbol timing (i.e., to the transmitter clock) from the continous-time input signal: analog (precisely: hybrid [3]) and fully digital timing recovery. The former comprises an analog voltage-controlled oscillator (VCO). The digital processing units control the VCO in such a way that the frequency and the time instances of the sampling process within the A/D-converters are synchronized to the symbol timing. The latter alternative, fully digital timing recovery, consists of an NCO, interpolation and controlled decimation. In this structure, the system clock (and hence, the sampling process) is not synchronized to the symbol rate but is generated by a fixed and free-running oscillator which runs at a (slightly) higher rate than the symbol rate. The shift of the samples to the correct time instances is then performed by the digital interpolator. The reduction of the sample rate to the symbol rate is done in a decimator. Both units are controlled by the NCO. For the implementation of the DVB receiver, the latter solution was chosen because it possesses several advantages. The avoidance of external analog clock components decreases the I/O-complexity, minimizes cost by reduction of pin count and number of devices necessary to build up an entire system, and increases design robustness. In addition, design time and test complexity are decreased as no joint analog/digital design style has to be employed. 3.1.2.
Modeling of Channel and Analog Front-End
The analog frontend influences the input signal of the digital receiver. Therefore, it must be modeled to allow the performance evaluation of the inner receiver. Prerequisites for the performance analysis by means of Monte-Carlo simulations are simulateable discrete-time models of the channel, the analog frontend consisting of an analog AGC and an anti-aliasing filter (AAF), the A/D converter, and the second (digitally controlled) AGC. The analog processing units are displayed in Figure 12.
269
DESIGN METHODOLOGY
H(f)
A(f) Nf
S(f) 2∆f (1+α)/2T a
fg
f
Figure 13. Signal spectra and transfer function.
The analog AGC (AGC1 in Figure 12) is modeled as follows: The signal power PAGC within a band of b AGC = 50 MHz centered around the carrier is held constant. Thus, PAGC = N0 b AGC +
E sC = const. T
(4)
holds where E sC is the channel symbol energy. As N0 and b AGC do not vary significantly during operation and E sC = 2RC E b where RC is the code rate, equivalently p AGC = 1 +
2E b RC No b AGC T
(5)
can be held constant. This result can be further simplified if the (double-sided) AGC bandwidth 1 f A is reduced to 1 f A ≤ 1/T . In this case, T and b AGC cancel and the result also becomes independent of 1 f A . Here it was assumed, that 1 f A and N0 are independent of any operational condition. In order to realize this type of AGC, in equation (5) b AGC T equals one. The control of the amplitude is executed by scaling the incoming signal with √ k/ p AGC which can be directly transferred into the digital discrete-time simulation. In case that short term variations affect the functionality of the AGC, their effect can be modeled by choosing k 6= 1. Before sampling and A/D conversion, the received signal is filtered in the AAF with transfer function A( f ). The AAF usually has a bandwidth much smaller than 2/T in order to prevent noise from being sampled with the A/D converter since this would require a larger dynamic range of the A/D converter. The bandwidth of the anti aliasing filter is locked to the sampling rate (not to the symbol rate) in a way that for Ts = 0.5T the used bandwidth including the signals roll-off soft edge can pass through. For a roll-off of α = 0.35, this leads ideally to f g,A AF,ideal =
1.35 4Ts
(6)
The corresponding signal spectra are displayed in Figure 13. The filtered noise n f (·) has the power spectral density N f ( f ) and the signal has the power spectral density S( f ). The timing offset is always measured with respect to T . The bandwidth used by the signal transmission is 2a, where a=
1+α + 1f 2T
(7)
270
VAUPEL ET AL.
Figure 14. COSSAP model of channel and off-chip analog processing.
Here 1 f is the maximum allowable frequency offset and α is the roll-off factor of the root raised cosine pulses used. We assume that A( f ) = const. for | f | < a. We also assume that the various AGCs manage to keep the signal power at the A/D input at a constant value. For the COSSAP realization of the channel, the noise power spectral density is assumed to be constant. Different E b /N0 are thus achieved by variation of the bit (or symbol) energy. Once the proper signal energy is obtained, additive Gaussian noise is added. The noise power spectral density is N0 = 1 and per sample noise of the power σ 2 = 1/Ts is added. The channel model also contains functions to simulate a frequency offset and phase offset. The above models have been implemented in COSSAP leading to the schematic depicted in Figure 14. In this netlist, the MUXSR blocks serve to enable/disable the AGC. Since almost all receiver settings (Ts , Rc , T , E b /N0 ) affect the AGC and analog pre-filter, several simulations have been carried out to evaluate the capabilities of the AGC. The uncertainty of the A/D converter is modeled by adding a random variable to the quantized signal value. The result of the addition is then clipped again to the range of the A/D converter. 3.1.3.
Simulation Paradigm
The simulation paradigm of the stream driven simulation engine [16] is well suited to the data-flow dominated nature of the application under consideration [17]. The processing of a functional block is scheduled not for each single arriving data item, but only if a configurable amount of values is accessible. This block processing results in a high simulation efficiency. In the receiver, we are faced with multi-rate signal processing. The incoming continous signal is sampled with rate Rs = 1/Ts which is generated by a free running oscillator. Samples at this rate are denoted by xm in Figure 15. It is important to distinguish between the asynchronously sampled signal at rate 1/Ts and the synchronized samples (denoted by xk ) at an average rate 2/T which are derived by means of interpolation and controlled
271
DESIGN METHODOLOGY
1 Ts
xm
2 T
xk
1 T
xl
Figure 15. Multi-rate and dynamic data flow.
decimation. Rs = 1/Ts is always slightly larger than 2/T . The samples xl at average rate R = 1/T are synchronized to the symbols (and hence to the transmitter clock). They are derived by a slaved decimation. Therfore, samples xk=2l,2l+1 correspond to xl . The modeling and simulation of this multi-rate and even dynamic, data-dependent dataflow is possible without additional overhead or a loss of simulation efficiency. Since there exists no explicit time or clock in COSSAP, invalid data items can be simply discarded (see Figure 15). In contrast, in a time-driven simulation, the invalid data items have to be transferred to each consecutive block and marked as invalid. This adds to the overall system load without gaining any additional benefits. Of course, this modeling style has to be carefully taken into account when mapping the algorithmic model onto architectures (see section 3.2.2). Although the system under consideration is mainly data-flow dominated, a certain amount of control-flow has to be implemented and, thus, modeled and simulated. In principle, there exist two possibilities for the handling of control-flow: 1) modeling the flow and, thereby, visualizing it at the system level or 2) hiding it by means of a description in a programminglanguage like C together with an encapsulation it inside a data-flow block. The decision for one of these two choices depends on the scope of the control flow. System-inherent control is mainly encapsulated within one block (e.g. the finite state machine of the frame synchronizer or the hard-wired FSM of the Reed-Solomon decoder (see section 2.6)). These encapsulated state machines do not communicate with their environment and therefore their control flow need not be modeled at system level. Threshold values and acquisition strategies for the different synchronizers can be parameterized and controlled from the outside via the IIC-bus. The acquisition control units of timing and carrier synchronization control settings of the respective loop filters. Thus, this control flow is ‘visible’ at the outside of these components and, therefore, has to be modeled in COSSAP. Since a mapping to synchronous hardware is the ultimate goal of the algorithm design phase, these control signals are conveniently modeled with the same data rate as the respective data-flow signals. As a result, no special treatment of the control-flow parts was necessary.
272
VAUPEL ET AL.
False lock probability
0
10
K = 1, BER=(0.05, 0.0002)
−2
10
−4
p_fl
10
−6
10
K = 0, BER=0.05
−8
10
K = 0, BER=0.0002
−10
10
1
2
3
4
5
6
7
8
SC
Figure 16. False lock probability p f l .
3.1.4.
Analytical/Numerical Performance Investigations
For the performance analysis of the frame, carrier, and timing synchronizer in terms of mean acquisition time, loss of sync probability etc, a simulation of the data flow behavior is not practical. To compute these statistics, large amounts of data would have to be processed which would result in very long simulation times. Instead, the performance has to be calculated as far as possible by theoretical, or where not applicable, numerical analysis. For instance, the statistical behavior of the frame synchronizer was investigated in conjunction with MATLAB [18], [19]. As we recall from section 2.4, the synchronizer is always in one out of three global phases (acquisition and tracking while either in-lock or in false lock, cf. Fig. 7). For each global phase, the synchronizer can be modeled as a discrete-time Markov-Chain because 1) the synchronizer possesses a finite number of states and 2) the next local state only depends on the current state. The static state probabilities and the mean value of the absorbing time of a Markov chain can be computed by means of numerical analysis. This values correspond to meaningful performance measures (e.g. the mean time of an acquisition process Tac and the probability of detecting an in-lock falsely p f l ). A detailed description of the performance assessment is given in the Appendix. Different parameter sets lead to different shapes of the state diagrams and different dimensions of the corresponding transition matrices. This has to be taken into account when computing the values for p f l and Tac depending on BER, K and SC (cp. section 2.4) and visualizing them using MATLAB. The results are shown in Figures 16 and 17, respectively.
3.2.
Architecture Design
All architectures of the building blocks (see chapter 2) are described in VHDL [20]. However, depending on the time period allowed for the execution of an operation, different
273
DESIGN METHODOLOGY
Mean acquisition time 160 K = 1, BER=0.05 140 K = 1, BER=0.0002 120
T_ac [frames]
100
80
60 K = 0, BER=0.05 40
20 K = 0, BER=0.0002 0 2
3
4
5
6
7
SC
Figure 17. Tac in number of frames.
description styles were applied. We have to distinguish between two cases: 1) the processing of the data is allowed within a large number of clock cycles making the sharing of execution units possible and 2) only one clock cycle per operation is available. The latter case has to be further differentiated depending on the clock rate. For the blocks with the largest time budget (e.g. the Reed-Solomon decoder with about 800 clock cycles per frame) resource sharing was exploited up to a high degree inside the component. The resulting architecture bears resemblance to a digital signal processor: a special Galois field algorithmic logical unit (ALU) was designed executing the BerlekampMassey algorithm in order to compute error locator and evaluator polynomials. However, in contrast to a programmable device, this data-path is controlled by a hard-wired state machine. (Remark: the resulting architecture, consisting of application-specific data-path and hard-wired state machine is the target platform of the Behavioral Compiler (BC) [21] of SYNOPSYS. Today, this high level synthesis tool is used in conjunction with an environment that automatically generates synchronizing interfaces for the different BC-processes [22], [23].) For the parts of the design, where the clock rate is in a medium range (e.g. the loop filters of timing and carrier synchronizer) the designer can rely on the DesignWare [24] library components for the arithmetic units. However, we have increased efficiency where applicable by substituting, for example, multipliers with shifters. In this clock domain, pipelining is only necessary between components. Silicon area was minimized by carefully reducing the wordlength after each operation by clipping, rounding or truncation in accordance with performance evaluations. For the highest clock rates, all blocks are heavily pipelined in order to meet the tight timing requirements. In addition, optimized high-throughput arithmetic components must be developed and integrated. The architectures are described in a structural and very finegranular way. Most of these components are building blocks of each digital communication system (high-throughput multipliers, interpolators, filters with fixed or variable coefficients, etc). Thus, the investment in terms of design effort spent developing these modules should be returned in other design projects. The resulting architectures must be integrated into
274
alg.−ports
group: MINIMIZED METHOD
REGISTER−EXCHANGE
TRACE−BACK
alg.−param. alg.−behav.
primary: PARALLEL
SHARED
STANDARD
STANDARD
impl−ports generics
secondary: PARALLEL
SHARED
CARRY_SAVE
BINARY
BINARY
timing, architecture
Implementation (Timed)
class:
VITERBI
Algorithm (Data Flow)
VAUPEL ET AL.
Figure 18. Generator library organization. Example: Viterbi decoder class.
a design environment which allows the capturing of this design knowledge and enables design re-use by generation of VHDL modules. The design environment that was used in the course of this project, ComBox [25], enhances reuseability and was developed at RWTH Aachen. 3.2.1.
ComBox
The design environment for high-throughput data-flow-dominated VLSI systems, ComBox, allows the exploitation of algorithmic and architectural trade-offs. It links the system level to the architectural level by means of its four-tired library. In Figure 18 the organization of the information within ComBox libraries is depicted and exemplified by generators for Viterbi decoders. The upper two layers (named class and group) contain information about the algorithmic behavior. They are needed for hardware generation from a functional dataflow block diagram [26], [27]. For the DVB receiver, the group describing the ‘trace-back’ behavior was chosen. The two lower levels provide implementation-related details which are necessary in order to generate a hardware description. For each pair consisting of primary and secondary unit a single VHDL-description is produced which possesses an identical behavior compared to the data-flow simulation model of the enclosing group. In addition, customized synthesis scripts can be generated for each component. As can be seen in Figure 19, three different generation schemes are supported: 1) source code templates, 2) hierarchical generation, and 3) use of generation scripts. Source code templates are an extended format of VHDL source code or synthesis scripts which are processed by a simple macro processor and allow code customization depending on user provided parameters of the generator. The hierarchical generation reflects the hierarchical structure of systems for digital communications. It is possible to call subgenerators from a generator. This reduces the complexity of the library and the number of modules to be maintained. The use of generation scripts is the most flexible way of specifying code-generation. They are necessary in all cases where the generic properties of
275
DESIGN METHODOLOGY
class group primary secondary SUBGENERATOR
GENERATOR CLAUSE
Shell script
VHDL-TEMPLATE
VHDL-Code
SYNSCRIPT-TEMPLATE
synthesis script
Figure 19. Generation schemes.
the synthesizable sub-set of VHDL is not sufficient to obtain a large degree of flexibility and parameterizability of a model with reasonable effort. The complexity of the generators that were used in this project reaches from simple logic functions or micro processor interfaces to quite complex blocks like Viterbi decoders [28], CORDIC processors [29], or filters with fixed or variable coefficients [9]. The ComBox library concept allows the capturing of expert knowledge in a well-defined manner. The convenient way of VHDL-generation and the implicit documentation enhances reusability [30]. Thus, rapid design of digital communication systems is supported while maintaining a high quality of results. In order to speed up the design cycles even further, a methodology was developed that allows the estimation of implementation costs on the system level [31]. 3.2.2.
Modeling and Implementation of Dynamic Data Flow
As discussed, COSSAP is well suited for modeling of the dynamic data-dependent data flow caused by the controlled decimation in the timing synchronizer. In hardware, this dynamic multi-rate data-flow was realized using gated clocks. The alternative approach, using ‘enable’ or ‘data-valid’ signals, would lead to a more area-consuming solution due to the additional MUX cell for each register. Additionally, in components of the data-path with the highest sample rate one would encounter severe problems in satisfying the timing constraints. With regard to power consumption the chosen alternative has to be preferred as well, since the large clock networks are switched at a lower rate. Corresponding to the three different sample rates 1/TS , 2/T , and 1/T we identify three
276
VAUPEL ET AL.
gate clock
Interpolator
1 Ts
NCO
HOLD
Power Estimation
TED
CORDIC
MF
NCO
2 T
clock1g HOLD
HOLD
scaler
1 T
Lock detect Timing Acquisition Control
Timing Loop Filter
Lock detect
PED
Carrier Acquisition Control
Carrier Loop Filter
clock2 HOLD register, clocked with faster clock
register, clocked with slower clock
Figure 20. Clock domains in the symbol synchronizer. 0
400
800
1200
1600
0
400
800
1200
1600
reset clock gate clock1g clock2
Figure 21. Simulated waveforms of the clock signals.
different clock domains (clock, clock1g, and clock2) in the symbol synchronizer (see Figure 20). (A fourth clock domain drives the Viterbi decoder and the consecutive units.) The domain with the highest average rate (1/TS ) and the non-gated clock clock consists only of the interpolator and the timing NCO. The input samples of the DVB receiver are fed in with this clock rate. As shown in Figure 20, the timing NCO controls the gating process by a control signal gate. This signal is input to the clock gating unit that works with the non-gated clock clock and generates the signals clock1g and clock2. When the signal gate goes ‘0’, the clock clock1g remains high for the next cycle. clock2 is derived by simply dividing clock1g by the factor two (slaved decimation). To avoid setup and hold time violations, the senses of adjacent clocks are negated. This means that each transition from ‘low’ to ‘high’ of the clock corresponding to 2/T occurs only on a falling edge of the clock corresponding to 1/TS . Figure 21 depicts the simulated waveforms of the reset, the gate, and the different clock signals. This behavior is modeled in COSSAP with dynamic data flow. The gate signal of the timing NCO controls several dmux blocks, one of whose outputs is killed (see Figure 22). The clock division is modeled with a block that generates an alternating sequence of zeros and ones. This signal controls several dmux blocks. The behavior of the two possible clock2 senses can be modeled by simply exchanging two consti blocks. For synchronization purposes, it is necessary to latch each data signal which crosses the
277
DESIGN METHODOLOGY
data_in
enable_in KILLI
DMUXSI
ADDS_I
M5
M8
M7
data_out DELAYI M4
Figure 22. Synchronization register modeling: in gate.
CONSTI
DELAYI
M7
M6
data_in
data_out MUX2SI M4
SELECTI M5
control
Figure 23. Synchronization register modeling: out gate.
clock domain borders (e.g. the TED output) with registers that are clocked with the clock of the receiving domain (cf. Figure 20). In COSSAP, these units are modeled in two different blocks depending on the direction of the data-flow: a transition from higher data rates to slower ones is modeled with the hierarchical block in˙gate (see Figure 22) which simply discards data items steered by the controling enable signal. The behavior at a transition from lower to higher rates is modeled with the COSSAP block out˙gate (Figure 23) which duplicates internally stored data items when necessary. As a results, this modeling approach of dynamic data flow in VHDL and COSSAP combines high simulation efficiency with high quality implementation results. Using the bit and cycle true models results in an one-to-one mapping between system simulation and the behavior of the hardware which is crucial for the verification tasks (see below).
3.2.3.
Synthesis Strategy
Together with an efficient VHDL coding style, a proper synthesis strategy is necessary to guarantee high quality and reproducable results. We have applied a bottom-up synthesis approach relying on nested synthesis scripts which firstly characterize the design entity under consideration and compile it in a second step. The organization of these scripts reflects the design hierarchy. Where applicable, we have aspired to place a (pipeline) register at the output of each VHDL entity in order to ease the characterization of the building blocks at each hierarchy level. To take the clock skews imposed by the different clock trees and domains as well as the uncertainties resulting from placement and routing into account, a 10% safety margin was applied, which was found to be sufficient.
278
VAUPEL ET AL. Table 2. Lines of VHDL code and synthesis scripts. Component Timing & Carrier Sync Viterbi decoder Frame sync De-interleaver RS decoder Descrambler
lines VHDL
lines dc-script
7000 4000 700 640 5400 360
1000 340 100 100 630 100
Table 2 shows the amount of VHDL code and synthesis scripts as measured by the number of lines for each major building block. The number of gates per code line decreases with increasing architectural timing requirements because the coding style becomes very granular and structural and each level of hierarchy adds to the code size without referring to actual gates. As discussed earlier, especially parts of the code which describe high-throughput components were generated by proprietary software tools. The resulting hierarchy of synthesis script allows late design changes and ensures reproducability of results by automating the synthesis process and the inherent documentation of the proper component-specific synthesis strategy. 3.3.
VHDL Verification
The VHDL descriptions of the components were verified against the corresponding COSSAP models using a coupling of the system simulator and the VHDL simulator [32], [33]. As depicted in Figure 24, test stimuli are generated within COSSAP re-using the simulation setup of the algorithm definition phase consisting of source and channel models and blocks that produce configuration signals. Via an automatically generated interface, this data is fed into the VHDL simulator that contains the VHDL model under test and an entity which provides clock and reset signals. The graphical post-processing features of COSSAP make the comparison of the outputs very convenient. In addition, all debugging facilities of the VHDL-debugger (e.g. breakpoints, waveform displays) are usable. Therefore, no VHDL test-bench had to be written in the course of this project. This led to significant savings in design time compared to a more conventional HDL-based verification methodology. In addition, generating the test stimuli and simulating the behavior of surrounding components within the data flow simulation environment speeds up the simulation runs considerably compared to producing the input stimuli inside the VHDL test-bench. Since this verification methodology is applicable at each hierarchy level, the VHDL model of the complete receiver can eventually be verified with the bit and cycle true COSSAP model including the multi-rate and dynamic data-flow behavior. 3.4.
Hardware Verification
Besides the post production testing, the final device has to be functionally evaluated in a real world environment. In the following sections we describe two possible approaches.
279
DESIGN METHODOLOGY
Data Flow Simulation Environment COSSAP
Data Flow Reference Model =?
Source
Interface
VHDL Simulator VHDL Debugger VSS
VHDL Model
clock reset
Workstation
Figure 24. Co-simulation of system level and VHDL.
3.4.1.
Evaluation Board
A hardware evaluation board was designed that allows functional verification and demonstration of the capabilities of the receiver chip. The board comprises a tuner with I/Q demodulator, an I2 C interface controller, a transport stream connector, and the receiver ASIC. The RF input of the tuner can be connected with signal generation equipment or directly with a satellite dish. Output is the MPEG transport stream which can be fed into an MPEG2 decoder. All functions are controlled by the I2 C bus which is connected via an interface module with the parallel port of a PC. The PC then acts as the external micro controller in Figure 1. This setup allows the designer to easily configure the receiver, read out all important status information (which synchronizers are in lock, code rate of the Viterbi decoder, position of the de-puncturing mask) and measure the performance such as estimated channel BER, number of corrected bytes, values of the lock criteria, etc. While this functional verification method is impressive to demonstrate at the customer site and therefore supports marketing, the costs for developing the PCB and acquiring the test equipment that is able to generate DVB-compliant modulated data streams have to be taken into account. An alternative approach which allows modeling of sources, channel and external controller in software is used at RWTH Aachen. It is described in the following section.
3.4.2.
Hardware in the Simulation Loop
The real-time analysis and verification environment RAVEN [34] allows embedding the hardware into a system simulation. Figure 26 illustrates the principle. Via an interface
280
VAUPEL ET AL.
power IIC bus (to/from PC) status LEDs
to MPEG2 decoder
receiver ASIC to/from LNB
Figure 25. Evaluation board.
block, the input data is written to the SCSI interface of the workstation. Upon arrival at the hardware test interface, the samples are stored into an internal memory. When the SCSI transfer is completed, the clock signal driving the hardware under test is switched on and a real time test cycle starts. During this phase, the output data of the ASIC is stored into the RAM of the interface and after completion transferred back to the workstation where the stimuli for the following cycle are generated. The number of samples of a single hardware test cycle can be between 1 and 32000 and is pre-configurable by a parameter in the interfacing module. In addition, the mapping between COSSAP signals and the 128 I/O pins of the hardware test interface is parameterizable in terms of position, wordlength, number representation, and direction. Via control signals inside the COSSAP netlist the predefined values for number of samples inside a cycle and port direction can be changed dynamically. Like the VHDL co-verification, again all simulation setups of the algorithm definition phase can be reused. Besides the manufacturing of a very simple board (the chip carrier in Figure 26) that connects the ASIC under test to the hardware test interface (cf. Figure 27) and the configuration of the COSSAP interface model, no additional effort is necessary to 1) functionally verify the fabricated device, 2) run high-speed performance evaluations under different environmental conditions, or 3) demonstrate the fabricated solution. The device under test, which can be a single chip or a PCB containing a system of several components, is steered and observed by means of a dynamic control of the simulation parameters and the graphical output and post-processing capabilities of COSSAP. Thus, the device can be
281
DESIGN METHODOLOGY
Data Flow Simulation Environment COSSAP
Data Flow Reference Model =?
Source Interface
Workstation SCSI Hardware Interface
RAM ASIC clock power
RAVEN
Chip carrier
Figure 26. Hardware in the simulation loop.
compared with the bit and cycle true system model under easily accessible and reproducable conditions. In addition, the RAVEN environment can be used to embed already existing (third-party) hardware components (e.g. the MPEG decoder) into enclosing virtual systems in order to investigate behavior and performance efficiently.
3.5.
Project Management
A vital point for the success of projects of this complexity is the proper planning of tasks in advance. Microsoft Project [35] aided in the partitioning of the workflow, resource allocation, and scheduling. Analogous to the design hierarchy itself, dividing the work into subtasks has to result in reasonable entities and must not lead to additional effort. Since the costs of fixing errors increases dramatically with the project progress, verification has to ensure as early as possible that a model is flawless before entering consecutive design phases. In spite of the advanced verification methodologies which rely heavily on re-use, the sum of the verification and evaluation efforts of all design phases was estimated to be about 50% of the overall design time in the original project plan which was sufficient. The concurrency of the tasks aggravates the monitoring of the progresses. The risks of concurrent engineering are alleviated by a common tool platform which must be fixed during the project definition phase. This tool set must support multi-level and hierarchical component descriptions as is the case of the COSSAP environment. The possibilities of sharing model libraries and exchanging precise design information conveniently by mailing model files further ease working in a team whose members are located at different geographical locations. On the other hand, concurrency is able to improve reliability when code reviews are made and verification setups are developed independently of the designer. This method was applied in our case, a collaboration between industry and university.
282
VAUPEL ET AL.
Figure 27. RAVEN in use.
The careful definition of responsibilities and of interfaces (technical and human) contributed to success. Bi-weekly team-internal and monthly project-wide reviews allowed to identify critical issues and to take measures against them. In addition, this meetings eased the technology transfer related to design methodology. However, one of the major responsibilities of project management is to create an open and success-oriented team environment.
4. 4.1.
Results Performance Results
For a symbol rate of R = 33 MHz, frequency offsets of ±12.5% normalized to the symbol rate, and a typical parameter setting, the acquisition times of the carrier and the timing synchronizer are below 20 ms and below 2 ms, respectively. For the frame synchronizer, Figures 28 and 29 show the computed mean time for correctly detecting in-sync Tsync and falsely loss-of-sync Tls expressed in number of frames, respectively. For a typical parameter set and a Viterbi decoder output data rate of 50 Mbit/s, Tsync equals 0.5 ms. In order to assess the performance of the device, bounds must be established for the tracking performance of the synchronizers as well as for the overall system. An ideal implementation serves a the benchmark. Any degradation from the theoretical limits is due to either imperfect synchronization (detection loss) or due to implementation effects like quantization or clipping (implementation loss). Figures 30 and 31 display the variances of the phase and timing error estimate, respectively, in relation to the Cram`er-Rao (CRB) bound.
283
DESIGN METHODOLOGY
BER=0.0002, match_init=1, K=0, L=0
20
T_sync [frames]
19 18 17 16 15
15 LSC
14 7
10 6 5
SC
5
4 3 2
Figure 28. Tsync at BER = 2 · 10−4 .
Mean time until loss of sync, BER=0.0002
199 1. 10
L=3
T_ls [frames]
165 1. 10 L=2
131 1. 10 97 1. 10
L=1
63 1. 10 L=0 29 1. 10
1
2
4
6
8
10
12
14 LSC
Figure 29. Tls at BER = 2 · 10−4 .
At the design point (E s /N0 = 3.4 dB), the normalized variance of the phase estimate ˆ var(θ)/(2B L T ) is about 8.5 dB above the CRB. A simulation of the quantized system with known symbols in the error detector reveals that this degradation can be splitted up into 7 dB detection and 1.5 dB implementation loss (cf. Figure 30). For the timing synchronizer, a difference of about 6.5 dB between the simulated results and the CRB is determined at the design point. The main part of this loss results from the fact that the NDA algorithm does not reach the CRB. The variance is lower bounded by the quantization of the fractional delay µ in the interpolator which leaves asymptotically about 0.45 dB implementation loss to other word length effects. A detailed performance analysis can be found in [6], [7], [3]. The most important measure for the performance is the bit error rate, which is measured at the Viterbi decoder output. Figure 32 displays the resulting bit error rate as a function of the E b /N0 for the convolutional code rates of R = 1/2 and R = 7/8. In the figure, the ideal implementation (perfect synchronization, no implementation loss due to survivor path truncation or quantization) is compared to the bit-true model of the receiver also including impairments of the A/D converter. The overall degradation is about 0.4 dB, leaving a 0.6 dB margin for the other analog circuitry.
284
VAUPEL ET AL.
ˆ / (2BLT) [dB] var(θ)
Normalized Variance of the Phase Estimate
Design Point
Es/No [dB]
ˆ Figure 30. Normalized var(θ).
Timing Error Variance
var(ε) ˆ [dB]
simulated
Es/No [dB]
Figure 31. Variance of the timing error.
Figure 32. Bit-true receiver model performance.
285
DESIGN METHODOLOGY
Table 3. Cell and chip areas. Component Synchronizer Viterbi RS-decoder incl. deinterl., frame sync, and descr. A/D-converter Clock-Pll Pad frame Sum
4.2.
accum. cell area
silicon area
32 % 40 % 28 %
9% 48 % 17 % 4% 1% 21 % 100 %
100 %
Implementation Results
The chip was implemented in a 0.5 µm CMOS technology with three metal layers and fits into a P-QFP-64 package. The single supply voltage is 3.3 V. The power consumption amounts to 1.2 W at a maximum sampling rate of the analog input values of 88 MHz. The maximum output bit rate figures up to 56 Mbit/s. Table 3 summarizes the relative standard cell areas of the main components and the normalized silicon areas including analog components and RAM. The RAM modules occupy 34% of the core area. Figure 33 shows a chip photograph. The data flow direction is from left to right. On the upper left corner the two clock synthesizer PLLs for the timing and carrier recovery and the channel decoders are located. Below these, the A/D-converters for in–phase and quadrature component are placed. The memory blocks in the middle region are the RAM’s of the survivor memory unit which enclose the Viterbi decoder. On the right hand side, the memories for the deinterleaver (at the top) and the Reed-Solomon decoder (bottom) can be identified. 5.
Conclusion
The implementation of a single-chip timing and carrier synchronizer and channel decoder for digital video broadcasting over satellite (DVB-S) was described. The fully digital timing and carrier synchronization minimized the number of external analog components. The chip is fully compliant with the DVB-standard and allows automatic acquisition of different symbol rates and convolutional code rates. Acquisition and tracking parameters of the various synchronizing units, and even acquisition strategies are freely configurable via the I2 C-bus interface. In addition, internal states and important system information can be read out. The joint development of algorithms and dedicated architectures in conjunction with careful quantization investigations led to a highly efficient single-chip solution. The important role of well-suited modeling styles with respect to simulation and verification efficiency as well as implementation quality was highlighted. For each phase of the design flow simulation paradigm, verification environment and modeling style of the components must match the specific requirements. The tool framework used in this project is based on COSSAP, which served as the design entry tool for algorithm design as well as algorithm verification and system performance evaluation tool due to its block-based
286
VAUPEL ET AL.
Figure 33. Chip photograph.
simulation paradigm and therefore high simulation efficiency. The intensive reuse of simulation setups of previous design phases accelerated verification throughout the whole design flow. In addition, concurrent engineering and design iterations between different levels of abstraction were leveraged by support of hierarchy and the seamless integration of models. A hierarchical COSSAP model exists which is bit-true and cycle-true identical to the VHDL model. A hierarchy of synthesis scripts aided Logic synthesis using SYNOPSYS Design Compiler which increased the reproducability and the quality of the results. Test patterns for the resulting gate level netlist were also generated from COSSAP. Even the functional verification of fabricated hardware could be simplified by embedding it into a simulation loop within the RAVEN environment. The design methodology presented ensures both short time to market and high design integrity. Acknowledgements We would like to thank Ralf Schwendt (Siemens AG) and Christian von Reventlow (now with Robert Bosch GmbH) who made the cooperation between industry and university fruitful and the successful results possible.
287
DESIGN METHODOLOGY
1 Trfl Tracking, false lock
Acquisition Phase
1 - pfl Tac
Tracking, lock
p fl Tac Figure A.1. Phase transitions.
Appendix A Frame Synchronizer Performance Analysis To assess the performance of the frame and byte synchronization, the synchronizer is modeled as different discrete-time Markov-Chains whose shape depends on the global phase (see Figure A.1). As discussed in section 2.4, during acquisition (or tracking) the incoming bitstream is bit-wise correlated with both the sync and the inverted sync byte. For a successful correlation, mismatches are allowed at up to K (or L) positions. Different counters are used to compare the number of successful (tracking: failed) correlations with the programmable threshold SC (L SC). Each counter value corresponds to a single state. The states that correspond to a change into another synchronizing phase (acquisition or tracking) are called absorbing states. In the acquisition phase two absorbing states exist: 1) going into lock and 2) going into false lock (see Figure A.2 for a state diagram for SC = 3). The transition rates λ per frame are λ0+ = 1 − p B λ+ = 1 − p B λ− = p B λ00+ = 1617 p K λ0+ = p K λ0− = 1 − p K
there exists exactly one sync byte per frame. During bit-wise correlation, this is not detected with probability p B during frame-spaced correlation, the sync byte is not detected with probability p B disturbed sync bytes are detected with probability p B there exist 1617 random 8-bit-patterns per frame that are recognized as a sync-pattern during bit-wise correlation with probability p K correlation within the data part of a frame detects a sync pattern with probability p K correlation within the data part of a frame detects no sync patterns with probability 1 − p K
where p B equals pB =
µ ¶ 8 i p B E R (1 − p B E R )8−i i i=K +1 7−K X
(A.1)
288
VAUPEL ET AL.
λ+
0
λ’0+ λ’-
2
3
λ-
λStart
λ+
1
λ 0+
λ’+
λ’+
1’
2’
3’
λ’out of sync
Lock
idle out
False lock in sync
Figure A.2. Markov model for the acquisition phase, SC = 3.
and p K equals K µ ¶ 2 X 8 if p K = 256 n n=0 1 if
0≤K ≤3
(A.2)
4≤K ≤8
If the state transition diagram is known, the matrix Q of the transition rates can be determined [36] which allows to compute eg the probability of false locks p f l by deriving the transition probability matrix P P = I + Q/q
(A.3)
and computing the static state probabilities by z (∞) = lim z (0) P n n→∞
(A.4)
In addition, the Q m -matrix containing only the non-absorbing (i.e., transient) states can be derived which allows to assess the mean value of the absorbing time of a Markov chain. This value corresponds to varios performance measures (e.g. Tac , acquisition time). Let b be a column vector with all ‘1’, then the elements of vector E (E T = (E 0 , E 1 , . . .)) describe the mean absorbing times depending on the starting state s0 , s1 , . . . respectively, where E is the solution of the following equation −Q m E = b For the state diagram of Figure A.2, the Q-matrix equals −(λ00+ + λ0+ ) λ0+ 0 0 λ00+ 0 0 λ− −1 λ+ 0 0 0 0 0 λ− −1 λ+ 0 0 0 0 0 0 0 0 0 0 Q= 0 0 0 −1 λ0+ 0 λ0− 0 0 0 0 λ0− −1 λ0+ 0 0 0 0 0 0 0
(A.5)
DESIGN METHODOLOGY
289
For K = 0 and p B E R = 5 · 10−2 we are able to compute z (∞) numerically and finally p f l = 0.0021. By solving equation (A.5) numerically we get the mean number of frames E 1 until entering the tracking phase (either with lock or with false lock) E 1 = 40.48 frames which corresponds to Tac = 1.32 ms at a data rate of 50 Mbit/s (and 204 ∗ 8 bits per frame). Different parameter sets lead to different shapes of the state diagrams and different dimensions of the corresponding matrices. This has to be taken into account when computing the values for p f l and Tac depending on BER, K and SC and visualizing them using MATLAB. The results are shown in Figures 16 and 17, respectively. Similar considerations led to the determination of the tracking performance, the performance of the ambiguity resolver, and in conjunction with the global phase diagram (Figure A.1) to the mean synchronization time and the mean time until loss-of-sync (cf. Figures 28 and 29). References 1. European Telecommunications Institute. Digital broadcasting system for television, sound and data services; Framing structure, channel coding and modulation for 11/12 GHz satellite services. Draft ETS 300421, ETSI Secretariat, 06921 Sophia Antipolis – France, August 1994. 2. K. M¨uller, F. Frieling, H. Kriedt, C. v. Reventlow, R. Schwendt, M. Haas, F. Kuttner, H. Dawid, O. Joeressen, U. Lambrette, M. Vaupel, and H. Meyr. A low-cost DVB compliant Viterbi and Reed Solomon decoder. In Intl. Conference on Consumer Electronics, Rosemont, Ill., June 1997. 3. H. Meyr et al. Digital Communication Receivers: Synchronization, Channel Estimation and Signal Processing. John Wiley & Sons, 1997. 4. F. Gardner. A BPSK/QPSK timing–error detector for sampled receivers. IEEE Transactions on Communications, COM–34: 423–429, May 1986. 5. M. Oerder and H. Meyr. Derivation of Gardner’s timing error detector from the maximum likelihood principle. IEEE Transactions on Communications, COM-35, pp. 684–685, June 1987. 6. U. Lambrette, K. Langhammer, and H. Meyr. Variable sample rate digital feedback NDA timing synchronization. In Proceedings of the IEEE Global Telecommunications Conference GLOBECOM, 1996. 7. U. Lambrette, K. Langhammer, and H. Meyr. An aliasing–free receiver with variable sample rate digital feedback NDA timing synchronization. J. Wireless Personal Communciations, Sept. 1998, scheduled for publication. 8. T. G. Noll. Semi–systolic maximum rate transversal filters with programmable coefficients. In W. M. et al., editors, Systolic Arrays, pp. 103–112, Adam Hilger, Bristol, 1987. 9. M. Vaupel and H. Meyr. High speed FIR-filter architectures with scalable sample rates. In Proceedings of the IEEE International Symposium on Circuits and Systems, 4.127–4.130, London, May 1994. 10. H. Meyr and G. Ascheid. Synchronization in Digital Communications, vol. 1. John Wiley & Sons, 1990. 11. J. E. Volder. The CORDIC trigonometric computing technique. IRE Trans. Electronic Computing EC–8: 330–334, September 1959. 12. H. Dawid and H. Meyr. The differential CORDIC algorithm: Constant scale factor redundant implementation without correcting iterations. IEEE Transactions on Computers 45: 307–318, March 1996. 13. G. D. Forney. Burst-correcting codes for the classic bursty channel. IEEE Transactions Communications, COM-19: 772–781, Oct. 1971. 14. J. L. Ramsey. Realization of optimum interleavers. IEEE Transactions on Information Theory, IT-16: 338– 345, May 1970. 15. Synopsys. COSSAP Overview and Documentation Roadmap. Synopsys, Inc., Mountain View, CA, 1996. 16. J. Kunkel. COSSAP: A stream driven simulator. In IEEE International Workshop on Microelectronics in Communications, Interlaken, Switzerland, March 1991. 17. G. Jennings. A case against event driven simulation of digital system design. In A. H. Rutan, editor, The 24th Annual Simulation Symposium, pp. 170–176, IEEE Computer Society Press. Los Alamitos, California, April 1991.
290
VAUPEL ET AL.
18. The MathWorks. MATLAB Reference Guide. Cochituate Place, Natick, Mass., 1994. Documentation of MATLAB 4.2. 19. U. Lambrette, B. Schmandt, G. Post, and H. Meyr. COSSAP – MATLAB Cosimulation. In Proc. Int. Conf. on Signal Processing Application and Technology (ICSPAT), October 1995. 20. Institute of Electrical and Electronics Engineers Inc. IEEE Standard VHDL Language Reference Manual, IEEE Std 1076-1987, New York, NY, March 1988. 21. Synopsys Inc. Behavioral Compiler User Guide. Mountain View, CA. 22. J. Horstmannshoff, T. Gr¨otker, H. Meyr, M. Wloka, and K. Djigande. DSP system synthesis: Integration of reusable building blocks. In Proc. Int. Conf. on Signal Processing Application and Technology (ICSPAT), pp. 774–778, San Diego, Sep. 1997. 23. J. Horstmannshoff, T. Gr¨otker, and H. Meyr. Mapping multirate dataflow to complex RT level hardware models. In ASAP, IEEE, 1997. 24. DesignWare Developer Guide. Mountain View, CA. 25. M. Vaupel, T. Gr¨otker, and H. Meyr. ComBox: Library-based generation of VHDL modules. In T. M. W. Burleson and K. Konstantinides, editors, VLSI Signal Processing IX pp. 293–302, IEEE, 1996. 26. P. Zepter, T. Gr¨otker, and H. Meyr. Digital receiver design using VHDL generation from data flow graphs. In Proc. 32nd Design Automation Conf., June 1995. 27. T. Gr¨otker, P. Zepter, and H. Meyr. ADEN: An environment for digital receiver ASIC design. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3243–3246, Detroit, May 1995. 28. O. J. Joeressen, M. Vaupel, and H. Meyr. High-speed VLSI architectures for soft-output Viterbi decoding. Journal of VLSI Signal Processing, 8: 169–181, October 1994. 29. H. Dawid and H. Meyr. High speed bit–level pipelined architectures for redundant CORDIC implementation. In Proceedings of the Int. Conf. on Application Specific Array Processors, pp. 358–372, IEEE Computer Society Press, Oakland, August 1992. 30. E. Girczyc and S. Carlson. Increasing design quality and engineering productivity through design reuse. In Proc. of the 30th Design Automation Conf., pp. 48–53, 1993. 31. T. Groetker, M. Vaupel, and H. Meyr. DFG-Abschlussbericht Kommunikationssystem-Synthese. Tech. Rep. Me 651/12-4, ISS, July 1997. 32. P. Zepter. Simulator Coupling: COSSAP—Synopsys VSS. Internal Memo 715/16, ISS, RWTH Aachen, September 1993. 33. P. Zepter. Kopplung eines VHDL Simulators an einen Simulator f¨ur Signalverarbeitungsalgorithmen. In D. Seitzer, editor, GME Fachberichte 11 Mikroelektronik, pp. 127–132, VDE Verlag, March 1993. (In german.) 34. A. M¨uller, G. Post, M. Vaupel, and H. Meyr. RAVEN—A real-time analysis and verification environment. In Proc. of DSP Deutschland 97, (M¨unchen), Oct. 1997. 35. Microsoft, Microsoft Project User’s Guide. 36. P. Buchholz, J. Dunkeland, B. M¨uller-Clostermann, M. Sczittnick, and S. Z¨aske. Quantitative Systemanalyse mit Markovschen Ketten. B. G. Teubner, Verlagsgesellschaft, 1994.