FPGA Based Hardware Co-Simulation of an Area and Power Efficient FIR Filter for Wireless Communication Systems

In this paper FPGA based hardware co-simulation of an area and power efficient FIR filter for wireless communication systems is presented. The implementation is based on distributed arithmetic (DA) which substitutes multiply-and-accumulate operations with look up table (LUT) accesses. Parallel Distributed arithmetic (PDA) look up table approach is used to implement an FIR Filter taking optimal advantage of the look up table structure of FPGA using VHDL. The proposed design is hardware cosimulated using System Generator10.1, synthesized with Xilinx ISE 10.1 software, and implemented on Virtex-4 based xc4vlx25-10ff668 target device. Results show that the proposed design operates at 17.5 MHz throughput and consumes 0.468W power with considerable reduction in required resources to implement the design as compared to Coregen and add/shift based design styles. Due to this reduction in required resources the proposed design can also be implemented on Spartan-3 FPGA device to provide cost effective solution for DSP and wireless communication applications.


INTRODUCTION
T oday's consumer electronics such as cellular phones and other multimedia and wireless devices often require digital signal processing (DSP) algorithms for several crucial operations (Allred et al., 2004). Due to a growing demand for such complex DSP applications, high performance, low-cost Soc implementations of DSP algorithms are receiving increased attention among researchers and design engineers. There is a constant requirement for efficient use of FPGA resources (Macpherson and Stewart, 2006) where occupying less hardware for a given system that can yield significant cost-related benefits: (i) Reduced power consumption; (ii) Area for additional application functionality; (iii) Potential to use a smaller, cheaper FPGA.
Finite impulse response (FIR) digital filters are common DSP functions and are widely used in multiple applications like telecommunications, wireless/satellite communications, video and audio processing, biomedical signal processing and many others. On one hand, high development costs and time-to-market factors associated with ASICs can be prohibitive for certain applications while, on the other hand, programmable DSP processors can be unable to meet desired performance due to their sequential-execution architecture (Longa and Miri, 2006). In this context, reconfigurable FPGAs offer a very attractive solution that balance high flexibility, time-to-market, cost and performance. Therefore, in this paper, an important DSP function i.e. FIR filter is implemented on Virtex-4 FPGA. The impulse response of an FIR filter may be expressed as: where C 1 ,C 2 …….C K are fixed coefficients and the x 1 , x 2 ……… x K are the input data words. A typical digital implementation will require K multiplyand-accumulate (MAC) operations, which are expensive to compute in hardware due to logic complexity, area usage, and throughput (White, 1989). Alternatively, the MAC operations may be replaced by a series of look-up-table (LUT) accesses and summations. Such an implementation of the filter is known as distributed arithmetic (DA).
where C1,C2…….CK are fixed coefficients and the x1, x2……… xK are the input data words. A typical digital implementation will require K multiply-and-accumulate (MAC) operations, which are expensive to compute in hardware due to logic complexity, area usage, and throughput (White, 1989). Alternatively, the MAC operations may be replaced by a series of look-up-table (LUT) accesses and summations. Such an implementation of the filter is known as distributed arithmetic (DA).

DISTRIBUTED ARITHMETIC
DISTRIBUTED ARITHMETIC (DA) is an efficient method for computing inner products when one of the input vectors is fixed. It uses look-up tables and accumulators instead of multipliers for computing inner products and has been widely used in many DSP applications such as DFT, DCT, convolution, and digital filters (White, 1989). The example of direct DA inner-product generation is shown in equation 1where xk is a 2's-complement binary number scaled such that |x k | < 1. We may express each x k as which occurs at the sign-bit time. As a consequence, 2 x 2K word ROM is needed. Figure 1 shows the simple structure that can be used to compute these equations. The S, signal is the sign-bit timing signal. The term x k may be written as and in 2's-complement notation the negative of x k may be written as In order to simplify the notation later, it is convenient to define the new variables as It may be seen that Q(bn) has only 2 (K-1) possible amplitude values with a sign that is given by the instantaneous combination of bits. The computation of y is obtained by using a 2 (K-1) word memory, a one-word initial condition register for Q(O), and a single parallel adder subtractor with the necessary control-logic gates.

CIRCUIT DESCRIPTION
The basic LUT-DA scheme on an FPGA would consist of three main components as shown in figure1. These are input registers, 4-input LUT unit and shifter/accumulator unit. Input Registers: To reduce the consumption of logic elements, RAM resources are used to implement the shift registers (Allred et al., 2004) Journal of Technology Management for Growing Economies, Volume 1, Number 1, April 2010

PROPOSED WORK
In DA implementation as the filter size K increases, the memory requirements grow exponentially as 2K. This problem is solved in this paper by breaking up the filter into smaller base DA filtering units that require less memory sizes and, less area. If the K tap filter is divided into m units of k tap base units (K = m×k), then the total memory requirement would be m×2 k memory words. The total number of clock cycles required for this implementation is B + [log2(m)]; the additional second term is the number of clock cycles required to implement an adder tree to calculate the sums of the units. Thus the decrease in throughput of this implementation is marginal. For instance, in this proposed design K = 41, instead of 2 41 in a full LUT implementation, we have chosen 12 partitions with k = 4 for m = 5 and k = 3 for m = 7 which would only require 136 memory words.
In this proposed work a 41-tap low pass filter has been designed. The first step in design flow is to develop an optimized VHDL code using distributed Arithmetic Algorithm and implement it using black box of System generator to develop proposed model of design. Figure 2 shows the developed model of proposed design using various Simulink and System Generator blocks. The part of model enclosed in green boundary shows the software based simulation whose output can be seen in figure 3, part of model enclosed in orange boundary shows hardware based simulation whose output can be seen in figure 4 and spectrum scope in blue boundary shows the comparison between software and hardware based simulation whose output is shown in figure 5. The output wave form with green color in figure 5 means complete matching of software based simulation with hard ware based simulation without errors.

RESULTS
The proposed design is implemented on Virtex-4 based xc4vlx25-10ff668 target FPGA. Table 1 shows the comparison of proposed PDA design with the published add-shift and coregen based PDA (Mirzaei et al., 2006) implemented on Virtex-4 device. It can be seen from the table that the throughput and performance of the proposed design are 17.50 MHz and 210 Msps respectively which are almost equal to other compared designs.  Figure 6 shows the comparison of area utilization between add/Shift, PDA (coregen) and proposed PDA (PPDA) for 41 tap filter designs. It can be observed that the PPDA uses considerably less amount of resources on the target device as compared to other compared designs. Due to this reduction in required resources the proposed design can be implemented on Spartan-3 FPGA as shown in table2.

CONCLUSIONS
In this paper, a Parallel Distributed Arithmetic algorithm for high performance reconfigurable FIR filter is presented to enhance the area & power efficiency. The proposed design is taking optimal advantage of look up table structure of target FPGA. The throughput and performance of the proposed design are 17.5 MHz and 210 Msps respectively with considerable amount of reduction in used resources. Due to this reduction in required resources the proposed design can be implemented on Spartan-3 FPGA device to provide cost effective solution for DSP and wireless communication applications.